Modeling Linkage Disequilibrium 1 Increases Accuracy of Polygenic
Total Page:16
File Type:pdf, Size:1020Kb
1 Modeling Linkage Disequilibrium 2 Increases Accuracy of Polygenic Risk 3 Scores 4 Bjarni J. Vilhjálmsson1,2,3,4,*, Jian Yang5,6, Hilary K. Finucane1,2,3,7, 5 Alexander Gusev1,2,3, Sara Lindström1,2, Stephan Ripke8,9, Giulio 6 Genovese3,8,10, Po-Ru Loh1,2,3, Gaurav Bhatia1,2,3, Ron Do11,12, Tristan 7 Hayeck1,2,3, Hong-Hee Won3,13, Schizophrenia Working Group of the 8 Psychiatric Genomics Consortium, the Discovery, Biology, and Risk of 9 Inherited Variants in Breast Cancer (DRIVE) study, Sekar Kathiresan3,13, 10 Michele Pato14, Carlos Pato14, Rulla Tamimi1,2,15, Eli Stahl16, Noah 11 Zaitlen17, Bogdan Pasaniuc18, Gillian Belbin11,12, Eimear Kenny11,12,19,20, 12 Mikkel H. Schierup4, Philip De Jager3,21,22 , Nikolaos A. Patsopoulos3,21,22, 13 Steve McCarroll3,8,10, Mark Daly3,8, Shaun Purcell15, Daniel Chasman23, 14 Benjamin Neale3,8, Michael Goddard24,25, Peter M. Visscher5,6, Peter 15 Kraft1,2,3,26, Nick Patterson3, Alkes L. Price1,2,3,26,* 16 17 18 1. Department of Epidemiology, Harvard T.H. Chan School of Public 19 Health, Boston, 02115 MA, USA. 20 2. Program in Genetic Epidemiology and Statistical Genetics, Harvard 21 T.H. Chan School of Public Health, Boston, 02115 MA, USA. 22 3. Program in Medical and Population Genetics, Broad Institute of 23 Harvard and MIT, Cambridge, 02142 MA, USA. 24 4. Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, 25 Denmark. 26 5. Queensland Brain Institute, The University of Queensland, 27 Brisbane, 4072 Queensland, Australia. 28 6. The Diamantina Institute, The Translational Research Institute, 29 University of Queensland, Brisbane, 4101 Queensland, Australia. 30 7. Department of Mathematics, Massachusetts Institute of Technology, 31 Cambridge, 02139 MA, USA. 32 8. Stanley Center for Psychiatric Research, Broad Institute of MIT and 33 Harvard, Cambridge, 02142 MA, USA. 34 9. Analytic and Translational Genetics Unit, Massachusetts General 35 Hospital, Boston, 02114 MA, USA. 36 10.Department of Genetics, Harvard Medical School, Boston, 02115 37 MA, USA. 38 11.The Charles Bronfman Institute of Personalized Medicine, The 39 Icahn School of Medicine at Mount Sinai, New York, NY, USA. 40 12.Department of Genetics and Genomic Sciences, The Icahn School of 41 Medicine at Mount Sinai, New York, NY, USA. 1 1 13.Cardiovascular Research Center, Massachusetts General Hospital, 2 Harvard Medical School, Boston, 02114 MA, USA. 3 14.Department of Psychiatry and Behavioral Sciences, Keck School of 4 Medicine at University of Southern California, Los Angeles, 90089 5 CA, USA. 6 15.Channing Division of Network Medicine, Brigham and Women’s 7 Hospital, Boston, MA, 02115 USA. 8 16.The Department of Psychiatry at Mount Sinai School of Medicine, 9 New York, 10029 NY, USA 10 17.Department of Medicine, Lung Biology Center, University of 11 California San Francisco, San Francisco, 94143 CA, USA. 12 18.Department of Pathology and Laboratory Medicine, University of 13 California Los Angeles, Los Angeles, 90095 CA, USA. 14 19.The Icahn Institute of Genomics and Multiscale Biology, The Icahn 15 School of Medicine at Mount Sinai, New York, NY, USA. 16 20.The Center of Statistical Genetics, The Icahn School of Medicine at 17 Mount Sinai, New York, NY, USA. 18 21.Department of Medicine, Harvard Medical School, Boston, 02115 19 MA, USA. 20 22.Program in Translational NeuroPsychiatric Genomics, Ann Romney 21 Center for Neurologic Diseases, Department of Neurology, Brigham 22 and Women’s Hospital, Boston, 02115 MA, USA 23 23.Division of Preventive Medicine, Brigham and Women’s Hospital, 24 Boston, 02215 MA, USA. 25 24.Department of Food and Agricultural Systems, University of 26 Melbourne, Parkville, 3010 Victoria, Australia. 27 25.Biosciences Research Division, Department of Primary Industries, 28 Bundoora, 3083 Victoria, Australia. 29 26.Department of Biostatistics, Harvard T.H. Chan School of Public 30 Health, Boston, 02115 MA, USA. 31 32 Correspondence should be addressed to B.J.V. 33 ([email protected]) or A.L.P. ([email protected]). 34 ∗ 35 36 37 38 39 40 41 42 43 2 1 Abstract: 2 Polygenic risk scores have shown great promise in predicting complex disease 3 risk, and will become more accurate as training sample sizes increase. The 4 standard approach for calculating risk scores involves LD-pruning markers 5 and applying a P-value threshold to association statistics, but this discards 6 information and may reduce predictive accuracy. We introduce a new method, 7 LDpred, which infers the posterior mean effect size of each marker using a 8 prior on effect sizes and LD information from an external reference panel. 9 Theory and simulations show that LDpred outperforms the 10 pruning/thresholding approach, particularly at large sample sizes. 11 Accordingly, prediction R2 increased from 20.1% to 25.3% in a large 12 schizophrenia data set and from 9.8% to 12.0% in a large multiple sclerosis 13 data set. A similar relative improvement in accuracy was observed for three 14 additional large disease data sets and when predicting in non-European 15 schizophrenia samples. The advantage of LDpred over existing methods will 16 grow as sample sizes increase. 17 18 Introduction 19 Polygenic risk scores (PRS) computed from genome-wide association study (GWAS) 20 summary statistics have proven valuable for predicting disease risk and 21 understanding the genetic architecture of complex traits. PRS were used to predict 22 genetic risk in a schizophrenia GWAS for which there was only one genome-wide 23 significant locus1 and have been widely used to predict genetic risk for many traits1- 24 15. PRS can also be used to draw inferences about genetic architectures within and 25 across traits12,13,16-18. As GWAS sample sizes grow the prediction accuracy of PRS 26 will increase and may eventually yield clinically actionable predictions16,19-21. 27 However, as noted in recent work19, current PRS methods do not account for effects 28 of linkage disequilibrium (LD), which limits their predictive value, especially for 29 large samples. Indeed, our simulations show that, in the presence of LD, the 30 prediction accuracy of the widely used approach of LD-pruning followed by P-value 31 thresholding1,6,8,9,12,13,15,16,19,20 falls short of the heritability explained by the SNPs 32 (Figure 1 and Supplementary Figure 1; see Materials and Methods). 33 34 One possible solution to this problem is to use one of the many available prediction 35 methods that require genotype data as input, including genomic BLUP—which 36 assumes an infinitesimal distribution of effect sizes—and its extensions to non- 37 infinitesimal mixture priors22-29. However, these methods are not applicable to 38 GWAS summary statistics when genotype data are unavailable due to privacy 39 concerns or logistical constraints, as is often the case. In addition, many of these 40 methods become computationally intractable at the very large sample sizes (>100K 41 individuals) that would be required to achieve clinically relevant predictions for 42 most common diseases16,19,20. 43 3 1 In this study we propose a Bayesian polygenic risk score, LDpred, which estimates 2 posterior mean causal effect sizes from GWAS summary statistics assuming a prior 3 for the genetic architecture and LD information from a reference panel. By using a 4 point-normal mixture prior26,30 for the marker effects, LDpred can be applied to 5 traits and diseases with a wide range of genetic architectures. Unlike LD-pruning 6 followed by P-value thresholding, LDpred has the desirable property that its 7 prediction accuracy converges to the heritability explained by the SNPs as sample 8 size grows (see below). Using simulations based on real genotypes we compare the 9 prediction accuracy of LDpred to the widely used approach of LD-pruning followed 10 by P-value thresholding1,6,8,9,12,13,15,16,19,20,31, as well as other approaches that train on 11 GWAS summary statistics. We apply LDpred to seven common diseases for which 12 raw genotypes are available in small sample size, and to five common diseases for 13 which only summary statistics are available in large sample size. 14 Materials and Methods 15 Overview of Methods 16 LDpred calculates the posterior mean effects from GWAS summary statistics 17 conditional on a genetic architecture prior and LD information from a reference 18 panel. The inner product of these re-weighted effect sizes with test sample 19 genotypes is the posterior mean phenotype and thus, under the model assumptions 20 and available data, an optimal (minimum variance and unbiased) predictor32. The 21 prior for the effect sizes is a point-normal mixture distribution, which allows for 22 non-infinitesimal genetic architectures. The prior has two parameters, the 23 heritability explained by the genotypes, and the fraction of causal markers (i.e. the 24 fraction of markers with non-zero effects). The heritability parameter is estimated 25 from GWAS summary statistics, accounting for sampling noise and LD33-35 (see 26 details below). The fraction of causal markers is allowed to vary and can be 27 optimized with respect to prediction accuracy in a validation data set, analogous to 28 how LD-pruning followed by P-value thresholding (P+T) is applied in practice. 29 Hence, similar to P+T, where P-value thresholds are varied and multiple PRS are 30 calculated, multiple LDpred risk scores are calculated using priors with varying 31 fractions of markers with non-zero effects. The value optimizing prediction 32 accuracy can then be determined in an independent validation data set. We 33 approximate LD using data from a reference panel (e.g. independent validation 34 data). The posterior mean effect sizes are estimated via Markov Chain Monte Carlo 35 (MCMC), and applied to validation data to obtain polygenic risk scores.