A General Statistic Framework for Genome-Based Disease Risk Prediction
Total Page:16
File Type:pdf, Size:1020Kb
A General Statistic Framework for Genome-based Disease Risk Prediction Long Ma1, Nan Lin1 and Momiao Xiong1,* 1 Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA Running Title: biomarker identification for risk prediction Key Words: genetic variants, clinical utility, sufficient dimension reduction, risk prediction, convex optimization. *Address for correspondence and reprints: Dr. Momiao Xiong, Human Genetics Center, The University of Texas Health Science Center at Houston, P.O. Box 20186, Houston, Texas 77225, (Phone): 713-500-9894, (Fax): 713-500-0900, E-mail: [email protected]. 1 Abstract Fast and more economical next generation sequencing (NGS) technologies will generate unprecedentedly massive and highly-dimensional genomic and epigenomic variation data. In the near future, a routine part of medical records will include the sequenced genomes. How to efficiently extract biomarkers for risk prediction and treatment selection from millions or dozens of millions of genomic variants raises a great challenge. Traditional paradigms for identifying variants of clinical validity are to test association of the variants. However, significantly associated genetic variants may or may not be usefulness for diagnosis and prognosis of diseases. Alternative to association studies for finding genetic variants of predictive utility is to systematically search variants that contain sufficient information for phenotype prediction. To achieve this, we introduce concepts of sufficient dimension reduction (SDR) and coordinate hypothesis which project the original high dimensional data to very low dimensional space while preserving all information on response phenotypes. We then formulate a clinically significant genetic variant discovery problem into the sparse SDR problem and develop algorithms that can select significant genetic variants from up to or even ten millions of predictors with the aid of a split-and-conquer approach. The sparse SDR is in turn formulated as a sparse optimal scoring problem, but with penalty which can remove row vectors from the basis matrix. To speed up computation, we apply the alternating direction method of multipliers to solving the sparse optimal scoring problem which can easily be implemented in parallel. To illustrate its application, the proposed method is applied to a coronary artery disease (CAD) dataset from the Wellcome Trust Case Control Consortium (WTCCC) study, Rheumatoid Arthritis (RA) dataset from the GWAS of North American Rheumatoid Arthritis Consortium (NARAC) and the early- onset myocardial infarction (EOMI) exome sequence datasets which have European origin from the NHLBI’s Exome Sequencing Project. 2 Introduction Although there is heated debate about whether DNA variation has value to predict diseases1-4, genetic variations that are being generated by next generation sequencing (NGS) technologies will provide invaluable information for disease prediction and prevention5. However, how to efficiently extract genetic variants for risk prediction and treatment selection from millions or dozens of millions of genomic variants is among the biggest challenges in genomic research and medicine6. Innovative approaches should be developed to address these challenges. Traditional paradigms for identifying variants of clinical validity test association of the variants7. The variants are ranked by P-values or odds ratios. A few “top-SNPs” with the smallest P-values are then used to predict disease risk of individuals8,9. However, most variants have small effect sizes and the often low to moderate heritability of common diseases. Many associated genetic variants, on their own, have little or no predictive utility. The studies based on a few number of the loci demonstrate limited predictive value of disease10-14. The best predictors may not always be at the top of the list and variable selection by P-values will miss many important predictors8. To overcome this limitation, rather than using a few highly significant SNPs from genome- wide association studies (GWAS), some researchers use penalized techniques such as the penalized pseudo-likelihood approaches and the bridge regression15, the LASSO16, the SCAD and other folded-concave penalties17. The Dantzig selector18 and their related methods19 have recently been developed to directly search a set of SNPs for disease risk prediction from a large number of genetic variants20 with mild success. 3 The widely used methods for disease risk prediction have several serious limitations. Dimension reduction in which we attempt to identify linear or nonlinear combinations of the original set of variables or chose a subset of variables from the original set21 is an essential issue for genome-based prediction of common diseases. The first limitation of the current methods for disease risk prediction is to use unsupervised dimension reduction in which we fail to make use of the phenotype information. The second limitation of these methods is their lack of algorithms to systematically search millions of genetic variants for disease risk prediction. We observe that the maximum number of features which the LASSO type methods are predictive is less than 50,000. Simple application of the penalized pseudo-likelihood approaches to millions or dozen of millions of features is infeasible. How to select an informative subset of features from millions of variables is still an open question. The genome-based prediction of common diseases often demands extremely intensive computations. A practical solution is to employ the power of parallel computing. The third limitation is that it is difficult to adapt the current methods for risk prediction to parallel computing. The purpose of this paper is to address the limitations of these traditional methods for disease risk prediction. Dimension reduction and variable selection are a dominant theme in the genome-based disease risk prediction. To overcome the most important limitation of the currently used unsupervised dimension reduction in disease risk prediction, which will lose individual disease status information, we propose a supervised dimension reduction method in which both genetic variant information and disease information will be used. An essential concept in supervised dimension reduction is sufficient dimension reduction (SDR) and testing of the coordinate relation hypothesis22,23. Since the dimension of genomic variation data is extremely high and the most genetic variants across the genome are irrelevant with disease, we 4 replace the original predictor vector (genetic variants) with its prediction onto a low dimensional space of the predictor space –which results in a minimal set of linear combinations of the original predictors, without loss of information on disease or response24,25. Such low dimensional space is called dimension reduction subspace22. Minimum dimension reduction subspace is referred to as central subspace (CS). All information of the predictors about disease or response in the genetic dataset is contained in the central subspace. The central subspace provides sufficient information to predict response or risk. We only need to study relationships between responses (disease risk) and variables in the central space regardless of the original predictors. The number of methods including sliced inverse regression (SIR)26, sliced average variance estimation27, principal Hessian direction25,28, directional regression29, and likelihood-based SDR30 have been developed to estimate the central subspace. Although great progress in SDR has been made the most current methods for dimension reduction still have serious limitation in their practice applications. The first limitation is that the current popular methods for the SDR require the inverse of the sample covariance matrix of the predictors which are often near singular for large number of predictors31. Second, the CS estimation needs to use all the original predictors. Therefore, the estimations are not efficient due to the presence of a lot of irrelevant and noise data and the results are difficult to interpret32. These methods cannot be used to identify genetic variants of clinical significance. In the past several years, the variable selection methods have been explored in the SDR analysis to address these two problems. The variable selection methods for the SDR include model-free variable selection33, the shrinkage sliced inverse regression34, sparse SDR35, constrained canonical correlation and variable filtering for dimension reduction36. The common feature of these methods is that the identification of CS and variable selection are separated. At 5 the first step, the relevant SDR methods are used to estimate the CS. At the second step, the regularization methods are applied to the estimated basis vector forming the CS to select predictors. In addition, these methods still cannot solve the singularity problem of the covariance matrix of the predictors if at the first stage all predictors are used for SDR. Chen et al.32 proposed coordinate-independent sparse sufficient dimension reduction and variable selection for SDR which can simultaneously reduce dimension and select variables. However, they need to partition all response variables into slices and still require the inverse of the sample covariance matrix of the predictors. As a consequence, their method can only solve SDR with a small number of predictors. To avoid the inverse of covariance of predictors, Wang and Zhu31 used the optimal scoring