A New Efficient Method to Detect Genetic Interactions for Lung Cancer
Total Page:16
File Type:pdf, Size:1020Kb
Luyapan et al. BMC Med Genomics (2020) 13:162 https://doi.org/10.1186/s12920-020-00807-9 TECHNICAL ADVANCE Open Access A new efcient method to detect genetic interactions for lung cancer GWAS Jennifer Luyapan1,2, Xuemei Ji2, Siting Li1,2, Xiangjun Xiao3, Dakai Zhu2,3, Eric J. Duell4, David C. Christiani5,6, Matthew B. Schabath7, Susanne M. Arnold8, Shanbeh Zienolddiny9, Hans Brunnström10, Olle Melander11, Mark D. Thornquist12, Todd A. MacKenzie1,2, Christopher I. Amos1,2,3* and Jiang Gui1,2* Abstract Background: Genome-wide association studies (GWAS) have proven successful in predicting genetic risk of disease using single-locus models; however, identifying single nucleotide polymorphism (SNP) interactions at the genome- wide scale is limited due to computational and statistical challenges. We addressed the computational burden encountered when detecting SNP interactions for survival analysis, such as age of disease-onset. To confront this problem, we developed a novel algorithm, called the Efcient Survival Multifactor Dimensionality Reduction (ES-MDR) method, which used Martingale Residuals as the outcome parameter to estimate survival outcomes, and imple- mented the Quantitative Multifactor Dimensionality Reduction method to identify signifcant interactions associated with age of disease-onset. Methods: To demonstrate efcacy, we evaluated this method on two simulation data sets to estimate the type I error rate and power. Simulations showed that ES-MDR identifed interactions using less computational workload and allowed for adjustment of covariates. We applied ES-MDR on the OncoArray-TRICL Consortium data with 14,935 cases and 12,787 controls for lung cancer (SNPs 108,254) to search over all two-way interactions to identify genetic interactions associated with lung cancer age-of-onset.= We tested the best model in an independent data set from the OncoArray-TRICL data. Results: Our experiment on the OncoArray-TRICL data identifed many one-way and two-way models with a single- base deletion in the noncoding region of BRCA1 (HR 1.24, P 3.15 10–15), as the top marker to predict age of lung cancer onset. = × Conclusions: From the results of our extensive simulations and analysis of a large GWAS study, we demonstrated that our method is an efcient algorithm that identifed genetic interactions to include in our models to predict survival outcomes. Keywords: Genetic interactions, Machine learning, Genome-wide association study, Lung cancer Background A fundamental aim of studying human genetics is to pre- dict disease risk from genomic data. Genome-wide asso- ciation studies (GWAS) that used single-locus models by testing each single nucleotide polymorphism (SNP) for *Correspondence: [email protected]; [email protected] association with a phenotype, proved to be instrumen- 1 Quantitative Biomedical Science Program, Geisel School of Medicine, tal in identifying thousands of genetic variants associ- Dartmouth College, Hanover, NH 03755, USA Full list of author information is available at the end of the article ated with human traits and disorders [1–4]. However, © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Luyapan et al. BMC Med Genomics (2020) 13:162 Page 2 of 15 most of the fndings explained only a small proportion environmental toxins, and other confounding factors to of the genetic efects on diseases and traits [1, 5]. Te accurately identify genetic association relations [14]. complex biological mechanisms and genetic architec- Currently, there are limited methods capable of iden- tures of diseases motivated researchers to not only study tifying genome-wide genetic interactions efciently main additive efects of single genetic variations, but also with adjustment for covariates when studying age of interactions between multiple variants with non-additive disease-onset, such as a patient’s age at frst diagnosis efects to explain more of the heritability of complex dis- or recurrence of disease, for large-scale studies due to eases [6–10]. As the availability of large genome-wide computational demands. It is important to have reliable genotype and next generation sequencing data continues estimates on the age of frst diagnosis to understand the to grow, detecting genetic interactions (i.e., SNP interac- etiology of the disease and to tailor clinical practices, tions) will become more feasible with increasing power especially when determining the appropriate starting to detect signifcant associations [11]. At the same time, age for diagnostic screening, such as lung cancer screen- epistasis detection faces computational and statistical ing [15]. In this study, we demonstrated how the Efcient challenges in analyzing high-dimensional data and in Survival Multifactor Dimensionality Reduction (ES- testing millions of interaction models from an exhaustive MDR) method improved on the efciency of Surv-MDR search in GWAS [6, 12]. Te number of tests increases and allowed for adjustment of covariate efects to ana- exponentially when analyzing higher orders of interac- lyze large-scale survival and genetic data to analyze age tions, which require immense computing resources and of disease-onset in association with SNP interactions. processing time. Additionally, if the genotypic combina- Our method used Martingale Residuals as the estimated tions that confer risk are nonadditive, fnding the com- survival outcome with adjustment for confounding fac- binations of genotypes that increase risk can become a tors that provided an efcient and efective identifcation complex combinatorial challenge [7]. of genetic interactions associated with survival out- With the arrival of multi-dimensional and compli- comes. We demonstrated the strength of the proposed cated genetic data sets, researchers have adapted to this method by designing two simulations to evaluate the growth by integrating machine learning methods to ana- 5% type I error threshold through an evaluation of the lyze complex genetic architectures. In genetic epidemiol- empirical null distribution and to analyze the predictive ogy, a popular series of methods were centered around power of ES-MDR. To analyze the efectiveness of the a machine learning approach adapted to detect gene– ES-MDR method, we evaluated our approach using the gene interactions called the Multifactor Dimensionality genome-wide genotyped lung cancer OncoArray-TRICL Reduction (MDR) method. First introduced by Ritchie (Transdisciplinary Research Into Cancer of the Lung) et al. (2001), MDR aimed at reducing high-dimensional Consortium data to detect and characterize SNP interac- genetic interacting loci to a one-dimensional binary tions that were associated with lung cancer age-of-onset. variable that could be easily classifed into high and low risk groups [7]. While MDR have successfully facilitated Methods detection and characterization of multiple genetic loci, In this section, we discuss how we improved the compu- there were disadvantages to this algorithm that limited tational efciency without reducing accuracy to develop its use on diverse data structures such as survival data, the ES-MDR method when analyzing SNP interactions which is often a primary outcome of interest in cancer (i.e., joint efects of two SNPs) in association with age of research. Gui et al. [13, 14] have expanded on the MDR disease-onset. algorithm to diferent phenotypes, survival and continu- ous outcomes data. Survival MDR (Surv-MDR) extended Incorporating martingale residuals for age‑of‑onset the analysis of dichotomous traits in MDR to censored survival analysis and time-to-event survival data using a log-rank test to ES-MDR improved the efciency of Surv-MDR and classify sets of multi-loci combinations. Tis algorithm applied the QMDR algorithm to analyze age of dis- demonstrated profciency in identifying genetic interac- ease-onset in association with genetic interactions. tions associated with censored time-to-death or time- Our novel ES-MDR approach used a combination of to-event data; however, it was more computationally survival analysis and QMDR for continuous outcome demanding than MDR and it did not allow for covariate analysis in two steps. In the frst step, we started replac- adjustments important for controlling confounding fac- ing event time and status with Martingale Residuals tors [13]. Quantitative MDR (QMDR) ofered a computa- with covariate adjustment as a new continuous score. tionally efcient algorithm to identify genetic interactions In the second step, we applied QMDR to efciently cat- associated