Discovering Pure Gene-Environment Interactions In
Total Page:16
File Type:pdf, Size:1020Kb
Wang et al. BMC Proceedings 2014, 8(Suppl 1):S62 http://www.biomedcentral.com/1753-6561/8/S1/S62 PROCEEDINGS Open Access Discovering pure gene-environment interactions in blood pressure genome-wide association studies data: a two-step approach incorporating new statistics Maggie Haitian Wang1, Chien-Hsun Huang2, Tian Zheng2, Shaw-Hwa Lo2, Inchi Hu3* From Genetic Analysis Workshop 18 Stevenson, WA, USA. 13-17 October 2012 Abstract Environment has long been known to play an important part in disease etiology. However, not many genome- wide association studies take environmental factors into consideration. There is also a need for new methods to identify the gene-environment interactions. In this study, we propose a 2-step approach incorporating an influence measure that capturespure gene-environment effect. We found that pure gene-age interaction has a stronger association than considering the genetic effect alone for systolic blood pressure, measured by counting the number of single-nucleotide polymorphisms (SNPs)reaching a certain significance level. We analyzed the subjects by dividing them into two age groups and found no overlap in the top identified SNPs between them. This suggested that age might have a nonlinear effect on genetic association. Furthermore, the scores of the top SNPs for the two age subgroups were about 3times those obtained when using all subjects for systolic blood pressure. In addition, the scores of the older age subgroup were much higher than those for the younger group. The results suggest that genetic effects are stronger in older age and that genetic association studies should take environmental effects into consideration, especially age. Background generalized linear models incorporating G × E terms, are Gene-environment interactions (G × E) have long been becoming popular [3,4]. We do not know,however, how known to play an important role in complex disease much of the association identified is a result of main etiology. Understanding these will reduce the bias in vari- effects and how much is a result ofpure G × E interac- able selection because of different cohort exposure to tions. In this study, we used a 2-step method that, first, theenvironment [1]. Previous methods of studying G × E aggressively removed main effects from both gene and effects have mainly included candidate genes, case-only environment, and then tested for the strength of pure G design, and family-based association studies [1,2]. These × E interaction. We found that, for systolic blood pres- methods have made their respective assumptions in sure (SBP), the pure gene-age interaction was stronger terms of biological knowledge, independence of gene and than the main effect of single-nucleotide polymorphisms environment, and kinship information. There is an (SNPs) alone. We also analyzed the genetic association urgent need for new methods to detect gene-environ- separately in two age groups to test the effect of age. We ment effects. With the emergence of genome-wide asso- found that the marker profiles were quite distinct in dif- ciation studies (GWAS), data mining methods, such as ferent age cohorts. This suggested that age might have a strong nonlinear effect on genetic association. * Correspondence: [email protected] 3Department of ISOM, the Hong Kong University of Science and Technology, Clearwater Bay, Kowloon, Hong Kong SAR Full list of author information is available at the end of the article © 2014 Wang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wang et al. BMC Proceedings 2014, 8(Suppl 1):S62 Page 2 of 5 http://www.biomedcentral.com/1753-6561/8/S1/S62 Methods −1 2 − 2 I = n ni (Yi Y) Dataset i The dataset adopted in this study was provided by i n Genetic Analysis Workshop 18 (GAW18), for which real where runs through the partition cells, i is the num- i, n phenotypes and genotypes from the San Antonio Family ber of observations in cell is the total number of ¯ i Studies are used. We focused on chromosome 3, which observations, Yi is the local mean of phenotypes in cell , ¯ includes 62,915 SNPs. There were 142 unrelated indivi- and Y is the overall mean. When the partition contains ¯ duals, for whom information is available on SBP, diasto- no association information, cell mean Yi should be very ¯ lic blood pressure (DBP), age, smoking status, and use close to the overall mean Y. By contrast, when a subset of of antihypertension medication. Although there were variables has a joint influence on Y, the difference ¯ ¯ 4longitudinal measurements of the phenotypes, we con- between Yi and Y will be large. The effect will be captured n 2 sidered only the first measurement, which had the few- by the squared deviation and weighted by i ,resulting est missing values. After removing the missing values, an elevated I-score. The proposed method complements the data for 130 unrelated individuals were retained for main effect methods. So one can find main effect first by further analysis. using another method, and then add the interaction fea- tures back. Detecting pure G × E effects For each SNP and an environmental factor, the phe- Y Step 1: Removal of main effect of gene and environment notype of interest ( ) is replaced by the residual calcu- For each SNP and an environmental factor, we remove lated in Step 1, resulting in: their main effects on y by taking the residual (res)of − I = n 1 n2(res − res)2 projection pursuit regression (PPR) [5]. The PPR i i i smooths the regression surface following an additive model of (nonlinear) smoothing functions (S) based on The significance of the I-score is evaluated by permu- 7 a linear combination of predictors (am·x), expressed as tation on the phenotype of the data set 10 times. follows: Dichotomization of age Smoking and medication are both discrete variables. We M dichotomize age by a 2-mean clustering method (k-means y = Sα (α · x)+ m m in R). The cutoff value was found to be 55. Thus, if age is m=1 >55 years, the age is coded as 1, otherwise it is coded 0. mth where Sαm is the smooth function of any linear combinations of x. Because PPR does not assume linear Nonlinear gene-age association relation of the predictors, both nonlinear and linear Current GWAS assume that a biomarker affects disease, effects can be removed from the residual. It is calculated independent of age, so most SNPs identified in the litera- using R package ppr in a stepwise manner by first ture are those with strong association across the whole removing the main effect of the environment, and then age range. What if some genetic effect is nonlinear with the main effect of the gene, without considering the age: In one’s youth a group of SNPs influences the phe- interactions among them. notype,whereas in old age some other group of SNPs Step 2: Evaluation of interactions by an influence measure takes effect? To test this hypothesis, we divided the indi- An influence measure wasintroduced by Lo and Zheng viduals into two groups by the same 2-mean clustering [6] to capture the interaction effects based on partitions threshold, at age 55 years. There were 76 subjects age by a variable subset. It has been shown to be very effec- ≤55 years (the younger group) and 54 subjects age >55 tive in capturing joint effects, even when main effects years (the older group). We selected the top SNPs (G are weak. Important SNPs were found for inflammatory effect) within each group by I-score and checked to what bowel disease and confirmed by later experimental extentthese SNPs overlapped. results [7]. This also worked in a classification algorithm that achieved the lowest error rates in predicting several Results cancer datasets [8]. Detecting pure gene-environment effects Assuming that we have discrete explanatory variables, Pure gene-age association is stronger than SNP main effect for a given subset of variables (either gene-gene [G × G] for SBP or G × E), a partition of the observations can be created. Using the 2-step approach, the I-score of the pure inter- For example, if x1 and x2 take values of either 0 or 1, we actions of G × E was calculated after the main effects of will have a partition of four cells. If the phenotype of both SNP and environmental factorswere removed; interest is Y, the influence measure takes the form p values were obtained by permuting the phenotype 107 Wang et al. BMC Proceedings 2014, 8(Suppl 1):S62 Page 3 of 5 http://www.biomedcentral.com/1753-6561/8/S1/S62 times. Table 1 displays the number of SNPs, for which Table 2 Nonlinear age effect on genetic association for corresponding pure G × E interactions reached each sig- SBP nificance level. The result for G alone appears in the last a. Age ≤55 years (76 observations) row of the table. Pure G × E interaction shows a strong Rank SNP no SNP name Gene I-score Overall I-score association, even when the main effect has been taken 1 14166 rs9834970 NA 77.94 20.70 away. Consider,for example, SBP: gene-age (G × age) 2 4557 rs159154 BRPF1 68.90 19.03 p −3 interaction resulted in 150 SNPs with a value <10 3 12457 rs12493391 NA 68.29 34.50 p −4 and 29 SNPs with a value <10 , far more than the 4 58703 rs2239626 DGKG 65.98 10.36 p main genetic effect, which had only 41 SNPs with value 5 269 rs12715600 NA 65.86 39.59 −3 p −4 <10 and 5 SNP with value <10 .