DNA Methylation Loci Associated with Atopy and High Serum Ige: a Genome-Wide Application of Recursive Random Forest Feature Selection Todd M
Total Page:16
File Type:pdf, Size:1020Kb
Everson et al. Genome Medicine (2015) 7:89 DOI 10.1186/s13073-015-0213-8 RESEARCH Open Access DNA methylation loci associated with atopy and high serum IgE: a genome-wide application of recursive Random Forest feature selection Todd M. Everson1*, Genevieve Lyons2, Hongmei Zhang3, Nelís Soto-Ramírez3, Gabrielle A. Lockett4, Veeresh K. Patil5, Simon K. Merid6, Cilla Sӧderhӓll7, Erik Melén6,8, John W. Holloway4,9, S. Hasan Arshad5,9 and Wilfried Karmaus3 Abstract Background: The prevalence of allergic diseases are increasing worldwide, emphasizing the need to elucidate their pathogeneses. The aims of this study were to use a two-stage design to identify DNA methylation levels at cytosine–phosphate–guanine (CpG) sites across the genome associated with atopy and high serum immunoglobulin E (IgE), then to replicate our findings in an independent cohort. Methods: Atopy was assessed via skin prick tests and high serum IgE. Methylation levels were measured from whole blood using the Illumina Infinium HumanMethylation450 BeadChip from 18-year-old women (n = 245) and men (n = 122) in the Isle of Wight birth cohort. After data cleaning and processing, and removing probes with possible single nucleotide polymorphisms, DNA methylation levels from 254,460 CpG sites from the 245 women were subjected to recursive Random Forest feature selection for stage 1. The sites selected from stage 1 were tested in stage 2 for associations with atopy and high IgE levels (>200 kU/L) via logistic regression adjusted for predicted cell-type proportions and sex. Sites significantly associated with atopy in stage 2 underwent replication tests in the independent Swedish birth cohort BAMSE (n = 464). Results: In stage 1, 62 sites were selected, of which 22 were associated with atopy in stage 2 (P-value range 6.5E−9 to 1.4E−5) and 12 associated with high IgE levels (P-value range 1.1E−5 to 7.1E−4) at the Bonferroni adjusted alpha (0.05/62 = 0.0008). Of the 19 available sites, 13 were replicated. Conclusions: We identified 13 novel epigenetic loci associated with atopy and high IgE that could serve as candidate loci for future studies; four were within genes with known roles in the immune response (cg04983687 in the body of ZFPM1, cg18219873 in the 5′UTR of PRG2, cg27469152 in the 3′UTR of EPX, and cg09332506 in the body of COPA). Background environmental stimuli during critical prenatal and post- The prevalence of allergic disease is increasing world- natal periods can permanently alter metabolism and in- wide; approximately 40 % of the population of industri- fluence the risk of allergic diseases [4], yet the specific ally developed countries are considered to be affected molecular mechanisms through which this occurs are [1]. Many of these allergic diseases appear to have a poorly understood [1, 5]. hereditary component but are also influenced by envir- Epigenetics, changes in gene activity not caused by alter- onmental stimuli [2], and the origin of the immune ations to the sequence of DNA, may clarify some of these response, including allergen sensitization, is thought to mechanisms because much of cell lineage and tissue- start during the fetal period [3]. It is well recognized that specific gene expression is tightly regulated by epigenetic programming [1]. One of the most commonly studied * Correspondence: [email protected] epigenetic mechanisms is DNA methylation (DNA-M), 1 Department of Epidemiology and Biostatistics, Arnold School of Public the covalent addition of a methyl group to a cytosine Health, University of South Carolina, 915 Greene Street, Columbia, SC 29208, USA followed by a guanine (cytosine–phosphate–guanine; CpG). Full list of author information is available at the end of the article © 2015 Everson et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Everson et al. Genome Medicine (2015) 7:89 Page 2 of 16 Changes in DNA-M affect gene transcription and have selection when the number of variables exceeds the been associated with disease [6]. Some of DNA-M’srolesin number of observations, even when most variables are the development of the immune system, immune cell-fate, uninformative; RF can use both numeric and categorical and allergic diseases have been unlocked, but substantial variables; the algorithm can be optimized fairly easily by gaps in knowledge still exist [1]. varying the RF parameters; and adding trees does not Atopy is defined as a positive reaction to a skin prick cause the model to become over-fit [14, 15, 19]. In test (SPT) or immunoglobulin E (IgE) production in addition, biological processes are probably not linear in response to allergens [7]. IgE plays an important role in nature; rather, they involve interactions between many many, but not all, allergic diseases, for example, asthma, different molecules. Thus it is likely that methylation rhinitis, and eczema [7, 8]. High levels of IgE in the changes at a combination of CpG sites could influence blood are associated with both the risk and severity of disease states. RF allows for the identification of multiple asthma, and cord blood IgE levels have been studied as interacting predictors and identifies which of these are possible predictors of asthma and other atopic allergic most important without imposing a structure or model diseases [4]. Atopy is therefore connected to allergic dis- on the way that it takes place. ease, although many of the details of this relationship Despite its increased presence in the analysis of are still unknown. Epigenetic epidemiology can help to genomic data, few applications of the algorithm have clarify the role that DNA-M plays in atopy by confirm- utilized the variable importance measures (VIM) and its ing candidate loci and revealing novel loci associated potential for feature selection [18]. While RF lends itself with atopy [5]. to a variety of applications, we focused on using it for Advances in genetic biotechnology have made it feas- feature selection. ible to measure DNA-M throughout an individual’s In this study, we implemented a two-stage discovery epigenome and, consequently, epigenetic assessments study within the Isle of Wight (IOW) birth cohort are becoming feasible in larger epidemiologic studies [9]. to first select a set of atopy-candidate CpGs from A growing challenge with epigenetic epidemiology is epigenome-wide data using a subsample and then to test that a vast amount of data is generated and new statis- which of these sites were significantly associated with tical techniques are necessary to make sense of it. This atopy as defined by positive SPT or high total serum IgE is because of small-n-large-p (few observations relative levels in the joint sample. Then, to validate our findings, to the number of predictors) and because traditional we ran replication tests in the independent Swedish methods are not optimized for identifying complex bio- cohort BAMSE. logical processes. Because of the large-scale data gener- ated for each completed methylation assay, techniques Methods to select a subset of informative variables are needed The Isle of Wight birth cohort [10–12], particularly in cases of sparse data in which the The IOW birth cohort was established to study the nat- vast majority of predictors are uninformative [13]. ural history of allergic disease among children born be- Random Forest (RF), developed by Leo Breiman, is a tween 1 January 1989 and 28 February 1990 on the Isle machine learning algorithm used for classification that of Wight, UK. The study was approved by the local can handle the data issues discussed above [14]. A forest research ethics committee (now named the National composed of classification trees is grown using randomly Research Ethics Service, NRES Committee South selected bootstrap samples of the data to form training Central – Southampton B; 06/Q1701/34) and written and testing sets of study participants. At each node informed consent was provided by the infants’ parents. within each tree, the training set is partitioned into dif- After exclusion of adoptions, perinatal deaths, and re- ferent classes with the split determined by a subset of fusals, 1,456 children (95 %) were enrolled. Participants randomly chosen predictors. These two levels of ran- were followed-up at ages 1 (n = 1,167), 2 (n = 1,174), domness, random selection of training/testing sets and 4 (n = 1,218), 10 (n = 1,373), and 18 years (n = 1,313); random testing of predictors, allow the RF to produce detailed questionnaires were administered at each robust classification predictions. Once the forest is follow-up. Details of the birth cohort have been described grown using the training sets, the observations in the elsewhere [20, 21]. At age 18 years, 245 women and 122 testing sets are classified via the forest and misclassifica- men were randomly selected from the cohort for genome- tion rates can be used to evaluate the accuracy of the wide DNA methylation screening as part of another study forest [14]. assessing trans-generational inheritance of atopy. Utilization of RF to analyze array data has increased in recent years [15–18]; it is an ideal method for classifica- Data collection and DNA methylation tion with methylation data for several reasons. Unlike At the 18-year follow-up, most of those who were seen most traditional methods, RF can be used for feature in-person received SPTs using a standard method [22] Everson et al.