A Comparison of Feature Selection and Classification Methods in DNA

Zhuang et al. BMC Bioinformatics 2012, 13:59 http://www.biomedcentral.com/1471-2105/13/59 RESEARCHARTICLE Open Access A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform Joanna Zhuang1,2, Martin Widschwendter2 and Andrew E Teschendorff1* Abstract Background: The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context. Results: Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and b-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis. Conclusions: Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays. Keywords: DNA methylation, Classification, Feature selection, Beadarrays * Correspondence: [email protected] 1Statistical Genomics Group, Paul O’Gorman Building, UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK Full list of author information is available at the end of the article © 2012 Zhuang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Zhuang et al. BMC Bioinformatics 2012, 13:59 Page 2 of 14 http://www.biomedcentral.com/1471-2105/13/59 Background used in the context of other methylation array technolo- DNA methylation (DNAm) is one of the most important gies [37]. The M-value is also an analogue of the quan- epigenetic mechanisms regulating gene expression, and tity which has been widely used in expression aberrant DNAm has been implicated in the initiation microarray analysis, although there are two important and progression of human cancers [1,2]. DNAm changes differences: in the Infinium DNAm (type I) assay, both have also been observed in normal tissue as a function probes are (i) always measured in the same colour chan- of age [3-8], and age-associated DNAm markers have nel, thus dye bias does not need to be adjusted for, and been proposed as early detection or cancer risk markers (ii) the two probes are measured in the same sample. A [3,6-8]. Proper statistical analysis of genome-wide DNA recent study based on a titration experiment compared methylation profiles is therefore critically important for b andM-valuesandconcludedthatM-values,owingto the discovery of novel DNAm based biomarkers. How- their more homoscedastic nature (i.e. variance being ever, the nature of DNA methylation data presents independent of the mean), were a better measure to use novel statistical challenges and it is therefore unclear if [19]. However, this study was limited to one data set popular statistical methods used in the gene expression and feature selection was only investigated using fold- community can be translated to the DNAm context [9]. changes, while statistics and P-values were not The Illumina Infinium HumanMethylation27 Bead- considered. Chip assay is a relatively recent high-throughput tech- The purpose of our study is therefore two-fold: (i) to nology [10] that allows over 27,000 CpGs to be assayed. provide a comprehensive comparison of the perfor- While a growing number of Infinium 27k data sets have mance of b and M-value measures in the context of been deposited in the public domain [3,4,11-15], rela- popular feature selection tools that use actual statistics tively few studies have compared statistical analysis to rank features (CpGs), and (ii) to compare the perfor- methods for this platform. In fact, most statistical mance of some of the most popular feature selection reports on Infinium 27k DNAm data have focused on and classification methods on DNAm data from differ- unsupervised clustering and normalisation methods ent tissues and correlating with different phenotypes, so [16-19], but as yet no study has performed a compre- as to gain objective insights into how best to perform hensive comparison of feature selection and classifica- feature selection and classification in this novel context. tion methods in this type of data. This is surprising To address these goals, we make use of 7 independent given that feature selection and classification methods sets of human DNA methylation data, all generated with have been extensively explored in the context of gene the 27K Illumina Infinium platform, encompassing over expression data, see e.g. [20-33]. Moreover, feature 1000 samples and representing over 29 million data selection can be of critical importance, as demonstrated points. by gene expression studies, where for instance use of higher order statistics has helped identify important Materials and methods novel cancer subtypes [24,34]. Given that the high den- Data sets sity Illumina Infinium 450k methylation array is now The 7 human DNA methylation data sets are sum- starting to be used [10,35] and that this array offers the marised in Table 1. All the data were generated using coverage and scalability for epigenome wide association the Illumina Infinium HumanMethylation27 BeadChip studies (EWAS) [36], it has become a critical and urgent assay that enables the direct measurement of methyla- question to determine how best to perform feature tion at over 27,000 individual cytosines at CpG loci selection on these beadarrays. located primarily in the promoter regions of 14,495 The Illumina Infinium assay utilizes a pair of probes unique genes. All data sets, except the TCGA lung can- for each CpG site, one probe for the methylated allele cer set, followed the same quality control and and the other for the unmethylated version [10]. The methylation level is then estimated, based on the mea- Table 1 The Illumina Infinium 27 k data sets sured intensities of this pair of probes. Two quantifica- Data Set #CpGs Size Tissue #(N, C) Age range Reference tion methods have been proposed to estimate the methylation level: (i) b-values and (ii) M-values. While UKOPS 25,642 261 blood (148,113) 50-84 GSE19711 the b-value measures the percentage of methylation at ENDOM 25,998 87 tissue (23,64) 32-90 GSE33422 the given CpG site and is the manufacturer’s recom- CERVX 26,698 63 tissue (15,48) 25-92 GSE30760 mended measure to use [10], the M-value has recently OVC 27,578 177 tissue (0,177) 25-89 [3] been proposed as an alternative measure [19]. The M- T1D 22,486 187 blood (187,0) 24-74 GSE20067 value is defined by the log2 ratio of the intensities of the BC 24,206 136 tissue (23,113) 31-90 GSE32393 methylated to unmethylated probe and has also been LC 23,067 151 tissue (24,127) NA TCGA Zhuang et al. BMC Bioinformatics 2012, 13:59 Page 3 of 14 http://www.biomedcentral.com/1471-2105/13/59 normalisation strategy. Briefly, non background-cor- (i.e. 100 in b-value and 1 in M-value) have only a small rected data was used and only CpGs with intensity effect on both measurements. Hence, the relationship detection P-values less than 0.05 (i.e. significantly higher between the b and M-values is approximately logistic: intensity above the background determined by the nega- β ∧ ∧ β − β tive controls) were selected. Inter-array normalisation = 2 M / 2 M+1 ;M=log2 /1 . (3) correcting for beadchip effects and variations in bisulfite While the b-value has a more intuitive biological conversion efficiency were performed using a linear interpretation it suffers from severe heteroscedasticity model framework with explicit adjustment for these fac- (the intrinsic variability is much lower for features tors, but only if these factors were significantly corre- which are either unmethylated or methylated, with lated with PCA/SVD components.

A Comparison of Feature Selection and Classification Methods in DNA

High-Throughput Analysis Suggests Differences in Journal False

Tracking the Popularity and Outcomes of All Biorxiv Preprints

Event Extraction on Pubmed Scale Filip Ginter1*, Jari Björne1,2, Sampo Pyysalo3 from Workshop on Advances in Bio Text Mining Ghent, Belgium

Use and Mis-Use of Supplementary Material in Science Publications Mihai Pop1* and Steven L

Viewed: 8 Were Accepted As Full Papers, 4 As Poster Large Investment of Effort Needed to Develop Sufficiently Papers and One As a Demo Paper

Mining Locus Tags in Pubmed Central to Improve Microbial Gene Annotation Chris J Stubben* and Jean F Challacombe

Analyzing the Field of Bioinformatics with the Multi-Faceted Topic Modeling Technique Go Eun Heo1, Keun Young Kang1, Min Song1* and Jeong-Hoon Lee2

Semantic Web Applications and Tools for the Life Sciences: SWAT4LS 2010

Paperbot: Open-Source Web-Based Search and Metadata Organization of Scientific Literature Patricia Maraver1,2, Rubén Armañanzas1,2, Todd A

BMC Bioinformatics Biomed Central

Knowledge Discovery of Drug Data on the Example of Adverse Reaction Prediction Pinar Yildirim1*, Ljiljana Majnarić2, Ozgur Ilyas Ekmekci1, Andreas Holzinger3

Interactive Metagenomic Visualization in a Web Browser Ondov Et Al