Seq-Ing Improved Gene Expression Estimates from Microarrays Using Machine Learning Paul K

Korir et al. BMC Bioinformatics (2015) 16:286 DOI 10.1186/s12859-015-0712-z RESEARCH ARTICLE Open Access Seq-ing improved gene expression estimates from microarrays using machine learning Paul K. Korir1, Paul Geeleher2 and Cathal Seoighe3,4* Abstract Background: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories. Results: We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues. Conclusion: This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated. Keywords: RNA-Seq, Microarray, Machine learning, Statistical learning Background that directly leverage existing datasets have also been pro- Muchefforthasbeeninvestedindevelopingaccurate posed, for example, sequencing scaled microarray inten- methods to infer gene expression levels from the fluores- sities (SSMI) employs quantile-based scaling to produce cence intensities of microarray probes. Typical analysis intensities that result in higher statistical power to infer pipelines for oligonucleotide microarrays include subtrac- differential expression [5]. tion of background signals, normalization of the signal Although they remain in very widespread use, microar- intensities across samples and summarization of the flu- rays have a number of limitations in comparison to orescence intensities of probes that map to the same sequencing-based methods for the quantification of genomic feature (gene or transcript) [1, 2]. Numerous gene expression (generally referred to as RNA-Seq). approaches have been applied to obtain gene-level expres- Documented advantages of RNA-Seq include improved sion estimates from probe intensities [3] with robust reproducibility and dynamic range and gene expression linear models (RMA, parameter estimation by median- estimates on an absolute scale [6]. The latter attribute polish) and multiplicative models (PLIER, MAS5.0, and facilitates comparison of gene expression levels between dChip) suggested to give the best results [4]. Methods different experiments and also enables the expression of different genes within the same sample to be compared, *Correspondence: [email protected] which is useful, for instance in modeling regulatory net- 3School of Mathematics, Statistics and Applied Mathematics, National works [6]. Currently, the summarized probe intensity val- University of Ireland, Galway, University Road, Galway, Ireland ues obtained from microarrays, by contrast, are on an 4Institute of Infectious Disease and Molecular Medicine, Anzio Road, 7925 Cape Town, South Africa arbitrary scale, cannot easily be compared between exper- Full list of author information is available at the end of the article iments without renormalization and are not well suited to © 2015 Korir et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons. org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Korir et al. BMC Bioinformatics (2015) 16:286 Page 2 of 11 the comparison of expression levels of different genes in (http://www.affymetrix.com). Custom CDF files for the same sample [7–9]. In addition, the chips are designed Human Gene (HuGene11stv1_Hs_ENSG.cdf)that according to specific genome annotations and need to were used in the GTEx project [14] were also down- be adapted when these annotations change. Finally, the loaded from the Brainarray resource (http://brainarray. dynamic range of microarrays (a few hundred fold) is an mbni.med.umich.edu). Probes were extracted using order of magnitude lower than that of RNA-Seq (approx. Affymetrix Power Tools (APT v1.12.0) with and without 9,000 fold depending on the coverage) making them less quantile-normalization and with background correction sensitive [10]. However, because of their relatively low (http://www.affymetrix.com/estore/partners_programs/ cost, microarrays remain in widespread use and are par- programs/developer/tools/powertools.affx). Expression ticularly important for large-scale studies and in smaller estimates obtained using RMA were downloaded from and less well-funded laboratories [11–13]. GEO. Gene expression was also estimated using PLIER Here we investigate an alternative approach to estimat- [15] as implemented in APT using the custom CDF ing gene or transcript expression levels from microar- (HuGene11stv1_Hs_ENSG.cdf). ray probe level fluorescence intensities, which addresses We mapped genes to probes using two datasets: (i) the many of the limitations discussed above. We refer to mapping between probes and probe sets implied within this method as machine learning of transcript expression the probe intensities file and (ii) the mapping between (MaLTE). Given a set of samples for which both microar- probe sets and genes. The data for (ii) were constructed ray and RNA-Seq data are available, MaLTE uses machine- using the BEDTools intersect utility. This utility takes learning techniques to learn the relationship between the two BED files as input: one for gene and another for probe intensities of a set of probes associated with the gene set genomic coordinates. Gene coordinates were deter- and the expression level of the feature as estimated by mined from the Ensembl v.72 gene model [16] while probe RNA-Seq. The regression models can then be applied to set coordinates were downloaded from the Affymetrix estimate gene expression from microarray data for which website (http://www.affymetrix.com). RNA-Seq data is not available. We applied MaLTE to 716 Altogether, we assessed expression estimates for 26,215 samples from a broad range of tissues profiled by the genes. These corresponded to 140,212 transcript iso- Genotype-Tissue Expression (GTEx) project, training the forms. Quantile normalization of gene RPKM values for regression models on a random subset of the samples and the 26,215 genes was performed prior to training and testing on the remainder. Within individual test samples testing. MaLTE as well as RMA and PLIER expression esti- the gene expression estimates from MaLTE approximate mates were also quantile-normalized prior to downstream closely the values estimated by RNA-Seq. Cross-sample analyses. correlation between RNA-Seq and microarray expression estimates for individual genes was also significantly higher Selecting and tuning the best learning algorithm using MaLTE compared to existing methods to estimate We used an independent dataset to select and tune the gene expression from microarrays. Remarkably, MaLTE optimal learning algorithm. We obtained RNA-Seq data can also leverage transcript isoform expression estimates [17, 18] and Affymetrix Human Exon 1.0 ST array CEL produced from RNA-Seq to provide a means to esti- files [19] for 53 Yoruba (YRI) and 44 European (CEU) mate the expression levels of specific isoforms from the HapMap cell lines [20]. All individuals were unrelated. microarray data. Data was downloaded from the GEO database [21] under accessions GSE25030 (CEU RNA-Seq), GSE19480 (YRI Materials and Methods RNA-Seq) and GSE7851 (CEU and YRI exon array data). GTEx data preparation We compared the performance of several learning Affymetrix Human Gene 1.1 ST microarray and RNA- algorithms: classification and regression trees (CART), Seq expression data (derived from 837 samples across 29 multivariate adaptive regression splines (MARS), boosted tissue types and three cell lines) were downloaded from regression trees (BRT), random forest (RF), conditional the genotype-tissue expression (GTEx) project website random forest (CRF) and quantile regression random for- (http://www.broadinstitute.org/gtex) on 30th July 2013. est (QRF) [22–26]. All learning algorithms were compared For 91 of the samples, only microarray data was available to median-polish and PLIER. All methods were applied and multiple biological replicates were included for sev- in the R statistical computing environment [27]. Perfor- eral of the cell lines. A total of 716 of the samples were mance was evaluated based on the mean cross-sample used in this study. correlation over all genes (see Results for a definition). Raw and RMA (median-polish) processed microarray We identified CRF as the

Seq-Ing Improved Gene Expression Estimates from Microarrays Using Machine Learning Paul K

Birth Cohort Effects Among US-Born Adults Born in the 1980S: Foreshadowing Future Trends in US Obesity Prevalence

Median Polish with Covariate on Before and After Data

BIOINFORMATICS ORIGINAL PAPER Doi:10.1093/Bioinformatics/Btm145

Exploratory Data Analysis

Expression Summarization Interrogate for Each Gene, Called a Probe Set

Median Polish Algorithm

Introductory Methods for Area Data Analysis

Generalised Median Polish Based on Additive Generators

A New Statistical Model for Analyzing Rating Scale Data Pertaining to Word Meaning

Statistical Analysis of Some Gas Chromatography Measurements

Using Approximations to Scale Exploratory Data Analysis In

Universiti Putra Malaysia Median Polish Techniques