Biological Classification with RNA-Seq Data: Can Alternatively Spliced Transcript Expression Enhance Machine Learning Classifier?

Downloaded from rnajournal.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Biological classification with RNA-Seq data: Can alternative spliced transcript expression enhance 2 machine learning classifier? 3 4 Authors: Nathan T. Johnson1, Andi Dhroso1, Katelyn I. Hughes1, and Dmitry Korkin1,2 5 6 Affiliations 7 1Worcester Polytechnic Institute, Bioinformatics and Computational Biology Program, Worcester, MA 8 01609 9 2Worcester Polytechnic Institute, Department of Computer Science, Worcester, MA 01609 10 11 Abstract 12 13 The extent to which the genes are expressed in the cell can be simplistically defined as a function of one 14 or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a 15 prevalent approach to quantify gene expression and is expected to gain better insights to a number of 16 biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq 17 allows to quantify expression at the gene and alternative splicing transcript levels. However, leveraging 18 the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine 19 learning methods are commonly used approaches for biological data analysis and have recently gained 20 attention for their applications to the RNA-Seq data. 21 22 In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse 23 range of biological classification tasks. We hypothesize that the transcript-level expression data is more 24 informative for biological classification tasks than the gene-level expression data. Our large-scale 1 Downloaded from rnajournal.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 25 assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis 26 pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three 27 independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab 28 groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of 29 the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the 30 cancerous tissue. For each classification problem, the performance of three normalization techniques and 31 six machine learning classifiers was explored. We find that for every single classification problem, the 32 transcript-based classifiers outperform or are comparable with gene expression-based methods. The top- 33 performing supervised learning techniques reached a near perfect classification accuracy, demonstrating 34 the utility of supervised learning for RNA-Seq based data analysis. 2 Downloaded from rnajournal.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 35 Introduction 36 Ever since the intrinsic role of RNA was proposed by Crick in his Central Dogma (Crick 1970), there has 37 been a desire to accurately annotate and quantify the amount of RNA material in the cell. A decade ago, 38 with the introduction of RNA sequencing (RNA-Seq) (Mortazavi, Williams et al. 2008), it became 39 possible to quantify the RNA levels on the whole genome scale using a probe-free approach, gaining 40 insights into cellular and disease processes and illuminating the details of many critical molecular events 41 such as alternative splicing, gene fusion, single nucleotide variation, and differential gene expression 42 (Conesa, Madrigal et al. 2016). The basic assessment of RNA-Seq is focused on utilizing the data for 43 differential gene expression between the groups of biological importance (Trapnell, Hendrickson et al. 44 2013). However, there are additional patterns that can be elucidated from the same raw sequencing data 45 by extracting the expression levels of the alternatively spliced transcripts (Zhang, Pal et al. 2013). 46 47 Alternative splicing (AS) of pre-mRNA provides an important means of genetic control (Chen and 48 Manley 2009, Nilsen and Graveley 2010). It is abundant across all eukaryotes and even occurs in some 49 bacteria and archaea (Keren, Lev-Maor et al. 2010, Barbosa-Morais, Irimia et al. 2012, Reddy, Marquez 50 et al. 2013). AS is defined by the rearrangement of exons, introns, and/or untranslated regions that yields 51 multiple transcripts (Kelemen, Convertini et al. 2013). Furthermore, 86-95% of multi-exon human genes 52 is estimated to undergo alternative splicing (Djebali, Davis et al. 2012). Genes tend to express many 53 transcripts simultaneously, 70% of which encode important functional or structural changes for the 54 protein (Djebali, Davis et al. 2012). RNA-Seq data encompasses expression at both gene and transcript 55 levels: the gene-level expression amounts to the combined expression of all transcripts associated with a 56 particular gene. It has been previously demonstrated that the gene-level expression is an excellent 57 indicator of the tissue of origin as well as certain cancer types (Wan, Qu et al. 2014, Wei, Shi et al. 2014, 58 Achim, Pettit et al. 2015, Danielsson, James et al. 2015, Mele, Ferreira et al. 2015). However, transcript- 59 level expression has been shown to provide a more precise measurement of gene product dosage, 3 Downloaded from rnajournal.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 60 resulting in the superior performance in predicting the cancer patient prognosis or survival time, and 61 providing further insights into the functional transformations driving cancer (Zhang, Pal et al. 2013, Shen, 62 Wang et al. 2016, Trincado, Sebestyen et al. 2016, Climente-González, Porta-Pardo et al. 2017). 63 Differential AS depends on many factors, including the epigenetic state, genome sequence, RNA 64 sequence specificity, activators and inhibitors from both, proteins and RNAs, as well as post-translational 65 modification (Edwards and Myers 2007, Chen and Manley 2009, Luco, Allo et al. 2011, Gamazon and 66 Stranger 2014). These diverse mechanisms control AS to obtain developmental, cell-type, and tissue- 67 specific expression. Furthermore, the patterns driven by AS and specific to cancer and other diseases have 68 been recently identified (Cáceres and Kornblihtt 2002, Sebestyen, Zawisza et al. 2015). 69 70 Machine learning tools developed over the last several decades have significantly advanced the analysis 71 of the vast amount of next generation sequencing and microarray expression data by discovering the 72 biologically relevant patterns (Tarca, Carey et al. 2007, Liu, Che et al. 2013, Neelima and Babu 2017). 73 Previous studies have utilized unsupervised and supervised machine learning techniques on the 74 microarray gene expression data with variable success rates (Vandesompele, De Preter et al. 2002, 75 Libbrecht and Noble 2015). Along with the individual approaches (Jagga and Gupta 2014), large-scale 76 comparative studies have been carried out (Costa, de Carvalho et al. 2004, Pirooznia, Yang et al. 2008). 77 Some studies evaluated both basic and advanced clustering techniques, such as hierarchical clustering, k- 78 means, CLICK, dynamical clustering, and self-organizing maps, to identify the groups of genes that share 79 similar functions or genes that are expressed during the same time point of a mitotic cell cycle (Mudge, 80 Frankish et al. 2013, Consortium 2015, Mele, Ferreira et al. 2015). Other studies compared the ability to 81 perform disease/healthy sample classification tasks by state-of-the-art supervised methods, such as 82 Support Vector Machines (SVM), Artificial Neural Nets (ANN), Bayesian Networks, Decision Trees, and 83 Random Forest classifiers (Pirooznia, Yang et al. 2008). 84 4 Downloaded from rnajournal.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 85 When it comes to the biological classification, the RNA-Seq data present an attractive alternative to 86 microarrays, since it is possible to quantify all RNA present in the sample without the need of the a priori 87 knowledge. With RNA-Seq rapidly replacing microarrays, it is necessary to assess the potential of the 88 supervised machine learning methodology applied to the RNA-Seq data across multiple datasets and 89 biological questions (Byron, Van Keuren-Jensen et al. 2016). Recently, there have been limited studies 90 that have assessed RNA-Seq data with supervised and unsupervised machine learning techniques 91 (Thompson, Tan et al. 2016). However, these studies utilized RNA-Seq data by leveraging only gene- 92 level expression data rather than more detailed transcript-level, or transcript-level, data available for the 93 alternative splicing transcripts (Chen and Manley 2009). Most recently, a study analyzed the utility of 94 RNA-Seq transcript-level data for the disease/non-disease phenotype classification of the samples, 95 showing the advantage of the transcript expression data for the disease phenotype prediction task 96 (Labuzzetta, Antonio et al. 2016). However, the question of whether or not the utility of transcript-level 97 expression presents a general trend across all main biological and biomedical classification tasks remains 98 open. 99 100 This work aims to systematically assess how well state-of-the-art supervised machine learning methods 101 perform in various biological classification tasks when utilizing either gene-level or transcript-level 102 expression data obtained from the RNA-Seq experiments. The assessment

Load more