Current Progress of High-Throughput Microrna Differential Expression
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Integrative Bioinformatics, 13(5):306, 2016 http://journal.imbio.de/ Current Progress of High-Throughput MicroRNA Differential Expression Analysis and Random Forest Gene Selection for Model and Non-Model Systems: an R Implementation Jing Zhang1, Hanane Hadj-Moussa1, Kenneth B. Storey1,* 1Institute of Biochemistry and Department of Biology, Carleton University, 1125 Colonel By Drive, K1S 5B6, Ottawa, Ontario, Canada, http://carleton.ca/ Summary MicroRNAs are short non-coding RNA transcripts that act as master cellular regulators with roles in orchestrating virtually all biological functions. The recent affordability and widespread use of high-throughput microRNA profiling technologies has grown along with the advancement of bioinformatics tools available for analysis of the mounting data flow. While there are many computational resources available for the management of data from genome- sequenced animals, researchers are often faced with the challenge of identifying the biological implications of the daunting amount of data generated from these high-throughput technologies. In this article, we review the current state of high- throughput microRNA expression profiling platforms, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, our R package implementations for differential expression analysis and random forest-based gene selection. Detailed installation guides are available at kenstoreylab.com. 1 Introduction Recently, microRNAs (miRNAs) have come to the forefront as dynamic gene regulators that are proving to be master regulators of most biological functions [1]. This group of highly- conserved, short (17-22 nt) non-coding RNAs is known to regulate over 60% of protein-coding genes in humans. The broad controls that miRNAs exert are in part due to the ability of an individual miRNA to target multiple mRNAs and the fact that a single mRNA transcript may be subject to regulation by various miRNAs [2]. MiRNAs regulate genes mainly at the post- transcriptional level through binding with partial or perfect complementarity to the 3’ untranslated region of their mRNA targets, thereby modulating mRNA stability and translation [3]. Partial complementarity leads to translational repression via mechanisms that are not yet fully elucidated but that include sequestering mRNA transcripts in cytoplasmic loci such as stress granules and P-bodies. However, perfect complementarity, a major silencing mechanism, results in mRNA target degradation by Argonaute endonucleases [3]. Other miRNA silencing mechanisms include cases of gene regulation at the transcriptional level in which miRNAs are able to bind directly to DNA regulatory elements, destabilize mRNAs through cleavage- Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. independent processes, and inhibit mRNA:protein interactions by acting as decoys that directly This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). bind to these RNA-binding proteins [4]. * To whom correspondence should be addressed. Email: [email protected] doi:10.2390/biecoll-jib-2016-306 1 Journal of Integrative Bioinformatics, 13(5):306, 2016 http://journal.imbio.de/ On-going studies on miRNA biology have led to evaluation of their uses in diagnostic, prognostic, and therapeutic applications for diseases [5]. Their emerging roles in orchestrating development, cell cycle, metabolism, and the molecular and physiological adaptations required for organisms to respond to various environmental stresses has also attracted tremendous attention [6]. MiRNAs have even been shown to regulate pathogen and host interactions [7]. As such, multiple approaches have been developed for characterizing and assessing the functional roles and biomarker potential of miRNAs. The main technologies currently available for high-throughput miRNA expression profiling include: quantitative reverse transcription polymerase chain reaction (qRT-PCR), microarray analysis, and next generation sequencing (NGS) based methods. These approaches have received immense attention in recent years, largely due to their increased accessibility and the advancement in computational capacity and data analysis tools. In this article, we review the current state of miRNA high-throughput expression profiling techniques, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, two automated and easy-to-use R packages for the assessment of differential expression (DE) and for machine learning-based gene selection. 2 Current miRNA research approaches The short length and uniqueness of miRNAs made large-scale parallel analysis a technical challenge that was initially systematically addressed using dot blots and northern blots. Currently, there are three main types of high-throughput miRNA expression profiling approaches: qRT-PCR, hybridization-based miRNA microarray, and NGS based small RNA- Sequencing (RNA-Seq) [8]. The qRT-PCR is appealing to many laboratories due to its simplicity, reproducibility, and low cost [9]. This approach typically features either polyadenylation or artificial stem-loop based target amplification [10]. While the targeted nature of detection enables assessment without the need for a reference genome, it also renders qRT-PCR approaches ineffective for the discovery of novel miRNAs, and makes it better suited as a validation method rather than a discovery tool [8]. However, the sensitivity of qRT-PCR allows for the absolute quantification of miRNA transcript abundance levels. When compared to other strategies, the lack of scalability that is characteristic of qRT-PCR renders it inefficient for mass miRNA profiling. Hybridization-based miRNA microarrays were among the first high-throughput miRNA profiling methods developed. Such methods use a surface fixed with thousands of DNA-based capture probes, designed to be complementary to a specific fluorescently labelled mature miRNA target [11]. While miRNA microarrays are easily scalable and relatively less expensive than other platforms, they are considered less sensitive or specific, and are also not suited for novel miRNA discovery and absolute quantification of miRNA abundance [8]. The third major approach is the NGS-based massively parallel small RNA-Seq technology [12]. The general principle behind small RNA-Seq is the generation of a small RNA cDNA library from the biological samples, followed by adapter ligation, and sequencing [13]. While RNA- Seq is relatively more expensive than other approaches, the continual introduction of newer models and DNA barcoding multiplexing technology have made it more accessible. Small RNA-Seq is also considered to be the main platform for novel miRNA discovery [14]. Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). Limitations of RNA-Seq include the massive computational support required for data analysis and the increased dependence on genome availability that makes it challenging to use with non- sequenced animal models [8]. doi:10.2390/biecoll-jib-2016-306 2 Journal of Integrative Bioinformatics, 13(5):306, 2016 http://journal.imbio.de/ 3 Computational tools and workflows for RNA-Seq-based miRNA analysis The advent of high-throughput miRNA profiling technologies has led to the generation of large datasets that have made computational tools indispensable for miRNA studies. While numerous bioinformatics tools, both public and custom-made, are available, the same general miRNA- profiling data analysis approach can be readily applied to any of the platforms. This includes: (1) raw data processing, (2) quality assessments, (3) identification of conserved and novel miRNAs, (4) DE analysis, (5) target prediction and novel miRNA discovery, as well as (6) other higher level analyses such as gene set enrichment [8]. While specialized programs can be used to perform each of the steps summarized above, comprehensive miRNA analysis pipelines such as miRanalyzer [15] and DSAP [16] are also available. For a detailed discussion of these miRNA bioinformatics tools and others, see Akhtar et al. (2015) [4]. Homology based search tools that rely on sequence and structure can be used to identify orthologues of conserved miRNAs in numerous species, including non-sequenced animals [17], [18]. These tools generally utilize miRbase [19], the miRNA repository of all known and annotated precursor and mature miRNAs from a range of species. Since many of these tools require well-annotated genomes to effectively identify miRNAs, their usefulness to researchers that work on non-sequenced animal models is limited. As such, more advanced approaches that involve leveraging machine learning strategies and experimental data driven methods are needed for assessing both conserved and novel miRNAs. Machine learning tools use algorithms to ‘learn’ the sequence, structural, and thermodynamic characteristics of miRNAs [20], [21], where specialized tools such as SMIRP [22] are able to predict novel miRNAs in sequenced non-model organisms. Examples of computational tools that leverage transcriptomic and small RNA-seq data to characterize conserved and novel miRNAs are miRDeep2 [23] and the machine-learning