conserved, short (17-22 nt) non-coding RNAs is known to regulate over 60% of -coding * [1]. functions biological most of regulators master be to proving are that regulators dynamic as forefront the to come have (miRNAs) microRNAs Recently, 1 be o id iety o N rgltr eeet, etblz mNs hog cleavage- bind totheseRNA-binding [4]. through mRNAs destabilize elements, regulatory independent processes, and inhibit DNA mRNA:protein interactions by acting as decoys that to directly directly bind to able are miRNAs inwhich level transcriptional the at regulation gene of cases include mechanisms silencing miRNA Other [3]. endonucleases Argonaute by degradation target mRNA in results as stress granules such and P-bodies. loci However, perfect cytoplasmic complementarity, a in major silencing transcripts mechanism, mRNA sequestering include that but 3’ elucidated fully the yet to not are that mechanisms via repression complementarity translational to leads complementarity Partial [3]. perfect or partial with untranslated region of their mRNA targets, thereby modulating binding mRNA stability and translation through level transcriptional post- the at mainly regulate MiRNAs [2]. miRNAs various by regulation to subject be an may transcript mRNA of single a that ability fact the and the mRNAs multiple target to to miRNA individual due part in are exert miRNAs that controls broad The humans. in genes Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 om correspondence should be addressed. D To 1 Institute of Biochemistry and Department of Biology, Carleton University, University, Carleton Biology, of Department and Biochemistry of Institute ifferential ifferential wh high- daunti the of implications biological the - genome from identifying of challenge the with faced data often are researchers animals, sequenced of management the tools for available bioinformatics resources computational of high advancement the of with available along use grown has widespread technologies and affordability recent The functions. biological all virtually orchestrating in roles with regulators non short are MicroRNAs rsn RiMR n RiF, u R akg implementations package R our RBioFS, and RBioMIR present mol comparative of context the in tools computational expression profiling throughput microRNA platforms, data analysis processes, and guides are available at expression analysis and random forest S Introduction Current election for throughput technologies. In this article, we review the current state of high of state current the review we article, this In technologies. throughput Jing Zhang Drive, K1S 5B6, for analysis of the mounting data flow. While there are many many are there While flow. data mounting the of analysis for E xpression P rogress of rogress

M 1 kenstoreylab.com , Hanane Hadj odel and - Ottawa, Ontario, Canada oig N tasrps ht c a m as act that transcripts RNA coding I mplementation A nalysis and H Summary Email: igh - - Moussa based gene selection. Detailed installation N ng amount of data generated from these from generated data of amount ng

. . [email protected] on-M -T hroughput

- 1 throughput microRNA profiling profiling microRNA throughput , , Kenneth B. Storey odel , http://carleton.ca/

R ecular physiology. We also We physiology. ecular andom andom S ystems: an R M This icroRNA icroRNA http://journal.imbio.de/ F for for se cellular aster orest 1,* 1125 Colonel By group of highly- of group

differential G ene - 1

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). prognostic, and therapeutic applications for diseases [5]. Their emerging roles in orchestrating in roles emerging Their [5]. diseases for applications therapeutic and prognostic, either features typically approach targeted the This While [10]. amplification [9]. target based its cost stem-loop to artificial due low or polyadenylation laboratories and many to reproducibility, appealing is profiling simplicity, qRT-PCR expression The miRNA [8]. (RNA-Seq) high-throughput Sequencing of types blots. RNA- northern small based main NGS and microarray, miRNA hybridization-based and qRT-PCR, approaches: three blots dot are using there addressed Currently, systematically initially was technical a that analysis parallel challenge large-scale made miRNAs of uniqueness and length short The 2 On to other strategies, the lack of scalability that is characteristic of qRT-PCR renders it inefficient compared When levels. abundance transcript miRNA of quantification absolute the for allows o ognss o epn t vros niomna srse hs lo trce tremendous attracted [7]. interactions host also and pathogen regulate has to shown been even stresses have MiRNAs [6]. environmental attention various to respond to organisms for development, cell cycle, metabolism, and the molecular and physiological adaptations required as a validation method rather than a discovery tool [8]. However, the sensitivity of qRT-PCR of sensitivity the However, [8]. tool discovery a than rather method validation a as qRT-PCR approaches ineffective for the discovery of novel miRNAs, and makes it better suited nature of detection enables assessment without the need for a reference genome, it also renders sequenced animalmodels [8]. The third major approach is the novel miRNAdiscoveryandabsolutequantification ofmiRNAabundance[8]. for suited not also are and specific, or sensitive less considered are they platforms, other than mature labelled fluorescently specific a to miRNA target [11]. While miRNA complementary microarrays are easily scalable and be relatively less expensive to designed probes, miRNA capture high-throughput first the among profiling methodsdeveloped.SuchuseasurfacefixedwiththousandsofDNA-based were microarrays miRNA Hybridization-based for massmiRNAprofiling. and easy-to-use R packages for the assessment of differential expression (DE) and for machine automated two RBioFS, and RBioMIR present also We physiology. molecular comparative of high-throughput miRNA of state current the expression review profiling techniques, data analysis processes, we and computational tools in the article, context this In tools. analysis data years, recent in largely due to their attention increased accessibility and the advancement in computational capacity and immense received have approaches These methods. based (NGS) sequencing generation next and analysis, microarray (qRT-PCR), reaction transcription chain polymerase reverse quantitative include: profiling the expression miRNA assessing high-throughput and for characterizing for developed been functional roles and have potential of miRNAs. The approaches main technologies currently available multiple such, As N-e i as cniee t b te an ltom o nvl iN dsoey [14]. discovery miRNA novel and the increased for dependence on genome availability that makes platform it challenging to use with main non- the analysis data for be required support computational massive the include RNA-Seq of Limitations to considered also Small accessible. is more it RNA-Seq made have technology multiplexing barcoding DNA and models newer of introduction continual the approaches, other than expensive more relatively is Seq RNA- While [13]. sequencing and ligation, adapter by followed samples, biological the from RNA-Seq small behind principle general The learning-based geneselection. Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 gig tde o mRA ilg hv ld o vlain f hi ue i diagnostic, in uses their of evaluation to led have biology miRNA on studies -going Current miRNA research approaches NGS-based massively parallel small RNA-Seq technology [12]. is the generation of a small RNA cDNA library cDNA RNA small a of generation the http://journal.imbio.de/ 2

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). [26], or FastQC using genome checks quality annotated read performing with complete starts stage a processing data require The not do on we focusing comprehensive transcriptomesfortheseanalyses. are miRNAs, we Since conserved environment). (R screening analysis expression and environment), Shell computational tools of can assembly be customized. present Overall, the the current that pipeline features and two pipeline stages: data analysis processing data (Unix miRNA a of example for comparative studies (Fig. 1). It should be noted that the procedure outlined herein is but one miRNA expression profiling (DE analysis) with a focus on conserved miRNAs and applications Here we describe a general data processing and analysis workflow for small RNA-Seq based RNA-Seq small for workflow analysis and processing data general a describe we Here [27]. Since there is currently no currently is there Since [27]. fasq-mcf with filtering size and trimming adapter by followed 3.1 on aperprojectbasis[25]. the and been shown to be inconsequential [23] and the DE results miRDeep2 for most abundant miRNAs are are acceptable miRNAs has studies these in introduced is that error novel of level The [24]. miReader tool machine-learning and conserved characterize to data RNA-seq non-model organisms. Examples of computational tools that leverage transcriptomic and small sequenced in miRNAs novel predict to able are [22] SMIRP as such tools specialized where [21], [20], miRNAs of characteristics thermodynamic and structural, sequence, the ‘learn’ to are methods driven data experimental needed for and assessing both conserved and strategies novel miRNAs. Machine learning tools learning use algorithms machine leveraging involve that work on non-sequenced animal models is limited. As such, more advanced approaches that require well-annotated genomes to effectively and identify miRNAs, tools their usefulness these to known researchers of many all Since species. of of range a repository from miRNAs mature miRNA and precursor the annotated [19], miRbase utilize generally tools identify These [18]. to used be can structure and sequence orthologues of on conserved miRNAs rely that tools search based Homology miRNA bioinformaticstoolsandothers,seeAkhtaretal.(2015)[4]. the which to database, reference positive the as [31] miRbase from obtained is species all of database miRNA mature entire the annotated, be to yet have interest of species the from miRNAs the as Similarly, steps. next the to on carried are reads ‘clean’ unaligned the and [30] The trimmed and filtered reads are then aligned to this negative negative reference database using bowtie the animals, non-sequenced reference database was built to using all sequences from the rfam [28] specific and piRNA databases [29]. database species RNA small non-miRNA and [15] miRanalyzer as such used to perform each of the steps summarized above, comprehensive miRNA analysis pipelines (6) as well other higher as level analyses such as gene discovery, set enrichment [8]. While specialized miRNA programs novel can be novel and and prediction conserved target of (5) identification analysis, DE (3) (4) assessments, miRNAs, quality (2) processing, data raw (1) includes This platforms. the of any to applied readily be can approach analysis data profiling miRNA- general same the available, are custom-made, and public both tools, bioinformatics datasets that have made computational tools indispensable for miRNA studies. While numerous The advent of high-throughput miRNA profiling technologies has led to the generation of large 3 Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 General workflowforconservedmiRNAexpressionanalysis analysis Computational tools and workflow

[16] are also available. For a detailed discussion of these of discussion detailed a For available. also are [16] DSAP in numerous species, including non-sequenced animals [17], s s for for RNA - Seq- http://journal.imbio.de/ based miRNA

3 :

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). profiles by eliminating irrelevant features, a process known as feature selection ( using, for example, univariate ranking, conventional modelling, or machine learning classifier methods FS of myriad a such, As discovery. biomarker and characterization mechanistic for high expression gene from information meaningful large in preserving physiology, molecular as comparative result denoted also profiling count, feature miRNA high based (i.e., dimensional qRT-PCR high-throughput and RNA-Seq Small 4 be can package R the and code downloaded throughGitHub:(https://github.com/jzhangc/git_RBioMIR.git). source at found be The can manual user (http://kenstoreylab.com/?page_id=2540). RBioMIR the and commands Shell R Unix the RBioMIR. in tools package required and steps all wrapped have we analysis, complex this streamline To (see below). results can then be used for downstream analysis such as machine learning-based gene selection (csv values separated comma normalize the read counts before DE analysis with limma. The DE results are then exported as the and interest of species to which the reads are aligned. R packages species edgeR [33] and limma [34] are then used to the between conservation the from originates discrepancy count and in cases where a miRNA species is duplicated, the highest count should be used. Such read The data analysis stage uses the R programming language. First, read counts are imported to R, analysis DE steps. upcoming the for basis the as serving miRNAs, identified the of each for reads remaining have beendeveloped[37], [38]. Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 Figure 1: A flowchart depicting the general RNA for assessing conserved miRNA expression in non in expression miRNA conserved assessing for Random forest miRNAs important ‘clean’ reads are aligned using bowtie. HTSeq [32] is then used to count the aligned - based feature selection for identifying ) files and visualized using ggplot2 and gplots [35], [36]. The [36]. [35], gplots and ggplot2 using visualized and files ) - Seq data processing and analysis steps necessary - model species.

dtst. n h cnet of context the In datasets. p) http://journal.imbio.de/ FS), is critical 4 s

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). heuristic natureofRF,abootstrappingrepeatRF-FS runishighlyvaluable. the example, For specialties. distinct its has which of each [44], [43], developed been have (RF-RFE) elimination feature recursive RF-based and Boruta as such algorithms FS Various the to due that, noting worth is It workflow. FS the for point starting a as act that metrics (VI) importance variable provides algorithm The [40]. tasks FS regression high as well as multi- and categorical, binary, handling evaluation model of capable feature is RF RF that unbiased demonstrated final been to has It the leading power. classification trees, Therefore, decision [40–42]. of tree collection decision a represents a generate to methods or parametric (CART) algorithm tree regression the and features, classification as selected such randomly methods non-parametric of either utilizes subgroup each For features. the to or fitting) over- (or dataset the to bias modelling reduces step partitioning feature random The 2). (Fig. of-bag [OOB] samples), allowing OOB error rates to be used as a measurement for performance out- as (known out samples some leaving by cross-validation internal enables step resampling bootstrap The 2). (Fig. of [40] modelling tree decision subset and node), tree selected each at model randomly to features the (i.e., partitioning feature random bagging), (or resampling The universality. high relatively and p) versatility, robustness, its as (known datasets size sample low with to dimensionality high handles algorithm due datasets, dimensional high Random of the minimal number of features. In the case of investigating and selecting key miRNA targets former focuses on retaining all the relevant features, whereas RF-RFE emphasizes the isolation among miRNAs may be manifested as statistical correlation [45], it is crucial to reflect this reflect consideration to when establishing crucial an is RF-FS procedure. it Since recursive [45], RF approaches correlation have been statistical connection as manifested regulatory be may the miRNAs since among Additionally, appropriate. more be might features relevant under responses physiological facilitate that be preferred for the processing of gene expression datasets. It is also imperative to evaluate the shown to demonstrate superior performance when handling correlated features [46], they might Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 Figure 2: A flowchart of the core algorithm for random forest (RF). forest (RF) is a machine learning classifier that is well-suited for the processing of processing the for well-suited is that classifier learning machine a is (RF) forest well as compared to other methods [39]. The core concepts of RF include bootstrap select environmental conditions, identifying all identifying conditions, environmental

http://journal.imbio.de/ small , large n, 5

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). workflow uses the original RF principle proposed by Breiman (2001) Breiman by proposed principle RF original the uses workflow (2010) al. et Genuer by presented first strategy the on based is workflow current The of SFS, providing and modeling, CART single randomForest interest. of phenotypes physiological between differentiating to relevant miRNAs selecting for a of implementation an RBioFS, present we above, principles the on Based 4.1 workflow: around study focused specific research goals. more Accordingly, we consider the following when implementing an RF-FS a help to may leading features, FS selecting RF when to knowledge prior prior integrate filter statistics univariate a applying Moreover, [44]. demanding by method original The elimination. be can power computational and sizes sample large for requirements Boruta’s features, computational selecting allrelevantfeatures,the Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 Figure 3: .Optionsforreducingcomputationaltime. d. c. Takingfeaturecorrelationintoconsideration; b. a. Implementation Incorporating aunivariatestatisticsfilter; Selecting allrelevantfeatures; The [48], and is a combination of combination a is and [48], power and time requirements. For example, although being able to identify to able being although example, For requirements. time and power

flowchart of the current RF current the of flowchart a minimal feature list for predictive modelling. s equential list obtainedfromthefirstSFSroundisconsideredsufficient. Genuer et al. (2010) al. et Genuer - FS implementation. f

orward univariate statistical filter, recursive RF VI ranking, VI RF recursive filter, statistical univariate s lcin (SFS) election

[47]

also features an additional an features also However, since our focus is on - ie eusv R feature RF recursive like

[40] http://journal.imbio.de/ n via RF- the FS workflow FS R package R

[47]. The [47].

round all 6

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). An the initialRFfeatureelimination;thisresultedin selectionof12 for threshold the estimate to established was model CART a ranking, and values SD on Based 4A). (Fig. runs RF times 50 upon ranked and calculated then were values VI based permutation t- miRNAs all filtering, statistical univariate For We applied RBioFS, high-throughput qRT-PCRprofilingapproach. a using animals torpid and (control) active euthermic of muscle skeletal and liver in measured were miRNAs 85 of levels expression paper, original the In [51]. marsupial American South a from a study that explored changes in miRNA expression profiles in response to hibernation in dataset the of part on selection gene based RF-FS for RBioFS of use the demonstrate we Here 4.2 GitHub: (https://github.com/jzhangc/git_RBioFS.git). (50 bootstrapping A discarded. are phenotypes between changes significant conducted on validated high-throughput miRNA datasets and miRNAs showing no statistically Briefly, introduced to the RF runs starting from the feature ranked at the top of the VI ranking. The first recursively are features where process, SFS-like times) (50 bootstrapping a to subjected then relevant features tend to exhibit a larger variance in VI values [47]. The remaining features are that observation the on based step, elimination feature rank-based VI initial the for threshold BoS eald sr aul sml dtst ad ape eut cn e on at found be can results sample and dataset, sample manual, (http://kenstoreylab.com/?page_id=2542 user The computing. detailed the parallel in of RBioFS advantage herein takes described that pipeline package a RF-FS RBioFS; the package wrapped R automated We 3. Fig. in depicted is workflow eusv R rn (0 ie pr u) ee are ot dig n mRA ah time. each miRNA one adding out carried were run) miRNAs 6 featuring group per first the Consequently, times (50 runs RF recursive a as used then is estimate This [50]. rpart package, R the using SD minimum the estimate to ANOVA using modelling regression and CART to subjected feature are ranks mean each VI corresponding for (SD) deviation standard the [47], (2010) al. et Genuer in described As the without and with trees RF all permutation forthefeatureinrandom partition (Fig.2). from rates error OOB between difference average the is biased less are towards they correlated features [49]. because Specifically, the values permutation-based VI value VI of a given permutation-based feature uses implementation current the that noting worth is It value. VI mean the on based order descending in ranked subsequently are features All feature. each for values VI calculate to dataset filtered the on performed then plus one standard deviation standard one plus rate error OOB minimum the than less rate error OOB mean a in resulting features of subset The decreasing trend of the OOB error is shown in Fig. 4B. For both RF steps, a tree count, tree a steps, RF both For 4B. Fig. in shown is error OOB the of trend decreasing The Genuer etal.(2010)[47]. (denoted partitioning feature random value of as (denoted resampling of times i.e.: Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 test) were discarded; this reduced the miRNA list size from 85 to 35 targets (Table 1). The 1). (Table targets 35 to 85 from size list miRNA the reduced this discarded; were test) -like selection step was conducted on these 12 miRNAs. Starting with Starting miRNAs. 12 these on conducted was step selection SFS-like Case studyanddiscussion an analysis of variance (ANOVA) or Student’s t-test, based on the univariate test, is test, univariate the on based t-test, Student’s or (ANOVA) variance of analysis an p / 3 was used as the random feature partitioning scale, i.e.: the number of features for our in- house R implementation of the RF reported as the final result. A schematic representation of this of representation schematic A result. final the as reported is in the randomForest package) of 501 was used. A used. was 501 of package) randomForest the in ntree mtry . h pcae n suc cd ae vial on available are code source and package The ). n h rnoFrs pcae, s ugse by suggested as package), randomForest the in that failed to show significant changes (Student’s changes significant show to failed was considered the most important targ important most the considered - FS workflow

miRNAs (Table1). http://journal.imbio.de/ , to the liver dataset ) RF run is run RF times) miR , 12 -22-5p, et. 7 .

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). pathways. (MAPK) kinase protein mitogen-activated of elements key target directly 876-5p that revealed miRNAs 6 the of analysis [54] miR Indeed, marsupials. hibernating in arousal facilitate to mechanism thermoregulatory hepatic compensatory a coordinating in involved likely are torpor during downregulated significantly and miRNA target prediction tools highlighted in this work. It was revealed that these miRNAs literature available both using examined further was miRNAs selected RF-FS 6 of group The Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 has been implicated as a regulator of lipid metabolism [53] and DIANA-miRPath and [53] metabolism lipid of regulator a as implicated been has -22-5p OOB error rate (+ 1SD). on on - SFS Table 1 Table marsupials. torpid versus control 4: Figure the the 35 miRNAs ing remain Dromiciops gliroides dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl ------miR miR miR miR miR miR miR miR miR miR miR miR miR miR miR ------: : like selection. The vertical line represents the set of features that resulted in the set the that resulted minimu of line features represents vertical The like selection. 34c 34a- 29a- 27a- 26a- 23a- 22- 21a- 20a- 18a- 16- 10b- 1b- 1a- - let The miRNAs selected by RBioFS after each FS step. “ step. FS each after RBioFS by selected miRNAs The Results from RF from Results 7f 5p 3p 5p 5p - 5p 5p 5p 5p 5p 5p 3p 5p 3p 5p - 5p (not ranked) (not . . t- test F

from from the univariate filtering; (B) OOB error rate ( igures were generated using the R package RBioplot package R the using generated were igures dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl - FS conducted on the expression profile of 85 miRNAs from liver of liver from miRNAs 85 of profile expression the on conducted FS ------

miR miR miR miR miR miR miR miR miR miR miR miR miR miR miR (A) Histogram depicting mean VI values ( values VI mean depicting Histogram (A) ------214- 196a- 193b- 191- 185- 181a- 152- 145a- 142b- 139- 137- 134- 125a- 106b- 99b- 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p 5p miR -22-5p, dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl dgl VI ------miR miR miR miR miR miR miR miR miR miR miR miR (ranked by VI) by (ranked - based selecting ------miR 106b- 219a- 196a- 23a- 34a- 191- 142b- 16- 18a- 21a- 876- 22- dgl 3p 5p 5p 5p 5p 3p 3p 5p -21a-3p, 5p 5p 5p 5p ” stands for the species name name species the for stands ” ± ± miR

SEM) SEM) change based SD) and ranking of http://journal.imbio.de/ dgl dgl dgl dgl dgl dgl [ 50] - SFS , and -34a-5p, ------(ranked by VI) by (ranked miR miR miR miR miR miR . . like selection selection like ------142b- 16- 18a- 21a- 876- 22- 3p 5p 3p 3p 5p 5p m miR

8 -

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). packages RBioMIRandRBioFS,respectively. patient and healthy between groups, emphasizesitabilitytoaidinthediscovery ofbiomarkersinvariousconditions. differentiating of capable targets, miRNA important most the select to RBioFS of ability the Furthermore, disease. and development of studies miRNA for G.Bertoli, C.Cava,andI.Castiglioni.MicroRNAs:NewBiomarkers forDiagnosis, [5] M.Akhtar,L.Micolucci, M.S.Islam,F.Olivieri,andA.D.Procopio. [4] D.Bartel.MicroRNAsGenomics,Biogenesis, Mechanism,andFunction.Cell, [3] M.S.EbertandP.A.Sharp.RolesformicroRNAs inconferringrobustnessto [2] P.Amaral,M.E.Dinger,andJ.S.Mattick. Non-codingRNAsinhomeostasis, [1] References Canada the holds Natural KBS the Canada. from Research ChairinMolecularPhysiologyandHHholdsanOntarioGraduateScholarship. of KBS (NSERC) to Council 6793) Research # Engineering (Grant and grant Sciences Discovery a by supported was work This Acknowledgments and HHperformedthedataanalysis.JZdevelopedUnixShellcodes,RBioMIR,RBioFS. All authors were involved in the assembly of the manuscript and approved the final version. JZ Author contributions R the through workflow selection feature based RF an and pipeline analysis data of RNA-Seq implementations our present also We miRNAs, analyses. these of undertake to required characteristics tools been functional the computational and have steps processing highlight data main the outline we we and methods, miRNA-profiling tools review, this computational In expression. gene advanced and instrumental to profiling our understanding of miRNAs and the central role that they play in modulating expression High-throughput 5 the utility of RBioFS is not limited to comparative molecular physiology but is also applicable Indeed, system. particular a in perturbations environmental or metabolic to response in occur of ability the that emphasizes miRNAs of set crucial data most and differentiating the reveal marsupial and sort effectively to RBioFS hibernating on study case Our [56]. temperature miR known to be inversely related to MAPK activity [55]. Moreover, recent findings have identified This Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 s tmeauesniie iN, n a nw rgltr f asd body raised of regulator known a and miRNA, temperature-sensitive a as -142-5p yohss s ute spotd y h rdcd xrsin ee of level expression reduced the by supported further is hypothesis 5(10):1122–1143, 2015. Prognosis, TherapyPrediction andTherapeuticToolsforBreastCancer. Theranostics, Jan. 2015. Bioinformatic toolsformicroRNAdissection.Nucleic acidsresearch,44(1):24–44, 116(2):281–297, Jan.2004. biological processes.Cell,149(3):515–24,Apr.2012. genomics, 12(3):254–78,May2013. disease andstressresponses:anevolutionaryperspective. Briefingsinfunctional Conclusion

http://journal.imbio.de/ miR ht is that -142-5p a small 9

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). C.J.Creighton,G.Reid,andP.H.Gunaratne.ExpressionprofilingofmicroRNAs [14] M.L.Metzker.Sequencingtechnologies [13] J.ShendureandH.Ji.Next-generationDNAsequencing.NatureBiotechnology, [12] P.T.Nelson,D.aBaldwin,L.M.Scearce,J.C.Oberholtzer,W.Tobias,andZ. [11] K.Biggar,C.-W.Wu,andB.Storey.High-throughputamplificationofmature [10] [9] C.Pritchard,H.Cheng,andM.Tewari.MicroRNAprofiling:approaches [8] M.D.Saçar,C.Bağcı,andJ.Allmer.ComputationalPredictionofMicroRNAsfrom [7] [6] 2]A.JhaandR.Shankar. miReader:DiscoveringNovelmiRNAs inSpecieswithout [24] M.R.Friedländer,S.D.Mackowiak, N.Li,W.Chen,andRajewsky.miRDeep2 [23] R.J.Peace,K.Biggar,B.Storey, andJ.R.Green.Aframeworkforimproving [22] M.D.SaçarandJ.Allmer.Machinelearning methodsformicroRNAgeneprediction. [21] L.Li,J.Xu,D.Yang,X.Tan,andH.Wang. ComputationalapproachesformicroRNA [20] A.KozomaraandS.Griffiths-Jones.miRBase:annotatinghighconfidencemicroRNAs [19] L.P.Lim,N.C.Lau,E.G.Weinstein,A.Abdelhakim,S.Yekta,M.W.Rhoades, [18] J.R.BrownandP.Sanseau.AcomputationalviewofmicroRNAstheirtargets. [17] [16] M.Hackenberg,N.Rodríguez-Ezpeleta,andA.Aransay.miRanalyzer:anupdate [15] Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 by deepsequencing.BriefingsinBioinformatics,10(5):490–497, Sep.2009. Genetics, 11(1):31–46,Jan.2010. 26(10):1135–1145, Oct.2008. microRNAs. Naturemethods,1(2):155–161,2004. Mourelatos. Microarray-based,high-throughputgeneexpressionprofilingof loop reversetranscriptionpolymerasechainreaction.2014. microRNAs inuncharacterizedanimalmodelsusingpolyadenylatedRNAandstem– Sequenced Genome.PLoS ONE,8(6),2013. clades. Nucleicacidsresearch, 40(1):37–52, Jan.2012. accurately identifiesknownandhundredsofnovel microRNAgenesinsevenanimal Nov. 2015. microRNA predictioninnon-humangenomes.Nucleic acidsresearch,43(20):e138, Methods inmolecularbiology(Clifton,N.J.),1107:177–87, Jan.2014. Mammalian GenomeSociety,21(1–2):1–12,Feb. 2010. studies: areview.Mammaliangenome :officialjournaloftheInternational using deepsequencingdata.NucleicAcidsResearch,42(D1):D68–D73, Jan.2014. Development, 17(8):991–1008,Apr.2003. Burge,andD.P.Bartel.ThemicroRNAsofCaenorhabditiselegans.Genes& B. Drug DiscoveryToday,10(8):595–601,2005. Server issue):W385-91,Jul.2010. DSAP: deep-sequencingsmallRNAanalysispipeline.Nucleicacidsresearch,38(Web P.-J. Huang,Y.-C.Liu,C.-C.Lee,W.-C.Lin,R.R.-C.Gan,P.-C.Lyu,andTang. experiments. Nucleicacidsresearch,39(WebServerissue):W132-8, Jul.2011. on thedetectionandanalysisofmicroRNAsinhigh-throughputsequencing of ExperimentalBiology,218(1):150–159,2015. K. B.Storey.Regulationofhypometabolism:insightsintoepigeneticcontrols.Journ microRNA signatureindiabeticretinopathy.ScientificReports,5:10375,Jun.2015. et al.Acomparativeanalysisofhigh R. J.Farr,A.S.Januszewski,M.V.Joglekar,H.Liang,K.McAulley,W.H considerations. NatureReviewsGenetics,13(5):358–369,2012. proteomics &bioinformatics,12(5):228–238,Oct.2014. Toxoplasma gondiiPotentiallyRegulatingtheHosts’GeneExpression.Genomics, - throughput platformsforvalidationofacirculating — thenextgeneration.NatureReviews http://journal.imbio.de/ ewitte, al 10

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). S.Griffiths-Jones,A.Bateman,M.Marshall,Khanna,andR.Eddy.Rfam:an [28] [27] S.Andrews.FastQCAqualitycontroltoolforhighthroughputsequencedata. [26] [25] [29] 3]S.Anders,P.T.Pyl,andW.Huber.HTSeq--aPythonframeworktoworkwithhigh- [32] S.Griffiths-Jones,R.J.Grocock,vanDongen,A.Bateman,andEnright. [31] B.Langmead,andBen.AligningShortSequencingReadswithBowtie.in [30] 4]L.Breiman.RandomForests.MachineLearning, 45(1):5–32, 2001. [40] R.Díaz-Uriarte,S.AlvarezdeAndrés,J.Lee,M.Park,etal.Geneselectionand [39] C.AmbroiseandG.J.McLachlan.Selectionbiasingeneextractiononthebasisof [38] I.Guyon,J.Weston,S.Barnhill,andV.Vapnik.GeneSelectionforCancer [37] H.Wickham.Ggplot2 :elegantgraphicsfordataanalysis.Springer,2009. [36] G.Warnes,B.Bolker,L.Bonebakker,R.Gentleman,W.Liaw,T.Lumley,M. [35] M.E.Ritchie,B.Phipson,D.Wu,Y.Hu,C.W.Law,Shi,andG.K.Smyth.limma [34] M.D.Robinson,J.McCarthy,andG.K.Smyth.edgeR:aBioconductorpackagefor [33] 4]B.Gregorutti, Michel, andP.Saint-Pierre.Correlation variableimportancein [46] S.G.Chaulk,H. A. Ebhardt,andR.P.Fahlman.Correlations of [45] M.B.Kursa,Y.Saeys,I.Inza,P.Larrañaga, R.Nielsen,J.Peña,etal.Robustnessof [44] M.B.KursaandW.R.Rudnicki.Feature SelectionwiththeBoruta Package.Journal [43] A.Zeileis,T.Hothorn,andK.Hornik. Model-BasedRecursivePartitioning.Journalof [42] L.Breiman,J.Fredman,C.Stone,andO. RA.Classificationandregressiontrees. [41] Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 organisms usingcloselyrelatedspeciesgenomes.PLoSONE,9(1):1–10,2014. K. EtebariandS.Asgari.AccuracyofMicroRNAdiscoverypipelinesinnon-model RNA familydatabase.Nucleicacidsresearch,31(1):439–41, Jan.2003. E. Aronesty.ea http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Bioinformatics, Babraham,2010.[Online].Available: Piwi- S. SaiLakshmiandAgrawal.piRNABank:awebresourceonclassifiedclustered throughput sequencingdata.Bioinformatics,31(2):166–169, Jan.2015. research, 34(Databaseissue):D140-4,Jan.2006. miRBase: microRNAsequences,targetsandgenenomenclature.Nucleicacids & Sons,Inc.,2010,11.7.1-11.7.14. Current ProtocolsinBioinformatics,:11.7.1-11.7.14,Hoboken,NJ,USA:JohnWiley 2006. classification ofmicroarraydatausingrandomforest.BMCBioinformatics,7(1):3, the UnitedStatesofAmerica,99(10):6562–6,May2002. microarray gene-expressiondata.ProceedingsoftheNationalAcademySciences Classification usingSupportVectorMachines.MachineLearning,(46):389–422, 2002. R programmingtoolsforplottingdata.packageversion3.0.1.2016. Maechler, A.Magnusson,S.Moeller,M.Schwartz,andB.Venables.gplots:Various Nucleic AcidsResearch,43(7):e47–e47,Apr.2015. powers differentialexpressionanalysesforRNA-sequencing andmicroarraystudies. 26(1):139–140, Jan.2010. differential expressionanalysisofdigitalgenedata.Bioinformatics, random forests.Statistics andComputing,:1–20,Mar.2016. global microRNAexpression patterns.Mol.BioSyst.,12(1):110–119,2016. microRNA:microRNA expression patternsrevealinsightsintomicroRNA clustersand Random Forest-basedgeneselectionmethods.BMC Bioinformatics,15(1):8,2014. of StatisticalSoftware,36(11):1–13,2010. Computational andGraphicalStatistics,17(2):492–514, 2008. Chapman &Hall,1984. interacting RNAs.NucleicAcidsResearch - utils : Command - line tools for processing biological sequencingdata.2011. line toolsforprocessingbiological , 36(Database):D173 http://journal.imbio.de/ – D177, Dec.2007. 11

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/). S.Sharma,J.Liu,Wei,H.Yuan,T.Zhang,andN.Bishopric.RepressionofmiR- [55] I.S.Vlachos,K.Zagganas,M.D.Paraskevopoulou,G.Georgakilas,Karagkouni, [54] C.Koufaris,G.N.Valbuena,Y.Pomyen,D.Tredwell,E.Nevedomskaya,C.-H. [53] J.ZhangandK.B.Storey.RBioplot:aneasy-to-useRpipelineforautomated [52] H.Hadj-Moussa,J.A.Moggridge,B.E.Luu,F.Quintero-Galvis,D.Gaitán- [51] T.Therneau,B.Atkinson,andRipley.rpart:Recursivepartitioningregression [50] K.Nicodemus,J.D.Malley,C.Strobl,A.Ziegler,L.Breiman,T.Hothorn,etal. [49] [48] R.Genuer,J.-M.Poggi,andC.Tuleau-Malot.Variableselectionusingrandomforests. [47] 5]J.J.-L.Wong,A.Y.M.Au,D.Gao,N.Pinello,C.-T.Kwok,Thoeng,etal.RBM3 [56] Journal of IntegrativeBioinformatics,13(5):306,2016 Journal doi:10.2390/biecoll-jib-2016-306 2(December):18–22, 2002. A. LiawandM.Wiener.ClassificationRegression by randomForest.Rnews, hypertrophy. EMBOmolecularmedicine,4(7):617–32, Jul.2012. 142 byp300andMAPKisrequiredforsurvivalsignallingviagp130duringadaptive 43(W1):W460-6, Jul.2015. deciphering microRNAfunctionwithexperimentalsupport.Nucleicacidsresearch, Vergoulis,T.Dalamagas,andA.G.Hatzigeorgiou.DIANA-miRPathv3.0: T. May 2016. of lipidandfolatemetabolisminbreastcancercells.Oncogene,35(21):2766–2776, Lau, etal.Systematicintegrationofmolecularprofilesidentifies miR-22asaregulator 4:e2436, Sep.2016. statistical analysisanddatavisualizationinmolecularbiologybiochemistry.PeerJ, 6:24627, Apr.2016. gliroides, displaystorpor-sensitivemicroRNAexpressionpatterns.ScientificReports, Espitia, R.F.Nespolo,etal.ThehibernatingSouthAmericanmarsupial,Dromiciops trees. Rpackageversion4.1-10.2015. under predictorcorrelation.BMCBioinformatics,11(1):110,2010. The behaviourofrandomforestpermutation-based variableimportancemeasures 2010. immune genesandcontrolfever.Nucleicacidsresearch,44(6):2888–97, Apr.2016. regulates temperaturesensitivemiR-142-5p andmiR-143(thermomiRs),whichtarget http://journal.imbio.de/ 12

Copyright 2016 The Author(s). Published by Journal of Integrative Bioinformatics. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).