Noncoding Rnas and Their Annotation Using Metagenomics Algorithms Shubhra Sankar Ray1,2∗ and Sonam Maiti1

Advanced Review Noncoding RNAs and their annotation using metagenomics algorithms Shubhra Sankar Ray1,2∗ and Sonam Maiti1 This article provides an overview of noncoding RNAs (ncRNA) involving their structure, function, computational methods for structure prediction and the algorithms for analyzing ncRNAs from metagenome samples. Different techniques for ncRNA structure prediction such as dynamic programming (DP), genetic algorithm (GA), artiﬁcial neural network (ANN) and stochastic context-free grammar (SCFG) are discussed. The basic concepts of metagenomics along with their biological basis are mentioned and the relevance of ncRNAs in metagenomics is also explored. Similarity and composition based computational methods for analyzing noncoding sequences in metagenomes are then mentioned along with their biological ﬁnd- ings. An extensive bibliography is included. © 2015 John Wiley & Sons, Ltd. Howtocitethisarticle: WIREs Data Mining Knowl Discov 2015, 5:1–20. doi: 10.1002/widm.1142 INTRODUCTION also help in many action mechanisms in the cell.10 In general, the structure and function of ncRNA oncoding RNAs (ncRNAs) are functional sequences can be predicted from multiple sequence RNAs in the cell. Although they do not code N alignment of ncRNAs belonging to the same family for proteins, revealing their functions are neces- with known conserved secondary structures.13 The sary for understanding many biological processes function of new ncRNAs can also be identified from like gene expression regulation,1 gene silencing,2 homologous RNAs by inference method or from transcription,3 replication,4 processing,5 chromo- the base composition.14 However, the structure can some stability,6 protein stability, translocation, and , be predicted from sequence itself by computational localization,2 7 RNA modification,8 andsoon. methods.15 However, due to the exponential num- There are different types of ncRNAs such as trans- ber of possible solutions, RNA structure prediction fer RNA (tRNA), ribosomal RNA (rRNA), micro through computational methods is a complex prob- RNA (miRNA), small nucleolar RNA (snoRNA), lem. It is observed by high-throughput methods that small nuclear RNA (snRNA), small interfering RNA in human 90% of the genome is transcribed at some (siRNA), and piwi-interacting RNA (piRNA). NcR- time in some tissue. Although the functionality of NAs can also reveal the relations among several this transcription is unclear in many instances,16 organisms.9 For example, miRNAs can provide richer these transcripts suggest that many important ncRNA functional spectrum and in-depth explanation about functions are yet to be discovered. In this regard, how genes are regulated.10–12 Biological roles of metagenomic databases can provide new directions in ncRNAs revealed that they are not only transitional finding novel ncRNAs, annotating existing ones and pathway between the genome and the proteins, but possibilities for other biological discoveries. Metagenomics is a rapidly growing field of ∗Correspondence to: [email protected] research that involves the study of genetic materials 1Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India recovered directly from the environmental samples 2Center for Soft Computing Research: A National Facility, Indian because more than 98% microbial genome cannot be Statistical Institute, Kolkata, West Bengal, India cultured and most microbial species live in mixed or Conflict of interest: The authors have declared no conflicts of interest complex environment. Metagenomics offer a powerful for this article. methodology for examining the microbial world that Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 1 Advanced Review wires.wiley.com/widm has the potential to revolutionize our understanding 20 deals with in-depth analysis of ncRNA only. The of the entire living world. Over the past few years, description of different ncRNAs is not provided in Ref the major computational challenges associated with 20. While computational techniques like dynamic pro- metagenomics are shifted from generating sequences gramming (DP) and stochastic context-free grammar to analyzing sequences. The objectives in metagenomic (SCFG) are covered in this survey and also in Ref 20, study can be broadly viewed as: but in a different way, computational methods involving artificial neural networks (ANNs) and heuristic • offering a window to observe genetic material search techniques like genetic algorithms (GAs) and where all of the parts can be examined individ- simulated annealing for RNA secondary structure pre- ually or working as a whole; diction are only described in this article. Moreover, the relevance of ncRNA in metagenomics is one of • examining phylogenetic diversity of microorgan- the main focuses of this survey which is a completely isms for monitoring and predicting the changes 17 different issue from that in Ref 20. In this regard, sim- in environmental conditions ; ilarity and composition based computational methods • analyzing sequences for desirable enzyme can- for analyzing ncRNA sequences in metagenomes are didates (e.g., cellulases, chitinases, lipases, and described in the later part of this manuscript. First, the , antibiotics) in medical applications18 17; functions of different ncRNAs with their basic struc- • examining secretory, regulatory, and signal trans- tures are described in Section II. Various computa- duction mechanisms associated with the samples tional methods for structure prediction are explained or genes of interest19; in Section III. In Section IV, basic concepts of metage- • understanding metabolic pathways and design- nomics and relevance of ncRNAs in metagenomics ing culture media for the growth of previously are mentioned. Computational methods for analyzing uncultured microbes18; ncRNA sequences in metagenomics can be referred as metagenomic algorithms and described in Section V. • examining potential lateral gene transfer events Finally, conclusions are presented in Section VI. to acquire knowledge of genome plasticity, which may give us the ideas of selective pressures for gene capture and evolution within a habitat19; ROLE AND STRUCTURE OF • designing high-throughput experiments for DIFFERENT NCRNAS defining the roles of genes and microorganisms ncRNA is a special group of RNA which generally using metadata. does not code for protein and involved in many biological processes. Various set of ncRNAs such The success of the aforementioned objectives as tRNA, rRNA, H/ACA box snoRNAs, C/D box in metagenomics relies on the efficiency of following snoRNAs and most of the riboswitches are present steps. in both prokaryotes and eukaryotes. While most known riboswitches are found in bacteria, the TPP • the isolation of genetic material, riboswitches are observed in plants, certain fungi and 21 • manipulation of the genetic material, and archaea. NcRNAs such as 6S RNA and OxyS RNA are found in bacteria.22,23 Note that, small ncRNAs • library construction. like microRNA (miRNA), siRNA and piRNA are only available in eukaryotes. Brief description of some Metagenomic databases can be a rich source ncRNAs are provided below. for identifying novel ncRNAs. For analyzing these Riboswitch: It is typically located at noncoding databases, not only RNA secondary structure predic- region of mRNA and it contains an aptamer region tion tools are required, but also tools to extract the and an expression platform.24 The aptamer directly ncRNA sequences are necessary. Hence, the aim of binds a small molecule, called ligand, and guides the this article is to provide the basic ideas of ncRNAs, expression platform to control gene expression by computational tools to analyze them and how to iden- switching between two different secondary structures. tify new ncRNAs from metagenomic samples. One This is accomplished by a common part, called switch- may note that the present survey not only provides ing sequence. In the presence of a ligand, the switch- description of different ncRNAs and various computa- ing sequence becomes a part of aptamer region and tional methods for RNA structure prediction, but also a terminator stem-loop forms in the expression plat- presents the basic tasks in metagenomics and relevance form to stop the transcription process. When the lig- of ncRNA in metagenomics, whereas the review in Ref and is not bound to the apatamer then the switching 2 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation sequence becomes a part of expression platform to tRNA: This type of ncRNA is typically 73–94 form a anti-terminator stem loop and the transcription nucleotides long.32 The secondary structure of a tRNA starts through mRNA. Riboswitches are categorized is a cloverleaf structure with four arms that form into families according to the type of ligand they bind the 3D l-shaped structure through coaxial stacking and their secondary structures such as four-way helical of the helices. The four arms are denoted as D arm, junction, H-type pseudoknot, three-way helical junc- Anticodon arm, TC arm and Aminoacid arm.33 The tion and coaxial stacking. Loop-loop interaction in Anticodon arm forms an anticodon loop with seven tertiary structure also determines their family. A fam- unpaired bases at its end and three of these bases can ily is further classified in classes based on common recognize and decode an mRNA codon. The amino sequence that binds the ligand. A typical example

Noncoding Rnas and Their Annotation Using Metagenomics Algorithms Shubhra Sankar Ray1,2∗ and Sonam Maiti1

Molecular Analysis of Small Rna and Small Protein Regulation of Escherichia Coli Stress Responses

The Expression of NOD2, NLRP3 and NLRC5 and Renal Injury in Anti-Neutrophil Cytoplasmic Antibody-Associated Vasculitis

Determinants of Target Prioritization and Regulatory Hierarchy for the Bacterial Small RNA Sgrs

Striking the Right Balance in Anti-Neutrophil Cytoplasmic Antibody-Associated Vasculitis

RNAIII of the Staphylococcus Aureus Agr System Activates Global Regulator Mgra by Stabilizing Mrna

Lymphocyte Separation Medium (LSM

Genome-Wide Investigation of Micrornas and Their Targets in Response to Freezing Stress in Medicago Sativa L., Based on High-Throughput Sequencing

Or Drought-Responsive Lncrnas in Cassava

Induction of Protein Citrullination and Auto-Antibodies Production In

Computational Methods for the Identification and Characterization

The Ancestral Sgrs RNA Discriminates Horizontally Acquired Salmonella Mrnas Through a Single G-U Wobble Pair

Evolution of the Small Family of Alternative Splicing Modulators Nuclear Speckle RNA-Binding Proteins in Plants