Advanced Review Noncoding and their annotation using metagenomics algorithms Shubhra Sankar Ray1,2∗ and Sonam Maiti1

This article provides an overview of noncoding RNAs (ncRNA) involving their structure, function, computational methods for structure prediction and the algo- rithms for analyzing ncRNAs from metagenome samples. Different techniques for ncRNA structure prediction such as dynamic programming (DP), genetic algorithm (GA), artificial neural network (ANN) and stochastic context-free grammar (SCFG) are discussed. The basic concepts of metagenomics along with their biological basis are mentioned and the relevance of ncRNAs in metagenomics is also explored. Similarity and composition based computational methods for analyzing noncoding sequences in metagenomes are then mentioned along with their biological find- ings. An extensive bibliography is included. © 2015 John Wiley & Sons, Ltd.

Howtocitethisarticle: WIREs Data Mining Knowl Discov 2015, 5:1–20. doi: 10.1002/widm.1142

INTRODUCTION also help in many action mechanisms in the cell.10 In general, the structure and function of ncRNA oncoding RNAs (ncRNAs) are functional sequences can be predicted from multiple sequence RNAs in the cell. Although they do not code N alignment of ncRNAs belonging to the same family for , revealing their functions are neces- with known conserved secondary structures.13 The sary for understanding many biological processes function of new ncRNAs can also be identified from like expression regulation,1 gene silencing,2 homologous RNAs by inference method or from ,3 replication,4 processing,5 chromo- the base composition.14 However, the structure can some stability,6 stability, translocation, and , be predicted from sequence itself by computational localization,2 7 RNA modification,8 andsoon. methods.15 However, due to the exponential num- There are different types of ncRNAs such as trans- ber of possible solutions, RNA structure prediction fer RNA (tRNA), ribosomal RNA (rRNA), micro through computational methods is a complex prob- RNA (miRNA), small nucleolar RNA (snoRNA), lem. It is observed by high-throughput methods that small nuclear RNA (snRNA), small interfering RNA in human 90% of the is transcribed at some (siRNA), and piwi-interacting RNA (piRNA). NcR- time in some tissue. Although the functionality of NAs can also reveal the relations among several this transcription is unclear in many instances,16 .9 For example, miRNAs can provide richer these transcripts suggest that many important ncRNA functional spectrum and in-depth explanation about functions are yet to be discovered. In this regard, how are regulated.10–12 Biological roles of metagenomic databases can provide new directions in ncRNAs revealed that they are not only transitional finding novel ncRNAs, annotating existing ones and pathway between the genome and the proteins, but possibilities for other biological discoveries. Metagenomics is a rapidly growing field of ∗Correspondence to: [email protected] research that involves the study of genetic materials 1Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India recovered directly from the environmental samples 2Center for Soft Computing Research: A National Facility, Indian because more than 98% microbial genome cannot be Statistical Institute, Kolkata, West Bengal, India cultured and most microbial species live in mixed or Conflict of interest: The authors have declared no conflicts of interest complex environment. Metagenomics offer a powerful for this article. methodology for examining the microbial world that

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 1 Advanced Review wires.wiley.com/widm has the potential to revolutionize our understanding 20 deals with in-depth analysis of ncRNA only. The of the entire living world. Over the past few years, description of different ncRNAs is not provided in Ref the major computational challenges associated with 20. While computational techniques like dynamic pro- metagenomics are shifted from generating sequences gramming (DP) and stochastic context-free grammar to analyzing sequences. The objectives in metagenomic (SCFG) are covered in this survey and also in Ref 20, study can be broadly viewed as: but in a different way, computational methods involv- ing artificial neural networks (ANNs) and heuristic • offering a window to observe genetic material search techniques like genetic algorithms (GAs) and where all of the parts can be examined individ- simulated annealing for RNA secondary structure pre- ually or working as a whole; diction are only described in this article. Moreover, the relevance of ncRNA in metagenomics is one of • examining phylogenetic diversity of microorgan- the main focuses of this survey which is a completely isms for monitoring and predicting the changes 17 different issue from that in Ref 20. In this regard, sim- in environmental conditions ; ilarity and composition based computational methods • analyzing sequences for desirable enzyme can- for analyzing ncRNA sequences in metagenomes are didates (e.g., cellulases, chitinases, lipases, and described in the later part of this manuscript. First, the , antibiotics) in medical applications18 17; functions of different ncRNAs with their basic struc- • examining secretory, regulatory, and signal trans- tures are described in Section II. Various computa- duction mechanisms associated with the samples tional methods for structure prediction are explained or genes of interest19; in Section III. In Section IV, basic concepts of metage- • understanding metabolic pathways and design- nomics and relevance of ncRNAs in metagenomics ing culture media for the growth of previously are mentioned. Computational methods for analyzing uncultured microbes18; ncRNA sequences in metagenomics can be referred as metagenomic algorithms and described in Section V. • examining potential lateral gene transfer events Finally, conclusions are presented in Section VI. to acquire knowledge of genome plasticity, which may give us the ideas of selective pressures for gene capture and evolution within a habitat19; ROLE AND STRUCTURE OF • designing high-throughput experiments for DIFFERENT NCRNAS defining the roles of genes and microorganisms ncRNA is a special group of RNA which generally using metadata. does not code for protein and involved in many biological processes. Various set of ncRNAs such The success of the aforementioned objectives as tRNA, rRNA, H/ACA box snoRNAs, C/D box in metagenomics relies on the efficiency of following snoRNAs and most of the riboswitches are present steps. in both prokaryotes and . While most known riboswitches are found in , the TPP • the isolation of genetic material, riboswitches are observed in , certain fungi and 21 • manipulation of the genetic material, and . NcRNAs such as 6S RNA and OxyS RNA are found in bacteria.22,23 Note that, small ncRNAs • library construction. like microRNA (miRNA), siRNA and piRNA are only available in eukaryotes. Brief description of some Metagenomic databases can be a rich source ncRNAs are provided below. for identifying novel ncRNAs. For analyzing these Riboswitch: It is typically located at noncoding databases, not only RNA secondary structure predic- region of mRNA and it contains an aptamer region tion tools are required, but also tools to extract the and an expression platform.24 The aptamer directly ncRNA sequences are necessary. Hence, the aim of binds a small , called ligand, and guides the this article is to provide the basic ideas of ncRNAs, expression platform to control gene expression by computational tools to analyze them and how to iden- switching between two different secondary structures. tify new ncRNAs from metagenomic samples. One This is accomplished by a common part, called switch- may note that the present survey not only provides ing sequence. In the presence of a ligand, the switch- description of different ncRNAs and various computa- ing sequence becomes a part of aptamer region and tional methods for RNA structure prediction, but also a terminator stem-loop forms in the expression plat- presents the basic tasks in metagenomics and relevance form to stop the transcription process. When the lig- of ncRNA in metagenomics, whereas the review in Ref and is not bound to the apatamer then the switching

2 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation sequence becomes a part of expression platform to tRNA: This type of ncRNA is typically 73–94 form a anti-terminator stem loop and the transcription long.32 The secondary structure of a tRNA starts through mRNA. Riboswitches are categorized is a cloverleaf structure with four arms that form into families according to the type of ligand they bind the 3D l-shaped structure through coaxial stacking and their secondary structures such as four-way helical of the helices. The four arms are denoted as D arm, junction, H-type pseudoknot, three-way helical junc- Anticodon arm, T𝜓C arm and Aminoacid arm.33 The tion and coaxial stacking. Loop-loop interaction in Anticodon arm forms an anticodon loop with seven tertiary structure also determines their family. A fam- unpaired bases at its end and three of these bases can ily is further classified in classes based on common recognize and decode an mRNA codon. The amino sequence that binds the ligand. A typical example of a acid is attached to the CCA terminal group of the riboswitch family is the SAM riboswitch which binds Aminoacid arm which also contains a stem with seven with the ligand S-adenosylmethionine (SAM) and is base pairs. The overall folding of the tRNA molecule classified into SAM-I, SAM-II and SAM-III classes, mainly depends on the D arm and the T𝜓Carm.The based on the helices and pseudoknot. tRNA is a necessary component for protein synthesis Long non protein coding RNAs (lncRNA): according to the genetic code in DNA. It acts as a LncRNAs are a large class of RNA , found carrier for to for incorporation in eukaryotic cell and greater than 200 nucleotides into a polypeptide chain, undergoing synthesis. in length. In certain cases lncRNAs can be computa- rRNA: rRNA is the RNA component of ribo- tionally identified from the short secondary structures some with two subunits, large subunit (LSU) and such as stem loops25 with sequence conservation and small subunit (SSU).34 This type of ncRNA molecule compensatory for paired bases.26 While takes part in protein synthesis. The SSU binds with an the large intergenic ncRNAs are observed within the mRNA to form the ‘initiating amynoacil-tRNA’ and intergenic region (IGR) between two genes, many the LSU helps in pairing ‘initiating amynoacil tRNA’ lncRNAs overlap with the coding part of genes. In with the mRNA codon AUG that signals the start of general, they can be assigned to anyone or more of the a polypeptide.33 While in prokaryotes the LSU is 50S following five categories: (1) intergenic, (2) intronic, and the SSU is 30S, in eukaryotes those are 60S and (3) sense, (4) antisense, and (5) bidirectional.25 The 40S, respectively. In prokaryotes, the LSU contains 5S lncRNAs belonging to sense and antisense categories and 23S rRNAs, and the SSU contains 16S rRNA. In have overlapping regions with one or more coding contrast, the LSU contains three rRNA species (5S, parts of another sequence on the same or opposite 5.8S, and 28S in mammals and 25S in plants) and strand, respectively. In the bidirectional category, SSU contains 18S rRNA in eukaryotes. The secondary the lncRNA and a coding region on the opposite structure of the SSU in eukaryotes consists of four strand express when they are close in genomic con- distinct domains, called the 5′, central, 3′ major, and text. The intronic lncRNA completely lies within an 3′ minor. A conserved core, interspersed with vari- . The secondary structure of a human lncRNA, able regions, is observed in the structure of the LSU.35 steroid receptor RNA activator (SRA), consists of These regions are also vastly different in closely related four major domains and 25 helices.27 These helices eukaryotes. Moreover, there are no helices in the LSU are separated by a junction or a large internal loop rRNAs present in the mitochondria of kinetoplastids (greater than 12 nucleotides), or an asymmetric and . In prokaryotes the 5′ and 3′ ends of the internal loop formed with greater than 6 nucleotides LSU rRNA are joined by a helix. This stem helix is on one side and no nucleotides on the other side. followed by a large central multi-branched loop from Long ncRNAs interact with the proteins or genomic which several helices originate.35 In contrast, the stem DNA and in the process their secondary structure helix is not observed in eukaryotes. plays an important role. They contain many types of miRNA: This type of small (21–25 nucleotides transcripts those are structurally similar to mRNAs in length) ncRNA is found in eukaryotic cells and and sometimes transcribed by RNA polymerase II28 in some viruses. The miRNAs are produced from and/or RNA polymerase III.29 They take part in miRNA genes or lying within the other genes. post-transcriptional gene expression by binding with In animals, a part of hundreds of long pre- mRNA and masking the part required for binding cursor, denoted as primary miRNA (pri-miRNA), is with transcription factors. It is observed that lncRNAs produced from a 80 nucleotide long RNA hairpin.36 can also take part in chromatin modification.30 For This pri-miRNA is processed in the nucleus by the example, the Xist ncRNA31 can inactivate genes in RNase III enzyme, Drosha, to generate a shorter the X-chromosome in female placental mammals by 65 nucleotide precursor miRNA (pre-miRNA) with chromatin modification. stem-loop structure.37 It is then exported into the

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 3 Advanced Review wires.wiley.com/widm and its loop is removed by RNase III exported to the cytoplasm from nucleus to form a 3′ enzyme, Dicer, to form a 20-bp RNA duplex.38,39 stem-loop structure, necessary to generate the stable One of the strands of this duplex acts as a func- ribonucloprotein that goes back into the nucleus. This tional miRNA (mature miRNA) and is integrated with ribonucloprotein then binds with a specific sequence a large protein complex, denoted as RNA-induced on the pre-mRNA substrate. In contrast, the Lsm-class silencing complex (RISC), to interact with the mRNA snRNA is transcribed by RNA polymerase III and target.37,40–42 The main functions of the miRNA are to always stays in the nucleus. downregulate gene expression, including translational siRNA: These RNAs are one class of ncRNAs repression, mRNA cleavage and deadenylation.43 The and also known as short-interfering RNA or silenc- miRNA can bind to the complementary mRNA to ing RNA. In general, they are 20–25 base pairs inhibit protein .44 in length. By the help of Dicer enzyme, a long snoRNA: snoRNAs are ncRNAs that guide double-stranded RNA (dsRNA) or a small hairpin chemical changes in rRNA, tRNA and snRNAs.45 RNA is broken down into small siRNA fragments Based on sequence motifs and secondary structures, containing phosphorylated 5′ ends and hydroxylated the snoRNAs are divided into two groups, the C/D 3′ ends with two overhanging nucleotides.51,52 The box snoRNAs that are associated with attachment ssiRNA mainly guides argonaute proteins, a catalytic or substitution of a methyl group onto various sub- components of the RNA-induced silencing complex strates and the H/ACA box snoRNAs which con- (RISC), for the gene silencing phenomenon known as vert nucleoside to pseudouridine with exactly RNA interference. Here, siRNA interferes with the the same but rearranged atoms (pseudouridylation).46 expressions of specific genes through argonaute pro- The C/D box snoRNAs consist of two conserved teins with complementary nucleotide sequence. The sequence motifs C (RUGAUGA) and D (CUGA) and other functions of siRNA are shaping the chromatin are found at few nucleotides away from the 5′ end and structure of a genome, silencing post-transcriptional opposite to the 3′ end of the snoRNA, respectively.47 genes in plants,53 and so on. The differences between These two motifs have some complementary bases siRNA and miRNA are in their origin and structure. and are tied together at very short region to form The siRNA originates from double stranded RNA structures like stem-box, double helix, and so on. (dsRNA) and it is most commonly a response to for- The H/ACA snoRNAs also consist of two conserved eign RNA (usually viral). This RNA is often 100% sequence motifs, the H motif (5′ANANNA3′ where complementary to the target. In contrast, miRNA N is any nucleotide) which is located in the hinge originates from either their own genes or introns with region and the ACA motif that is located in the tail single-stranded (ssRNA) hairpin structure. Although region (three nucleotides away from the 3′ end of the both siRNA and miRNA regulate post-transcriptional snoRNA).48 The secondary structure of it is defined as gene expression, miRNA is often not 100% comple- a hairpin–hinge–hairpin–tail49 as it contains two hair- mentary to the target.52 pins and two single-stranded regions. piRNA: This type of small ncRNA molecules snRNA: This type of small ncRNA is found in all are located in groups across the genome in eukaryotic cells. The snRNAs usually act as ribonu- cells and they are 26–31 nucleotides in length. The cleoproteins in the nucleus of a cell and are involved sequences of piRNAs are not conserved across species in splicing of introns from messenger RNA (mRNA). and a clear secondary structure motif is not defined. They are approximately 150 nucleotides long and are While in invertebrates the piRNAs are surrounded conserved in both primary and secondary structures. by the protein coding genes, in vertebrates they are Based on shared sequences and associated proteins, observed in areas without any protein coding genes. snRNA is divided into two groups. One is sm-class A5′ uridine is usually observed in piRNAs of both RNA which has a 5′ trimethylguanosine cap and binds vertebrates and invertebrates. Moreover, in piRNAs several sm proteins. Other is Lsm-RNA that contains of some of the vertebrates and invertebrates the 2′ or a monomethylphosphate 5′ cap and a uridine rich 3′ the 3′ oxygen is blocked by 5′ monophosphate and a end, acting as a binding site for Lsm proteins.50 The 3′ modification which helps in piRNA stability.54–57 term ‘sm’ (Smith) originated from the name Stephanie These RNAs are connected to the argonaute family of Smith who was a patient and generating proteins for gene silencing58 and act as guardians of to nuclear proteins (sm proteins), named as ‘smith the genome to protect it from invasive transposable ’ (Sm Ag). The term ‘Lsm’ refers to like-sm. elements in the germline.58 These proteins form a ring like structure involving six Bifunctional RNAs: RNAs that can act as both or seven individual protein molecules. The Sm-class mRNA and ncRNA are called bifunctional RNA snRNA is transcribed by RNA polymerase II and or dual-function RNA. Here, the mRNA encodes a

4 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation protein whereas the ncRNA does not encode protein. loop, bulge loop, internal loop, multibranched loop, Some well-known examples of bifunctional RNAs so forth. Figure 1 represents a schematic view of the are transfer-messenger RNA (tmRNA), SgrS RNA, aforementioned substructures. The helical stem is and RNAIII. A tmRNA is a bifunctional RNA that frequently observed in many parts of RNA secondary performs functions of both tRNA and mRNA.59 The structure and it is formed by contiguous stacking of 5′ and 3′ ends of tmRNA fold into a structure adjacent base pairs.15 If no base pairs are formed containing an acceptor stem and a TΨC arm. The in a part of an RNA sequence such that, it remains wobble base pair (G–U) is formed in the acceptor single-stranded and bounded by base pairs at the end, stem which allows it to be recognized by alanyl-tRNA then the single-stranded region forms a loop. A loop synthetase and charged with alanine. Being larger which terminates at only one helical stem is called a than a tRNA molecule, the tmRNA contains multiple hairpin loop. Single-stranded bases occurring within pseudoknots and a specialized open reading frame in only one side of a stem are called bulge loop. When place of the anticodon loop. Some other bifunctional a loop is terminated by two helical stems at two RNAs are activator/SRA,60 VegT RNA,61,62 Oskar different sides then it is called an internal loop, and if RNA,63 ENOD40,64 p53 RNA,65 and SR1 RNA.66 A three or more helical stems radiate from a loop then recent work predicted that in yeast as much as 5% it is called multibranched loop. Contiguous stacking of mRNAs function independently as RNAs and it of base pairs is also observed in a substructure, called is estimated that this proportion is greater in higher pseudoknots. Here, unpaired bases of one substruc- eukaryotes. There are some well-known long ncRNAs ture (e.g., the loop part in a hairpin) bind to unpaired such as the co-activator SRA transcript, isoforms of bases of another substructure. which also encode a protein and the known protein RNA tertiary structure is formed by hydro- coding genes such as p53 whose transcripts also act as gen bonds or hydrophobic interactions between the regulatory RNAs. secondary structures like two helices, two unpaired It is observed that these ncRNAs are transcribed regions, or one unpaired region and a double-stranded from many parts of the of mammals and helix.70 Interactions between two helices may result other complex organisms. Moreover, microRNAs, in stacking on each other, or they can interact snoRNAs, small regulatory RNAs, and unknown through their shallow grooves (sugar edge involving RNAs are alternatively spliced and/or processed into 2′-OH). Unpaired regions in an internal loop, a bulge, smaller products.67 or a hairpin loop can interact with each other to Similar to other RNAs, the ncRNA is also a form a pseudoknot. An unpaired region can inter- polymer made of covalently linked nucleotides ade- act with a double-stranded helix in its shallow or nine (A), cytosine (C), guanine (G), and (U). deep groove to form a triple helix tertiary structure. Using the alphabets ‘A’, ‘C’, ‘G’, and ‘U’, represent- There are also instances where some unpaired bases ing the four nucleotides, the primary structure is in a GNRA-tetraloop (where N and R are variables) modeled as a sequence of alphabets. The secondary pair with the shallow grooves of double stranded structure of RNA, including ncRNA, is formed by helices to form triples (triple helix). Besides double and base pair interactions such as A–U, C–G, and G–U. triple helices, quadruple helices are also observed in The A–U and C–G pairs are called Watson-Crick RNA. A quadruplex is formed by interactions between base pairs68 and the ‘G–U’ pair is known as Wobble Watson-Crick pairing and noncanonical pairing in the base pair.69 The number of hydrogen bonds in these minor groove.71 These various interactions in RNA, pairs is two, three, and one. Wobble base pairs like responsible for tertiary structure, are influenced by hypoxanthine–uracil (I–U), hypoxanthine–adenine different types of cations such as magnesium. In the (I–A), and hypoxanthine–cytosine (I–C) are also process, either the cations bind with the backbone observed in RNA where A to I editing takes place. twisting sites to prevent binding between two bases or Here, hypoxanthine is the nucleobase of inosine. vice versa. A sharp turn, known as kink-turn (k-turn), Adenine is first converted to adenosine or inosine also helps in formation of RNA tertiary structure.72 It monophosphate (IMP) and then one of them is con- is observed in the phosphodiester backbone of RNA verted into inosine (I). Finally, it pairs with uracil (U), helix. It is a common RNA structural motif where adenine (A), and cytosine (C). Other base pairs like a three-nucleotide bulge is flanked by GA pairs. The U–U, U–C, G–A, A–A, and A–C, and base triples like folding involved in the kink turn is influenced by GAC, GGC, and so on are also observed. These base metal ions or by proteins that target the region for pairs along with the Van der Waals interactions enable binding.73,72 the ssRNA to fold within itself and thereby forming The base pair interactions within the primary important substructures like helical stem, hairpin sequence determine the secondary structure of an

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 5 Advanced Review wires.wiley.com/widm

Single-stranded region

Bulge Hairpin loop Multi- loop Pseudoknot loop

Helical stem

Internal loop

FIGURE 1| Different types of secondary substructures in RNA. (Reprinted with permission from Ref 15. Copyright 1992).

RNA. The value of different parameters involved in structure which are also applicable for ncRNA these interactions can be experimentally obtained74 structure prediction. and tabulated. Hence, the unknown secondary struc- ture of a known RNA sequence can be computation- Thermodynamic Folding of Single Sequences ally predicted by using a list of known possible base Thermodynamic folding of RNA is based on the pair interactions and related parameters. The structure minimization of Gibbs free energy (ΔG)oftheRNA can also be predicted by searching its sequence simi- structures.75 Nearest-neighbor energy model (NNEM) larity with the known RNAs. These tools are helpful is often used for computing the free energy of an RNA. when the sequence is known. Now we discuss the char- (For example,) the free energy change for the helix 5′GAUC3′ acteristics of different computational tools developed can be computed as 3′CUAG5′ for RNA secondary structure prediction. ( ) ( ) ( ) 5′GA3′ 5′AU3′ ΔG pred = 2ΔG +ΔG 3′CU5′ 3′UA5′ COMPUTATIONAL METHODS FOR ( ) NCRNA STRUCTURE PREDICTION +ΔGinit +ΔGAU endpenalty perAUend +ΔGsym (1)

The sequences underlying RNA secondary structure where ΔGinit represents the loss of entropy dur- are variable and thereby enable the prediction of ing initial pairing between the first two bases, ncRNA structures from different sequences using ΔGAU endpenalty(perAUend) = 0astherearenoAU computational methods. Moreover, compensatory base pairs at the end of this helix and ΔGsym corrects mutations and conserved sequences in some ncRNAs for twofold rotational symmetry, resulting from self help in predicting their structures using sequence complementary strand. The energy values of these alignment based methods. We now describe some of terms are then obtained from an experimentally the existing methods for predicting RNA secondary estimated tabular list.74 A possible approach to

6 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation predict a thermodynamically stable RNA structure is Other Computational Methods to find all the possible structures and then select the As mentioned earlier, besides DP different heuristic one with the minimum free energy (MFE). At present, search methods such as GAs and simulated annealing approaches involving DP, heuristic search methods (SA), and machine-learning tools like ANN are also like GA and machine-learning tools like ANNs are used for predicting secondary structure involving ther- mainly used. These are now discussed below. modynamic folding. These are discussed below. Heuristic search techniques: GAs are one of the Dynamic Programming widely studied heuristic search methods for RNA Based on the works in Refs 76 and 77, a DP algo- secondary structure prediction.93–97 Using the free rithm for predicting RNA structure is developed by energy minimization, van Batenburg et al.93 developed Waterman and Smith.78 It works by minimizing the a GA with binary representation, standard selection, free energy using an NNEM. The secondary structure , and crossover operator. A stem-array is is obtained using a traceback procedure by construct- developed with a possible list of stems and a GA ing a matrix such that each element represents the free energy of each possible base pair plus the free energy population of several possible solutions is constructed of that secondary structure having the MFE among with these stem-arrays. The presence and absence all the substructures, formed from the previous sub- of a stem in a solution are represented with 1 and sequences. Nussinov et al.79 extended the method in 0, respectively, in the stem-array. Note that, each Ref 78 for long (1000 nucleotides) ssRNAs. possible solution in GA and some other heuristic Zuker and Stiegler80–82 developed mfold (multi search techniques is also called chromosome as it fold), a DP-based algorithm, and also a web server is guided by the principles of evolution and natural 15 for finding the MFE secondary structure. The algo- genetics. The kinetic behavior of RNA folding is rithm also provides suboptimal structures calculated formulated by limiting each potential solution to small in terms of percentage (provided by the user) of the fragments of RNA and then increasing the length by free energy of the MFE structure. Similar methods 10% of the initial one in each iteration of GA. involving additional structural components such as Instead of conventional GA, a parallel GA interior loop symmetry/asymmetry and coaxial stack- involving computer with 16,384 processors is devel- ing of helices are available in UNAFold (Unified oped in Ref 96. Initially, GA is used in creating stems Nucleic Acid Folding),83 RNAfold84 of the Vienna and their start position, stop position, size, and energy RNA85 package, and RNA structure.86 are preserved. Various structures are then explored McCaskill developed an equilibrium parti- through each processor by randomly selecting a stem tion function87 by using the probability of any two and adding them to complete a chromosome, a pos- nucleotides to be paired in a secondary structure. The sible solution in GA framework. Extensions of this partition function and the probabilities of all the base technique are available in Refs 98 and 95. pairs are computed by a DP algorithm of polynomial Another approach involving simulated anneal- order N3. A probability matrix is then formed for ing is used in Ref 99 for predicting secondary struc- examining the full ensemble of probable alternative tures by iterative formation and disruption of single equilibrium structures. base pairs. This resulted in free energy changes. The Sfold88,89 produces a sampling of the complete structures with higher free energy are then proba- structure space using the Boltzmann distribution and bilistically selected in subsequent iterations, depending a partition function to compensate for the excessively on the Boltzmann factor. The concept of sequential large number of substructures for longer sequences. folding during transcription is explained using RNA The partition function is computed by a recursive polymerase chain elongation rates. Other investiga- sampling process. It uses conditional probabilities of tions using GAs, SAs, particle swarm optimization, ant substructures and statistically generates representative colony optimization and tabu search are available in samples of secondary structures. Refs 100, 101, 97, 102, 103, and 104. A partition function based algorithm for nucleic ANNs: Le et al.105 predicted the secondary struc- acid secondary structure prediction involving pseu- ture using the tree representation of RNA where stems doknots is developed in Nucleic Acid Package are represented as edges, hairpins as vertex of degree (NUPACK).90,91 In Kinfold (Kinetic Folding),92 one, internal loops and bulges as vertices of degree RNA structures are also predicted with pseudoknots. two, and junctions as vertices of degree three or more. Moreover, the process of co-transcriptional folding The final tree representation is a combination of trees is considered. In this method, single helices and obtained from smaller structures like known RNA, individual base-pair are either added or removed as RNA-like candidates, and not RNA-like candidates. they are faster than nucleic acid folding and unfolding. The vertex identification results of the tree are then

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 7 Advanced Review wires.wiley.com/widm used to train a three-layered back propagation ANN CONTRAfold111 predicts the secondary struc- to distinguish a tree as RNA or not. tures using conditional log-linear model (CLLM), Nussinov et al.106 first represented the RNA generalized upon SCFGs. CLLM is based on dis- structure as a circular graph and using it Liu et al.68 criminative training, flexible parameterization, and predicted RNA secondary structure in a Hopfield accuracy-adjustable optimization. The conditional neural networks (HNN) framework. The nucleotides probability of a structure is modeled as log-linear of an RNA are arranged in a circular fashion. Each arc function of parameters using CLLM. The parameters of the circle represents a base pair and each neuron in are represented as scores for 13 structural features HNN represents a base pair or arc. An output value such as base pairs, lengths of hairpins, helices, bulge of 1 is obtained for a neuron when the corresponding loops, internal loop asymmetry, so on. Finally, the arc or the base pair is not considered to be a part predicted structure maximizes the expected accuracy of the predicted RNA structure. The concept of free which is a user-defined trade-off between sensitivity energy minimization is incorporated by selecting or and specificity of base pair predictions. discarding a base pair (i.e., off or on state of a neuron) that lowers the energy of the network. for Homologous Sequences Using Probabilistic Models Homologous RNA sequences are expected to have The probabilistic models predict RNA structure from similar structure, sequence and function. However, a set of known features. The features can be sequences, they may have some different but valid base pairs. structures, and alignments that are distributed over These base pairs, resulted from mutations, may have specific RNA families.107 These features can be used some specific role and are considered as the evidence for training the model and thereby enabling it to of selective pressure.112 determine different unknown features. SCFG-based David Sankoff113 introduced a DP method approach is one of the probabilistic models that that simultaneously aligns and folds multiple RNA are widely used in RNA secondary structure predic- sequences. The alignment and folding information tion. It is a generalization of Hidden Markov model. are integrated through an objective function that Context-free grammars (CFG) are concepts from for- makes a tradeoff between free energy and alignment mal language theory. A CFG (G) is defined by the cost through a weighted sum. The algorithm can 4-tuple: G = (V, N, S, P)whereV is the set of nonter- predict the secondary structure as well as the ances- minals variables, N is a finite set of terminal variables, tral sequences on a . The known S is the start symbol, and P is the production rules. For sequences are associated with the terminal vertices in RNA sequence, the terminals are the elementary sym- the tree and ancestral sequences with the nonterminal bols A, U, C,andG which cannot be changed with ones. The algorithm counts the smallest n number of rules and are defined by a formal grammar in a lan- possible state changes to minimize the cost of a tree. guage. Nonterminal symbols are replaced by terminal This is accomplished by optimizing a cost matrix C symbols to build up all possible RNA secondary struc- where each element cij represents the cost of moving tures according to the production rules. from a state i to a state j along any branch in the tree. In Ref 108, the RNA structure is represented as The cost matrix is also used to find the set of optimal a parse tree by using CFG. The grammar is described states at the interior nodes of the tree. Let Sv(k)bethe in the text and each node in the tree corresponds smallest number of steps needed to evolve the subtree to a production rule of the grammar, which in turn at or above node v where node v is in state k. corresponds to a structure element of the RNA (base For a leaf node v at tip, pair, single nucleotide, and bifurcation). Sv(k) = 0 if node v is in state k, An SCFG-based method that uses the prior otherwise, Sv(k) =∞. knowledge about RNA structure is introduced in If v is a node whose immediate descendants are Ref 109. This is performed on a structural align- u and w ment of sequences. Information about the structure then, Sv(k) = mini(cki + Su(i)) + minj(ckj + Sw(j)). is revealed by phylogenetic tree of the sequences and The minimal cost of the tree is given by minkSvk. using mutation processes in RNA. The phylogenetic The method is useful for detection of homologous tree of sequences is obtained by maximum likelihood RNA structures in metagenomics when several sam- (ML) estimation and it provides one common struc- pled may not have a reference species. tural prediction for all the sequences. Pfold110 further Some of the restricted versions of Sankoff algo- improves the algorithm developed in Ref 109 by mak- rithm are Foldalign,114,115 Dynalign,116,117 PMmulti ing it faster and more robust toward alignment errors. & PMcomp,118 Stemloc,119 and Murlet.120

8 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation

LocARNA121 is one of the fastest and accurate MFEs, calculated using RNAfold, is close to the MFE tools for computing local alignments of RNA. It of the alignment. In contrast, a higher MFE of the uses a variation of the Sankoff algorithm and can alignment than the average MFE indicates that there handle several thousand candidate sequences. Here, does not exist any conserved structure. An improve- multiple alignments are constructed from pairwise ment of RNAz is available in RNAz 2.0.128 alignments using a progressive alignment strategy. The tool CMfinder129 can predict a consensus LocARNA simultaneously folds and aligns the input structure even from unaligned sequences using mutual RNA sequences and then provides a multiple align- information and single sequence structure prediction ment along with a consensus structure as output. methods. For a reliable prediction, there should be For the folding, it uses RNAfold of the Vienna RNA a common motif in a subset of sequences. First, the package. method involves an expectation maximization (EM) Computational methods like covariation anal- framework to search the motif space and then it uses ysis detect compensatory mutations. Here, the pre- a Bayesian framework to combine the mutual infor- served base pairs in a multiple are mation and folding energy to predict RNA structures. analyzed by considering each sequence position as a As mentioned earlier, the computational tools for random variable and calculating the mutual informa- identifying secondary structures are useful when the tion of each pair of random variables. The accuracy of sequence is known. A detailed analysis of RNA struc- the method relies on conservation of sequences in mul- ture and description of some of the computational tiple sequence alignment and diversity of sequences to tools, discussed so far in this article, are available in show covariations.112 Ref 20. However, new database of sequences as well as The method of Zuker and Stiegler82 can predict noncoding part of unknown sequences also need to be thefoldofanRNAmoleculewithMFEandcanalso identified for discovering new ncRNAs. In this regard, use phylogenetic data on secondary structure conser- additional tools and databases like metagenomes are vation. Hofacker et al. extended the method of Zuker required. Thereafter, one can use the existing com- anddevelopedawebserver,122 called RNAalifold, putational tools for secondary structure prediction. to compute consensus structure from a set of aligned Additionally, the metagenomic sequences need to be RNA sequences. The method is based on thermo- processed to make them as accurate as possible before dynamic model, covariance score, and phylogenetic starting the structure prediction tasks. The processing information. It predicts minimum energy structure involves some basic tasks which are described in the from a set of aligned sequences. A matrix, A, is con- next section. sidered for multiple sequence alignment of N number of sequences where Ai is the i-th column of the align- BASIC TASKS IN METAGENOMICS ment. A function f (X) is assumed as the frequency of i AND RELEVANCE OF NCRNAS base X at aligned position i. A related function f ij(XY) is also considered which indicates the frequency of Recent researches130,131 on metagenomic sequences co-occurrence of bases X and Y in the alignment suggested that they are very useful in identifying new columns i and j, respectively. Mutual information ncRNAs. However, metagenomic sequences cannot be , score123 124 is used to quantify sequence covariation used readily for identifying ncRNAs due to the pres- for a large number of sequences. An improvement ence of redundant sequences and bases, low quality of this method in terms of prediction accuracy is bases predicted by programs and erroneous assem- available in Ref 125. Here, the alignment gaps are bly of sequence fragments. In this regard, new tools handled more efficiently and RIBOSUM-like scoring are developed132–134 which can handle the aforemen- matrices126 are used instead of covariance score. tioned issues in a better fashion. This in turn leads to a Washietl et al.127 developed the RNAz pro- new research area where the issues are addressed as the gram, which combines comparative sequence analy- basic tasks in metagenomics. These tasks gained more sis and structure prediction tools for prediction of significance with the demand of error free metage- ncRNAs. RNAalifold125 is used for finding the con- nomic sequences for predicting ncRNAs and genes. sensus sequence. The consensus MFE is computed by As mentioned earlier, imprecise functionality of many combining the average MFE of the single sequences transcription processes points toward lots of impor- and sequence covariation terms to the folding energy tant ncRNAs which are still unidentified and might be model. The sequence covariation terms are computed discovered by using metagenomic databases. Hence, from compensatory and consistent mutations. If the the importance of metagenomics also increased for sequences in the alignment fold into a conserved understanding the functionality of various ncRNAs. common structure then the average of the individual In the following paragraphs, first we describe the basic

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 9 Advanced Review wires.wiley.com/widm tasks in metagenomics and then the relevance of ncR- using short-read (fragment)-based methods, then NAs in analyzing metagenomic sequences. the annotation process becomes harder as fraction The basic tasks in metagenomics mainly involve of proteins are only available there. MG-RAST execution of prefiltering steps, assembly of sequence (MetaGenome Rapid Annotations based on Subsys- fragments into contiguous stretches of DNA, and tem Technology)141 and IMG/M (Integrated Microbial prediction of genes. Brief description of these tasks is Genomes)142 perform comparative functional and provided below. sequence-based analysis for uploaded metage- Execution of prefiltering steps: It mainly deals nomic sequence and provide its annotation. The with identification of redundant and low-quality aforementioned methods mainly rely on annotated sequences and removal of them.135 While the redun- sequences available in TIGRFAM,143 (Protein dant sequences are mainly those with repetition of FAMilies),144 PDB (),145 KEGG same bases and handled with programs searching for (Kyoto Encyclopedia of Genes and Genomes),146 and exact matches, low-quality sequences are those which COGs (Clusters of Orthologous Groups).147 will not serve the purpose (say, annotation) or cer- The tasks in metagenomics, described so far, are tain bases within the sequence are not predicted with also crucial for identifying new ncRNAs as metage- highest accuracy by fragment assembly programs. nomic sequences contain both coding and noncoding For example, PhymmBL (metagenomic phylogenetic regions. The proportion of noncoding sequences in classification with interpolated Markov models),132 some metagenomes is up to 21%, which have many uses interpolated Markov models to assign bases to significant sequences.148 Revealing these functional- sequences. ity of these ncRNAs in metagenomes could provide Assembly of sequence fragments: This process new directions in biological research. For example, is based on sequence similarity where contiguous ncRNAs are found to be involved in sequence spe- stretches of DNA, called contigs,18 are obtained using cific recognition of other nucleic acids (e.g. mRNAs many fragments. The consensus sequence for a contig and DNAs) and siRNA are used to protect our is either based on the highest-quality nucleotide in genome by recognizing and degrading the invading any given read at each position or based on majority foreign RNAs/DNAs based on the sequence speci- 149,150 voting, i.e., the most frequently observed nucleotide ficity. The computational methods involving DP, 109,151 93–97 152,153 at each position. The assembly programs mainly SCFG, GAs, ANNs, and so on, dis- use information from paired-end tags in order to cussed so far, are important for studying ncRNA struc- improve the accuracy of assemblies. Programs such tures, but functionality of ncRNAs involving metage- as Phrap (Phil’s Revised Assembly Program)136 and nomics is the key to understand the various biological Celera Assembler134 can not only assemble single processes. The processes include regulatory elements genomes from multiple fragments but also can assem- and signal transduction mechanisms in prokaryotic or 19 ble metagenomic fragments. Other programs such simple eukaryotic organisms, phylogenetic diversity 17 18 as Velvet137 use De Bruijn133 graphs for assembling of microorganisms, metabolic pathways, and so the shorter reads produced by second-generation on. Metagenomics also provide a platform for making sequencing. accessible and studying the genetic makeup of thou- Prediction of genes: Gene prediction from sands of microbial ncRNAs in a single environment 154 metagenome fragments or sequences is mainly per- that cannot be grown in the laboratory. Note that, formed using two different approaches. The first computational methods developed for protein cod- 155,156 approach deals with identifying genes from publicly ing genes often fail when searching for ncRNAs available sequence databases, using sequence similar- and one can assume that the role of protein coding RNAs are already widely studied and available in the ity based BLAST searches. This approach is used in 157 MEGAN (MEtaGenome ANalyzer).138 The second literature. In this regard, new computational meth- approach uses a supervised learning procedure where ods are developed for analyzing the noncoding part in genes in a sequence are predicted by training the a sequence. related classifier with intrinsic features of genes in related organisms. Such classifiers are available in 139 METHODS IN METAGENOMICS GeneMark and GLIMMER (Gene Locator and INVOLVING NCRNAS Interpolated Markov ModelER).140 The next step is to map those predicted genes into known and The basic tasks in metagenomics provide a platform well-annotated gene families. In most of the cases, for processing the sequences and making them more this mapping stops at clustering the genes into a accurate for ncRNA prediction. However, in analyz- group having similar bases. If the data are obtained ing these sequences, researchers faced new challenges

10 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation that mainly constitute two parts: (1) identifying the and number of substructures. These approaches have coding and noncoding regions, and (2) predicting shown their utility in predicting RNA structure but the structure of noncoding region. These make the till now they are not investigated for identifying task of ncRNA annotation more complex than the ncRNAs from metagenomes. The potential of the task of predicting a structure from a given sequence. secondary structure prediction tools, involving free On the top of that, technologies like next-generation energy minimization of a single sequence and predict- sequencing (NGS) are generating a huge amount ing structure of new ncRNA, is explored in only few of sequences with high-sequencing efficiency and cases. In general, methods exploring similarity and reduced cost from various metagenomic databases. composition of sequences are used to analyze ncRNAs Hence, effort is given by the researchers for either in metagenomes. While the similarity based methods modifying the existing methodologies or developing provide information about functional and structural new techniques for identifying the novel ncRNAs features of noncoding sequences, composition-based from metagenomes. This constitutes a new research methods can reveal the codon and anticodon patterns, area involving recent methodologies for ncRNA and group metagenomic fragments. The computa- analysis from metagenomic sequences. tional tools developed for these tasks are discussed In order to determine the functional relation- below. ships among the sequences, that are often co-regulated and involved in the same cellular process, compu- tational analysis involving clustering115,158 and Metagenome Description classification is necessary. Clustering, classification, by Similarity-based Methods and visualization of similarity patterns are considered Similarity-based methods are used to analyze the tax- to be useful computational tasks to detect genes onomic content and to find the relation between the or part of sequences that are co-expressed or are functional and structural features of a metagenome. In implicated in similar cellular functions. These tools Ref 159, 16S rRNAs and 18S rRNAs are used to deter- provide a group of sequences those are functionally mine the evolutionary relationships between known related. Understanding the function of ncRNAs, in and unknown marker genes in pathogenic bacteria particular in the age of high-throughput experiments, having reads within 1000 bp using BLAST.160 The is enhanced with the development of computational webservers MEGAN (MEtaGenome ANalyzer)138 approaches in metagenomics. Algorithms to annotate, and CARMA (Computer-Aided Resource for Mor- organize and functionally characterize ncRNAs are of phological Analysis)161 are used to handle the large increasing relevance. number of short fragments (35 bp for MEGAN and The available investigations in ncRNAs are 80 bp for CARMA). While MEGAN is used to align mostly directed towards assembling the reads and reads to databases of known sequences using BLAST, identifying the ncRNAs using sequence comparison a phylogenetic classification of reads is performed or nucleotide frequencies. Although a huge number with CARMA161 using all Pfam domains and protein of structure prediction tools are developed for RNA families as phylogenetic markers. and ncRNA analysis, at present there is a lack of To take advantage of the NGS technology, application of these tools in identifying ncRNAs from Tanaseichuk et al.162 developed a two-phase heuristic metagenomic dataset as it is a new research . algorithm and a related software for handling closely The tools include ANNs, heuristic search techniques related sequences from distantly related species. and equilibrium partition function. For example, GAs This similarity-based algorithm can separate short and simulated annealing can be used to predict not paired-end reads obtained from different bacterial only the optimal structures but also other structures genomes in a metagenomic dataset. In the first phase that may be closer to the natural fold. They can of the algorithm, for each of the genomes a cluster also be used to estimate certain energy parameters74 is formed with unique l-mers where the value of l is and can incorporate the concept of cotranscriptional experimentally obtained as 20 for complete bacterial folding by addition or removal of single helices and genomes downloaded from NCBI. In the process, base pairs. The most advantageous application of unique l-mers and l-mers with repeats are identified GAs can be in massive reduction of computation using a threshold on the Poisson distribution of their time where the possible solutions (chromosomes) are counts. In the second phase, the l-mer repeat infor- implemented in multiple processors in parallel. Fur- mation is used to merge clusters belonging to the thermore, another promising approach for secondary same genome. These final clusters are used to assign structure prediction involves DP and equilibrium reads. The algorithm is able to separate genomes at partition function that uses free energy, temperature, various phylogenetic distances when the number of

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 11 Advanced Review wires.wiley.com/widm common repeats is small as compared to the number motif prediction using CMfinder.129 A similar work of genome-specific repeats. The performance deterio- in identifying two exceptional ncRNAs, GOLLD rates for sequences of closely related species because (Giant, Ornate, Lake-and Lactobacillales-Derived), the fraction of common repeats in genomes correlates and HEARO (HNH Endonuclease-Associated RNA with the phylogenetic distance between the genomes. and ORF), from bacterial metagenomes is available in It is pointed out that composition based methods Ref 167. They are exceptional in size and structural cannot classify short reads and are only applicable for complexity or in amount that is unusually huge. The large fragments where the composition patterns are GOLLD is commonly identified in a place adjacent to preserved. tRNAs. The HEARO is predicted from an ORF that Another class of NGS, the pyrosequencing tech- is often embedded within it and it codes the HNN nique, is explored in Ref 163. Here, RNA sequences endonuclease, characterized by HNH sequence. are extracted from metagenomic samples available The methodology in Ref 168 is similar to in the Pacific Ocean. A large fraction of comple- that in Ref 131 where IGRs are clustered using a mentary DNA (cDNA) sequences is detected which BLAST-based method and secondary structures are is comprised of well-known small RNAs (sRNAs) inferred using CMfinder.129 Many ncRNAs with con- as well as unrecognized putative sRNAs (psRNAs). served secondary structure are identified and putative First, rRNA sequences are identified using BLASTN164 functions are inferred from metagenomic sequences and then removed from the subsequent analysis by of bacteria and archaea. These structures are further comparing them with an rRNA database. Similarly, used for finding homologues and refinement of struc- protein-coding cDNA reads are recognized by query- tural alignment by CMfinder. Finally, the alignment ing with BLASTX164 against peptide databases and is analyzed manually to identify structured ncRNAs marine-specific peptide database. Next, the remain- and their functions. The general assumptions behind ing sequences are compared with IGRs of known the manual analysis are: (1) cis-regulatory RNAs marine planktonic micro-organisms to identify sRNAs (noncoding region) are commonly located in 5′ UTRs, with high nucleotide similarity. The comparison is (2) motifs those consist of the strongest riboswitch performed using INFERNAL (INFERence of RNA candidates have tight and highly selective binding ALignment) program in covariance-model.165 Finally, pockets in secondary structure for metabolite ligands, psRNAs are identified by comparing with the genomes and (3) ligand binding stabilizes the conserved struc- recovered from similar habitats and conserved sec- ture of riboswitches and ‘higher ligand concentrations ondary structures. In the process of annotating non- are expected to inhibit terminator stem formation and coding cDNAs, a self-clustering approach based on increase gene expression’.168 From the analysis, some BLASTN is introduced. All unknown noncoding noticeable structures like conserved stem-loop struc- cDNAs are compared with each other and the cDNA ture among JUMPstart , many G–U read with the most matches is used as a seed sequence base pairs, and GNRA tetraloops in ribosomal protein of the first cluster. After all matches of the seed L17 motif, and repeated pseudoknots in some motifs sequences are grouped in a cluster, the next seed is are observed. These findings may help the researchers identified in a similar way to form the next cluster. in incorporating domain specific knowledge in The process continues until all the cDNA reads are MFE-based secondary structure prediction tools assigned to some cluster. In brief, these clusters are to predict ncRNAs from metagenomic sequences. mainly used for BLAST comparisons and unknown Another attempt for predicting ncRNAs using sRNAs are annotated to 13 known sRNA families. sequence and structure information is available in Like Ref 163, both the sequence and the sec- Ref 169. The method, called REAPR (RE-Alignment ondary structure information is used in Ref 131 to for Prediction of structural ncRNA), is developed identify ncRNAs from 108 bacterial sequences. Here, to boost the performance of de novo ncRNA pre- a combination of sequence and secondary struc- dictors by realigning whole genomes based on ture similarity-based method is used to cluster IGR. RNA sequence and structure. It is pointed out First, IGRs are extracted by finding and removing that for whole-genome alignments (WGAs) using the coding regions annotated in RefSeq (Reference conventional similarity-based methods there is a Sequence).166 BLAST is used in this step. Then a possibility of misalignment in regions of conserved hierarchical clustering method, called folded-BLAST, structure if the sequences are different. Moreover, is introduced that uses IGRs and RNALfold from if the sequence identity is below 60% then purely Vienna package to compute locally stable RNA sequence based methods fail to identify the regions. secondary structures with a maximal base span L. Using an existing alignment matrix (computed using Finally, these locally stable structures are used for DP), the REAPR searches for an alignment within a

12 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation band around a reference alignment instead of search- this property is used for frequency distribution-based ing within a band around the diagonal. This enables metagenome grouping. The frequency distribution the method to find a structure-based alignment and to for each genome is represented with a matrix where improve the original alignment by preserving the local columns represent a function that varies for different dinucleotide content and the conservation patterns. values of k and rows represent the ratio between The initial WGAs are performed using LocARNA121 genome_length and size of fragments. In a part and improved using REAPR. Finally, ncRNAs are of the investigation, two 16S rRNAs are grouped predicted using RNAz.127 into nine clusters and for each cluster the average Besides the aforementioned similarity-based sequence similarity for all possible pairs is computed. approaches, additional information involving ncRNA Plots of average sequence similarity of nine clusters keywords in literature is also used. An ncRNA versus barcode distance, computed using k-mer fre- database, excluding transfer and rRNAs, is devel- quency distribution, reveal that the barcode distance oped in Ref 170 by searching the PubMed database decreases with the increase of sequence similarity with ncRNA keywords and extracting the ncRNA and barcodes for known ncRNAs can be effective sequences from the relevant literatures. Keywords in identifying other ncRNAs in metagenomes. The are also extracted from those literatures and further details of barcode distance calculation are available used to filter the GeneBank data and to annotate the in Ref 173. The main challenges in the k-mer fre- ncRNAs. In the final stage, the results are manually quency approach is that, these frequencies produce confirmed with the help of related literature and the large feature vectors that can be even larger than the information regarding function, cellular role, cellular sizes of fragments. The length of fragments may also location, and so on, are included in the database. The significantly influence the performance. confirmed ncRNAs are assigned accession number Different methods have been proposed to study and Vienna RNA Package171 is used to predict the the k-mer frequency distribution. CompostBin174 uses secondary structure of only nonredundant ncRNA hexamer frequencies to bin raw and short sequence sequences. Recent versions of this database provide reads without assembly or training. Considering a more information about long ncRNA genes. sequence of length k and each nucleotide has four pos- sibilities, the feature vector of the sequence can have 4k possible dimensions. In this regard, a weighted prin- Metagenome Description cipal component analysis (PCA) algorithm is used to by Composition-based Methods project the high-dimensional DNA composition data Composition-based methods can be used for annotat- into an informative lower-dimensional space. There- ing metagenomic sequences when the after, the normalized cut clustering algorithm, involv- and CG contents follow a particular frequency for a ing semi-supervised information from phylogenetic specific genome and the frequency varies for differ- relations, is used on this filtered data set to classify ent organisms. The CG content is first used in Ref sequences into taxon-specific bins. The method mainly 172 to annotate coding and complete (coding and uses frequencies of hexamers (k = 6), which is moti- noncoding) metagenomic sequences. Codon, di, tetra, vated by the fact that the length of two codons is 6. or pentanucleotide frequencies are also used in some The length is also restricted on the higher side by the investigations to identify regularities among microor- memory and CPU performance of the computers avail- ganisms, compare structures and assign taxonomic able at that time. groups. In general, the k-mer frequency is studied for Two special cases of the k-mer frequency-based each genome where k is the number of nucleotides in approach, the tri and tetranucleotide frequencies, are every fragment. used in Ref 175 to cluster fragments of metagenomes. A study with known sequences is conducted in Here, an unsupervised growing self-organizing map Ref 173 for 1 < k < 6 to find a value for which there is (GSOM), called Seeded GSOM (S-GSOM), is imple- a stable distribution of k-mer frequencies in fragments mented using semi-supervised seeding method for above 1000 bp. A similar k-mer frequency distribu- binning (clustering) fragments according to phy- tion for two different genomes or fragments indicates logency. In most cases the number of seed is one that they are biologically highly correlated and can be per species. The method uses fragments around seen as a same group or cluster. In contrast, the distri- the highly conserved 16S rRNA sequences as seeds bution is different for horizontally transferred genes, and is not dependent on the knowledge of com- prokaryotes, eukaryotes, mitochondria, and plastids. pleted genomes. If the seeds involving 16S rRNA Similarity between k-mer frequency distribution is ref- are available then GSOM can identify sequences of ereed as phylogenetic closeness between genomes and unknown metagenomes by analyzing the phylogenetic

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 13 Advanced Review wires.wiley.com/widm relationships between the clusters associated with the found by searching IGRs with high GC content. The seeds. In contrast, when the seeds are not available RefSeq (Reference Sequence) database166 with known a certain percentage of fragments from a cluster are nucleotide annotations and the Rfam database13 considered and BLAST search is performed against with known conserved structures are used in the known databases to identify any known marker genes identification process. Their results indicate that from those fragments. If any marker gene is identified GC-enriched IGRs, longer than 100 bp in P. ubique, then phylogenetic analysis can relate those fragments are mostly known ncRNAs. Computational methods with unknown or known metagenomes. like RNAshapes,179 CMFinder,129 and RAVENNA180 Another k-mer frequency-based approach for are used for the analysis of RNA. metagenomic fragments is available with the program Another approach involving dinucleotide con- MetaCluster.176 It uses Spearman’s footrule distance tent of metagenomic sequences and additional infor- between trinucleotide frequency vectors to bin metage- mational like energy distribution of longer simulated nomic fragments of species. This includes species with sequences with similar average content is investigated balanced abundance ratios (say, 1:1) to very dif- in Soldatov et al.181 They developed a code, called ferent abundance ratios (e.g. 1:24). The method is RNASurface, to improve the precision in identifying based on the unsupervised top–down separation and known ncRNAs in . The method bottom–up merging strategy. However, the approach can detect structured ncRNA from a sequence when in TETRA177 is only useful when the abundance ratios homologous sequences are not available for compar- of the species in the metagenomic sample are almost isons. First, Zuker algorithm82 is used to construct the same. Here, z-scores are computed for tetranu- an MFE matrix that stores the MFEs for each subse- cleotide frequencies and fragments are classified by the quence, and then a Z-score matrix is computed which Pearson correlation of their z-scores. stores the Z-scores for the same subsequences. The The trinucleotide frequency as well as dinu- Z-score used in this investigation is a modified version cleotide frequency of sequences is used in Ref 172 of that in Ref 108. It involves MFE of the subsequence, to annotate noncoding sequences in metagenomes. and average and standard deviation of the energy dis- The method uses the codon for coding sequences tribution of simulated sequences with the same and trinucleotide contents for complete sequences. average dinucleotide content but sufficiently large in The radial distribution of the contents is used to length as compared to the subsequence. The idea is find the similarities and differences between coding to capture the bias of dinucleotide content of ncRNA and complete (coding and noncoding) sequences. as compared to its genomic background. For a given High peaks are identified from the distribution for sequence, the MFE matrix and the Z-score matrix are GC content in coding sequences at 68, 62, 56, and used to define a surface of structural potential. Locally 44.5% of the length of the sequence. For the complete optimal structures are represented as peaks in this sur- sequences, the peak for GC content is observed at face. The peaks are identified by comparing each cell 43%. Hence, noncoding sequences have higher pro- in Z-score matrix with its adjacent cells. A web server portion of GC content and the information is used is also developed to visualize the surface of structural to identify them. It is also pointed out that protein potential. synthesis can be performed by the codons and triplets at the same time and trinucleotide compositions CGC, CCG, TTT, and AAA, which are different CONCLUSION from codons in complete sequences, are highly used A review on some existing computational method- in noncoding sequences. These compositions may ologies for ncRNA analysis and how ncRNAs can be relevant structural features of metagenomes like be identified from metagenomic samples is presented. trinucleotide repeat sequences.154 For example, In this regard, different ncRNAs and structural aquatic metagenomes (Methylotrophic commu- elements of the RNA are discussed. Various compu- nity from Lake Washington sediment (MLWSF)) tational methods involving DP, GA, ANN, SCFG, have many trinucleotide repeat sequences and and sequence alignment-based methods for secondary they may indicate a new environmental-dependent structure prediction are explained along with their feature.172 merits. The basic concepts of metagenomics along A work similar to that in Ref 172 is available with their biological basis and the relevance of in Ref 178. Here, ncRNAs such as rRNA, tRNA, annotating ncRNAs from metagenomic databases and riboswitches are identified from RNA sequences are explored. Structure, function, and taxonomic extracted from marine metagenome P. ubique. These analyses for noncoding sequence in metagenomes ncRNAs show conserved secondary structures and are are mainly performed using computational methods

14 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation involving sequence similarity and base composition characteristics of the metagenomic datasets (e.g., frequency-based computational methods. From the short fragment length and repeated sequence). Hence, existing investigations, it is revealed that metagenomes designing specific tools may be the key for increasing can be a powerful database for identifying new ncR- the accuracy of the results. Even though the exist- NAs as well as new environmental and biological ing approaches in structure prediction are useful in features. identifying patterns and functions, the computational The performance of the existing ncRNA pre- problems related to ncRNAs and metagenomics will diction methods are restricted by some general remain challenging for the coming years.

REFERENCES 1. Quintana ML, Rauhut R, Lendeckel W, Tuschl T. 14. Ray SS, Halder S, Kaypee S, Bhattacharyya D. Identification of novel genes coding for small expressed HD-RNAS: an automated hierarchical database of RNAs. Science 2001, 294:853–858. RNA structures. Front Genet 2012, 3:1–10. 2. Hannon GJ. RNA interference. Nature 2002, 15. Ray SS, Pal SK. RNA secondary structure prediction 418:244–251. using soft computing. IEEE/ACM Trans Comput Biol Bioinform 2012, 10:2–17. 3. Yang Z, Zhu Q, Luo K, Zhou Q. The 7SK small nuclear RNA inhibits the CDK9/cyclin t1 kinase to 16. Kapranov P, Willingham AT, Gingeras TR. control transcription. Nature 2001, 414:317–322. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet 2007, 4. Paul S, Weaver SM, Chattopadhyay S, Sokurenk 8:413–423. E, Merrikh H. Accelerated gene evolution through replication–transcription conflicts. Nature 2013, 17. Riesenfeld CS, Schloss PD, Handelsman J. Metage- 495:512–515. nomics: genomic analysis of microbial communities. Genetics 2004, 38:525–552. 5. Saze H, Kitayama J, Takashima K, Miura S, Harukawa Y, Ito T, Kakutani T. Mechanism for full-length RNA 18. National Research Council (US) Committee on processing of arabidopsis genes containing intragenic Metagenomics: Challenges and Functional Applica- heterochromatin. Nat Commun 2013, 4:1–8. tions The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. Washington, DC: 6. van Gent DC, Hoeijmakers JH, Kanaar R. Chromo- The National Academies Press; 2007. somal stability and the DNA double-stranded break connection. Nat Rev Genet 2001, 2:196–206. 19. Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS 7. Storz G. An expandind universe of noncoding RNAs. Comput Biol 2005, 1:106–112. Science 2002, 296:1260–1263. 20. Washietl S, Will S, Hendrix DA, Goff LA, Rinn 8. Lowe TM, Eddy SR. A computational screen for JL, Berger B, Kellis M. Computational analysis of methylation guide snoRNAs in yeast. Science 1999, noncoding RNAs. WIREs RNA 2012, 3:759–778. 283:1168–1171. 21. Sudarsan N, Barrick JE, Breaker RR. 9. Liu J, Gough J, Rost B. Distinguishing protein-coding Metabolite-binding RNA domains are present in from non-coding RNAs through support vector the genes of eukaryotes. RNA 2003, 9:644–647. machines. PLoS Genet 2006, 2:29–37. 22. Axmann IM, Kensche P, Vogel J, Kohl S, Herzel H, 10. Huttenhofer A, Schattner P, Polacek N. Non-coding Hess WR. Identification of cyanobacterial non-coding RNAs: hope or hype. Trends Genet 2005, 21:289–297. RNAs by comparative genome analysis. Genome Biol 11. Ray SS, Pal JK, Pal SK. Computational approaches 2005, 6:73. for identifying cancer miRNA expressions. Gene Expr 23. Wilderman PJ, Sowa NA, FitzGerald DJ, FitzGerald 2013, 15:243–253. PC, Gottesman S, Ochsner UA, Vasil ML. Identifica- 12. Wen J, Parker BJ, Weiller GF. In silico identification tion of tandem duplicate regulatory small RNAs in and characterization of mRNA-like noncoding tran- pseudomonas aeruginosa involved in iron homeosta- scripts in . In Silico Biol 2007, sis. Proc Natl Acad Sci U S A 2004, 101:9792–9797. 7:485–505. 24. Edwards AL, Batey RT. Riboswitches: a common 13. Griffiths-Jones S1, Moxon S, Marshall M, Khanna A, RNA regulatory element. Nat Educ 2010, 3:9. Eddy SR, Bateman A. Rfam: annotating non-coding 25. Ponting CP, Oliver PL, Reik W. Evolution and RNAs in complete genomes. Nucleic Acids Res 2005, functions of long noncoding RNAs. Cell 2009, 33:121–124. 136:629–641.

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 15 Advanced Review wires.wiley.com/widm

26. Babak T, Blencowe BJ, Hughes TR. Considerations in the Drosophila and human RNAi pathways. Mol in the identification of functional RNA structural Cell 2002, 10:537–548. elements in genomic alignments. BMC Bioinformatics 43. Eulalio A, Huntzinger E, Nishihara T, Rehwinkel J, 2007, 8:1–21. Fauser M, Izaurralde E. Deadenylation is a widespread 27. Novikova IV, Hennelly SP, Sanbonmatsu KY. Struc- effect of miRNA regulation. RNA 2009, 15:21–32. tural architecture of the human long non-coding RNA, 44. Allen E, Xie ZX, Gustafson AM, Sung GH, Spatafora steroid receptor RNA activator. Nucleic Acids Res JW, Carrington JC. Evolution of microRNA genes by 2012, 40:5034–5051. inverted duplication of target gene sequences in Ara- 28. Goodrich JA, Kugel JF. Non-coding-RNA regulators bidopsis thaliana. Nat Genet 2004, 36:1282–1290. of RNA polymerase II transcription. Nat Rev Mol Cell 45. Kawasaki H, Taira K, Wadhwa R. World of small Biol 2006, 7:612–616. RNAs: from to siRNA and miRNA. Differ- 29. Dieci G, Fiorino G, Castelnuovo M, Teichmann M, entiation 2004, 72:58–64. Pagano A. The expanding RNA polymerase III tran- 46. Maxwell ES, Fournier MJ. The small nucleolar RNAs. scriptome. Trends Genet 2007, 23:614–622. Annu Rev Biochem 1995, 64:897–934. 30. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega 47. Samarsky DA, Fournier MJ, Singer RH, Bertrand E. VB, Wong E, Orlov YL, Zhang W, Jiang J et al. The snoRNA box C/D motif directs nucleolar targeting Integration of external signaling pathways with the and also couples snoRNA synthesis and localization. core transcriptional network in embryonic stem cells. EMBO J 1998, 17:3747–3757. Cell 2008, 133:1106–1117. 48. Ganot P, Ferrer MC, Kiss T. The family of box ACA 31. Wutz A, Gribnau J. X inactivation Xplained. Curr small nucleolar RNAs is defined by an evolution- Opin Genet Dev 2007, 17:387–393. arily conserved secondary structure and ubiquitous 32. Donoghue PO, Schulten ZL. On the evolution of sequence elements essential for RNA accumulation. structure in aminoacyl-tRNA synthetases. Microbiol Genes Dev 1997, 11:941–956. MolBiolRev2003, 67:550–573. 49. Bachellerie JP, Cavaillé J, Hüttenhofer A. The expand- 33. Lehninger AL, Nelson DL, Cox MM. Principles of ing snoRNA world. Biochimie 2002, 84:775–790. Biochemistry. 2nd ed. New York: Worth; 1993. 50. Matera AG, Terns RM, Terns MP. Non-coding RNAs: 34. Penman S, Smith I, Holtzman E. Ribosomal RNA lessons from the small nuclear and small nucleolar synthesis and processing in a particulate site in the hela RNAs. Nature 2007, 8:209–220. . Science 1966, 154:786–789. 51. Bernstein E, Caudy A, Hammond S, Hannon G. Role 35. Rijk PD, Wuyts J, Peer YV, Winkelmans T, Wachter for a bidentate ribonuclease in the initiation step of RD. The european large subunit ribosomal RNA RNA interference. Nature 2001, 409:363–366. database. Nucleic Acids Res 2000, 28:177–178. 52. Phillips T. Small non-coding RNA and gene expres- 36. Lee Y. wt al. The nuclear RNase III Drosha initiates sion. Nat Educ 2008, 1:115. microRNA processing. Nature 2003, 425:415–419. 53. Hamilton A, Baulcombe D. A species of small antisense 37. Cai X, Hagedorn CH, Cullen BR. Human microR- RNA in posttranscriptional gene silencing in plants. NAs are processed from capped, polyadenylated tran- Science 1999, 286:950–952. scripts that can also function as mRNAs. RNA 2004, 54. Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum 10:1957–1966. C, Ge H, Bartel DP. Large-scale sequencing reveals 38. Grishok A, Pasquinelli AE, Conte D, Li N, Parrish S, 21U-RNAs and additional MicroRNAs and endoge- Ha I, Baillie DL, Fire A, Ruvkun G, Mello CC. Genes nous siRNAs in C elegans. Cell 2006, 127:1193–1207. and mechanisms related to RNA interference regulate 55. Houwing S, Kamminga LM, Berezikov E, Cronembold expression of the small temporal RNAs that control C. D, Girard A, van den Elst H, Filippov DV, Blaser H, elegans developmental timing. Cell 2001, 106:23–34. Raz E, Moens CB et al. A Role for Piwi and piRNAs 39. Lee Y, Jeon K, Lee JT, Kim S, Kim VN. Microrna in germ cell maintenance and transposon silencing in maturation: stepwise processing and subcellular local- zebrafish. Cell 2007, 129:69–82. ization. EMBO J 2002, 21:4663–4670. 56. Vagin VV, Sigova A, Li C, Seitz H, Gvozdev V, 40. Hammond SM, Boettcher S, Caudy AA, Kobayashi Zamore PD. A distinct small RNA pathway silences R, Hannon GJ. Argonaute2, a link between genetic selfish genetic elements in the germline. Science 2006, and biochemical analyses of RNAi. Science 2001, 313:320–324. 293:1146–1150. 57. Kirino Y, Mourelatos Z. Mouse Piwi-interacting 41. Martinez J, Patkaniowska A, Urlaub H, Luhrmann R, RNAs are 2′-o-methylated at their 3′ termini. Nat Tuschl T. Single-stranded antisense siRNAs guide tar- Struct Mol Biol 2007, 14:347–348. get RNA cleavage in RNAi. Cell 2002, 110:563–574. 58. Siomi MC, Sato K, Pezic D, Aravin AA. 42. Schwarz DS, Hutvagner G, Haley B, Zamore PD. PIWI-interacting small RNAs: the vanguard of genome Evidence that siRNAs function as guides, not primers, defence. Nat Rev Mol Cell Biol 2011, 12:246–258.

16 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation

59. Keiler KC, Ramadoss NS. Bifunctional transfer- modification constraints into a dynamic programming messenger RNA. Biochimie 2011, 93:1993–1997. algorithm for prediction of RNA secondary structure. 60. Leygue E. Steroid receptor RNA activator (SRA1): Proc Natl Acad Sci U S A 2004, 101:7287–7292. unusual bifaceted gene products with suspected rel- 75. Turner DH, Sugimoto N. RNA structure prediction. evance to breast cancer. Nucl Recept Signal 2007, Annu Rev Biophys Biophys Chem 1988, 17:167–192. 5:6–19. 76. Needleman SB, Wunsch CD. A general method appli- 61. Kloc M, Wilk K, Vargas D, Shirato Y, Bilinski S, Etkin cable to the search for similarities in the amino LD. Potential structural role of non-coding and coding acid sequence of two proteins. J Mol Biol 1970, RNAs in the organization of the cytoskeleton at the 48:443–453. vegetal cortex of xenopus oocytes. Development 2005, 77. Tinoco I, Uhlenbeck OC, Levine MD. Estimation of 132:3445–3457. secondary structure in ribonucleic acids. Nature 1971, 62. Zhang J, King ML. Xenopus vegT RNA is localized 230:362–367. to the vegetal cortex during oogenesis and encodes a 78. Waterman MS, Smith TF. RNA secondary structure: novel T-box transcription factor involved in mesoder- A complete mathematical analysis. Math Biosci 1978, mal patterning. Development 1996, 122:4419–4429. 42:257–266. 63. Jenny A, Hachet O, Závorszky P, Cyrklaff A, 79. Nussinov R, Jacobson A. Fast algorithm for predicting Weston MD, Johnston DS, Erdélyi M, Ephrussi the secondary structure of single-stranded RNA. Proc A. A translation-independent role of oskar RNA Natl Acad Sci U S A 1980, 77:6303–6313. in early drosophila oogenesis. Development 2006, 80. Zuker M. On finding all suboptimal foldings ofan 133:2827–2833. RNA molecule. Science 1989, 244:48–52. 64. Gultyaev AP, Roussis A. Identification of conserved 81. Zuker M. Mfold web server for nucleic acid folding secondary structures and expansion segments in and hybridization prediction. Nucleic Acids Res 2003, enod40 RNAs reveals new enod40 homologues in 31:3406–3415. plants. Nucleic Acids Res 2007, 35:3144–3152. 82. Zuker M, Stiegler P. Optimal computer folding of large 65. Candeias MM, Colas LM, Powell DJ, Daskalogianni RNA sequences using thermodynamics and auxiliary C, Maslon MM, Naski N, Bourougaa K, Calvo F, information. Nucleic Acids Res 1981, 9:133–148. Fahraeus R. p53 mRNA controls p53 activity by 83. Markham NR, Zuker M. UNAfold: software for managing Mdm2 functions. Nat Cell Biol 2008, nucleic acid folding and hybridization. Methods Mol 10:1098–1105. Biol 2008, 453:3–31. 66. Gimpel M, Preis H, Barth E, Gramzow L, Brantl S. 84. Hofacker IL. Vienna RNA secondary structure server. SR1—a small RNA with two remarkably conserved Nucleic Acids Res 2003, 31:3429–3431. functions. Nucleic Acids Res 2012, 40:11659–11672. 85. Gruber A, Lorenz R, Bernhart S, Neubock R, Hofacker 67. Mattick JS, Makunin IV. Non-coding RNA. Hum Mol I.TheViennaRNAwebsite.Nucleic Acids Res 2008, Genet 2006, 15:17–29. 36:70–74. 68. Liu Q, Ye X, Zhang Y. A hopfield neural network 86. Reuter J, Mathews D. RNAstructure: software for based algorithm for RNA secondary structure predic- RNA secondary structure prediction and analysis. tion. Comput Comput Sci 2006, 1:10–16. BMC Bioinformatics 2010, 11:1–9. 69. Crick FH. Codon-anticodon pairing: the wobble 87. McCaskill JS. The equilibrium partition function and hypothesis. J Mol Biol 1966, 19:548–555. base pair binding probabilities for RNA secondary 70. Westhof E, Auffinger P. RNA tertiary structure. In: structure. Biopolymers 1990, 29:1105–1119. Meyers RA, ed. Encyclopedia of Analytical Chem- 88. Ding Y, Chan CY, Lawrence CE. Sfold web server for istry. Chichester: John Wiley & Sons, Ltd; 2000, statistical folding and rational design of nucleic acids. 5222–5232. Nucleic Acids Res 2004, 32:135–141. 71. Batey RT, Gilbert SD, Montange RK. Structure of 89. Ding Y, Lawrence CE. A statistical sampling algorithm a natural guanine-responsive riboswitch complexed for RNA secondary structure prediction. Nucleic Acids with the metabolite hypoxanthine. Nature 2004, Res 2003, 31:7280–7301. 432:411–415. 90. Dirks RM, Pierce NA. A partition function algorithm 72. Liu J, Lilley DMJ. The role of specific′ 2 -hydroxyl for nucleic acids secondary structure including pseu- groups in the stabilization of the folded conformation doknots. J Comput Chem 2003, 24:1664–1677. of kink-turn RNA. RNA 2007, 13:200–210. 91. Dirks RM, Pierce NA. An algorithm for com- 73. Goody TA, Melcher SE, Norman DG, Lilley DMJ. puting nucleic acid base-pairing probabilities The kink-turn motif in RNA is dimorphic, and metal inclu-ding pseudoknots. J Comput Chem 2004, 25: ion-dependent. RNA 2004, 10:254–264. 1295–1304. 74. Mathews DH, Disney MD, Childs JL, Schroeder 92. Xayaphoummine A, Bucher T, Isambert H. Kinefold SJ, Zuker M, Turner DH. Incorporating chemical webserver for RNA/DNA folding path and structure

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 17 Advanced Review wires.wiley.com/widm

prediction including pseudoknots and knots. Nucleic Zur Erlangung des akademischen Grades, 2005, Acids Res 2005, 33:605–610. 1–127. 93. van Batenburg FH, Gultyaev AP, Pleij CW. An APL- 109. Knudsen B, Hein J. RNA secondary structure pre- programmed genetic algorithm for the prediction diction using stochastic context-free grammars of RNA secondary structure. J Theor Biol 1995, and evolutionary history. Bioinformatics 1999, 15: 174:269–280. 446–454. 94. Gultyaev AP, Van Batenburg FH, Pleij CW. The 110. Knudsen B, Hein J. Pfold: RNA secondary structure computer simulation of RNA folding pathways using prediction using stochastic context-free grammars. an genetic algorithm. J Mol Biol 1995, 250:37–51. Nucleic Acids Res 2003, 31:3423–3428. 95.ShapiroBA,BengaliD,KasprzakW,WuJC.RNA 111. Do CB, Woods DA. CONTRAfold: RNA secondary folding pathway functional intermediates: their predic- structure prediction without physics-based models. tion and analysis. J Mol Biol 2001, 312:27–41. Bioinformatics 2006, 22:90–98. 96. Shapiro BA, Navetta J. A massively parallel genetic 112. James BD, Olsen GJ, Pace NR. Phylogenetic compara- algorithm for RNA secondary structure prediction. tive analysis of RNA secondary structure. Meth Enzy- J Supercomput 1994, 8:195–207. mol 1989, 180:227–239. 97. Wiese KC, Deschenes AA, Hendricks AG. 113. Sankoff D. Simultaneous solution of the RNA folding, RNApredict-an evolutionary algorithm for RNA alignment and protosequence problems. SIAMJAppl secondary structure prediction. IEEE/ACM Trans Math 1985, 45:810–882. Comput Biol Bioinform 2008, 5:25–41. 114. Havgaard JH, Lyngs RB, Stormo GD, Gorodkin J. 98. Shapiro BA, Wu JC. An annealing mutation operator Pairwise local structural alignment of RNA sequences in the genetic algorithms for RNA folding. Comput with sequence similarity less than 40. Bioinformatics Appl Biosci 1996, 12:171–180. 2005, 21:1815–1824. 99. Schmitz M, Steger G. Description of RNA folding by 115. Torarinsson E, Havgaard JH, Gorodkin J. Multiple simulated annealing. J Mol Biol 1996, 255:254–266. structural alignment and clustering of RNA sequences. 100. Neethling M, Engelbrecht AP. Determining RNA Bioinformatics 2007, 23:926–932. secondary structure using set-based particle swarm 116. Harmanci AO, Sharma G, Mathews DH. Efficient optimization. In: IEEE Congress on Evolutionary pairwise RNA structure prediction using probabilistic Computation, Vancouver, Canada; 2006, 6134–6141. alignment constraints in dynalign. BMC Bioinformat- 101. Tsang HH, Wiese KC. SARNA-predict: Accuracy ics 2007, 8:1–21. improvement of RNA secondary structure predic- 117. Mathews DH, Turner DH. Dynalign: an algorithm for tion using pennutation based simulated annealing. finding the secondary structure common to two RNA IEEE/ACM Trans Comput Biol Bioinform 2010, sequences. J Mol Biol 2002, 317:191–203. 7:727–740. 118. Hofacker IL, Bernhart SH, Stadler PF. Alignment of 102. Xing C, Wang G, Wang Y, Zhou Y, Wang K, Fan L. RNA base pairing probability matrices. Bioinformatics PSOfold: a metaheuristic for RNA folding. J Comput 2004, 20:2222–2227. Inf Syst 2012, 8:915–923. 119. Holmes I. Accelerated probabilistic inference of 103. Yu J, Zhang CH, Liu YN, Li X. Simulating the RNA structure evolution. BMC Bioinformatics 2005, folding pathway of RNA secondary structure using 6:1–22. the modified ant colony algorithm. J Bionic Eng 2011, 120. Kiryu H, Tabei Y, Kin T, Asai K. Murlet: a practical 7:382–389. multiple alignment tool for structural RNA sequences. 104. Liu Y, Hao J, Peng J. Predicting RNA secondary Bioinformatics 2007, 23:1588–1598. structure with tabu search. IEEE Int Conf Cogn 121. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen Inform, Beijing, China; 2010, 409–414. R. Inferring non-coding RNA families and classes 105. Le S, Nussinov R, Maziel J. Tree graphs of RNA by means of genome-scale structure-based clustering. secondary structures and their comparison. Comput PLoS Comput Biol 2007, 3:65–77. Biomed Res 1989, 22:461–473. 122. Hofacker IL, Fekete M, Stadler PF. Secondary struc- 106. Nussinov R, Piecznik G, Grigg JR, Kleitman DJ. ture prediction for aligned RNA sequences. J Mol Biol Algorithms for loop matching. SIAM J Appl Math 2002, 319:1059–1066. 1978, 35:68–82. 123. Chiu DK, Kolodziejczak T. Inferring consensus struc- 107. Dowell RD. RNA structural alignment using stochas- ture from nucleic acid sequences. Comput Appl Biosci tic context-free grammars. PhD Thesis, The institute of 1991, 7:347–352. Biological Sciences, 2004. 124. Gutell RR, Power A, Hertz GZ, Putz EJ, Stormo GD. 108. Washiet MS. Prediction of structural non-coding Identifying constraints on the higher-order structure RNAs by comparative sequence analysis. Dissertation, of RNA: continued development and application of

18 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015 WIREs Data Mining and Knowledge Discovery Noncoding RNAs and their annotation

comparative sequence analysis methods. Nucleic Acids 140. Salzberg SL, Delcher AL, Kasif S, White O. Microbial Res 1992, 20:5785–5795. gene identification using interpolated markov models. 125. Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler Nucleic Acids Res 1998, 26:544–548. PF. RNAalifold: improved consensus structure predic- 141. Aziz RK, Bartels D, Best AA, DeJongh M, Disz tion for RNA alignments. BMC Bioinformatics 2008, T, Edwards RA, Formsma K, Gerdes S, Glass EM, 9:474–474. Kubal M. The RAST server: rapid annotations using 126. Klein RJ, Eddy SR. RSEARCH: finding homologs of subsystems technology. BMC Genomics 2008, 9:1–15. single structured RNA sequences. BMC Bioinformat- 142. Markowitz VM, Ivanova NN, Szeto E, Palaniappan ics 2003, 4:1–16. K, Chu K, Dalevi D, Chen A, Grechkin Y, Dubchak 127. Washietl S, Hofacker IL, Stadler PF. Fast and reliable I, Anderson I, et al. IMG/M: a data management and prediction of noncoding RNAs. Proc Natl Acad Sci U analysis system for metagenomes. Nucleic Acids Res SA2005, 102:2454–2459. 2008, 36:534–538. 128. Gruber AR, Findeiß S, Washietl S, Hofacker IL, Stadler 143. Haft DH, Selengut JD, White O. The TIGRFAMs PF. RNAz 2.0: improved noncoding RNA detection. database of protein families. Nucleic Acids Res 2003, Pac Symp Biocomput 2010, 15:69–79. 31:371–373. 129. Yao Z, Weinberg Z, Ruzzo WL. CMfinder: a covari- 144. Sonnhammer EL, Eddy SR, Durbin R. Pfam: A com- ance model based RNA motif finding algorithm. Bioin- prehensive database of families based formatics 2006, 22:445–452. on seed alignments. Proteins 1997, 28:405–420. 130. Jung CH, Hansen MA, Makunin IV, Korbie DJ, 145. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat Mattick JS. Identification of novel non-coding RNAs TN, Weissig H, Shindyalov IN, Bourne PE. The protein using profiles of short sequence reads from next data bank. Nucleic Acid Res 2000, 28:235–242. generation sequencing data. BMC Genomics 2010, 146. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori 11:1–12. M. The KEGG resource for deciphering the genome. 131. Tseng HH, Weinberg Z, Gore J, Breaker RR, Ruzzo Nucleic Acids Res 2004, 32:277–280. WL. Finding non-coding RNAs through genome-scale 147. Tatusov RL, Koonin EV, Lipman DJ. A genomic clustering. J Bioinform Comput Biol 2009, 7:373–388. perspective on protein families. Science 1997, 132. Brady A, Salzberg SL. PhymmBL expanded: confidence 278:631–637. scores, custom databases, parallelization and more. 148. Yok NG, Rosen GL. Combining gene prediction meth- Nat Methods 2011, 8:367–369. ods to improve metagenomic gene annotation. BMC 133. de Bruijn NG. A combinatorial problem. Koninklijke Bioinformatics 2011, 13:12–20. Nederlandse Akademie Van Wetenschappen 1946, 149. Cupal J, Hofacker IL, Stadler PF. Dynamic pro- 49:758–764. gramming algorithm for the density of states of 134. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, RNA secondary structures. Comput Sci Biol 1996, Sutton GG, Smith HO, Yandell M, Evans CA, Holt 96:184–186. RA et al. The sequence of the human genome. Science 150. Rivas E, Eddy SR. A dynamic programming algorithm 2001, 291:1304–1354. for RNA structure prediction including pseudoknots. 135. Balzer S, Malde K, Grohme MA, Jonassen I. Filtering J Mol Biol 1999, 285:2053–2068. duplicate reads from 454 pyrosequencing data. Bioin- 151. Knudsen M. Stochastic context-free grammars and formatics 2013, 29:830–836. RNA secondary structure prediction. PhD Thesis, 136. Machado M, Magalhães WCS, Sene A, Araújo B, Bioinformatic Research Center, 2005. Campos ACF, Chanock SJ, Scott L, Oliveira G, 152. Carter RJ, Dubchak I, Holbrook S. A computational Santos ET, Rodrigues MR. Phred-Phrap package to approach to identify genes for functional RNAs analyses tools: a pipeline to facilitate population genet- in genomic sequences. Nucleic Acids Res 2001, ics re-sequencing studies. Investigative Genet 2011, 29:3928–3938. 2:1–7. 153. Koessler DR, Knisley DJ, Knisley J, Haynes T. A 137. Zerbino DR, Birney E. Velvet: algorithms for de novo predictive model for secondary RNA structure using short read assembly using de Bruijn graphs. Genome graph theory and a neural network. BMC Bioinfor- Res 2008, 18:821–829. matics 2010, 11:21–31. 138. Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schus- 154. Frank AC, Amiri H, Andersson SGE. Genome deteri- ter SC. Integrative analysis of environmental sequences oration: loss of repeated sequences and accumulation using MEGAN4. Genome Res 2011, 6:1552–1560. of junk DNA. Genetics 2002, 115:1–12. 139. Zhu W, Lomsadze A, Borodovsky M. Ab initio 155. Machado AL, del Portillo HA, Durham AM. Compu- gene identification in metagenomic sequences. Nucleic tational methods in noncoding RNA research. Math Acids Res 2010, 38:132–147. Biol 2008, 56:15–49.

Volume 5, January/February 2015 © 2015 John Wiley & Sons, Ltd. 19 Advanced Review wires.wiley.com/widm

156. Ray SS, Bandyopadhyay S, Pal SK. A weighted power 169. Will S, Yu M, Berger B. Structure-based whole-genome framework for integrating multi-source information: realignment reveals many novel noncoding RNAs. gene function prediction in yeast. IEEE Trans Biomed Genome Res 2013, 23:1018–1027. Eng 2012, 59:1162–1168. 170. Liu C, Bai B, Skogerbø G, Cai L, Deng W, Zhang Y, 157. Ray SS, Bandyopadhyay S, Pal SK. Combin- Bu D, Zhao Y, Chen R. NONCODE: an integrated ing multisource information through functional- knowledge database of non-coding RNAs. Nucleic annotation-based weighting: gene function prediction Acids Res 2005, 33:D112–D115. in yeast. IEEE Trans Biomed Eng 2009, 56:229–236. 171. Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, 158. Will S, Reiche K, Hofacker IL, Stadler PF, Backofen Tacker M, Schuster P. Fast folding and comparison R. Inferring non-coding RNA families and classes of RNA secondary structures. Monatsh Chem 1994, by means of genome-scale structure-based clustering. 125:167–188. PLoS Comput Biol 2007, 3:680–692. 172. Tosse FT, Rodríguez AC, Vélez PE, Zambrano MM, 159. Chakravorty S, Danica H, Michele B, Nancy C, David Moreno PA. Exploration of noncoding sequences in A. A detailed analysis of 16 s ribosomal RNA gene metagenomes. PLoS One 2013, 8:1–12. segments for the diagnosis of pathogenic bacteria. 173. Zhou F, Olman V, Xu Y. Barcodes for genomes and J Microbiol Meth 2007, 69:330–339. applications. BMC Bioinformatics 2008, 9:546–557. 160. Altschul SF, Madden TL, Schaffer AA, Zhang J, 174. Chatterji S, Yamazaki I, Bai Z, Eisen J. Compost- Zhang Z, Miller W, Lipman DJ. Gapped BLAST Bin: a DNA composition-based algorithm for binning and PSI-BLAST: a new generation of protein environmental shotgun reads. Proc RECOMB 2008, database search programs. Nucleic Acids Res 1997, 4955:17–28. 25:3389–3402. 175. Chan CKK, Hsu AL, Halgamuge SK, Tang SL. 161. Krause L, Diaz N, Goesmann A, Kelley S, Nattkemper Binning sequences using very sparse labels within TW, Rohwer F, Edwards R, Stoye J. Phylogenetic a metagenome. BMC Bioinformatics 2008, 9: classification of short environmental DNA fragments. 215–232. Nucleic Acids Res 2008, 36:2230–2239. 176. Leung HC, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, 162. Tanaseichuk O, Borneman J, Jiang T. Separating Chen J, Qin J, Li R, Chin FY. A robust and accu- metagenomic short reads into genomes via clustering rate binning algorithm for metagenomic sequences algorithms. Mol Biol 2012, 7:27–42. with arbitrary species abundance ratio. Bioinformatics 163. Shi Y, Tyson GW, DeLong EF. Metatranscriptomics 2011, 27:1489–1495. reveals unique microbial small RNAs in the ocean’s 177. Teeling H, Waldmann J, Lombardot T, Bauer water column. Nature 2009, 459:266–269. M, Glöckner FO. TETRA: a web-service and a 164. Altschul SF, Gish W, Miller W, Myers EW, Lipman stand-alone program for the analysis and comparison DJ. Basic local alignment search tool. J Mol Biol 1990, of tetranucleotide usage patterns in DNA sequences. 215:403–410. BMC Bioinformatics 2004, 5:163–170. 165. Eddy S. INFERNAL User’s Guide, Version 0.72. 2007. 178. Meyer MM, Ames TD, Smith DP, Weinberg Z, Available at: http://infernal.janelia.org/. (Accessed Jan- Schwalbach MS, Giovannoni SJ, Breaker RR. Identi- uary 15, 2015) fication of candidate structured RNAs in the marine 166. Pruitt KD, Tatusova T, Maglott DR. NCBI reference ‘Candidatus Pelagibacter ubique’. BMC sequences (RefSeq): a curated non-redundant sequence Genomics 2009, 10:1–16. database of genomes, transcripts and proteins. Nucleic 179. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich Acids Res 2007, 35:61–65. R. RNAshapes: an integrated RNA analysis pack- 167. Weinberg Z, Perreault J, Meyer MM, Breaker RR. age based on abstract shapes. Bioinformatics 2006, Exceptional structured noncoding RNAs revealed 22:500–503. by bacterial metagenome analysis. Nature 2009, 180. Weinberg Z, Ruzzo WL. Sequence-based heuristics 462:656–659. for faster annotation of non-coding RNA families. 168. Weinberg Z, Wang JX, Bogue J, Yang J, Corbino Bioinformatics 2006, 22:35–39. K, Moy RH, Breaker RR. Comparative genomics 181. Soldatov RA, Vinogradova SV, Mironov AA. RNA- reveals 104 candidate structured RNAs from bacteria, Surface: fast and accurate detection of locally optimal archaea, and their metagenomes. Genome Biol 2010, potentially structured RNA segments. Bioinformatics 11:1–17. 2014, 30:457–463.

20 © 2015 John Wiley & Sons, Ltd. Volume 5, January/February 2015