Classifying Multigraph Models of Secondary RNA Structure Using Graph-Theoretic Descriptors
Total Page:16
File Type:pdf, Size:1020Kb
International Scholarly Research Network ISRN Bioinformatics Volume 2012, Article ID 157135, 11 pages doi:10.5402/2012/157135 Research Article Classifying Multigraph Models of Secondary RNA Structure Using Graph-Theoretic Descriptors Debra Knisley,1, 2 Jeff Knisley,1, 2 Chelsea Ross,2 and Alissa Rockney2 1 Institute for Quantitative Biology, East Tennessee State University, Johnson City, TN 37614-0663, USA 2 Department of Mathematics and Statistics, East Tennessee State University, Johnson City, TN 37614-0663, USA Correspondence should be addressed to Debra Knisley, [email protected] Received 26 August 2012; Accepted 11 September 2012 Academic Editors: J. Arthur and N. Lemke Copyright © 2012 Debra Knisley et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The prediction of secondary RNA folds from primary sequences continues to be an important area of research given the significance of RNA molecules in biological processes such as gene regulation. To facilitate this effort, graph models of secondary structure have been developed to quantify and thereby characterize the topological properties of the secondary folds. In this work we utilize a multigraph representation of a secondary RNA structure to examine the ability of the existing graph-theoretic descriptors to classify all possible topologies as either RNA-like or not RNA-like. We use more than one hundred descriptors and several different machine learning approaches, including nearest neighbor algorithms, one-class classifiers, and several clustering techniques. We predict that many more topologies will be identified as those representing RNA secondary structures than currently predicted in the RAG (RNA-As-Graphs) database. The results also suggest which descriptors and which algorithms are more informative in classifying and exploring secondary RNA structures. 1. Introduction candidate structures by integrating biochemical footprinting data with molecular dynamics. Although the focus is on The need for a more complete understanding of the struc- tertiary folds, their method uses information about RNA tural characteristics of RNA is evidenced by the increasing base pairings from known secondary structures as a starting awareness of the significance of RNA molecules in biological point. This, coupled with the understanding that the RNA processes such as their role in gene regulatory networks folding mechanisms producing tertiary structure are believed which guide the overall expressions of genes. Consequently, to be hierarchical in nature, implies that much can be the number of studies investigating the structure and achieved by discovering all possible secondary structural function of RNA molecules continues to rise and the RNA topologies. characterization of the structural properties of RNA remains Given the primary sequence of an RNA molecule, there a tremendous challenge in computational biology. RNA are a number of algorithms and tools available to predict molecules are seemingly more sensitive to their environment the most likely set of resulting secondary structures. The and have greater degrees of backbone torsional freedom than most widely used algorithms such as Zucker’s Mfold [3] proteins, resulting in even greater structural diversity [1]. and Vienna RNAfold [4] typically base their predictions on Although the tertiary structure is of significant importance, the minimum free energy paradigm. While these algorithms it is much more difficult to predict than the tertiary structure have been highly beneficial, it is not always the case that the of proteins. Advances in molecular modeling have resulted in predicted structure with minimum free energy is the correct accurate predictions of small RNAs. However, the structure one and consequently some suggest that the actual RNA prediction for large RNAs with complex topologies is beyond secondary structure may not have a minimum global free the reach of the current ab initio methods [2]. energy, only local ones [5]. Other means of characterizing A coarse-grained model to refine tertiary RNA structure the topology of secondary RNA structures are still an active prediction was developed by Ding et al. [2] to produce useful avenue of pursuit. 2 ISRN Bioinformatics 6 4 5 3 4 2 3 1 2 0 1 −1 0 −2 −1 − −2 3 −3 −4 0 6 12 18 24 30 36 42 48 54 60 0 5 10 15 20 25 30 35 40 45 Figure 1: Topological invariants for RNA multigraphs of order 4. Figure 3: Line graph invariants. 6 5 the knowledge afforded by the graph theory and exploit the 4 tree representation of the secondary RNA structure to build a 3 predictive model that identifies whether a given tree structure 2 is RNA-like or not RNA-like. In this work, we now consider 1 the dual graph representation. 0 In particular, all possible dual graph representations of − 1 orders 2, 3, and 4 are given in the RAG database and − 2 the corresponding structures are classified as either (a) −3 0 2 4 6 8 10 12 14 16 18 20 22 representing a known structure or (b) not representing a known structure. Those not representing a known structure Figure 2: Variations on the Balaban index. are further classified as either likely to represent a structure in the future, that is, having the characteristics of RNA structure making it likely that such a structure will be identified at The graph representations used in this work can be found some point, or not RNA-like in structure. For the dual in the database RAG: RNA-As-Graphs [6]. Secondary RNA graphs of order 5, the database contains 18 structures that structure is modeled by two graph-theoretic representations have been identified and states that there are 108 possible in the database resource RAG (see [6] for additional details dual graphs of order 5. This number was determined by on the differences between the two). In one of these a graph growing algorithm. Eighteen of these 108 graphs representations, regions of the secondary structure that are verified as representing existing RNA structures and the consist of unpaired bases such as junctions, hairpins, and remaining 90 structures are classified as either RNA-like or bulges are represented by vertices. The connecting stems not RNA-like in the most recent update for the database are represented quite naturally as connecting edges. The by Izzo et al. [9]. This update describes two methods by resulting graph is a connected, acyclic graph, that is, a which the unverified structures are classified. The Laplacian tree. One advantage of this representation is the fact that eigenvalues for each structure were transformed using a trees have been highly studied in the graph theory thereby linear regression to obtain two values for each structure and providing a wealth of information about the model. For then these values were applied in two clustering algorithms, instance it is known by the generating function developed namely, a partitioning method called PAM and a k-nearest by Harary and Prins [7] exactly how many distinct trees neighbor algorithm [9]. They state that 63 are RNA-like and can be constructed for a given number of vertices. This 36 are not and that 45 are RNA-like and 45 are not RNA-like allows the entire space, that is, all possible configurations, by the two methods, respectively. Since only 18 structures to be considered. Unfortunately, secondary RNA structures are provided in the database, our objectives were to (1) containing a pseudoknot cannot be represented as described combinatorially analyze the structures of the 90 dual graphs above by the tree model. If, however, the model is reversed of order 5 not in the RAGs database and (2) predict which and stem regions are represented as vertices and connecting of those 90 dual graphs of order 5 are RNA-like in structure strings of unpaired bases as the edges, all secondary RNA via graph-theoretic information from chemical graph theory structures can be now be modeled, including those that and mathematical graph theory. contain a pseudoknot. This representation is called the dual Our findings differ significantly from those of Izzo et graph in the RAG database. The resulting dual graph however al. [9]. We find by using a combinatorial algorithm to is no longer a simple graph; instead this method produces construct all possible graphs with the given constraints a multigraph. Unlike a simple graph, a multigraph can have that there are 118 instead of 108 possible dual graphs of more than one edge connecting a pair of vertices. And, unlike order 5. Furthermore, we show that indeed almost all of simple graphs, multigraphs have not been as highly studied the structures in the database with 5 vertices are RNA- in the theoretical setting. In previous work [8], the authors like instead of approximately half as indicated in [9]. We of this paper, together with Koessler et al., capitalize from feel that this is not too surprising. In the earlier version ISRN Bioinformatics 3 Figure 4: Clustering of the 50 graphs most distant from the 18 verified as RNA-like (in red). motifs (hairpin loops, bulges, internal loops, and junctions). Dual graphs may contain multiple edges and loops; however, neither of these structures is required. Since a double- stranded RNA stem is connected to at most 2 strands on each side, every vertex v must have at most degree four. In fact, all vertices are of degree 4 except either (a) one of degree 2 or (b) two of degree 3. It follows that dual graphs of order n are of size 2n − 1[6]. Given these constraints, we use a constructive graph algorithm to enumerate the number of dual graphs of order five. These 118 graphs may be found in Figure 6. Figure 5: Two graphs from N f .