<<

i “main” — 2018/4/3 — 9:38 — page 1 — #1 i i i

Published online XX XXX XXXX Nucleic Acids Research, XXXX, Vol. XX, No. XX 1–10 doi:10.1093/nar/gkn000

Mining for recurrent long-range interactions in RNA structures reveals embedded hierarchies in network families Vladimir Reinharz 1,2, Antoine Soul´e 2,3,EricWesthof4,J´erˆomeWaldisp¨uhl2 and Alain Denise 5,6,ò

1Department of Computer Science, Ben-Gurion University of the Negev, P.O.B. 653 Beer-Sheva, 84105, Israel, 2School of Computer Science, McGill University, 3480 University, Montreal, Quebec, H3A 0E9, Canada, 3LIX, Ecole´ Polytechnique, CNRS, Inria, Palaiseau, 91120, France, 4ARN, Universit´ede Strasbourg, IBMC-CNRS, 15 rue Ren´eDescartes, Strasbourg Cedex, 67084, France, 5LRI, Universit´eParis-Sud, CNRS, Universit´eParis-Saclay, Bˆatiment 650, Orsay cedex, 91405, France and 6I2BC, Universit´eParis-Sud, CNRS, CEA, Universit´eParis-Saclay, Bˆatiment 400, Orsay cedex, 91405, France

Received XXX XX, XXXX; Revised XXX XX, XXXX; Accepted XXX XX, XXXX

ABSTRACT loops, bulges, terminal loops. Additional long-range interactions, those that connect distinct secondary structure The wealth of the combinatorics of base pairs elements (SSEs) in 3D structures, and non canonical base pairs enables RNA to assemble into sophisticated or interactions make the adopt its three-dimensional interaction networks, which are used to create complex 3D tertiary structure. substructures. These interaction networks are essential to RNA modules are small substructures which appear in shape the 3D architecture of the molecule, and also to multiple locations in a variety of different RNA molecules, provide the key elements to carry molecular functions such and which fold identically or almost identically. They as or ligand binding. They are made of organised are formed of assemblies of non-Watson-Crick base pairs, sets of long-range tertiary interactions which connect they mediate the folding of the molecule and they can distinct secondary structure elements in 3D structures. also constitute specific protein or ligand binding sites (21, Here, we present a de novo data-driven approach to 24, 25, 26, 27, 32). Well known RNA modules are, for example, GNRA loops, Kink-turns, G-bulges, and the A- extract automatically from large data sets of full RNA minor interactions. Identifying, characterizing RNA modules, 3D structures the recurrent interaction networks. Our understanding how they form and what are their relationships methodology enables us for the first time to detect the are key points for a better understanding of how RNA folds interaction networks connecting distinct components of the and interact with other molecules. RNA modules can be RNA structure, highlighting their diversity and conservation classified in two classes: through non-related functional . We use a graphical model to perform pairwise comparisons of all RNA structures • Local modules are located within secondary structure available and to extract recurrent interaction networks and elements: they are mainly formed of non-Watson-Crick base pairings inside the loops (internal, multiple or modules. Our analysis yields a complete catalogue of RNA terminal loops, or bulges) of the secondary structure. 3D structures available in the Protein Data Bank and reveals Most known modules are built mainly locally, as the G- the intricate hierarchical organization of the RNA interaction bulges and the Kink-turn loops (21, 25), but they can networks and modules. We assembled our results in an online also constitute an element of an interaction module. database (http://carnaval.lri.fr)whichwillbe regularly updated. Within the site, a tool allows users with • Interaction modules connect two distinct secondary a novel RNA structure to detect automatically whether structure elements (helices, loops or local modules). A the novel structure contains previously observed recurrent well known element of this class is the “A- minor” Type I/II (27, 28). interaction networks. Here we distinguish recurrent interaction networks (RINs) from interaction modules. As specified below, a INTRODUCTION recurrent interaction network does not contain any sequence RNA tertiary structures are highly modular. Canonical information, but only topological information about the Watson-Crick base pairs form what is called the secondary interactions between and the nature of these structure, composed of helices interspersed with other interactions. Thus, a given RIN may be a constituent element secondary structure elements such as multiloops, interior of several other RINs. Further, when embedded in sequence

òTo whom correspondence should be addressed. Tel: +33 (0) 1 69 15 63 69; Fax: +33 (0) 1 69 15 65 86; Email: [email protected]

© XXXX The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

i i i i i “main” — 2018/4/3 — 9:38 — page 2 — #2 i i i

2 Nucleic Acids Research, XXXX, Vol. XX, No. XX

space, a given RIN may participate in several types of other constraints which are developed below. These interaction modules. In other words, when mapped onto subgraphs are called interaction networks. sequence information, an identical recurrent interaction network can give rise to one or several interaction modules. 4. Finally we cluster the identical interaction networks A number of computational approaches have been together and create a network of direct inclusions. developed so far for finding automatically RNA modules in tertiary structures, either by geometric methods, or by We present in Sup. Mat. FigS1 a schema of the method, and algorithms based on graph theory (1, 2, 6, 9, 10, 12, 14, we detail it below. 15, 31, 33, 34, 36, 37). Most of these methods aim to find known modules in new structures. A few methods aim to Data search for modules without any prior knowledge of their The non-redundant RNA database maintained on geometry or topology (9, 15), but they only consider local RNA3DHub (30) on Sept. 9th 2016, version 2.92, was interactions. Databases, as the RNA 3D Motif Atlas (32), and used. It contains 845 all-atom molecular complexes with RNA Bricks (3) store information on the RNA modules which a resolution of at most 3A.˚ From these complexes, we have been found in experimentally determined RNA tertiary retrieved all RNA chains also marked as non-redundant by structures. RNA3DHub. Each chain was annotated by FR3D. Because Regarding especially recurrent interaction networks, apart FR3D cannot analyse modified nucleotides or those with a preliminary attempt (8), no automated method has been missing atoms, our present method does not include them developed up to now to detect them in tertiary structures either. If several models exist for a same chain, the first one and to classify them without any a priori knowledge of their only was considered. For the rest of this paper, the base geometry or topology. pairs extracted from the FR3D annotations are those defined We developed a graph-based methodology to extract all in the Leontis-Westhof geometric classification (23). They recurrent interaction networks in crystallized RNA tertiary are any combination of the orientation cis (c) (resp. trans structures and to cluster them according to their similarity. (t)) with the name of the side which interacts for each of We applied our methodology to a large set of experimentally the two nucleotides: Watson-Crick (W) cis c (or b for resolved RNA structures. Not only we retrieved the known trans), Hoogsteen (H) v (or u) or Sugar-Edge (S) Z (resp. recurrent interaction networks (as the different types of A- V). Thus, each base pair is annotated by a string from the minors), but we also extracted new ones. Our method gives 2 set: c,t ✓ W,S,H or by combining previous symbols. a global view on interaction networks and their modularity, { } { } by organizing them in families according to their inclusion To represent a canonical cWW interaction, a double line is relations. The publicly accessible database CaRNAval http: generally used instead of (cc). //carnaval.lri.fr allows to visually explore and study all the interaction networks and their intricate relationships. Secondary Structure We further analyze our data and expose the remarkable For each chain a secondary structure without pseudoknots diversity of the well known A-minor networks. In particular, was deduced from the annotated interactions, as follows. we show that an unexpected number of unrelated structures First all canonical Watson-Crick and wobble base pairs (i.e. form the exact same intricate network of interactions. A-U, G-C and G-U) were identified. Then, since many Furthermore, the diversity of the molecules in which several of structures are naturally pseudoknotted, we used the K2N (35) these networks are found (e.g. , ribozymes and other implementation in the PyCogent (18) Python module to non-functionally related RNAs) underlines the universality remove pseudoknots. Problems arise when a nucleotide and fundamental nature of these recurrent architectures. is involved in several Watson-Crick base pairs (which is geometrically not feasible), probably due to an error of the MATERIALS AND METHODS automatic annotation. Those discrepancies were removed with a ad hoc algorithm such that if a nucleotide is involved in Given an mmCIF file from the PDB describing an RNA chain, several Watson-Crick base pairs, we remove the base pair the method presented here works in five steps. which belongs to the shortest helix. 0. We first build for the chain a directed graph such that the edges represent the phosphodiester bonds as well as Secondary Structure Elements and Skeleton Graph the canonical and non-canonical interactions. From the secondary structure, four types of secondary 1. From the annotations all canonical base pairs are structure elements (SSEs) are defined. The simplest SSE is a identified and used to determine the secondary structure. stem, which is a stack of canonical Watson-Crick and Wobble The secondary structure is used to add on each edge a base pairs, containing at least two base pairs. The others are label to indicate whether it is local (inside one SSE) or the loops of the secondary structure, classified by the number long-range (between two SSEs). of strands inside them. The hairpins are single stranded and 2. Each pair of SSEs connected by a long-range interaction closed by a canonical base pair. An interior loop has two is extracted as a separate graph. These graphs are called stranded elements and is closed by two canonical base pairs; interaction graphs. we consider bulges as particular interior loops. Finally multi loops are composed of three or more strands. Any loop can 3. For each pair of interaction graphs, we compute also be seen as a cycle in the graph of the secondary structure, all maximal common subgraphs which obey some because the loops contain the closing canonical base pairs. The

i i i i i “main” — 2018/4/3 — 9:38 — page 3 — #3 i i i

Nucleic Acids Research, XXXX, Vol. XX, No. XX 3

only exceptions are the two external loops, that is the dangling Interaction Networks ends of the structure. The interaction networks are the RNA structural building The non pseudoknotted secondary structure is then blocks that capture the long-range interactions. They are represented as a skeleton graph (19) where the nodes are the subgraphs of the interaction graphs. We define here the notion SSEs, and there is an edge between two nodes if the two SSEs of interaction network. are consecutive in the secondary structure. Three observations Given two distinct interaction graphs g and h belonging to must be done: (1) given any two consecutive SSEs, one and F, m is a common interaction network of g and h if: only one must be a stem; (2) any two consecutive SSEs share one canonical base pair and (3) any nucleotide can at most a. It is a common edge-labelled subgraph of g and h. belong to two SSEs, which must be consecutive. For each pair of SSEs with at least two base pair interactions b. It is connected and each node belongs to a cycle in the between them, the interaction graph is built, as described in non-directed graph induced by m. (The non-directed the following section. In the case of consecutive SSEs, the graph induced by m is obtained by replacing every nucleotides in the shared canonical base pair belong to both directed edge of m by a non-directed edge and merging SSEs. those between the same nodes.) c. It contains at least two long-range interactions, i.e. four Interaction Graphs edges labeled as long-range since each interaction is described with two edges. For each pair of SSEs with at least two interactions between them, an ensemble of interaction graphs is identified. An d. Each node in m is involved in a canonical or a non- interaction graph g is a directed graph defined as follows: canonical interaction. each node represents a nucleotide in an SSE, and each edge represents an interaction or a between e. If two nodes, a and b in m, form a local canonical base two nucleotides. Every edge e in g has two attributes: pair, there exists a node c in m such that c is a neighbour to a or b, and c is involved in a long-range or non- a. The relation between the two nucleotides, i.e. a canonical interaction. In other words we do not extend phosphodiester bond or an interaction, canonical or not. stacks whose nucleotides are involved in canonical base The interactions are annotated, as will be seen below. pairs only. Each of the above constraints is justified as follows: b. Whether the relation is local or long-range. Local a. We are searching for recurrent sub-structures, whose interactions are the ones occurring between nucleotides geometry is constrained by the labelled edges. of the same SSE. Long-range interactions connect two distinct SSEs. b. This natural condition is to enforce the cohesiveness of the interaction network. The ensemble of interaction graphs of an SSE pair is c. This is a property of all known interaction networks (as built as follows. First a directed graph G is built. A node the A-minor and the zipper). is added to G for each nucleotide in the SSEs. For each canonical or non-canonical interaction inside each SSE, two d. The interaction networks are intended to capture edges are added to the graph, in both directions. Each of a representation of the geometry. Non interacting these edges has a label indicating the type of interaction, in nucleotides do not have geometric constraints. the order of its direction (e.g. cSH). Then, an edge for each ¨ ¨ e. Stacks of canonical base pairs (i.e. at least two G 5 3 phosphodiester bond is added to in the direction, consecutive cWW with no other interaction) form the with its corresponding label. All these edges have a second core of the structure and are either embedded in the attribute indicating that they are local (to one SSE). Finally, secondary structure with little geometric variation or for each interaction between the SSEs two edges are added result from the folding of the tertiary structure (co- G to , one in each direction with the appropriate label. These axial stacking between helices, loop-loop interactions or edges second attributes are marked as long-range. The nodes pseudoknots) with often a larger geometric variation. which are connected to the rest of the graph only through phosphodiester bonds are removed. The weakly connected components of G containing at least a long-range edge are Searching for Recurrent Interaction Networks (RINs) the interaction graphs between the two SSEs. The set of all We are interested in finding the maximal common interaction interaction graphs for all pairs of SSEs is denoted F.We networks of two graphs, that is the common interaction present in Fig. 1 an example of an atomic structure with its networks which cannot be extended in either graph. This annotated structure and its corresponding interaction graph. problem is an instance of the problem of finding a maximum We additionally define two subsets of the set of interaction edge isomorphism and has been shown to be induced by the graphs: adjacent interaction graphs involve two SSEs which node isomorphism when the degree is bounded by at least are adjacent in the secondary structure, that is they share a 5 (11). The maximal subgraph isomorphism has been proven cWW pair. The other interaction graphs are called distant to be NP-hard (4) even for many particular classes of graphs interaction graphs. including planar graphs (7), and the labels do not create any

i i i i i “main” — 2018/4/3 — 9:38 — page 4 — #4 i i i

4 Nucleic Acids Research, XXXX, Vol. XX, No. XX

Figure 1. Up: a stereo view of a GNRA tetraloop interacting with two Watson-Crick pairs of a helical fragment (from chain A in 4QCN). Below: left: the same substructure, coarse grained. Middle: the annotated substructure, with the Leontis-Westhof representation. In all figures, the long-range interaction are shown in red, local in blue with their Leontis-Westhof annotations, cWW are shown as double lines. The sugar- backbone is in black, connected by arrows and directed 5¨ 3¨. The nucleotides belonging to a network are numbered sequentially. Right: the interaction graph. To build the interaction graph there are two main operations. First the nucleotides without any interaction are removed (nucleotide 6), then the interactions are replaced by labelled doubled directed edges. Nodes with only backbone interactions are removed. The first label of each edge indicates the type of interaction of the base pair, first the interacting face of the source followed by the interacting face of the target. The second label indicates if the interaction is local to an SSE (blue) or long-range (red). This interaction graph is provided in larger size in the Sup. Mat. FigS3

evident restriction to leverage. We developed an algorithm to We are only interested in the graphs containing long-range solve the problem. Obviously it is exponential in the worst interactions, thus the initial set of smallest common subgraphs, case, but it performs well for our problem. Nevertheless, sete will be the set of long-range interactions shared between it requires over 200 GB of RAM for some of the largest them. Finally each maximal common subgraph computed comparisons. In the following, maximal common interaction which has some of its nodes not involved in a cycle is networks will be called recurrent interaction networks (RINs) removed to fulfill specification (b) of an interaction network. since they are found in more than one structure. The weakly connected components with at least two long- We describe here the procedure to detect automatically range interactions (i.e. four long-range edges) are returned, to all the maximal common interaction networks between two fulfill specification (c) of an interaction network. Note that, for interaction motifs g and h belonging to F. any pair of interaction graphs, there can be several different Given a graph g, a graph n is defined as being a subgraph maximal common interaction networks. of g, denoted as nNg, if n is isomorphic to a subgraph of g, The main algorithm is based on the following observation: taking into account the edges labels (the type of interaction Consider the three graphs g, h and n which is a subgraph of g and whether it is a long-range interaction or not). and h (noted as nNg, nNh) and e"Edges g . If the graph n The strategy implemented consists in starting from a augmented with the edge e, n+e, is not a subgraph of h, then ¨ ¨ ¨ smallest common subgraph of g and h, and adding to it for all graphs n such that nNn we know( that) n +e is not a one neighbouring edge at the time while considering all subgraph of h. possibilities, until maximality is obtained. The method whose To leverage the observation, the algorithm uses a set N of full procedure is detailed in Algorithm 1 takes as input two pairs of graphs n,n˜ , such that each graph n being grown is graphs g and h such that the number of edges in h is smaller associated with the set of unexplored admissible edges n˜. At (or equal) than in g without loss of generality, and a set of the beginning, n˜ is g minus the edges composing n. graphs such that each of them is a subgraph of g and h. ( )

i i i i i “main” — 2018/4/3 — 9:38 — page 5 — #5 i i i

Nucleic Acids Research, XXXX, Vol. XX, No. XX 5

Algorithm 1: MaxInteractionModules g,h,sete Algorithm 2: ReduceGrowing growing Data: Given g, h, sete, two graphs and a subset of the Data: Given a list growing of pairs of set of edges and edges of h ( ) graphs (n,n˜) ( ) Result: All maximal interaction networks between g Result: Returns a list L of pairs of set of edges and and h containing at least one edge of sete graphs (n,n˜) such that every n is unique growing=o; L=o; subIso=o; D =o; /* dict, key:set edges, for e"sete do value:list of graphs */ Add ( e , h e) to growing; for n,n˜ "growing do while growingjo do if näD then bigger{=}o; \ Add n to D; growing loop=copy growing ; Add graph n˜ to D n ; growing=o; for n"D do for n,n˜ "growing loop( do ) newn˜ =intersection[ ] of all graphs in D n ); maximal =True; Add n,new˜ n˜ n to L; neighb NeighbourEdges n,n˜ for " do return L [ ] LimitGrowth neighb,n True if j then ( \ ) newn =n+neighb; ( ) if newn Ng then ( ) Algorithm 3: LimitGrowth e,n maximal =False; Given an edge e and a set of edges n ( Add new) n to bigger; Data: else Result: Returns Trueif the new( edge) only prolongates n˜ =n˜ neighb; the backbone or canonical base pairs, else if maximal then False nodes all nodes in n; Add the nodes\ of n to subIso; = else e1,e2 = two extremities of e; e1 nodes0e2 nodes for n"set bigger do /* remove if " " then False bigger doubles */ return ; Add n,copy n˜ to growing; if e1"nodes then ( ) for e˜ n do growing= ReduceGrowing growing ; " if e1 e˜0label e˜ B53,CWW then return KeepCycle subIso ; " ä ( ( )) return False; ( ) if e2"nodes then ( ) { } ( ) for e˜"n do if e2"e˜0label e˜ ä B53,CWW then return False; return True; ( ) { } In each round, for each pair of graphs n,n˜ in N, each edge neighbour of n in n˜ is independently added to n. If it breaks the subgraph isomorphism property, the edge is removed from ( ) Implementation and Webserver n˜, else the updated graph n with its additional edge is kept for the next round. After each pair in N has been processed, N is The program is implemented in Python2.7 using the updated with the new ensemble of pairs of graphs. In order to networkx (13) Python module which implements the VF2 limit, in the next round, the set of neighbour edges admissible algorithm (5) for subgraph isomorphism testing. Software to grow the subgraph isomorphism, we pull together all and results are accessible through the website http:// identical subgraphs of g and compute the intersection of their carnaval.lri.fr. sets of admissibles edges. The implementation is presented in Algorithm 2 which receives as input a list of tuples of graphs Visualization Each RIN has its own page which provides the with their associated sets of admissibles edges. nucleobase composition over all observations, the secondary Another algorithm is needed to impede both the growth structures in which the RIN has been observed and the other of stacks of cWW base pairs, and prolongating the backbone RINs that either include or are included in the current RIN. chain with non interacting nucleotides, as specified in Section In addition we provide for each RIN a 3D display tool to Interaction Networks, item (e). An implementation, shown in align and compare the different observations, a 2D extensive Algorithm 3, impedes the growth of stacks of cWW base pairs display of the observations with PDB files of the RIN with or unless there exists at least one additional interaction in the without its context. We also provide a research tool allowing previous base pair. It similarly impedes the prolongation of the user to restrain the display to observations compatible with the backbone chain if previous nucleotides are not involved in a sequence specified with the IUPAC nomenclature. interactions. Given a graph n and a new edge e, it returns False The RINs can be accessed and browsed from two different if these conditions are not met, that is to say if the new edge perspectives. The first one is the catalog, a list of all the RINs can be added to the graph. (which can be restrained to distant SSE RINs or adjacent SSE

i i i i i “main” — 2018/4/3 — 9:38 — page 6 — #6 i i i

6 Nucleic Acids Research, XXXX, Vol. XX, No. XX

RINs by the user). The second one is a graph which represents interaction modules. For simplification, we adopted usual the network of RINs: a RIN r1 is linked to a RIN r2 if r1 is names but in a restrictive way. Thus, the largest component included in r2 and there is no other RIN which includes r1 and of the complete graph contains 201 RINs and we named it the is included in r2. This network of RINs can also be restrained A-minor mesh because it contains all occurrences of at least to distant SSEs RINs or adjacent SSEs RINs by the user. In one A-minor contact. both views, pictures representing the RINs are clickable and The second largest contains nested Watson-Crick base pairs open the RIN specific page. and, consequently, was named the pseudoknot mesh. The Older versions of the database are kept indexed and third largest component contains always one trans Watson- accessible. At the present time, the version of RNA3DHub Crick/Hoogsteen pair and was named accordingly. Within the 2.92 is available. A-minor mesh several RINs are present (see Figure 2a) and we named them according to the basic interaction they contain. It RIN Search by Interaction Features The large amount of RINs must be noticed that the GNRA RIN (see Figure 4 top left) makes the exploration of the results difficult. To ease this does not contain only tetraloop hairpins; it contains the typical process we offer a filter by type of interactions. A minimal trans Hoogsteen/Sugar edge of the GNRA tetraloop. or maximal amount of any type of edge—long-range or not, combined with the Leontis-Westhof classification—can be chosen. A catalogue with only those RINs fulfilling the The A-minor Mesh ensemble of constraints is then built. We show the A-minor mesh in Figure 2a. Each vertex is labeled with the number of the RIN it represents. Two vertices RIN identification in novel structures As an additional utility, have an edge between them if one of them is included we provide an automatic pipeline in which a structure file, in the other (directly or not). We used the ForceAtlas2 in the mmCIF format, can be uploaded with the name algorithm (16) for drawing this graph. This algorithm is a of a specific chain it contains. The structure is annotated force-directed layout: nodes tend to repulse each other, like by FR3D and all RINs found are extracted. An additional charged particles, while edges tend to make nodes closer, parameter allows to consider, or not, the annotations marked like springs. It was proved in (29) that such layouts tend as “near” by FR3D. (We remind the reader that “near” to cluster the nodes by minimizing the so-called modularity interactions are never considered for identifying the RINs in of the clusters. In other words, they put together sets of the CaRNAval database.) The identified RINs in the provided nodes which are interconnected by many edges. The nodes structure are presented in a similar interface as described in are colored according to the largest type of known RINs they Section Visualization. contain among: the A-minor Type I/II (blue), the A-minor Type I/II with an additional tSS interaction (black), the ribose Code availability The code is freely available at: zipper (yellow),the ribose zipper on top of an A-minor type I http://jwgitlab.cs.mcgill.ca/vreinharz/ (red). And in pink the A-minor type II. The ribose zippers do carnaval_code. not formally form base pairs, but geometrically they can be categorized in the cis Sugar/Sugar family introduced in (22). Figure 2b presents a synthetic view of the mesh, by RESULTS partitioning it in several sets of RINs, according to their The Full Graph of Recurrent Interaction Networks closeness in the layout of Figure 2a and to common subgraphs. All RINs in a same set share a common maximal subgraph The 845 structures extracted from the PDB contain 912 RNA which is shown in each set, and two sets have a common chains identified as non-redundant. From those all 1426 pairs boundary if there are edges between some RINs of both sides of SSEs having FR3D annotated interactions between them in the A-minor mesh (i.e. if there are inclusion relations were identified, belonging to 165 chains. In total 337 RINs between these RINs). Each node in Figure 2b is colored were identified, corresponding to 6056 occurrences inside according to the color of its set in Figure 2a, and the cardinality the non-redundant dataset. This number contains duplicate of each set is given. locations: if a recurrent interaction network has as subgraph Figure 2c shows more precisely the variations around the another interaction networks, both are counted. four main RINs of the A-minor mesh: ribose zipper (RIN 11), By connecting two RINs if one is a subgraph of the other, A-minor type I (RIN 2), A-minor Type I/II (RIN 17), and A- a graph can be drawn. The complete graph of RINs of direct minor Type I/II with an additional tSS (RIN 165). The figure inclusions can be visualized at http://carnaval.lri. represents the graph of the shortest paths between these RINs. fr. This graph is constituted of 28 connected components. More precisely, there is an edge between two RINs if (1) there Among them, 25 components are of small size: from 1 to 9 is a direct inclusion relation between them, and (2) this edge RINs each. The three other components are much larger and belongs to a shortest path between two of the four RINs listed are discussed in detail below. above. RINs contain only topological information about the Ad hoc rules for naming RINs interacting nucleotides. Thus, the sets shown in the A- As discussed above, a recurrent interaction network does minor mesh of Figure 2b and Figure 2c can represent (i) not contain any sequence information, but only topological the various components of the standard A-minor interaction information about the interactions between nucleotides and network, (ii) molecular instances of incomplete configurations the nature of these interactions. The naming of a given RIN present in the crystal structures, or (iii) molecular instances of brings along potential confusion with the usual names for complete sets of interaction networks (e .g. in ribose zipper,

i i i i i “main” — 2018/4/3 — 9:38 — page 7 — #7 i i i

Nucleic Acids Research, XXXX, Vol. XX, No. XX 7

#11 55

#16 #13 #15

45 3 #17 33 #107 #2

5

44

#165 #4

(a) (b) (c)

Figure 2. a) The A-minor mesh. In blue the A-minor Type I/II, in black the same RIN with an additional tSS. A ribose zipper (in yellow) on top of an A-minor type I (red). And in pink the A-minor type II. b) A synthetic view of the A-minor mesh. All RINs in a same set share a common maximal subgraph which is shown in each set, and two sets have a common boundary if there are edges between some RINs of both sides in the A-minor mesh (i.e. if there are inclusion relations between these RINs). Each set is colored according to the color of its nodes in Figure 2a. The number of RINs is given for each set. The pink set is subdivided in several parts corresponding to different common maximal subgraphs. c)The shortest path subgraph between some ‘canonical’ RINs : ribose zipper (RIN 11), A-minor type I (RIN 2), A-minor Type I/II (RIN 17), and A-minor Type I/II with an additional tSS (RIN 165).

only contacts between the occur depending on the them in two groups, depending whether they have a GNRA sequence). stem loop or not. Most instances pairs are below 1.5A.˚ We present in Figure 3 connections between some frequent In Figure 4 we present a more complete and detailed view RINs of the A-minor mesh. The RINs are annotated with the of the RIN interconnections. The adjunction of A-minor Type number of unique occurrences. At the top of each subfigure I with the ribose zipper contacts gives rise to three modes of is shown the isolated long-range contact and below the long-range contacts via the terminal GNRA hairpin loop, the same long-range contact surrounded by two canonical base A-minor type I/II, or the internal A-rich loop module. pairs. In the ribose zipper (Figure 3b) and A-minor (typeI/II) For each of them, depending on the nucleotides in the (Figure 3ac), framing with one base pair above or below colored positions, some preferences are exhibited in the or on both sides leads to the same order of magnitude in nucleotide composition and order of the interacting cWW occurrences. However, for the A-minor (type I), framing on base pairs. In the A-minor type I/II long-range contacts, both sides is one order of magnitude more frequent than the position in magenta is almost always an A that binds framing with a single base pair. In Figure 3d, on the bottom preferentially in cSS the C of a C=G pair (there is one right, we show the A-minor Type I/II with one missing contact. occurrence of a G at the magenta position and it also binds This situation may occur transiently during the formation of to the C of a C=G pair). At the orange position an A occurs 80 the contact or reflects a contact not fully formed or a lack times and binds mainly the C of a C=G pair also, but when the of resolution in the structures. It has been suggested (38) orange position is a G it binds preferentially to the U of a U- that A-minor contacts play an important role in the dynamics A pair. In the GNRA long-range contacts, the preferences are of internal movements in large RNA molecules and such identical. In the A-rich loop again there is a strong preference phenomena would require transient states. From the statistics for an A at positions orange and magenta with both contacting presented in Figure 3, it is also clear that A-minor type I/II in cSS the C of a C=G pair. and ribose zipper prefer to bind internally to base pairs within We conclude the analysis of the A-minor component a helix instead of binding at helical ends. with observations on the GNRA RIN (see also Figure 1). The A-minor Type I/II motif requires two As at the positions This RIN is presented in Figure 5 with two superimposed interacting with the cWW base pairs (Figure 3c topmost) and occurrences of three dimensional structures found in different our general method recovered 102 occurrences of the A-minor contexts, the c-di-gmp riboswitch 3UCZ and the Deinococcus (type I/II) RIN. All have one A involved in the double cSS radiodurans large ribosomal subunit 5DM6. The GNRA / tSS interactions, except one with a G. The position with a tetraloop is operationally defined by a sequence and its single cSS interaction has an A only in 80 occurrences, 21 context, with a potential imprecision in the experimental others have a G and one a U. We show in the Sup. Mat. FigS2 structure determination and base pair annotation. We focused the RMSD values between the elements of this RIN, dividing on a sub-element of the the GNRA RIN, the A-minor type I/II to study its three dimensional diversity. This stresses the

i i i i i “main” — 2018/4/3 — 9:38 — page 8 — #8 i i i

8 Nucleic Acids Research, XXXX, Vol. XX, No. XX

Figure 4. At the top left, connections between the major RINs, the minimal RINs A-minor Type I and ribose zipper, lead to the complete RIN A-minor Type I/II, which can further lead to the RINs GNRA and A-rich loop (the total number of occurrences in the database are indicated between parenthesis next Figure 3. Connections between some frequent RINs that are annotated with to the number excluding their occurrences in other shown RINs). Note that the number of unique occurrences. The subfigures represent three known the RIN GNRA is also included into the RIN A-rich loop. For each of those interaction networks, the A-minor type I (a), the A-minor type I/II (c), and three RINs, we present how the nucleotides in orange and magenta influence the ribose zipper (b). One can see that the RIN A-minor Type I/II is included the distribution of nucleotides in the interacting cWW pair. Each histogram within the RIN ribose zipper. In (d) are shown the A-minor type I/II with a has below the bar the nucleotide in the cWW base adjacent to the colored one, missing long-range interaction. Please note that for the ribose zipper, a single and the paired one above. long-range contact cannot be deduced by the algorithm since two contacts are required per interaction network (see above Interaction Networks (c)). In each subfigure the top RIN shows the minimal contact and the one at the bottom the same surrounded by canonical base pairs. The middle row shows the RIN with only one additional cWW base pair stacked on either side. a C=G pair and 1 and 11 are always A and U). This RIN belongs to the UA-handle family (17) and it is part of the trans Watson-Crick-Hoogsteen mesh also. points made above: (i) a given RIN may be a constituent element of several other RINs; and (ii) a given RIN may participate in several types of interaction modules. In short, The trans-Watson-Crick/Hoogsteen mesh and other RINs the same RIN can lead to one or several interaction modules. In Figure 5de, the diversity of contacts made by A-rich The third large component contains 22 RINS, we name it loop (20) is shown; when in the closing pair there is a U, the trans-Watson-Crick-Hoogsteen mesh because all of its a Watson-Crick/Hoogsteen trans and when there is a G, it members have such a long-range interaction (see Sup. Mat. forms a Hoogsteen/Sugar edge trans. At the same time the FigS4) for a major constituent of this mesh). The triple base long-range contacts interact with the module differently, but pair involves the trans Watson-Crick/Hoogsteen between the still maintaining the central contact. There is an apparent conserved U8 and A14 in tRNAs, as well as the Watson- tendency for the type I A-minor contact to occur with a base Crick/Sugar edge between A14 and A21. All instances of this (most preferred is a G, see Figure 4) interacting through the RIN occur in structures of tRNAs either alone or in protein Hoogsteen edge with another nucleotide. complexes. Other interesting RINs can also be found in the smaller connected components. In Figure 7 left we present a RIN The Pseudoknot Mesh composed of five nucleotides and three interactions, a cWW, The second main component of the Recurrent Interaction a long-range cWS, and a long-range tWS interaction. It is Networks contains 59 RINs and can be named the pseudoknot the smallest RIN in a component of 4 RINs and it has mesh since most of its RINs are parts of pseudoknots. The been observed 25 times in a variety of context, tRNAs, simplest of these RINs is a stack of two canonical cWW base riboswitches, ribozymes and ribosomal subunits, hinting at its pairs, and is the most frequent interaction motif. The pair of universality. Residues 4 and 5 are mostly As and, unlike the SSEs most found in this configuration, 28% of the time, occurs A-minor contacts, the two As present the Watson-Crick edge between two hairpin loops, stressing the importance of kissing for contacting the minor groove of two stacked base pairs. The hairpins as a structural feature in large RNA assemblies. same figure contains, right, the smallest RIN in a component Several more original RINs belong to this component, as of 9 RINs, with a local cWW and two long-range interactions, the one shown in Figure 6. It shows an interaction network a tWW and a cSS. It has been observed 11 times in ribosomal with two cWW and a tSS long-range interaction occurring RNAs, signal recognition particle RNAs and a riboswitch. In found 10 times in ribozymes, riboswitches and ribosomal these RINs, the typical trans Watson-Crick-Hoogsteen pair is subunits. This RIN can be described as T-loop-like, with disrupted so that the Watson-Crick edge of the A forms a trans similar sequence conservations (residues 4 and 10 form always Watson-Crick/Watson-Crick pair with another A.

i i i i i “main” — 2018/4/3 — 9:38 — page 9 — #9 i i i

Nucleic Acids Research, XXXX, Vol. XX, No. XX 9

8

5 9

1 4 10 7

2 3 11 6

(a) (a) (b) (b) 1 6 7

7 14 7 2 5 8 11

8 13 1 6 1 6 8 13 3 4 9 10

9 12 2 5 2 5 9 12 (c)

10 11 3 4 3 4 10 11

(c) (d) Figure 6. An interaction found 10 times is shown on top left and, on the right, a stereo view of the superposition of its occurrences in the 4V9F 12 11 Haloarcula marismortui large ribosomal subunit, the 4FRG cobalamin and FMN riboswitches and the twister ribozyme. The nucleotides 1 and 11 form a trans Watson-Crick/Hoogsteen pair and can be closing an hairpin (with a 5 4 13 10 variable number of nucleotides) or can even belong to different strands. The trans Watson-Crick/Hoogsteen pair stacks upon the 4-10 Watson-Crick pair 6 3 14 9 with residues 2 and 3 bulging out and forming a two-stack Watson-Crick pairs. Another example is shown below with a T-loop-like containing 5 nts in the

7 2 15 loop (two are not shown) instead of the 7 nts present in the usual T-loops.

8 1

(e)

1 3 4 2 3 4 Figure 5. Two superimposed occurrences of (a) are shown in a stereo view (b) as example of the sequence diversity within the RIN GNRA. In the 3UCZ c-di-GMP riboswitch bound to GpG (a), the RIN GNRA occurs between a 2 5 1 GAAA tetraloop in contact with two Watson-Crick pairs of a neighboring helix. However, in 5DM6 Deinococcus radiodurans large ribosomal subunit (c), an A-rich loop yields the same interaction contacts. Other RINs leading to similar types of contacts and with a common trans Hoogsteen/Sugar-edge Figure 7. (Left) Smallest RIN in a component of 4 RINs. Residue 2 is are shown in (d) and (e). WC base paired to a residue that is not shown. This RIN is found in 25 RNAs, riboswitches / ribozyme / ribosomal subunit. (Right) Smallest RIN in a network in a component of 9 RINs, found in 11 RNAs, ribosomal / signal recognition particle RNA / riboswitch CONCLUSION In this work, we present a fully automated method for extracting and classifying RNA substructures based on their interactions rather than sequence or context. Through a rigorous mathematical description of the RNA interactions, expected. Only 337 families have been found in all known making a distinction of those within a secondary structure and annotated RNA structures. The number of structurally element (local) and those between two secondary structure non redundant families is even smaller because several elements (long-range), our automatic ab initio method RINs are included within others or are part of larger ones. detects all recurrent interaction networks (RINs) between two Further, because of lack of crystallographic resolution or structure elements. The collection of all RINs is presented molecular dynamics within crystals, one or more contact(s) in a database called CaRNAval, freely accessible at http: in similar RINs may be missing leading to the appearance //carnaval.lri.fr. of a distinct RIN. In any case, these long-range contacts The principal novelty and key element of our methodology display an amazing potential in molecular accommodation and is to cluster motifs solely on the base of the similarity evolution with several neutral intermediate states. Finally, the of their interaction networks, regardless of the nucleotide fact that several complex RINs are found in ribosomes and composition. This approach enables us to demonstrate the ribozymes as well as in tRNAs and riboswitches, or other non- extraordinary versatility and diversity of the well known A- functionally related RNAs, demonstrates how fundamental minor contacts, where an unsuspected variety of sequences they are for RNA architecture. The extent to which a small fold into the exact same intricate network of interactions. We number of such structures is found, can be key for the design also show that the diversity of RINs is more limited than of novel artificial RNAs and structures.

i i i i i “main” — 2018/4/3 — 9:38 — page 10 — #10 i i i

10 Nucleic Acids Research, XXXX, Vol. XX, No. XX

FUNDING 15. Hung-Chung Huang, Uma Nagaswamy, and George E Fox. The application of cluster analysis in the intercomparison of loop structures This work was supported by the Natural Sciences and in RNA. RNA, 11(4):412–423, 2005. Engineering Research Council of Canada [RGPIN-2015- 16. Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu 03786, RGPAS 477873-15], Genome Canada [BCB 2015], Bastian. Forceatlas2, a continuous graph layout algorithm for handy the Canadian Institutes of Health Research [BOP-149429], network visualization designed for the gephi software. PloS one, 9(6):e98679, 2014. Fonds de recherche du Quebec´ [211485, FQ-175959]; by 17. Luc Jaeger, Erik J Verzemnieks, and Cody Geary. The UA handle: a an Erasmus Mundus, Azrieli and Fonds de recherche du versatile submotif in stable RNA architectures. Nucleic acids research, Quebec´ postdoctoral fellowship to VR; by the project “High- 37(1):215–230, 2008. throughput comparative approaches for modelling RNA 3D 18. Rob Knight, Peter Maxwell, Amanda Birmingham, Jason Carnes, J Gregory Caporaso, Brett C Easton, Michael Eaton, Micah Hamady, architecture using chemical probing” of the French Fondation Helen Lindsay, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, pour la Recherche Medicale.´ Michael Robeson, Raymond Sammut, Sandra Smit, Matthew J Wakefield, Jeremy Widmann, Shandy Wikman, Stephanie Wilson, Hua Ying, and Gavin A Huttley. PyCogent: a toolkit for making sense from sequence. ACKNOWLEDGEMENTS Genome , 8(8):R171, 2007. 19. Alexis Lamiable, Franck Quessette, Sandrine Vial, Dominique Barth, We thank Mahassine Djelloul, Alexis Lamiable, Alexis and Alain Denise. An algorithmic game-theory approach for coarse- Delabriere` for their help at a very early stage of the work; Yann grain prediction of RNA 3D structure. IEEE/ACM Transactions on Ponty for the RIN drawing software; Anton Petrov for the 3D Computational Biology and Bioinformatics, 10(1):193–199, 2013. 20. Jung C Lee, Robin R Gutell, and Rick Russell. The UAA/GAN internal alignment tool; Neocles Leontis for fruitful discussions; and loop motif: a new structural element that forms a cross-strand AAA Laurent Darre´ for technical help. stack and long-range tertiary interactions. Journal of molecular biology, 360(5):978–988, 2006. Conflict of interest statement. None declared. 21. Neocles B Leontis, Jesse Stombaugh, and Eric Westhof. Motif prediction in ribosomal RNAs lessons and prospects for automated motif prediction in homologous RNA molecules. Biochimie, 84(9):961–973, 2002. 22. Neocles B. Leontis, Jesse Stombaugh, and Eric Westhof. The non- REFERENCES Watson-Crick base pairs and their associated isostericity matrices. 1. Alberto Apostolico, Giovanni Ciriello, Concettina Guerra, Christine E. Nucleic Acids Research, 30(16):3497–3531, 2002. Heitsch, Chiaolong Hsiao, and Loren Dean Williams. Finding 3D motifs 23. Neocles B Leontis and Eric Westhof. Geometric nomenclature and in ribosomal RNA structures. Nucleic Acids Research, 37(4):e29, 2009. classification of RNA base pairs. RNA, 7(4):499–512, 2001. 2. Sri Devan Appasamy, Hazrina Yusof Hamdani, Effirul Ikhwan Ramlan, 24. Neocles B Leontis and Eric Westhof. Analysis of RNA motifs. Current and Mohd Firdaus-Raih. InterRNA: a database of base interactions in opinion in structural biology, 13(3):300–308, 2003. RNA structures. Nucleic acids research, 44(D1):D266–D271, 2015. 25. Aurelie´ Lescoute, Neocles B Leontis, Christian Massire, and Eric 3. Grzegorz Chojnowski, Tomasz Walen,´ and Janusz M Bujnicki. RNA Westhof. Recurrent structural RNA motifs, isostericity matrices and Bricks—a database of RNA 3D motifs and their interactions. Nucleic sequence alignments. Nucleic Acids Research, 33(8):2395–2409, 2005. Acids Research, 42(D1):D123–D131, 2014. 26. Aurelie´ Lescoute and Eric Westhof. The A-minor motifs in the decoding 4. Stephen A Cook. The complexity of theorem-proving procedures. In recognition process. Biochimie, 88(8):993–999, 2006. Proceedings of the third annual ACM symposium on Theory of computing, 27. Aurelie´ Lescoute and Eric Westhof. The interaction networks of pages 151–158. ACM, 1971. structured RNAs. Nucleic Acids Research, 34(22):6587–6604, 2006. 5. Luigi P Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 28. Poul Nissen, Joseph A Ippolito, Nenad Ban, Peter B Moore, and A (sub)graph isomorphism algorithm for matching large graphs. IEEE Thomas A Steitz. RNA tertiary interactions in the large ribosomal subunit: Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367– the a-minor motif. Proceedings of the National Academy of Sciences, 1372, 2004. 98(9):4899–4903, 2001. 6. Jose´ Almeida Cruz and Eric Westhof. Sequence-based identification 29. Andreas Noack. Modularity clustering is force-directed layout. Physical of 3D structural modules in RNA with RMDetect. Nature methods, Review E, 79(2):026102, 2009. 8(6):513–519, 2011. 30. Anton Petrov. RNA 3D Motifs: Identification, Clustering, and Analysis. 7. Colin De La Higuera, Jean-Christophe Janodet, Emilie´ Samuel, PhD thesis, Bowling Green State University, 2012. Guillaume Damiand, and Christine Solnon. Polynomial algorithms for 31. Anton I Petrov, Craig L Zirbel, and Neocles B Leontis. WebFR3D— open plane graph and subgraph isomorphisms. Theoretical Computer a server for finding, aligning and analyzing recurrent RNA 3D motifs. Science, 498:76–99, 2013. Nucleic acids research, 39(suppl 2):W50–W55, 2011. 8. Mahassine Djelloul. Algorithmes de graphes pour la recherche de motifs 32. Anton I Petrov, Craig L Zirbel, and Neocles B Leontis. Automated recurrents´ dans les structures tertiaires d’ARN. PhD thesis, Universite´ classification of RNA 3D motifs and the RNA 3D motif atlas. RNA, Paris Sud-Paris XI, 2009. 19(10):1327–1340, 2013. 9. Mahassine Djelloul and Alain Denise. Automated motif extraction and 33. Karen Sargsyan and Carmay Lim. Arrangement of 3D structural motifs classification in RNA tertiary structures. RNA, 14(12):2489–2497, 2008. in ribosomal RNA. Nucleic Acids Research, 38(11):3512–3522, 2010. 10. Carlos M Duarte, Leven M Wadley, and Anna Marie Pyle. RNA structure 34. Michael Sarver, Craig L Zirbel, Jesse Stombaugh, Ali Mokdad, and comparison, motif search and discovery using a reduced representation of Neocles B Leontis. FR3D: finding local and composite recurrent RNA conformational space. Nucleic Acids Research, 31(16):4755–4761, structural motifs in rna 3d structures. Journal of mathematical biology, 2003. 56(1):215–252, 2008. 11. Marianne L Gardner. Hypergraphs and Whitney’s theorem on edge- 35. Sandra Smit, Kristian Rother, Jaap Heringa, and Rob Knight. From isomorphisms of graphs. Discrete mathematics, 51(1):1–9, 1984. knotted to nested RNA structures: a variety of computational methods 12. Patrick Gendron, Sebastien´ Lemieux, and Franc¸ois Major. Quantitative for pseudoknot removal. RNA, 14(3):410–416, 2008. analysis of three-dimensional structures. Journal of 36. Leven M Wadley and Anna Marie Pyle. The identification of novel molecular biology, 308(5):919–936, 2001. RNA structural motifs using COMPADRES: an automated approach to 13. Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structural discovery. Nucleic Acids Research, 32(22):6650–6659, 2004. structure, dynamics, and function using networkx. Technical report, Los 37. Cuncong Zhong, Haixu Tang, and Shaojie Zhang. RNAMotifScan: Alamos National Laboratory (LANL), 2008. automatic identification of RNA structural motifs using secondary 14. Anne-Marie Harrison, Darren R South, Peter Willett, and Peter J structural alignment. Nucleic Acids Research, 38(18):e176–e176, 2010. Artymiuk. Representation, searching and discovery of patterns of bases 38. Jie Zhou, Laura Lancaster, John Paul Donohue, and Harry F Noller. How in complex RNA structures. Journal of computer-aided molecular design, the hands the A-site tRNA to the P site during EF-G–catalyzed 17(8):537–549, 2003. translocation. Science, 345(6201):1188–1191, 2014.

i i i i Mining for recurrent long-range interactions in RNA structures reveals embedded hierarchies in network families Supplementary Material

Vladimir Reinharz, Antoine Soul´e, Eric Westhof, J´erˆome Waldisp¨uhl and Alain Denise

1 Overview of the method

Figure S1: Overview of the method. (1) Extraction of the secondary structures from the crystallized tertiary structures. (2) All interaction graphs are extracted from the derived secondary structures. (3) The maximal interaction networks are extracted. (4) The recurrent interaction networks are clustered and gathered in a fully organized and searchable website.

1 2 A-minor Type I/II RIN RMSD

Figure S2: RMSD between the elements of the A-minor Type I/II RIN. Each plot is normalized to have an area of 1. The blue parts only consists of the elements that have a GNRA stem loop, the green part of all the others. The orange position is the nucleotide involved in a single cSS long-range interaction as shown in Fig. 6a.

2 3 Interaction Network

( , ) ( , ) 1 12 7 ( , ) ( , ) ( , ) ( , ) ( ( , ) , ) ( , ) ( , ) ( , ) 2 11 8 5 ( , ) ( , ) ( , ) ( , ) , ) ( ( , ) ( , ) ( , ) ( , ) ( , ) 3 10 9 4 ( , ) ( , )

Figure S3: Figure2 bottom right in full page. The details of an interaction network.

3 4 The trans-Watson-Crick/Hoogsteen mesh

5 2 3 2 7 6 3 5

3 2 1

4 3 1 4 1 8 5 2 4 1

10 11

8 7 7 3 6 9 12 3 2 8 7 4 5 1 5 3 2 8 7 4 3 1 9 6 2 6 4 2 10 9 5 2 1 4 1 9 6 6 3 4 1 9 6 5 2 5 3 1 10 5 1 11 8 7 4 5 7 2 4 11

8 3

6 10 6 7 3 5 3 7 5 4 1

5 7 2 12 11 7 5 2 6 4 2 10 9 4 2 6 6 3

8 4 1 13 8 4 1 13 10 5 1 11 8 1 7 2

9 3 9 3 12 11

7 3 9 3 11 7 2 6 4 2 11 10 6 11 10 10 4 9 8 5 1 12 9 7 5 2 6 12 9 5 3 6 10 8 8 4 1 13 10 7 5 2 13 8 6 2 14 13 5 3 1 11

9 3 12 11 8 4 1 17 14 7 1 15 12 4 12

16 15

9 3

11 10

6 12

7 5 2 13

8 4 1 17 14

16 15

Figure S4: A major cluster of the trans-Watson-Crick-Hoogsteen mesh is based on a triple base pair between conserved residues in the core of tRNA structure.

4