RNA Structure: Motif Search; RNA 3D Predictions Comparative RNA Structure Analysis

RNA structure: motif search; RNA 3D predictions Comparative RNA structure analysis

A powerful approach in RNA structure prediction, in particular, due to RNA-specific patterns of variation, nucleotide covariations.

An example of two covariations in three related RNA’s:

n n n n n n n n n n n n n-n n-n n-n G-C U-A A-U n-n n-n n-n n-n n-n n-n A-U G-C C-G

RNA 1 RNA 2 RNA 3

AnnGnnnnnnCnnU RNA 1 GnnUnnnnnnAnnC RNA 2 CnnAnnnnnnUnnG RNA 3 (((((....))))) consensus "bracket view" Genes 2018, 9, 302 5 of 9

depends on the time in evolution when it was first acquired. After the last common ancestor, the evolution of Dscam went differently in Drosophila and in Chelicerata [35], while the regulatory sequence in Nmnat remained remarkably conserved [38]. Most of the listed structures contain uninterrupted helices of at least 12 nucleotides, many of which are surrounded by more diffuse base pairings. This is likely due to the free energy constraints to maintain long-range interaction, although the intervening sequences could also be structured since there is no apparent correlation between the stem length and the loop size. Finally, almost all examples in the table are located in syntenic regions, e.g., in introns separating orthologous exons, possibly reflecting locations related to their function and evolution. In sum, functional long-range RNA structures have the following characteristic properties:

1. most long-range RNA structures are well-conserved; 2. the core of a long-range RNA structure is a long, nearly-perfect complementary match; 3. elements of long-range RNA structures are located in syntenic regions.

The first property justifies the so-called “first-align-then-fold” limit of Sankoff’s method (Figure 1), in which a set of orthologous sequences is first aligned and the alignment is then folded. It is the most frequent approach in comparative methods [49]. By construction, it disregards the hypothetical cases of sequences that have diverged beyond recognition, but their structure has remained unchanged. Its sensitivity is limited by the quality of the input alignment, particularly by the uncertainty of aligning mutually exclusive exons that arise from genomic duplications, or by misalignment of conserved structural elementsDetecting that are conserved too short relative structures to the size in of related non-conserved RNAs background [34,62]. Apart from these special circumstances,(prediction “first-align-then-fold”of “consensus” structures) is a simple, fast, and powerful approach that is used in many current comparative methods, including comparative RNA–RNA interaction Differentprediction strategies: [52,63] and probabilistic sampling [64].

fold

first-fold -the n-a lig n align align

f irs t-a lign -then-fold

fold

Figure 1. A “commutative diagram” of the alignment and folding tradeoff. Top left: unaligned RNA sequences. Bottom left: their structure-agnostic alignment; conserved regions are shown in gray. Top right: sparse folding identiﬁes candidate helices shown as arcs. Bottom right: conserved helices are matched by structure-aware alignment or identiﬁed in a multiple sequence alignment.

The advantage of folding a multiple sequence alignment compared to folding its consensus sequence is the covariation statistics. In general, it is possible to gain statistical power from observing covarying positions only when sequences mutate, e.g., in rapidly evolving viralfrom genomes Pervouchine [54 ].(2018) However, the examples from mammalian and insect genes (Table 1) show little or no variation, suggesting that functional structures evolve under strong negative selection [38]. In addition, compensatory patterns can arise not only to maintain base-pairing interactions, but also as a result of synchronized mutations that preserve binding of a common interaction partner in antisense genomic orientation [53]. An example of this is RP11-439A17.4 long non-coding RNA, which is located in antisense to HIST2H2BA gene and overlaps a transcription factor binding site, which also occurs in almost all human histone Detecting conserved structures in related RNAs (prediction of “consensus” structures)

Consensus structures can be computed from sequence alignments using information from suboptimal structures, base probabilities and covariation patterns

Input: Sequence alignment

Calculation: suboptimal structures/partition functions/base probabilities for individual sequences; detection of common patterns and their scoring

Output: The “consensus” structure, (ideally) conserved in all sequences of the dataset. ((...(((((((((((((..((((((...)))))).)))....((.(((((((((....( NP_gullMD77/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCCACAUCAUGGCGUCCCAAGGC 60 NP_gsGD96/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCAACAUCAUGGCGUCUCAGGGC 60 NP_eqMiami63/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCAAAGCCAUGGCGUCUCAAGGC 60 NP_Victoria75/1-1565 AGCAAAAGCAGGGUUAAUAAUCACUCACUGAGUGACAUCAAAAUCAUGGCGUCCCAAGGC 60 NP_swTN77/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCAAUGAGUGACAUCGAAAUCAUGGCGUCUCAAGGC 60 ...... 10...... 20...... 30...... 40...... 50......

((((.....((.(((...... )))))...))))))))).)))))))((((..((((.((( NP_gullMD77/1-1565 ACCAAACGAUCUUAUGAGCAGAUGGAAACUGGUGGCGAUCGCCAGAAUGCCAAUGAAAUC 120 NP_gsGD96/1-1565 ACCAAACGAUCUUAUGAACAGAUGGAAACUGGUGGAGAACGCCAGAAUGCUACUGAGAUC 120 NP_eqMiami63/1-1565 ACCAAACGAUCUUAUGAGCAGAUGGAAACUGGUGGGGAACGCCAGAAUGCAACUGAAAUC 120 NP_Victoria75/1-1565 ACCAAACGGUCUUAUGAACAGAUGGAAACUGAUGGGGAACGCCAGAAUGCAACUGAAAUC 120 NP_swTN77/1-1565 ACCAAACGAUCAUAUGAACAAAUGGAGACUGGUGGGGAACGCCAGGAUGCCACAGAAAUC 120 ...... 70...... 80...... 90...... 100...... 110......

((....(((.(((...... ((....((((((((....))))))))...))...((.(((( NP_gullMD77/1-1565 AGGGCAUCUGUCGGGAGAAUGGUUGGGGGAAUCGGAAGAUUCUACAUACAGAUGUGCACU 180 NP_gsGD96/1-1565 AGAGCAUCUGUUGGAAGAAUGGUUGGUGGAAUUGGGAGGUUUUAUAUACAGAUGUGCACU 180 NP_eqMiami63/1-1565 AGAGCAUCUGUUGGAAGGAUGGUGGGAGGAAUCGGGCGGUUCUACGUUCAAAUGUGUACU 180 NP_Victoria75/1-1565 AGGGCAUCCGUCGGGAAGAUGAUUAAUGGAAUUGGACGAUUCUACAUCCAAAUGUGCACU 180 NP_swTN77/1-1565 AGAGCAUCUGUCGGAAGAAUGAUUGGUGGAAUCGGAAGAUUCUACAUCCAAAUGUGCACU 180 ...... 130...... 140...... 150...... 160...... 170......

((...... ))))))))..))).)))...)))))))))....)))).....(((((((. NP_gullMD77/1-1565 GAACUCAAGCUCAGUGAUAAUGAGGGAAGAUUGAUCCAAAACAGCAUCACCAUAGAGAGA 240 NP_gsGD96/1-1565 GAACUCAAACUCAGCGACUAUGAAGGAAGGCUGAUUCAGAACAGCAUAACAAUAGAGAGA 240 NP_eqMiami63/1-1565 GAGCUCAAACUCAGCGACCACGAAGGGCGGCUGAUUCAGAACAGCAUAACAAUAGAGAGG 240 NP_Victoria75/1-1565 GAACUUAAACUCAGUGAUUAUGAAGGGCGGUUGAUCCAGAACAGCUUAACAAUAGAGAGA 240 NP_swTN77/1-1565 GAACUCAAACUCAGUGACUAUGAGGGACGACUGAUUCAAAAUAGCAUAACAAUAGAGAGA 240 ...... 190...... 200...... 210...... 220...... 230......

...... )))))))..((((.((((((...... ((.((((.((.((((.((.(( NP_gullMD77/1-1565 AUGGUCCUAUCAGCAUUUGAUGAGAGAAGGAACAAGUACCUGGAAGAGCACCCCAGUACU 300 NP_gsGD96/1-1565 AUGGUUCUCUCUGCAUUUGAUGAAAGGAGGAACAAAUACCUGGAAGAACAUCCCAGUGCG 300 NP_eqMiami63/1-1565 AUGGUGCUUUCGGCAUUUGACGAAAGAAGAAACAAGUACCUGGAGGAGCAUCCCAGUGCU 300 NP_Victoria75/1-1565 AUGGUGCUCUCUGCCUUUGAUGAGAGAAGGAAUAGAUAUCUGGAAGAACAUCCCAGCGCG 300 NP_swTN77/1-1565 AUGGUGCUCUCUGCUUUUGACGAGAGAAGGAAUAAAUACCUAGAAGAGCAUCCCAGUGCU 300 ...... 250...... 260...... 270...... 280...... 290......

.((((((((((..((....))...)).((..(((...... )))...)).....((((( NP_gullMD77/1-1565 GGGAGAGACCCCAAGAAAACUGGAGGACCAAUUUACAGAAGGAGAGAUGGGAAGUGGGUG 360 NP_gsGD96/1-1565 GGGAAGGACCCAAAGAAAACUGGAGGUCCAAUCUACCGAAGAAGAGACGGAAAAUGGGUG 360 NP_eqMiami63/1-1565 GGAAAAGACCCCAAGAAGACAGGAGGCCCGAUCUACAGACGGAGAGAUGGGAAAUGGGUG 360 NP_Victoria75/1-1565 GGGAAAGAUCCUAAGAAAACUGGAGGGCCCAUAUACAAGAGAGUAGAUGGAAAGUGGAUG 360 NP_swTN77/1-1565 GGGAAAGAUCCUAAGAAAACUGGAGGACCCAUAUAUAGAAGAGUAGACGGAAAAUGGAUG 360 ...... 310...... 320...... 330...... 340...... 350......

((....)))))))...... ((((.....))))...(( NP_gullMD77/1-1565 AGAGAAUUAGUUCUAUAUGACAAAGAAGAAAUAAGGAGGAUCUGGCGGCAAGCAAACAAU 420 NP_gsGD96/1-1565 AGAGAGCUGAUUCUGUAUGACAAAGAGGAGAUCAGGAGAAUUUGGCGUCAAGCGAACAAU 420 NP_eqMiami63/1-1565 AGAGAACUCAUCCUCUAUGACAAAGAAGAAAUCAUGAGGAUCUGGCGUCAGGCCAACAAU 420 NP_Victoria75/1-1565 AGGGAACUCGUCCUUUAUGACAAAGAAGAAAUAAGGCGAAUCUGGCGCCAAGCCAAUAAU 420 NP_swTN77/1-1565 AGGGAACUCAUCCUUUAUGACAAAGAAGAAAUAAGGAGAGUUUGGCGCCAAGCCAACAAU 420 ...... 370...... 380...... 390...... 400...... 410......

((((.((((((....))..))))..))).)))..((.(((.(((((((((((.....))) NP_gullMD77/1-1565 GGAGAAGAUGCAACAGCUGGUCUCACUCACUUGAUGAUCUGGCAUUCCAAUUUGAAUGAC 480 NP_gsGD96/1-1565 GGAGAAGAUGCAACUGCUGGUCUCACUCACAUGAUGAUCUGGCAUUCCAAUCUAAAUGAU 480 NP_eqMiami63/1-1565 GGAGAAGACGCUACUGCUGGUCUUACUCAUCUGAUGAUUUGGCACUCCAAUCUCAAUGAC 480 NP_Victoria75/1-1565 GGUGAUGAUGCGACAGCUGGUCUAACUCACAUGAUGAUCUGGCAUUCCAAUUUGAAUGAU 480 NP_swTN77/1-1565 GGUGAAGAUGCAACAGCCGGCCUUACCCAUAUUAUGAUUUGGCACUCCAAUCUGAAUGAU 480 ...... 430...... 440...... 450...... 460...... 470......

....(((.((((.(((((((....)))))))...)))).)))(((((((((((((((((. NP_gullMD77/1-1565 GCCACGUAUCAGAGAACAAGAGCACUAGUGCGAACAGGGAUGGAUCCCCGGAUGUGCUCC 540 NP_gsGD96/1-1565 GCCACAUACCAGAGAACAAGAGCUCUCGUGCGUACUGGGAUGGACCCUAGAAUGUGCUCU 540 NP_eqMiami63/1-1565 ACCACCUACCAAAGAACAAGGGCUCUUGUUCGGACUGGGAUGGAUCCCAGAAUGUGCUCU 540 NP_Victoria75/1-1565 ACAACAUAUCAGAGGACAAGAGCUCUUGUUCGCACCGGAAUGGAUCCCAGGAUGUGCUCU 540 NP_swTN77/1-1565 GCCACCUAUCAGAGAACAAGAGCUCUUGUUCGCACUGGGAUGGAUCCCAGAAUGUGCUCC 540 ...... 490...... 500...... 510...... 520...... 530......

.((.(((...(((((.(((((((((((...))))).(((((((...)))))))...(((( NP_gullMD77/1-1565 CUCAUGCAGGGCUCGACACUCCCUAGAAGGUCUGGAGCUGCUGGUGCAGCAGUAAAGGGA 600 NP_gsGD96/1-1565 CUGAUGCAAGGAUCAACUCUCCCGAGGAGAUCUGGAGCUGCUGGUGCGGCAGUAAAGGGA 600 NP_eqMiami63/1-1565 CUGAUGCAAGGAUCAACUCUCCCACGGAGAUCUGGAGCUGCCGGUGCUGCAGUGAAGGGU 600 NP_Victoria75/1-1565 CUGAUGCAGGGUUCGACUCUCCCUAGGAGGUCUGGAGCUGCAGGCGCUGCAGUCAAAGGA 600 NP_swTN77/1-1565 CUAAUGCAAGGUUCAACACUUCCCAGAAGGUCUGGAGCCGCAGGUGCUGCAGUAAAAGGA 600 ...... 550...... 560...... 570...... 580...... 590......

(((.((.((.((((...... (((.(((.....)))..))).....)))).)).)).)).) NP_gullMD77/1-1565 GUUGGGACAAUGGUAAUGGAACUCAUCAGGAUGAUAAAGAGAGGAGUCAAUGACCGCAAU 660 NP_gsGD96/1-1565 GUCGGAACGAUGGUGAUGGAACUAAUUCGGAUGAUAAAGCGAGGGAUUAACGAUCGGAAU 660 NP_eqMiami63/1-1565 GUUGGAACAAUGGUGAUGGAGCUCAUUCAGAUGAUCAAACGUGGGAUAAAUGAUCGAAAC 660 NP_Victoria75/1-1565 GUCGGGACAAUGGUGAUGGAGCUGAUCAGAAUGAUCAAACGUGGAAUCAACGAUCGGAAC 660 NP_swTN77/1-1565 GUUGGAACAAUAGCGAUGGAGUUAAUCAGAAUGAUCAAACGUGGGAUCAAUGACCGAAAC 660 ...... 610...... 620...... 630...... 640...... 650......

)))))))))).)))...((...... )).(((((.(((.(((..((((((...))))) NP_gullMD77/1-1565 UUUUGGAGAGGUGAAAACGGGCGAAGAACAAGAAUUGCCUAUGAAAGGAUGUGCAACAUU 720 NP_gsGD96/1-1565 UUCUGGAGAGGUGAAAAUGGGCGAAGAACAAGAAUUGCAUAUGAGAGAAUGUGCAACAUC 720 NP_eqMiami63/1-1565 UUCUGGAGAGGUGAAAAUGGUCGAAGAACCAGAAUUGCUUAUGAAAGAAUGUGCAACAUC 720 NP_Victoria75/1-1565 UUCUGGAGAGGUGAGAAUGGACGAAAAACAAGGAGUGCUUAUGAGAGAAUGUGCAACAUU 720 NP_swTN77/1-1565 UUCUGGAGGGGUGAAAAUGGACGAAGGACAAGGAUUGCAUAUGAAAGAAUGUGCAACAUU 720 ...... 670...... 680...... 690...... 700...... 710......

)))).)))..)))))...... )).)))))..))))).(((....)))(((((.(((...( NP_gullMD77/1-1565 CUCAAAGGGAAAUUUCAAACAGCAGCACAACGAGCUAUGAUGGACCAAGUGCGAGAAAGC 780 NP_gsGD96/1-1565 CUCAAAGGGAAAUUCCAAACAGCAGCACAAAGAGCAAUGAUGGAUCAGGUACGGGAAAGC 780 NP_eqMiami63/1-1565Detecting CUCAAGGGGAAAUUCCAAACAGCAGCACAACGAGCAAUGAUGGACCAGGUGCGGGAAAGC conserved structures in related RNAs 780 NP_Victoria75/1-1565 CUCAAAGGAAAAUUUCAAACAGCUGCACAAAGAGCAAUGAUGGAUCAAGUGAGAGAAAGC 780 NP_swTN77/1-1565 CUCAAAGGGAAAUUUCAGACAGCUGCCCAGAGGGCAAUGAUGGAUCAAGUGAGAGAAAGU(prediction of “consensus” structures) 780 ...... 730...... 740...... 750...... 760...... 770...... Consensus structures can be computed from sequence alignments using ((....((.(((((((.(((((((...))))))))))))))))..))))))))))))))) NP_gullMD77/1-1565information from suboptimal CGGAAUCCUGGAAAUGCUGAAAUAGAAGACCUUAUAUUUCUGGCUCGAUCUGCACUUAUC structures, base probabilities and covariation patterns 840 NP_gsGD96/1-1565 AGAAAUCCUGGGAAUGCUGAGAUUGAAGAUCUCAUAUUUCUGGCACGGUCUGCACUCAUC 840 NP_eqMiami63/1-1565 CGCAAUCCUGGAAAUGCUGAGAUUGAGGACCUCAUUUUCUUGGCGCGAUCAGCACUCAUU 840 NP_Victoria75/1-1565Input: Sequence alignment CGGAACCCAGGAAAUGCUGAGAUCGAAGAUCUCAUAUUUUCGGCACGGUCUGCACUAAUA 840 NP_swTN77/1-1565 CGGAACCCAGGAAACGCUGAAAUUGAAGAUCUCAUUUUCCUGGCACGGUCAGCACUCAUU 840 ...... 790...... 800...... 810...... 820...... 830...... Calculation: suboptimal structures/partition functions/base probabilities for individual sequences; detection of common patterns and their scoring )))...))))).((.((.(((..((..(((((((.(((((.(((...((((((.((.... NP_gullMD77/1-1565 CUGAGAGGAGCAGUAGCUCACAAAUCAUGCCUUCCUGCCUGUGUAUAUGGACUGGCUGUG 900 NP_gsGD96/1-1565 CUGAGAGGAUCAGUGGCCCACAAGUCCUGCUUGCCUGCUUGUGUGUACGGGCUUGCCGUG 900 NP_eqMiami63/1-1565Output: The “consensus” CUGAGAGGAUCAGUGGCCCAUAAAUCAUGCCUACCUGCCUGUGUUUAUGGCCUUACAGUA structure, (ideally) conserved in all sequences of the 900 NP_Victoria75/1-1565 UUGAGAGGGUCAGUUGCUCACAAAUCUUGUCUGCCUGCCUGUGUGUAUGGACCUGCCGUA 900 NP_swTN77/1-1565dataset. UUAAGAGGGUCAGUUGCACAUAAGUCUUGCCUGCCUGCUUGUGUGUAUGGGCUUGCAGUA 900 ...... 850...... 860...... 870...... 880...... 890...... For instance, a fragment of the output of RNAalifold algorithm:

)).)))((((.....(((((.((((((.....))))))..)))))....))))((((... NP_gullMD77/1-1565 GCAAGUGGGUAUGACUUUGAAAGGGAGGGAUAUUCCCUCGUUGGAAUAGAUCCUUUCCGU 960 NP_gsGD96/1-1565 GCCAGUGGAUAUGACUUUGAGAGAGAAGGGUACUCUCUGGUCGGGAUUGAUCCUUUCCGU 960 NP_eqMiami63/1-1565 GCCAGUGGGUAUGACUUCGAGAGAGAGGGAUACUCUCUGAUUGGAAUAGAUCCUUUCAAA 960 NP_Victoria75/1-1565 GCCAGUGGGUACGACUUUGAAAAAGAGGGAUAUUCUUUGGUGGGAAUUGACCCUUUCAAA 960 NP_swTN77/1-1565 GCGAGUGGGCAUGACUUUGAAAGAGAAGGAUAUUCUCUGGUCGGAAUAGACCCCUUCAAA 960 ...... 910...... 920...... 930...... 940...... 950......

(((((((...((...... )).....((((....))))((((((.....((((.((..... NP_gullMD77/1-1565 CUGCUCCAAAACAGCCAGGUAUUCAGUCUAAUCCGACCAAACGAAAAUCCAGCACAUAAG 1020 NP_gsGD96/1-1565 CUGCUGCAAAACAGCCAGGUCUUUAGUCUAAUUAGACCAAAUGAGAAUCCAGCACAUAAA 1020 NP_eqMiami63/1-1565 CUGCUCCAGAACAGCCAAGUUUUCAGUCUGAUCAGACCGAAUGAAAAUCCAGCACACAAG 1020 NP_Victoria75/1-1565 CUACUUCAAAACAGCCAAGUGUACAGCCUAAUCAGACCGAACGAGAAUCCAGCACACAAG 1020 NP_swTN77/1-1565 CUACUUCAAAACAGUCAAGUAUUCAGCCUGAUCAGACCAAAUGAAAACCCAGCUCACAAG 1020 ...... 970...... 980...... 990...... 1000...... 1010......

.))...)))).((((((((....))).)))))..))))))...... (((((..(( NP_gullMD77/1-1565 AGUCAAUUGGUGUGGAUGGCAUGCCAUUCCGCCGCAUUUGAGGAUUUGAGAGUGUCAAGU 1080 NP_gsGD96/1-1565 AGUCAAUUGGUGUGGAUGGCAUGCCAUUCUGCAGCAUUUGAAGAUCUGAGAGUCUCAAGC 1080 NP_eqMiami63/1-1565 AGUCAGCUGGUGUGGAUGGCAUGCCAUUCUGCAGCAUUUGAGGAUCUGAGAGUUUCAAAU 1080 NP_Victoria75/1-1565 AGUCAGCUGGUGUGGAUGGCAUGCCAUUCUGCUGCAUUUGAAGAUCUAAGAUUAUUAAGC 1080 NP_swTN77/1-1565 AGUCAACUGGUGUGGAUGGCAUGCCACUCUGCCGCAUUUGAGGAUUUAAGAGUAUCAGGC 1080 ...... 1030...... 1040...... 1050...... 1060...... 1070......

(((.((((.(((((....))..)))(((((((((....)))...... (((((...... NP_gullMD77/1-1565 UUCAUCCGGGGAACAAGGGUGCUACCAAGAGGACAACUGUCUACUAGGGGUGUCCAAAUU 1140 NP_gsGD96/1-1565 UUCAUCAGAGGGACAAGAGUGGCCCCAAGGGGACAACUAUCUACUAGAGGAGUUCAAAUU 1140 NP_eqMiami63/1-1565 UUCAUUAGAGGAACCAAAGUAGUCCCAAGAGGACAAUUAGCAACCAGAGGAGUGCAAAUU 1140 NP_Victoria75/1-1565 UUCAUCAGAGGGACCAAAGUAUCUCCAAGGGGGAAACUUUCAACUAGAGGAGUACAAAUU 1140 NP_swTN77/1-1565 UUCAUAAGAGGGAAGAAAGUGGUUCCAAGAGGAAAGCUUUCCACAAGAGGGGUUCAGAUU 1140 ...... 1090...... 1100...... 1110...... 1120...... 1130......

)))))....(((..(((((...)))))..))).....))))))..))))))))).))))) NP_gullMD77/1-1565 GCAUCCAAUGAAAACAUGGAAACCAUGAAUUCCAGCACUCUUGAAUUGAGAAGCAAAUAC 1200 NP_gsGD96/1-1565 GCUUCAAAUGAGAACAUGGAAACAAUGGACUCCAGCACUCUUGAACUGAGAAGCAGAUAU 1200 NP_eqMiami63/1-1565 GCUUCAAAUGAAAACAUGGAGACAAUGGAUUCCUGCACACUCGAACUGAGGAGCAGAUAU 1200 NP_Victoria75/1-1565 GCUUCAAAUGAAAACAUGGAUACUAUGGAAUCAAGUACUCUUGAACUGAGAAGCAGAUAC 1200 NP_swTN77/1-1565 GCUUCAAAUGAGAAUGUGGAAGCUAUGGACUCUAGUACCCUGGAACUAAGAAGCAGGUAC 1200 ...... 1150...... 1160...... 1170...... 1180...... 1190......

(((.((....)).))))))))))..))))..)))....))).)).)))...))))).)). NP_gullMD77/1-1565 UGGGCAAUAAGGACCAGAAGCGGAGGAAACACUAACCAACAAAGAGCAUCUGCAGGACAA 1260 NP_gsGD96/1-1565 UGGGCUAUAAGGACCAGGAGUGGAGGAAACACCAACCAGCAGAGAGCAUCUGCAGGACAA 1260 NP_eqMiami63/1-1565 UGGGCAAUAAGGACCAGGAGCGGAGGGAACACCAAUCAACAGAGAGCAUCUGCAGGACAG 1260 NP_Victoria75/1-1565 UGGGCCAUAAGGACCAGAAGUGGAGGAAACACUAAUCAACAGAGGGCCUCCGCAGGCCAA 1260 NP_swTN77/1-1565 UGGGCCAUAAGGACCAGAAGCGGGGGAAAUACCAAUCAACAGAAGGCAUCCGCAGGCCAG 1260 ...... 1210...... 1220...... 1230...... 1240...... 1250......

))..))).)).))...(((((((((((..(((....)))(((..((((...((..(((.. NP_gullMD77/1-1565 GUCAGUGUGCAACCCACUUUCUCUGUGCAGAGAAACCUCCCCUUUGAGAGGGCGACCAUC 1320 NP_gsGD96/1-1565 AUCAGUGUGCAGCCUACUUUCUCGGUACAGAGAAAUCUUCCCUUCGAAAGAGCGACCAUU 1320 NP_eqMiami63/1-1565 AUAAGCGUACAACCCACUUUCUCGGUGCAGAGAAAUCUUCCCUUUGAAAGAGCAACCAUA 1320 NP_Victoria75/1-1565 AUCAGUGUGCAACCUGCAUUUUCUGUGCAAAGAAACCUCCCAUUUGACAAAUCAACCAUC 1320 NP_swTN77/1-1565 AUCAGUGUGCAACCUACAUUCUCAGUACAAAGGAAUCUCCCUUUUGAGAGAGCGACCGUU 1320 ...... 1270...... 1280...... 1290...... 1300...... 1310......

.)))..)).))))..)))..))))))))).))..(((((...((((...... )))). NP_gullMD77/1-1565 AUGGCUGCUUUCACAGGGAAUACAGAGGGCAGGACAUCUGAUAUGAGGACUGAAAUCAUA 1380 NP_gsGD96/1-1565 AUGGCGGCAUUCACAGGGAAUACAGAGGGCAGAACAUCUGACAUGAGGACUGAAAUCAUA 1380 NP_eqMiami63/1-1565 AUGGCUGCAUUCACUGGAAACACUGAGGGGAGGACUUCCGACAUGAGAACUGAAAUCAUA 1380 NP_Victoria75/1-1565 AUGGCAGCAUUCACUGGGAAUACGGAAGGAAGAACCUCAGACAUGAGGGCAGAAAUCAUA 1380 NP_swTN77/1-1565 AUGGCAGCUUUCAUUGGGAACAAUGAGGGACGAACAUCAGAUAUGCGAACUGAAAUCAUA 1380 ...... 1330...... 1340...... 1350...... 1360...... 1370......

.))))).))))..)))).))).))...... )))))))).)).)))))).)))))).)) NP_gullMD77/1-1565 AGGAUGAUGGAAAAUUCAAGACCAGAGGAUGUGUCUUUCCAGGGGCGGGGAGUCUUCGAG 1440 NP_gsGD96/1-1565 AGGAUGAUGGAAAGCUCCAGACCAGAAGAUGUGUCUUUCCAGGGGCGGGGAGUCUUCGAG 1440 NP_eqMiami63/1-1565 AGGAUGAUGGAAAAUGCCAGAUCAGAAGAUGUGUCUUUCCAGGGGCGGGGAGUCUUCGAG 1440 NP_Victoria75/1-1565 AGGAUGAUGGAAGGUGCAAAACCAGAAGAAGUGUCCUUCCGUGGGCGGGGAGUCUUCGAG 1440 NP_swTN77/1-1565 AGGAUGAUGGAAAGUGCAAAGCCAGAAGAUUUGUCCUUCCAGGGGCGGGGAGUCUUCGAG 1440 ...... 1390...... 1400...... 1410...... 1420...... 1430......

))))))..))))..(((.((((...... )))))))((((((((.(((..((..((((((( NP_gullMD77/1-1565 CUCUCGGACGAAAAGGCCACGAACCCGAUCGUGCCUUCCUUUGACAUGAGUAAUGAGGGA 1500 NP_gsGD96/1-1565 CUCUCGGACGAAAAGGCAACGAACCCGAUCGUGCCUUCCUUUGACAUGAGUAAUGAAGGA 1500 NP_eqMiami63/1-1565 CUCUCGGACGAAAAGGCAACGAACCCGAUCGUGCCUUCCUUGGACAUGAGCAAUGAAGGG 1500 NP_Victoria75/1-1565 CUCUCGGACGAAAAGGCAACGAACCCGGUCGUGCCCUCUUUUGACAUGAGUAAUGAAGGA 1500 NP_swTN77/1-1565 CUCUCGGACGAAAAGGCAACGAACCCGAUCGUGCCUUCCUUUGACAUGAAUAAUGAGGGG 1500 ...... 1450...... 1460...... 1470...... 1480...... 1490......

...... )))))))...)).)))))))))))...... ))))).))))) NP_gullMD77/1-1565 UCUUAUUUCUUCGGAGACAAUGCUGAGGAGUUUGACAGUUAAAGAAAAAUACCCUUGUUU 1560 NP_gsGD96/1-1565 UCUUAUUUCUUCGGAGACAAUGCAGAGGAAUAUGACAAUUGAAGAAAAAUACCCUUGUUU 1560 NP_eqMiami63/1-1565 UCUUAUUUCUUCGGAGACAAUGCAGAGGAGUAUGACAAUUAAAGAAAAAUACCCUUGUUU 1560 NP_Victoria75/1-1565 UCUUAUUUCUUCGGAGACAAUGCAGAGGAGUACGACAAUUAAGGAAAAAUACCCUUGUUU 1560 NP_swTN77/1-1565 UCUUAUUUCUUCGGAGACAAUGCAGAGGAGUAUGACAAUUGAAGAAAAAUACCCUUGUUU 1560 ...... 1510...... 1520...... 1530...... 1540...... 1550......

...)) NP_gullMD77/1-1565 CUACU 1565 NP_gsGD96/1-1565 CUACU 1565 NP_eqMiami63/1-1565 CUACU 1565 NP_Victoria75/1-1565 CUACU 1565 NP_swTN77/1-1565 CUACU 1565 ..... ((...(((((((((((((..((((((...)))))).)))....((.(((((((((....( NP_gullMD77/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCCACAUCAUGGCGUCCCAAGGC 60 NP_gsGD96/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCAACAUCAUGGCGUCUCAGGGC 60 NP_eqMiami63/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCACUGAGUGACAUCAAAGCCAUGGCGUCUCAAGGC 60 NP_Victoria75/1-1565 AGCAAAAGCAGGGUUAAUAAUCACUCACUGAGUGACAUCAAAAUCAUGGCGUCCCAAGGC 60 NP_swTN77/1-1565 AGCAAAAGCAGGGUAGAUAAUCACUCAAUGAGUGACAUCGAAAUCAUGGCGUCUCAAGGC 60 ...... 10...... 20...... 30...... 40...... 50......

(((((((...((...... )).....((((....))))((((((.....((((.((..... NP_gullMD77/1-1565 CUGCUCCAAAACAGCCAGGUAUUCAGUCUAAUCCGACCAAACGAAAAUCCAGCACAUAAG 1020 NP_gsGD96/1-1565Such structure-annotated CUGCUGCAAAACAGCCAGGUCUUUAGUCUAAUUAGACCAAAUGAGAAUCCAGCACAUAAA alignments allow one to identify covariations. 1020 NP_eqMiami63/1-1565 CUGCUCCAGAACAGCCAAGUUUUCAGUCUGAUCAGACCGAAUGAAAAUCCAGCACACAAG 1020 NP_Victoria75/1-1565 CUACUUCAAAACAGCCAAGUGUACAGCCUAAUCAGACCGAACGAGAAUCCAGCACACAAG 1020 NP_swTN77/1-1565 CUACUUCAAAACAGUCAAGUAUUCAGCCUGAUCAGACCAAAUGAAAACCCAGCUCACAAG 1020 ...... 970...... 980...... 990...... 1000...... 1010......

...)) NP_gullMD77/1-1565 CUACU 1565 NP_gsGD96/1-1565 CUACU 1565 NP_eqMiami63/1-1565 CUACU 1565 NP_Victoria75/1-1565 CUACU 1565 NP_swTN77/1-1565 CUACU 1565 ..... U C G A Mutual information and alignment position entropies G U U U A - U G . U C - G Mutual information M(x,y): A - U G - C Covariation U - A x/y (?) bx - by G - C U - A A - U Using entropy values at the alignment positions: Alignment: x y positions AUGUUGACGAUGGUCAUUUUGUCACAU seq1 AUGCUGACGAUGGUCAUUUUGUCGCAU seq2 AUGCUGACGAUGGUCAUUUUGUCGCAU seq3 AUGCUGACGAUGGUCAUUUUGUCACAU seq4 where AUGUUGACGAUGGUCAUUUUGUCGCAU seq5 AUGUUGACGAUGGUCAUUUUGUCACAU seq6 ... (an entropy term, a measure of variability). AUGCUGACGAUGGUCAUUUUGUCGCAU seqN f(bx) f(by) base frequencies

The ratios of M(x,y) and entropies can reveal correlations at (biased) positions:

High values (close to 1) indicate to significant correlations.

[Gutell et al., 1992] Mutual information and numbers of covariation events

230 Dutheil Similar values of MI may reflect different evolutionary scenarios. The scenario on the right is a stronger case for coevolution hypothesis (multiple covariation events).

from Dutheil (2012)

Figure 1: Patterns and processes.The two scenarios display the same correlated patterns, with high MI.The under- Downloaded from lying evolutionary process is however different, with the scenario on the right being a stronger case for coevolution because three cosubstitutionAlignment: events are required to explain the observed pattern, which is unlikely to have occurred by chance under the null hypothesis of independence. 1 2 positions

...A...... U... seq1 http://bib.oxfordjournals.org/ ...A...... U... seq2 ...A...... U... seq3 ...A...... U... seq4 ...G...... C... seq5 ...G...... C... seq6 ...G...... C... seq7 ...G...... C... seq8 at University Library on August 14, 2015 f1() f2()

Figure 2: The observed correlation of patterns is compared to different background correlations depending on the methods. Index-based methods do not distinguish the different sources of correlation (A) while statistics-based methods assess the stochastic components of observed correlations (B). Evolutionary methods (C)furtherdistin- guish phylogenetic from functional correlations. Evolution-based or corrected statistics directly include the phylogenetic component in their correlation measure (D). ‘Cfun’: functional correlation, ‘Cstruct’: structural correlation, ‘Cinter’: interaction correlation,‘Cphylo’: phylogenetic correlation and ‘Cstoc’: correlation due to stochasticity. biologically relevant correlation, consists of a phylo- (i) ignoring the phylogenetic correlation. In this genetic correlation due to shared ancestry, and sto- simplest case, an observed correlation is only chasticity, inherent to the random mutation process. compared to a stochastic correlation, and there- Assessing the importance of stochasticity is the fore assesses the significance of functional realm of statistics. A phylogenetic correlation is phylogenetic components; þ more challenging to account for, as in most cases (ii) minimizing the phylogenetic correlation, through the phylogenetic relationships are unknown and data filtering, for instance by removing redun- have to be inferred from the data under several dant sequences [same approach as (i) but after assumptions and therefore with uncertainty. some alignment preprocessing]; Several approaches have been proposed in the lit- (iii) incorporating the phylogenetic correlation in erature (Figure 3): the null hypothesis. In this case, an observed Covariance models, RNA families and RNA descriptors

One of the core computational problems in RNomics is a so-called “sequence/structure” alignment.

For instance, a problem to align a motif <<<<.<<<<.....>>>>>>>> to a sequence: CCCCACGCGAAAACGCGGGGG

Obviously, a deletion in the sequence yields the best alignment (score):

<<<<.<<<<.....>>>>>>>> CCCCACGCG-AAAACGCGGGGG

Various algorithms are possible for the search of the optimal sequence/structure alignments (dynamic programming, BLAST-like etc.). They can be used e.g. for the alignment of a structural motif to a sequence (database of sequences), alignment of a sequence to a motif (database of motifs).

Similar ideas can be used in fold/align algorithms (simultaneously folding and aligning RNA sequences).

Multiple sequence/structure alignments lead to definitions of RNA families and descriptors. Rfam: database of RNA families

http://rfam.sanger.ac.uk/

In Rfam, the related RNAs (families) are stored as sequence/structure alignments (multiple sequence alignments + structure motifs in the Stockholm format)

... Influenza_A_virus_AN.1 UUCCAGGACAUACUAAUGAGGAUGUCAAAAAUGCAAUUGGGAUUCUCA Influenza_A_virus_Ac.8 UGCCAGGACAUUCUGCGGAGGAUGUCAAAAAUGCAAUUGGGAUCCUCA Influenza_A_virus_AL.1 UUCCAGGACAUACUGCUGAGGAUGUCAAAAAUGCAGUUGGAGUCCUCA Influenza_A_virus_Ap.1 UGCCAGGACAUUCUCAUGAGGAUGUCAAAAAUGCAAUUGGAAUCCUCA ... #=GC SS_cons .<<<<<...... AAAAAA...... >>>>>aaaaaa.

(In Stockholm format, the pseudoknots are shown with AAA...aaa; BBB...bbb etc symbols.)

Every family in Rfam is initially defined by “seed” alignments: representative sequences plus structural motif. These seed alignments define a descriptor (covariance model). The covariance model is further used to search for other family members in a sequence database. Structured RNA molecules without protein-coding function: - tRNA 242 - ribosomal RNA (rRNA) - small nucle(ol)ar RNA (snRNA, snoRNA) A - microRNA (miRNA) ~-.r 30 - long non-coding RNA (lncRNA) 8232 Nucleic Acids Research, 2013, Vol. 41, No. 17

A 2 kb Scale - etc. chr11: 65266000 65268000 65270000 65272000 65274000 MALAT1

Enredo-Pecan-Ortheus 35 way alignment (with BlastZ-net)a ne from Ensembl 62

Non-coding RNAs (ncRNAs) are usually characterizedGerperp Constrained Elements by (35 eutherian amammals) conserved on homo_sapiensmo in Ensembl 63 G-c Repeating Elements by RepeatMaskertM ke a.o EvoFold Predictions of RNA Secondaryda Structuretr structure. All conserved RNA 2D structure prpredictionstio

A C U B C 3’ C U A A A A C A A 5’ U A U A G C 3’

A Downloaded from U G U G A A U C A G A U C U G U C 5’ A U G G U G C C A A G U C A A A C G C G A U A G C G G A A A U U U G A G U C A C G U U U U C U A C U G C A A G G G C U A A U A G C A C A A U A U U U U U C C G G C A U G U A A A U G C .... ".oooo,o:, U A ~ A A A A U C A C A G U U G G C C G C A C C A A A A C A U G G http://nar.oxfordjournals.org/ G U C G U C U A G C U G ;oo~: ~,:oo~ A U A --;o~ U U ,, ~ G G G U U p~ ~o~:t~~ _...~ ~o~.?~,,.'~. ~ \ A A A U U U A A A U U U C G U C C U G U A C U U A A A A C A U C A G G A U A G A U U A A A G A A A U U A 5’U A G G U U U U U A U U A U A A G U A A U C C G A A C A C A A A U G A 3’ A U A A A G G U C G A C U G G G A A C A A G U A U U C U U C A U U U C U U G U U U U G A U A G C C C A C G A U U U A A A U A A U A G G A A G G C C C U U A C C A A G A G U A G G U A U A C A A G G A A C A by guest on November 18, 2013 U U U G G A A C U U U G U C A A G Doman3 ~-~ %":~ vso g'~ I~ ~ I' C A G A G G A C A C A G U G A G G U G U C A G A U G C A A C G C U G C U U A U U U A G C A U G C C U U A C A U U G A C C A C A U A U A C G G U C A C A C G G U G A U G G G U G G C G A C C C G C C A G C G G U G U A C A A G C A C U G C C C G C C A C G C G G G G G A A A U C C A U U C C G C C G U C A C A U G A C C G C U C C G G G A A G U G G G A G U U U G U G U G A G U G G U U U G C G G 5’ C A C G A A G G A C C A G U C A U U 5’ C A C C U G C G 3’ C G U U G C U G G U 3’ G A G C U A A C A U U C U A G G 5’ C C G G U A U G A U C G 3’ G A U G G A A A G C G G C A A C A G C U C C U A C U G C G U C U C G C G U U U U U G U A U G U U G G U C A G C C G U A U A U A U C A :~176176,~ ~'g0 ,~ ,oo ...... C C C A C U A A U G U C G C C U C G U A G U C C A C G A A C G C G U C U U G C G U G G G C A U A A U G U U U C C G U G U A A G C U U A U C G A C G A G U G C U C C C U G G U G U A G G A A C C U G U U G G C G G G G C C A C G A C U A C G A U U G G U A C A U A U U C C U U U C A U U C G C A A G C A U U U A U A A C U C G U UC G A U C G 1% 99% U U

Figure 5. Structural characterization of the long ncRNA MALAT1.(A) UCSC genome browser (hg19) screenshot of the MALAT1 locus with the o.~:.. ~ ~o o .... o, o_o _ ...... following tracks: EPO multiple genome alignment (used to emit predictions), GERP++constrained sequence element track, repeat elements, EvoFold evolutionarily conserved RNA secondary structure predictions, and the ECS predictions reported herein, with colors representing the algorithm used 0 o ~ o_o~-~ ~,~..... o c e..~'l~, ~ .oc oo~ ~GC c/~c to make the prediction (SISSIz in red, SISSIz with RIBOSUM in green, RNAz in blue). Red rectangles represent ECS predictions that are ~o..~o,~,yo~o,c/V2-12 structurally congruent with the reference sequence. (B) Human RNA secondary structures associated to predictions. Consensus RNA secondary One of the domains in human 28S rRNA structures were extracted from the associated alignments and used as a constraint for folding the human sequence with RNAfold (76). The base Fragmentscolors represent the (unconstrained) of partitionconserved function base-pairing probabilitiesstructures associated to the represented predicted structures. Gray structure in annotationshuman G/ v2.15 A ~''r176 correspond to RNAalifold consensus structures supported by conserved or compensatory mutations. The substructures outlined in black (bottom (Gorski et al., 1987). right) correspond to the previously characterized mascRNA and associated stem-loop, which are required for efﬁcient RNAse P cleavage (cleavage longsite indicated ncRNA with an arrow) (74). (Smith Structure representations et were al., created 2013) with VARNA (77). Length = 5035 nt Length ~ 7000 nt

:0r0i" V~-1~ U II]fl C CCQ ~GC / ..... 9 Ic ocoooo I%1//~ c~/ -~----v, ~;~co o 6 C c\ ~ ~c"6"oc ,,o.~.~ ~~ ~e/ll ~C C "~A,~.0~-,o0%;2~.% " ',%. , ~ v~.e :

,,~",',%g w ,.. ~,_ , o ~ o'" ~176176..-. ~zg, ~ o-o w-~ %-%

~ ~ ,~.e-~ ,-P %:% '~, ~ \ .... oca 9 c r c o c or t~m cccucc c c c ccG ~ ~ o ~c ~ ~c c.~mG ~ COeO~IOm~mm~,OmOCGGGGCm~OOOOCUmOGGCOCOOO~OGaC UC~" 0 m Ill I Ill,II l,ll,II II Illll IIIllllllll.lllllll Illll U'~Gc ~ |C c cc OUCCou CCGUC c ~u Ccvccu ec u c CcCCOU~ u Gccc cc coG~ ec c~c ou cc u c c eu

c%.~ v~-~ ~:~' c-G Domain 2

u.o // c-a ~:~ ~ ~, e-a.~

"c~'U c r

;-~ Zo%;o,c v, Domain 1 :r %' ,%co _~-~ c ~

"~ :li 28S rRNA

Fig. 2A, B. Domains 1-7 of the human 28S rRNA molecule. Helixes are consecutively labeled and correspond to those listed in Tables 1-7. Helixes composed of variable sequence tracts are denoted as belonging to a variable region (e.g., V2 is the second variable region) and by relative position. Variable sequence tracts are denoted by shading. Invariant nucleotides are boxed. Within V2 and V8, sequences conserved among the vertebrates are denoted by dashed lines, sequences specific to mammals are enclosed in solid lines gions--a structural feature unique to the variable quence tract, V2-19, a 31-nucleotide stretch capable regions (Fig. 2). Such structures have been observed of forming a conserved structural element (Fig. 2) by electron microscopic studies of mammalian large is totally conserved in all eukaryotes examined. All rRNA (Schibler et al. 1977). other conserved tracts are vertebrate specific (Figs. Our alignment identifies conserved collinear se- 1 and 2). quence tracts within the variable regions (Fig. 1). These conserved tracts differ from those in the con- Secondary Structure of the Variable Regions served regions being (1) less numerous, (2) conserved over shorter evolutionary distances, and (3) The variable-region sequence tracts are capable of interrupted by sequences of low homology that vary forming secondary structures that do not disrupt the in length during evolution. Only one conserved se- superimposable secondary structures of the con- Structured RNA molecules without protein-coding function: - tRNA - ribosomal RNA (rRNA) - small nucle(ol)ar RNA (snRNA, snoRNA) - microRNA (miRNA) - long non-coding RNA (lncRNA) - etc. Non-coding RNAs (ncRNAs) are usually characterized by conserved structure.

Identification of (non-coding) RNA transcripts and/or structured RNA regions in genomes: RNomics. Multiple databases of ncRNAs

RNAcentral database (The RNAcentral Consortium, rnacentral.org) integrates data from ncRNA resources

Different search tasks are possible:

(rnacentral.org) microRNAs (miRNAs) MicroRNAs are 21-22 nt RNA’s are derived from precursor primary miRNAs (pri-miRNAs).

Pri-miRNAs are extended stem-loops. They are enzymatically processed to yield miRNAs (below shown in colour) that can be produced from both sides of the stem-loop.

hsa-mir-122-5p c - gg c --u c 5' cuuagcag agcugu aguguga aaugguguuug gu u The hsa-mir-122 precursor: |||||||| |||||| ||||||| ||||||||||| || 3' ggaucguc ucgaua ucacacu uuaccgcaaac ca a c a aa a uau a hsa-mir-122-3p (www.mirbase.org/) microRNAs (miRNAs) MicroRNAs are 21-22 nt RNA’s are derived from precursor primary miRNAs (pri-miRNAs).

Pri-miRNAs are extended stem-loops. They are enzymatically processed to yield miRNAs (below shown in colour) that can be produced from both sides of the stem-loop.

• In animals, the main function of miRNA's is translational repression mediated by miRNA binding to mRNA 3'UTR's. • This binding is mostly determined by so-called "seed" complementary match of 7-8 base pairs between the miRNA 5'end and target. For instance:

5' ...UGCCCUGGGAGCCCUACACUCCA... target mRNA ||||||| 3' GUUUGUGGUAACAGUGUGAGGU miRNA microRNAs (miRNAs) REVIEWSMiRNAs target genes by pairing to mRNAs. Different regulation mechanisms can be used.

a Endonucleolytic cleavage b 7 Plants (frequent) AGO m Gppp 10,11 (n)AAAA miR-171 5′-UGAUUGAGCCGCGCCAAUAUC-3′ SCL6 mRNA 3′-...ACUAACUCGGCGCGGUUAUAG...-5′

(n)AAAA Animals (rare) 10,11 miR-196a 5 - -3 UUUU ′ UAGGUAGUUUCAUGUUGUUGGG ′ HoxB8 3′ UTR 3′-...AUCCGUCAAAGUACAACAACCU...-5′ XRN1 or XRN4 Exosome

c Translational repression Block to translation e initiation Plants (rare?) miR-172a/c 5′-AGAAUCUUGAUGAUGCUGCAU-3′ SCL6 mRNA 3′-...UCUUAGGACUACUACGACGUC...-5′ m7Gppp (n)AAAA ORF Animals (canonical seed match site; most frequent) 2 8 lin-4 5′-UCCCUGAGACCUCAAGUGUGA-3′ lin-14 3′ UTR 3′-...AGGGACUCAACCUAAUUUUCU...-5′ AA

A Deadenylation A Animals (G-bulge site; less frequent?) Recruitment of 2 8 -3 translation blockers miR-124 5′-UAAGG-CACGCGGUGAAUGCC ′ Mink1 3′ UTR 3′-...GUUCCGGUGAUGUAACUCUUU...-5′

(n)AAAA Animals (3′ supplementary site; less frequent?) 2 7 13 16 miR-2 5′-UAUCACAGCCAGCUUUGAGGAGC-3′ grim 3′ UTR 3′-...UUAGUGU CGAAACU d mRNA turnover Decapping and UACG AACUC...-5′ 5′-to-3′ decay Change in repressive mechanism over time? Change in repressive Animals (3′ compensatory site; rare) 2 8 let-7 5′-UGAG-GUAGUA-GGUUGUAUAGUU-3′ AA lin-41 3′ UTR 3′-...ACUCACAUCUUGCCAACAUAUUUU...-5′ A Deadenylation A (Ameres & Zamore, 2013) Figure 4 | microRNA function in plants and animals. a | Most plant microRNAs (miRNAs) and a few animal miRNAs direct endonucleolytic cleavage (slicing) of their mRNA targets. The 5ʹ-to-3ʹ exoribonucleaseNature Reviews XRN4 | Molecular in plants andCell XRN1 Biology in animals, together with the major cellular 3ʹ-to-5ʹ exonucleolytic complex, the exosome, subsequently degrade the sliced mRNA fragments. The 3ʹ end of the 5ʹ for decay. b | miRNA-directed endonucleolytic cleavage of mRNAs requires extensive complementarity between the miRNA and its target site (representative examples in plants and mammals are shown). Endonucleolytic cleavage occurs ʹ end. Although slicing seems to be the dominant mode of action for miRNAs in plants, highly complementary sites in the transcriptome of animals are rare. c | In animals, miRNAs were originally proposed to repress translation of an open reading frame (ORF). d | In many cells and tissues, miRNA-directed translational repression is indistinguishable from mRNA destruction via decapping and 5ʹ-to-3ʹ decay. This has led to the suggestion that miRNAs directly target mRNAs for decay. Another possibility is that the inhibition of translation by miRNAs (part c) triggers subsequent mRNA decay, and the temporal delay between these two effects can vary depending on the surveillance mechanisms in place in particular cellular contexts (depicted by blue arrow on the right). e SCL6) is indistinguishable from sites that trigger cleavage (part b). In contrast to plants, animal miRNAs are almost always imperfectly complementary to the target mRNAs they regulate. The seed sequence (miRNA nucleotides 2 to 8; sometimes interrupted by a specific G-bulge) is the major determinant for target binding and often suffices to trigger mRNA repression. Additional pairing ʹ supplementary sites). Only in rare cases can an imperfectly matching seed sequence be compensated for by extensive complementarity between the miRNA 3ʹ region and the target site (3ʹ compensatory sites). Binding of animal miRNAs to partially complementary target sites generally c and part d.

482 | AUGUST 2013 | VOLUME 14 www.nature.com/reviews/molcellbio

Predictions of pri-miRNAs are mostly based on finding conserved stem-loop structures encoded in related genomic sequences.

Some sequence preferences can be used in the search.

Due to weak sequence patterns, such an approach may lead to many false-positive results. http://genomebiology.com/2003/4/7/R42 Genome Biology 2003, Volume 4, Issue 7, Article R42 Lai et al. R42.3

mir-184

100% comment

75%

50% 8392567 8392887 8393207 8393527 8393847 8394167 8394487 8394807 8395127 8395447 reviews

Unstructured sequence Conserved stem-loop

Figure 1 reports miRNA genes are isolated, evolutionarily conserved genomic sequences that have the capacity to form extended stem-loop structures as RNA. Shown are VISTA plots of globally aligned sequence from D. melanogaster and D. pseudoobscura, in which the degree of conservation is represented by the height of the peak. This particular region contains a conserved sequence identified in this study that adopts a stem-loop structure characteristic of known miRNAs. Expression of this sequence was confirmed by northern analysis (Table 2), and it was subsequently determined to be the fly ortholog of mammalian mir- 184 . Most conserved sequences do not have the ability to form extended stem-loops, as evidenced by the fold adopted by the sequence in the neighboring(E.C. Lai et al., 2003) peak. deposited research deposited

for many other non-coding RNA genes, their degree of and gaps in the loop sequence but maintain perfectly con- conservation usually exceeds that of coding regions, due to served arms (Figure 2b). Thus, the terminal loop is the most their lack of third-position wobble. This suggested that evolutionary labile portion of pre-miRNA. miRNA genes might be found by folding fixed lengths of conserved sequence to identify ones that display the high degree Mutations do eventually accumulate on the non-miRNA-

of relatively continuous helicity characteristic of known pre- encoding arm, and orthologous pre-miRNAs from more refereed research miRNAs. However, pilot studies identified a very large diverged species will often preserve only the 21-24 nucleotide number of conserved stem-loops in genomic sequence, sug- mature miRNA itself. However, because of preferential diver- gesting that additional criteria were necessary to make effec- gence within the loop, orthologous miRNAs from species of tive miRNA gene predictions. an appropriate evolutionary distance (such as D. melanogaster and D. pseudoobscura) show an equal or greater We next aligned the 24 pairs of orthologous Drosophila pre- amount of change within the loop than on the non-miRNA- miRNA sequences and assessed their pattern of nucleotide encoding arm (Figure 2a, class 3). This is the case despite the divergence. There are only three pairs that have been com- fact that the loop is typically only a third to a quarter the pletely conserved (Figure 2a, class 1), indicating that most length of each arm. Of the eleven members of the reference interactions pre-miRNAs have diverged to some extent within Dro- set that show divergence on both an arm and the loop, seven sophila. Unexpectedly, we detected mild selective pressure on show more changes in the loop than on the non-miRNA- the precise sequence of the non-miRNA-encoding arm. This encoding arm and three have an equal number of changes on attribute is not self-evident. It might have been the case that the loop and non-miRNA-encoding arm (Figure 2a, class 3); point mutations along an arm would be neutral as long as the only one member of the reference set (miR-2b-1) shows a status of base-pairing was maintained; this is possible due to greater number of changes on the non-miRNA arm than the the acceptability of G-U base-pairing in RNA. Instead, we loop (Figure 2a, class 6). Finally, there are no cases where observed preferential divergence within the loop sequence. both arms have diverged (irrespective of loop status), a situa- information Ten members of the reference set have diverged exclusively tion that would imply that the miRNA sequence itself had within their loop sequence (Figure 2a, class 2), whereas there diverged (Figure 2a, class 4). are no members that have diverged exclusively along an arm (Figure 2a, class 5). This is well illustrated by the Drosophila We propose then that classes 1-3 represent the normal pat- let-7 orthologs, which have accumulated four mismatches tern of evolutionary divergence of miRNAs, and consider

Genome Biology 2003, 4:R42 RNAsnp server: predicting SNP effects on RNA folding

(http://rth.dk/resources/rnasnp/ ; Sabarinathan et al., 2013)

HOME RESEARCH RESOURCES PUBLICATIONS ABOUT PEOPLE EVENTS NEWS JOBS CONTACT

About RNAsnp Web Server: Predicting SNP effects on local RNA secondary structure Web Server Please fill out the submission form and click the Submit button given below. Input fields marked with a * are required. - Submit

(Load Example Data) - Results

Input sequence* - Template - Example Enter your input sequence here in either fasta format or linear sequence (without gaps). [?] - Help

Software

Contact

(or) Upload sequence file: Choose File no ﬁle selected

(or) Select sequence from genome database

Mammal Human hg19 genome region chr19:49468565-49469565

Mode SNP details*

Enter your SNP details in the required format [?] Select the mode of operation [?] XposY, X is the wild-type nt., Y is the mutant and pos is the Mode 1 - based on global folding (RNAfold) [?] position Mode 2 - based on local folding (RNAplfold) [?] of nt. (pos=1 for first nucleotide in a sequence) Mode 3 - to screen putative structure-disruptive SNP [?] In case of multiple SNPs, separate each SNP with hypen "-" More than one SNP to test in a single run, provide them in seperate lines Folding window

Select the size of flanking regions on either side 200 of SNP [?]

(or) Upload SNP file: Additional options Choose File no ﬁle selected Parameters associated with mode 1 [?] Measure distance Correlation coefficient Job options Minimum length of the sequence interval 50 Enter your e-mail adress if you want to receive a notification Minimum cut-off for the base pair probabilities 0.01 after the job has finished, and give your job any custom name.

E-mail Job name

Submit Reset

© Center for non-coding RNA in Technology and Health Non-canonical base pairs in RNA In addition to canonical Watson-Crick base pairs (GC and AU), non-canonical edge-to-edge interactions with other base pairs are formed in multiple structured RNAs. These interactions are mediatedDownloaded from rnajournal.cshlp.org by hydrogen on March 9, 2019 bonds - Published by (H-bonds) Cold Spring Harbor Laboratory and Press are classified according to geometries of Nucleic Acids Research, 2002, Vol. 30 No. 16 3499 RNAinteracting base pair classification edges. and nomenclature 501 Table 2. The 12 families of edge-to-edge base pairs formed by nucleic example has been found and for which no reasonable model acidTwelve bases, de®ned main by the relative families orientations of ofisosteric the glycosidic bonds of could be proposed based on current knowledge. Sugar ring the interacting bases (column 2) and the edges used in the interaction (columnbase 3) pairs: atoms are drawn for those cases where the O2¢ participates (or could potentially participate) in hydrogen bonding to the base (or ribose O2¢) of the partner nucleotide. Otherwise the entire sugar moiety is designated with a closed circle. The sugar edge of purine and pyrimidine nucleotides includes the 2¢-OH, when the glycosidic bond of the nucleotide is in the usual anti domain. Thus, when one or both bases interact with the sugar edge, hydrogen bonds can form with the 2¢-OH group(s) acting either as donor(s) or acceptor(s). In fact, in some of the cis sugar edge/sugar edge pairs, no direct base±base hydrogen bonds occur at all. Since the position of the 2¢-OH hydrogen cannot be inferred from X-ray structures of nucleic acids, the 2¢-OH is drawn as a single unit in Figures 2±14. The C-H±O hydrogen bond is well established in structural chemistry on the basis of detailed analyses of small molecule crystallography (25). Thus, we also mark interactions involving adenine H2, purine (R) H8 and pyrimidine (Y) H5 or H6 as hydrogen bonds in Figures Recently proposed(Leontis symbols et al,, for 2002) designating each base pair family in secondary structure diagrams are given in column 4. Circles designate 2±14. For hydrogen bonds not involving a C-H the maximum Watson±Crick edges, squares Hoogsteen or pyrimidine CH edges, and distance between heavy atoms is 3.4 AÊ and for hydrogen bonds FIGURE 1. Left panel: Purine (A or G, indicated by “R”) and pyrimidine (C or U, indicated by “Y”) bases provide three edges for interaction, as shown for adenosine and cytosine+ The Watson–Crick edge comprises A(N6)/G(O6), R(N1), A(C2)/ triangles sugar edges. Solid symbols indicate cis base pairs and open involving C-H bonds the maximum distance should be <3.9 AÊ . G(N2), U(O4)/C(N4), Y(N3), and Y(O2)+ The Hoogsteen edge comprises A(N6)/G(O6), R(N7), U(O4)/C(N4), and Y(C5)+ symbols trans base pairs. The local strand orientation that occurs when both The Sugar-edge comprises A(C2)/G(N2), R(N3), Y(O2), and the ribose hydroxyl group, O29+ Right panel: The cis and trans Alignment example: Bridging water molecules are integral elements of a number of Isostericorientations are defined base relative topairs a line drawn can parallel to substitute and between the base-to-base eachhydrogen bonds in the case of two bases are in the default anti conformation are in column 5; a syn orientation hydrogen bonds or, in the case of three hydrogen bonds, along the middle hydrogen bond+ for one... ofG the...... nucleotides would imply a reversalU... of seq1 orientation; for the non-Watson±Crick base pairs (26,27). Water acts as both other in RNA structure. Frequently a global... orientation,G...... the stereochemistry at theU phosphate... seq2 groups has to be hydrogen bond donor and acceptor in these structures but, conserved non-canonical pairing can be considered....A...... G... seq3 again, the actual positions of the hydrogen atoms cannot be inferred from available crystal structures so water molecules derived from covariations in alignment. ...G...... U... seq4 more... edge-to-edgeA...... hydrogen bondsG belong... seq5 to one of 12 are simply designated by W in Figures 2±14. Information orientation of the strands+ In the very rare case that TABLE 1+ The 12 main families of base pairs between nucleic acid geometric families (7). Each family is identi®ed by the edges regarding the hydrogen bonding is provided in the lower left- both bases are syn, the strand orientations revert to bases together with the local strand orientation (which assumes that all bases are in the default anti conformation; a syn orientation would involved in the interaction and the relative orientations of the hand corner of each panel in Figures 2±14. Three numbers are those given in Table 1+ Thus, the proposed system elim- imply a reversal of orientation; for the global orientation, the stereo- inates the need to speak of “flipped” bases, “reverse” chemistry at the phosphate groups has to be considered)+ glycosidic bonds of the interacting nucleotides, cis or trans given: (i) the number of observed or potential hydrogen bonds orientations, or to explicitly state the donor and accep- Glycosidic bond Local strand (Table 2). When the glycosidic bonds of the two bases assume between two nitrogen or oxygen containing groups (i.e. tor atoms+ With a mental image of the edges that each No+ orientation Interacting edges orientation base of an RNA nucleotide presents for interaction, the default anti con®guration, the relative strand orientations normal hydrogen bonds); (ii) the number of hydrogen bonds one can easily visualize and memorize the essential 1 Cis Watson–Crick/Watson–Crick Antiparallel are those given in the third column of Table 2 (24). In Figures 2 Trans Watson–Crick/Watson–Crick Parallel involving polarized C-H groups (i.e. AH2, RH8, YH5 or geometry of each interaction+ To facilitate the adoption 3 Cis Watson–Crick/Hoogsteen Parallel 2±13, representative examples are provided of observed base YH6); (iii) the number of bridging water molecules. of the proposed nomenclature, we present in Table 2 4 Trans Watson–Crick/Hoogsteen Antiparallel the correspondence between our nomenclature and the 5 Cis Watson–Crick/Sugar Edge Antiparallel pairs for each geometrical family. When the two interacting It is well known that adenine can be protonated at the N1 base-pair designations given in the web-accessible data- 6 Trans Watson–Crick/Sugar Edge Parallel edges are different (for example Watson±Crick and position and cytidine at the N3 position (6). The proton cannot base cited above, grouped according to geometric type+ 7 Cis Hoogsteen/Hoogsteen Antiparallel The canonical A-U and GϭC pairs belong to the cis 8 Trans Hoogsteen/Hoogsteen Parallel Hoogsteen), a historically based priority rule is invoked be directly observed in nucleic acid crystal structures, but a Watson–Crick/Watson–Crick (W+C+/W+C+)geometry+ The 9 Cis Hoogsteen/Sugar Edge Parallel (Watson±Crick > Hoogsteen > sugar edge) so the base number of interactions cannot be readily rationalized without 10 Trans Hoogsteen/Sugar Edge Antiparallel so-called wobble pairs also belong to this group+ Orig- 11 Cis Sugar Edge/Sugar Edge Antiparallel identi®ed with each row of a given matrix is the one assuming protonation. In some rare instances, experimental or inally, the term “wobble” designated the pairing be- 12 Trans Sugar Edge/Sugar Edge Parallel tween the noncomplementary bases G and U and pairs interacting with the higher priority edge. Thus, in Family 3, theoretical support has been obtained for protonation (28). cis Watson±Crick/Hoogsteen (Fig. 4), all the pairings in the Therefore, wherever it makes chemical sense, we have ®rst row involve adenine interacting with its Watson±Crick indicated protonated adenine and cytidine in Figures 2±13. edge while all the pairings in the ®rst column involve adenine interacting with its Hoogsteen edge. In each panel of Figures Isostericity 2±13, the higher priority base appears to the left, oriented so The three-dimensional structures of homologous RNA mole- that its Watson±Crick edge faces to the right. A list of cules change much more slowly than their sequences in the referenced NDB ®les with primary references is provided in course of evolution (as is also true for homologous proteins). Table 1. By de®nition, homologous molecules share a common For each base pair in Figures 2±14, the source (NDB biological origin and a conserved function. Random point ®lename) and resolution of the X-ray data (in AÊ ), as well as the mutations in structurally crucial parts of RNA molecules are C1¢±C1¢ distance (also in AÊ ) are provided in the lower right accommodated by natural selection when they affect the threecorner. As higher resolution examples are obtained of each dimensional structure little or when they are compensated by base pair, they may be conveniently substituted for the pair further mutations. Such co-variations, when they occur at shown. In those cases where an example of a base pair was not positions that are cis Watson±Crick paired, have been applied found in a crystal structure, the pair was modeled using known with great success to predict the occurrence of conserved structures as templates and basic principles of hydrogen double helices in homologous RNA molecules. The isosteri- bonding. The pairs used as templates for modeled pairs are city of the standard base pairs, A-U, G=C, C=G and U-A in noted in the lower right of the panel. Blank spaces in Figure 2, is the fundamental property. The C1¢±C1¢ distance in Figures 2±14 indicate base combinations for which no each of these pairs is identical (Fig. 2, lower right of each Non-canonical base pairs in RNA BB46CH22-Westhof ARI 21 March 2017 11:2 BB46CH22-Westhof ARI 21 March 2017 11:2 Non-canonical base pairs can determine RNA 3D structure. Nucleic Acids Research, 2002, Vol. 30 No. 16 3499

The kink-turn or K-turn: TableNomenclature 2. The 12 families of of edge-to-edge non-canonical base pairs formed pairs: by nucleic example has been found and for which no reasonable model ab c acid bases, de®ned by the relative orientations of the glycosidic bonds of could be proposed based on current knowledge. Sugar ring abthe interacting bases (column 2) and the edges used inc the interaction 5' 3' (column 3) (Leontis et al,, 2002) atoms are drawn for those cases where the O2¢ participates (or U G 5' 3' could potentially participate) in hydrogen bonding to the base G A U G (or ribose O2¢) of the partner nucleotide. Otherwise the entire G A G A sugar moiety is designated with a closed circle. A G G A The sugar edge of purine and pyrimidine nucleotides includes the 2 -OH, when the glycosidic bond of the U A G ¢ nucleotide is in the usual anti domain. Thus, when one or G U both bases interact with the sugar edge, hydrogen bonds can G G form with the 2¢-OH group(s) acting either as donor(s) or G C G acceptor(s). In fact, in some of the cis sugar edge/sugar edge G C G C pairs, no direct base±base hydrogen bonds occur at all. Since 3' 5' G C the position of the 2¢-OH hydrogen cannot be inferred from 3' 5' X-ray structures of nucleic acids, the 2¢-OH is drawn as a Figure 3 Superimposition of K-turns single unit in Figures 2±14. The C-H±O hydrogen bond is well The K-turn (37) as an example of an RNA module. (a)SuperimpositionofK-turnsfromHomo2D sapiensdiagram Figure 3 established in structural chemistry on the basis of detailed from three different RNA analyses of small molecule crystallography (25). Thus, we ( green), Haloarcula marismortui (cyan), and Archaeoglobus fulgidus (magenta). (b)U4K-turninteractingwiththe The K-turn (37) as an example of an RNA module. (a)SuperimpositionofK-turnsfromHomo sapiens 15.5-kD protein (102). (c) Two-dimensional annotated diagram of a classic K-turn module (48). molecules also mark interactions involving adenine H2, purine (R) H8 ( green), Haloarcula marismortui (cyan), and Archaeoglobus fulgidus (magenta). (b)U4K-turninteractingwiththe Recently proposed symbols for designating each base pair family in and pyrimidine (Y) H5 or H6 as hydrogen bonds in Figures 15.5-kD protein (102). (c) Two-dimensional annotated diagram of a classic K-turn module (48). 2±14. For hydrogen bonds not involving a C-H the maximum and function as an L7Ae-like protein binding region; structural and schematic contacts are shown From Miao & Westhof (2017) secondary structure diagrams are given in column 4. Circles designate Watson±Crick edges, squares Hoogsteen or pyrimidine CH edges, and distance between heavy atoms is 3.4 AÊ and for hydrogen bonds in Figure 3. Therefore, non-WC base pairs are crucial to RNA structure and may determine the and function as an L7Ae-like protein bindingtriangles region; sugar structural edges. Solid and symbols schematic indicate contactscis base pairs are shown and open involving C-H bonds the maximum distance should be <3.9 AÊ . RNA 3D topologies in hinge regions. This also explains why RNA 3D structures are difficult to symbols trans base pairs. The local strand orientation that occurs when both Bridging water molecules are integral elements of a number of in Figure 3. Therefore, non-WC base pairsbases are are crucial in the default to RNAanti conformation structure andare in may column determine 5; a syn orientation the build based solely on 2D structure. RNA 3D modules are recurring 3D building blocks, mainly for one of the nucleotides would imply a reversal of orientation; for the non-Watson±Crick base pairs (26,27). Water acts as both RNA 3D topologies in hinge regions. This also explains why RNA 3D structures are difficult to constituted by non-WC base pairs, of similar structure and function occurring in various functional global orientation, the stereochemistry at the phosphate groups has to be hydrogen bond donor and acceptor in these structures but, build based solely on 2D structure. RNA 3Dconsidered. modules are recurring 3D building blocks, mainly again, the actual positions of the hydrogen atoms cannot be RNA molecules. As RNA secondary structure is insufficient to describe the unique structure of constituted by non-WC base pairs, of similar structure and function occurring in various functional inferred from available crystal structures so water molecules an RNA molecule, RNA 3D modules can greatly complement the deficiency of 2D structure in more edge-to-edge hydrogen bonds belong to one of 12 are simply designated by W in Figures 2±14. Information depicting molecular details. Here, we briefly review some crucial RNA structural modules. RNA molecules. As RNA secondary structuregeometric is insufficient families (7). to Eachdescribe family the is unique identi®ed structure by the edges of regarding the hydrogen bonding is provided in the lower left- an RNA molecule, RNA 3D modules can greatlyinvolved complement in the interaction the and deficiency the relative of 2D orientations structure of in the hand corner of each panel in Figures 2±14. Three numbers are depicting molecular details. Here, we brieflyglycosidic review some bonds crucial of the RNAinteracting structural nucleotides, modules.cis or trans given: (i) the number of observed or potential hydrogen bonds EXAMPLE OF AN RNA MODULE: THE G-BULGE (Table 2). When the glycosidic bonds of the two bases assume between two nitrogen or oxygen containing groups (i.e. the default anti con®guration, the relative strand orientations normal hydrogen bonds); (ii) the number of hydrogen bonds The sarcin/ricin module was first observed in the 5S ribosomal RNA (rRNA) of eukaryotes (12, EXAMPLE OF AN RNA MODULE:are THE those given G-BULGE in the third column of Table 2 (24). In Figures involving polarized C-H groups (i.e. AH2, RH8, YH5 or 13, 45, 46, 96). The sarcin/ricin loop (SRL) includes a G-bulge module, in which a guanosine is 2±13, representative examples are provided of observed base YH6); (iii) the number of bridging water molecules. extrahelical bulged and forms a non-WC interaction with its neighboring uridine. Nine bases and The sarcin/ricin module was first observedpairs in the for 5S each ribosomal geometrical RNA family. (rRNA) When of the eukaryotes two interacting (12, It is well known that adenine can be protonated at the N1 five non-WC base pairs that are surrounded by WC base pairs form the center of the sarcin/ricin 13, 45, 46, 96). The sarcin/ricin loop (SRL)edges includes are a differentG-bulge module, (for example in which Watson±Crick a guanosine and is position and cytidine at the N3 position (6). The proton cannot module. As a recurrent structural module, the SRL has been found in different rRNAs, namely, Hoogsteen), a historically based priority rule is invoked be directly observed in nucleic acid crystal structures, but a extrahelical bulged and forms a non-WC interaction(Watson±Crick with its > neighboring Hoogsteen > uridine. sugar edge) Nine so bases the and base number of interactions cannot be readily rationalized without 5S rRNA and 23S rRNA. The SRL is functionally essential as it is targeted by cytotoxins such as five non-WC base pairs that are surroundedidenti®ed by WC base with pairs each form row the of acenter given of matrix the sarcin/ricin is the one assuming protonation. In some rare instances, experimental or α-sarcin and ricin that completely abolish translation. A recent study points out that SRL is critical module. As a recurrent structural module, theinteracting SRL has with been the higherfound priority in different edge. rRNAs, Thus, in namely, Family 3, theoretical support has been obtained for protonation (28).

Annu. Rev. Biophys. 2017.46. Downloaded from www.annualreviews.org for anchoring elongation factor G (EF-G) on the ribosome during the process of mRNA–tRNA cis Watson±Crick/Hoogsteen (Fig. 4), all the pairings in the Therefore, wherever it makes chemical sense, we have

Access provided by Erasmus Universiteit on 04/05/17. For personal use only. 5S rRNA and 23S rRNA. The SRL is functionally essential as it is targeted by cytotoxins such as translocation (91). ®rst row involve adenine interacting with its Watson±Crick indicated protonated adenine and cytidine in Figures 2±13. α-sarcin and ricin that completely abolish translation.edge while Aall recent the pairings study in points the ®rst out column that SRL involve is critical adenine The bacterial loop E is another type of module, which is found in bacterial 5S rRNAs. The loop Isostericity Annu. Rev. Biophys. 2017.46. Downloaded from www.annualreviews.org for anchoring elongation factor G (EF-G) oninteracting the ribosome with its during Hoogsteen the edge. process In each of mRNA–tRNA panel of Figures

E region speciﬁcally binds ribosomal protein L25 in the conserved Access provided by Erasmus Universiteit on 04/05/17. For personal use only. region. The loop E module translocation (91). 2±13, the higher priority base appears to the left, oriented so The three-dimensional structures of homologous RNA mole- is a symmetrical loop. According to the crystal structure of Escherichia coli 5S rRNA, the loop E that its Watson±Crick edge faces to the right. A list of cules change much more slowly than their sequences in the The bacterial loop E is another type of module, which is found in bacterial 5S rRNAs. The loop module includes seven non-WC base pairs forming a severely distorted double helix. The structural referenced NDB ®les with primary references is provided in course of evolution (as is also true for homologous proteins). Table 1. By de®nition, homologous molecules share a common consensus of this module is constituted by three key base pairs: trans Hoogsteen/sugar edge, trans E region speciﬁcally binds ribosomal protein L25 in the conserved region. The loop E module is a symmetrical loop. According to the crystalFor structure each base of pairEscherichia in Figures coli 5S 2±14, rRNA, the source the loop (NDB E biological origin and a conserved function. Random point WC/Hoogsteen edge or trans sugar/Hoogsteen edge, and cis bifurcated or trans sugar/Hoogsteen ®lename) and resolution of the X-ray data (in AÊ ), as well as the mutations in structurally crucial parts of RNA molecules are edge. In such local structures, the base edges face the grooves in different orientations than in module includes seven non-WC base pairs formingC1¢±C1¢ adistance severely (also distorted in AÊ ) double are provided helix. in The the structural lower right accommodated by natural selection when they affect the three- standard WC pairs and thus form novel interaction interfaces with protein side chains. consensus of this module is constituted by threecorner. key As base higher pairs: resolutiontrans Hoogsteen/sugar examples are obtained edge, oftrans each dimensional structure little or when they are compensated by WC/Hoogsteen edge or trans sugar/Hoogsteenbase edge,pair, they and maycis bifurcated be conveniently or trans substitutedsugar/Hoogsteen for the pair further mutations. Such co-variations, when they occur at shown. In those cases where an example of a base pair was not positions that are cis Watson±Crick paired, have been applied Pseudoknots edge. In such local structures, the base edgesfound face in a the crystal grooves structure, in different the pair was orientations modeled using than known in with great success to predict the occurrence of conserved A pseudoknot (94, 107) is a structure that contains at least a hairpin loop and a single-stranded standard WC pairs and thus form novel interactionstructures interfaces as templates with and protein basic side principles chains. of hydrogen double helices in homologous RNA molecules. The isosteri- bonding. The pairs used as templates for modeled pairs are complementary sequence outside the loop that forms WC base pairs with the hairpin. A stricter city of the standard base pairs, A-U, G=C, C=G and U-A in Pseudoknots noted in the lower right of the panel. Blank spaces in Figure 2, is the fundamental property. The C1¢±C1¢ distance in Figures 2±14 indicate base combinations for which no each of these pairs is identical (Fig. 2, lower right of each 488 Miao Westhof A pseudoknot (94, 107) is a structure that contains at least a hairpin loop and a single-stranded · complementary sequence outside the loop that forms WC base pairs with the hairpin. A stricter

488 Miao Westhof · Changes may still occur before final publication online and in print

Changes may still occur before final publication online and in print RNA 3D modeling using tertiary structure constraints

604 ComparativeF. Michel analysis and E. Westhof (in particular, covariations) can identify important tertiary contacts that constrain the 3D folding. This info can be efficiently used in modeling. sequences and length ofFor aligned instance, segments the(see firstterminal models loop distal of of ribozymes stem P6 and the wereL7.1 a termi- produced using constraints derived from Materials and Methods for derivation of subgroups). nal loop. (2) Pairings P5a, Prib and P5c (including The following secondary covariationsstructure components identified are the A-rich in alignmentsbulge on the 3’ side guided of P5a), distal by secondaryof stem structures. Later, these models not shown in Figures l(a),were (b) and/or confirmed not discussed by P5crystallographic (Collins, 1988; “c-domain”, data. see Michel & in the main text: (1) Pll, a 4 to 7-bp interaction, Cummings, 1985). This peripheral extension is typi- confined to subgroup IAl, between an internal or cal of many subgroup IC and IB introns: note.

Pl Pl’ P2 P2’ P2.1 P3 P4 1 01 Tt.LB” 02 TP.LS” 03 PP.LB”,3 ICI 04 Nc.ND4L OS P..ND4L,, 06 Pa.ND4L.2 07 Pa.ND1.I 594 F. Michel and E. Westhof 08 PC.SS” 09 P..OX1,5 IO Pa.ND3 588 11 Na.NDB,, VQhVCCUcQooGC F. Michel1 B and E. Westhf (2 P..NDS,P “0h”cc”cQ IC2 13 Pa.ND4 vhuhchuvvhh---- 14 Pa.hTP6 ChhhChUhhhhhC16 IS Pa.ND1.2 hhVhCCVVhV----- 1.5 sc.OX1.S~ ChvGAuvhhvhhC35 17 Pp.Lsu.1 lJJG~gGVhuC44 18 Pp.Lsv.2 UGAVAI 7L13VhhhCGGC (9 Pe.OX1,16 hGZ>h ______*c 20 Cp.fLeu CGGACVCV ___- __..__ **AC IC3 21 “p.rLe” CGGAWX p------vu*! 22 “f.tLe” CffihW* u------VGU~ Alignment 23 Nt.tLa” CWACUCU U------“=a 24 Zn.fLeu CQQAW~ V------“G” 25 Pa.O*1,10 h”vcwgg ------*“A IBI 2627 P4.OX1.9 CVVC~ ______--*G*J SC.OXI.4 CVUU~ A------*C*G 28 sp.ox1,za cvvuwaQ, &------CGGA 29 Sc.COB.4 %A------AhUh% 30 Nc.hTP6 JA------ACAGQ! 31 h”.OX1,1 ~hhhUhC231lhVVhh~ Distant contacts 32 P~.OX1.3 ~hhhhCBB73hGGGhh~ 33 Pa.OX1.2 2~9421”hhGhhd 34 sc.ox1.3 Ct ,094 >hhVhhG IB2 35 Pa.ND5,3 AC 1177lhhVhhA 36 PI.OX2.1 ------*Q*u 37 P..ND1,4 AGAGWVGGI UVC147G3hCGhU IVCQGVCVCWh------hhhh~”h”hV~ 38 Na.ND1 VGhQaCVhCVhhVh----VGhVVUGWCUChVV------hhhQhVChChhhV~ 39 P*.oxl.* VhUCIhAUUV~OC12003ChhhVGhGGVV~VhVUh------AhVVVCVAhWhVhUQCVOC 40 P..OX,*11 UVGAAGVVG, KGV-----chuGi ~VGQCVVChCCCVhVhQVGhVhVhGQGhC5IhVVhhVVUCQFUhUh~ 3D 41 pa.ox1.13 CWGQhVhh------hCVVWGUVChC~hh------hhhVCUUOCUhVh~ 42 P..OX1;14 OChGGQUVAhVh---hhVChCVVhGCVWCCCVUV------hhhGhCCAWhVh~ CChCQhVhVVhh--VVVhhVhhGVGVCGUGCVV------hhhhUUChWhIh~ IB3 4344 Sc.OXl,Sbsp.ox1.3 CChVQ~VhV~hC4,,hVhh”hCKiVhVCh”Ghh”------hhhV~ChCCh~h~CGG”G 45 Pa.OX1.15 CChCGhVhffiCB7B~VVhGVhhGVGVCQVGUUG------VChW~h~hIhh~GVG 46 Cn.prbh.2 VhCG~------hhhVGQGGCCQVVhGhVhGhhhVhVCVahVG---hhChV~~VGVhV~VW 47 h”.OXl.3 ChQhQWChhhh------hUhVh~UVCV------VhhCWVGCVGVh~ (b) 48 sp.ox1;2 ChGhWVChh”“------VhhVh~CC”CV------VhhCOVVOCUhVhVOCVGG 49 h”.OXI .2 MCCGGVI “‘ACAVA---Ghhhhti IQCCGQCAU------“hhhmhVAUAWGG (Michel & Westhof, 1990) p..ox1; 7 hGCC(MVhM Sv”“h”hGhhhvv IGVGCCQGGVV------vhhhUCUQWIhhhVVtj IB4 8: Nc.NDS,Z QGhhQQVVVl UW-----VAGVI GQGWWCCV------“hhhWVChWhVhUQChGG 52 Ce.LSv.5 VAACGWVAI MJVC403VAhvvn ,GGQCCGVVhh------AhhhGhVGhCChVhVGCVGG 53 P1.COB.3 GAAVQAVAAI LGVA--hhhhVhA ,hUQVChVVCV------AhhhUVVCOCQhhCVGCVGG 54 P.s.COB. 1 55 an. COB 5.5 Sc.COB,3 57 Cn.psbh. 1 !5B cP.i.sv.i 59 Ce.LSv.3 60 Ca.LSV.4 61 Crn.LSV.4 62 SC.LSi 63 Nt . LS” 64 KC.hTPP IA1 65 sc.COB.5 66 Nc.COB.2 67 hn.LSV’ bB Nc.LS” 69 Pa.LSv.2 70 P..LS”, 1 71 Pa.N”I,L( 22 P1.OX2.2 rr~CV”--& 73 T4.nrdB i------**q LhhcGVhh----GzChl IA2 7475 T4.IUIIYTb.td 76 SPol.~31 77 cm.ssv IA3 7879 Ce.LSv. 1 Ce.LSV.* 80 Cl-.LS” 8, Pa.NDS.4 82 Pc,.OXl.G 83 Pa.OXl.12 ID 84 Nc.COB,I 85 PP.COB.2 86 C.s.COB - - 67 sc.COB.2 )OAOGUhhvh-----Uhhhvhvc~chhhuh ~UUI8483hVU~hVGCGhVhVVhUUGh-Ah- Fig. Al. (bl Figure 1. Two-dimensional models of group I introns. (a) Intron in the large ribosomal RNA of Tetrahpena (cl thermophiZa (Tt.LSU). Except for Pg.0 (see Michel et al., 1989), numbering of residues and nomenclature of secondary Figure 3. (b) Stereo views of the backbone showing the organization of the core into 2 major domains, P5-P4-P6 (thin structure pairings is as given by Burke et al. (1987; the drawing is slightly modified from the standard presentation of lines) and PS-P3-P’i (double lines). Thick curves correspond to the Pl and PlO pairings (substrates of the core these authors; Pl to P9 designate double-stranded segments, Ll to L9, terminal loops, J3/4, the segment connecting ribozyme). (c) Same as (b), but the angle of view is the same as in (a). (d) Stereo views of the backbone with conserved segments P3 to P4). Tertiary interactions (see Table 3) are indicated by broken lines (boxed base-pairs have at least 1 residues (see Fig. 1) added. The angle of view is the same as in (a). Thick curves correspond to the PlLPlO substrate. base participating in a tertiary interaction). Circled residues are conserved in all known group I introns with at most 3 (e) Same as (d), but the Pl-PlO substrate is in cross section. Residues shown are the same as is in (d). exceptions (see Appendix). (b) Intron in the large ribosomal RNA of Saccharomyces cerevisiae mitochondria (Sc.LSU). Numbering of core residues is as in Tt.LSU, with positions 210, 278 and 300 missing and 312.1 added (see Appendix). Peripheral parts are numbered in continuity with neighbouring core segments. Broken lines, circled and boxed residues as in (a). (L2) tend to be. The P2 stem is most often con- GNRA terminal loops are unexpectedly frequent in tiguous to PI, and in about half of all introns that natural RNAs (see, for instance, Jaeger et al., 1989), diester bond following the terminal G residue by the step may be regarded as a reversal of 5’ cleavage possess one, it ends with a four-nucleotide terminal constituting up to about half of all tetraloops in terminal 3’ OH of the 5’ exon, does not seem to (e.g. see discussions by Inoue et aZ., 1986). The loop (a “tetraloop”), the sequence of which reads catalytic introns (Michel et al,, 1989, and Appendix). require translocation of the Pl helix. Rather, the situation we have modelled corresponds to the state GNRA (Appendix; N is any nucleotide, R a purine). Strikingly, in all 28 introns (at least 6 different attacking and leaving groups of the first step following 5’ cleavage and immediately prior to exon exchange roles, so that, mechanistically, the ligation ligation. RNA 3D modeling: Molecular Dynamics RNA 3D folding can be simulated by Molecular Dynamics (MD) approaches.

JOBNAME: RNA 14#6 2008 PAGE: 2 OUTPUT: Wednesday May 7 22:16:42 2008 The MD simulations implement the functions (potentials) for interactions (“force fields”) acting csh/RNA/152282/rna8946 on atoms and molecular groups, that force them to move.

Known or predicted 2D structure is frequently used as a constraint. Downloaded from rnajournal.cshlp.org on October 6, 2014 - Published by Cold Spring Harbor Laboratory Press A number of algorithms use simplified coarse-grained models with “pseudoatoms”. For instance, RNA can be considered as a string with beads, with each nucleotide consisting of e.g. three (phosphate, sugar,Ab initio base) RNA or folding five (phosphate, sugar, three beads for a base) beads. experimental observations (Cao and Chen 2006). However, CG ModelCG for Model Simulation for Simulation of RNA of 3-D RNA Structures 3-D Structures J. Phys.J. Chem. Phys. B, Chem. Vol. 114, B, Vol. No. 114, 42, 2010No. 42,13499 2010 13499 due to the lattice constraints and the dynamic issues associated with predefined Monte Carlo moves (Baumgartner 1987), this approach is inadequate to study the folding dynamics of RNAs. Several other computational tools were developed for RNA 3D structure prediction (for review, see Shapiro et al. 2007). These methods either use comparative modeling of RNA sequences with known structures or utilize known secondary and tertiary structural information from experiments in interactive modeling (Major et al. 1991, 1993; Shapiro et al. 2007). Therefore, novel auto- mated computational tools are required to accurately predict the tertiary structure and dynamics of RNA molecules. Recently developed knowledge-based approaches using as- sembly of trinucleotide torsion-angle libraries (Das and (F. Ding et al., 2008) (Z. Xia et al., 2010) Baker 2007) are successful in predicting RNA structures for small globular RNA fragments (z30 nucleotides [nt]). However, RNA molecules often do not adopt globular topologies, such as the L-shaped tRNA. Enhanced prediction accuracy for longer RNA molecules is attainable by using physically principled energy functions and using an FIGURE 1. Coarse-grained structural model of RNA employed in accurate sampling of RNA conformations. DMD simulations. (A) Three consecutive nucleotides,Figure indexed 1.Figure(a) i Schematic 1.1,(a) Schematic representation representation of the CG of model the CG for model RNA. for Phosphate RNA. Phosphate and sugar and are sugar represented are represented as one CG as particle. one CG The particle. bases The A, G,bases C, A, G, C, and U areand represented UÀ are represented as three CG as three particles CG forparticles each. for (b) each. The components (b) The components of each CG of each base. CG The base. base The is divided base is by divided the red by dashed the red line. dashed (c) line. (c) Here, we introduce a discrete molecular dynamics (DMD) i, i + 1, are shown. Beads in the RNA: sugar (S), phosphateSchematic (SchematicP representation), and representation of all-atom of RNA all-atom backbone. RNA backbone. base (B). (Thick lines) Covalent interactions, (dashed lines) angular (Ding and Dokholyan 2005) approach toward ab initio 3D TABLETABLE 1: Properties 1: Properties of Nine CGof Nine Particles CG Particles TABLETABLE 2: Bond 2: Stretching Bond Stretching Interaction Interaction Parameters Parameters for the for the CG ModelCG of Model RNA ofFitted RNA by Fitted the Gaussian by the Gaussian Function Function and and constraints, (dashed–dotted lines) dihedral constraints.number Additionalnumber CG particle CG name particle namemass (amu) mass (amu) bond connections bond connections RNA structure predictions and characterization of RNA steric constraints are used to model base stacking. (B) Hydrogen ObtainedObtained From Statistical From Statistical Structures Structures 11 P P 94.970 94.970 2 2 bond bond b0 b Kbond K folding dynamics using simplified structural models. In bonding in RNA base pairing. (Dashed lines) The base-pairing22 S S 97.054 97.054 3 3 0 bond contacts between bases B :B and B :B . A reaction3 algorithm3 CG CG 53.022 53.022 3 3 1-21-2 3.85 3.85 11.12 11.12 contrast to the traditional molecular dynamics simulations, i 1 j+1 i j 4 N6 42.030 2 is used (see Materials andÀ Methods) for modeling the hydrogen4 N6 42.030 2 2-32-3 3.74 3.74 9.79 9.79 which are computation-intensive and hence expensive in 55 N2 N2 54.030 54.030 2 2 2-82-8 3.61 3.61 10.89 10.89 bonding interaction between specific nucleotide base pairs.66 O6 O6 43.014 43.014 2 2 3-43-4 4.29 4.29 57.70 57.70 probing RNA folding dynamics over long time scales, the 77 O2 O2 42.006 42.006 2 2 3-53-5 5.66 5.66 51.66 51.66 88 CU CU 26.016 26.016 3 3 3-63-6 4.28 4.28 44.60 44.60 DMD algorithm provides rapid conformational sampling 99 CA CA 39.015 39.015 2 2 3-93-9 4.33 4.33 109.19 109.19 4-74-7 4.55 4.55 44.00 44.00 (Ding and Dokholyan 2005). It is demonstrated in numer- RNA sequences, whose experimentally derived structures 4-8 3.59 124.29 the logarithm of both sides of eq 4 and dropping the constant 4-8 3.59 124.29 the logarithm of both sides of eq 4 and dropping the constant 4-94-9 3.53 3.53 93.79 93.79 ous studies that the DMD method is suitable for study- are available at the Nucleic Acid Database (NDB,term, afterterm, http:// performing after performing Boltzmann Boltzmann inversion, inversion, we have we have 5-65-6 4.57 4.57 37.14 37.14 ing various properties of protein folding (Chen et al. ndbserver.rutgers.edu), and compare our predictions with 6-76-7 4.53 4.53 57.10 57.10 k T k T 6-86-8 3.55 3.55 89.85 89.85 B B 2 2 2 2 7-87-8 3.52 3.52 82.87 82.87 2008) and protein aggregation (Ding and Dokholyan experimentally derived structures and folding dynamics.Ebond ) E (b)- b0)(b)-Kbbond) ()b -K b0)(b - b )(5) (5) 2bondσ2 2 0 bond 0 2005), and for probing different biomolecular mechanisms We restrict our study to RNA molecules having a length 2σ (Ding and Dokholyan 2005; Sharma et al. 2006, 2007). Here, greater than 10 nucleotides (nt) and shorter thanwhere the 100where temperature, nt. the temperature,T, is setT to, is be set 298 to K. be 298 K. were alsowere ﬁtted also with ﬁtted the with Gaussian the Gaussian function, function, and then and the then the Boltzmann inversion was used to calculate E . we extend this methodology to the RNA folding problem. Short RNA molecules lack well-formed tertiaryAngle structures Bending.Angle Bending.The distributionsThe distributions of bond of angles bond can angles be can be Boltzmann inversion was used to calculateangle Eangle. weightedweighted by a factor by a sin( factorθ) and sin( renormalizedθ) and renormalized by a factor by aZn factor. Zn. We simplify the RNA structural model by using a ‘‘bead-on- and were excluded from this study. All simulatedThe normalizedThe RNA normalized distribution distribution is expressed is expressed as as k T B kBT 2 2 2 2 a-string’’ model polymer with three coarse-grained beads— molecules (153 in total) are listed in Supplemental Table 1. -E /k T E )E (θ)- θ )(θ)-Kθ ()θ )- Kθ )(θ - θ ) (7) (7) P(θ) ) PZ(p()θ)/sin(Z pθ() )/sin() e bend) B e-Ebend/kBT (6) (6) angle angle2 20 a0 0a 0 phosphate, sugar, and base—representing each nucleotide Notably, this set of 153 molecules spans a range of tertiary nθ ) n θ θ ) 2σ 2σ

(see Materials and Methods; Fig. 1). We include the base- structural motifs: cloverleaf-like structures,where L-shapedθwhereis theθ angleis the between angle between neighboring neighboring bonds, while bonds,P( whileθ) P(θ) pairing, base-stacking, and hydrophobic interactions, the tRNAs, hairpins, and pseudoknots. and p(θ)and arep normalized(θ) are normalized and un-normalized and un-normalized distribution distribution func- func-DihedralDihedral Rotation. Rotation.As in atomicAs in force atomic ﬁelds, force the ﬁelds, tensional the tensional tions of θtions. The of distributionsθ. The distributions of bond angles of bond between angles between CG bonds CG bondsenergy takesenergy the takes formula the formulaof of parameters of which are obtained from experiments. The For each RNA molecule, we first generate a linear con- coarse-grained nature of the model, as well as the efficiency formation using the nucleotide sequence. Starting from this of the conformational sampling algorithm, enables us to extended conformation, we perform replica exchange simu- rapidly explore the possible conformational space of RNA lations at different temperatures (see Materials and Meth- molecules. ods). The three-dimensional conformation corresponding to the lowest free energy is predicted as the putative structure of the RNA molecule, assuming that the corre- RESULTS sponding native structure is unknown. The extent of native structure formation in simulations is measured by com- Large-scale benchmark test of DMD-based ab initio puting the Q-values (akin to protein folding experiments RNA structure prediction on 153 RNA sequences [Sali et al. 1994], see Materials and Methods), defined as We test the predictive power of the DMD-based RNA the fraction of native base pairs present in a given RNA folding approach by selecting a set of intermediate-length conformation. We compute Q-values for the lowest free

www.rnajournal.org 1165 RNA 3D modeling RNA backbone can be approximated by a coarse-grained representation with virtual bonds, reducing computational complexity.

146 W.K. Dawson et al. / Methods 103 (2016) 138–156

compared to proteins, the backbone conﬁgurations alone are not sufﬁcient to describe important physical properties of the bases.

However, the P–C40 model is uniquely suited to this purpose [119]. The details of these strategies will be discussed in the philosophy sections where many of the strengths and weakness of the approaches can be compared.

All-atom (minus H atoms) 4.2. Philosophy behind TBP-like CG approaches structure: Here, we discuss the methodology, some examples, and some challenges to the TBP-like approach.

4.2.1. The force fields of TBP-like CG methods Since TBP-like CG methods derive their concepts mostly from the all-atom MD approaches, the concept of a force field and its application to beads instead of individual atoms has considerable appeal. For one thing, biomolecules typically contain a plethora of hydrogens. Why not just ignore them if we can make life easier? The difference is that the beads now represent the collective Coarse-grained backbone motion and the interaction of a cluster of atoms is built from the chemical intuition of the person developing the CG representation. repesentation: There is no obvious way to select or construct the beads and no recipe to derive or estimate the interaction potential as for AAMM [17,18]. For TBP-like CG methods, one seeks to re-parameterize the force field parameters in MD simulations to approximate the AA-MD simulations. In such a model, the binding interactions are approx- Fig. 3. Definition of angles and bonds along the backbone of RNA. Top: All atom imated from all atom simulation and possibly other auxiliary representation (minus the H atoms) showing a short ribonucleotide sequence GUG with the names of the relevant atoms and torsion angles. Bottom: The same chain experimental information, when available [17,18]. These resulting showing only the two bead per residue coarse-grained representation (composed of force fields typically have the same form as Eq. (1); however, the P–C40 and C40–P virtual bonds) of the same sequence and the pseudo-torsion angles interaction potentials represent an integration of the forces over g and h. the groups of atoms being approximated by individual beads. For example, when this involves treating a methyl group as a bead, it to these two virtual bonds [121]. This approach has been imple- is relatively easy to understand that the H atoms are ‘‘smeared mented in some approaches that will be discussed later. out”. However, when a whole residue is treated as a bead, it is Reduction of eight bonds to two pseudobonds that connect two not so obvious what to neglect in the pseudo-atoms that one beads is not the only way to cover the RNA backbone. Olson had (Dawsoncreates. et al, 2016) also shown the possibility of using a single virtual bond on the The core advantage of TBP-CG approaches over TBP-AA is the backbone that connects the phosphate atoms [122]; though this large reduction of conformational space available for the molecule. approach required different virtual bonds for 30-endo and 20-endo There are several consequences of such an operation. First, the sugar pucker (the two most common conformations). It is also pos- number of interactions is reduced, which may speed up the com- sible to increase the number of beads. Another minimalist way is to putations. Second, the potential energy function is usually use very few beads, but employ very complex potentials [21] to smoothed and, therefore, with the absence of many local minima, describe the composite interactions in a rigid body nucleotide for the conformation of the molecule can evolve faster. Third, the high both the backbone and the base, as implemented in oxRNA [50] frequency vibrations (especially these of protons) are removed and and in models from Liwo, Scheraga and coworkers; e.g., Ref. [53]. a longer time-step can be used in MD simulation; i.e., a longer real It is also possible to consider beads or blocks that represent time evolution of the system can be achieved with the same com- much larger structures than individual ribonucleotide residues. putational effort. However, switching from TBP-AA to TBP-CG usu- Examples of this approach occur early on with programs like ally leads to a more complicated analytical potential. For example, YAMMP [123,124] that could reduce the whole dsRNA helix to a in NARES-2P [53], the spherically symmetric Lennard-Jones poten- single bead, and more recent methods like RAGTOP [125] and ERN- tial is replaced by the Gay-Berne ellipsoid of revolution potential WIN [126] that reduce the RNA structure to helices, loops and junc- and charges are replaced with electric dipoles, which are also used tions and try to arrange them based on the statistics of loops and in the dipolar-bead model [52]. Also replacing many covalent junctions. The model of Jost and Everaers [127] uses a lattice model bonds with one virtual bond usually leads to larger deviations from where the stems are treated as a unit and the conformation space harmonicity of the bonding terms of potential energy function. of the loop regions is explored. This allows for efficient sampling of Nevertheless, in principle, low energy vibrational modes of RNA all the possible conformations; however, the lattice model also could be used to model the vibrational modes of the beads and like limits any precise transformation between the CG representation AA approaches and these parameters could be obtained from IR and the actual 3D structure. spectroscopy. However, such information is usually quite difficult to extract from spectroscopic data and rarely all that helpful [18]. 4.1.2. Representation of the base Using TBP-CG methods, one is able to simulate the time- If the decision is to divide the structure into monomers and dependent dynamics of RNA (only with less resolution) and the work with beads within these monomers, then the next issue is simplifications (such as the number of beads and bonds) can be how to show the base that is attached to the backbone. In general, adjusted to the scale of the interaction of interest. Thermodynamic although nucleic acids have far more regularity in the side chains potentials result from the collective sampling of conformations Downloaded from rnajournal.cshlp.org on March 9, 2019 - Published by Cold Spring Harbor Laboratory Press

Kerpedjiev et al.

graph and then fitting helices onto the all-atom model to get the 3D coordinates of the coarse-grain representation (see next section).

The helix and the 3D model At the core of the Ernwin tertiary structure prediction package is the reduced cylinder-like model of an RNA helix. The representation of the helix is defined by a line segment indicating the start and end points of the axis of the helix (as, ae) as well as two vectors pointing from the ends of the axis to the middle of the first and last base pairs, respectively (ts, te) as depicted in the schematic (Fig. 2; Supplemental FIGURE 1. The coarse-grain representation of the 2D structure of an Fig. A.10). The calculation of these parameters cannot exactly repre- RNA molecule. (A) The paired regions are shown as gray rectangles. The arcs show the path of the strand in connecting the paired regions. sent a helix insofar as RNA helices deviate from an ideal double he- The labels in black are names given to distinguish the different second- lix. While such a representation has previously been alluded to ary structure elements in the graph. The elements f1 and t1 are the 5′ and (Laederach et al. 2007; Popenda et al. 2012), the calculation of the 3′ unpaired regions, respectively. Elements starting with “s” correspond axis and twist vectors has never been explicitly defined. We tested to base-paired canonical helices. Elements starting with an “h” are hair- four different methods for fitting idealized helices to real RNA dou- pins. Interior loops and multiloops are denoted by names starting with ble helices, the details of which are documented in Supplemental “i” and “m,” respectively. The numbers in red indicate the corners of the stem. (B) The skeleton graph representation of the structure. Section A.5. The position of the twist values is illustrated in Supplemental Section A.5.5 and Supplemental Fig. A.10. RNA 3D modeling nucleotides before the first stem. This section is always connected to Journal of the American Chemical Society Article the first stem. If there are no paired regions, then the entire molecule Proposal distribution, model building, and sampling structure and conformation of nucleic acids, including, as will be a single 5 unpaired region. 37−39 ′ Molecular Dynamics simulations benchmarks, TAR complexes. “Interior loops” are double-stranded regions which link exactly The proposal distribution for new structures is based on a set of sta- Here, we build on these technical advances to re-examine the show that helical stems behave like two stems and contain no canonical base pairs, although they may tistics relating the orientation of two adjacent helices, the orienta- conformational selection hypothesis of TAR-ligand recognition (quasi-)rigid domains. by using advanced computational techniques to reexamine the be rich in noncanonical base pairs (Leontis et al. 2006). These re- tions of hairpin loops, and the 5′ and 3′ unpaired regions relative to helices. Just as the position of 1 nt relative to the previous can experimental results. We collected μs-long MD simulations gions always connect corners 2 and 3 of one stem (sj) to corners be expressed as a function of the torsion angles and sugar pucker, starting from various instances of apo and holo TAR structures 1 and 4 of the next stem (sk), where the term “next stem” implies depleted of the ligands using the parmbsc0 AMBER force the position of one coarse-grain helix relative to the previous can ﬁ 37 ﬀ that c1(sj)

Creation of the secondary structure graph The secondary structure graph is created from RNA second- FIGURE 2. An illustration of the helix model for the 53 nucleotide SMK box (SAM-III) Riboswitch RNA structure (C, PDB: 3e5c). The he- ary structure predictions. Currently, we use RNAfold from the lices are shown as green cylinders, interior loops as thinner yellow cyl- ViennaRNA v2 package (Lorenz et al. 2011). The coarse-grained inders, multiloops as collections of red cylinders, hairpin loops as thin graph can be trivially created from any secondary structure repre- blue cylinders, and the 3′ unpaired region is shown in magenta. A 5′ un- sentation or prediction algorithm (i.e., minimum-free energy fold- paired region is missing from this structure since the first nucleotide is ing, centroid structures, nonphysics based methods) which does already paired. We denote the twist parameters as orange lines protrud- not contain pseudoknots. Threading a coarse-grain model onto a ing from the axis of a helix as viewed along (A) and perpendicular (B) to the axis of a helix. Each one is perpendicular to the cylinder axis and known 3D structure requires the extraction of the secondary struc- points in the direction of the midpoint between the C1′ atoms of the first ture, for which we use the annotation produced by MC-Annotate and last base pair in the helix. In C, twist vectors are interpolated for the (Gendron et al. 2001), removing the pseudoknots (conflict elimina- base pairs in the middle of the stem with values stored only for the vec- tion method) (Smit et al. 2008), creating the secondary structure tors at the end of each stem.

1112 RNA, Vol. 21, No. 6