Published online Xxxx 2014 Nucleic Acids Research, 2014, Vol. XX, No. YY 1–37 doi:10.1093/nar/gkn000 Supplementary material: Annotating RNA motifs in sequences and alignments Paul P. Gardner1,∗, and Hisham Eldai1∗

1School of Biological Sciences, Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch 8140, New Zealand

Received July, 2014; Revised July, 2014; Accepted July, 2014

∗To whom correspondence should be addressed. Tel: +64 3 364 2987; Fax: +64 3 364 2590; Email: [email protected]

c 2014 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

[11:22 5/11/2014 supplementary-results.tex] Page: 1 1–37 2 Nucleic Acids Research, 2014, Vol. XX, No. YY

SUMMARY In the following document we present supplementary methods, results and figures relating to the RMfam resource: 1. Figures 1-8 illustrate secondary structure diagrams for each of the RMfam motifs. Figure 1 contains a Legend, detailing the color and symbol schemes used to illustrate different evolutionary constraints on the different structures. 2. Figure 9 illustrates our estimates of the accuracy of using covariance models to annotate RNA motifs on sequences and alignments. 3. Figures 10-43 contain secondary structures and the results of per-motif benchmarks. 4. Figures 44&45 illustrate improvements to Rfam (v11.0) alignments and consensus structures based upon RMfam annotations. 5. Figure 46 illustrates the network of the 50 highest scoring RMfam to Rfam mappings.

[11:22 5/11/2014 supplementary-results.tex] Page: 2 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 3

SECONDARY STRUCTURES

Legend U U U Y C G A A A U C basepairannotations A A G A U G C G U C covaryingmutations G C Y R C G compatiblemutations G Y C G nomutationsobserved A Y R Y G C nucleotide nucleotide U A A R present identity A U Y 80% 60% N 80% 70% 40% N 70% R U G U N 60% 5´ 5´ 5´ 5´ 5´ R=AorG.Y=CorU. Figure 1. A legend describing the symbols used in all the secondary structures images presented in figures 1-8. Secondary structure diagrams of: tetraloops: ANYA (1, 2, 3), CUYG (4, 5, 6, 7), GNRA (8, 9, 10, 11, 12, 13), UMAC (14, 15) and UNCG (10, 12, 13, 16) and the hairpins loops C-loop (17, 18, 19, 20), T-loop (12, 13, 21, 22, 23) and U-turn (12, 13, 24, 25).

5-46 nt R R R A U R U R Y R G A G C C G C U A G C A R R Y R Y C A U R G Y 5´ 5´ 5´

Figure 2. Secondary structure diagrams of: the hairpins loops; C-loop (17, 18, 19, 20), T-loop (12, 13, 21, 22, 23) and U-turn (12, 13, 24, 25).

0-38 nt Y 3-81 nt R C R C 4-40 nt R R Y A A Y G C A U G R Y G G C Y R Y R U U R Y Y C C A G G C R R Y G A A A Y R R Y G U A A A G U A U G G G G R A A G G 0-57 nt G A G A A Y R R R Y C G R R R R Y R R Y Y R R Y R R Y C G Y R A R R Y C G G C A G 5´ 5´ 5´ A G 5´ R A R A A 5´ R

Figure 3. Secondary structure diagrams of: internal loops: three k-turns (3, 12, 13, 18, 26, 27, 28) and two sarcin-ricin loops (12, 20, 29, 30).

[11:22 5/11/2014 supplementary-results.tex] Page: 3 1–37 4 Nucleic Acids Research, 2014, Vol. XX, No. YY

R Y G G A G G G G C 4-88 nt 2-22 nt R G Y Y G A G A U C A G A A G U C C C G G G A C U C G G R Y R Y A G R R C G R G R Y A A R Y R R A 0-31 nt C A A C G R Y R Y R G G Y U G U G C C C C A A A U R Y A R Y G G U G A A 28-249 nt R G A A R Y A G C G G A G 1-12 nt A R Y C Y C G G Y Y A G C R Y C R 0-7 nt G Y A G A Y U R R Y Y G R Y G A G C Y G C G Y G U G 0-36 nt C G Y G U G R Y G Y G C G Y G C R G C 5´ R Y A 5´ G C 5´ R G C Y R 5´ 5´ 5´

Figure 4. Secondary structure diagrams of: internal loops: the tandem-GA (20, 31), twist up (17) and UAA GAN (32), the docking elbow (33), right angle 2 and 3 (34) motifs.

R

Y R

R R Y C G C G C G R Y Y R C G Y R A U A U R Y R Y Y G A U C G C G G C C G G Y A U A U G C A U A U R Y C G G C A U R Y A U A U A U G C R U C G A U G C 5´ U Y U Y R U 5´ U U U Y 5´ R Y Y Y Y R R R R

Figure 5. Secondary structure diagrams of Rho independent transcription terminators (35).

[11:22 5/11/2014 supplementary-results.tex] Page: 4 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 5

A R A R G Y G Y R G A U U A U U C G Y A Y R U R U Y Y G U A U A U Y U A R U R U RA Y R Y R R Y U A R U R R Y U R U R U R 5´ 5´ U 5´ 5´ U R

G C G

Y R

5´ 5´ A A Y A A R A A C A A A R R

Figure 6. Secondary structure diagrams of: interactions: the AUF1 (36), CRC (37, 38, 39), CsrA (40, 41, 42, 43, 44, 45, 46, 47), HuR (48, 49), Roquin (50) and VTS1 (51, 52, 53, 54) protein binding motifs.

R A A A G A R G A Y R

G A G G C C G A C R Y A C A U Y Y G R Y C G R G Y R R Y Y C C C Y C G A A Y G R G A R C R C G R Y G U R Y A U G Y Y G Y 5´ A U A U G Y C 5´ 5´

Figure 7. Secondary structure diagrams of: vapC target (55), the SRP RNA S domain (56, 57, 58) and the catalytic Domain-V (59, 60).

[11:22 5/11/2014 supplementary-results.tex] Page: 5 1–37 6 Nucleic Acids Research, 2014, Vol. XX, No. YY

Shine-Dalgarno sequence from Bacillus subtilis subsp. subtilis str. 168 2.0

1.0 bits AA GG G A A A A G U A A G A U U A A U A A U U AAAU C A AU U U GA U A C U U CA U U AUAAAAAUAU UU C GA U C C A C G G C C C C C C C C C U C C C U A A G C C CU G UG U GC G UU G G G GGG G C CA G U C G G G A U U A G A G G C 0.0 A -30 -25 -20 -15 -10 -5 0 5 Distance from start codon (nucs) WebLogo 3.1 5´ R A A A R G G G G G R R Y Y R Y A U G A R R R A R Shine-Dalgarno sequence from Escherichia coli str. K-12 substr. MG1655 2.0

1.0 bits AGGA A A A A A A U AA A AA A A U U UU U AA C U UUU G A A AU A U C C C G C UC A U A CC A U U U CA U G U U U G C C G C GU C AA C C A C C G C GC C U UG C U G U G G C CCG UG C U AG CA A U G 0.0 GA -30 -25 -20 -15 -10 -5 0 5 Distance from start codon (nucs) WebLogo 3.1 5´ Y Y Y Y Y Y Y Y R R R G G R R R A Y Y A U G A R A R Shine-Dalgarno sequence from Helicobacter pylori 26695 2.0

1.0 bits A AG A A A AU UA G U A A A UU U AA AU AA U A A C CA A A A U A A U AAAA GG UA AC GU U AUU U U UA UU U G G A AU C CC GC G U G C C G U G U GC C A C G C UC U C C C C U UG U U G G G C

G C G C C C G C GG C G C G U C A G G U 0.0 GAA -30 -25 -20 -15 -10 -5 0 5 Distance from start codon (nucs) WebLogo 3.1 5´ Y Y Y Y A A G G R R Y A U G R R A R A

Figure 8. Secondary structure diagrams of: sequence motifs: Shine-Dalgarno sequences from Bacillus subtilis, Escherichia coli and Helicobacter pylori respectively (61).

[11:22 5/11/2014 supplementary-results.tex] Page: 6 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 7

BENCHMARKING In order to ensure that our approach provides accurate predictions we have carried out extensive benchmarking of the covariance models. These have been broken into three phases. We ran three different benchmark approaches RMfam sequence benchmark, RMfam2Rfam alignment benchmark and a RMfam2Rfam sequence benchmark on all the RMfam covariance models. These benchmarks can be distinguished primarily by what is considered a true positive.

RMfam sequence benchmark Unfortunately, most of the alignments in RMfam are composed of few sequences. In fact, the median number of sequences in the RMfam alignments is just 34.5. This means that idealised benchmarking strategies, such as cross-validation, are unlikely to provide useful results. Therefore we tested these covariance models on the training (seed) sequences, using a large negative control. This consisted of 10 permuted sequences for each seed sequence and 10 permuted sequences for each PDB sequence (62). In order to control for sequence composition biases the di-nucleotide content was preserved between the native and permuted sequences (63). Also, in order to identify members of the motif family with solved structures, we ran the CMs over 11,508 nucleotide sequences extracted from the June 2014 release of PDB. We used the results of this benchmark to identify a bit score threshold, this value ideally discriminates between the true members of the family and the negative control (permuted) sequences. In practise, a slightly lower than optimal threshold is generally selected as false positives are generally considered to be more desirable than false negatives. The results of these tests are illustrated in Supplementary Figures 10-43.

RMfam2Rfam alignment benchmark There are many instances of RNA families (Rfam) with good evidence that they host RNA motifs. Many of these have been published in the literature. For the purposes of benchmarking we have curated a collection of motifs in Rfam, including annotating the evidence associated with these (See Supplementary Table 1), the bulk (261/446) of these are derived from Cruz and Westhof (2011) (20), 37 are from other publications (17, 20, 22, 24, 26, 33, 34, 37, 64, 65, 66, 67, 68, 69, 70, 71, 72) and 148 were curated by ourselves. These connections between RMfam and Rfam cover 238/2208 Rfam families and 21/34 RMfam motifs. In order to automate the prediction of motifs in Rfam alignments we built a Perl wrapper (rmfam scan.pl), which is available on GitHub: http://github.com/ppgardne/RMfam. Our approach begins by making the input Rfam (version 11.0) seed alignments non-redundant by filtering out sequences that are more than 90% similar to each other. We annotate the remaining sequences with each RMfam motif, using the score threshold determined during the “RMfam sequence benchmark”. We further filter these annotations by selecting only those that are identified in two or more and ≥10% of the sequences in each Rfam alignment. We experimented with a number of approaches for generating negative control alignments that preserved the characteristics of sequence conservation found in the Rfam alignments, including multiperm (73), SISSIz (74), “esl-shuffle” (75) and “shuffle- aln.pl” from the RNAz package (76). We selected shuffle-aln.pl for generating our negative controls because it (A) ran on our computers and (B) did not significantly alter key characteristics of the alignments e.g. sequence lengths and sequence identity (data not shown). We experimented with a number of summary statistics for identifying “good” matches between our motifs and Rfam. These included the fraction of annotated sequences, a tree weighted sum of bit scores (77) and summing all bit scores for each motif in each Rfam alignment (See Supplementary Figure 9). We selected the latter (sum of bit scores) as the preferred summary statistic, as this provided the maximum Matthew’s Correlation Coefficient (MCC) of all the measures we tested and is trivial to compute (Figure 9).

RMfam2Rfam sequence benchmark The depths of Rfam seed alignments can vary from 2 to 1,020 sequences. Consequently, measures like sum-of-bits can be a reflection of the numbers of sequences in alignments rather than the likelihood that they host a motif. In order to compensate for this we sampled up to 5 sequences from each Rfam seed alignment, and ran a sequence annotation over these sequences (skipping the similarity reduction and the minimum number of sequences filters used for the alignment benchmark). Ten shuffled versions of each sampled Rfam sequence were also generated and annotated.

Definitions of performance measures In the following results we display a range of performance measures for all RMfam annotations. We briefly summarize these below. Each prediction is classified as either a true positive (TP), true negative (TN), false positive (FP) or false negative (FN). The totals of these can be used to compute a range of performance statistics. These include the Sensitivity or fraction true data that are correctly assigned, the Specificity or the fraction of false data that are correctly assigned, the Positive Predictive Value (PPV) or the fraction of predicted trues that are correct, the Negative Predictive Value (NPV) or the fraction of false predictions that are correct, the False Discovery Rate (FDR) or the fraction of true predictions that are incorrect, the Accuracy (ACC) or the fraction of all predictions (true and false) that are correct, the False Positive Rate (FPR) or the fraction of false predictions that are actually false. Finally, a common measure for determining the accuracy of a method is to compute the Matthew’s Correlation Coefficient (MCC). This measure ranges between +1 and −1, a value of +1 indicates a perfect discrimination between true and false

[11:22 5/11/2014 supplementary-results.tex] Page: 7 1–37 8 Nucleic Acids Research, 2014, Vol. XX, No. YY

members, a value of 0 implies no predictive power and a value −1 indicates a completely imperfect discrimination between true and false positives.

TP FP Sensitivity = FDR= =1−PPV TP +FN TP +FP

TN TP +TN TP +TN Specificity = ACC = = FP +TN P +N TP +TN +FP +FN

TP FP PPV = FPR= TP +FP FP +TN

TN NPV = TN +FN

TP ×TN −FP ×FN MCC = p(TP +FP )(TP +FN)(TN +FP )(TN +FN)

[11:22 5/11/2014 supplementary-results.tex] Page: 8 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 9

ROC plot ROC−like plot 1.0 xx 0.8 xxxx 0.6 FDR fraction seqs. 0.4

Sensitivity sum bits x x xx weighted sum bits xx x single−sequence sum bitsxx x 0.0 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Specificity ACC

MCC vs threshold FDR plot 0.6 0.8 0.4 FDR MCC 0.4 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

threshold threshold

Figure 9. Testing a variety of summary statistics for identifying RMfam motifs in Rfam seed alignments. These were fraction of sequences, the sum of bit scores, a tree-weighted sum of bit scores and a sum of bit scores for single sequences sampled from each Rfam alignment. The top left figure is a ROC plot (78), the top right shows the false discovery rate versus accuracy trajectory for each score, the bottom left shows the Matthew’s Correlation Coefficient

[11:22 5/11/2014 supplementary-results.tex] Page: 9 1–37 10 Nucleic Acids Research, 2014, Vol. XX, No. YY

Individual motif performance The following figures (S10 to S43) illustrate the annotation accuracy for each of the motifs in RMfam. On the far left of each figure is an illustration of the motif secondary structure and sequence conservation, see Figure 1 for a legend. In the middle is an illustration of the covariance model score distributions over sequences derived from the PDB, sequences from the RMfam seed alignments and shuffled PDB and RMfam counterparts. A curated “threshold”, for distinguishing between true and false sequence matches is illustrated with a dashed vertical line. The right figure contains four panels, starting from the top-left and moving around the plot in a clockwise direction, these are: ROC-curves for each of the 3 benchmarks described previously; ROC- like-curves, of PPV vs Specificity; a bar plot illustrating the MCC, sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), accuracy (ACC) and the false discovery rate (FDR), each of these was computed using the threshold that maximises the MCC; The MCC shown as a function of the covariance model bitscore (or sum of bit scores in the alignment benchmark).

[11:22 5/11/2014 supplementary-results.tex] Page: 10 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 11

ROC curve: ANYA ROC−like curve: ANYA 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV U Y ANYA Score vs MCC: ANYA Performance: ANYA A A PDB 1.0 3.0 G C 10000 PDB−shuffle 2.5 SEED 0.5 A SEED−shuffled 2.0 threshold 0.0 1.5 100 MCC Stats 1.0 U A Frequency −0.5 10 0.5 A U −1.0

0 0.0 0

0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV R U ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 10. ANYA.

[11:22 5/11/2014 supplementary-results.tex] Page: 11 1–37 12 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: AUF1_binding ROC−like curve: AUF1_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

A R AUF1_binding A R Score vs MCC: AUF1_binding Performance: AUF1_binding Y R PDB 1.0 3.0 A U 10000 PDB−shuffle 2.5 Y SEED 0.5 R U Y SEED−shuffled 2.0

U threshold 0.0 1.5

A U 100 MCC Y Stats 1.0 R R U Frequency −0.5 A 10 Y R U 0.5 R −1.0 R 0 0.0 U R 0 R U 0 5 10 15 20 25 10 20 30 40 50 PPV FDR NPV ACC MCC

U R SENS SPEC 5´ U CM score (bits) CM score (bits)

Figure 11. AUF1 binding.

ROC curve: C−loop ROC−like curve: C−loop 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

5-46 nt C−loop Score vs MCC: C−loop Performance: C−loop

PDB 1.0 3.0

Y R 10000 PDB−shuffle 2.5 SEED 0.5 G C SEED−shuffled 2.0

C U A threshold 0.0 1.5 100 MCC A R Stats 1.0 Frequency −0.5 C 10 0.5 A U −1.0 G 0 0.0 0

Y 0 10 20 30 40 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 12. C-loop.

[11:22 5/11/2014 supplementary-results.tex] Page: 12 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 13

ROC curve: CRC_binding ROC−like curve: CRC_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

CRC_binding Score vs MCC: CRC_binding Performance: CRC_binding

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 10 20 30 40 50 PPV FDR NPV ACC MCC

SENS SPEC CM score (bits) CM score (bits) 5´ A A Y A A R A A C A A A R R

Figure 13. CRC binding.

[11:22 5/11/2014 supplementary-results.tex] Page: 13 1–37 14 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: CsrA_binding ROC−like curve: CsrA_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV CsrA_binding Score vs MCC: CsrA_binding Performance: CsrA_binding G G A PDB 1.0 3.0 10000 PDB−shuffle 2.5 A Y SEED 0.5 SEED−shuffled 2.0

Y G threshold 0.0 1.5 100 MCC Stats 1.0 Y R Frequency −0.5 10 0.5 R Y −1.0

0 0.0 0

0 5 10 15 20 10 20 30 40 50 PPV FDR NPV ACC MCC

5´ SENS SPEC CM score (bits) CM score (bits)

Figure 14. CsrA binding.

ROC curve: CUYG ROC−like curve: CUYG 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

U U C G CUYG Score vs MCC: CUYG Performance: CUYG

C G PDB 1.0 3.0 Y R 10000 PDB−shuffle 2.5 G Y SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5

Y 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 30 35 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 15. CUYG.

[11:22 5/11/2014 supplementary-results.tex] Page: 14 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 15

ROC curve: docking_elbow ROC−like curve: docking_elbow 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

R Y G docking_elbow Score vs MCC: docking_elbow Performance: docking_elbow

G G U PDB 1.0 3.0 G 10000 PDB−shuffle R 2.5 A SEED A 0.5 R Y R SEED−shuffled 2.0 Y A R threshold 0.0 1.5 G A 100 MCC A R Y Stats G 1.0 A Frequency −0.5

R 10 C A Y 0.5 −1.0

0 0.0 0

0 10 30 50 70 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC Y CM score (bits) R CM score (bits) 5´

Figure 16. docking elbow.

[11:22 5/11/2014 supplementary-results.tex] Page: 15 1–37 16 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: Domain−V ROC−like curve: Domain−V 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

A A G A Domain−V Score vs MCC: Domain−V Performance: Domain−V

PDB 1.0 3.0 C G 10000 PDB−shuffle 2.5 SEED 0.5 R Y A SEED−shuffled 2.0 U Y threshold 0.0 1.5

R G 100 MCC Y Stats 1.0 Frequency −0.5 C 10 Y G 0.5 −1.0 C G 0 0.0

G U 0 0 10 20 30 40 10 20 30 40 50

A U PPV FDR NPV ACC MCC G Y SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 17. Domain-V.

ROC curve: GNRA ROC−like curve: GNRA 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV A GNRA Score vs MCC: GNRA Performance: GNRA G A PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 Y R SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 18. GNRA.

[11:22 5/11/2014 supplementary-results.tex] Page: 16 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 17

ROC curve: k−turn−1 ROC−like curve: k−turn−1 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

3-81 nt k−turn−1 Score vs MCC: k−turn−1 Performance: k−turn−1

PDB 1.0 3.0 Y R 10000 PDB−shuffle 2.5 G C R SEED 0.5 A R SEED−shuffled 2.0

G threshold 0.0 1.5 100 MCC G Stats G 1.0 Y A Frequency −0.5 R 10 0.5 −1.0

R 0 0.0 0 R 0 10 20 30 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 19. k-turn-1.

[11:22 5/11/2014 supplementary-results.tex] Page: 17 1–37 18 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: k−turn−2 ROC−like curve: k−turn−2 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

0-38 nt Y R R k−turn−2 Score vs MCC: k−turn−2 Performance: k−turn−2 Y G G R PDB 1.0 3.0 10000 PDB−shuffle 2.5 Y G SEED 0.5 Y R SEED−shuffled 2.0 A A G U G G threshold 0.0 1.5 G 100 MCC A Stats 1.0 C G Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 10 20 30 40 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC CM score (bits) CM score (bits) 5´

Figure 20. k-turn-2.

ROC curve: pK−turn ROC−like curve: pK−turn 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

C R Y pK−turn Score vs MCC: pK−turn Performance: pK−turn C Y G PDB 1.0 3.0 10000 PDB−shuffle 2.5 Y SEED 0.5 R Y SEED−shuffled 2.0

R threshold 0.0 1.5 100 MCC 0-57 nt Stats 1.0 R Frequency −0.5 R Y 10 0.5 −1.0

0 0.0

R Y 0 R Y 0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV ACC MCC

G C SENS SPEC 5´ A G CM score (bits) CM score (bits)

Figure 21. pK-turn.

[11:22 5/11/2014 supplementary-results.tex] Page: 18 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 19

ROC curve: RBS_B_subtilis ROC−like curve: RBS_B_subtilis 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

RBS_B_subtilis Score vs MCC: RBS_B_subtilis Performance: RBS_B_subtilis

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC CM score (bits) CM score (bits)

5´ R A A A R G G G G G R R Y Y R Y A U G A R R R A R

Figure 22. RBS B subtilis.

[11:22 5/11/2014 supplementary-results.tex] Page: 19 1–37 20 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: RBS_E_coli ROC−like curve: RBS_E_coli 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

RBS_E_coli Score vs MCC: RBS_E_coli Performance: RBS_E_coli

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC CM score (bits) CM score (bits)

5´ Y Y Y Y Y Y Y Y R R R G G R R R A Y Y A U G A R A R

Figure 23. RBS E coli.

ROC curve: RBS_H_pylori ROC−like curve: RBS_H_pylori 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

RBS_H_pylori Score vs MCC: RBS_H_pylori Performance: RBS_H_pylori

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC CM score (bits) CM score (bits)

5´ Y Y Y Y A A G G R R Y A U G R R A R A

Figure 24. RBS H pylori.

[11:22 5/11/2014 supplementary-results.tex] Page: 20 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 21

ROC curve: HuR_binding ROC−like curve: HuR_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0 Y 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV U R U HuR_binding Score vs MCC: HuR_binding Performance: HuR_binding

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 R U SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC R Y Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 10 20 30 40 50 PPV FDR NPV ACC MCC

5´ SENS SPEC CM score (bits) CM score (bits)

Figure 25. HuR binding.

[11:22 5/11/2014 supplementary-results.tex] Page: 21 1–37 22 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: right_angle−2 ROC−like curve: right_angle−2 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

right_angle−2 Score vs MCC: right_angle−2 Performance: right_angle−2

C G PDB 1.0 3.0

R Y 10000 R PDB−shuffle 2.5 A R SEED R C G SEED−shuffled 0.5 U G 2.0 G C threshold A 0.0 1.5 28-249 nt 100 MCC A Stats C G 1.0 G C Frequency −0.5 0-7 nt 10 0.5 Y G Y −1.0

G 0 0.0 Y G 0-36 nt Y G 0 Y G 0 50 150 250 10 20 30 40 50 PPV FDR G C NPV ACC G C MCC CM score (bits) SENS SPEC G C CM score (bits) 5´

Figure 26. right angle-2.

ROC curve: right_angle−3 ROC−like curve: right_angle−3 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

G G A G G G C R G Y Y G right_angle−3 A G A Score vs MCC: right_angle−3 Performance: right_angle−3 A G A A U C C G G 3.0 A PDB 1.0 R 10000 PDB−shuffle 2.5 G SEED A 0-31 nt SEED−shuffled 0.5 G 2.0 U C C Y G G U U threshold 0.0 1.5

R G 100 MCC G Stats G C G A 1-12 nt Y 1.0 G Y Frequency −0.5

G Y 10 U R 0.5 R Y −1.0

0 0.0 U G 0

Y G 0 20 40 60 80 10 20 30 40 50 PPV FDR NPV ACC MCC A CM score (bits) SENS SPEC Y R CM score (bits) 5´

Figure 27. right angle-3.

[11:22 5/11/2014 supplementary-results.tex] Page: 22 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 23

ROC curve: Roquin_binding ROC−like curve: Roquin_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV G Roquin_binding Score vs MCC: Roquin_binding Performance: Roquin_binding U U PDB 1.0 3.0 C G 10000 PDB−shuffle 2.5 SEED 0.5 U A SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC U A Stats 1.0 Frequency −0.5 U A 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ U R CM score (bits) CM score (bits)

Figure 28. Roquin binding.

[11:22 5/11/2014 supplementary-results.tex] Page: 23 1–37 24 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: sarcin−ricin−1 ROC−like curve: sarcin−ricin−1 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

R C A A sarcin−ricin−1 Score vs MCC: sarcin−ricin−1 Performance: sarcin−ricin−1 A U G C PDB 1.0 3.0 U U 10000 C C PDB−shuffle 2.5 SEED 0.5 A A SEED−shuffled G 2.0 A U threshold 0.0 1.5 100 MCC G A Stats 1.0 Y R Frequency −0.5 R 10 0.5 R Y −1.0 Y R 0 0.0

C G 0 A 0 10 20 30 40 50 10 20 30 40 50 PPV FDR NPV C G ACC MCC A G SENS SPEC 5´ R A R A A CM score (bits) CM score (bits)

Figure 29. sarcin-ricin-1.

ROC curve: sarcin−ricin−2 ROC−like curve: sarcin−ricin−2 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

4-40 nt sarcin−ricin−2 Score vs MCC: sarcin−ricin−2 Performance: sarcin−ricin−2 Y R PDB 1.0 3.0 R Y 10000 PDB−shuffle 2.5 A G SEED 0.5 U A SEED−shuffled 2.0 G A A threshold 0.0 1.5 100 MCC R R Stats 1.0 Frequency −0.5

R 10 Y R 0.5 −1.0

0 0.0

Y R 0 0 10 20 30 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ R CM score (bits) CM score (bits)

Figure 30. sarcin-ricin-2.

[11:22 5/11/2014 supplementary-results.tex] Page: 24 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 25

ROC curve: SRP_S_domain ROC−like curve: SRP_S_domain 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

R A G A SRP_S_domain Score vs MCC: SRP_S_domain Performance: SRP_S_domain Y R

G A PDB 1.0 3.0 G G 10000 PDB−shuffle A C 2.5 SEED 0.5 C A SEED−shuffled 2.0 Y G

R Y threshold 0.0 1.5 100 MCC Y R Stats 1.0 Frequency −0.5

C Y 10 A A 0.5 R G A −1.0

0 0.0 R Y 0

0 10 20 30 40 50 60 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC CM score (bits) CM score (bits) 5´

Figure 31. SRP S domain.

[11:22 5/11/2014 supplementary-results.tex] Page: 25 1–37 26 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: tandem−GA ROC−like curve: tandem−GA 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

4-88 nt tandem−GA Score vs MCC: tandem−GA Performance: tandem−GA

R Y PDB 1.0 3.0 10000 R Y PDB−shuffle 2.5 SEED 0.5 R Y SEED−shuffled 2.0

R Y threshold 0.0 1.5 100 MCC R Stats 1.0 Frequency −0.5 A G 10 0.5 G A −1.0

0 0.0 U G 0

R 0 10 20 30 40 50 60 10 20 30 40 50 PPV FDR NPV ACC MCC

R SENS SPEC 5´ G C CM score (bits) CM score (bits)

Figure 32. tandem-GA.

ROC curve: Terminator1 ROC−like curve: Terminator1 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

Terminator1 Score vs MCC: Terminator1 Performance: Terminator1

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED Y R SEED−shuffled 0.5 Y R 2.0

R Y threshold 0.0 1.5 100 MCC Y G Stats 1.0 C G Frequency −0.5 G Y 10 0.5 A U −1.0

0 0.0 A U 0

A U 0 10 20 30 40 10 20 30 40 50 PPV FDR NPV ACC

A U MCC A U SENS SPEC 5´ U Y U Y CM score (bits) CM score (bits)

Figure 33. Terminator1.

[11:22 5/11/2014 supplementary-results.tex] Page: 26 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 27

ROC curve: Terminator2 ROC−like curve: Terminator2 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

Terminator2 Score vs MCC: Terminator2 Performance: Terminator2

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 R Y 0.5 −1.0 R Y 0 0.0

A U 0 0 5 10 15 20 25 30 10 20 30 40 50

R U PPV FDR NPV ACC MCC R U SENS SPEC 5´ U U U Y CM score (bits) CM score (bits)

Figure 34. Terminator2.

[11:22 5/11/2014 supplementary-results.tex] Page: 27 1–37 28 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: T−loop ROC−like curve: T−loop 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV R R T−loop Score vs MCC: T−loop Performance: T−loop U R PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC R Y Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 35. T-loop.

ROC curve: TRIT ROC−like curve: TRIT 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

R TRIT Y R Score vs MCC: TRIT Performance: TRIT

R R Y PDB 1.0 3.0

C G 10000 C G PDB−shuffle 2.5 C G SEED R Y 0.5 C G SEED−shuffled 2.0

A U threshold A U 0.0 1.5 100 MCC R Y Stats A U 1.0 C G Frequency −0.5 G C 10 C G 0.5 A U G C −1.0

0 0.0 A U C G G C 0 A U 0 20 40 60 80 10 20 30 40 50 PPV FDR NPV G C ACC MCC

C G SENS SPEC G C CM score (bits) CM score (bits) 5´ R Y Y Y Y R R R R

Figure 36. TRIT.

[11:22 5/11/2014 supplementary-results.tex] Page: 28 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 29

ROC curve: twist_up ROC−like curve: twist_up 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV U C twist_up Score vs MCC: twist_up Performance: twist_up

A C PDB 1.0 3.0 C G 10000 PDB−shuffle 2.5 SEED 0.5 C A SEED−shuffled 2.0

C A threshold 0.0 1.5 100 MCC Y C Stats 1.0 Frequency −0.5 R Y 10 0.5 G C −1.0

0 0.0

C G 0 0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV ACC

C G MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 37. twist up.

[11:22 5/11/2014 supplementary-results.tex] Page: 29 1–37 30 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: UAA_GAN ROC−like curve: UAA_GAN 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

2-22 nt UAA_GAN Score vs MCC: UAA_GAN Performance: UAA_GAN C G PDB 1.0 3.0 10000 PDB−shuffle 2.5 Y R SEED 0.5 SEED−shuffled 2.0 A G A threshold 0.0 1.5 100 MCC A Stats A 1.0 Y Frequency −0.5 10 0.5 R Y −1.0 G C 0 0.0 0

0 5 10 15 20 25 30 10 20 30 40 50

Y PPV FDR NPV ACC MCC R SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 38. UAA GAN.

ROC curve: UMAC ROC−like curve: UMAC 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV A A UMAC Score vs MCC: UMAC Performance: UMAC U C PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 C G SEED−shuffled 2.0

G C threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5

A R 10 0.5 −1.0

Y 0 0.0 0

G U 0 5 10 15 20 25 10 20 30 40 50 PPV FDR NPV ACC MCC 5´ SENS SPEC CM score (bits) CM score (bits)

Figure 39. UMAC.

[11:22 5/11/2014 supplementary-results.tex] Page: 30 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 31

ROC curve: UNCG ROC−like curve: UNCG 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV U C UNCG Score vs MCC: UNCG Performance: UNCG U G PDB 1.0 3.0 C G 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 25 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 40. UNCG.

[11:22 5/11/2014 supplementary-results.tex] Page: 31 1–37 32 Nucleic Acids Research, 2014, Vol. XX, No. YY

ROC curve: U−turn ROC−like curve: U−turn 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Specificity PPV R A U−turn Score vs MCC: U−turn Performance: U−turn U R PDB 1.0 3.0 G A 10000 PDB−shuffle 2.5 SEED 0.5 C G SEED−shuffled 2.0

G C threshold 0.0 1.5 100 MCC R Y Stats 1.0 Frequency −0.5 10 0.5 −1.0

R 0 0.0 0

0 5 10 15 20 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 41. U-turn.

ROC curve: vapC_target ROC−like curve: vapC_target 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

R vapC_target Score vs MCC: vapC_target Performance: vapC_target

PDB 1.0 3.0 10000 PDB−shuffle 2.5 C SEED 0.5 SEED−shuffled 2.0

threshold 0.0 1.5

C G 100 MCC R Y Stats 1.0 C Frequency −0.5 10 C G 0.5 R C −1.0

R 0 0.0

R Y 0 0 5 10 15 20 25 30 10 20 30 40 50 PPV FDR NPV ACC G Y MCC Y CM score (bits) SENS SPEC 5´ A U A U G Y C CM score (bits)

Figure 42. vapC target.

[11:22 5/11/2014 supplementary-results.tex] Page: 32 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 33

ROC curve: VTS1_binding ROC−like curve: VTS1_binding 0.8 0.8 0.4 0.4

Sensitivity Rfam alignments (sum−bits) Sensitivity Rfam alignments (sum−bits) Rfam sequences (bits) Rfam sequences (bits) RMfam sequences (bits) RMfam sequences (bits) 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Specificity PPV

G VTS1_binding Score vs MCC: VTS1_binding Performance: VTS1_binding C G

PDB 1.0 3.0 10000 PDB−shuffle 2.5 SEED 0.5 SEED−shuffled 2.0

Y R threshold 0.0 1.5 100 MCC Stats 1.0 Frequency −0.5 10 0.5 −1.0

0 0.0 0

0 5 10 15 20 10 20 30 40 50 PPV FDR NPV ACC MCC SENS SPEC 5´ CM score (bits) CM score (bits)

Figure 43. VTS1 binding.

[11:22 5/11/2014 supplementary-results.tex] Page: 33 1–37 34 Nucleic Acids Research, 2014, Vol. XX, No. YY

R U Y C C R U C C Y C G C C U A C U G C A G C Original Rfam Refined Rfam Twist up G C C C model model G C C R motif C G A A A A C G C A R U R G Y U A Y R A R A G C G G A C G

R R Y G R Y G G R G A U G C Y C G U C Y C A G R A GNRA A G C Y G G U A A G Y G C U Y U A Y R G tetraloop A Y G G R C Y G U A C C G G R Y C U C G R R G Y A A G G G G R Y G G U G Y G Y Y C C A G C G C G R Y R Y Sarcin-ricin Y R Y R loop 5' 5'

Figure 44. A comparison of the Rfam 11.0 5S rRNA consensus structure and a corresponding manually corrected model. The RMfam annotations identified a number of conserved motifs in the 5S rRNA model, using RMfam annotations as a guide. These include the twist up motif (17), a sarcin-ricin motif (29) and a GNRA motif (8). The sarcin-ricin loop appeared to be mis-aligned in a number of cases in the Rfam alignment, the RMfam annotations allowed the alignment to be refined, correcting the alignment of this conserved motif.

Y Y Original Rfam Y G C G model G C C G C G G G A C G A A C C G C G G U G C A U C G C G A U G C C G A U 5' A U G G A Y R U R A A G A C A R A C A G G G A C A C R Y A G G R A A U G G Y G G R R Y A G G A U G U C A G G R A A C A G U C U G C

G G A R CsrA binding Y Y RefinedRfam Y R Y G motif C G model G C C G G G A C G G G A U C G A A C G 5´ C G C G G R U G C Y A U C G C G A U G C C G A U 5' A U G G A Y R U R A A G A C A R A A C A G G G A C A C R Y A G G R R R R Y A G G A U G U C A G G R A A C A G U C U G C Figure 45. The RsmY sRNA family in Rfam 10.1 had a mal-formed consensus secondary structure. RMfam annotations identified an additional CsrA binding motif, which allowed the structure to be refined to emphasise this fact. CsrA is a dimeric protein that generally binds to two motifs, the refined structure has a better fit with this model (79, 80).

[11:22 5/11/2014 supplementary-results.tex] Page: 34 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 35

ctRNA_pND324

Termite−leu

tRNA−Sec

alpha_tmRNA mascRNA−menRNA cyano_tmRNA U12 tmRNAtRNA GcvB PyrR ● ● ● MicX ● ● ● ● His_leader ● ● L10_leader ● ● L20_leader group−II−D1D4−4 ● ● Phe_leader ● ● Thr_leader FMN ● ● Qrr Entero_5_CRE ● ● rimP ● ● rnk_pseudo ● PrrB_RsmZ ● ● sraA ● ● ● ● suhB U−turn CsrB ● ● Spot_42 ● ● IRES_Aptho RsmY ● ●

Terminator Domain−V U6 TwoAYGGAY ● ● GNRA ● ● T−loop Intron_gpII RRE ● k−turn−1 ● ykoK ● ● U8 Clostridiales−1 ● ● U3 Cobalamin ● ● Csr−Rsm ● ● GEMM_RNA_motif ● UNCG ● Fungi_U3 ● manA STAXI ● ● ● ● RNaseP_arch HIV_FE ● ● RNaseP_bact_b psaA ● ● RNaseP_bact_a ● ● group−II−D1D4−1 C4 ● ● group−II−D1D4−3 ● ● group−II−D1D4−6 ● ● ● ● group−II−D1D4−7 ● ● ● RtT serC PhotoRC−I MOCO_RNA_motif

SAM OLE

Lysine

SNORA73

SSU_rRNA_bacteria

SSU_rRNA_archaea

SSU_rRNA_eukarya

Figure 46. A network of the highest scoring 100 annotations of RMfam on Rfam. The nodes on the inner circle shows 8 RMfam motifs, the outer circle shows 64 Rfam families. The edges connecting the nodes indicate high-scoring predictions.

Networks We can now gain insights into the network of RNA motifs and families. This reveals aspects of the evolutionary constraints on RNA structure as well as convergent evolution and function. An example of an extreme evolutionary constraint that we have observed is the GNRA tetraloop in the bacterial A, B and archaeal RNase P RNA families. This loop is located on the P9 helix of RNase P that appears to have been conserved throughout the evolutionary span of and archaea (64, 81). The structurally diverse domains 1 - 4 for the group II introns families are also enriched with GNRA-tetraloop hosting helices near the 50 end of the region (See Figure 46), other than this loop there is little that is conserved between these presumably homologous sequences and structures. A striking example of convergent evolution of analogous structures is the intrinsic bacterial transcription terminators (82) (See Figure S4). These motifs are required for the efficient termination of transcription (35). We see that these are frequently used by many bacterial small and cis-regulatory elements such as 50 leaders (See Figure 46), a result that serves to validate the accuracy of our method as well as illustrating the plasticity of transcription terminator evolution.

[11:22 5/11/2014 supplementary-results.tex] Page: 35 1–37 36 Nucleic Acids Research, 2014, Vol. XX, No. YY

REFERENCES 1. M A Convery, S Rowsell, N J Stonehouse, A D Ellington, I Hirao, J B Murray, D S Peabody, S E Phillips, and P G Stockley. Crystal structure of an aptamer-protein complex at 2.8 a resolution. Nat Struct Biol, 5(2):133–9, Feb 1998. 2. S Rowsell, N J Stonehouse, M A Convery, C J Adams, A D Ellington, I Hirao, D S Peabody, P G Stockley, and S E Phillips. Crystal structures of a series of rna aptamers complexed to the same protein target. Nat Struct Biol, 5(11):970–5, Nov 1998. 3. P S Klosterman, D K Hendrix, M Tamura, S R Holbrook, and S E Brenner. Three-dimensional motifs from the SCOR, structural classification of RNA database: extruded strands, base triples, tetraloops and U-turns. Nucleic Acids Res, 32(8):2342–52, 2004. 4. C R Woese, S Winker, and R R Gutell. Architecture of ribosomal rna: constraints on the sequence of ”tetra-loops”. Proc Natl Acad Sci U S A, 87(21):8467–71, Nov 1990. 5. J Wolters. The nature of preferred hairpin structures in 16s-like rrna variable regions. Nucleic Acids Res, 20(8):1843–50, Apr 1992. 6. F M Jucker and A Pardi. Solution structure of the cuug hairpin loop: a novel rna tetraloop motif. Biochemistry, 34(44):14416–27, Nov 1995. 7. V P Antao, S Y Lai, and I Tinoco. A thermodynamic study of unusually stable rna and dna hairpins. Nucleic Acids Res, 19(21):5901–5, Nov 1991. 8. L Jaeger, F Michel, and E Westhof. Involvement of a gnra tetraloop in long-range rna tertiary interactions. J Mol Biol, 236(5):1271–6, Mar 1994. 9. D L Abramovitz and A M Pyle. Remarkable morphological variability of a common rna folding motif: the gnra tetraloop-receptor interaction. J Mol Biol, 266(3):493–506, Feb 1997. 10. N Leulliot, V Baumruk, M Abdelkafi, P Y Turpin, A Namane, C Gouyette, T Huynh-Dinh, and M Ghomi. Unusual nucleotide conformations in gnra and uncg type tetraloop hairpins: evidence from raman markers assignments. Nucleic Acids Res, 27(5):1398–404, Mar 1999. 11. F M Jucker, H A Heus, P F Yip, E H Moors, and A Pardi. A network of heterogeneous hydrogen bonds in gnra tetraloops. J Mol Biol, 264(5):968–80, Dec 1996. 12. M Sarver, C L Zirbel, J Stombaugh, A Mokdad, and N B Leontis. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J Math Biol, 56(1-2):215–52, Jan 2008. 13. C L Zirbel, J E Sponer, J Sponer, J Stombaugh, and N B Leontis. Classification and energetics of the base-phosphate interactions in RNA. Nucleic Acids Res, 37(15):4898–918, Aug 2009. 14. Q Zhao, H C Huang, U Nagaswamy, Y Xia, X Gao, and G E Fox. Unac tetraloops: to what extent do they mimic gnra tetraloops? Biopolymers, 97(8):617–28, Aug 2012. 15. J J Cannone, S Subramanian, M N Schnare, J R Collett, L M D’Souza, Y Du, B Feng, N Lin, L V Madabusi, K M Muller,¨ N Pande, Z Shang, N Yu, and R R Gutell. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3:2, 2002. 16. M Molinaro and I Tinoco. Use of ultra stable uncg tetraloop hairpins to fold rna structures: thermodynamic and spectroscopic applications. Nucleic Acids Res, 23(15):3056–63, Aug 1995. 17. C Zhong and S Zhang. Clustering RNA structural motifs in ribosomal RNAs using secondary structural alignment. Nucleic Acids Res, 40(3):1307–17, Feb 2012. 18. A Lescoute, N B Leontis, C Massire, and E Westhof. Recurrent structural rna motifs, isostericity matrices and sequence alignments. Nucleic Acids Res, 33(8):2395–409, 2005. 19. C Zhong, H Tang, and S Zhang. Rnamotifscan: automatic identification of rna structural motifs using secondary structural alignment. Nucleic Acids Res, 38(18):e176, Oct 2010. 20. J A Cruz and E Westhof. Sequence-based identification of 3D structural modules in RNA with RMDetect. Nat Methods, 8(6):513–21, Jun 2011. 21. U Nagaswamy and G E Fox. Frequent occurrence of the t-loop rna folding motif in ribosomal rnas. RNA, 8(9):1112–9, Sep 2002. 22. A S Krasilnikov and A Mondragn. On the occurrence of the t-loop rna folding motif in large rna molecules. RNA, 9(6):640–3, Jun 2003. 23. Z Zhuang, L Jaeger, and J E Shea. Probing the structural hierarchy and energy landscape of an rna t-loop hairpin. Nucleic Acids Res, 35(20):6995–7002, 2007. 24. R R Gutell, J J Cannone, D Konings, and D Gautheret. Predicting U-turns in ribosomal RNA with comparative sequence analysis. J Mol Biol, 300(4):791–803, Jul 2000. 25. G J Quigley and A Rich. Structural domains of transfer rna molecules. Science, 194(4267):796–806, Nov 1976. 26. K T Schroeder, S A McPhee, J Ouellet, and D M Lilley. A structural database for k-turn motifs in RNA. RNA, 16(8):1463–8, Aug 2010. 27. S Blouin, R Chinnappan, and D A Lafontaine. Folding of the lysine riboswitch: importance of peripheral elements for transcriptional regulation. Nucleic Acids Res, 39(8):3373–87, Apr 2011. 28. M Meyer, E Westhof, and B Masquida. A structural module in rnase p expands the variety of rna kinks. RNA Biol, 9(3), Mar 2012. 29. N B Leontis and E Westhof. A common motif organizes the structure of multi-helix loops in 16 s and 23 s ribosomal rnas. J Mol Biol, 283(3):571–83, Oct 1998. 30. C M Duarte, L M Wadley, and A M Pyle. Rna structure comparison, motif search and discovery using a reduced representation of rna conformational space. Nucleic Acids Res, 31(16):4755–61, Aug 2003. 31. D Gautheret, D Konings, and R R Gutell. A major family of motifs involving G.A mismatches in ribosomal RNA. J Mol Biol, 242(1):1–8, Sep 1994. 32. J C Lee, R R Gutell, and R Russell. The uaa/gan internal loop motif: a new rna structural element that forms a cross-strand aaa stack and long-range tertiary interactions. J Mol Biol, 360(5):978–88, Jul 2006. 33. J Lehmann, F Jossinet, and D Gautheret. A universal rna structural motif docking the elbow of trna in the ribosome, rnase p and t-box leaders. Nucleic Acids Res, 41(10):5494–502, May 2013. 34. W W Grabow, Z Zhuang, Z N Swank, J E Shea, and L Jaeger. The right angle (RA) motif: a prevalent ribosomal RNA structural pattern found in group I introns. J Mol Biol, 424(1-2):54–67, Nov 2012. 35. P P Gardner, L Barquist, A Bateman, E P Nawrocki, and Z Weinberg. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic Acids Res, 14(39):5845–5852, 2011. 36. K Mazan-Mamczarz, Y Kuwano, M Zhan, E J White, J L Martindale, A Lal, and M Gorospe. Identification of a signature motif in target mrnas of rna-binding protein auf1. Nucleic Acids Res, 37(1):204–14, Jan 2009. 37. E Sonnleitner, L Abdou, and D Haas. Small rna as global regulator of carbon catabolite repression in pseudomonas aeruginosa. Proc Natl Acad Sci U S A, 106(51):21866–71, Dec 2009. 38. M J Filiatrault, P V Stodghill, J Wilson, B G Butcher, H Chen, C R Myers, and S W Cartinhour. Crcz and crcx regulate carbon source utilization in pseudomonas syringae pathovar tomato strain dc3000. RNA Biol, 10(2):245–55, Feb 2013. 39. R Moreno, P Fonseca, and F Rojo. Two small rnas, crcy and crcz, act in concert to sequester the crc global regulator in pseudomonas putida, modulating catabolite repression. Mol Microbiol, 83(1):24–40, Jan 2012. 40. M Y Liu, G Gui, B Wei, J F Preston, L Oakford, U Yksel, D P Giedroc, and T Romeo. The rna molecule csrb binds to the global regulatory protein csra and antagonizes its activity in escherichia coli. J Biol Chem, 272(28):17502–10, Jul 1997. 41. Y Cui, A Chatterjee, Y Liu, C K Dumenyo, and A K Chatterjee. Identification of a global repressor , rsma, of erwinia carotovora subsp. carotovora that controls extracellular enzymes, n-(3-oxohexanoyl)-l-homoserine lactone, and pathogenicity in soft-rotting erwinia spp. J Bacteriol, 177(17):5108–15, Sep

[11:22 5/11/2014 supplementary-results.tex] Page: 36 1–37 Nucleic Acids Research, 2014, Vol. XX, No. YY 37

1995. 42. T Weilbacher, K Suzuki, A K Dubey, X Wang, S Gudapaty, I Morozov, C S Baker, D Georgellis, P Babitzke, and T Romeo. A novel srna component of the carbon storage regulatory system of escherichia coli. Mol Microbiol, 48(3):657–70, May 2003. 43. L Argaman, R Hershberg, J Vogel, G Bejerano, E G Wagner, H Margalit, and S Altuvia. Novel small rna-encoding in the intergenic regions of escherichia coli. Curr Biol, 11(12):941–50, Jun 2001. 44. S Aarons, A Abbas, C Adams, A Fenton, and F O’Gara. A regulatory rna (prrb rna) modulates expression of secondary metabolite genes in pseudomonas fluorescens f113. J Bacteriol, 182(14):3913–9, Jul 2000. 45. C Valverde, S Heeb, C Keel, and D Haas. Rsmy, a small regulatory rna, is required in concert with rsmz for gaca-dependent expression of biocontrol traits in pseudomonas fluorescens cha0. Mol Microbiol, 50(4):1361–79, Nov 2003. 46. S Moll, D J Schneider, P Stodghill, C R Myers, S W Cartinhour, and M J Filiatrault. Construction of an rsmx co-variance model and identification of five rsmx non-coding rnas in pseudomonas syringae pv. tomato dc3000. RNA Biol, 7(5):508–16, 2010. 47. P P Gardner, J Daub, J Tate, B L Moore, I H Osuch, S Griffiths-Jones, R D Finn, E P Nawrocki, D L Kolbe, S R Eddy, and A Bateman. Rfam: Wikipedia, clans and the decimal release. Nucleic Acids Res, 39(Database issue):D141–5, Jan 2011. 48. I Lopez´ de Silanes, M Zhan, A Lal, X Yang, and M Gorospe. Identification of a target RNA motif for RNA-binding protein HuR. Proc Natl Acad Sci U S A, 101(9):2987–92, Mar 2004. 49. E Dassi, P Zuccotti, S Leo, A Provenzani, M Assfalg, M D’Onofrio, P Riva, and A Quattrone. Hyper conserved elements in vertebrate mrna 3’-utrs reveal a translational network of rna-binding proteins controlled by hur. Nucleic Acids Res, 41(5):3201–16, Mar 2013. 50. K Leppek, J Schott, S Reitter, F Poetz, M C Hammond, and G Stoecklin. Roquin promotes constitutive mrna decay via a conserved class of stem-loop recognition motifs. Cell, 153(4):869–81, May 2013. 51. T Aviv, A N Amborski, X S Zhao, J J Kwan, P E Johnson, F Sicheri, and L W Donaldson. The nmr and x-ray structures of the saccharomyces cerevisiae vts1 sam domain define a surface for the recognition of rna hairpins. J Mol Biol, 356(2):274–9, Feb 2006. 52. T Aviv, Z Lin, G Ben-Ari, C A Smibert, and F Sicheri. Sequence-specific recognition of rna hairpins by the sam domain of vts1p. Nat Struct Mol Biol, 13(2):168–76, Feb 2006. 53. D Ray, H Kazan, E T Chan, L Pea Castillo, S Chaudhry, S Talukder, B J Blencowe, Q Morris, and T R Hughes. Rapid and systematic analysis of the rna recognition specificities of rna-binding proteins. Nat Biotechnol, 27(7):667–70, Jul 2009. 54. F C Oberstrass, A Lee, R Stefl, M Janis, G Chanfreau, and F H Allain. Shape-specific recognition in the structure of the vts1p sam domain with rna. Nat Struct Mol Biol, 13(2):160–7, Feb 2006. 55. J L McKenzie, J Robson, M Berney, T C Smith, A Ruthe, P P Gardner, V L Arcus, and G M Cook. A vapbc toxin-antitoxin module is a posttranscriptional regulator of metabolic flux in mycobacteria. J Bacteriol, 194(9):2189–204, May 2012. 56. M Regalia, M A Rosenblad, and T Samuelsson. Prediction of signal recognition particle rna genes. Nucleic Acids Res, 30(15):3368–77, Aug 2002. 57. M A Rosenblad, J Gorodkin, B Knudsen, C Zwieb, and T Samuelsson. Srpdb: Signal recognition particle database. Nucleic Acids Res, 31(1):363–4, Jan 2003. 58. M A Rosenblad, N Larsen, T Samuelsson, and C Zwieb. Kinship in the SRP RNA family. RNA Biol, 6(5):508–16, 2009. 59. M Seetharaman, N V Eldho, R A Padgett, and K T Dayie. Structure of a self-splicing group ii intron catalytic effector domain 5: parallels with spliceosomal u6 rna. RNA, 12(2):235–47, Feb 2006. 60. S Valadkhan. Role of the snrnas in spliceosomal active site. RNA Biol, 7(3):345–53, 2010. 61. J Shine and L Dalgarno. Determinant of cistron specificity in bacterial ribosomes. Nature, 254(5495):34–8, Mar 1975. 62. P W Rose, C Bi, W F Bluhm, C H Christie, D Dimitropoulos, S Dutta, R K Green, D S Goodsell, A Prlic, M Quesada, G B Quinn, A G Ramos, J D Westbrook, J Young, C Zardecki, H M Berman, and P E Bourne. The rcsb protein data bank: new resources for research and education. Nucleic Acids Res, 41(Database issue):D475–82, Jan 2013. 63. C Workman and A Krogh. No evidence that mrnas have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res, 27(24):4816–22, Dec 1999. 64. J W Brown, J M Nolan, E S Haas, M A Rubio, F Major, and N R Pace. Comparative analysis of ribonuclease P RNA using gene sequences from natural microbial populations reveals tertiary structural elements. Proc Natl Acad Sci U S A, 93(7):3001–6, Apr 1996. 65. D A Pomeranz Krummel and S Altman. Verification of phylogenetic predictions in vivo and the importance of the tetraloop motif in a catalytic rna. Proc Natl Acad Sci U S A, 96(20):11200–5, Sep 1999. 66. S Nottrott, K Hartmuth, P Fabrizio, H Urlaub, I Vidovic, R Ficner, and R Lhrmann. Functional interaction of a novel 15.5kd [u4/u6.u5] tri-snrnp protein with the 5’ stem-loop of u4 snrna. EMBO J, 18(21):6119–33, Nov 1999. 67. E Ennifar, A Nikulin, S Tishchenko, A Serganov, N Nevskaya, M Garber, B Ehresmann, C Ehresmann, S Nikonov, and P Dumas. The crystal structure of uucg tetraloop. J Mol Biol, 304(1):35–42, Nov 2000. 68. S Barends, K Bjrk, A P Gultyaev, M H de Smit, C W Pleij, and B Kraal. Functional evidence for d- and t-loop interactions in tmrna. FEBS Lett, 514(1):78–83, Mar 2002. 69. A S Krasilnikov, Y Xiao, T Pan, and A Mondragn. Basis for structural diversity in homologous rnas. Science, 306(5693):104–7, Oct 2004. 70. S Nonin-Lecomte, B Felden, and F Dardel. Nmr structure of the aquifex aeolicus tmrna pseudoknot pk1: new insights into the recoding event of the ribosomal trans-translation. Nucleic Acids Res, 34(6):1847–53, 2006. 71. Z Weinberg, J X Wang, J Bogue, J Yang, K Corbino, R H Moy, and R R Breaker. Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biol, 11(3):R31, 2010. 72. E R Lee, J L Baker, Z Weinberg, N Sudarsan, and R R Breaker. An allosteric self-splicing ribozyme triggered by a bacterial second messenger. Science, 329(5993):845–8, Aug 2010. 73. P Anandam, E Torarinsson, and W L Ruzzo. Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies. Bioinformatics, 25(5):668–9, Mar 2009. 74. T Gesell and S Washietl. Dinucleotide controlled null models for comparative rna gene prediction. BMC Bioinformatics, 9:248, 2008. 75. Eric P Nawrocki and Sean R Eddy. INFERNAL Users Guide. HHMI Janelia, Janelia Farm Research Campus, Ashburn, USA, version 1.1 edition, 2013. 76. S Washietl, I L Hofacker, and P F Stadler. Fast and reliable prediction of noncoding rnas. Proc Natl Acad Sci U S A, 102(7):2454–9, Feb 2005. 77. M Gerstein, E L Sonnhammer, and C Chothia. Volume changes in protein evolution. J Mol Biol, 236(4):1067–78, Mar 1994. 78. Tom Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27(8):861–874, 2006. 79. O Duss, E Michel, M Yulikov, M Schubert, G Jeschke, and F H Allain. Structural basis of the non-coding RNA RsmZ acting as a protein sponge. Nature, 509(7502):588–92, May 2014. 80. M Schubert, K Lapouge, O Duss, F C Oberstrass, I Jelesarov, D Haas, and F H Allain. Molecular basis of messenger RNA recognition by the specific bacterial repressing clamp RsmA/CsrA. Nat Struct Mol Biol, 14(9):807–13, Sep 2007. 81. J K Harris, E S Haas, D Williams, D N Frank, and J W Brown. New insight into RNase P RNA structure from comparative analysis of the archaeal RNA. RNA, 7(2):220–32, Feb 2001. 82. M Naville and D Gautheret. Transcription attenuation in bacteria: theme and variations. Brief Funct Genomic Proteomic, 8(6):482–92, Nov 2009.

[11:22 5/11/2014 supplementary-results.tex] Page: 37 1–37