Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 1
1 TITLE
2 RNAconTest: Comparing tools for non-coding RNA multiple sequence alignment based on
3 structural consistency
4 Running title: RNAconTest: benchmarking comparative RNA programs
5 Author: Erik S. Wright1,*
6 1 Department of Biomedical Informatics, University of Pittsburgh (Pittsburgh, PA)
7 * Corresponding author: Erik S. Wright ([email protected])
8 Keywords: Multiple sequence alignment, Secondary structure prediction, Benchmark, non-
9 coding RNA, Consensus secondary structure
10 Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 2
11 ABSTRACT
12 The importance of non-coding RNA sequences has become increasingly clear over the past
13 decade. New RNA families are often detected and analyzed using comparative methods based on
14 multiple sequence alignments. Accordingly, a number of programs have been developed for
15 aligning and deriving secondary structures from sets of RNA sequences. Yet, the best tools for
16 these tasks remain unclear because existing benchmarks contain too few sequences belonging to
17 only a small number of RNA families. RNAconTest (RNA consistency test) is a new
18 benchmarking approach relying on the observation that secondary structure is often conserved
19 across highly divergent RNA sequences from the same family. RNAconTest scores multiple
20 sequence alignments based on the level of consistency among known secondary structures
21 belonging to reference sequences in their output alignment. Similarly, consensus secondary
22 structure predictions are scored according to their agreement with one or more known structures
23 in a family. Comparing the performance of 10 popular alignment programs using RNAconTest
24 revealed that DAFS, DECIPHER, LocARNA, and MAFFT created the most structurally
25 consistent alignments. The best consensus secondary structure predictions were generated by
26 DAFS and LocARNA (via RNAalifold). Many of the methods specific to non-coding RNAs
27 exhibited poor scalability as the number or length of input sequences increased, and several
28 programs displayed substantial declines in score as more sequences were aligned. Overall,
29 RNAconTest provides a means of testing and improving tools for comparative RNA analysis, as
30 well as highlighting the best available approaches. RNAconTest is available from the
31 DECIPHER website
32 Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 3
33 INTRODUCTION
34 Multiple sequence alignment forms the basis of many comparative analyses of non-
35 coding RNA sequences such as the prediction of secondary structure. The alignment of RNA
36 sequences remains an unsolved challenge in computational biology (Fallmann et al. 2017). While
37 protein sequences can be accurately aligned down to 30% similarity, RNA sequences become
38 difficult to align below 60% similarity. This discrepancy can be attributed to the much smaller
39 alphabet of RNA sequences (i.e., four nucleotides versus 20 amino acids) and the fact that many
40 different primary sequences can fold into similar secondary structures through complementary
41 base pairing (Bussotti et al. 2013). For this reason, most RNA-specific tools take advantage of
42 structural conservation to improve alignment performance. Despite the wide variety of tools that
43 have been published, it remains unclear which tools are the best for RNA multiple sequence
44 alignment. However, it is generally believed that programs performing simultaneous folding and
45 alignment outperform programs that do not consider structure during alignment.
46 Several benchmarks for RNA multiple sequence alignment have been published. The
47 popular BRaliBase II (Gardner et al. 2005) and 2.1 (Wilm et al. 2006) benchmarks were
48 constructed from Rfam (Kalvari et al. 2018) seed alignments, which are often derived from
49 alignments created by automated programs. Since the ground truth is unknown in these cases, it
50 is foreseeable that accuracy is partly based on the ability to match other programs' outputs.
51 Furthermore, the BRaliBase benchmarks contain only 2 to 15 sequences per set and many test
52 sets were generated by repeated resampling of the same few RNA families. The paucity of
53 families with low levels of sequence similarity resulted in the infamous "BRaliBase dent", where
54 programs appear to perform worse at an intermediary similarity range than they do for highly
55 dissimilar sequences (Löwes et al. 2017). More recently, a collection of manually curated Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 4
56 multiple alignments was generated for RNA families with a known structure (Widmann et al.
57 2012). Although structures can be used to adjust alignments by hand, there is no guarantee that
58 the resulting alignment is optimal (Morrison 2009). Thus, existing benchmarks for RNA multiple
59 sequence alignment leave room for improvement, especially as the number of RNA families and
60 solved 3D structures has continued to increase.
61 In contrast to RNA, there is a much wider variety of benchmarks for protein multiple
62 sequence alignment. Many benchmarks rely on the superposition of multiple 3D crystal
63 structures to identify co-located residues. Structural superposition is more challenging for RNA
64 sequences because RNAs have modular 3D structures that behave as a collection of semi-
65 independent rigid bodies (Rahrig et al. 2013; Čech et al. 2015; Piatkowski et al. 2017). Even
66 though many more benchmarks are available for proteins than RNAs, there remains little
67 consensus as to which benchmarking approach is optimal (Aniba et al. 2010; Edgar 2010;
68 Iantorno et al. 2014). Recently, a new protein benchmark was developed to overcome some
69 flaws in previous approaches by indirectly comparing protein multiple sequence alignments
70 based on their utility for predicting correct secondary structure (Fox et al. 2015). This approach
71 was shown to strongly correlate with the results of 3D superposition based protein benchmarks
72 (Sievers and Higgins 2019) but has yet to be applied for benchmarking RNA alignment
73 programs.
74 The goal of this study was to develop a new benchmark, RNAconTest (RNA consistency
75 test), for assessing RNA multiple sequence alignments based on their degree of secondary
76 structure consistency. Ten different alignment programs were compared using 29 Rfam families
77 with two or more solved structures. This approach is based on the assumption that better
78 secondary structure agreement results from a better alignment, that is, the alignment adequately Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 5
79 represents a biologically conserved structure. The same concept was applied to assess consensus
80 secondary structure predictions according to their agreement with empirically determined
81 structures, although many alternative benchmarks already exist for this purpose (Puton et al.
82 2013; Miao et al. 2017). Collectively, the results reveal differences among programs for RNA
83 alignment and could assist with improving future programs.
84 RESULTS
85 Scores on the RNAconTest benchmark are based on the consistency of known secondary
86 structures across multiple sequences in an alignment or when compared to a predicted consensus
87 structure (Fig. 1). Alignment and structure scores were defined using secondary structures in a
88 manner that is analogous to weighted sum of pairs scores that are commonly used to compare
89 sequence alignments (see Materials and Methods). Each set of reference sequences was derived
90 from Rfam families with one or more solved 3D structures (Table 1). Only the unique sequences
91 in each reference set were kept because many solved structures are redundant. Since reference
92 sets contained a small number of unique sequences (1 to 93) with empirically-supported
93 structures, supplemental RNA sequences (N = 10 to 5120) were added from each Rfam family's
94 full alignment. This resulted in a set of 29 different Rfam families with two or more solved
95 structures, and 22 additional Rfam families with a single solved structure (Table 1). At least two
96 reference structures are required to calculate an alignment score, whereas only one structure is
97 needed to compare with a consensus secondary structure prediction (Fig. 1).
98 Ten different alignment programs were tested for their ability to generate structurally
99 consistent alignments. The programs vary in the way in which they handle folding and
100 alignment. Two generic alignment programs, Clustal Omega (Sievers et al. 2011) and MUSCLE
101 (Edgar 2004), do not make use of secondary structure predictions. Three programs, DECIPHER Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 6
102 (Wright 2015), MAFFT (Katoh and Standley 2013), and R-Coffee (Wilm et al. 2008), are
103 extensions of programs originally designed for protein alignment but handle RNA sequences
104 differently by incorporating secondary structure predictions. The remaining five programs were
105 specifically designed for the alignment of non-coding RNAs: DAFS (Sato et al. 2012), MASTR
106 (Lindgreen et al. 2007), MXSCARNA (Tabei et al. 2008), LocARNA and SPARSE (Will et al.
107 2015). Seven out of ten programs provide a consensus secondary structure prediction with their
108 output (i.e., all except Clustal Omega, MUSCLE, and MAFFT).
109 Alignment consistency
110 As expected, all programs generated alignments following a monotonic pattern of lower
111 consistency scores with increasing distance (Fig. 2a). Unlike previous benchmarks, there was no
112 sudden transition to lower scores beyond a specific similarity threshold, suggesting that the
113 concept of an alignment 'twilight zone' may be less applicable to RNA sequences than proteins.
114 However, a component of the steady decline in scores may be partly due to true biological
115 variability among secondary structures as the sequences evolve. For example, consistency scores
116 varied about 10% between different reference sets with similar average sequence identities (Fig.
117 3). Therefore, it is not feasible to determine an upper-bound on consistency scores, and
118 RNAconTest benchmark results should only be considered on a relative basis.
119 On average, LocARNA (0.758) was the highest scoring program (Table 2), followed by
120 MAFFT (0.745), DECIPHER (0.744), and DAFS (0.739). However, the difference among the
121 top scoring programs was not statistically significant (p ≥ 0.02), except in single cases where
122 LocARNA had higher scores than DAFS (N = 10) and MAFFT (N = 20). For larger sets of
123 sequences, the lack of statistical significance could be attributable to the relatively few
124 alignments completed by LocARNA within the 12 hour time limit (Table 2). The statistically Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 7
125 significant scores were used to rank programs (Fig. 4), revealing that the highest scoring
126 programs all held positions at the top of the hierarchy. MASTR (0.545) was consistently the
127 lowest scoring program placed at the bottom of the hierarchy. Notably, MUSCLE (0.722)
128 performed reasonably well given that it does not take into account secondary structure. SPARSE
129 (0.662), which is an extension of LocARNA, performed considerably worse than LocARNA.
130 Furthermore, SPARSE displayed considerable score variability within families, suggesting that
131 the heuristics it employs to speed-up LocARNA are sometimes detrimental to accuracy. Overall,
132 all alignment programs performed better than expected by random chance, except on the
133 reference sets with lowest average similarity (Fig. 3).
134 It is known that alignment programs tend to decrease in accuracy as more sequences are
135 incorporated into the alignment (Sievers et al. 2013). Across the range of supplemental
136 sequences added (10 to 5120), Clustal Omega, MASTR, MUSCLE, and R-Coffee exhibited
137 steady fall-offs in score as more sequences were aligned (Fig. 2b). In contrast, DAFS,
138 DECIPHER, MAFFT, and MXSCARNA all displayed relatively constant or increasing scores.
139 Consistency of secondary structure predictions
140 The scores for consensus secondary structure predictions told a somewhat different story
141 than alignment scores. Structure scores tended to decrease with greater distance among the
142 aligned sequences. However, some programs displayed a local peak in structure scores around
143 50% sequence identity (Fig. 2c), suggesting that intermediate levels of identity balance the
144 benefits of covariation information against the loss of alignment quality. The highest structure
145 scores were achieved by DAFS (0.690) and LocARNA (0.680), with neither program being
146 statistically significantly better than the other (Table 2). Interestingly, LocARNA uses
147 RNAalifold for their structure predictions, yet R-Coffee (0.524) had considerably lower structure Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 8
148 scores despite also relying on RNAalifold. DECIPHER's predictions, which are purely based on
149 mutual information rather than free energy, improved substantially (> 10% on average) by
150 increasing the number of sequences in the alignment (Fig. 2d). In contrast, MASTR,
151 MXSCARNA, R-Coffee, and SPARSE all exhibited declines in structure scores on larger
152 alignments.
153 Empirical scalability
154 Scalability was split into two components: number of sequences being aligned (n) and
155 average length of sequences (l). In terms of elapsed time, Clustal Omega was the fastest aligner
156 tested, followed by DECIPHER. Table 2 shows that both programs scaled nearly linearly with
157 the number of input sequences (n). DECIPHER was the only program to exhibit sub-linear
158 scalability with increasing length of sequences, likely due to heuristics it employs to constrain
159 the dynamic programming matrix. Most other programs displayed quadratic or worse limiting
160 behavior in n and/or l. In particular, SPARSE was similar in scalability to LocARNA, implying
161 that its faster speed is not worth its lower accuracy. Clustal Omega (259) and DECIPHER (258)
162 completed the most alignments within the 12 hour time limit per alignment, while LocARNA
163 (140) and SPARSE (142) completed the fewest.
164 DISCUSSION
165 The RNAconTest benchmark provides an alternative to existing benchmarks for RNA
166 sequence alignments. The approach described here only tests for consistency among known
167 secondary structures since the perfect alignment is unknown. Such a benchmarking approach
168 would have been impractical a decade ago when far fewer empirical RNA structures were
169 available. The advantages of RNAconTest are that it does not rely on manual curation,
170 encompasses a wide variety of RNA families without substantial redundancy, and more Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 9
171 realistically reflects many user scenarios for RNA alignment (e.g., thousands of input
172 sequences). Nevertheless, it relies on the assumption that greater structural consistency
173 corresponds to greater underlying alignment accuracy. This assumption was shown to be
174 reasonable for protein sequences (Sievers and Higgins 2019), but would be invalidated if
175 secondary structure diverged substantially during RNA evolution.
176 It is known that RNA structures evolve in regions outside of their conserved core. For
177 example, some tRNAs can have a different number of arms than the traditional four arm
178 cloverleaf structure (Pons et al. 2019). Extensive structural variation also exist in RNase P
179 (Brown 1999), group I introns (Jackson et al. 2002), and telomerase (Podlevsky et al. 2008)
180 sequences. This raises the question of whether it is even feasible to align such divergent
181 structures? The assumption made by most alignment software is that the input sequences are, at a
182 minimum, homologous, if not orthologous (Morrison 2006). If true, it is reasonable to assume
183 that greater structural consistency would result from a better alignment. However, the need for a
184 large number of supplemental sequences likely introduces some false positive input sequences
185 that score above Rfam's gathering cutoff (Nawrocki et al. 2015) for the full alignment but do not
186 belong in the RNA family. Such sequences would violate the assumption of homology made by
187 most aligners, although they also likely reflect a realistic user scenario. As alignments grow in
188 size, it is possible that not all input sequences are true homologs and alignment software is still
189 expected to generate a high quality alignment of any homologous sequences.
190 A variety of different measures have been used to gauge the accuracy of secondary
191 structure agreement (Gardner and Giegerich 2004; Xu et al. 2012). Here, the decision was made
192 to focus on a single alignment score representing the agreement among paired and unpaired
193 positions in known structures (see Materials and Methods). The alignment score takes into Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 10
194 account the number of correctly/incorrectly aligned positions, as well positions that were gapped
195 that should have been aligned. The structure score incorporates correctly/incorrectly predicted
196 base pairings, as well as missing base pairings. The advantage of these scoring schemes is that
197 they provide a unified measure of accuracy. Notably, this differs from a traditional benchmark in
198 that the maximum score is unknown, that is, scores can only be compared on a relative (not
199 absolute) basis. Notwithstanding this limitation, the ranking of programs on RNAconTest was
200 generally consistent with previous benchmarks showing that LocARNA is currently the best
201 available program for non-coding RNA alignment (Löwes et al. 2017).
202 Taken together, the results of this study suggest that there remains considerable room for
203 improvement in the alignment of non-coding RNA sequences. This was evidenced by the fact
204 that no program substantially outperformed all others across most Rfam families (Fig. 3). Users
205 of alignment programs have many viable options for reasonably accurate RNA alignments. For a
206 small number of input sequences (n < 1000), DAFS, LocARNA, and MAFFT all produced
207 similarly accurate alignments. DECIPHER was the most scalable program that maintained high
208 accuracy on alignments of a large number of sequences. DAFS and LocARNA (via RNAalifold)
209 outperformed most other secondary structure prediction tools. Hopefully the availability of a new
210 benchmark will reinvigorate developers' efforts to improve tools for analyzing non-coding
211 RNAs.
212 MATERIALS AND METHODS
213 Construction of the RNAconTest benchmark
214 The benchmark was constructed from the set of 3016 families in Rfam (v14.1). This set
215 was narrowed to 751 with at least 100 sequences available in their full alignment. Available 3D
216 structures were downloaded from the Protein Data Bank (wwPDB consortium 2019) and used to Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press
ES Wright 11
217 determine an empirical secondary structure with DSSR (Lu and Olson 2003). Briefly, RNA
218 chains were extracted using PyMOL (v2.4.0a0) to prevent cross-chain bonds within multi-chain
219 structures and any resulting fragments were mapped to their primary sequence. Masking symbols
220 ("+") were used to denote chain breaks in the resulting reference secondary structures that were
221 excluded from scoring. The final sets were reduced to the set of unique sequences with structures
222 containing at least five paired bases and less than 10% imbalance between left and right pairings.
223 Imbalance sometimes resulted from structures where the Rfam family only matched a small
224 subsequence of the full-length chain (e.g., RF00029).
225 A set of up to 10 different input alignments were tested for each Rfam family. These
226 input alignments were generated by supplementing the reference sequences with 10 to 5120
227 ( 10 2&:0