Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them

chapter 7 Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them karl m. kjer Rutgers University usman roshan New Jersey Institute of Technology joseph j. gillespie University of Maryland, Baltimore County; Virginia Bioinformatics Institute, Virginia Tech Identifi cation of Goals. .106 Alignment and Its Relation to Data Exclusion. .108 Differentiation of Molecules . .110 rRNA Sequences Evolve under Structural Constraints . .111 Challenges to Existing Programs . .114 Compositional Bias Presents a Severe Challenge . .114 Gaps Are Not Uniformly Distributed . .116 Nonindependence of Indels. .121 Long Inserts/Deletions . .122 Lack of Recognition of Covarying Sites (A Well-Known, Seldom-Adopted Strategy) . .123 Are Structural Inferences Justifi ed? . .126 Why Align Manually? . .127 Perceived Advantages of Algorithms . .127 105 RRosenberg08_C07.inddosenberg08_C07.indd 105105 99/30/08/30/08 55:09:18:09:18 PMPM 106 Structural Considerations for RNA MSA An Example of Accuracy and Repeatability . .129 Comparison to Protein Alignment—Programs and Benchmarks . .136 Conclusion . .137 Terminology . .139 Appendix: Instructions on Performing a Structural Alignment . .141 identification of goals What Is It You are Trying to Accomplish with an Alignment? Some of the disagreement over alignment approaches comes from differences in objectives among investigators. Are the data merely meant to distinguish target DNA from contaminants in a BLAST search? Or is there a specifi c node on a cladogram you wish to test? Are you aligning genomes or genes? Are the data protein-coding, structural RNAs or noncoding sequences? Do you consider phylogenetics to be a process of inference or estimation? Would you rather be more consistent or more accurate? Are you studying the performance of your selected programs or the relationships among your taxa? Different answers to each of these questions could likely lead to legitimate alternate alignment approaches. Morrison (2006) reviewed the many uses of alignment programs, and distinguished phylogenetic alignments as a special subset that requires attention to biological processes. Hypša (2006) reached a similar conclusion and emphasized the importance of adding complexity to multiple sequence alignments and phylogeny estimation. This chapter is devoted to discussing the alignment of structural RNAs, or ribosomal RNA (rRNA) and transfer RNA (tRNA) sequences, for phylogenetic analysis (although our thesis applies to other smaller RNAs, such as tmRNAs, RNase Ps, and group I and group II introns). We seek to have our phylogenetic hypotheses be predictive and accurate, even if accuracy is diffi cult (or impossible) to demonstrate. By accurate we mean that a hypothesis coincides with the true history of branching events. If You Knew You could Improve Your Alignment, Would You Do It? Figure 7.1 shows an example of a fragment of a computer-generated alignment of the 12S rRNA from a variety of primates with that of murine rodent outgroups. It follows an optimality criterion based on minimizing costs from a Needleman–Wunsch (1970) algorithm and a guide tree. We hope that this example, almost like the fi rst couplet of a dichot- omous key, will indicate where you stand on the issue of adjustment. RRosenberg08_C07.inddosenberg08_C07.indd 106106 99/30/08/30/08 55:09:19:09:19 PMPM Structural Considerations for RNA MSA 107 Mouse GCTACATTTTCTTA--TAAAAGAACAT-TACTATACCCTTTATGA Rat GCTACATTTTCTTTTCCCAGAGAACAT-TACGAAACCCTTTATGA Gibbon GCTACATTTTCTA--TGCC-AGAAAAC-CACGATAACCCTCATGA Baboon GCTACATTTTCTA--CTTCAGAAAACCCCACGATAGCTCTTATGA Orangutan GCTACATTTTCTA---CTTCAGAAAAC-TACGATAGCCCTCATGA Human GCTACATTTTCTA---CCCCAGAAAAC-TACGATAGCCCTTATGA Bonobo GCTACATTTTCTA--CCCC-AGAAAAT-TACGATAACCCTTATGA Chimp GCTACATTTTCTA--CCCC-AGAAAAT-TACGATAACCCTTATGA Figure 7.1. A fragment of an alignment of complete 12S rRNA, generated by ClustalX (Jeanmougin et al. 1998; Thompson et al. 1997). Notice the “CTTCAGAAAAC” in the middle of the fi gure for both the baboon and the orangutan. If these were your data, would you adjust the sequences to correct the errors made by the program, or would you leave it alone and let the program make all of the decisions about homology, even when it appears to have erred? Would you adjust the nearly identical sequences between the human and the chimps? Would you make decisions about which nucleotides to exclude? Would it bother you if the baboon grouped inside the rodents? There are no correct answers, and the choices you make have implications that relate to what you wish to discover from your data, and tell a lot about both your background and your objectives. If you adjust the alignment, even in this one instance, you have converted to a manual alignment, with all its strengths and limitations. Adjusting the alignment is an attempt to improve the accuracy of an alignment. Here we defi ne “accurate” as representing true (unknowable) homology, and also propose that accurate homology estimations will probably improve the accuracy of the phylogenetic hypothesis. But how do you know what “accurate” is, and where do you draw the line? Is manual alignment an art form subject to the whims and biases of the aligner, or can we identify a repeatable meth- odology? Similarly, if we criticize manual alignments as subjective and inconsistent, might these same criticisms apply to computer-generated alignments? If accuracy is a concern, where do current algorithms fail? Many workers could legitimately state that we cannot objectively defi ne errors made by the computer, and, in fact, the whole concept would be counter to an optimality-based study. If you are looking for the shortest tree, you should favor an alignment that reduces the number of steps. Others might assume that a few errors, even if they RRosenberg08_C07.inddosenberg08_C07.indd 107107 99/30/08/30/08 55:09:19:09:19 PMPM 108 Structural Considerations for RNA MSA Mouse GCTACATT(TTCT TATA--AA AGAA)CAT--TACTATACCCTTTATGA Rat GCTACATT(TTCT TTTCCCAG AGAA)CAT--TACGAAACCCTTTATGA Gibbon GCTACATT(TTCT -ATGCC-- AGAA)AAC--CACGATAACCCTCATGA Baboon GCTACATT(TTCT -ACTTC-- AGAA)AACCCCACGATAGCTCTTATGA Orangutan GCTACATT(TTCT -ACTTC-- AGAA)AAC--TACGATAGCCCTCATGA Human GCTACATT(TTCT -ACCCC-- AGAA)AAC--TACGATAGCCCTTATGA Bonobo GCTACATT(TTCT -ACCCC-- AGAA)AAT--TACGATAACCCTTATGA Chimp GCTACATT(TTCT -ACCCC-- AGAA)AAT--TACGATAACCCTTATGA Figure 7.2. A structurally adjusted alignment of the same data as shown in Figure 7.1. Parentheses indicate the bounds of a hairpin-stem loop, with hydrogen-bonded nucleotides indicated with underlines (Kjer et al. 1994). An unaligned region (the “loop” portion of the hairpin-stem) is delimited with spaces. could be defi ned, would be better left alone, because the mass of the data should counterbalance a few random errors. But what if the errors are not random, creating a potential for linking together unrelated groups that share the same systematic biases? What if there were some higher order of conservation that we could examine in making decisions about homology that does not necessarily result in shorter trees? Figure 7.2 shows the same region of rRNA as in Figure 7.1, but has been adjusted to minimize secondary structural changes predicted for this region of the molecule. Minimizing structural change is also an optimality criterion for homology assessment that we support and will explore in this chapter. Structural homology is based on position and connection, and assumes that the same structure existed in a common ancestor. Strict adherence to nucleotide homology may require that the same nucleotide state exists in a common ancestor as in its descendants, and, by this defi nition, structural homology and nucleotide homology may support different alignments. alignment and its relation to data exclusion One of the things that are not clear from the above comparisons (Figures 7.1 and 7.2) is what we should do with the nucleotides in the “loop” portion of the hairpin-stem loop, between the “TTCT” and the “AGAA.” This is an extremely important issue, but somewhat out- side the debate about alignment. Frequently, these regions are excluded from the analysis on the grounds that they are too variable to align. Some systematists fi nd any form of data exclusion to be unacceptable, RRosenberg08_C07.inddosenberg08_C07.indd 108108 99/30/08/30/08 55:09:19:09:19 PMPM Structural Considerations for RNA MSA 109 A B C REC RAA REC' Mouse ????? [TATA--AA] 1 [T-] 1 [ATAA-] 1 [-A] 1 Rat ????? [TTTCCCAG] 2 [TT] 2 [TCCC-] 2 [AG] 2 Gibbon ATGCC [-ATGCC--] 3 [--] 3 [ATGCC] 3 [--] 3 Baboon ACTTC [-ACTTC--] 4 [--] 3 [ACTTC] 4 [--] 3 Orangutan ACTTC [-ACTTC--] 4 [--] 3 [ACTTC] 4 [--] 3 Human ACCCC [-ACCCC--] 5 [--] 3 [ACCCC] 5 [--] 3 Bonobo ACCCC [-ACCCC--] 5 [--] 3 [ACCCC] 5 [--] 3 Chimp ACCCC [-ACCCC--] 5 [--] 3 [ACCCC] 5 [--] 3 Figure 7.3. Three suggestions for dealing with the unaligned loop from Figure 7.2. (A) If the in-group is alignable, but the out-group cannot be aligned to the in-group, consider the out-group data as missing. (B) Eliminate the entire region, but recode it as multistate characters. Taxa sharing identical sequences are coded with the same state. (C) Gillespie’s (2004) method of fi nding the unpaired middle, to break up the region into the ambiguously aligned loop, and fl anking regions of slippage. and those are typically the same researchers who would not adjust the alignment given in Figure 7.1. It is possible that these regions con- tain some potentially informative characters, yet in our observations the ability to objectively retrieve signal from these regions quickly becomes confounded with increased sequence divergence across an alignment. There is a wide variety of treatments for these data. One option would be to transform the out-group states to “missing data” (Kjer et al. 2001), with the idea that if you do not have any reasonable confi dence of homology between the rodents and the primates, the data really are “missing,” and since you are interested in the relationships among the in-group taxa, and in-group motifs are of uniform length, they could be treated as in Figure 7.3A.

Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support