DEVELOPING TOOLS FOR RNA STRUCTURAL ALIGNMENT

Ali G. Mokdad

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

May 2006

Committee:

Neocles B. Leontis, Advisor

Danny Myers, Graduate Faculty Representative

Scott O. Rogers

Craig Zirbel

Alexei Fedorov

© 2006

Ali Mokdad

All Rights Reserved iii

ABSTRACT

Neocles B. Leontis, Advisor

This work addresses current problems of RNA and describes different tools

for solving them. RNA molecules form basepairs that fold the molecule into its secondary and

tertiary structures. These structures are more conserved in than primary sequence

because they directly affect the function of the molecule. Thus, sequence alignment of RNA

molecules, unlike that of other biological molecules, must proceed by aligning homologous pairs

of positions to each other. The state of the art methods used today for aligning RNA are based on

Stochastic Context Free Grammars (SCFG). These methods are able to characterize and thus align nested RNA basepairs, but are incapable of dealing with crossing basepair patterns. In addition, the current application of this algorithm ignores 3D structure and thus deals best with

only one type of basepairs. Although this type (the cis Watson-Crick/Watson-Crick or cWW type) is the most common in RNA, there are at least eleven other families of basepairs that account for about one third of RNA interactions. Each of these families has its own structural dimension, and therefore its own patterns of accepted isosteric substitutions in sequence alignments. Here, an RNA alignment analysis and evaluation tool that takes into consideration

3D structure with all types of basepairs is described. This tool is used to structurally evaluate

alignments and locate errors in them. A discussion and classification of tertiary interactions of

the G/U wobble basepair is then presented. Novel conserved interactions are discovered, and

their sequence signatures are used to further enhance sequence alignments. Finally, a better

SCFG approach for automatic RNA alignment is suggested and tested. This approach takes into

consideration the 3D structure of all families of basepairs. It is also coupled with another theory,

Markov Random Field (MRF), to align areas where crossing basepairs occur. iv

To My Mother and Father

Thank You!

v

ACKNOWLEDGEMENTS

I would like to thank my advisor and committee members for their support and help at different stages of my studies. I would also like to thank Stan Smith, John Graham, and Arthur Brecher for their kindness, trust, and valuable advice throughout my training. Finally, this work would not have been possible without the constant support and encouragement of my wife. vi

TABLE OF CONTENTS

Page

CHAPTER I: BACKGROUND AND SIGNIFICANCE ...... 1

General Characteristics of RNA ...... 1

RNA Structure ...... 3

Basepairing, RNA Evolution, and Isostericity Matrices...... 7

RNA Sequence Alignment: A Statement about Evolution...... 14

Building Sequence Alignments...... 14

Problems with Sequence Alignment Representations ...... 18

Models: Manageable Approximations of Realities...... 20

Importance and Use of Good RNA Alignments...... 21

Dissertation Structure...... 22

CHAPTER II: RIBOSTRAL – AN RNA 3D STRUCTURAL ALIGNMENT ANALYZER,

EVALUATOR, AND VIEWER BASED ON BASEPAIR ISOSTERICITIES...... 25

Introduction…...... 25

Supported Platforms and Deployment Process...... 27

Executing Ribostral...... 28

Loading an Alignment File...... 29

Dividing Sequences into Subgroups ...... 31

Preparations for the Analysis of a List of ...... 34

Analyzing a List of Nucleotides ...... 38

Types of Output Files and their Interpretations ...... 39

Output Files for a Non-basepair List ...... 39 vii

Output Files for a Basepair List...... 41

Ribostral Alignment Viewer...... 47

Additional Output Formats ...... 52

Interactive Analysis of Nucleotides...... 53

Interactive Analysis of a Family of Basepairs ...... 54

Interactive Analysis of Individual Basepairs or Motifs ...... 57

Ribostral Preferences ...... 60

Supporting Tools...... 62

Tool 1: Generate BP List from PDB...... 63

Tool 2: Align Sequences...... 64

Tool 3: Extract Parts of a FASTA File ...... 64

Tool 4: Merge & Remove Repeats from FASTA Files ...... 66

Tool 5: Create .fasta from .mat ...... 66

Help Menu…...... 67

Practical Uses of Ribostral...... 67

CHAPTER III: STRUCTURAL AND EVOLUTIONARY CLASSIFICATION OF G/U

WOBBLE BASEPAIRS IN THE RIBOSOME ...... 68

Introduction…...... 68

Materials and Methods...... 73

Results……… ...... 76

Types of Observed Tertiary Interactions and Their Sequence Patterns...... 85

P-interaction...... 89

Molecular Dynamics Analysis of the P-interaction...... 95 viii

Other Shallow Groove Pocket Interactions...... 98

Discussion……...... 102

Conclusion…...... 109

CHAPTER IV: SEQUENCE PARSING AND AUTOMATIC RNA ALIGNMENT ...... 111

Algorithms for Aligning RNA Sequences ...... 111

SCFG Profiles of RNA Molecules...... 114

Alignment of an RNA Motif Using the Hybrid SCFG/MRF Method...... 117

Alignment Results and Discussions...... 121

SCFG/MRF Discrimination Power...... 125

Conclusion and Summary...... 127

REFERENCES……...... 128

ix

LIST OF FIGURES

Figure Page

1 Identification of the parts of an RNA ...... 3

2 Secondary structure of the 5S rRNA from Haloarcula marismortui ...... 5

3 Tertiary structure of 5S rRNA from Haloarcula marismortui ...... 6

4 Cis versus trans orientations of glycosidic bonds...... 8

5 Basepair families and their isosteric subfamilies...... 12

6 Ribostral default installation subdirectories...... 28

7 Ribostral’s main GUI...... 29

8 File menu options...... 30

9 Browsing for an alignment file ...... 30

10 A snapshot of the “KnownFASTAFilenames.xls” file...... 32

11 GUI changes when an alignment file is successfully loaded...... 33

12 Browsing for an Excel nucleotide list...... 34

13 Nucleotide list for analysis...... 35

14 Basepair list for analysis ...... 35

15 The main GUI after successful loading of a nucleotide list file...... 38

16 Output files produced depending on input files analyzed...... 39

17 Percent output text file created upon sequence analysis of an NT list...... 40

18 Counts output HTML file created upon sequence analysis of a basepair list...... 42

19 Matching sequence values and names ...... 43

20 Score plots for the archaeal 5S rRNA alignment...... 47

21 Ribostral HTML Alignment Viewer...... 49 x

22 Schematic explanation of structural masks...... 51

23 Initial screenshot of the interactive analysis GUI...... 54

24 Analysis of all occurrences of a basepair family ...... 55

25 Getting sequence names with certain substitution patterns ...... 56

26 Basepair interactive analysis...... 58

27 Motif interactive analysis...... 59

28 Ribostral default preferences ...... 60

29 Substitution counts with expected values ...... 61

30 Ribostral tools...... 63

31 GUI for extracting parts of a FASTA file...... 64

32 Shallow groove pocket interactions formed by G/U basepairs...... 71

33 Stereo image of G/U superimposed on U/G ...... 88

34 GNRA substitutions of G/U’s making P- and phosphate-in-pocket interactions ...... 94

35 P-interaction between two helices ...... 95

36 Molecular Dynamics of the eight P-interaction systems...... 97

37 Electrostatic maps of G/U and A+/C basepairs involved in P-interaction ...... 98

38 Shallow groove pocket interactions marked in 16S rRNA...... 100

39 Shallow groove pocket interactions in 5S rRNA, 23S rRNA, and tRNA…...... 101

40 The cis Watson-Crick G/U and A+/C basepairs...... 102

41 Venn diagram representing locations, types, and sequence conservations of G/U wobble basepairs in the ribosome...... 105

42 A part of Chomsky hierarchy of transformational grammars...... 113

43 A hypothetical RNA structure and its SCFG profile ...... 115 xi

44 Representations of Hm 23S rRNA Helix 95 containing the sarcin/ricin motif ...... 117

45 Representation of the “tWH-IL-tHS” motif on which our hybrid SCFG/MRF alignment algorithm is tested…...... 118

46 The eleven “tWH-IL-tHS” motifs found in all structures ...... 119

47 Manual structural alignment of the eleven “tWH-IL-tHS”s motifs...... 120

48 MATLAB script representing the seed alignment of “tWH-IL-tHS” motif...... 121

49 Validation of results with Ribostral Alignment Viewer ...... 122

50 “tWH-IL-tHS” motif homologues aligned with the hybrid algorithm ...... 124

51 Representations of “tWH-IL-tHS” motif and kink turn motif...... 125

52 Scatter plot of sequences scored according to two motif models ...... 126

xii

LIST OF TABLES

Table Page

1 Examples of RNA catalysis ...... 3

2 The twelve families of RNA basepairs and the symbols for their designation...... 9

3 Basepair codes used by Ribostral ...... 37

4 G/U’s reported in secondary structures but not observed in 3D structures ...... 75

5 Interactions of the cis WC G/U basepairs in 16S rRNA...... 77

6 Interactions of the cis WC G/U basepairs in 23S and 5S rRNA...... 79

7 Sequence analysis of cis WC G/U basepairs in 16S, 23S, and 5S rRNA...... 86

8 Sequence analysis of all P-interactions in 16S, 23S, 5S rRNA, and tRNA...... 90

1

CHAPTER I: BACKGROUND AND SIGNIFICANCE

General Characteristics of RNA

RNA stands for ribonucleic acid. In some respects, this molecule is similar to deoxyribonucleic acid, or DNA, and in others it is very different. DNA and RNA molecules are made of long strands of similar building blocks, and both can store and transmit genetic information. RNA however is much more diverse in structure and function. Besides messenger RNA (mRNA) which acts as an intermediary between DNA and , many other non-coding that play

diverse and central roles in cellular processes are also known. Many non-coding RNAs resemble

more than DNA in structure and function: they can fold into specific compact 3D

structures that allow them to carry out complex biological functions, including chemical catalysis

(Guerrier-Takada et al. 1983; Altman 1984; Bass et al. 1984; Guerrier-Takada et al. 1984; Cech et al. 1986; DeRose 2002; Kirsebom 2002; Lilley 2003; Bevilacqua et al. 2005). When the ability of RNA to catalyze reactions was discovered in the 1980’s, it was a paradigm-changing discovery that altered the way scientists think about RNA and earned its discoverers, Sidney

Altman and Thomas Cech, the Nobel Prize in Chemistry in 1989. As a result of their work, RNA was no longer thought of as a passive molecule that merely transmits genetic codes relayed to it by DNA, or as passive scaffold for proteins carried out by ribosomal RNA (rRNA). We now view RNA as dynamic structures capable of catalyzing complex biological reactions, binding and organizing substrates, and carrying out complex informational processes. Before that time, only proteins were believed to be able to do those functions. RNA eventually emerged as a molecule capable of doing the two functions that are essential for life: storing and passing genetic information, and at the same time carrying out biological reactions and informational 2

transactions essential for life, such as self-replication. This fed speculation that RNA could have

been the first molecule with which life on planet Earth originated, and gave rise to the “RNA

world” hypothesis (Gilbert 1986).

RNA catalysis is a function of all organisms which depend on it for their existence (Table 1).

Some examples of essential reactions that are catalyzed partially or completely by RNA are the

peptidyl transferase reaction carried out by rRNA (Noller et al. 1992), splicing introns out of

eukaryotic pre-mRNA to form mature mRNA (Tarn et al. 1997), signal recognition for translocating some proteins across cell membranes (Larsen et al. 1993), and many types of post- transcriptional modification or genetic regulation in higher eukaryotes (McKeown 1992; Peltz et

al. 1992; Melefors et al. 1993; Maxwell et al. 1995). This may provide an explanation for the

huge phenotypic differences between higher eukaryotic organisms and lower organisms, despite

the similarity between their coding DNAs and number of genes (Mattick 2003; Mattick 2004;

Mattick 2005; Ravasi et al. 2006). All of these functions are possible through the specific 3D

configuration that involved RNA molecules adopt. Thus, in order to better understand the

function and evolution of RNA, it is necessary to study in detail all aspects of its structure.

Ribozyme Reaction Ribosome (peptidyl transferase center) Peptidyl transfer in protein synthesis Ribonuclease P Cleavage of 5' end of precursor tRNAs Signal recognition particle Translocation of proteins across membranes Small nucleolar RNAs RNA processing and modification 5' and 3' untranslated regions of RNA Post-transcriptional regulation Spliceosome (contains small nuclear RNAs) RNA splicing (removal of introns from pre-mRNA) Group I introns Self splicing of introns in rRNA genes Group II introns Self splicing of introns in mRNA genes Viroids and virusoids Self splicing and cutting of RNA repeats Table 1. Examples of RNA catalysis.

3

RNA Structure

RNA is hierarchically organized, and its 3D structure can be described at different levels. The

primary structure specifies the linear covalent sequence of nucleotides that form the strand of the

RNA molecule. Each nucleotide is composed of a phosphate, a five-membered ribose sugar, and

a base (Figure 1). The base can either be a purine (R) or a pyrimidine (Y). Purines can be

adenines (A) or guanines (G), and pyrimidines can be cytosines (C) or uracils (U). In addition to

these there are some other modified bases (Kruger et al. 1998; Iseni et al. 2002; Ziesche et al.

2004; Sumita et al. 2005).

Figure 1. Identification of the parts of an RNA nucleotide: the phosphate, the five-membered ribose sugar, and the base. The left panel shows a purine, and the right panel shows a pyrimidine. The three edges of each base are also indicated (discussed later in text). The “C-H” edge of pyrimidines is frequenctly referred to as the Hoogsteen edge as well.

These are the building blocks that are shared between RNA and DNA, except that DNA has

thymidine (T) instead of uracil, and also has a different form of sugar, deoxyribose (that lacks the

2’-hydrocyl group), instead of the ribose sugar of RNA. This simple difference between DNA

and RNA structures has important consequences. The lack of the 2’-hydroxyl makes DNA much

more resistant to hydrolysis, and thus more stable and safer for storing large quantities of genetic 4

information. Furthermore, because of the different sugar puckers that each of the molecules

assume, helices formed from double strands of DNA typically have a different fundamental

shape than that of RNA (B versus A helices, respectively). The resulting double-stranded DNA

structures (those that mostly exist in nature) are more rigid and allow for fewer varieties than

RNA structures. Single stranded DNA (ssDNA) molecules are much less common than single

stranded RNA, but when they occur or when they are chosen by in vitro selection (SELEX), they

too can carry diverse functions. Examples of these are DNAzymes (catalytic DNAs) and

aptamers (Ulrich et al. 2004; Peracchi 2005). DNA can also form other structures, such as triple

stranded molecules and other artificial nano-molecular structures (Hirao et al. 1995; Seeman

1999; Seeman 2003; Seeman 2005). RNA, however, can make other kinds of minor groove

interactions by using its 2’-OH group. This important RNA characteristic is covered in detail in

Chapter III.

Instead of the long helices formed by two perfectly complementary strands of DNA, an RNA chain folds back on itself (often imperfectly) to form short stretches of helical regions interrupted by bulges, internal loops, hairpin loops, or multi-way junctions. This second level of RNA hierarchical structure is known as the secondary structure of RNA (Figure 2). 5

Figure 2. Secondary structure of the 5S rRNA from Haloarcula marismortui. Secondary structure elements shown include several helices, a three-way junction (3WJ), two internal loops (IL), two hairpin loops (HP), and several bulges (B).

Basepairing and base stacking are the two fundamental types of interactions responsible for

forming RNA secondary (and tertiary) structures. Figure 2 is a planar representation of the cis

Watson-Crick (W.C.) type of basepairing, and some base stacking within helices. Basepairing is

the edge-to-edge interaction between two roughly coplanar bases to form a larger structural unit,

the basepair. W.C. basepairs (A–U, U–A, C=G, and G=C) are mostly represented by solid lines

between nucleotides in the figure. Basepairing is very specific, it occurs if two or three atoms

from the first base are able to simultaneously form hydrogen bonds with two or three atoms from

the second base. This H-bonding can only exist between atoms with opposite partial charges, where the positively charged atom is a hydrogen covalently bonded to oxygen or nitrogen (the donor), and the negatively charged atom is an oxygen or nitrogen with non-bonding electrons.

Some atoms can act as both H-bond donor and acceptor.

Other molecular forces such as van der Walls, and dipole-dipole interactions also play important

roles in shaping RNA, but they are much less specific or directional than H-bonding. 6

Consecutive basepairs in helical regions and sometimes individual bases in loop regions are

stacked on top of each other due to these less specific dipole-dipole and dispersion forces

(Sponer et al. 2001; Takasu et al. 2002). Stacking is less well structurally characterized than

basepairing. It is generally possible between any two types of bases, as long as there is no

electric or steric clashes between their atoms (Dima et al. 2005). Stacking significantly lowers

the free energy and stabilizes the whole molecule (Zhu et al. 1997; Burkard et al. 1999; Dima et

al. 2005). Although secondary structure representations of RNA molecules provide an easy way

to visualize and understand different aspects of the molecule, they do not completely describe the

3D structure of the molecule (Figure 3). S2S is a program designed to annotate the secondary structure in a way to reflect all 3D interactions based on Leontis and Westhof nomenclature

(Leontis et al. 2001; Jossinet et al. 2005).

Figure 3. Tertiary structure of 5S rRNA from Haloarcula marismortui, the same molecule whose secondary structure is displayed in Figure 2. Notice how helix 1 (yellow) folds and interacts with helix 2 (green). This type of tertiary interaction can not be easily represented on secondary structures.

This tertiary or 3D structure is the third level of RNA structure. It was demonstrated that RNA folds in a hierarchical manner, with local or nearby elements of RNA folding into their secondary structures (such as helices) first, and only then the whole secondary structure folds 7

compactly on itself to form the functional tertiary structure (Wu et al. 1998; Kent et al. 2000;

Thirumalai et al. 2001; Onoa et al. 2004; Westhof et al. 2004; Woodson 2005; Li et al. 2006).

Tertiary structures are formed through the same basic interactions that are responsible for the

formation of secondary structures. The main difference is the range of those interactions.

Tertiary interactions bring together distant (in 2D structure) helices or loops to form 3D motifs with higher complexity and compactness. Base triples or higher combinations of bases also occur, and they can be considered as two or more combinations of basepairs occurring

simultaneously through different edges. Finally, there is a fourth and final level of structure called quaternary structure. This involves interactions between different RNA molecules or

between RNA and proteins to form the active biological complex. An example of such a

complex is the ribosome, which is made of two different RNA subunits (5S-like and 23S-like

rRNAs form the large subunit or LSU, and 16S-like rRNA forms the small subunit or SSU) and up to 50 ribosomal proteins (Schuwirth et al. 2005). In structural RNAs, as it is the case for other structural molecules, quaternary and tertiary interactions are more conserved than secondary structure, which is more conserved than primary sequence (Westhof et al. 2004).

Basepairing, RNA Evolution, and Isostericity Matrices

The basepairs represented in the secondary structures are W.C. A/U and G/C basepairs and G/U

“wobble” basepairs. These are the ones represented by solid lines (or black dots for G/U), and found mostly in helices. A majority of bases in the so-called “loops” also pair to form non-

Watson-Crick (non-W.C.) basepairs. In addition, bases forming W.C. basepairs can at the same time form other non-W.C. basepairs with bases in loops. The different types of basepairing are possible because each base has three distinct edges (Figure 1), and each basepair combination 8

can take place through one of these edges from one base and another from the other base. Also,

each basepair can have one of two possible orientations around the glycosidic bond that connects

the base (glycosidic nitrogen) to the sugar (C1’) as in Figure 4. The combination of base edges and glycosidic bond orientations give a total of twelve main types or families of basepairs

(Leontis et al. 2001). Table 2 shows a summary of these types. In addition to the twelve main families of basepairs, there are other known intermediary types (Auffinger et al. 1999; Lemieux

et al. 2002). One of these is the cis Watson-Crick bifurcated family, which is an intermediate

between the cis Watson-Crick/Watson-Crick and the trans Watson-Crick/Hoogsteen families.

Figure 4. Cis (panel A) versus trans (panel B) orientations of glycosidic bonds. These nucleotide orientations are defined relative to the line drawn along the center of all hydrogen bonds forming the basepair.

9

Glycosidic Bond No Interacting Edges Symbol Orientation 1 Cis Watson-Crick/Watson-Crick (WC/WC) 2 Trans Watson-Crick/Watson-Crick (WC/WC) 3 Cis Watson-Crick/Hoogsteen (WC/H) 4 Trans Watson-Crick/Hoogsteen (WC/H) 5 Cis Watson-Crick/Sugar edge (WC/SE) 6 Trans Watson-Crick/Sugar edge (WC/SE) 7 Cis Hoogsteen/Hoogsteen (H/H) 8 Trans Hoogsteen/Hoogsteen (H/H) 9 Cis Hoogsteen/Sugar edge (H/SE) 10 Trans Hoogsteen/Sugar edge (H/SE) 11 Cis Sugar edge/Sugar edge (SE/SE) 12 Trans Sugar edge/Sugar edge (SE/SE) Table 2. The twelve families of RNA basepairs and the symbols for their designation (Leontis et al. 2001).

During evolution, DNA undergoes random mutations due mostly to mistakes in its replication,

virus interference, or other mechanisms (Higgs 2000). This can happen to both coding and non-

coding regions of DNA. These mutations are then passed on to coding and non-coding RNAs.

Coding RNAs (mRNAs) code the sequences of proteins. If the mutations cause significant

change in the protein sequence and structure, they may lead to metabolic disorders and

sometimes death of the organism. Mutations in non-coding RNAs that have structural roles are

screened by natural selection in a similar way. Similar to proteins, the main element that has to be conserved here is the resulting 3D structure of the RNA molecule. Since basepairing is the basis of 3D structure, any mutations that cause significant change in basepairing are most probably deleterious, and organisms having them are eliminated from the gene pool, unless compensating mutations can restore the basepair to its original structure. This leads to an important idea which defines another fundamental difference between proteins and coding DNAs on one side and structural RNAs on the other: the basepair, and not the individual base, is the evolutionary unit of structural RNAs. In other words, it does not matter if a certain G/C basepair is changed into A/U for example, as long as the G/C and A/U are equivalent, or isosteric, to each other. The word “isosteric” means having the same structure. To be isosteric, two basepairs must 10

satisfy two criteria: first, they must have the same size, defined by the distance between C1’ of

the first base and C1’ of the second base, and second, their bases must have the same orientations

with respect to the bond axis, so that the backbones do not get distorted by the substitution. In

some particular situations, the identity of individual bases may also be important in determining

allowed substitutions. Specific bases may be needed in catalytic centers for example because of

their specific distribution of partial charges or functional groups. This may play essential roles in

mediating bond formation or breaking during the propagation of the reaction being catalyzed.

But generally speaking, basepairing is the most important basic structural element that needs to

be conserved in order for the overall structure and function of the molecule to stay the same.

Each family of basepairs has its own distribution of isosteric subfamilies. These are called the

basepair isostericity matrices (Leontis et al. 2002). Figure 5 shows all basepair families with their isosteric subfamilies. Boxes with the same color are isosteric within the same family. The

colors shown in this figure are also used by Ribostral (Chapter II), and are chosen carefully so

that nearly-isosteric groups are assigned “similar” colors. There are five such “similar” color

groups, as indicated by the color scale at the button right corner of Figure 5. The first table

(cWW) is for the cis Watson-Crick/Watson-Crick family. Its color patterns indicate that the classical W.C. basepairs (A/U, C/G, G/C, and U/A) are isosteric, having all the same color, red.

A/C and G/U form another isosteric subfamily, colored in yellow. These two isosteric subfamilies are nearly-isosteric to each other (red and yellow belong to the same color group).

C/A and U/G boxes are colored in pink, as these basepairs form yet another isosteric subfamily,

which is also nearly-isosteric to the classical W.C. subfamily (red and pink belong to another

color group), but not nearly-isosteric to A/C and G/U, despite the fact that they have the same

C1’ – C1’ distance (yellow and pink belong to different color groups). The reason is that these 11

basepairs form wobbles and are not symmetric (see Chapter III). There are a few more colors in the first table, each of which indicates a separate isosteric subfamily. The only color which denotes something else is gray: this indicates a forbidden interaction. G/G in the cWW table is

colored in gray because it cannot form a basepair due to steric clashes. As seen in the figure,

each family has a different pattern of isosteric subfamilies. For instance, G/C’s and A/U’s are not

always isosteric. It all depends on the basepair family they belong to. 12

Figure 5. Basepair families and their isosteric subfamilies. Gray colors indicate forbidden combinations of nucleotides, i.e., combinations that cannot form basepairs due to steric clashes or incompatible distribution of H- bond donor or acceptor atoms. In each family, isosteric subfamilies have the same color, and nearly isosteric- families have similar colors (the five “similar” color groups are shown on the bottom right corner). Letters correspond to the original reference (Leontis et al. 2002). Asterisk (*), corrected from the original reference. 13

Before proceeding further one critical point needs clarification. Thinking in evolutionary terms, one expects that mutations simultaneously affecting both bases in a basepair in one particular organism should be exceedingly rare because of the low probability of mutations. However,

RNA sequences represent not individual organisms, but the most common sequence in a large population that constitutes a particular species. We can therefore observe what seems to be small evolutionary “jumps” from one basepair like G/C to another like C/G, A/U, or U/A in closely related species (Higgs 2000). There might be reasons to prefer one of these basepairs to the others, so genetic drift may occur towards the advantageous basepair, which may become fixed in the population (Byles 1972). We rarely observe the less stable intermediate steps between these evolutionary jumps, such as the G/U which is the intermediate between G/C and A/U, or the G/A which is the intermediate between G/C and U/A, except if there is another reason that gives these intermediate basepairs some extra stability and makes them appear significantly in the population. This occurs for example when such basepairs participate in specific tertiary contacts. The same principle applies when dealing with mutations involving other basepairs as well. A structurally or otherwise crucial G/U basepair in some ancient organism may be observed as G/U or A/C, which are isosteric to each other, in descendent species without seeing very many of the intermediate steps like G/C or A/U. If some tertiary interaction that stabilizes the 3D structure can take place in case of the G/U but not the A/C, then evolutionary pressure would give G/U the advantage. Chapter III goes more in depth into this matter, showing examples of basepairs that are highly conserved G/U’s throughout evolution because of specific tertiary interactions they mediate.

14

RNA Sequence Alignment: A Statement about Evolution

The basepair families and their isosteric subfamilies are important for understanding evolution and evaluating resulting sequence differences between organisms. By knowing what types of basepairs are present in a structural RNA molecule, it is possible deduce the isosteric substitutions at each basepair position. Such substitutions are structurally neutral and most probably would not affect function. Organisms having these substitutions are more likely to survive and be observed today than organisms having non-isosteric substitutions. Before going into how this can be applied for building and improving sequence alignments, which is our overall goal, a closer look is needed at sequence alignments and what they mean and represent.

Building Sequence Alignments

When sequences are aligned, two or more sequences of homologous molecules are placed one above the other. Generally, “similar” positions are aligned to each other, meaning that they are put in the same column in the alignment. If, due to mutations, some sequences have insertions or deletions compared to others, gap symbols such as dashes (“–”) are used to replace the deleted nucleotides and keep all similar positions aligned vertically. In this sense, sequence alignments are detailed descriptions of the structure and evolution of molecules or organisms they represent.

They are an attempt to accurately display evolutionary events, such as point mutations and motif deletions or insertions, without losing track of the parts of the molecules that correspond to each other. This is a complicated task. In proteins for example, positions are compared to each other one at a time. By using information from empirical amino-acid substitution matrices, such as the PAM or BLOSUM matrices (Dayhoff et al. 1978; Jones et al. 1992; Henikoff et al.

1993), the researcher can make inferences regarding likely and unlikely amino acid substitutions. 15

These matrices generally show that amino acids having similar chemical properties, such as

acidity, basicity, or hydrophobicity, tend to replace each other more often without distorting the

molecule. This method largely ignores the context, as a small number of slightly different

matrices are used for all positions. PSI-BLAST is a similar method that constructs context- specific or position-specific substitution matrices from available sequence alignments. Hidden

Markov Models (HMM) are usually used to construct these matrices. An alignment “profile” is

built and new sequences can be added more precisely to the growing alignment. This process is

iterated and enhanced as the alignment grows (Altschul et al. 1997; Neuwald et al. 2000; Jones

et al. 2002).

RNA alignments are built in a different way. Base-pairs instead of bases are compared to each other, because they are the basic structural unit for recurrent RNA motifs. This means that for

RNA, secondary structures (which comprise W.C. basepairs), and not simply primary sequences, are aligned to each other. In fact, both sequence alignments and secondary structures are determined from one another in an iterative way. To make this possible, the Comparative

Sequence Analysis (CSA) method has been developed (Gutell et al. 1985; Chiu et al. 1991;

Gutell et al. 1992; Eddy et al. 1994; Michel et al. 2000). This method was created before most

crystal structures of RNA molecules were known. It provides a measure of the degree to which

nucleotides at two positions in an alignment are not independent of each other (this is called

mutual information, or covariance). The higher this measure, the higher the probability that the two positions interact to form a basepair. This method is generally applied for detecting W.C. basepairs, by optimizing sequence alignments to maximize covariations between A/U, C/G, U/A, and G/C at aligned positions. It also considers most G/U and U/G wobble basepairs by

optimizing their covariations with W.C. basepairs. Other base justapositions are considered 16

“mismatches” even though other families of basepairs have been known since the discovery of the trans Watson-Crick/Hoogsteen U/A basepair by X-ray crystallography (Hoogsteen 1956).

One of the reasons for largely ignoring these basepairs may have been their scarcity and inability to determine their accepted mutations from sequence analysis alone (Frank et al. 2000; Michel et al. 2000). However, some investigators were successful in predicting tertiary interactions this way (Yaniv et al. 1969; Michel et al. 1990; Michel et al. 2000). Alignment of RNA sequences using CSA method to optimize basepairing is carried out using dynamic programming algorithms (Nussinov et al. 1980; Van de Peer et al. 2000; Wuyts et al. 2004), or by hand where additional information from structure can also be considered if available (Mears et al. 2002;

Pitulle et al. 2002; Lescoute et al. 2005).

In addition to the CSA method, some other methods are available to determine the secondary structure of RNA molecules. Free Energy Minimization is a prominent method capable of determining probable secondary structures of single molecules without having any homologous sequences available (which is a prerequisite for CSA method). Folding free energies for all possible secondary structures are calculated based on Turner energy parameters, which are measured experimentally for many possible combinations of basepairs, basepair stacks, and for some hairpin loops (Mathews et al. 1999; Mathews et al. 2002; Mathews et al. 2004). These elements are assumed to be independent and additive, so for each secondary structure the total free energy can be calculated. The molecules with lowest free energies are the most stable, and one of them most probably corresponds to the actual secondary structure of the molecule. One problem with this method is that not all combinations of families of basepairs, stacks, and loops are measured simply because there are too many of them. The other problem is that the secondary structure with the lowest calculated free energy does not necessarily correspond to the 17

actual secondary structure because energy is not the sole factor that influences RNA folding. The

dynamic programming algorithm of mFold (Zuker 2003) is based on this method, and was recently modified to return, in addition to the lowest free energy structure, other secondary structures within a range of energy specified by the user. A similar but statistically more rigorous approach is used by sFold (Ding et al. 2003; Chan et al. 2005), where in addition to minimum free energies, the Boltzman distribution of all possible secondary structures are evaluated. The secondary structures that can be obtained in several ways are considered more probable to be

formed than others with only specific or limited ways with which they can be formed. The results present the probability for each possible basepair to occur. By combining the most probable pairing for each base, the most probable structure can be determined. Dynalign is a program that uses both free energy minimization and CSA to determine the secondary structure common to two RNA molecules (Mathews et al. 2002; Mathews 2005).

Finally, there is the Stochastic Context Free Grammar (SCFG) method which is currently considered the best method for determining secondary structure profiles as well as building alignments of RNA molecules. This method is explained in detail in Chapter IV, where we also

suggest a significant enhancement to the way in which it is currently applied.

Although successful in determining most of the W.C. and wobble basepairs correctly, these methods often miss the non-W.C. basepairs that comprise about one third of all interactions.

Since the W.C. and wobble basepairs occur mostly in helices, and other types of basepairs occur mostly in loops and junctions, these methods show good results in helical areas and make most mistakes in the loop and junction areas. These loop and junction areas are important for the function of the molecule because they often carry out most tertiary intra- and inter-molecular interactions. Without good crystal structures of RNA molecules, it is hard for the investigator to 18

determine where the basepairs occur or what types they are. This problem has been successfully

and incrementally solved through the years, with the determination of an ever increasing number

of high resolution crystal structures. One thing that is still lacking is the application of the knowledge we have accumulated about the different families of basepairs and their isostericity matrices to improve sequence alignments.

Problems with Sequence Alignment Representations

As stated earlier, sequence alignments are usually represented in a 2D matrix with each row corresponding to a sequence, and each column to a “similar” position. This representation has some major limitations. There are instances when certain motifs or parts of alignments change into different forms in different organisms (motif swaps). In 5S rRNA for example, there are two types of loop E motif: one that is present mainly in archaea (Ban et al. 2000), and another present mainly in bacteria and plasmids (Harms et al. 2001; Schuwirth et al. 2005). These motifs occur

at the same location in 5S rRNA, but they are not “alignable” with each other because they have

different 3D structures and comprise different non-W.C. basepairs. A similar example is the ribozyme Ribonuclease P (RNase P). This molecule has a highly conserved core structure common to all organisms, with some major secondary structural alterations in some groups

(Brown et al. 1996; Harris et al. 2001; Westhof et al. 2004; Gillespie et al. 2005; Marquez et al.

2005). There is a central question about whether these regions of sequences should be aligned with each other at all, or whether a different way should be designed to represent such motif swaps. If all matching motifs are aligned to each other, and all alternative motifs aligned to

“gaps” in sequences where they are absent, the sequence alignments will contain a significant number of gaps. As more sequences are added to the alignment, more gaps need to be inserted to 19

accomodate individual sequence idiosyncrasies and keep the equivalent areas aligned. This leads

to an artificial horizontal expansion of the alignment (Thompson et al. 2005). The gaps causing

the expansion are not biologically relevant; they have no real meaning except that they indicate

non-alignability of motifs at these positions. This is artificial because most often alternative motifs occur at the same physical locations and have the same function in their alternative forms.

So logically, they belong in the same vertical areas of the alignments. This is then a dilemma: rationally, alternative motifs should be put on top of each other in an alignment because they represent the same physical and functional areas of the molecule. However, in the way sequence alignments are currently presented, they cannot be put on top of each other because that would imply their position-by-position equivalence at each column, which is not true. Therefore, new

ways for presenting sequence alignments are needed. One way would be to indicate alternative

motifs in an alignment with different colors or symbols (such as brackets of different shapes),

while keeping them aligned. The different colors or symbols indicate that these areas do not

follow the exact basepairing patterns, but they do occur at the same locations in the 3D structure.

These types of representations are not possible in the simplest and most common FASTA

alignment format, and some details about specific motifs cannot be directly included in more

sophisticated GenBank alignment format. Other available formats, such as RNAml format

(Waugh et al. 2002) used by the program S2S (Jossinet et al. 2005), have the capabilities to

better deal with these ontological issues (Thompson et al. 2005).

In addition, new ways also have been designed to describe the structural elements of the

molecule that is being aligned. Such descriptions may include the names or numbers of helices or

loops in the molecule, the corresponding nucleotides in basepairs, and even the types of basepair

families observed at each position. Usually, extra lines that include this information are added to 20

sequence alignments. A novel yet very simple way to describe secondary and tertiary basepairing

patterns and types from all basepair families is suggested in Chapter II.

Models: Manageable Approximations of Realities

Science is the skill of designing models that describe reality in a simple yet accurate way. Often

we do not see the full realities of things, and the realities that we do see may be too complicated

to describe mathematically. This is why scientists do the best they can to capture essential features using simple mathematically manageable models that provide an adequate approximation of reality. This applies to all empirical sciences, including molecular biology.

Besides their manageability and approximation of observed reality, good models can be adapted to incorporate new details as these are discovered. Our aim is to design the best model for approximating observed RNA structure and phylogenetic behavior. Thanks to all the new structural data available today, we know much more now than a few years ago about possible types of basepairs and their isosteric subfamilies. We now have access to atomic-resolution crystal structures of large RNA molecules, including the small and large ribosomal subunits from several different organisms belonging to different phylogenetic domains. These structures contain valuable information for developing new alignment methods for many reasons.

Ribosomal RNA (rRNA) is universally essential for the life of every cell in every living organism, since it is the molecule that carries out the translation of mRNA genetic codes into functional proteins. It follows that the 3D structure of this ribozyme is highly conserved. Many homologous rRNA sequences are available for each of the ribosomal subunits, so sequence and phylogenetic studies are possible. These are also huge RNA structures containing thousands of basepairs of all types. Instead of speculating about the positions of basepairs, we can know that 21

information for sure from crystal structures. We can also know exactly what types of basepairs are formed, and thus deduce what alternative substitutions would be isosteric or nearly-isosteric.

Our hypothesis is that by applying all of this “structural” knowledge, we can substantially

improve sequence alignment methods, and come up with better RNA automatic aligning

algorithms.

Importance and Use of Good RNA Alignments

After all this discussion, a fundamentally simple question may come to mind: what is the use for having good RNA alignments? There are in fact too many answers for this question. Here, I will only mention two major uses for good sequence alignments. First, good sequence alignments are good mirrors of phylogenetic events; they help us understand the evolution of organisms and of individual molecules and build accurate phylogenetic trees. Building phylogenetic trees and building sequence alignments are linked together in a self improving iterative process. Every time a phylogenetic tree is built based on a sequence alignment, it can be used to improve the alignment and to accurately add new sequences to it. The process is repeated as many times as needed to cope with the available phylogenetic and structural information.

The second use for good sequence alignments is . This is building 3D models of molecules whose atomic-resolution structures have not yet been determined, based on comparing their sequences to one or many other homologous molecules with known structure. A good sequence alignment between the homologous molecules has to be constructed first, and only then the nucleotides of the model can be threaded to fit into the homologous structure.

Another way to determine the structure of RNA molecules is based on energy minimization methods discussed earlier, but these have many limitations and do not provide good results for 22

large and complicated structures (Nussinov et al. 1980; Mathews et al. 1999; Rivas et al. 1999;

Mathews et al. 2002). Homology modeling is a useful tool in medicine and molecular biology.

Atomic-resolution structures give accurate information about the structure of molecules, but they

require years and sometimes decades to solve. Homology modeling is the next best solution for

this problem, providing a preliminary structure in a fraction of that time. Drugs designed to

attack malicious organisms that have a high mutation rate, or those being over-used or misused

causing adaptation in malicious organisms, can lose their effectiveness with time. By sequencing the new forms of organisms and modeling their new structures drugs can be quickly modified or redesigned to match the mutations without too much loss of time and human life. This procedure can also shed light on the mechanisms of such mutations and advance our knowledge in that field

as well. The only disadvantage is that current molecular models may contain mistakes. By

building better 3D structural alignments of molecules, these mistakes in models can be

minimized.

Dissertation Structure

This dissertation is divided into three main parts, each of which will be presented in a separate

chapter. These chapters address different sides of the RNA structural alignment problem, and present new and valuable tools that help solve it.

Chapter II describes a new suite of user-friendly programs, collectively named Ribostral, which

allow the investigator to analyze and evaluate existing sequence alignments in light of

isostericity patterns of different families of basepairs. There is no other tool that allows for such

analysis or evaluation. This is useful for many applications, including determining the locations

potential mistakes in sequence alignments and suggesting ways for their hand curation. This tool 23

was essential for example in the hand improvement of the 5S RNA alignments obtained from

RFAM database, which is widely accepted as the most accurate RNA alignment database at this time (Griffiths-Jones et al. 2003; Griffiths-Jones et al. 2005). This tool is also useful for obtaining specialized substitution data from sequence alignments, such as basepair and motif substitution patterns. It also includes an HTML alignment viewer that is the only one of its kind capable of evaluating basepairing patterns according to their compliance with the isostericity rules.

Chapter III includes the study of the G/U wobble basepair, which is the most common type after the classical basepairs. All tertiary interactions involving this basepair are studied and classified in detail using Ribostral and other tools. A few novel interactions are discovered and studied phylogenetically, and their sequence signatures are described. This information has been useful for improving sequence alignments of motifs involving such interactions. This study also shows the depth and complexity of RNA structure, which should be taken into consideration when building RNA models.

Chapter IV describes a new approach to automatically align RNA sequences based on atomic- resolution structures. The best current method for RNA alignment is based on Stochastic Context

Free Grammars (SCFGs) that are capable of describing nested inter-dependencies of some RNA

nucleotides (basepairs) without using information from 3D structure. Our method is a combined

approach that uses all the information seen in 3D structure and applies a SCFG algorithm for

describing nested interactions as well as a Markov Random Field (MRF) algorithm for

describing non-nested or crossing interactions. This hybrid SCFG/MRF method is highly

successful in aligning internal loop and hairpin loop areas, which constitute the most problematic

positions in current sequence alignments. Our method in addition recognizes all families of 24 basepairs and differentiates between all of their isosteric subfamilies. This is in fact its most powerful enhancement to the current method used today. Examples of motifs automatically aligned with this new method are presented in this chapter and evaluated by Ribostral. 25

CHAPTER II: RIBOSTRAL – AN RNA 3D STRUCTURAL ALIGNMENT ANALYZER,

EVALUATOR, AND VIEWER BASED ON BASEPAIR ISOSTERICITIES

Introduction

Ribostral (Ribonucleic Structural Aligner) is a suite of programs designed to integrate known

structural data with homologous sequence alignments, with the purpose of evaluating the quality

of the alignments and guiding efforts to improve them. The main GUI (Graphical User Interface)

of this program provides an expandable user-friendly platform through which other related programs can be run. The related programs gather and analyze atomic resolution structure data, parse and automatically align sequences, and perform other manipulations on sequence alignments including extracting sub-alignments corresponding to individual motifs or domains, removing repeated sequences from an alignment to build a “unique” alignment with higher phylogenetic diversity, and creating a .fasta alignment file from .mat, which is another alignment format used by Ribostral. These tools are covered in detail at the end of this chapter. The main functions of the program, namely analyzing, evaluating, and viewing RNA sequence alignments are discussed first.

We start with a short review of existing RNA sequence analysis programs. One of the most widely used editors for manual alignment of RNA sequences is the program BioEdit, which runs under Windows platforms (Hall 1999). BioEdit reads a sequence alignment file and allows the user to choose pairs of nucleotides one pair at a time and display substitution (covariation) patterns for them (in BioEdit, this is called Mutual Information Examination). The resulting substitution table, however, provides no description of what the observed substitution patterns mean in terms of structure. BioEdit does provide one kind of link between sequence alignments 26

and structure: if the sequence alignment contains a “mask” describing cis Watson-Crick nested

basepair, BioEdit will color nucleotides in the alignment in terms of how well they conform to

the basepair family. A mask is a representation of the locations of basepairs occurring between

nucleotides represented in the sequence alignment, with paired characters such as "(" and ")"

indicating the positions of basepairs in the alignment. BioEdit however does not provide such

information for any of the other eleven or so families of edge-to-edge interactions that are possible between basepairs, comprising about one third of all interactions (Leontis et al. 2001).

BioEdit is designed to be primarily an alignment editor and viewer, and not a tool for structural alignment of RNA the way Ribostral is. Coseq, a program that runs under UNIX platforms, measures substitution patterns of basepairs without using any structural information (C. Massire and E. Westhof, unpublished). The user then needs to analyze the structure manually to see if sequence substitution patterns agree with it. Finally, S2S is another more recent UNIX program that dynamically displays parts of structure, creates full 2D annotations of them, and shows corresponding positions in sequence alignments in an alignment editor (Jossinet et al. 2005).

However, it too does not evaluate alignments based on isostericities of basepairs formed in

structure the way Ribostral does.

Ribostral, which runs under multiple platforms, can either be used like other sequence analysis

programs, i.e., to simply provide substitution patterns of basepairs in a sequence alignment

without any relation to structure, or can be used as the much more powerful tool that it is

designed to be: to provide substitution patterns, and at the same time superimpose structural

information on top of the substitution patterns to make sense of it. Ribostral does that by

coloring substitution patterns of each basepair in a way that reflects the edge-to-edge family and

the isosteric subfamily it belongs to (Leontis et al. 2002). Even in its more simple usage as a 27

program that provides sequence substitution analysis without the use of 3D structural

information, Ribostral is more convenient than other programs, because it allows for

simultaneous analysis of lists of basepairs and produces a single and easily portable HTML

output, without having to input nucleotide numbers of interest one position at a time. Interactive

position-by-position analysis is also possible, where in addition to what has been described

above, an integrated structure viewer is also available. In addition to providing substitution

information for basepairs, Ribostral is also capable of analyzing substitution patterns for more

than two nucleotides at once, such as base triples, quadruples, and so on. The sequences of whole

motifs can be analyzed this way, as will be shown below.

Supported Platforms and Deployment Process

Ribostral is designed and fully tested under Windows XP. It is free and can be obtained from http://rna.bgsu.edu/mokdad/Ribostral. The program is distributed in two forms: MATLAB

source files capable of running on any platform with MATLAB version 7 SP3 or higher (with

loss of some non-essential options on non-PC platforms), or a stand-alone program capable of

running under Windows or Linux platforms, after installation of a free compiler provided by

Mathworks (http://www.mathworks.com). Figures used in this text are based on the Windows

XP version with system appearance set to Windows Classic style. Upon downloading the

MATLAB source files or the stand-alone version, the user will end up with the specific hierarchy of folders shown in Figure 6. 28

Figure 6. Ribostral default installation subdirectories. Using folders marked with asterisk (*) is optional; on Windows platforms, these are the locations where Ribostral starts browsing for the corresponding input files.

Executing Ribostral

The startup window of Ribostral is a blank GUI with three menu options: File, Tools, and Help

(Figure 7). The other GUIs of the program are activated from this window. 29

Figure 7. Ribostral’s main GUI.

To avoid ambiguity, the GUI only displays options that are allowed at the specific stage of the analysis. For example, before doing any analysis the user needs to load a sequence alignment file

on which the analysis will be performed. So initially, the GUI does not display the buttons that start the analysis. These buttons will appear only after the sequence alignment file is loaded into the program. To provide additional help for the user, the upper right-hand corner of the main

GUI is reserved for messages describing the status of the program, any errors in execution, or hints on how to proceed further.

Loading an Alignment File

To start the analysis, the user needs to follow the hint that appears in the upper right-hand corner of the main GUI: “Hint: start from FILE”. By clicking on “File” from the menu bar, the user sees the options: Open FASTA alignment, Open NT list, and Preferences (Figure 8). 30

Figure 8. File menu options.

The first thing to do is to open an alignment file. In its current version, Ribostral reads only alignments in FASTA format, the simplest and most common sequence alignment format available. In this format, each sequence is represented by a header line preceded by the symbol

“>” and one or more sequence lines. Upon choosing “Open FASTA alignment” from the File menu, the user can browse the local drive for alignment files (Figure 9).

Figure 9. Browsing for an alignment file.

31

On PC machines, the browser window initially starts looking for alignment files in the folder:

\FASTA_alignments” (refer to Figure 6). Here, there are two main options: reading a raw alignment file in its text form (with extensions such as .fasta, or .txt), or reading a MATLAB data file (.mat) which is derived from the raw alignment file. This last option is processed faster by Ribostral, especially if the alignment of interest contains thousands of sequences, like the 16S rRNA alignments. A .mat file is created after a FASTA alignment is read for the first time. It is saved in the same directory and under the same name except for the extension. Time can then be saved by reading the data file instead of the text file the next time the same alignment is used. Note that a new data file is only created if no data file with the same name is already present in the directory. Therefore, if the raw FASTA alignment is deliberately modified in any way, the old .mat file referring to it needs to be deleted to allow for its re- creation. Notice that when a FASTA format file is read, all characters indicating unknown nucleotides (“N”, “n”, “O”, and “o”) are transformed into “o”, and all other characters are capitalized. When analyzing basepairs, Ribostral only recognizes dashes (“–”) which represent gaps or deletions, o’s, and the four RNA nucleotide letters A, C, G, and U as valid characters in sequences. When analyzing longer motifs, the program recognizes all characters.

Dividing Sequences into Subgroups

After choosing the alignment file of interest, Ribostral looks in the same directory for an Excel file called “KnownFASTAFilenames.xls”. This is an optional user-created Excel sheet that gives additional details about the FASTA file being read, such as the names of different groups (or domains) of organisms it represents, and the boundaries of these groups. If this Excel file is not present in the same folder as the FASTA file being read, or if the exact name of the FASTA file 32

being read cannot be found in the Excel file, this information will be ignored and all FASTA sequences in the alignment will be considered as one group. The analysis of specific subgroups

(such as phylogenetic domains) will not be possible later. Figure 10 shows the format of the

“KnownFASTAFilenames.xls” file and the information it contains.

Figure 10. A snapshot of the “KnownFASTAFilenames.xls” file. The highlighted entry (6) is the FASTA file analyzed in this chapter.

The Excel file is organized in the following way: The first column contains the exact names of

“known” FASTA alignments; the second column contains the names of the subgroups the

FASTA alignment contains separated by empty spaces; and the third column defines the limits of these subgroups in the order of their names, separated by empty spaces. The highlighted file

(entry 6, 5S_ABE_2004_UNIQUE.fasta) is the one used for the sample study in this chapter. It

contains sequences from the phylogenetic subgroups: archaea (sequences 0+1 to 39), bacteria

(sequences 39+1 to 390), and eukarya (sequences 390+1 to 667). To ignore the first sequence in

the alignment (if it is a structure mask for example, or if it is a reference sequence that does not 33

belong to the subgroup), “1” can be placed instead of “0” in the third column. In that case, the program still reads the first sequence but does not include it in the analysis.

The main GUI then displays any known details associated with the FASTA file and also unlocks some buttons or check boxes that permit the user to analyze the alignment (Figure 11). Note that if this step (loading a new FASTA file) is repeated at any stage during the analysis, all previous data that corresponds to previously read alignments will be erased from memory and the GUI will be reinitialized.

Figure 11. GUI changes when an alignment file is successfully loaded. The text box in the middle displays the name of the sequence file in memory.

At this stage, two options are possible for carrying out sequence (and optionally structure) analysis: analyzing a list of nucleotides (with or without structure data) for the complete sequence analysis of all positions listed in it, or interactively and directly analyzing individual positions on screen. The first option is covered first.

34

Preparations for the Analysis of a List of Nucleotides

After loading a FASTA file, the user can load an Excel file containing a list of the nucleotides of interest. This is done by activating the option “Open NT list” from the File menu. A browser opens and on PC platforms search for Excel files starts in the local directory “

directory>\NT_lists” (Figure 12).

Figure 12. Browsing for an Excel nucleotide list.

The NT list Excel file must contain names of the source organisms and nucleotide numbers of

interest. The first row is a header row and anything in it will not be read. Figures 13 and 14 show what such a file should look like. 35

Figure 13. Nucleotide list for analysis. This Excel list from 23S rRNA allows the sequence analysis of four nucleotide positions simultaneously.

Figure 14. Basepair list for analysis. This Excel list from 5S rRNA includes structural data about some of the basepairs (to show that structural data is optional).

36

As seen in Figures 13 and 14, the first column of the Excel sheet contains the name of the

reference organism for that row, and subsequent columns contain the nucleotide numbers to be analyzed. The name must be the full name or part of a name (case insensitive) present in the loaded FASTA alignment file. It does not necessarily need to be the beginning of the name (e.g.

“Halomari” can be used if the actual name in the FASTA alignment is “Eu_halomari”). If several

FASTA comment lines share this name, the first of the occurrences will be considered as the reference organism (e.g. if “halo” is used, and it is present three times in the sequence alignment, first as “EU_HALOJAPO”, then as “Eu_halomari” and then as “Eu_halomedi”, the first will be used). It is advisable for the user to manually check sequence names in the FASTA file to prevent inadvertent reference to unintended organisms. The most common error is caused by incorrect spelling of the source organism in the NT list Excel file; if such a spelling cannot be found in the FASTA file an error message will appear and the program will stop. If universal numbers for the alignment are desired the word “universal” can be used instead of an organism name (similar to entry 8 in Figure 14).

Structural information (i.e. basepair type) is allowed only for basepair lists. These are lists comprising two columns of nucleotide numbers, like the one shown in Figure 14. As explained in Chapter 1, basepair types are named by reference to the interacting edges of the nucleotides.

These are the Watson-Crick edge (W), Hoogsteen edge (H), and sugar edge (S) (Figure 1). Edge- to-edge interactions in addition may be cis (c) or trans (t) with respect to the glycosidic bond

(Figure 4). This gives rise to twelve main families of basepairs (Leontis et al. 2001; Leontis et al.

2002), and some intermediate families (Auffinger et al. 1999; Lemieux et al. 2002), only one of which with currently characterized isostericity matrix (the bifurcated cWW/tWH family). These thirteen basepair types are coded as follows in order for Ribostral to understand them (case 37

insensitive for all except tSs, because it is a directional interaction with asymmetric isostericity matrix): cWW, tWW, cWH, tWH, cWS, tWS, cHH, tHH, cHS, tHS, cSS, tSs, and bif.

Asymmetric codes (such as cWS) can be reversed, so in the table if nucleotides 22 and 26 in this order are coded as tSW, this is the same as 26 and 22 being tWS (see Table 3 for a definition of all the basepair codes used by Ribostral). Ribostral always presents data about basepairs as one of the thirteen codes listed above (and not the reverse form).

Table 3. Basepair codes used by Ribostral. The last column shows the codes used for constructing the structural mask in Ribostral Alignment Viewer, discussed later in text.

If the type of basepairing is known for a pair of bases, the output of the sequence analysis will be colored to indicate its isosteric subfamilies, so the investigator can determine whether the aligned

nucleotides for each sequence represent isosteric or near-isosteric substitutions. If no structural information is available, the table output will only display sequence substitution data observed in the corresponding columns of the alignment. 38

After successfully reading a nucleotide list file, Ribostral displays a new button allowing for the analysis of the whole list at once (Figure 15).

Figure 15. The main GUI after successful loading of a nucleotide list file. The status bar (upper right hand corner) displays “NT list loaded” and the new button “NT list analysis” becomes available.

Analyzing a List of Nucleotides

Upon activation of the button “NT list analysis” the program takes each row of the Excel NT list file and counts the number of each substitution of its nucleotides in corresponding columns of the alignment. This process may take a few minutes for long lists of nucleotides or for large alignments. The MATLAB command prompt (or the DOS prompt if the compiled stand-alone version of Ribostral is executed) displays different messages and counters indicating the status and progress of the analysis. Upon successful execution, output files are created in the folder

\Output” (refer to Figure 6), with descriptive names indicating the NT list file and the alignment file they represent. At the same time a new button providing a quick link to the main output file will also appear on the main GUI.

39

Types of Output Files and their Interpretations

Depending on the type of input NT list, several types of output files are possible (Figure 16). The

following sections discuss each type of these output files in detail.

Figure 16. Output files produced depending on input files analyzed. The first output file in each case is the “main” file accessed directly by clicking the button “Display list results” on the main GUI. Outputs indicated by asterisks (*) are produced optionally by changing the program preferences.

Output Files for a Non-basepair List

If the input NT list is not a list of basepairs but a list of nucleotides forming distinct motids (such as the one shown in Figure 13), two text output files will be created, one giving counts and the other percentages of sequences that share a common pattern (Figure 17). The patterns are listed

in alphabetical order. To clarify this, the following example from Figure 17 is considered. The

first entry in the Figure (highlighted) shows that in the archaeal part of the alignment, nucleotides

corresponding to Haloarcula_marismortui local numbers 545, 611, 529, and 14 (in this order) 40 are 88% GUGC and 12% GUGo. From this one can deduce that Nucleotides 545, 611, and 529 are GUG in 100% of the cases, and so on. The values in the percent output file are rounded according to the preferences of the user (Ribostral preferences will be discussed in detail later).

Figure 17. Percent output text file created upon sequence analysis of an NT list. Sequence patterns are listed in alphabetical order. A similar output file with counts instead of percents is also created. This analysis is for the NT list shown in Figure 13.

41

Output Files for a Basepair List

In case the Excel input NT file gives a list of basepairs from an atomic resolution structure,

Ribostral generates output files that are better tuned and more informative for this kind of input.

These include colored HTML outputs in the form of tables with sequence substitution values.

The tables use different background colors to indicate for each basepair whether observed substitutions are isosteric, near-isosteric, heterosteric, or forbidden as compared to what is observed in the structure. By default, three HTML output files (and a fourth text output in case the input Excel list is a basepair list with structure information) are produced upon the analysis of a basepair list (see Figure 16). The main output file accessed directly by the “Display list results” button on Ribostral is the percent covariation file whose name ends with “_COV_Percents.html”.

A similar “_COV_Counts.html” file shows the counts instead of percents. Figure 18 displays a snapshot of such an output. A third file (ending with “_SEQ.html”) displays the names of organisms giving rise to the values in the first two output files. The positions of the names correspond to the values observed, as clarified in Figure 19. Note that the basepair identity

observed in structure is indicated by bold font in the corresponding cell.

42

Figure 18. Counts output HTML file created upon sequence analysis of a basepair list. A similar output file with counts instead of percents is also created. This analysis is for the NT list presented in Figure 14. Note that the basepair identity observed in structure is indicated by bold font in the corresponding cell. This happened to be CG for both basepairs shown here. 43

Figure 19. Matching sequence values and names. The “_COV_Counts.html” output file (left panel) displays the sequence counts, and the corresponding “_SEQ.html” (right panel) displays the names of sequences. Note the value indicated by the black arrow in both outputs: The left panel shows that there are three bacterial counts for AA occupying the position A3/G21 (halomari local numbers). The right panel shows that these three counts come from the sequences of G_complan, P_palmata, and S_vulgare.

The output files have two purposes: First, to display substitution patterns corresponding to each

BP position analyzed, and second, to provide information about how well each of these positions in alignments agrees with 3D structure. Based on this, if there are any potential mistakes in the alignment they are easily pinpointed for their manual correction with available sequence editors.

All three HTML output files mentioned start with a total score assigned for each phylogenetic subgroup in the alignment (in this case the subgroups are the three domains: archaea, bacteria, 44 and eukarya). Each subgroup will have its own results printed on a separate line in the substitution tables. The areas in Figure 19 indicated by the black arrows for example represent the substitutions in eukaryal sequences of the basepair A3/G21 (using halomari local numbers).

On top of each table that represents a basepair position the following information is provided:

The names of the subgroups in the order they are analyzed, the numbers of sequences in each subgroup, the count of forbidden substitutions present in the table (in red color, with the letter F following the value), the count of gaps present including gaps on both sides or just on one side

(in gray, with the letter G following the value), the count of isosteric substitutions (in blue, with the letter I following the value), nearly-isosteric substitutions (in cyan, with the letters NI following the value), non-isosteric but not forbidden substitutions that we refer to as heterosteric substitutions (in pink, with the letter H following the value), and finally individual scores of each subgroup at that basepair position.

The score is a value describing how structurally “acceptable” the sequence substitutions at analyzed basepair positions are. This is based on the measure of how structurally compatible these substitutions are with the basepair in the reference organism, which is typically the organism with known structure. The individual basepair scores are currently calculated based on an ad hoc formula derived from experience with structures and knowledge of the patterns of allowed substitutions for each type of basepair. This formula can be easily modified by the user, by manually entering desired weight parameters in the file “\Ribostral\SCORES.txt”. The formula used throughout this text is:

Individual BP score = c ´ (3I + 2NI – H – 2F – 2G1 – 3G2) / number of sequences

45

Where c is the correction coefficient: c = 100 / (Highest positive weight), in this case c =100/3.

I, NI, H, F, G1 and G2 are the counts of sequences having substitutions that are isosteric, nearly-

isosteric, heterosteric, forbidden, gap on one side, or gaps on both sides respectively. The

Highest positive weight is 3 in this case. The correction coefficient c insures that the maximum

score is +100 (in this case, the minimum score when all substitutions are gaps on both sides is

–100, but that is not always the case depending on the formula used). The formula used here is asymmetric: unfavorable terms that contribute to its reduction are more numerous and weigh slightly more than favorable terms that contribute to its increase. But for our purposes, this is not a critical point. What is important is to easily identify low-scoring spots in the alignment to guide manual realignment efforts. To better locate these trouble spots in the alignment, the score is printed in red if it is worse than a certain threshold (i.e. if it is below zero, which is the score in case no structural data is available) and in black otherwise. The total score printed at the top of each HTML output file is an adjusted sum (sum divided by number of basepairs studied) of the individual BP scores in the study (so it is actually an average of the individual scores). This means that its maximum is also +100 and its minimum, in case the formula presented here is used, is –100.

Note that the formula used for RNA automatic alignment discussed in Chapter IV is:

Individual BP score = log (2I + NI – 3F) / number of sequences

Both formulas are ad hoc and the difference between their weights is not essential. The log operation used in the latter case is to reflect the probabilities of having the different isosteric 46

classes. The ease of reading and understanding normalized values between +100 and –100 or

–150 is the reason for using a correction coefficient in the earlier formula.

The most obvious and important element of the HTML output files discussed thus far is the

background color patterns that appear in their tables. These colors reflect the isostericity matrices of the thirteen characterized families of basepairs. Each family has its own pattern of isosteric,

nearly-isosteric, heterosteric, and forbidden substitutions. Boxes with the same colors are isosteric. Boxes that are nearly-isosteric with each other have “similar” colors. There are five

groups of these “similar” colors: pink/red/orange, red/orange/yellow (so substitutions with pink

and yellow backgrounds are not nearly isosteric to each other), blue/cyan, dark green/light green,

and brown (Figure 5). Forbidden boxes are colored in dark gray, and gaps in light gray. Boxes

with non-similar colors indicate that they are heterosteric to each other. If the basepair family for

a certain basepair is not specified, all nucleotide boxes in its table will be colored in neutral beige

(like the second BP listed in Figure 18).

One more output file is produced by default in case a basepair input list with structure information is analyzed. This is the “_Statistics.out” file, which is best viewed with Microsoft

Excel or WordPad. This output summarizes some of the results by stating the percentage of basepairs analyzed that have mostly allowed substitutions (containing <10% forbidden or gaps), the percent of basepairs that have >10% forbidden substitutions, and the percent of basepairs having >10% of sequences with gaps at their positions. It also shows the percent of basepairs among the ones mostly allowed that have only isosteric substitutions. Instead of the total score which is just one value that describes the quality of the alignment studied, this output file gives a more quantitative measure of the alignment quality. 47

When a Basepair list is analyzed, a button labeled “Plot scores” also appears on the main GUI.

When activated, this allows the user to quickly analyze the scores of each basepair and define places of misalignment or motif swaps (Figure 20).

Figure 20. Score plots for the archaeal 5S rRNA alignment. The red curve describes the scores of corresponding basepairs, and the blue line represents the average score (or total score as defined in this text) for the whole alignment based on the basepair list provided. By modifying the ad hoc formula according to which scores are calculated (this can be done by changing parameters in the “SCORES.txt” file), the user can plot any combination of one or more aspects of the alignment, such as percent isosteric substitutions, percent isosteric and near-isosteric substitutions, and so on.

Ribostral Alignment Viewer

There are additional output files not produced by default that can be obtained by changing the

Preferences under the File menu options. The reason for not producing those output files by

default is not to overwhelm the casual user with too many output files, and not to unnecessarily

extend the execution time. 48

If the “AlignViewer” option is selected in Preferences, an HTML-format alignment viewer is created where the sequence alignment is shown in colors indicating substitutions that are compatible with the 3D structure and those that are not. This tool is the first of its kind taking into consideration all thirteen families of basepairs instead of just the classical cis Watson-Crick family. The sequences in the Alignment Viewer are colored in a way to describe how well each column (more precisely, each pair of columns) in the alignment agrees with structure (Figure

21). Basepairs from each sequence are colored individually based on their isosteric agreement with the homologous basepair from the reference sequence, which is usually the sequence with known 3D strycture. If the substitution pattern is isosteric to the one in the reference sequence it is colored in green, if it is nearly-isosteric to it, it is colored blue, heterosteric is pink, forbidden is red, gap on one side is dark gray, and gap on both sides is light gray. If no basepair information is available the nucleotides are printed in black (these color assignments can be modified by changing the MATLAB script file “mColorCode.m”). For easy interpretation, the color scale is printed at the top of the Alignment Viewer HTML file. Thanks to this tool, it is now possible to look at the sequence alignment and directly get a good idea about whether the alignment is structurally valid or not, and where the major areas for local alignment mistakes (or motif swaps) are located. Note that if base triplets are present (i.e., two basepairs with one nucleotide in common), the basepair listed earlier in the Excel BP list file takes priority in coloring. If the user prefers to give priority to cWW interactions for example, then these should be listed first in the BP list. 49

Figure 21. Ribostral HTML Alignment Viewer. The visible part of the 5S alignment describes how well sequences agree with structure, represented here by Eu_Halomari, or sequence number 11. The color code is printed at the top. Note that the whole content including colors can be copy/pasted into Excel or other editors for manipulation.

Ribostral Alignment Viewer starts with several title lines containing the names of the analyzed files as well as the legend of all codes and colors used. Following this is the listing of all organism names and their sequences, exactly as they appear in the original FASTA alignment.

Organism names change their color gradually as they approach the limit of the subgroup or domain they belong to. This is how it is made clear that sequence 39 for example is the end of the archaea domain. The first five sequences (those on black background) represent the universal 50

numbers, local numbers, and three types of structural masks that describe basepairing patterns.

Universal numbers are assigned to each character seen in the sequence, including all indels (“–”).

Local numbers are the numbers that correspond to the reference sequence; unlike universal

numbers, local numbers are not assigned to indels in this sequence. In the Alignment Viewer

only decimal representatives of local numbers are listed, i.e. the first “1” stands for 10, the first

“2” stands for 20, and so on until the first “0” is seen, which stands for 100. After that the numbering cycle is repeated, so the second “1” stands for 110, the second “2” stands for 120, and so on.

The structural masks describe basepairing patterns reported in the Excel BP list directly on top of the sequences. Figure 22 is a schematic description of the information represented by structural masks. 51

Figure 22. Schematic explanation of structural masks. These are one dimensional representations of structure. The example shown here is from Helix 95 containing the sarcin/ricin motif in the large ribosomal subunit of H. marismortui (pdb file 1S72). Color codes correspond to basepairs. Note that the 13D and 3D masks may have several symbols overwriting each other; only one of the cyan RG and ][ symbols displayed in this Figure on top of each other is displayed in the 13D and 3D masks produced by Ribostral.

The 2D mask describes only nested basepairings, irrespective of their geometric families (most

of these would be helical cWW basepairs). Each nested basepair (one which does not cross in 2D

with other basepairs) is represented by the two parentheses symbols “(” for its opening or 5’

nucleotide and “)” for its closing or 3’ nucleotide. The 3D mask is the same as the 2D mask, but

in addition to representing nested basepairs it also represents other non-nested basepairs using

the bracket symbols “[” and “]”. When one nucleotide forms more than one basepair by using

more than one of its edges at once (e.g. in base triplets), the number of bracket symbols of the 52 opening and closing type may not be equal. Therefore, it is not possible from this mask alone to know which opening brackets correspond to which closing brackets. Another limitation of both the 2D and 3D masks is that they do not tell anything about the type of basepairs formed. The

13D mask is designed to solve this problem. Here, instead of just one or two pairs of symbols representing opening and closing of basepairs, thirteen different pairs of symbols are used to represent all known families. These codes are AB for the cWW family, CD for tWW family, and so on (see Table 3). Since some families also are not symmetric (C/G cWS has a different meaning than C/G cSW), uppercase symbols and lowercase symbols are used to indicate the direction of the interaction. The symbols must be read from uppercase to lowercase direction whenever the interaction is asymmetric, with the uppercase symbol referring to the base edge of higher priority (edge priority is assigned in the order W, H, S). Thus, if an interaction exists between columns (universal numbers) 38/42 and is represented by kL, this means that nucleotides occupying these positions form a tSW interaction in the source organism (or structure). This is the same as saying that 42/38 is a tWS (or Lk) interaction. The 13D mask has the same limitation as the 3D mask, in that finding matching opening and closing symbols is not always straight forward. Using the 2D mask together with the other two masks makes it easier to at least find nested pairs.

Additional Output Formats

Ribostral can also create other output files upon request (by checking the option “GU special” in

Preferences). These files represent another way of classifying some similar isosteric groups together, the way the G/U wobble basepairs are analyzed in Chapter III. Ribostral scripts can easily be modified to produce other similar outputs. 53

Interactive Analysis of Nucleotides

Besides analyzing lists of nucleotides or basepairs at once and producing dedicated output files that describe their results, Ribostral is also capable of analyzing nucleotides of interest one by one and directly displaying their results on screen. This can be done by activating the button

“Interactive analysis” from the main GUI, which starts a new GUI window (Figure 23). This

option becomes possible only after successful loading of a FASTA alignment file. If a BP list file

is also loaded before the interactive analysis is activated, the new GUI will have more options and information extracted from that file. In the following text we will discuss the options

available in case a BP list is loaded before activating the interactive analysis GUI.

54

Figure 23. Initial screenshot of the interactive analysis GUI. Here, both a FASTA file and an NT list file were loaded from Ribostral main GUI before starting this GUI. If only a FASTA file was loaded, some of the options or buttons seen here would not have been made visible.

The interactive analysis tool is a powerful tool providing yet a new array of functions not provided by the list analysis tools discussed in previous sections. After specifying the right choices (such as source organism name, domain of interest, etc…) the user can either analyze a specific family of basepairs, or analyze a particular position.

Interactive Analysis of a Family of Basepairs

The right-hand side of the interactive analysis GUI contains the options and buttons capable of gathering statistics about a whole family of basepairs at once. This option is only possible if the 55

user has loaded a basepair list before opening this GUI. An example of what can be done here is the analysis of all the occurrences of the tHS family in the archaeal part of the 5S rRNA

alignment. The result of such an analysis is shown in Figure 24.

Figure 24. Analysis of all occurrences of a basepair family. All basepairs of family 10 (tHS) are analyzed here in the archaeal 5S rRNA alignment.

The GUI shown in Figure 24 states on the top right the number of sequences in the chosen

domain (in this case the domain archeae has 39 sequences). Below that, the number of basepairs

forming this particular interaction in the source organism is shown. In this case six such tHS

basepairs are found. The isosterically colored buttons that appeared in the middle of the GUI

display the substitutions in the sequence positions corresponding to all these six positions at 56 once. Notice how most of them are clumped in the yellow isosteric subfamily (the colors are the same as defined previously for the HTML output files). Upon clicking on any of these colored buttons the names of sequences giving rise to them (preceded by the corresponding nucleotide numbers in the source organism) will be displayed in the lower part of the GUI (Figure 25). This allows for easy identification of organisms with potential mistakes in their alignments, so that the investigator can realign them by hand.

Figure 25. Getting sequence names with specific substitution patterns. Upon clicking any button in the substitution matrix (button GU is clicked here), the names of sequences giving rise to the substitutions become displayed in the lower part of the GUI. Sequence names are preceded by the corresponding nucleotide numbers from the source organism.

57

Interactive Analysis of Individual Basepairs or Motifs

In addition to analyzing a family of basepairs, it is possible through this GUI to analyze a

particular position by entering individual numbers separated by commas “,” or dashes “-” into the editable box in the middle of the GUI. But first, the source of local nucleotide numbers has to be specified. If this GUI is activated after a BP or NT list is loaded, then the sequence name of the first entry will be initially displayed in the white editable box and gray drop-menu box that define the source sequence for the analysis (upper left corner of the GUI). Both of these boxes can be used to specify the source organism, but in case the sequence names in these two boxes are different, then the white editable box takes priority. If universal numbers are desired the word

“universal” should be typed in the white editable box and carriage return pressed. If only two nucleotide numbers separated by “,” are entered (such as “22, 26”), the sequence substitutions for these two positions are displayed in the same output format seen for the basepair family analysis described in the previous section. The value in the box that corresponds to the source organism will be printed in bold. If the nucleotide numbers entered are present in the input Excel

BP list and the interaction they make is known, the buttons will be colored according to their BP family isostericity matrix. The family name also will be printed on the upper left corner. If any of the buttons is clicked the names of sequences giving rise to the value in it are displayed in the lower part of the GUI (Figure 26). (Note: it is possible to scroll between the basepairs from the input BP list one by one by clicking the green buttons labeled “<<” or “>>” that appear in the middle of the GUI) 58

Figure 26. Basepair interactive analysis. Notice that if the query is “22, 26” instead of “26, 22” the same result is obtained, since this is a known BP present in the input BP list and it is always colored in the same way. The BP identity present in the source organism (also the crystal structure) is CG (printed in bold font).

If instead of a basepair position (such as “22, 26”), the user enters only one, or more that two numbers separated by commas, or if the user enters two numbers separated by a dash, or any logical combination of comma-separated and dash-separated numbers, then a motif sequence analysis will be performed and results will be produced in a different format, where the motif pattern and its number of occurrences will be printed in text instead of the colored table. The pattern present in the source sequence will also be indicated by the “<<<” sign. Once again, if the user clicks on any one of the nucleotide patterns, the names of sequences giving rise to them will be displayed in the lower portion of the GUI (Figure 27). 59

Figure 27. Motif interactive analysis. Individual nucleotides in the motif are arranged in the order of the positions entered. Since the query here is “26-22, 37”, the letters that correspond to local number 26 of source organism is displayed first, then those that correspond for 25, then 24, 23, 22, and finally 37. If instead of this the query was “22- 26, 37” the result would be displayed differently.

Finally, the interactive GUI provides a quick link to a basic structure viewer through the activation of the button “Display 3D”. When this button is clicked the first time it prompts the

user to choose the PDB file corresponding to the sequence studied, and atomic coordinates are

read (the bigger the PDB file the slower this process). The structure of the input nucleotides is

then shown. Subsequent activations of the button within the same session display the structure

immediately.

60

Ribostral Preferences

Ribostral is a diverse program capable of dealing with data in a variety of ways. Through its

preferences, the user can enhance his experience with the program and make it do what is best

for the investigation in hands. Ribostral preferences can be accessed from the File menu of the main GUI. They are then saved as a data file on the users disk, for easy passing between all subordinate Ribostral GUIs. Last positions of GUIs before exiting them are also saved in this file, so that GUIs always open in the last known position they were closed in. Figure 28 shows

the default preferences of the program.

Figure 28. Ribostral default preferences.

The preferences are divided into several categories. Round radio buttons belonging to the same

category are mutually exclusive. The first two categories, “Display Totals” and “Display

Expected”, affect only the list output files and do not affect the interactive analysis GUI. If

“Display Totals” is enabled, the totals of each row and column of the BP list output are printed

with the output (as in Figure 18). Otherwise, the totals will be left out. If “Display Expected” is 61

enabled, the expected value for each substitution is printed in parentheses after the observed substitution value, as in Figure 29.

Figure 29. Substitution counts with expected values (in parentheses).

The expected value for each box in the table is obtained by applying the formula:

Expected (of box in row a, column b) = Sum of boxes in row a ´ Sum of boxes in columns b / Table total

This value is a measure of the likelihood of observing substitutions in the specific box, based on

the values observed in the neighboring boxes. 62

Enabling the “Audio” option allows the program to play specific musical chimes every time an

operation is completed or an error occurs. This is useful in case a large file is being read or

analyzed. In such cases the user will be notified by the sound to carry on to the next step.

Enabling “Consider Gaps” means that the program will not ignore insertion characters (“–”) seen in the alignment. Disabling this option not only stops the program from showing gaps in the output, it also ignores gaps from any other calculations, such as score or percent substitution calculations. The “consider Ns” option does the same thing but with characters representing undetermined nucleotides (usually symbolized by “o”, “O”, “n”, and “N”, which as stated earlier are all converted to “o” when the FASTA alignment file is originally read). These two options do not affect analysis of non-basepairs, such as base triples or longer motifs.

Several rounding options are available through a dropdown menu. The choice here affects the presentation of all decimal calculations done by Ribostral. Activating any of the “Additional

Output” options results in the production of one or two more output formats that Ribostral does not produce by default (discussed earlier). Preferences are saved only if the “SAVE” button is activated.

Finally, upon activating the button “Modify Score Parameters”, the file containing score parameters (“\Ribostral\SCORES.txt”) is opened for editing. The parameters entered in this file affect all score calculations. The resulting formula is shown on screen and is also printed out in some output files every time sequence analysis is carried out.

Supporting Tools

Ribostral provides an expandable sequence and structure analysis platform. Additional tools can be easily and smoothly integrated into the main program. There are currently five additional 63

tools that allow Ribostral to do operations beyond structurally analyzing sequence alignments

(Figure 30). Each of these tools is discussed separately below.

Figure 30. Ribostral tools.

Tool 1: Generate BP List from PDB

This tool currently works only pn PC platforms. PDB files are text files containing all the atomic

coordinates and related information about a structure. By analyzing this data, interacting bases

can be identified and classified into geometric families (Sarver, M., Zirbel, C., Stombaugh, J.,

Mokdad, A., and Leontis, N.B., submitted). Upon activation of this tool, the user first chooses a

PDB file (browsing starts from the default PDB directory associated with Ribostral

\PDB_structures”). Then, the user is prompted to enter the sequence name in the alignment that corresponds to this 3D structure. Ribostral then creates an Excel BP

list for each of the basepair families found in the crystal structure. It also creates a separate Excel

BP list containing all of them, and another Excel BP list containing all the non-cWW ones

among them together. These lists are created in the proper format to be read directly by Ribostral

and are saved in the default “\NT_lists” folder. Note that the automatic 64

classification of basepairs is not 100% accurate, so the resulting automatic lists should be

visually compared to the 3D structure and corrected where needed.

Tool 2: Align Sequences

This tool opens a GUI that interfaces with automatic motif alignment programs based on a hybrid Stochastic Context Free Grammars/Markov Random Field (SCFG/MRF) model. The GUI facilitates some of the steps in running these programs, and allows the user to parse and align motifs that have their SCFG/MRF nodes already characterized based on known 3D structure (see

Chapter IV for the details of this process). Currently, MATLAB script files still need to be modified in order to parse and align new motifs. This tool mainly provides an easily expandable user-friendly platform for applying automatic sequence alignment.

Tool 3: Extract Parts of a FASTA File

This tool, shown in Figure 31, requires loading a FASTA file from Ribostral main GUI before it can be activated.

Figure 31. GUI for extracting parts of a FASTA file. 65

The user then simply needs to indicate the motif of interest by entering its nucleotide numbers in

the editable box. All positions homologous to these numbers will be extracted from the original

alignment and saved as a new sub-alignment in FASTA format. If there are one or more commas

separating two or more groups of numbers, the sub-alignment will represent these breaks in

continuity by four consecutive dots (....). This continuity separation code is understood by

other Ribostral tools, such as the alignment parser tool discussed in the previous section.

Naturally, nucleotide numbers refer to the source or reference organism specified in the GUI.

Similarly to the interactive analysis GUI, in case there is disagreement between the sequence

name in the white editable box and that in the gray drop-menu box, the white editable box takes

precedence. The user can choose to ignore up to five organisms from the top of the sequence

alignment (for instance, if these are structure masks or sequences that do not belong to the same

phylogenetic group), and can choose to extract one domain or all domains at once. The extracted

sub-alignment is saved in the same directory as the original source alignment file.

If there is more than one motif to be extracted, the user can create a tab-delimited input text file

containing the names of all of them and their nucleotide numbers for batch processing. This input text file would look like this:

h5IL 53-56,356-358 h6IL 64-69,99-103 h7IL1 128-131,231-233 h7IL2 133-136,227-229

The numbers in such a file are then extracted from the alignment the same way as if they were

entered one by one in the GUI editable box. An advantage here is that the user can assign

specific names to the sub-alignments containing these motifs, instead of the otherwise generic

names derived from nucleotide numbers extracted. 66

Tool 4: Merge & Remove Repeats from FASTA Files

This tool opens a new GUI from which the user can browse for a minimum of one and a maximum of three input FASTA files. Each one of the files read will be checked for individual sequence lengths and repeats in sequence names. If some sequence lengths are shorter than others, gap characters (“–”) are added to the ends of the short ones. This prevents some errors when the alignments are analyzed. If name repeats are found (the name meant here is the first

separate word after the “>” sign in a FASTA header line), only the sequence with most valid

nucleotide letters (A, C, G, and U) is kept (it is possible to change this default criteria and choose

sequences with shortest internal gaps; this can be done by changing the value for the variable

“Criteria” in the script file mFastaPrinter.m). The resulting “unique” alignment represents a more divergent set of phylogenetic taxa which may be more meaningful to analyze. The “unique”

FASTA files are saved in the same directory and under the same names (with descriptive suffixes) as the original FASTA files they are created from. If more than one original file was inputted into this tool, a merged file containing all of them is also created.

Tool 5: Create .fasta from .mat

This is a simple tool that allows the user to open a “*.mat” file representing a sequence alignment in MATLAB data format and recreate the “*.fasta” file it was originally created from.

If the file contains more than one subgroup (or phylogenetic domain), separate FASTA files containing the sequence of each of them, in addition to one containing them all together, are created and saved in the same directory as the input “*.mat” file.

67

Help Menu

Under the help menu, the manual, website, and version information of Ribostral can be found.

To check if any updated version of the program is available, the user can compare the local

version number to the version number listed on Ribostral website.

Practical Uses of Ribostral

Ribostral is not just a user-friendly program that does some sequence analysis and reflects some informtion about the structure of RNA molecules. It is a powerful tool specifically designed and used successfully to improve RNA sequence alignments. This tool has been designed and

improved throughout different stages of our research to become the useful and dynamic tool it is

today. We used Ribostral to locate major areas of misalignments in the RFAM 5S and 23S rRNA alignments (Griffiths-Jones et al. 2003; Griffiths-Jones et al. 2005). Although this database of

RNA alignments is currently considered the most accurate (the method used by RFAM is discussed in Chapter IV), mistakes have been found by Ribostral and were later manually corrected (while an automatic aligner has subsequently been developed and is discussed in detail in Chapter IV, Ribostral is the main method for structurally evaluating all these alignments and accordingly improving the algorithms producing them). Ribostral was also used for evaluating and helping in the realignments of archaeal and bacterial loop E-motifs, and other internal loop and hairpin loop motifs present in 16S rRNA. Finally, Ribostral was the main tool used for the sequence analysis of G/U wobble basepairs, discussed in detail in Chapter III (Mokdad et al.

2006). 68

CHAPTER III: STRUCTURAL AND EVOLUTIONARY CLASSIFICATION OF G/U

WOBBLE BASEPAIRS IN THE RIBOSOME

(This work has been published in Nucleic Acids Research and was awarded the cover of the issue (Mokdad et al. 2006)).

Introduction

Comparisons of basepair frequencies and positions in conserved RNA sequences, such as rRNA, can lead to important predictive models for other RNAs. Since the sequencing of tRNA in the

1960's, comparative sequence analysis has been used extensively to infer the common secondary structures of homologous RNAs. More recently it has been applied to infer the locations of tertiary interactions that stabilize folded RNA 3D structures. The inferred tertiary interactions in several cases have been applied to construct 3D models for biologically active RNAs, which subsequently were verified by X-ray crystallography. Recent examples include RNase P and the

Group I intron (Cate et al. 1996; Massire et al. 1997; Tsai et al. 2003; Adams et al. 2004). In this work we explore the potentials of tertiary interactions made by G/U wobbles to be used in similar ways.

The cis Watson-Crick (WC) G/U basepair is the most common non-classical basepair present in

RNA. It was first recognized by Crick in 1966 in the context of the tRNA – mRNA anticodon – codon interaction (Crick 1966). Subsequently, G/U’s at specific positions have been shown to be essential for the function of various RNAs (Varani et al. 2000). Critical functional roles also have been inferred for highly conserved G/U basepairs found near the active sites of certain ribozymes, such as the Group I introns (Cate et al. 1996). X-ray crystal structures subsequently 69

confirmed the role of the conserved G/U in binding substrate residues in activated conformations

using the shallow (minor) groove pocket formed by U(O2), U(O2’) and G(N2) of the conserved

G/U basepair.

The first systematic study of the cis WC G/U basepairs was done by Gutell et al. who

investigated G/U variations in rRNA among broadly divergent phylogenetic taxa and classified

them into several types according to their sequence conservation (Gutell et al. 1994). The

“invariant type” is composed of conserved G/U’s in all phylogenetic domains. The “dominant

type” is composed predominantly of G/U’s that show mainly U/G replacements with some

classical WC replacements as well. Finally, the “non-typical type” changes infrequently and almost always becomes an A/C. In a similar way, Gautheret et al. classified the G/U basepairs according to their phylogenetic patterns combined with their positions in the secondary structures. These authors recognized that at least 50% of WC-paired positions in 16S and 23S rRNAs contain more than 1% G/U in sequence alignments, and about 10% of all paired positions display 50% or more G/U substitutions. They also noted that most of these positions are often substituted by various classical WC pairs (A/U, U/A, G/C, or C/G) but some highly conserved

G/U’s have specific variation patterns – such as G/U to U/G or G/U to A/C (Gautheret et al.

1995).

The studies by Gutell and Gautheret and their coworkers were carried out before the X-ray

structures of the ribosome were available, and thus did not consider the role of tertiary

interactions in constraining the sequences at these positions. In the present paper, we analyze

phylogenetic substitution patterns of the G/U basepairs in different structural contexts. The

crystal structures of the ribosome have revealed at least two types of highly conserved tertiary

interactions that involve the G/U shallow groove pocket: the shallow groove Packing interaction 70

(P-interaction) reported earlier by Gagnon and Steinberg (Gagnon et al. 2002), and Phosphate in

shallow-groove-pocket interaction (phosphate-in-pocket interaction).

The P-interaction is an interaction between two basepairs. Most commonly, one of these basepairs is a cis WC G/U, and the other a cis WC C/G. It somewhat resembles a “type 0” A- minor motif (Nissen et al. 2001), but with better and deeper packing of the interacting helices along the G/U shallow groove pocket (Figure 32a). Both the P-interaction and “type 0” A-minor motif are variants of ribose zippers, initially described in the P4 – P6 domain of group I introns

(Cate et al. 1996). Our analysis of sequence alignments shows that P-interactions have the highest G/U conservation among all other interactions involving G/U. In this context, we identified a novel quadruple compensatory mutation (compensatory mutation involving not two, but four nucleotides at once) between a G/U and another basepair, apparently reinstating the G/U and the other basepair in the optimal orientation for this interaction to occur. Molecular dynamics (MD) simulations were carried out on eight motifs containing P-interactions in order to assess their relative stabilities and the optimal orientation of participating basepairs. 71

Figure 32. Shallow groove pocket interactions formed by G/U basepairs. a. P-interaction between the four nucleotides represented by A, B, C, and D. This interaction is optimal when one of the basepairs is cis WC G/U, resulting in the five H-bonds indicated (I1-I5). b. Phosphate-in-pocket interaction. c. O2’-in-pocket interaction. Panels a and b are from PDB file 1S72, panel c is from PDB file 1J5E. Produced by DeepView (Guex et al. 1997). 72

The other interaction leading to high G/U conservation is the phosphate-in-pocket interaction.

Here, a phosphate is embedded deep in the G/U shallow groove pocket (Figure 32b), in analogy with the sulfate ion interaction in the same pocket (Masquida et al. 1999). This kind of interaction can help in the folding process of complex RNA molecules. What is intriguing about the P-interaction and the phosphate-in-pocket interaction is that there is evidence from the structures of the large ribosomal subunit that the contacts they make are highly conserved, even when the local motifs forming them change. One G/U that forms a P-interaction in E. coli and D. radiodurans is replaced by a GNRA loop in H. marismortui. However, the tertiary contacts are still present between the equivalent areas of the structures. Another G/U making a phosphate-in-

pocket interaction in E. coli is a GNRA loop in both H. marismortui and D. radiodurans.

Similar tertiary contacts are retained that prevent any major differences in the overall folds of the structures. These complex patterns of motif swaps prove the precedence of tertiary over secondary structure in their particular contexts. Figure 32c shows a third type of shallow groove pocket interaction which also is associated with highly conserved G/U’s. This is the O2’-in- pocket interaction resulting from a sugar O2’ coming from outside the plane of G/U and binding at the G/U shallow groove pocket.

There is sequence evidence that some shallow groove pocket interactions may be transient, occurring only at particular stages of the dynamic conformational changes that the ribosome passes through during protein synthesis. This is possible because of the relative similarity between the P-interaction and “type 0” A-minor motif, or the phosphate-in-pocket interaction and other phosphate interactions occurring near the G/U pocket. The P-interaction in particular can be used for improving sequence alignments because of its specific GU/CG signature that is relatively easy to locate within sequence. Using this information, we were able to refine the 73

sequence alignments at several positions where a P-interaction takes place. The aims of this work were: 1. to structurally and phylogenetically characterize all G/U basepairs in rRNA and classify their tertiary interactions; 2. to study in detail the interactions that are highly specific for G/U and describe their sequence signatures; and, 3. to use this structure information for improving available sequence alignments at specific positions where tertiary interactions take place.

Materials and Methods

X-ray crystal structures were obtained from the Nucleic Acid Database (NDB) and the Protein

Data Bank (PDB) (Berman et al. 1998; Berman et al. 2000) and were visualized in 3D with the

DeepView (Swiss-PdbViewer) program (Guex et al. 1997). We studied all cis WC G/U

basepairs in the 16S rRNA of the bacterium Thermus thermophilus (Tt) – PDB files 1IBM,

3.31Å resolution, and 1J5E, 3.05Å (Wimberly et al. 2000), and in the 23S/5S rRNA structures of the archaeon Haloarcula marismortui (Hm) – PDB files 1JJ2 and 1S72, both at 2.4Å (Ban et al.

2000) and the bacterium Deinococcus radiodurans (Dr) – PDB files 1KPJ and 1LNR, both at

3.10Å (Harms et al. 2001). In addition, all Escherichia coli (Ec) cis WC G/U basepairs were studied in recent 70S structures – PDB files 2AVY, 2AW4, 2AW7, and 2AWB, solved at 3.5Å

(Schuwirth et al. 2005). Structures with bound mRNA and tRNA or substrate analogues also were examined (Nissen et al. 2000; Ogle et al. 2001; Hansen et al. 2002; Ogle et al. 2002;

Schmeing et al. 2002; Schmeing et al. 2003; Murphy et al. 2004; Murphy et al. 2004).

The alignments for the 16S-like and the 23S-like rRNA were obtained from the European ribosomal RNA database (Wuyts et al. 2001; Wuyts et al. 2002). The 16S-like rRNA alignments included 220 archaeal, 4475 bacterial, and 5248 eukaryal unique sequences (sequences with only one representative from each species). The 23S-like rRNA alignments included 24 archaeal, 184 74

bacterial, and 137 eukaryal unique sequences. The 5S-like seed RFAM rRNA alignment was

used in our study (Griffiths-Jones et al. 2003; Griffiths-Jones et al. 2005). It contains 37 archaeal, 336 bacterial, and 222 eukaryal unique sequences. Genomic tRNA alignments were obtained from the Compilation of tRNA sequences and sequences of tRNA genes (Sprinzl et al.

1998), and they are composed of 317 archaeal, 1768 bacterial and 141 eukaryal unique

sequences. The sequence analysis was carried out with the program Ribostral (Chapter II).

Our study included all cis WC G/U basepairs seen in the high resolution crystal structures of the

ribosome and its subunits. Further, other basepairs were considered if they had >50% G/U

substitutions in sequence alignments of archaeal or bacterial sequences (as these proved to be

more trustworthy than eukaryal alignments). Secondary structures from the Comparative RNA

Web Site (Cannone et al. 2002) were first used to identify all of these pairs, but after examining

each of them in crystal structures, it was determined that some were not cis WC-paired or were

not paired at all, so they were excluded from this study (see Table 4 for a list of basepairs

omitted in this way). This process produced a final list of 80 basepairs from 16S, 186 basepairs

from 23S, and 7 basepairs from 5S rRNAs. 75

E. coli BP NT1 NT2 16S GU 21 13 Not basepaired GU 388 375 Not basepaired GU 690 697 Not basepaired GU 606 632 Not basepaired AU 1036 1025 Not basepaired GU 1053 1205 Not basepaired GU 1177 1159 Not basepaired GU 1529 1506 Not basepaired

H. marismortui D. radiodurans E.coli BP NT1 NT2 BP NT1 NT2 BP NT1 NT2 23S GU 246 265 Motif change Motif change tSH GC 1592 1602 GU 1504 1515 CG 1489 1500 Not basepaired GU 1745 2034 GU 1684 1976 GU 1667 1993 Not basepaired GC 1855 1876 GU 1790 1812 GU 1799 1820 Not basepaired GU 1970 1966 GC 1912 1908 GC 1929 1925 tWW GU 1976 2003 GC 1918 1945 GC 1935 1962 Not basepaired AA 1991 1997 GU 1933 1939 GU 1950 1956 Not basepaired GC 2192* 2187* GU 2131* 2126* GC 2148 2143 Not basepaired GU 2288 2282 GU 2234 2228 GU 2255 2249 Not basepaired GU 2525 2495 GU 2469 2439 GU 2490 2460 Not basepaired Table 4. G/U’s reported in secondary structures but not observed in 3D structures. Asterisk (*), absent (not resolved) in the corresponding crystal structure.

MD analysis was carried out on eight systems each composed of two 24 nucleotide-long helices coming together and forming a P-interaction in their center. The first of these systems, extracted from PDB file 1S72, contains a G/U packed with C/G (P-interaction, abbreviated as GU P CG).

This central P-interaction was then modified using InsightII to obtain the following eight

different combinations: GU P CG (original), AC P CG, GU P GC, AC P GC, GC P CG,

GC P GC, AU P CG, and AU P GC. The explicit solvent MD simulations were carried out using

the AMBER7.0 program package (Case et al. 2002) with the parm99 Cornell et al. force field

(Cornell et al. 1995; Cheatham et al. 1999; Wang et al. 2000). The RNA was solvated in a

rectangular box of TIP3P waters (Jorgensen et al. 1983) and neutralized by minimal number of

sodium cations (Aqvist 1990) initially placed by the LeaP module at points of favorable 76

electrostatic potential close to the RNA. The standard protocols (Razga et al. 2005) were used

for the equilibration and production simulations, performed by the Sander module of

AMBER7.0. The production runs were carried out at 300 K with constant-pressure periodic

boundary conditions and the particle mesh Ewald method (Essmann et al. 1995) applied. The

MD trajectories were then analyzed using the ptraj and carnal modules of the AMBER7.0

package and our own scripts and visualized by the program VMD (Humphrey et al. 1996).

Results

Tables 5 and 6 are compilations of the secondary structure features and tertiary interactions of all

the G/U basepairs that occur in crystal structures of 16S rRNA and 23S/5S rRNA, respectively.

Interactions are classified according to their location, i.e. in shallow groove or in deep groove.

Shallow groove interactions are subdivided further into those that involve the G/U pocket

(Shallow Groove Pocket, SGP) and those that do not (Shallow Groove not in Pocket, SGNP).

Deep groove interactions are subdivided into those that involve the deep groove pocket, forming

H-bonds with GO6 and UO4 simultaneously (DGP), and those that do not (Deep Groove not in

Pocket, DGNP). The best structure to observe each interaction is also indicated. In most cases,

the Tt and Hm structures prove to be the best for observing 16S and 23S/5S interactions,

respectively. The 23S Dr and 70S Ec structures are at slightly lower resolutions and are mainly

used for comparison with the main structures and for inferring motions or motif swaps, as well as

observing basepairs that are not G/U’s in Tt or Hm. Table 5 also displays the conservation of the

G/U in sequence alignments at each of the positions. The main goal for our classification was to

test whether the interactions in the G/U shallow groove pocket are more specific for G/U than others occurring at the periphery of this basepair. 77

16S E. coli Seq %GU+UG # BP NT1 NT2 Secondary structure Interactions A B E 1 GU 22 12 h1 SGNP: A913 type1 A-minor; Mg in DGP (Tt) 7 99 1 2 GU 15 920 Part of SGNP: UO3'-G1081O2' (Tt) 99 99 98 3 CG 396 45 h4 None 60 28 1 4 CG 52 359 h5, flanking IL None 0 1 0 5/c UG 105 62 h6 SGP: U-pack-C379 (Tt) 100 99 96 6 UA 70 98 h6 None 18 13 13 7 GU 76 93 h6 None 0 20 4 8 GC 79 90 h6 None - 10 18 9 GU 122 239 h7, flanking 4WJ None 0 46 1 10 AU 236 125 h7 SGNP: G-C121 tHW (Tt) 90 75 0 11 GA 232 129 h7, flanking bulge SGNP: G-A263 tSW (Tt) 0 2 3 12 GU 226 137 h7 None 0 3 3 13 UG 157 164 h8 Mg in DGP (Tt) 100 59 54 14 Shorter helix h9, flanking 4WJ None 4 20 0 15 GU 198 219 h10, flanking 4WJ SGNP: G-U173 cSH (Ec) 5 13 2 16 GU 201 216 h10, flanking hairpin None 8 73 7 17 GU 275 249 h11, flanking bulge SGP: U252O2' (Tt) 93 99 97 18 GU 258 268 h11 None 0 7 1 19 GU 293 304 h12, flanking bulge None 0 56 4 20/a GU 301 296 h12, flanking hairpin SGP: U-pack-C556 (Tt) 100 99 99 21 GU 376 387 h15, flanking IL Mg in DGP (Tt) 27 99 3 22 GU 433 409 h16, flanking IL None 32 42 1 23 GU 416 427 h16, flanking IL SGNP: UO2'-G541 Phosphate, near pocket (Tt) - 77 67 24 GU 417 426 h16 SGNP: Residues 36-45 of S4 prot; GNH2-G540O2', near pocket (Ec) - 93 2 25 GU 454 479 h17, inside IL None - 19 40 26 GU 474 458 h17 None in Ec, shorter helix in Tt - 19 0 27 GC 761 580 h20, flanking IL None 2 5 0 28/d GU 584 757 h20, flanking IL SGP: U-pack-C879 (Tt) 98 100 18 29 GU 650 589 h21 None 5 19 26 30 UG 593 646 h21 None 11 63 11 31 GU 645 594 h21, flanking bulge None 47 56 10 (32) GC 601 637 h21 None 13 75 17 33 AU 635 603 h21 None 5 7 2 34 GU 633 605 h21 SGP: G126 Phosphate (Tt) 61 91 25 35 GU 615 625 h21 None 6 31 8 36 UA 662 743 h22 None 0 26 4 37 GU 666 740 h22, flanking bulge SGNP: Ser52 of S15 prot (Tt) 48 96 3 38 GU 734 672 h22, flanking 3WJ None 0 53 2 39 GU 713 677 h23, flanking IL RNA-protein bridge (B7b); SGNP: A777 type0 A-minor (Tt) 5 74 94 40 GU 683 707 h23 SGNP: Gly37 of S11 prot (Tt) 1 97 6 41 GU 778 804 h24, flanking bulge SGNP: GO2'-Arg120O1 of S11 prot (Tt) 65 96 2 42 AU 855 831 h26 None 100 63 3 43 GU 832 854 h26 SGNP: GO2'-G724 Phosphate, near pocket (Tt) 94 87 1 44 CG 853 833 h26 SGNP: GNH2-G725 Phosphate, near pocket (Tt) 96 89 97 45 GU 852 834 h26 None 28 18 1 46 GU 851 835 h26 SGNP: GNH2-C744O2' (Tt) 1 20 1 47 GU 849 837 h26 SGNP: UO4'-G745O2' (Ec) 17 32 4 48 GU 836 850 h26 SGNP: GNH2-C745O3' (Tt) 4 86 41 49 GU 886 911 h27 DGNP: UO2P-Arg97NH2 of S12 prot; G1489O2' near SGP; Mg in DGP (Tt) 98 96 100 50 GU 894 905 h27, flanking IL SGNP: UO2'-U244 Phosphate (Tt) 98 99 96 51 GU 895 904 h27 None 0 39 3 52 GU 925 1391 h28, flanking bulge SGNP: GNH2-A1503 Phosphate, near pocket; Mg in DGP (Tt) 100 100 99 53 GU 927 1390 h28, flanking bulge SGNP: GNH2-U1532 Phosphate, near pocket; Mg in DGP (Tt) 100 100 1 54 GU 942 1341 h29 SGNP: GNH2-Gln124OE1 of S9 prot (Tt) 100 100 99 55 GU 1231 950 h30 SGNP: GNH2-G971 Phosphate, near pocket; DGNP: UO4-Thr105OG1 of S13 prot (Tt) 100 80 99 56 GU 1006 1023 h33, flanking 3WJ None - 14 17 57 UG 1009 1020 h33 None - 20 25 58 UA 1017 1012 h33, flanking hairpin None - 7 25 59 GU 1206 1052 h34 SGNP: UO2'-A1055 Phosphate, near pocket; residues 190-194 of S3 prot (Tt) 100 100 100 60 GU 1058 1199 h34, flanking bulge SGNP: G-G1202 tSH; Mg in DGP (Tt) 99 99 99 61 GU 1074 1083 h36, flanking 3WJ SGNP: A1101 type1 A-minor (Tt) 24 100 99 62 GU 1099 1086 h37, flanking 3WJ None 0 95 2 63 GU 1185 1115 h38 None 0 73 38 64 GU 1184 1116 h38, flanking 3WJ None 13 72 6 65 GU 1242 1295 h41 SGNP: G-U1302 tSW (Tt) 47 90 2 66 GU 1290 1247 h41, flanking IL None 1 63 7 67 GU 1371 1351 h43 SGNP: GO3'-Gly69CA of S9 prot; DGNP: UO4-Lys118CE of S9 prot (Tt) 66 99 70 68 GU 1486 1414 h44 RNA-RNA bridge (B3) 0 60 1 69 GU 1415 1485 h44 RNA-RNA bridge (B3) 0 96 0 70 GU 1419 1481 h44, flanking IL RNA-RNA bridge (B5) 15 64 1 71 GU 1422 1478 h44 None 6 50 2 72 GU 1423 1477 h44 None 76 47 2 73 GU 1475 1425 h44 RNA-RNA bridge (B5 & B6) 60 38 61 74 GU 1426 1474 h44 None 30 33 4 75 GU 1438 1463 h44 None 8 31 3 76 GU 1461 1440 h44 SGNP: G-A1441 tSH (Tt) 7 85 1 77 GU 1458 1444 h44 None in Ec, shorter helix in Tt 0 13 0 78 GU 1457 1445 h44 None in Ec, shorter helix in Tt 0 67 33 79 GC 1525 1510 h45 None 0 2 0 80/e GU 1523 1512 h45 SGP: U-pack-A768 (Tt) 91 97 98 Table 5. (Legend on the next page) 78

Table 5 legend. Interactions of the cis WC G/U basepairs in 16S rRNA. Data based on the crystal structures of T. thermophilus and E. coli (initials of the structure best for observing each interaction appear in parentheses in the interactions column). Basepairs with interactions are shaded. The letters (a, b,…) in the first column that mark basepairs forming P-interactions (indicated by -pack-) are also used in Table 8 and Figure 38. Basepair number (32) is not G/U in any crystal structure, but is included in the study for having > 50% GU content in bacterial sequences. The last three columns indicate GU+UG (%) content at those positions in sequence alignments of archaea (A), bacteria (B), and eukarya (E). A hyphen (-) in these columns means that sequence alignments are all gaps (insertions) at the corresponding location. Abbreviations: h, helix; IL, internal loop; WJ, way-junction; SGP, shallow groove pocket; SGNP, shallow groove not in pocket; DGP, deep groove pocket; and DGNP, deep groove not in pocket. See Supplementary Table S3 for the 23S/5S rRNA analysis. 79

23S H. marismortui D. radiodurans E.coli Seq %GU+UG # BP NT1 NT2 BP NT1 NT2 BP NT1 NT2 Secondary structure Interactions A B E 1 UG 12 531 GU 15 525 GU 15 525 H2 None in Ec; not basepaired in Hm or Dr 43 97 7 2 GC 13 530 GU 16 534 CG 16 524 H2 SGNP: CO2'-U612O3' (Hm) 33 11 10 3 GU 17 526 CG 20 530 CG 20 520 H2 Water in both pockets 19 2 0 4 GU 21 522 GC 24 526 GC 24 516 H2, flanking IL Water in both pockets 29 0 1 5 GU 64 55 GC 67 58 GU 68 59 H6, flanking hairpin SGP: A70 Phosphate (Hm) 78 64 18 6 GU 71 107 GU 74 110 GU 75 112 Inside 3WJ SGP: A67 Phosphate (Hm) 100 92 11 7 UA 103 74 GC 106 77 GU 108 78 H7 None 4 37 27 8 Absent helix Shorter helix GU 168 158 H10, flanking hairpin None 0 95 60 9 GU 206 233 UA 212 239 UA 235 262 H13, flanking junction SGNP: UO2'-A437O2'; water in both pockets (Hm) 18 0 1 10 GU 229 210 CU 235 216 GC 258 239 H13, flanking IL SGNP: G683 type0 A-minor; water in both pockets (Hm) 77 43 2 11 GU 275 374 Motif change Motif change H16, flanking bulge Water in DGP 32 13 3 12 GC 362 290 UA 365 297 GU 356 284 H18, flanking IL None 23 5 13 13 Motif Change CG 298 364 GU 285 355 H18 None 4 26 18 14 Motif Change GC 351 301 GU 350 290 H18, flanking bulge None 0 22 20 15 Motif Change UA 302 360 GU 291 349 H18 None 22 25 6 16 UA 310 321 GU 314 325 GC 303 314 H19 None 4 3 2 17 GC 320 311 CG 324 315 GU 313 304 H19 None 4 22 4 18 GC 384 405 GU 388 412 GU 375 399 H21, flanking 4WJ None 17 62 12 19 GC 386 403 UA 390 410 GU 377 397 H21 None 0 67 3 20 GU 387 402 CG 391 409 CG 378 396 H21 SGNP: GNH2-A2264O2'; water in SGP; Na in DGP (Hm) 79 1 82 21 GC 388 401 GU 392 408 GU 379 395 H21 SGNP: UO2-Thr92OG1 of L15E prot (Hm) 0 58 1 22 CG 399 390 GU 406 394 CG 393 381 H21 None 0 0 63 23 CG 414 426 GC 421 432 GU 408 419 H22, flanking bulge None 54 58 1 24 GU 422 2445 AU 428 2387 AA 415 2407 - None 29 0 53 25 GU 499 493 GC 503 498 GC 493 487 H24, flanking hairpin SGNP: GNH2-Asn94 of L22 prot (Hm) 4 0 0 26 GU 537 2059 CG 541 2001 CG 531 2018 - Water in both pockets 25 2 0 27 UA 616 540 GU 568 544 GU 559 534 H25 None 8 97 9 28 CG 541 615 CG 545 567 GU 535 558 H25 None 8 17 12 29 GU 544 612 GU 548 564 AG 538 555 H25 Water in DGP 29 1 31 30/A GU 545 611 GU 549 563 GU 539 554 H25 SGP: U-pack-G529 (Hm) 100 97 9 31 GC 755 650 GC 677 603 UG 664 593 H27 SGNP: GO2'-G1039 Phosphate (Hm) 4 24 1 32 GU 754 651 GU 676 604 GU 663 594 H27 SGNP: GNH2-G1038O3' (Hm) 29 90 1 33 GU 652 753 GC 605 675 CG 595 662 H27 Water in both pockets 13 14 0 34 AU 750 655 CG 672 608 GU 659 598 H27 None 4 35 38 35/C GC 657 748 GU 610 670 GU 600 657 H27 SGP: C-pack-U662 (Hm; covaries with next) 71 99 7 36/C GU 684 662 GC 634 615 CG 623 605 H28 SGP: U-pack-C748 (Hm; covaries with previous) 21 0 87 37 GC 683 663 GU 633 616 GU 622 606 H28 SGNP: GO2'-U210O2' (Hm) 38 98 2 38/D GU 740 731 GU 660 650 GU 649 639 H31 SGP: U-pack-G690; water in DGP (Hm) 100 100 61 39 GC 739 732 GC 658 652 GU 647 641 H31 SGNP: UO2'-C2350 Phosphate , near pocket (Ec) 0 85 2 40 GU 738 733 AG 657 653 UU 646 642 H31, flanking hairpin SGNP: GNH2,N3-U2384O3',O2'; water in both pockets (Hm) 73 0 0 41 GU 782 864 GU 704 784 CG 691 771 H33 Water in SGP 42 1 85 42 AU 861 785 GU 781 707 GU 768 694 H33 SGNP: A-U1488 tSW (Hm) 21 99 80 43 GU 786 860 GU 708 780 GU 695 767 H33 Water in both pockets 50 98 95 44 GC 787 859 AU 709 779 GU 696 766 H33 SGNP: UO2'-C736O2' (Ec) 0 20 4 45 GU 820 794 GU 742 716 GU 728 703 H34, flanking IL None in Ec; not basepaired in Hm or Dr 67 98 26 46 GU 798 815 AC 720 737 GU 707 724 H34, flanking IL SGNP: GNH2-A1598O2'; water in SGP; Na in DGP (Hm, not seen in Dr or Ec) 100 88 98 47 GC 802 811 CG 724 733 GU 711 720 H34 None 21 11 5 48 GU 878 872 GU 798 792 GU 785 779 H35a, flanking hairpin SGNP: A780 type0 A-minor; DGP: Arg2NH2 of L2 prot (Hm) 58 77 2 49 GU 924 919 GU 844 839 GU 831 826 H37, flanking hairpin SGNP: G-A166 tSW; water in SGP; Na in DGP (Hm) 71 100 29 50/F GU 1038 932 GU 950 852 GU 939 839 H38 SGP: U-pack-A1296; water in DGP (Hm) 65 98 10 51 CC 1033 937 GU 945 857 UA 934 844 H38 Water in SGP 30 29 9 80

52 Motif change GU 861 942 CG 848 930 H38 None 4 16 2 53 GU 1024 942 GC 940 863 AU 928 850 H38 Na in DGP 57 3 26 54 AU 1018 949 GU 934 868 CG 922 855 H38 SGNP: His40, Thr60 of L21E prot (Hm) 4 3 90 55 GU 950 1017 CG 869 933 GC 856 921 H38 SGNP: GNH2-Gly580O of L21E prot; water in SGP (Hm) 54 0 20 56 CG 1015 952 GU 931 871 UG 919 858 H38, flanking IL SGNP: GNH2-A2303 Phosphate, near pocket (Hm, phosphate in pocket in Dr & Ec) 8 99 93 57 GU 1002 966 UA 919 883 GU 907 870 H38 SGNP: Phe88 of L10E prot; water in DGP (Hm) 96 31 1 58 Motif change GU 894* 907* GU 881* 895* H38, flanking bulge tRNA bridge (A site - g) 21 99 95 59 Motif change GU 895* 906* GU 882* 894* H38 tRNA bridge (A site - g) 0 79 17 60 GU 1048 1066 UA 960 979 GC 949 968 H39 Water in SGP 35 7 3 61 CG 1049 1065 GU 961 978 UA 950 967 H39 None 15 76 0 62 GU 1050 1064 CG 962 977 CG 951 966 H39 Water in SGP 40 4 0 63 GU 1052 1062 AC 964 975 GC 953 964 H39 SGNP: U-A2307 cSS (Hm) 61 17 86 64 GC 1053 1061 GU 965 974 GU 954 963 H39 SGNP: G-U2308 tSW (Hm) 52 96 5 65 CG 1060 1054 UA 973 966 GU 962 955 H39 None 0 47 1 66 GU 1265 1091 UA 1172 1004 CG 1161 993 H40 Water in both pockets 4 0 9 67 AU 1114 1249 GU 1028 1156 GC 1017 1145 H41 None 4 3 3 68 GU 1226 1136 GA 1133 1043 GA 1122 1032 H42, flanking bulge Water in both pockets 29 3 1 69 GU 1224 1139 GU 1131 1046 GU 1120 1035 H42 SGP: U1116 Phosphate (Hm) 100 99 97 70 CG 1140 1223 GU 1047 1130 GU 1036 1119 H42 None 0 80 4 71 GU 1143 1220 GC 1050 1127 AG 1039 1116 H42 Water in DGP 9 18 1 72 GU 1145 1218 CG 1052 1125 GC 1041 1114 H42 None 57 55 1 73 CG 1146 1217 GU 1053 1124 GU 1042 1113 H42 None 0 55 2 74 GC 1155 1212 GU 1062 1119 GU 1051 1108 H42, flanking IL None 22 38 0 75 UA 1270 1286 UU 1177 1197 GU 1166 1183 H45 SGNP: GNH2-U931N3 (Ec) 14 5 0 76 CG 1272 1284 AU 1179 1195 GU 1168 1181 H45 None 9 13 1 77 GU 1344 1310 CC 1253 1219 UG 1240 1206 H46, flanking IL DGP: Lys173NZ of L4 prot; water in SGP (Hm) 21 61 0 78 GU 1319 1338 GU 1228 1247 GU 1215 1234 H46, flanking IL SGP: G518 Phosphate (Hm) 100 99 79 79 UA 1336 1321 GC 1245 1230 GU 1232 1217 H46 None 13 50 4 80 GC 1322 1335 AU 1231 1244 GU 1218 1231 H46 None 4 39 50 81 CG 1334 1323 GU 1243 1232 AU 1230 1219 H46 SGNP: A552 type1 A-minor (Hm) 0 20 17 82 GU 1324 1333 AA 1233 1242 GC 1220 1229 H46 SGNP: GNH2-G553 Phosphate; water in both pockets (Hm) 96 2 5 83 AU 1331 1326 GC 1240 1235 GU 1227 1222 H46, flanking hairpin SGNP: GO2'-C1196O2' (Ec) 8 13 2 84 GU 1373 2052 AU 1282 1994 AU 1269 2011 - SGNP: C1431 type1 A-minor; water in both pockets (Hm) 33 27 2 85 GU 1697 1412 GC 1638 1319 GC 1622 1306 H49, flanking 4WJ SGNP: A1487 type1 A-minor; water in SGP (Hm) 4 7 1 86 UA 1440 1424 CG 1347 1331 GU 1334 1318 H50 None 0 11 1 87 UA 1461 1482 GC 1369 1388 GU 1356 1375 H52 SGNP: AO2'-G118C5'; water in DGP (Hm) 13 45 6 88 GC 1481 1462 GU 1387 1370 UG 1374 1357 H52, flanking bulge None 4 11 22 89 AU 1515 1671 GU 1419 1612 UA 1406 1596 H54 None 13 12 1 90 UA 1517 1669 UA 1421 1610 GU 1408 1594 H54 None 0 8 4 91 Motif Change AG 1603 1428 GU 1587 1415 H54, flanking IL SGNP: UO2'-C1417 Phosphate, near pocket (Ec) - 44 0 92 Motif change GU 1436 1592 GU 1422 1576 H55 SGNP: GNH2-C1498O2 (Ec) 0 87 5 93 GU 1660 1531 GG 1589 1439 GG 1573 1425 H55, flanking IL Water in both pockets 63 24 0 94 CG 1536 1649 CG 1444 1580 GU 1430 1563 H56 None 4 38 2 95 GC 1647 1538 GU 1577 1446 CG 1561 1432 H56 None 96 22 8 96 GU 1646 1539 GU 1576 1447 GA 1560 1433 H56 SGNP: GNH2-A1597N6; water in DGP (Hm) 8 11 18 97 GU 1540 1645 AA 1448 1574 AC 1434 1558 H56 SGNP: A1581 type0 A-minor; water in DGP (Hm) 71 0 3 98 GU 1546 1639 UA 1454 1567 UA 1440 1551 H56, flanking IL Water in DGP 38 45 11 99 GU 1552 1569 GU 1460 1481 GU 1445 1466 H57, flanking 4WJ SGNP: UO2',O3'-C1633O2', water in DGP (Hm) 52 65 92 100 CG 1553 1568 GU 1460 1481 CG 1446 1465 H57 SGNP: A1632 type1 A-minor; water in DGP (Hm) 0 3 2 101 AU 2739 1561 AU 2681 1473 GU 2702 1457 - SGNP: GNH2-C1462O4'; DGP: G1452O2' (Ec, not basepaired in Dr) 100 25 0 102 Motif Change UG 1539 1484 GU 1524 1468 H58 None 4 33 0 103 Motif change Motif change GU 1471 1520 H58, flanking IL SGNP: GO2'-A1403O3' (Hm) 42 68 4 104 GC 1618 1577 AU 1534 1490 GU 1517 1474 H58, flanking bulge None 38 21 7 81

105 Motif Change GU 1494 1530 GU 1478 1513 H58 SGNP: GNH2-C1558 Phosphate, near pocket; Na in DGP (Ec, not basepaired in Dr) 0 48 25 106 Motif Change Motif Change GU 1510 1481 H58, flanking bulge None 4 36 13 107 GU 1608 1587 GU 1520 1500 GU 1483 1506 H58 Water in DGP 50 0 5 108 Shorter helix GC 1545 1558 GU 1529 1542 H59 None 70 21 2 109 GU 1760 1784 AU 1699 1723 GC 1682 1706 H62, flanking 4WJ SGNP: Arg80, Ala77 of L19E prot; water in DGP (Hm) 92 7 99 110 GU 1785 1807 CG 1724 1742 GU 1707 1751 H63, flanking 4WJ Water in DGP 92 29 96 111 Shorter helix Shorter helix GU 1718 1742 H63 None 50 19 7 112 GNRA replacement GNRA replacement GU 1740 1720 H63 SGP: C1550 Phosphate (Ec, replaced by GNRA interaction in Hm & Dr) 33 34 20 113 Shorter helix Shorter helix GU 1724 1736 H63 None - 20 2 114 AU 2024 1825 CG 1966 1760 GU 1983 1769 H64 SGNP: UO2'-C1958 Phosphate, near pocket, GO2'-C2606C5' (Ec) 4 26 1 115 CG 1826 2023 GU 1761 1965 GU 1770 1982 H64, flanking bulge Water in SGP 0 52 0 116 GU 1848 1883 GU 1783 1819 GU 1792 1827 H66, flanking 4WJ SGNP: U-A2011 cSS; water in DGP (Hm) 75 97 99 (117) AU 1881 1850 UA 1817 1785 UA 1825 1794 H66 SGNP: UO2'-A1941O2', Lys213 of L2 prot; water in DGP (Hm) 71 44 1 118 UA 1879 1852 GU 1815 1787 GU 1823 1796 H66 None 0 84 4 119 GU 1898 1939 GU 1834 1881 GU 1842 1898 H68, flanking IL SGNP: Ser214OG & Thr233N of L2 prot; water in SGP; Na in DGP (Hm) 88 100 99 120/E GU 1932 1907 GU 1874 1843 GU 1891 1851 H68, flanking IL SGP: U-pack-G71 of E-site tRNA (Ec) 100 100 100 121 Motif change GU 1847 1870 GU 1855 1887 H68, flanking IL None 17 38 2 122 Motif Change GC 1852 1865 GU 1860 1882 H68 None 0 49 - 123 Motif change GU 1854 1863 GU 1862 1880 H68 None 4 66 0 124/B GNRA replacement GU 1861 1856 GU 1878 1864 H68, flanking hairpin SGP: U-pack-C414 (Ec, replaced by GNRA loop interaction in Hm) 0 99 0 125/P GU 1948 1964 GU 1890 1906 GU 1907 1923 H69 SGP: U-pack-U12 of P-site tRNA (Ec) 100 99 100 126 GU 1974 2008 GC 1916 1950 GC 1933 1967 Inside 4WJ SGNP: GNH2-G2014O2' & UO2'-U1980N3; water in DGP (Hm) 100 8 0 127 CG 2065 2080 GC 2007 2022 GU 2024 2039 H72 None 13 22 0 128 UA 2078 2067 GU 2020 2009 AU 2037 2026 H72 SGNP: U-A1078 cSS (Hm) 0 1 0 129 CG 2087 2657 GC 2029 2601 GU 2046 2622 H73 SGNP: G-A2841 tSW; water in DGP (Hm) 0 67 87 130 GU 2093 2652 GC 2035 2596 AU 2052 2617 H73, flanking bulge SGNP: UO2'-U2064O3' & GNH2-Ser245O of L3 prot; DGNP: G-G2092 cHS; water in SGP (Hm) 63 4 0 131 GC 2272 2122 GU 2218 2064 GU 2239 2081 H75, flanking bulge None 0 67 93 132 GC 2134 2269 GC 2066 2215 GU 2083 2236 H75 None 75 66 75 133 CG 2268 2125 GU 2214 2067 GU 2235 2084 H75 tRNA bridge (E site - k) 0 1 85 134 GC 2267 2126 GC 2213 2068 GU 2234 2085 H75 SGNP: A1931 type1 A-minor; water in DGP (Hm) 0 84 13 135 GU 2128 2265 GU 2070 2211 GC 2087 2232 H75 SGNP: A1910 type1 A-minor; water in SGP (Hm) 54 1 1 136 AU 2135 2240 GU 2077 2178 AU 2094 2195 H76 None 8 46 5 137 GC 2136 2239 GU 2078 2177 AU 2095 2194 H76 None 8 14 76 138 CG 2236* 2139* GU 2174 2081 AU 2191 2098 H76 None 0 46 14 139 GU 2235* 2140* GC 2173 2082 GU 2190 2099 H76 None 17 47 13 140 GU 2141* 2234* GU 2083 2172 GU 2100 2189 H76 None 42 89 68 141 GC 2142* 2233* GU 2084 2171 AU 2101 2188 H76 None 29 32 14 142 UA 2143* 2232* GC 2085 2170 GU 2102 2187 H76 None 0 20 - 143 GU 2145* 2230* UA 2087 2168 CU 2104 2185 H76 None 33 10 4 144 GC 2148* 2227* UA 2090 2165 GU 2107 2182 H76 None 25 18 28 145 GU 2225* 2150* UU 9163 2092 UU 2180 2109 H76, flanking IL None 17 0 1 146 GC 2179* 2201* AU 2119 2138 GU 2136 2155 H78, flanking IL None 42 52 0 147 CA 2198* 2182* GU 2136 2121 CG 2153 2138 H78 None 0 14 25 148 GC 2196* 2183* CG 2135 2122 GU 2152 2139 H78 None 0 31 0 149 CG 2184* 2195* GU 2123 2134 GU 2140 2151 H78 None 17 48 0 150 GC 2254 2247 GU 2200 2185 GU 2221 2202 H79 None 4 16 2 151 Motif Change GC 2186 2199 GU 2204 2220 H79, flanking hairpin None 4 12 14 152 GU 2323 2377 GU 2268 2322 GU 2289 2343 H83 SGNP: GNH2-A2380 Phosphate; water in SGP (Hm, not basepaired in Dr) 13 76 97 153/I GC 2375 2325 GU 2320 2270 GU 2341 2291 H83 SGP: C-pack-C2411 (Hm) 88 98 36 154 GC 2338 2346 GU 2283 2291 GU 2304 2312 H84, flanking hairpin SGNP: Asp129 & His24 of L5 prot (Hm) 75 92 1 155 GC 2357 2366 GU 2302 2311 GC 2323 2332 H85, flanking 3WJ SGNP: C-A2369 cSS (Hm) 0 19 15 156 GU 2365 2358 GC 2310 2303 GU 2331 2324 H85 SGNP: A2370 type1 A-minor; water in both pockets (Hm) 75 89 10 157 GU 2404 2384 GC 2346 2329 GC 2367 2350 H86 SGNP: A738 type0 A-minor; DGNP: G-U2419 tHW; water in DGP (Hm) 33 14 17 82

158 GU 2399 2389 GC 2341 2334 CG 2362 2355 H86 Water in both pockets 50 1 87 159 AU 2398 2390 CU 2340 2335 GU 2361 2356 H86, flanking hairpin None in Ec; not basepaired in Hm or Dr 21 12 2 160 AU 2434 2457 GU 2376 2398 GU 2397 2419 H88 None 42 1 93 161 GC 2453 2439 GU 2394 2380 GU 2415 2401 H88, flanking bulge SGNP: G-A692 cSS (Hm) 13 - 9 162 GU 2451 2441 GC 2392 2383 GU 2413 2404 H88, flanking hairpin Water in both pockets 42 1 0 163 GU 2529 2492 GU 2473 2436 GUp 2494 2457 H89 tRNA bridge (A site - j); SGNP: GNH2-U1056O4; water in SGP; Na in DGP (Hm) 100 99 99 164 GU 2494 2528 AU 2438 2472 AU 2459 2493 H89, flanking bulge DGNP: G-C2693 cHS, water in SGP (Hm) 100 0 99 165 GU 2618 2541 GU 2562 2485 GU 2583 2506 H90, flanking 5WJ (pep transf) SGNP: A76 of A-site tRNA type1 A-minor; DGNP: U-G2611 tHH; water in both pockets (Hm) 100 100 100 166 GU 2543 2615 GU 2487 2559 GUp 2508 2580 H90, flanking bulge SGP: C1750 Phosphate (water mediated); Na in DGP (Hm, phosphate in pocket in Dr, absent in Ec) 100 100 100 167 GU 2613 2545 GC 2557 2489 GC 2578 2510 H90 Water in SGP 38 0 0 168 GC 2605 2549 GU 2549 2493 GU 2570 2514 H90 SGNP: GNH2-U1130 Phosphate, near pocket; water in SGP (Ec) 71 97 98 169 GU 2578 2557 GU 2522 2501 GU 2543 2522 H91, flanking bulge SGP: A2684 Phosphate; water in DGP (Hm) 96 99 98 170 CG 2561 2572 GU 2505 2516 GU 2526 2537 H91 SGNP: GO2'-U2514 Phosphate (Hm) 4 100 2 171 GU 2570 2563 GU 2514 2507 GU 2535 2528 H91, flanking hairpin SGP: C2565 Phosphate; water in DGP (Hm) 100 99 5 172 GU 2592 2586 GC 2536 2530 GC 2557 2551 H92, flanking hairpin motif (A-site) SGNP: GNH2-U1985O4; water in SGP; Na in DGP (Hm) 33 1 0 173 GU 2823 2666* CG 2768 2609 CG 2788 2630 H94, flanking junction SGNP: G-A2866 cSS (Dr, Ec displays similar interaction, U2666 absent from Hm) 13 0 31 174 GC 2817 2672 GU 2761 2615 GC 2782 2636 H94 None 8 89 4 175 GU 2683 2711 UG 2626 2652 UG 2647 2673 H95 Water in both pockets 29 84 2 176 AU 2684 2710 GU 2627 2651 GU 2648 2672 H95 None 38 - - 177 GC 2722 2760 GU 2664 2704 GU 2685 2724 H96 None 42 89 92 178/H GU 2758 2724 GU 2702 2666 GU 2722 2687 H96, flanking IL SGP: U-pack-C2040; water in DGP (Hm) 100 100 89 179 GU 2750 2732 GC 2694 2674 GU 2714 2695 H96, flanking bulge SGP: G1814O2' (water mediated); water in DGP (Hm) 96 98 75 180/G GU 2744 2735 GU 2688 2677 GU 2709 2698 H96 SGP: U-pack-C1714; water in DGP (Hm) 100 100 10 181 GC 2770 2804 CA 2715 2749 GU 2735 2769 H97, flanking 4WJ None 0 17 21 182 GC 2845 2835 CG 2803 2793 GU 2828 2818 H99a None 0 30 1 183 AC 2856 2901 GU 2814 2853 GU 2839 2878 H100, flanking bulge None 60 96 6 184/J GU 2892 2864 GU 2844 2822 GU 2869 2847 H100, flanking IL SGP: U-pack-C2729; water in DGP (Hm) 100 100 82 185 GU 2869 2888 GU 2827 2840 GU 2852 2865 H101, flanking IL None 80 97 14 186 AU 2885 2872 GU 2837 2830 GC 2862 2855 H101 None 5 3 1 5S 1 CG 116 7 GU 115 10 GC 112 8 H1 SGNP: Asn53 of L18 prot (Hm) 46 31 71 5S 2 GC 8 115 GC 11 114 GU 9 111 H1 None 3 6 7 5S 3 GC 16 64 AC 20 67 GU 18 65 H2, flanking bulge None 5 18 1 5S 4 CG 60 20 AU 63 24 GU 61 22 H2, flanking bulge Water in both pockets 0 26 7 5S 5 CG 108 73 GU 105 76 GU 102 74 H3 Water in DGP 10 90 6 5S 6 GU 100 82 GU 99 82 GU 96 80 H3 SGP: U-pack-23S A1014; water in DGP (Hm) 100 94 79 5S 7 GU 83 99 CC 83 98 GU 81 95 H3 SGNP: Asn49 of L30 prot; water in both pockets (Hm) 51 61 6 Table 6. Interactions of the cis WC G/U basepairs in 23S and 5S rRNA. Data based on the crystal structures of H. marismortui, D. radiodurans, and E. coli (initials of the structure best for observing each interaction appear in parentheses in the interactions column). Basepairs with interactions are shaded. The letters (A, B,…) in the first column that designate basepairs forming P-interactions (indicated by -pack-) are also used in Table 8 and Figure 39. Basepair number (117) is not G/U in any crystal structure, but is included in the study for having > 50% GU content in archaeal sequences. The last three columns indicate GU+UG (%) content at those positions in sequence alignments of archaea (A), bacteria (B), and eukarya (E). A hyphen (-) in these columns means that sequence alignments are all gaps (insertions) at the corresponding location. Abbreviations: Asterisk (*), nucleotide absent from crystal structure; H, helix; IL, internal loop; WJ, way- junction; SGP, shallow groove pocket; SGNP, shallow groove not in pocket; DGP, deep groove pocket; and DGNP, deep groove not in pocket.

83

Among the 80 G/U basepairs that occur in the 16S rRNA, 48 (60%) are inside helices, and the

remaining 32 (40%) are at the ends of helices or within other motifs. Among the 193 G/U

basepairs of 23S and 5S rRNA, 122 (63%) are inside helices, and the remaining 71 (37%) are

elsewhere. This agrees with the previous counts of 56% intra-helical and 44% at the ends of helices compiled by Gautheret et al. (Gautheret et al. 1995), although our criteria for finding the

G/U pairs is slightly different. These counts indicate that there is slightly greater tendency for the cis WC G/U basepairs to be found within helices, compared to cis WC A/G basepairs, e.g., which have significantly larger C1’ – C1’ distance and are thus found almost exclusively at the ends of helices (Sponer et al. 2003).

Of the 80 16S basepairs, 28 (35%) form SGNP interactions, six (8%) form interactions in SGP, and two (3%) form DGNP. Five basepairs form intermolecular bridges with ribosomal proteins or the large ribosomal subunit (Yusupov et al. 2001). No interactions are observed in the DGP.

Instead, this area is occupied by a cation in several cases, consistent with the high elecronegativity of this site (Allain et al. 1995; McDowell et al. 1996). About one half of the 16S

G/U basepairs (42 out of 80) show no evidence of any tertiary interaction in the two available crystal structures (interactions with ions are not considered as tertiary interactions). Similarly, of the 193 23S/5S G/U basepairs, 62 (32%) form SGNP, 21 (11%) form interactions in SGP, four

(2%) form DGNP, and three form interactions in DGP. Seven basepairs form intermolecular bridges with tRNA (Yusupov et al. 2001). A cation binds simultaneously to GO6 and UO4 in several cases, and a water molecule occupies the G/U shallow groove pocket in about one third of the cases (only Hm crystal structure contains water molecules). Again, about one half of the

23S/5S G/U basepairs do not form any tertiary interactions (103 out of 193). 84

These data demonstrate almost total agreement between the statistics of the two ribosomal

subunits, showing that most tertiary interactions involving G/U occur in the shallow groove.

Almost all G/U tertiary and quaternary interactions in 16S Tt and Ec and in 23S/5S Hm, Dr, and

Ec are seen at equivalent positions. Exceptions are those interactions that occur in a variable

region that is either absent in one or two of the structures, or significantly different between

them. Another possible reason for seeing interactions in some of the crystal structures and not all

of them is the presence of a proximal kink-turn (Klein et al. 2001). When the intrinsically

flexible kink-turns change between open and closed conformations, different sets of tertiary

contacts become possible. Kink-turns are like hinges that allow the motion of attached helices

with respect to the body of the molecule. The further an interaction is from the fulcrum of the

kink, the more dramatic the change in its conformation upon opening or closing of the kink

(Razga et al. 2005). Since each crystal structure shows only a snapshot of such positions, there is

always the possibility that in one structure the open conformation is observed while in the other

the closed conformation is observed. An example of this is the position corresponding to Hm

G798/U815 in H34 (H and h are the symbols used throughout this text to denote helices in the

large and small subunit respectively). Here, a tertiary interaction is seen in the Hm structure but

not in the Dr or Ec. Another example is the G/U basepair corresponding to Hm G1646/U1539 in

H56. A tertiary interaction is seen in the Hm structure between G1646(NH2) and A1597(NH2).

This interaction is possible because GNH2 can assume a pyramidal configuration due to the NH2

partial sp3 hybridization (Sponer et al. 2003). No equivalent interactions are seen in Dr or Ec.

Note that the limited resolution of the X-ray structures may affect the appearance of some of the

interactions.

85

Types of Observed Tertiary Interactions and Their Sequence Patterns

Three distinct types of shallow groove interactions involve the G/U pocket (SGP): P-interactions,

phosphate-in-pocket interactions, and ribose O2’-in-pocket interactions (Figure 32). Aside from

the G/U pocket, several other types of interactions occur in the shallow groove. These include

phosphate single H-bond interactions, “type 0” A-minor motifs, “type I” A-minor motifs (also

known as trans sugar edge/sugar edge or tSS basepair), H-bonding to GNH2 of a G/U, or to the

ribose sugars of a G/U, and other edge-to-edge interactions, the most common of which are cSS

(cis sugar edge/sugar edge, also known as “type II” A-minor motif) and tWS (trans Watson-

Crick/sugar edge) interactions. There also are more than twenty non-specific interactions with proteins. A few G/U basepairs participated in more than one type of tertiary/quaternary interactions at once. Some others participated in interactions that look like SGNP interactions in the crystal structure, but could easily fall into the SGP category after a subtle geometrical rearrangement (“potential” SGP interactions). Table 7 represents the sequence analysis of G/U basepairs classified into four major categories: basepairs with SGP interactions, those with potential SGP interactions (see above), basepairs forming other types of interactions, and finally basepairs with no tertiary/quaternary interactions. The percentage of basepairs at each position is measured after ignoring all sequences with gaps (deletions) (i.e., they are not percentages over all sequences). For each position, however, we also report the percentage of sequences with gaps over all sequences (shaded). If no motif swaps take place between domains or classes at these locations, then this percentage of gaps can be considered a measure of the quality of alignments at the corresponding positions. 86

Archaea (220 sequences) Bacteria (4475 sequences) Eukarya (5248 sequences) 16S rRNA GU AC WC UG CA NTs Gaps GU AC WC UG CA NTs Gaps GU AC WC UG CA NTs Gaps 16S rRNA 84 1 4 7 0 5 10 98 0 1 0 0 2 12 68 0 14 4 3 10 34 6 interactions in SGP 98 0 1 0 0 0 23 92 0 7 0 0 1 2 66 11 21 0 0 2 3 10 potential interactions in SGP 34 4 55 1 0 6 8 72 0 19 2 0 7 10 25 1 41 7 0 26 10 22 Other interactions 13 2 67 4 2 12 27 32 1 54 5 0 7 12 6 4 58 3 2 27 38

Archaea (24 & 37 sequences) Bacteria (184 & 336 sequences) Eukarya (137 & 222 sequences) 23S & 5S rRNA GU AC WC UG CA NT Gaps GU AC WC UG CA NT Gaps GU AC WC UG CA NT Gaps 23S & 5S rRNA 92 0 4 0 0 3 8 93 0 3 0 0 4 7 57 2 32 2 0 6 18 21 interactions in SGP 15 0 66 2 0 18 17 48 1 22 18 2 9 7 18 3 52 19 1 8 35 6 potential interactions in SGP 37 1 55 4 0 2 3 39 1 48 2 0 7 7 23 2 56 7 1 13 11 62 Other interactions 18 2 68 3 1 7 5 30 1 53 3 1 11 7 12 3 55 5 2 22 23

Table 7. Sequence analysis of cis WC G/U basepairs in 16S, 23S, and 5S rRNA. Percentages are measured after ignoring sequences with gaps (deletions). The percent of gaps is separately given in the last column (shaded) as a measure of the quality of alignments at the corresponding positions, assuming that no motif swaps take place. Abbreviations: SGP, shallow groove pocket; WC, any combination of the four classical Watson-Crick basepairs (G/C, C/G, A/U, and U/A); and NTs, any basepair other than GU, AC, WC, UG, or CA. 87

In general, we observe that when G/U basepairs are not conserved they are replaced primarily by

A/C and by classical WC basepairs (Table 7). Much less common are substitutions by U/G or

C/A or other non-classical basepairs represented by “NTs” (other nucleotides). Although A/C

and C/A basepairs are relatively scarce, and thus a trend is difficult to measure, G/U’s are more

commonly substituted by A/C’s than C/A’s. As mentioned above, this can be explained by

isostericity (Leontis et al. 2002). This type of substitution pattern is more apparent in the

category of non-interacting G/U’s, whereas G/U’s that make SGP interactions, and in some cases

those that make potential SGP interactions are much more conserved (Table 7). This indicates

that these interactions are specific for G/U basepairs. G/U mutations at such locations can disturb

3D structure which might affect the fitness of the ribosome, and thus negatively affect the fitness

of the organism. One exception appears to occur in 16S rRNA, where the G/U’s with SGP

interactions have 84% GU and 7% UG content in archaeal sequence alignments. This seems to

violate the isostericity principle, but in fact has a simple explanation: there are only six basepairs

in this category in 16S, and the 7% UG observed is due to a single position, G633/U605, which

is occupied by G/U in 22% and by U/G in 39% of the sequences. This basepair makes a

phosphate-in-pocket interaction, which can still take place if G/U flips to U/G, because the

GNH2 that makes one of the two H-bonds in this interaction remains at exactly the same position

(Figure 33). This means that specific orientation of G/U is not required for the phosphate-in-

pocket interaction. The high gap count in some highly conserved regions, like those with specific

SGP interactions, may suggest mistakes in the available alignments or motif swaps. This seems

to be more apparent in the case of eukaryal alignments in both subunits. For this reason, we

consider eukaryal alignments of lower quality than archaeal and bacterial alignments. 88

Figure 33. Stereo image of G/U superimposed on U/G. The superposition is along the C1’ – C1’ axis without distorting the backbone, showing that position of the GNH2 is invariant with basepair reversal. Phosphates are also shown in the G/U pockets connected by an arrow that shows how close they are.

Contemporary sequence alignment methods depend almost exclusively on the local basepairing and covariation patterns of classical WC basepairs, and some G/U basepairs, and do not take into consideration tertiary interactions that impose unique constraints. Including the tertiary interaction data in sequence alignment methods would greatly enhance the quality of those alignments, and is simple to accomplish for some particular interactions with strong signatures.

An example of this is 23S Hm U731/G740 which in crystal structures superposes with Dr

U650/G660, both making the same P-interaction. Sequence alignment protocols, however, do not properly align these positions. We were able in this case to use the structure data to manually adjust the alignments accordingly.Finally, data in Table 7 indicates one important prediction: as suggested by their substitution patterns, we expect most 16S basepairs in the category “potential interactions in SGP” actually form SGP interactions because of their high G/U conservation. In contrast, most 23S basepairs in the same category are expected not to form them because of their low G/U conservation.

89

P-interaction

The most conserved interaction that G/U basepairs make is the P-interaction. It takes place between the ribose O2’ belonging to a nucleotide from a basepair in one helix packing deep into the G/U shallow groove pocket of a second helix, forming up to five H-bonds along the interface.

One nucleotide from the first basepair makes most of the contact with one nucleotide from the second basepair. These two nucleotides are known together as the internal pair. For simplicity, we symbolize this interaction as AB P CD, where AB and CD are the basepairs from the two helices, with B and C forming the internal pair (as seen in Figure 32a). A previous study noted

that the P-interaction plays important roles in tRNA binding to the 50S subunit and in

translocation, and that it is phylogenetically conserved in all three phylogenetic domains of the

large subunit (Gagnon et al. 2002). No prior phylogenetic work on the small subunit has been

reported, and no attempt was made to divide the results according to the three domains. We

exhaustively searched the 3D structures for this type of interaction using several methods

including FR3D, a structural motif search program of our design (Sarver, M., Zirbel, C.,

Stombaugh, J., Mokdad, A., and Leontis, N.B., submitted). Five such interactions are found in

the small ribosomal subunit and thirteen in the large subunit, as reported in Table 8. 90

SSU Escherichia coli Archaea (220 sequences) Bacteria (4475) Eukarya (5248 sequences) # A B C D GUWC GUNT WCUG NTUG WCWC NTNT Gaps GUWC GUNT WCUG NTUG WCWC NTNT Gaps GUWC GUNT WCUG NTUG WCWC NTNT Gaps a G301 U296 C556 G27 75 24 0 0 0 0 0 58 41 0 0 0 0 0 56 43 0 0 1 0 0 b G544 C501 C549 G35 0 0 78 17 3 2 1 2 0 0 0 95 2 0 0 0 5 0 93 1 0 c G105 U62 C379 G384 96 4 0 0 0 0 0 97 2 0 0 0 1 0 2 96 0 0 1 1 8 d G584 U757 C879 G821 98 0 2 0 0 0 0 99 1 0 0 0 0 0 0 0 0 100 0 0 13 e G1523 U1512 A768 C811 54 36 0 0 0 10 1 15 82 0 0 0 3 0 10 88 0 0 0 2 0

LSU Haloarcula marismortui Deinococcus radiodurans Escherichia coli Archaea (24 seq in 23S, 44 in 5S, 317 in tRNA) Bacteria (184 in 23S, 336 in 5S, 1768 in tRNA) Eukarya (137 in 23S, 222 in 5S, 141 in tRNA) # A B C D A B C D A B C D GUWC GUNT WCUG NTUG WCWC NTNT Gaps GUWC GUNT WCUG NTUG WCWC NTNT Gaps GUWC GUNT WCUG NTUG WCWC NTNT Gaps A G545 U611 G529 C14 G549 U563 C533 G17 G539 U554 C523 G17 88 13 0 0 0 0 0 80 17 3 0 1 0 0 7 6 0 0 70 17 7 B Replaced by GNRA inter. G1861 U1856 C427 G2388 G1878 U1864 C414 G2409 0 0 4 0 0 0 96 98 2 0 0 0 1 0 0 0 0 1 0 0 99 C G684 U662 C748 G657 G634 C615 U670 G610 C623 G605 U657 G600 21 0 71 0 8 0 0 0 0 99 1 0 1 0 85 0 5 2 8 0 3 D G740 U731 G690 C695 G660 U650 C640 G645 G649 U639 G629 C634 100 0 0 0 0 0 0 94 3 3 0 0 0 0 0 0 44 17 21 17 7 F G1038 U932 A1296 U909 G950 U852 G1205 C829 G939 U839 G1191 C816 65 0 0 0 30 4 0 96 2 0 0 2 0 1 0 0 0 0 88 12 1 G G2744 U2735 C1714 G1704 G2688 U2677 C1655 G1644 G2709 U2698 C1638 G1628 100 0 0 0 0 0 0 95 2 0 2 0 1 2 10 0 0 1 89 1 2 H G2758 U2724 C2040 G1739 G2702 U2666 C1982 G1678 G2722 U2687 C1999 G1661 96 4 0 0 0 0 0 98 1 0 1 0 0 0 48 30 6 14 2 2 1 E-tRNA G1932 U1907 tC70 tG2 G1874 U1843 tG71 tC2 G1891 U1851 tG71 tC2 99 0 0 0 0 0 0 97 3 0 0 0 0 0 86 12 0 0 0 0 0 P-tRNA G1948 U1964 tU12 tA22 G1890 U1906 tU12 tA23 G1907 U1923 tU12 tA23 99 1 0 0 0 0 0 98 1 0 0 1 0 0 94 5 0 0 0 0 0 I G2312 C2296 A2362 U2424 A2257 U2241 A2307 U2366 A2278 U2262 A2328 U2387 0 0 0 0 100 0 0 100 0 0 0 0 0 0 0 0 0 0 94 1 4 J G2375 C2325 C2411 G2416 G2320 U2270 G2353 C2358 G2341 U2291 C2374 G2379 88 0 0 0 13 0 0 98 0 0 0 1 1 0 21 16 48 0 13 2 2 K G2892 U2864 C2729 G2753 G2844 U2822 C2671 G2698 G2869 U2847 G2692 C2717 100 0 0 0 0 0 0 100 0 0 0 0 0 0 66 8 0 20 0 6 3 5SrRNA 5sG100 5sU82 A1014 A2302 5sC97 5sG84 A930 A2247 5sA96 5sU80 A918 A2278 0 100 0 0 0 0 0 0 93 0 0 0 5 2 0 77 0 0 0 20 3

Table 8. Sequence analysis of all P-interactions in 16S, 23S, 5S rRNA, and tRNA. The first column states the letters used in Tables 5 and 6 and Figures 38 and 39 to refer to these interactions. All four nucleotides participating in each interaction are compared. These interactions occur mostly between one helical element with a G/U and another helical element with a regular WC basepair (the G/U element is shaded). However, some occur between two helical elements without any G/U (b, I, and J), and some even occur between one helical element and one non-helical element (16S A768/G811 is trans Hoogsteen/sugar edge, and 23S A1014/A2302 is trans Hoogsteen/Hoogsteen basepair). Abbreviations: SSU, small ribosomal subunit; LSU, large subunit; WC, any combination of the four classical Watson-Crick basepairs (G/C, C/G, A/U, and U/A); and NT, any basepair other than GU, AC, UG, CA or WC.

91

The Table includes locations and substitution patterns simultaneously at all four positions

participating in this interaction. Among the thirteen P-interactions of the large subunit there is

one between 23S and 5S, one between 23S and E-site tRNA, and one between 23S and P-site

tRNA. U P C is the most commonly seen internal pair, but there is no apparent preference for a

Y P Y over Y P R interactions, since U P A, U P G, and other Y P R also are seen. There is,

however, clear preference for these over R P R interactions (none of these are observed). G/U on

the other hand is the most commonly seen basepair making this interaction, as only three of the

eighteen P-interactions do not involve a G/U (b, I, and J in Table 8). In addition, two such

interactions involve a non-WC basepair (e and 5S). A total of four P-interactions were not

reported before (e, B, I, and 5S). Thus, the most commonly seen P-interaction is in the form

GU P CG, which is considered the signature of this interaction.

In addition to the above mentioned structural characteristics, Table 8 also shows some important

characteristics of the P-interaction. Most notable is the high conservation of G/U’s participating in such interactions. Even in the cases where the crystal structure did not contain a G/U basepair

(b, I, and J in Table 8), alignments of some domains displayed a very high G/U content in the first helix or U/G in the second helix (95% U/G in the second helix in archaeal alignments for b,

100% G/U in the first helix in bacterial sequences for I, and 88% and 98% G/U in the first helix

in archaeal and bacterial sequences for J). Even more striking is the evidence that in the rare

cases when the G/U in a P-interaction is lost, there is great tendency for all four nucleotides

forming the interaction to mutate in a way to finally recreate a G/U in the right orientation in

either one of the two involved helices. This is best demonstrated by the P-interaction

corresponding to Hm G684/U662 P C748/G657 (C in Table 8, between H28 and H27). It seems

that a quadruple compensatory mutation between archaea and bacteria has occurred. The 92

GU P CG in Hm is replaced by a GC P UG in Dr, and by CG P UG in Ec. In archaeal sequence

alignments, these positions are 21% GU–WC, 71% WC–UG, and 8% WC–WC. The latter

number could correspond to organisms that survived with the intermediary mutation. Almost all

(99%) of bacterial sequences are WC–UG, and most eukaryal sequences (85%) are GU–WC.

The most parsimonious explanation is that originally there was a WC–UG variant, which is still

present in bacteria today. Sometime after the archaeal/eukaryal line diverged from bacteria a

mutation occurred, causing some archaeal organisms to have WC–WC at those positions. Some

of these organisms survive today. However, most archaea had other compensatory mutations

following the first one causing either their return to the original WC–UG bacterial variant, or

their “flipping” into the opposite GU–WC variant. Some archaea also may have retained the

original bacterial variant without ever mutating because they may have originated before the first

mutation ever occurred. Eukarya most probably separated from archaea before the first mutation.

These observed patterns of substitution strongly suggest that the P-interaction was conserved in

most, if not all organisms, even when the identity of the individual nucleotides making it had

changed. It also is clear that the G/U basepair with U at the internal position was favored. This is

the first time that such a covariation involving two basepairs, rather than just two nucleotides,

has been reported.

There is, on the other hand, one case where a P-interaction in bacteria (corresponding to Ec

numbers G1878/U1864 P C414/G2409) is completely lost and replaced by a GNRA loop

interaction in archaea (and probably in eukarya as well, as suggested by their sequence

alignments). The GNRA loop substitutes for one of the two helices (H68 flanked by the G/U)

without too much distortion of structure, because it makes a similar 3D contact as the P- interaction (see Figure 34 for a representation of this substitution). This proves that the presence 93

and orientation of an interaction (which defines the 3D folding) is far more important than the

actual identity or type of that interaction. Table 8 also shows that no G/U was substituted by U/G

(in the same helix), which can be explained first by the fact that cis WC U/G is not isosteric to

G/U, and second by the fact that the P-interaction, unlike the phosphate-in-pocket interaction, is structurally directional. It cannot occur in the form UG P CG because of the specific asymmetric

orientation of the G/U basepair that is more open for shallow groove pocket interactions coming

from the side of the U rather than the side of the G. 94

Figure 34. GNRA substitutions of G/U’s making P-interaction (a & b) and phosphate-in-pocket interaction (c & d). Panel a shows Ec P-interaction between G1878/U1864 flanking H68 and C414 at the end of H22 (C414 is also basepaired with G2409), and panel b shows the equivalent region in Hm, with G/U replaced by a GNRA loop. Panel c shows Ec phosphate-in-pocket interaction between G1740/U1720 of H63 and C1550 of H56, and panel d shows the equivalent region in Hm, again with the G/U replaced by a GNRA loop. In both cases, tertiary interactions take place between the same areas of the ribosome, preventing any drastic change in 3D folding. 95

Molecular Dynamics Analysis of the P-interaction

Most of the P-interactions observed in crystal structures are between a G/U pair and a C/G pair,

with U and C being at the internal positions (GU P CG). The purpose of MD analysis was to

explain this behavior, as well as reveal why the G/U’s forming P-interactions almost never are

substituted by their isosteric A/C basepairs throughout evolution.

MD simulations were carried out to compare the behavior of eight different combinations of P-

interactions, namely GU P CG, AC P CG, GU P GC, AC P GC, GC P CG, GC P GC,

AU P CG, and AU P GC. The studied motifs were embedded in two A-type helices, each with

24 residues (Figure 35).

Figure 35. P-interaction between two helices. The figure shows the whole system used for MD analysis, the nucleotides forming the P-interaction are highlighted in red. Produced by VMD (Humphrey et al. 1996).

96

The helices were identical in all simulations, except for their centrally positioned P-interactions.

The above combinations were chosen to represent Y P Y and R P Y types of interaction, as well

as G/U and other isosteric (A/C) or nearly isosteric (A/U and G/C) basepairs embedded in one

helix, combined with G/C and C/G basepairs embedded in the other helix. The simulations were

extended to ca. 10 ns each, which appears to be sufficient for the purpose of this study. Figure 36

shows time development of the five H-bonds constituting the P-interaction (as defined in Figure

32a) in each system. The eight structures were ranked in order of decreasing stabilities which

was indirectly estimated based on the number of H-bonds and their dynamic behavior in the simulations. (Note that such classification is more relevant to judge the stability of RNA interactions than, for example, evaluation of direct base-base interaction energies. This would neglect the interplay between that base-base interaction and all the other effectors such as adjacent basepairs, solvent screening, etc.). The suggested stability order is as follows: GU P CG

= GU P GC (these complexes maintained all five H-bonds throughout the simulations) >>

GC P GC = AU P GC = AU P CG = GC P CG (these lost the first two H-bonds, but generally

maintained the others) > AC P CG = AC P GC (most H-bonds lost early in the simulations).

Thus, a P-interaction involving A/C is the weakest among the tested systems. This is related to the fact that the A+/C shallow groove pocket is less electronegative than the G/U pocket, and it has a much lower H-bonding potential due to the loss of NH2 group. Hence, it cannot accommodate the ribose O2’ the same way as G/U does (see Supplementary Figure 37 for a comparison of electrostatic potential between A+/C and G/U). 97

Figure 36. Molecular Dynamics of the eight P-interaction systems. All five H-bonds involved in the generic P- interaction (defined in Figure 32a) are monitored. The red horizontal lines mark H-bond lengths in starting structures (average over first 50 ps); the red curves describe the probability distribution of H-bond lengths in the simulation and blue curves displays the actual time-development of the H-bond lengths along the trajectory.

98

Figure 37: Electrostatic maps of G/U (panel A) and A+/C (panel B) basepairs involved in P-interaction (view from the shallow groove). Electrostatic potential (ESP) was calculated for selected region (G/U or A+/C pair and two adjacent base pairs above and below) and is represented as VDW surface of key base pairs with color scale from -10 to 10 kT. The Figure clearly shows that the G/U to A+/C replacement leads to a major change of the electrostatic potential in the shallow groove area where upon formation of the P-interaction the opposite guanine binds via its NH2 group. In fact, protonated adenine creates a positive potential region which is not suitable for guanine binding, as seen in simulations. The ESP was calculated using DELPHI code by solving non-linear Poisson Boltzmann equation, assuming continuous solvent model.

Other Shallow Groove Pocket Interactions

There are two additional SGP interactions in the ribosome, phosphate-in-pocket interaction, and

O2’-in-pocket interaction (Figure 32). The latter means that O2’ falls in the SGP of G/U, but it cannot be classified as P-interaction or “type 0” A-minor interaction. Both of these SGP interactions lead to high conservation of the G/U’s. The phosphate-in-pocket interaction is less directional than the P-interaction, having no preference for a specific G/U orientation. This G/U to U/G covariation is also the case for some other interactions in the shallow groove, especially some single H-bond interactions. Thus, it is not unique to phosphate-in-pocket interactions but at least it is one of its indicators. Similar to the case where a G/U forming a P-interaction is replaced by a GNRA loop, there also is a case where a G/U forming a phosphate-in-pocket interaction is replaced by a GNRA loop, with no substantial distortion of the 3D folding. This occurs at the positions corresponding to Ec G1740/U1720 – C1550P, bringing together H63 and 99

H56. Here, both Hm and Dr have a shorter H63 than Ec, so the GNRA loop that normally is at the end of H63 is brought to a position homologous to the G/U position in Ec (see

Supplementary Figure 34 for a representation of this substitution).

Figures 38 and 39 show the positions of all SGP interactions on the secondary structures of the small and large ribosomal subunits, respectively. 100

Figure 38. Shallow groove pocket interactions in 16S rRNA. Labels of P-interactions correspond to Tables 5 and 8. Color code displayed at the bottom right. 101

Figure 39. Shallow groove pocket interactions in 5S rRNA, 23S rRNA, and tRNA. Labels of P-interactions correspond to Tables 6 and 8. Color code displayed at the bottom right. 102

Discussion

The C1’ – C1’ distance of the cis WC G/U basepair is about 10.2 Å, which is very close to the

distances in classical WC basepairs (10.3 Å). The U is shifted towards the deep (major) groove

to allow for hydrogen bonding between G(O6)…U(N3) and G(N1)…U(O2). The angle between

the C1’ – C1’ axis and the glycosidic bonds is approximately 40° for G and 65° for U, instead of

the symmetric 54° angle in the canonical basepairs (Figure 40). This causes helices containing

G/U’s to locally overtwist or undertwist depending on the orientation of the wobble basepair and

its neighbors (Varani et al. 2000). Therefore, the only other basepair that is completely isosteric to the cis WC G/U basepair is the cis WC A+/C basepair, which is rare due to the requirement

that the A(N1) be protonated (Doudna et al. 1989; Leontis et al. 2002).

Figure 40. The cis Watson-Crick G/U and A+/C basepairs. The upper panel represents a G/U wobble basepair with a water molecule (W) in its shallow and deep groove pocket. The angles formed between the C1’ – C1’ axis and the glycosidic bonds show the asymmetry of this basepair compared to the classical WC basepairs. The lower panel shows the isosteric A+/C basepair. Produced by ChemDraw (CambridgeSoft Corporation).

103

The canonical WC basepairs are nearly isosteric to G/U or U/G (Varani et al. 2000) while, due to

the asymmetry, G/U and A+/C are not isosteric to U/G and C/A+. The pocket created by G(N2),

U(O2), and U(O2’) in the shallow groove usually coordinates an integral water molecule by

forming two H-bonds. Another feature of the cis WC G/U basepair is the unique H-bond donor

and acceptor distribution around its grooves. The hydrogens of the amino group GNH2 are free

to interact in the shallow groove, and the electronegative G(N7), G(O6), and U(O4) atoms in the

deep groove create a strong electronegative surface (Allain et al. 1995; McDowell et al. 1996)

that binds metal ions. This may play important roles in RNA folding or ribozymatic activity

(Konforti et al. 1998; Draper et al. 2005). In addition, the cis WC G/U basepair is approximately

as thermodynamically stable as the classical cis WC A/U basepair (Sugimoto et al. 1986; Jaeger

et al. 1989; McDowell et al. 1996; McDowell et al. 1997; Mathews et al. 1999; Strazewski et al.

1999; Schroeder et al. 2001). Finally, the cis WC G/U basepair possesses a unique conformational flexibility (Varani et al. 2000) that allows it to respond to sequence contexts and crystal packing much more easily than classical basepairs (Westhof et al. 1985), hence allowing for recognition of interacting proteins or other RNAs by induced fit (Ramos et al. 1997; Chang et

al. 1999). These characteristics of G/U basepairs play major roles in determining the types of

RNA-RNA, RNA-protein, or RNA-metal ion interactions in which they participate, and in their

preferred substitution patterns throughout evolution.

There are 273 positions in the ribosome occupied by cis WC G/U basepairs in one or more of the available crystal structures. This corresponds to roughly 15% of all ribosomal basepairs. About half of these are involved in tertiary or quaternary interactions. Therefore, they have important roles in the assembly and 3D folding of the ribosome and its subunits. Figure 41 is a Venn

diagram representing the sequence data from archaeal and bacterial alignments mapped to the 104

structural data of each of the positions. This provides valuable insights into the possible

functions and mechanisms for some specific families of interactions. The diagram clearly shows

that G/U’s forming specific shallow groove pocket interactions (SGP) are highly conserved

among broadly divergent taxa. These include P-interactions, phosphate-in-pocket interactions,

and ribose O2’-in-pocket interactions (Figure 32). Other tertiary interactions also are included in the Figure, showing less G/U conservation in most cases. Protein interactions were discussed in detail in (Klein et al. 2004), so they will not be fully addressed here. 105

Figure 41. (Legend on the next page) 106

Figure 41 legend. Venn diagram representing locations, types, and sequence conservations of G/U wobble basepairs in the ribosome. Nucleotide numbers are taken from the main crystal structure in each category (Tt in 16S, Hm in 23S & 5S), unless otherwise specified (by Dr or Ec preceding the numbers). Each basepair is colored according to its average % GU+UG content in sequence alignments of archaea and bacteria. Water mediated interactions are indicated by (w); white boxes indicate intermolecular bridges between ribosomal subunits. The following abbreviations are used: c, cis; t, trans; W, Watson-Crick; H, Hoogsteen; S, sugar edge (Leontis et al. 2001); SGP, shallow groove pocket; SGNP, shallow groove not in pocket; DGP, deep groove pocket and DGNP, deep groove not in pocket; additional interactions are stated in parentheses after + (such as +SGNP prot or +DGP sugar). The two positions connected by curved lines in P-interactions (657-748 and 662-684) represent the P-interaction with quadruple compensatory mutation, discussed in detail in the text. The two basepairs in parentheses (16S 601-637 and 23S 1850-1881) are not GU in any crystal structure, but are included in the study for having > 50% GU content in bacterial and archaeal sequences, respectively.

107

Potential SGP interactions (mentioned in Table 7) are those that could form P-interactions,

phosphate-in-pocket interactions, or O2’-in-pocket interactions after a modest structural rearrangement. This could indicate a transient formation of SGP interactions in these positions. It is clear that most of the 16S basepairs belonging to this group are highly conserved G/U’s in sequences, meaning that phylogenetically they behave as actual SGP interactions. Interestingly, and as mentioned above, those from 23S (except for Ec 2514-2570) are less conserved and behave as if they never become transient with SGP interactions. The non-interacting G/U’s, on the other hand, display low G/U conservation at their positions. From such sequence information one can infer the locations and types of interactions in cases where atomic resolution structures are not available.

The ribosome is a large molecular machine with many moving parts, the movement of which must be carefully coordinated for correct and efficient functioning. Transient interactions, like the ones with tRNA or the ones between the ribosomal subunits, have to be formed and broken easily, and therefore must be relatively weak while still stereochemically precise. This may be possible through the association and dissociation of multiple weak tertiary and quaternary interactions or the transformation between different types of somewhat similar interactions. The similarity between P-interactions, “type 0” A-minor motifs, and O2’-in-pocket interactions, and some potential SGP interactions (all having an O2’ buried at different depths in the G/U shallow groove pocket) makes it possible for some of them to switch to others. A similar scenario was suggested for tertiary interactions comprising two or more A-minor motifs (Noller 2005). Such dynamic behavior could explain why some G/U basepairs without apparent interactions are conserved. A possible example of this kind is 16S G886/U911 interacting with G1489/C1411, which is observed as a potential O2’-in-pocket interaction. However, sequence analysis shows 108

that this position is >98% GU-WC (mainly GU-GC and GU-CG) in all domains, making it

similar in its evolutionary conservation to a P-interaction. We suggest that if a P-interaction relaxes or starts to dissociate, it may convert to an O2’-in-pocket interaction or a “type 0” A- minor interaction because of the orientational similarity between them.

A few other basepairs are highly conserved in archaeal and bacterial alignments, but are not taking part in specific or potential SGP interactions, or are not interacting at all (Figure 41).

Some of these may be involved in transient interactions that are in their dissociated states in crystal structures. Others, especially those flanking internal loops or junctions which make up more than one third of all G/U’s, may be highly conserved because of the important roles they play in providing specific stacking interactions that stabilize these nearby motifs. Examples of these are 23S 1848-1883 flanking a four-way-junction at the base of H66, 23S 1898-1939 flanking the internal loop of H68, which is a conserved motif similar to C-loop, and 23S 2869-

2888 flanking the internal loop of H101. Other interesting cases include 23S 2541-2618, which is a specific basepair forming “type I” A-minor motif with the terminal A76 of the tRNA CCA acceptor stem. The G/U here permits the interaction to become more compact by allowing the unstacked A76 to insert deeper in its pocket, making the G/U the preferred basepair at this location. The 23S 798-815 basepair flanking the internal loop of H34 is another interesting case where GNH2 forms an H-bond with A1598O2’, which belongs to the hairpin of H58. This helix is joined to the rest of 23S subunit by a kink turn, and the G/U here may be involved in stabilizing conformational changes at this flexible region located at the interface with the small subunit. Another conserved G/U is 23S 2492-2529 which is a part of a complex motif forming a receptor for a GNRA loop.

109

Conclusion

Complex functional RNA molecules, like the ribosomal RNAs, are compactly folded and form

complex tertiary and quaternary RNA-RNA and RNA-protein contacts mediated by several types of recurrent motifs and interactions. Some of these interactions, especially those reaching deep in shallow grooves of helices, like the P-interactions, phosphate-in-pocket interactions, and O2’-in- pocket interactions, are specific for G/U’s. Therefore, these are regions of high conservation of

G/U basepairs. Most of these interactions are long range interactions bringing together different

RNA molecules or distant helices in the same RNA molecule. Some of them, however, are short range interactions that are nonetheless highly conserved, such as the P-interactions b, C, D, and

J, the phosphate-in-pocket interactions occurring in h34, between H6 and H7, and in H91, and the O2’-in-pocket interaction occurring in h11 (refer to Figures 38 and 39). Interestingly, none of these interactions takes place between the “head” of the 30S subunit corresponding to domain III in 16S rRNA and the rest of the small subunit. This agrees with previous studies showing that only one A-minor motif interaction occurs between domain III and the rest of the small subunit

(Noller 2005). It also supports the observation in the 70S Ec that the “head” is subject to large- scale rigid body motions relative to the rest of the ribosome during the protein synthesis cycle, especially the ratchet-like motion that accompanies translocation (Frank et al. 2000; Moore

2005; Schuwirth et al. 2005). The evidence that a few G/U’s participating in P-interactions and phosphate-in-pocket interactions – which are some of the most conserved interactions in RNA – can be replaced by other elements (like GNRA loops) that are able to form similar tertiary contacts proves that these tertiary contacts are essential for the survival of the organisms. This does not challenge the importance of the G/U basepair itself that is still the preferred moderator of such interactions. This knowledge can be helpful in refining sequence alignments by looking 110

for the signatures of some of these interactions, such as the GU P CG signature of the P- interaction, and the G/U covariation with U/G which is one of the indications of a phosphate-in-

pocket interaction. 111

CHAPTER IV: SEQUENCE PARSING AND AUTOMATIC RNA ALIGNMENT

Algorithms for Aligning RNA Sequences

As mentioned in Chapter I, the problem of aligning structural RNA sequences is quite different than aligning other biological molecules, such as proteins. Proteins are aligned by comparing their sequences one amino acid at a time, and deciding whether those amino acids are “similar” to each other or not. The simplest method to do this is by scoring the similarity of amino acids based on various scoring matrices, as discussed in Chapter I (Dayhoff et al. 1978; Jones et al.

1992; Henikoff et al. 1993). The alignment is automated through dynamic programming algorithms, such as the Needleman-Wunsch algorithm used for building global alignments

(Needleman et al. 1970; Gotoh 1982), or Smith-Waterman algorithm for building local alignments (Smith et al. 1981). These algorithms however have serious limitations as they cannot treat distant interactions, and get too complicated when aligning more than two sequences. A more advanced method for aligning proteins is a probabilistic method used by

PFAM database (Sonnhammer et al. 1998; Bateman et al. 2004). The first step in this method is to carefully build a structural seed alignment using a smaller set of sequences known to form a homologous family. The second step is to build a fully probabilistic Hidden Markov Model

(HMM) profile for the family of proteins represented by the seed alignment. The profile includes the probabilities for amino acid substitutions or deletion at every position in the alignment, based on the frequencies observed at each position. The HMM profile can then be used to score and align new protein sequences that belong to the same group, making a larger high quality alignment. The HMM probability parameters can be recalculated based on the new sequences added. The calculation of parameters from an alignment is called “training” of the HMM model. 112

Because HMMs include information from many sequences that belong to a certain protein family, they can be used to find distant homologues that are not detected by pairwise search against any one sequence in the alignment.

RNA is different from protein in that it folds into stable secondary structures due to the interactions between its nucleotides. Hence, it is not enough to compare nucleotides to each other in the linear way done for proteins to decide whether they are similar or not. Instead, base pairs need to be compared to each other because they are the units that form secondary structures.

Isosteric or nearly-isosteric basepairs that can replace each other without causing distortions in the structure can occur at homologous positions. Also here, dynamic programming algorithms for optimizing RNA secondary structure are available (Nussinov et al. 1980; Zuker et al. 1981;

Zuker 1989; Zuker 2003), but they too have the same limitations seen with protein alignment dynamic programming algorithms. HMM models do not work in this case either because they cannot describe the behavior of two positions (the two nucleotides in a basepair) at once. A more complex algorithm, Stochastic Context Free Grammar or SCFG is used instead to describe RNA.

SCFG is a part of the theory known as the “Chomsky hierarchy of transformational grammars” first developed by computational linguists in an attempt to understand natural and computer languages (Chomsky 1956; Chomsky 1959). Figure 42 shows a summary of transformational

grammars and their production rules. 113

Figure 42. A part of Chomsky hierarchy of transformational grammars. Two higher levels of grammars are not shown here: context sensitive grammars and unrestricted grammars. HMM, Hidden Markov Models; SCFG, Stochastic Context Free Grammar; a and b, observable terminal states; S, abstract non-terminal state that gives rise to the observable terminal states.

Figure 42 describes only two of the four types of grammars in this theory. The two that are not shown are context sensitive and unrestrictive grammars. All of them are called generative grammars because a set of characters in a specific order can be generated from their rules. Non- terminal states, such as the “S” shown in the figure, are abstract states that are not observed directly, and that give rise to observed terminal states, such as “a” and “b” in the figure. Context free and SCF grammars describe mainly nested dependencies of characters (letters, or nucleotides) in a string of variables (words, or RNA strands). This has limited use in linguistics, because it can mainly describe or generate palindromic (or almost-palindromic, as letters are not necessarily generated at both sides) words or sentences, which do not occur often in natural languages. It has, however, some uses for parsing and compiling computer languages and for grammatical checks in some natural languages, such as subject-verb agreement or endings of adjectives in French (Jurafsky et al. 1995). On the other hand, SCFG are more useful for describing RNA structures because a great deal of RNA basepairs occur in a nested manner

(Durbin et al. 2000).

114

SCFG Profiles of RNA Molecules

Similarly to building an HMM model for a protein family, building a SCFG model for an RNA

family starts by manually constructing an accurate structural seed alignment of homologous

RNA sequences. The positions that form W.C. and G/U wobble basepairs are identified, and the

probabilities of their substitutions and deletions are determined. This profile is used to find more

sequences that satisfy its requirements and to add them to the seed alignment, making a larger

and larger alignment at every iteration. The SCFG profile is recalculated and enhanced (trained)

after each addition. Figure 43 shows a graphical representation of the SCFG profile of a

hypothetical RNA molecule. This algorithm is used by the program “Infernal” to construct the

RFAM database of RNA sequence alignments (Griffiths-Jones et al. 2003; Griffiths-Jones et al.

2005), and by the Pfold server (Knudsen et al. 2003) to determine RNA secondary structures.

Methods based on SCFG are currently considered the most accurate RNA automatic alignment methods. 115

Figure 43. A hypothetical RNA structure (panel A) and its SCFG profile (panel B). The two panels are color-coded to indicate corresponding secondary structure elements. Node codes used: S = start; B = bifurcation; P = basepair; R and L = right and left unpaired bases (“bulges”), respectively. The starting unpaired U’s, the hairpin loops, and the C between the two hairpins are arbitrarily considered left bulges. Figure adopted from (Durbin et al. 2000) with modifications.

As shown in Figure 43, the SCFG profile of an RNA molecule is a general way of representing

its secondary structure, including basepairs, bulges, hairpins, internal loops, and junction loops, without specifying the identities of nucleotides forming them. The algorithm employs dynamic programming to consider and score all possible assignments of bases in a sequence to the model and obtains the best alignment given the scoring method. There are two serious limitations to this 116 method: First, SCFG can only describe nested basepairs, and cannot deal with crossing interactions or . There have been limited attempts to extend SCFG algorithms to treat pseudoknots, but with a costly sacrifice in computational time and memory usage (Rivas et al. 1999; Rivas et al. 2000). The second and more important limitation of the SCFG algorithm as it is currently applied is that it does not take actual 3D structure into consideration, so different types of basepairs are not differentiated. The algorithm assigns substitution rates to all positions based on their frequencies seen in the seed alignment (which defines the consensus secondary structure). This approach is successful with the W.C. and G/U basepairs because they are common in helical areas and their frequencies are thus informative. Other basepair types are less common and their observed frequencies are not enough to build their profiles.

Our approach solves these two problems by programming all families of basepairs and their isosteric matrices into SCFG, and by coupling this with Markov Random Field (MRF) algorithm for covering local crossing interactions. We refer to this as the hybrid SCFG/MRF method. We also directly use the 3D structure to determine all types of interactions taking place in the motif of interest, rather than only using its consensus secondary structure. Figure 44 shows how a real

RNA motif is addressed by both methods. Although helical regions are well represented by both, the Figure shows that large parts of the internal and hairpin loops are ignored when using the

SCFG model alone.

117

Figure 44. Representations of Hm 23S rRNA Helix 95 containing the sarcin/ricin motif (pdb file 1S72). Panel A represents all interactions seen in 3D. Non-classical basepairs are presented in red, and the red box indicates a crossing interaction (G/U/A triplet). The hybrid SCFG/MRF model is capable of accurately representing all these interactions. The pure SCFG however can only represent some of the interactions, since it calculates the probabilities for positions based on observed base substitutions in the seed alignment. Panel B is a simplification of how pure SCFG “sees” this motif.

Alignment of an RNA Motif Using the Hybrid SCFG/MRF Method

Our method was applied on a recurrent internal loop motif that occurs in many RNA structures, including several occurrences in the large ribosomal subunit alone. The motif is referred to as

“tWH-IL-tHS” motif, which stands for tWH basepair, stacked on a left insertion, stacked on a

tHS basepair (Figure 45). This motif is somewhat similar to the well known sarcin/ricin motif

shown in Figure 44, but with only two non-WC basepairs separated by a non-paired bulge

(insertion). Closing cis WC basepairs are present on both sides of the internal loop. 118

Figure 45. Representation of the “tWH-IL-tHS” motif on which our hybrid SCFG/MRF alignment algorithm is tested. The motif presented here is present in internal loop 40a of H. marismortui 23S rRNA (PDB file 1S72)

All occurrences of this motif in high resolution crystal structures were found with the help of the motif search program “FR3D” (Sarver, M., Zirbel, C., Stombaugh, J., Mokdad, A., and Leontis,

N.B., submitted). FR3D finds and scores all 3D motifs with nucleotides within a user-defined geometric discrepancy from the query motif. It can also carry out symbolic searches for motifs containing specific pre-defined interactions. 842 candidates from FR3D were examined by eye and eleven were judged as good “tWH-IL-tHS” motifs (Figure 46). Nine of the eleven motifs were among FR3D’s highest ranking hits, and two had lower scores. These two did not completely resemble the query, but were close enough to be included as one group with it. 119

Figure 46. The eleven “tWH-IL-tHS” motifs found in all structures. Most of the hits are from 23S rRNA H. marismortui (Hm) and E. coli (Ec) internal loops (IL). Motifs occurring at homologous positions between these two structures are placed side by side with a small horizontal line connecting them. The top left is the query motif, with nucleotides used for the FR3D search outlined. Numbers in parentheses are the scores assigned by FR3D. The last two motifs do not entirely resemble the query: Hm 23S IL96 has a tSH interaction (in red) instead of the tWH present in the query, but these two families are similar structurally. The last motif has a distorted tWH C/C in the crystal structure. Based on PDB structures 1S72, 2AWB, and 1X8W.

The motifs shown in Figure 46 are geometrically similar recurrent motifs. Many of them occur in different regions of the same structure. Only the ones occurring at equivalent positions in Hm and Ec 23S rRNA may be homologues. The rest of them are examples of where, because of the usefulness of this motif in mediating tertiary interactions, there was evolutionary pressure to produce and conserve it in different places. These motifs are said to be 120

“analogous” to each other. Thus, we can make a structural seed alignment for them by hand and use it to train the hybrid SCFG/MRF algorithm. Figure 47 shows the resulting seed alignment.

Figure 47. Manual structural alignment of the eleven “tWH-IL-tHS” motifs. Color coding is used to indicate corresponding interactions in tertiary structure, secondary structure, and sequence alignment. The four dots ( . . . . ) in the middle of the alignment represent the separation between the two strands of the internal loop.

The seed alignment displayed in Figure 47 is then used to make a SCFG/MRF profile of the

“tWH-IL-tHS” family of motifs. This profile includes all details about the locations and types of basepairs present, the specific isosteric subfamily they form in 3D structure, and the locations and probabilities of insertions (calculated from their frequencies in the seed alignment). The

MATLAB script representing all these details for this motif is presented in Figure 48. 121

Figure 48. MATLAB script representing the seed alignment of “tWH-IL-tHS” motif. The code is colored to correspond to structural elements shown in Figure 47. Yellow ovals show the node, and yellow boxes contain explanations of the codes.

Alignment Results and Discussions

The next step is to apply the alignment algorithm using nodes as defined in Figure 48 to the seed sequences themselves. The alignment program parses the sequences according to the grammar represented by the nodes. This produces an automated alignment, which should ideally be 122 identical to the hand seed alignment. Figure 49 shows both the hand alignment as well as the automatic alignment evaluated by Ribostral Alignment Viewer.

Figure 49. Validation of results with Ribostral Alignment Viewer. The Alignment Viewer is used to evaluate the hand alignment based on 3D structure and the automatic alignment of “tWH-IL-tHS” motif. Nucleotides in green colors indicate isosteric substitutions, those in blue indicate nearly-isosteric substitutions, those in purple indicate heterosteric substitutions, and those in red indicate forbidden substitutions.

Figure 49 shows that, for all except the tenth sequence, the hybrid SCFG/MRF algorithm produced almost an identical alignment as the hand structural alignment. The only difference was the right justification of insertions in the right strand. Both versions are equally acceptable without further information from the structure. The tenth sequence however was aligned differently. The G/A shown in red in the hand alignment is a tSH interaction (as shown in Figure

46). All other nucleotides in these two columns belong to the tWH family. Although the two families look similar, the G/A is not an allowed substitution in the tWH family. This is why the algorithm considers this combination unfavorable and instead of aligning the G/A to the tWH node, it found that it is more favorable to shift the A and treat it as an insertion and align the nearby U in its place. The resulting G/U is heterosteric with the tWH structure version 123

programmed into the algorithm. Obviously, the tenth sequence does not belong to this group and

a better result would have been obtained if it was removed.

The seed alignment used sequences from atomic resolution structures. A much larger library of

sequences can be formed by gathering their homologues from available sequence alignments.

Archaeal homologues (identified by the helix/internal loop numbers indicated in Figure 46) of

“tWH-IL-tHS” motif were extracted from RFAM 23S seed alignment (Griffiths-Jones et al.

2003; Griffiths-Jones et al. 2005), then gaps were removed and sequences automatically re- aligned. Figure 50 represents the sequences before and after automatic re-alignment, showing the promising success of this method (as indicated by Ribostral Sequence Viewer colors). 124

Universal Numbers UN 1...|....10...|....20 1...|....10...|....20 Hm 23S (IL40a) Local Numbers LN ...-...... 1...... 1--..-.. 13D Directional Mask M13 AS.-GA....b.htb.. AS.GA....b--.h-tb 3D Mask [brackets not nested] M3 ((.-((....).))).. ((.((....)--.)-)) 2D Mask (nested) M2 ((.-((....).))).. ((.((....)--.)-)) Crenarchaeot_Acidianus_brierleyi 1 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Acidianus_infernus 2 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Aeropyrum_pernix 3 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Desulfurococcus_mobilis 4 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Stygiolobus_azoricus 5 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Sulfolobus_acidocaldarius 6 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Sulfolobus_shibatae 7 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Sulfolobus_solfataricus 8 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Thermofilum_pendens 9 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Archaeoglobus_fulgidus 10 GAA-UU....AUAGC GAAUU....A--UA-GC Euryarchaeot_Haloarcula_marismortui (HAND CHECKED AM) 11 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Halobacterium_halobium 12 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Halococcus_morrhuae 13 GAG-UU....AUAGC GAGUU....A--UA-GC Euryarchaeot_Haloferax_mediterranei 14 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Methanobacterium_thermoautotrop 15 GAA-UU....AUAGC GAAUU....A--UA-GC Euryarchaeot_Methanococcus_jannaschii 16 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Methanococcus_vannielii 17 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Methanospirillum_hungatei 18 UAA-UU....AUAGA UAAUU....A--UA-GA Euryarchaeot_Natronobacterium_magadii 19 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Pyrococcus_abyssi 20 GAA-UU....AGAGC GAAUU....A--GA-GC Hm IL40a (1099-1095,1261-1257) Euryarchaeot_Pyrococcus_horikoshii 21 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Thermococcus_celer 22 GAA-UU....AGAGC GAAUU....A--GA-GC Euryarchaeot_Thermoplasma_acidophilum 23 GAA-UU....AGAGC GAAUU....A--GA-GC Crenarchaeot_Acidianus_brierleyi 24 GAAUC....GAAGC GAAUC....G--AA-GC Crenarchaeot_Acidianus_infernus 25 GAAUC....GAAGC GAAUC....G--AA-GC Crenarchaeot_Aeropyrum_pernix 26 GAAUU....GAUGC GAAUU....G---AUGC Crenarchaeot_Desulfurococcus_mobilis 27 GAAUC....GAGGC GAAUC....G---AGGC Crenarchaeot_Pyrobaculum_islandicum 28 GAAUC....GAGGC GAAUC....G---AGGC Crenarchaeot_Stygiolobus_azoricus 29 GAAUC....GAAGC GAAUC....G--AA-GC Crenarchaeot_Sulfolobus_acidocaldarius 30 GAAUU....GAAGC GAAUU....G--AA-GC Crenarchaeot_Sulfolobus_shibatae 31 GAAUC....GAAGC GAAUC....G--AA-GC Crenarchaeot_Sulfolobus_solfataricus 32 GAAUU....GAAGC GAAUU....G--AA-GC Crenarchaeot_Thermofilum_pendens 33 GAAUC....GAAGC GAAUC....G--AA-GC Euryarchaeot_Archaeoglobus_fulgidus 34 GAAUC....GAAGC GAAUC....G--AA-GC Euryarchaeot_Haloarcula_marismortui (HAND CHECKED AM) 35 CAAUC....GAGGG CAAUC....G---AGGG Euryarchaeot_Halobacterium_halobium 36 CAAUC....GAAGG CAAUC....G--AA-GG Euryarchaeot_Halococcus_morrhuae 37 -AAUC....GAAGG AA-UC....G--AA-GG Euryarchaeot_Haloferax_mediterranei 38 GAAUC....GAGGC GAAUC....G---AGGC Euryarchaeot_Methanobacterium_thermoautotrop 39 GAAUC....GAAGU GAAUC....G--AA-GU Euryarchaeot_Methanococcus_jannaschii 40 GAAUC....GAGGC GAAUC....G---AGGC Euryarchaeot_Methanococcus_vannielii 41 GAAUU....GAAGC GAAUU....G--AA-GC

Hm IL28 (667-663,683-679) Euryarchaeot_Methanospirillum_hungatei 42 UAAUU....GAAGA UAAUU....G--AA-GA Euryarchaeot_Natronobacterium_magadii 43 CAAUC....GAAGG CAAUC....G--AA-GG Euryarchaeot_Pyrococcus_abyssi 44 GAAUU....GAUGC GAAUU....G---AUGC Euryarchaeot_Pyrococcus_horikoshii 45 GAAUU....GAUGC GAAUU....G---AUGC Euryarchaeot_Thermococcus_celer 46 GAAUU....GAAGC GAAUU....G--AA-GC Euryarchaeot_Thermoplasma_acidophilum 47 UAAUU....GAUGA UAAUU....G---AUGA Crenarchaeot_Acidianus_brierleyi 48 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Acidianus_infernus 49 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Aeropyrum_pernix 50 CAAUC....G-GAAG-G CAAUC....G-GAA-GG Crenarchaeot_Desulfurococcus_mobilis 51 CAAUC....G-UAAG-G CAAUC....G-UAA-GG Crenarchaeot_Pyrobaculum_islandicum 52 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Stygiolobus_azoricus 53 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Sulfolobus_acidocaldarius 54 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Sulfolobus_shibatae 55 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Sulfolobus_solfataricus 56 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Crenarchaeot_Thermofilum_pendens 57 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Euryarchaeot_Archaeoglobus_fulgidus 58 GAAUC....GAAAAG-C GAAUC....GAAAA-GC Euryarchaeot_Haloarcula_marismortui (HAND CHECKED AM) 59 GAAUC....GUAAAG-C GAAUC....GUAAA-GC Euryarchaeot_Halobacterium_halobium 60 GAAUC....G-AAAA-C GAAUC....G-AAA-AC Euryarchaeot_Halococcus_morrhuae 61 GACUC....GCUAAG-C GACUC....GCUAA-GC Euryarchaeot_Haloferax_mediterranei 62 GAAUC....GA-AAG-U GAAUC....G-AAA-GU Euryarchaeot_Methanobacterium_thermoautotrop 63 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Euryarchaeot_Methanococcus_jannaschii 64 CAAUC....G-AAAG-G CAAUC....G-AAA-GG Euryarchaeot_Methanococcus_vannielii 65 CAAUC....G-AAAG-G CAAUC....G-AAA-GG Euryarchaeot_Methanospirillum_hungatei 66 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Euryarchaeot_Natronobacterium_magadii 67 GAAUC....G-ACAA-C GAAUC....G-ACA-AC

Hm IL52a (1460-1456,1489-1483) Euryarchaeot_Pyrococcus_abyssi 68 CAAUC....G-AAAG-G CAAUC....G-AAA-GG Euryarchaeot_Pyrococcus_horikoshii 69 CAAUC....G-AAAG-G CAAUC....G-AAA-GG Euryarchaeot_Thermococcus_celer 70 GAAUC....G-AAAG-C GAAUC....G-AAA-GC Euryarchaeot_Thermoplasma_acidophilum 71 GAAUC....G-GAAG-C GAAUC....G-GAA-GC Crenarchaeot_Acidianus_brierleyi 72 UAAUG....CA-AGA UAAUG....C--AA-GA Crenarchaeot_Acidianus_infernus 73 UAAUG....CA-AGA UAAUG....C--AA-GA Crenarchaeot_Aeropyrum_pernix 74 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Desulfurococcus_mobilis 75 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Pyrobaculum_islandicum 76 CAAUA....GA-AGG CAAUA....G--AA-GG Crenarchaeot_Stygiolobus_azoricus 77 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Sulfolobus_acidocaldarius 78 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Sulfolobus_shibatae 79 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Sulfolobus_solfataricus 80 CAAUG....CA-AGG CAAUG....C--AA-GG Crenarchaeot_Thermofilum_pendens 81 CAAUG....AA-AGG CAAUG....A--AA-GG Euryarchaeot_Archaeoglobus_fulgidus 82 AAAUU....AA-AGU AAAUU....A--AA-GU Euryarchaeot_Haloarcula_marismortui (HAND CHECKED AM) 83 GAAUG....AA-AGC GAAUG....A--AA-GC Euryarchaeot_Halobacterium_halobium 84 GAAUG....AA-AGC GAAUG....A--AA-GC Euryarchaeot_Halococcus_morrhuae 85 GAAUG....AA-AGC GAAUG....A--AA-GC Euryarchaeot_Haloferax_mediterranei 86 GAAUG....AA-AGC GAAUG....A--AA-GC Euryarchaeot_Methanobacterium_thermoautotrop 87 AAAUA....GG-AGU AAAUA....G--GA-GU Euryarchaeot_Methanococcus_jannaschii 88 GAAUA....AA-AGC GAAUA....A--AA-GC Euryarchaeot_Methanococcus_vannielii 89 GAAUA....AA-AGC GAAUA....A--AA-GC Euryarchaeot_Methanospirillum_hungatei 90 AAAUA....AA-AGU AAAUA....A--AA-GU Euryarchaeot_Natronobacterium_magadii 91 GAAUG....AA-AGC GAAUG....A--AA-GC Hm IL97 (2777-2773,2801-2797) Euryarchaeot_Pyrococcus_abyssi 92 GAAUA....GA-AGC GAAUA....G--AA-GC Euryarchaeot_Pyrococcus_horikoshii 93 GAAUA....GA-AGC GAAUA....G--AA-GC Euryarchaeot_Thermococcus_celer 94 GAAUA....GA-AGC GAAUA....G--AA-GC Euryarchaeot_Thermoplasma_acidophilum 95 GAAUA....GA-AGC GAAUA....G--AA-GC Before alignment After alignment Figure 50. “tWH-IL-tHS” motif homologues aligned with the hybrid algorithm (right). The left side shows the original sequences. Color codes are the same as the previous figure.

125

SCFG/MRF Discrimination Power

Similarly, a SCFG/MRF profile was built for the kink turn motif present in Ec 16S H23b

(KT23b). Homologous sequences of “tWH-IL-tHS” motif (from Figure 50) and homologous sequences of two such kink turns from 16S (KT23b, Ec numbers 683-688, 699-707) and 23S

(23S KT7, Hm numbers 76-81, 93-101) rRNA were scored according to both SCFG/MRF models. These two types of motifs are structurally different, and their sequences have different

length (Figure 51).

Figure 51. Representations of “tWH-IL-tHS” motif and kink turn motif. A SCFG/MRF model was built for each of these motifs, then their sequences were scored according to both models. The results are shown in Figure 52.

Figure 52 is a scatter plot of the logs of probabilities (scores assigned by the SCFG/MRF

algorithm) of each group of sequences according to each model. These scores describe how

closely the sequences fit the model upon their best alignment to it. Gaps and forbidden

substitutions decrease the score, and isosteric and nearly-isosteric substitutions increase it,

similar to the score formula discussed in Chapter II (page 45).

126

Figure 52. Scatter plot of sequences scored according to two motif models. Some of the sequences are known (from structure) to be “tWH-IL-tHS” motifs, and others are known to be kink turns. The SCFG/MRF parser scored the sequences by how well they can be aligned according to either model. Clearly, the scores were able to discriminate well between the two groups that clustered into two different areas of the plot. Notice that the number of points is less than the number of sequences (95 seqs for “tWH-IL-tHS” motif and 104 seqs for kink-turn motif) because some sequences scored equally and were plotted on top of each other.

Figure 52 shows that the sequences belonging to the two motifs clustered into two distinct groups when they were scored by the two SCFG/MRF models. This proves that the method has a strong discrimination power, so it can be used not only to align sequences that are known to belong to a certain structural family, but also to identify sequences that fold into different structures by scoring them against different known structural motifs. 127

Conclusion and Summary

In this work, several aspects of building structural RNA alignments were discussed. First, new tools that superimpose 3D structural data onto sequence alignment were designed and tested.

These tools, collectively called Ribostral, provide an objective measure of the quality of an alignment and facilitate its manual enhancement based on structure. Second, specific tertiary interactions in the context of G/U wobble basepair were studied, and the knowledge gained was used to enhance available sequence alignment. In addition to the value of this study in terms of information it provided, it also has the value of setting up new and successful methodology for studying and understanding structural interactions. We demonstrated that, by complementing structural data with phylogenetic and energetic data, the whole panorama of structural evolution can be analyzed and understood more fully and accurately. Finally, we established a new and

promising method for automatically building structural RNA alignments. This SCFG/MRF

method is capable of dealing with crossing tertiary interactions without sacrificing time or

memory of execution, and is also capable of addressing all types of basepairs taking into

consideration their different patterns of isosteric substitutions. This proved successful in enhancing current sequence alignments, and showed the capability of differentiating between sequences based on their structural profiles. 128

REFERENCES

Adams, P. L., M. R. Stahley, A. B. Kosek, J. Wang and S. A. Strobel (2004). "Crystal structure

of a self-splicing group I intron with both exons." Nature 430(6995): 45-50.

Allain, F. H. and G. Varani (1995). "Divalent metal ion binding to a conserved wobble pair

defining the upstream site of cleavage of group I self-splicing introns." Nucleic Acids Res

23(3): 341-50.

Altman, S. (1984). "Aspects of biochemical catalysis." Cell 36(2): 237-9.

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman

(1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs." Nucleic Acids Res 25(17): 3389-402.

Aqvist, J. (1990). "Ion water interaction potentials derived from free-energy perturbation

simulations." Journal of Physical Chemistry 94(21): 8021-8024.

Auffinger, P. and E. Westhof (1999). "Singly and bifurcated hydrogen-bonded base-pairs in

tRNA anticodon hairpins and ribozymes." J Mol Biol 292(3): 467-83.

Ban, N., P. Nissen, J. Hansen, P. B. Moore and T. A. Steitz (2000). "The complete atomic

structure of the large ribosomal subunit at 2.4 A resolution." Science 289(5481): 905-20.

Bass, B. L. and T. R. Cech (1984). "Specific interaction between the self-splicing RNA of

Tetrahymena and its guanosine substrate: implications for biological catalysis by RNA."

Nature 308(5962): 820-6.

Bateman, A., L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M.

Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme, C. Yeats and S. R. Eddy 129

(2004). "The Pfam protein families database." Nucleic Acids Res 32(Database issue):

D138-41.

Berman, H. M., J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov

and P. E. Bourne (2000). "The ." Nucleic Acids Res 28(1): 235-42.

Berman, H. M., C. Zardecki and J. Westbrook (1998). "The Nucleic Acid Database: A resource

for nucleic acid science." Acta Crystallogr D Biol Crystallogr 54(Pt 6 Pt 1): 1095-104.

Bevilacqua, P. C., T. S. Brown, D. Chadalavada, J. Lecomte, E. Moody and S. I. Nakano (2005).

"Linkage between proton binding and folding in RNA: implications for RNA catalysis."

Biochem Soc Trans 33(Pt 3): 466-70.

Brown, J. W., J. M. Nolan, E. S. Haas, M. A. Rubio, F. Major and N. R. Pace (1996).

"Comparative analysis of ribonuclease P RNA using gene sequences from natural

microbial populations reveals tertiary structural elements." Proc Natl Acad Sci U S A

93(7): 3001-6.

Burkard, M. E., R. Kierzek and D. H. Turner (1999). "Thermodynamics of unpaired terminal

nucleotides on short RNA helixes correlates with stacking at helix termini in larger

RNAs." J Mol Biol 290(5): 967-82.

Byles, R. H. (1972). "Limiting conditions for the operation of the probable mutation effect." Soc

Biol 19(1): 29-34.

Cannone, J. J., S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D'Souza, Y. Du, B. Feng, N.

Lin, L. V. Madabusi, K. M. Muller, N. Pande, Z. Shang, N. Yu and R. R. Gutell (2002).

"The comparative RNA web (CRW) site: an online database of comparative sequence

and structure information for ribosomal, intron, and other RNAs." BMC

3(1): 2. 130

Case, D. A., D. A. Pearlman, J. W. Caldwell, T. E. Cheathan, III, J. Wang, W. S. Ross, C. L.

Simmerling, T. A. Darden, K. M. Merz, R. V. Stanton, A. L. Cheng, J. J. Vincent, M.

Crowley, V. Tsui, H. Gohlke, R. J. Radmer, Y. Duan, J. Pitera, I. Massova, G. L. Seibel,

U. C. Singh, P. K. Weiner and P. A. Kollman (2002). AMBER 7. San Francisco,

University of California San Francisco.

Cate, J. H., A. R. Gooding, E. Podell, K. Zhou, B. L. Golden, C. E. Kundrot, T. R. Cech and J.

A. Doudna (1996). "Crystal structure of a group I ribozyme domain: principles of RNA

packing." Science 273(5282): 1678-85.

Cech, T. R. and B. L. Bass (1986). "Biological catalysis by RNA." Annu Rev Biochem 55: 599-

629.

Chan, C. Y., C. E. Lawrence and Y. Ding (2005). "Structure clustering features on the Sfold

Web server." Bioinformatics 21(20): 3926-8.

Chang, K. Y., G. Varani, S. Bhattacharya, H. Choi and W. H. McClain (1999). "Correlation of

deformability at a tRNA recognition site and aminoacylation specificity." Proc Natl Acad

Sci U S A 96(21): 11764-9.

Cheatham, T. E., P. Cieplak and P. A. Kollman (1999). "A modified version of the Cornell et al.

force field with improved sugar pucker phases and helical repeat." Journal of

Biomolecular Structure & Dynamics 16(4): 845-862.

Chiu, D. K. and T. Kolodziejczak (1991). "Inferring consensus structure from nucleic acid

sequences." Comput Appl Biosci 7(3): 347-52.

Chomsky, N. (1956). "three models for the description of language." IRE Transactions

Information Theory 2: 113-124. 131

Chomsky, N. (1959). "On certain formal properties of grammars." Information and Control

2(137-167).

Cornell, W. D., P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C.

Spellmeyer, T. Fox, J. W. Caldwell and P. A. Kollman (1995). "A 2nd generation force-

field for the simulation of proteins, nucleic-acids, and organic-molecules." Journal of the

American Chemical Society 117(19): 5179-5197.

Crick, F. H. (1966). "Codon--anticodon pairing: the wobble hypothesis." J Mol Biol 19(2): 548-

55.

Dayhoff, M. O., R. M. Schwartz and B. C. Orcutt (1978). A model of evolutionary change in

proteins. Atlas of Protein Sequence and Structure. M. O. Dayhoff, National Biomedical

Research Foundation. 5: 345-352.

DeRose, V. J. (2002). "Two decades of RNA catalysis." Chem Biol 9(9): 961-9.

Dima, R. I., C. Hyeon and D. Thirumalai (2005). "Extracting stacking interaction parameters for

RNA from the data set of native structures." J Mol Biol 347(1): 53-69.

Ding, Y. and C. E. Lawrence (2003). "A statistical sampling algorithm for RNA secondary

structure prediction." Nucleic Acids Res 31(24): 7280-301.

Doudna, J. A., B. P. Cormack and J. W. Szostak (1989). "RNA structure, not sequence,

determines the 5' splice-site specificity of a group I intron." Proc Natl Acad Sci U S A

86(19): 7402-6.

Draper, D. E., D. Grilley and A. M. Soto (2005). "Ions and RNA folding." Annu Rev Biophys

Biomol Struct 34: 221-43.

Durbin, R., S. Eddy, A. Krogh and G. Mitchison (2000). Biological Sequence Analysis,

Cambridge University Press. 132

Eddy, S. R. and R. Durbin (1994). "RNA sequence analysis using covariance models." Nucleic

Acids Res 22(11): 2079-88.

Essmann, U., L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen (1995). "A

Smooth Particle Mesh Ewald Method." Journal of Chemical Physics 103(19): 8577-

8593.

Frank, D. N., C. Adamidi, M. A. Ehringer, C. Pitulle and N. R. Pace (2000). "Phylogenetic-

comparative analysis of the eukaryal ribonuclease P RNA." Rna 6(12): 1895-904.

Frank, J. and R. K. Agrawal (2000). "A ratchet-like inter-subunit reorganization of the ribosome

during translocation." Nature 406(6793): 318-22.

Gagnon, M. G. and S. V. Steinberg (2002). "GU receptors of double helices mediate tRNA

movement in the ribosome." RNA 8(7): 873-7.

Gautheret, D., D. Konings and R. R. Gutell (1995). "G.U base pairing motifs in ribosomal

RNA." RNA 1(8): 807-14.

Gilbert, W. (1986). "The RNA world." Nature 319: 618.

Gillespie, J. J., C. H. McKenna, M. J. Yoder, R. R. Gutell, J. S. Johnston, J. Kathirithamby and

A. I. Cognato (2005). "Assessing the odd secondary structural properties of nuclear small

subunit ribosomal RNA sequences (18S) of the twisted-wing parasites (Insecta:

Strepsiptera)." Insect Mol Biol 14(6): 625-43.

Gotoh, O. (1982). "An improved algorithm for matching biological sequences." J Mol Biol

162(3): 705-8.

Griffiths-Jones, S., A. Bateman, M. Marshall, A. Khanna and S. R. Eddy (2003). "Rfam: an

RNA family database." Nucleic Acids Res 31(1): 439-41. 133

Griffiths-Jones, S., S. Moxon, M. Marshall, A. Khanna, S. R. Eddy and A. Bateman (2005).

"Rfam: annotating non-coding RNAs in complete genomes." Nucleic Acids Res 33: 121-

4.

Guerrier-Takada, C. and S. Altman (1984). "Catalytic activity of an RNA molecule prepared by

transcription in vitro." Science 223(4633): 285-6.

Guerrier-Takada, C., K. Gardiner, T. Marsh, N. Pace and S. Altman (1983). "The RNA moiety of

ribonuclease P is the catalytic subunit of the enzyme." Cell 35(3 Pt 2): 849-57.

Guex, N. and M. C. Peitsch (1997). "SWISS-MODEL and the Swiss-PdbViewer: an

environment for comparative protein modeling." Electrophoresis 18(15): 2714-23.

Gutell, R. R., N. Larsen and C. R. Woese (1994). "Lessons from an evolving rRNA: 16S and 23S

rRNA structures from a comparative perspective." Microbiol Rev 58(1): 10-26.

Gutell, R. R., A. Power, G. Z. Hertz, E. J. Putz and G. D. Stormo (1992). "Identifying constraints

on the higher-order structure of RNA: continued development and application of

comparative sequence analysis methods." Nucleic Acids Res 20(21): 5785-95.

Gutell, R. R., B. Weiser, C. R. Woese and H. F. Noller (1985). "Comparative anatomy of 16-S-

like ribosomal RNA." Prog Nucleic Acid Res Mol Biol 32: 155-216.

Hall, T. A. (1999). "BioEdit: a user-friendly biological sequence alignment editor and analysis

program for Windows 95/98/NT." Nucl. Acids. Symp. Ser. 41: 95-98.

Hansen, J. L., T. M. Schmeing, P. B. Moore and T. A. Steitz (2002). "Structural insights into

formation." Proc Natl Acad Sci U S A 99(18): 11670-5.

Harms, J., F. Schluenzen, R. Zarivach, A. Bashan, S. Gat, I. Agmon, H. Bartels, F. Franceschi

and A. Yonath (2001). "High resolution structure of the large ribosomal subunit from a

mesophilic eubacterium." Cell 107(5): 679-88. 134

Harris, J. K., E. S. Haas, D. Williams, D. N. Frank and J. W. Brown (2001). "New insight into

RNase P RNA structure from comparative analysis of the archaeal RNA." Rna 7(2): 220-

32.

Henikoff, S. and J. G. Henikoff (1993). "Performance evaluation of amino acid substitution

matrices." Proteins 17(1): 49-61.

Higgs, P. G. (2000). "RNA secondary structure: physical and computational aspects." Q Rev

Biophys 33(3): 199-253.

Hirao, I. and K. Miura (1995). "[Unusual single-stranded DNA structures: extraordinarily

thermo-stable mini-hairpin]." Tanpakushitsu Kakusan Koso 40(10): 1583-91.

Hoogsteen, K. (1956). "Unit-cell dimensions and space group of 6-mercaptopurine

monohydrate." Nature 178(4529): 379.

Humphrey, W., A. Dalke and K. Schulten (1996). "VMD: Visual molecular dynamics." Journal

of Molecular Graphics 14(1): 33-&.

Iseni, F., F. Baudin, D. Garcin, J. B. Marq, R. W. Ruigrok and D. Kolakofsky (2002). "Chemical

modification of nucleotide bases and mRNA editing depend on hexamer or nucleoprotein

phase in Sendai virus nucleocapsids." Rna 8(8): 1056-67.

Jaeger, J. A., D. H. Turner and M. Zuker (1989). "Improved predictions of secondary structures

for RNA." Proc Natl Acad Sci U S A 86(20): 7706-10.

Jones, D. T. and M. B. Swindells (2002). "Getting the most from PSI-BLAST." Trends Biochem

Sci 27(3): 161-4.

Jones, D. T., W. R. Taylor and J. M. Thornton (1992). "The rapid generation of mutation data

matrices from protein sequences." Comput Appl Biosci 8(3): 275-82. 135

Jorgensen, W. L., J. Chandrasekhar, J. D. Madura, R. W. Impey and M. L. Klein (1983).

"Comparison of simple potential functions for simulating liquid water." Journal of

Chemical Physics 79(2): 926-935.

Jossinet, F. and E. Westhof (2005). "Sequence to Structure (S2S): display, manipulate and

interconnect RNA data from sequence to structure." Bioinformatics 21(15): 3320-1.

Jurafsky, D., C. Wooters, J. Segal, A. Stolcke, E. Fosler, G. Tajchman and N. Morgan (1995).

Using A Stochastic Context-Free Grammar As A Language Model For Speech

Recognition. Proc. ICASSP.

Kent, O., S. G. Chaulk and A. M. MacMillan (2000). "Kinetic analysis of the M1 RNA folding

pathway." J Mol Biol 304(5): 699-705.

Kirsebom, L. A. (2002). "RNase P RNA-mediated catalysis." Biochem Soc Trans 30(Pt 6): 1153-

8.

Klein, D. J., P. B. Moore and T. A. Steitz (2004). "The roles of ribosomal proteins in the

structure assembly, and evolution of the large ribosomal subunit." J Mol Biol 340(1):

141-77.

Klein, D. J., T. M. Schmeing, P. B. Moore and T. A. Steitz (2001). "The kink-turn: a new RNA

secondary structure motif." EMBO J 20(15): 4214-21.

Knudsen, B. and J. Hein (2003). "Pfold: RNA secondary structure prediction using stochastic

context-free grammars." Nucleic Acids Res 31(13): 3423-8.

Konforti, B. B., D. L. Abramovitz, C. M. Duarte, A. Karpeisky, L. Beigelman and A. M. Pyle

(1998). "Ribozyme catalysis from the major groove of group II intron domain 5." Mol

Cell 1(3): 433-41. 136

Kruger, M. K., S. Pedersen, T. G. Hagervall and M. A. Sorensen (1998). "The modification of

the wobble base of tRNAGlu modulates the translation rate of glutamic acid codons in

vivo." J Mol Biol 284(3): 621-31.

Larsen, N. and C. Zwieb (1993). "The signal recognition particle database (SRPDB)." Nucleic

Acids Res 21(13): 3019-20.

Lemieux, S. and F. Major (2002). "RNA canonical and non-canonical base pairing types: a

recognition method and complete repertoire." Nucleic Acids Res 30(19): 4250-63.

Leontis, N. B., J. Stombaugh and E. Westhof (2002). "The non-Watson-Crick base pairs and

their associated isostericity matrices." Nucleic Acids Res 30(16): 3497-531.

Leontis, N. B. and E. Westhof (2001). "Geometric nomenclature and classification of RNA base

pairs." RNA 7(4): 499-512.

Lescoute, A., N. B. Leontis, C. Massire and E. Westhof (2005). "Recurrent structural RNA

motifs, Isostericity Matrices and sequence alignments." Nucleic Acids Res 33(8): 2395-

409.

Li, P. T., D. Collin, S. B. Smith, C. Bustamante and I. Tinoco, Jr. (2006). "Probing the

mechanical folding kinetics of TAR RNA by hopping, force-jump, and force-ramp

methods." Biophys J 90(1): 250-60.

Lilley, D. M. (2003). "The origins of RNA catalysis in ribozymes." Trends Biochem Sci 28(9):

495-501.

Marquez, S. M., J. K. Harris, S. T. Kelley, J. W. Brown, S. C. Dawson, E. C. Roberts and N. R.

Pace (2005). "Structural implications of novel diversity in eucaryal RNase P RNA." Rna

11(5): 739-51. 137

Masquida, B., C. Sauter and E. Westhof (1999). "A sulfate pocket formed by three GoU pairs in

the 0.97 A resolution X-ray structure of a nonameric RNA." RNA 5(10): 1384-95.

Massire, C., L. Jaeger and E. Westhof (1997). "Phylogenetic evidence for a new tertiary

interaction in bacterial RNase P RNAs." RNA 3(6): 553-6.

Mathews, D. H. (2005). "Predicting a set of minimal free energy RNA secondary structures

common to two sequences." Bioinformatics 21(10): 2246-53.

Mathews, D. H., M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker and D. H. Turner (2004).

"Incorporating chemical modification constraints into a dynamic programming algorithm

for prediction of RNA secondary structure." Proc Natl Acad Sci U S A 101(19): 7287-92.

Mathews, D. H., J. Sabina, M. Zuker and D. H. Turner (1999). "Expanded sequence dependence

of thermodynamic parameters improves prediction of RNA secondary structure." J Mol

Biol 288(5): 911-40.

Mathews, D. H., J. Sabina, M. Zuker and D. H. Turner (1999). "Expanded sequence dependence

of thermodynamic parameters improves prediction of RNA secondary structure." J Mol

Biol 288(5): 911-40.

Mathews, D. H. and D. H. Turner (2002). "Dynalign: an algorithm for finding the secondary

structure common to two RNA sequences." J Mol Biol 317(2): 191-203.

Mathews, D. H. and D. H. Turner (2002). "Experimentally derived nearest-neighbor parameters

for the stability of RNA three- and four-way multibranch loops." 41(3):

869-80.

Mattick, J. S. (2003). "Challenging the dogma: the hidden layer of non-protein-coding RNAs in

complex organisms." Bioessays 25(10): 930-9.

Mattick, J. S. (2004). "The hidden genetic program of complex organisms." Sci Am 291(4): 60-7. 138

Mattick, J. S. (2005). "The functional of noncoding RNA." Science 309(5740): 1527-8.

Maxwell, E. S. and M. J. Fournier (1995). "The small nucleolar RNAs." Annu Rev Biochem 64:

897-934.

McDowell, J. A., L. He, X. Chen and D. H. Turner (1997). "Investigation of the structural basis

for thermodynamic stabilities of tandem GU wobble pairs: NMR structures of

(rGGAGUUCC)2 and (rGGAUGUCC)2." Biochemistry 36(26): 8030-8.

McDowell, J. A. and D. H. Turner (1996). "Investigation of the structural basis for

thermodynamic stabilities of tandem GU mismatches: solution structure of

(rGAGGUCUC)2 by two- dimensional NMR and simulated annealing." Biochemistry

35(45): 14077-89.

McKeown, M. (1992). "Alternative mRNA splicing." Annu Rev Cell Biol 8: 133-55.

Mears, J. A., J. J. Cannone, S. M. Stagg, R. R. Gutell, R. K. Agrawal and S. C. Harvey (2002).

"Modeling a minimal ribosome based on comparative sequence analysis." J Mol Biol

321(2): 215-34.

Melefors, O. and M. W. Hentze (1993). "Translational regulation by mRNA/protein interactions

in eukaryotic cells: ferritin and beyond." Bioessays 15(2): 85-90.

Michel, F., M. Costa, C. Massire and E. Westhof (2000). "Modeling RNA tertiary structure from

patterns of sequence variation." Methods Enzymol 317: 491-510.

Michel, F. and E. Westhof (1990). "Modelling of the three-dimensional architecture of group I

catalytic introns based on comparative sequence analysis." J Mol Biol 216(3): 585-610.

Mokdad, A., M. V. Krasovska, J. Sponer and N. B. Leontis (2006). "Structural and evolutionary

classification of G/U wobble basepairs in the ribosome." Nucleic Acids Res 34(5): 1326-

41. 139

Moore, P. B. (2005). "Structural biology. A ribosomal coup: E. coli at last!" Science 310(5749):

793-5.

Murphy, F. V. t. and V. Ramakrishnan (2004). "Structure of a purine-purine wobble in

the decoding center of the ribosome." Nat Struct Mol Biol 11(12): 1251-2.

Murphy, F. V. t., V. Ramakrishnan, A. Malkiewicz and P. F. Agris (2004). "The role of

modifications in codon discrimination by tRNA(Lys)UUU." Nat Struct Mol Biol 11(12):

1186-91.

Needleman, S. B. and C. D. Wunsch (1970). "A general method applicable to the search for

similarities in the amino acid sequence of two proteins." J Mol Biol 48(3): 443-53.

Neuwald, A. F. and A. Poleksic (2000). "PSI-BLAST searches using hidden markov models of

structural repeats: prediction of an unusual sliding DNA clamp and of beta-propellers in

UV-damaged DNA-binding protein." Nucleic Acids Res 28(18): 3570-80.

Nissen, P., J. Hansen, N. Ban, P. B. Moore and T. A. Steitz (2000). "The structural basis of

ribosome activity in peptide bond synthesis." Science 289(5481): 920-30.

Nissen, P., J. A. Ippolito, N. Ban, P. B. Moore and T. A. Steitz (2001). "RNA tertiary

interactions in the large ribosomal subunit: the A-minor motif." Proc Natl Acad Sci U S A

98(9): 4899-903.

Noller, H. F. (2005). "RNA structure: reading the ribosome." Science 309(5740): 1508-14.

Noller, H. F., V. Hoffarth and L. Zimniak (1992). "Unusual resistance of peptidyl transferase to

protein extraction procedures." Science 256(5062): 1416-9.

Nussinov, R. and A. B. Jacobson (1980). "Fast algorithm for predicting the secondary structure

of single-stranded RNA." Proc Natl Acad Sci U S A 77(11): 6309-13. 140

Ogle, J. M., D. E. Brodersen, W. M. Clemons, Jr., M. J. Tarry, A. P. Carter and V. Ramakrishnan

(2001). "Recognition of cognate transfer RNA by the 30S ribosomal subunit." Science

292(5518): 897-902.

Ogle, J. M., F. V. Murphy, M. J. Tarry and V. Ramakrishnan (2002). "Selection of tRNA by the

ribosome requires a transition from an open to a closed form." Cell 111(5): 721-32.

Onoa, B. and I. Tinoco, Jr. (2004). "RNA folding and unfolding." Curr Opin Struct Biol 14(3):

374-9.

Peltz, S. W. and A. Jacobson (1992). "mRNA stability: in trans-it." Curr Opin Cell Biol 4(6):

979-83.

Peracchi, A. (2005). "DNA catalysis: potential, limitations, open questions." Chembiochem 6(8):

1316-22.

Pitulle, C., C. Strehse, J. W. Brown and E. B. Breitschwerdt (2002). "Investigation of the

phylogenetic relationships within the genus Bartonella based on comparative sequence

analysis of the rnpB gene, 16S rDNA and 23S rDNA." Int J Syst Evol Microbiol 52(Pt 6):

2075-80.

Ramos, A. and G. Varani (1997). "Structure of the acceptor stem of Escherichia coli tRNA Ala:

role of the G3.U70 base pair in synthetase recognition." Nucleic Acids Res 25(11): 2083-

90.

Ravasi, T., H. Suzuki, K. C. Pang, S. Katayama, M. Furuno, R. Okunishi, S. Fukuda, K. Ru, M.

C. Frith, M. M. Gongora, S. M. Grimmond, D. A. Hume, Y. Hayashizaki and J. S.

Mattick (2006). "Experimental validation of the regulated expression of large numbers of

non-coding RNAs from the mouse genome." Genome Res 16(1): 11-9. 141

Razga, F., J. Koca, J. Sponer and N. B. Leontis (2005). "Hinge-like motions in RNA kink-turns:

the role of the second a-minor motif and nominally unpaired bases." Biophys J 88(5):

3466-85.

Rivas, E. and S. R. Eddy (1999). "A dynamic programming algorithm for RNA structure

prediction including pseudoknots." J Mol Biol 285(5): 2053-68.

Rivas, E. and S. R. Eddy (2000). "The language of RNA: a formal grammar that includes

pseudoknots." Bioinformatics 16(4): 334-40.

Schmeing, T. M., P. B. Moore and T. A. Steitz (2003). "Structures of deacylated tRNA mimics

bound to the E site of the large ribosomal subunit." RNA 9(11): 1345-52.

Schmeing, T. M., A. C. Seila, J. L. Hansen, B. Freeborn, J. K. Soukup, S. A. Scaringe, S. A.

Strobel, P. B. Moore and T. A. Steitz (2002). "A pre-translocational intermediate in

protein synthesis observed in crystals of enzymatically active 50S subunits." Nat Struct

Biol 9(3): 225-30.

Schroeder, S. J. and D. H. Turner (2001). "Thermodynamic stabilities of internal loops with GU

closing pairs in RNA." Biochemistry 40(38): 11509-17.

Schuwirth, B. S., M. A. Borovinskaya, C. W. Hau, W. Zhang, A. Vila-Sanjurjo, J. M. Holton and

J. H. Cate (2005). "Structures of the bacterial ribosome at 3.5 A resolution." Science

310(5749): 827-34.

Seeman, N. C. (1999). "DNA engineering and its application to nanotechnology." Trends

Biotechnol 17(11): 437-43.

Seeman, N. C. (2003). "DNA in a material world." Nature 421(6921): 427-31.

Seeman, N. C. (2005). "Structural DNA nanotechnology: an overview." Methods Mol Biol 303:

143-66. 142

Smith, T. F. and M. S. Waterman (1981). "Identification of common molecular subsequences." J

Mol Biol 147(1): 195-7.

Sonnhammer, E. L., S. R. Eddy, E. Birney, A. Bateman and R. Durbin (1998). "Pfam: multiple

sequence alignments and HMM-profiles of protein domains." Nucleic Acids Res 26(1):

320-2.

Sponer, J., J. Leszczynski and P. Hobza (2001). "Electronic properties, hydrogen bonding,

stacking, and cation binding of DNA and RNA bases." Biopolymers 61(1): 3-31.

Sponer, J., A. Mokdad, J. E. Sponer, N. Spackova, J. Leszczynski and N. B. Leontis (2003).

"Unique tertiary and neighbor interactions determine conservation patterns of Cis

Watson-Crick A/G base-pairs." J Mol Biol 330(5): 967-78.

Sprinzl, M., C. Horn, M. Brown, A. Ioudovitch and S. Steinberg (1998). "Compilation of tRNA

sequences and sequences of tRNA genes." Nucleic Acids Res 26(1): 148-53.

Strazewski, P., E. Biala, K. Gabriel and W. H. McClain (1999). "The relationship of

thermodynamic stability at a G x U recognition site to tRNA aminoacylation specificity."

RNA 5(11): 1490-4.

Sugimoto, N., R. Kierzek, S. M. Freier and D. H. Turner (1986). "Energetics of internal GU

mismatches in ribooligonucleotide helixes." Biochemistry 25(19): 5755-9.

Sumita, M., J. P. Desaulniers, Y. C. Chang, H. M. Chui, L. Clos, 2nd and C. S. Chow (2005).

"Effects of nucleotide substitution and modification on the stability and structure of helix

69 from 28S rRNA." Rna 11(9): 1420-9.

Takasu, A., K. Watanabe and G. Kawai (2002). "Classification of RNA structures based on

and base-base stacking patterns: application for NMR structures." J

Biochem (Tokyo) 132(2): 211-5. 143

Tarn, W. Y. and J. A. Steitz (1997). "Pre-mRNA splicing: the discovery of a new spliceosome

doubles the challenge." Trends Biochem Sci 22(4): 132-7.

Thirumalai, D., N. Lee, S. A. Woodson and D. Klimov (2001). "Early events in RNA folding."

Annu Rev Phys Chem 52: 751-62.

Thompson, J. D., S. R. Holbrook, K. Katoh, P. Koehl, D. Moras, E. Westhof and O. Poch (2005).

"MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences." Nucleic

Acids Res 33(13): 4164-71.

Tsai, H. Y., B. Masquida, R. Biswas, E. Westhof and V. Gopalan (2003). "Molecular modeling

of the three-dimensional structure of the bacterial RNase P holoenzyme." J Mol Biol

325(4): 661-75.

Ulrich, H., A. H. Martins and J. B. Pesquero (2004). "RNA and DNA aptamers in cytomics

analysis." Cytometry A 59(2): 220-31.

Van de Peer, Y., P. De Rijk, J. Wuyts, T. Winkelmans and R. De Wachter (2000). "The

European small subunit ribosomal RNA database." Nucleic Acids Res 28(1): 175-6.

Varani, G. and W. H. McClain (2000). "The G x U wobble base pair. A fundamental building

block of RNA structure crucial to RNA function in diverse biological systems." EMBO

Rep 1(1): 18-23.

Wang, J. M., P. Cieplak and P. A. Kollman (2000). "How well does a restrained electrostatic

potential (RESP) model perform in calculating conformational energies of organic and

biological molecules?" Journal of Computational Chemistry 21(12): 1049-1074.

Waugh, A., P. Gendron, R. Altman, J. W. Brown, D. Case, D. Gautheret, S. C. Harvey, N.

Leontis, J. Westbrook, E. Westhof, M. Zuker and F. Major (2002). "RNAML: a standard

syntax for exchanging RNA information." Rna 8(6): 707-17. 144

Westhof, E., P. Dumas and D. Moras (1985). "Crystallographic refinement of yeast aspartic acid

transfer RNA." J Mol Biol 184(1): 119-45.

Westhof, E. and C. Massire (2004). "Structural biology. Evolution of RNA architecture." Science

306(5693): 62-3.

Wimberly, B. T., D. E. Brodersen, W. M. Clemons, Jr., R. J. Morgan-Warren, A. P. Carter, C.

Vonrhein, T. Hartsch and V. Ramakrishnan (2000). "Structure of the 30S ribosomal

subunit." Nature 407(6802): 327-39.

Woodson, S. A. (2005). "Metal ions and RNA folding: a highly charged topic with a dynamic

future." Curr Opin Chem Biol 9(2): 104-9.

Wu, M. and I. Tinoco, Jr. (1998). "RNA folding causes secondary structure rearrangement."

Proc Natl Acad Sci U S A 95(20): 11555-60.

Wuyts, J., P. De Rijk, Y. Van de Peer, T. Winkelmans and R. De Wachter (2001). "The

European large subunit ribosomal RNA database." Nucleic Acids Res 29(1): 175-7.

Wuyts, J., G. Perriere and Y. Van De Peer (2004). "The European ribosomal RNA database."

Nucleic Acids Res 32(Database issue): D101-3.

Wuyts, J., Y. Van de Peer, T. Winkelmans and R. De Wachter (2002). "The European database

on small subunit ribosomal RNA." Nucleic Acids Res 30(1): 183-5.

Yaniv, M., A. Favre and B. G. Barrell (1969). "Structure of transfer RNA. Evidence for

interaction between two non-adjacent nucleotide residues in tRNA from Escherichia

coli." Nature 223(213): 1331-3.

Yusupov, M. M., G. Z. Yusupova, A. Baucom, K. Lieberman, T. N. Earnest, J. H. Cate and H. F.

Noller (2001). "Crystal structure of the ribosome at 5.5 A resolution." Science 292(5518):

883-96. 145

Zhu, J. and R. M. Wartell (1997). "The relative stabilities of base pair stacking interactions and

single mismatches in long RNA measured by temperature gradient ."

Biochemistry 36(49): 15326-35.

Ziesche, S. M., A. D. Omer and P. P. Dennis (2004). "RNA-guided nucleotide modification of

ribosomal and non-ribosomal RNAs in Archaea." Mol Microbiol 54(4): 980-93.

Zuker, M. (1989). "Computer prediction of RNA structure." Methods Enzymol 180: 262-88.

Zuker, M. (2003). "Mfold web server for nucleic acid folding and hybridization prediction."

Nucleic Acids Res 31(13): 3406-15.

Zuker, M. and P. Stiegler (1981). "Optimal computer folding of large RNA sequences using

thermodynamics and auxiliary information." Nucleic Acids Res 9(1): 133-48.