<<

TOWARDS AUTOMATING STRUCTURAL ANALYSIS OF COMPLEX RNA MOLECULES AND SOME APPLICATIONS IN NANOTECHNOLOGY

Lorena G. Parlea

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

May 2015

Committee:

Neocles B. Leontis, Advisor

R. Marshall Wilson Graduate Faculty Representative

Craig L. Zirbel

Carol A. Heckman

George S. Bullerjahn

© 2015

Lorena G. Parlea

All Rights Reserved iii ABSTRACT

Neocles B. Leontis, Advisor

RNA has emerged as a versatile and multi-faceted player in expression and the informational of living cells. RNA molecules can function by virtue of their sequences, storing or transmitting genetic information, as well as by forming complex three- dimensional (3D) structures that can bind specifically to , small molecules or other RNA or DNA molecules to carry out diverse recognition functions, including chemical catalysis. As a result of revolutions in RNA 3D structure determination and high-throughput DNA and RNA sequencing, on-line databases are brimming with new structure and sequence data. The large amounts of new data are creating new challenges in data management, curation, search, visualization, and access. For example, ribosomes have been solved in many different functional states, with tRNAs variously bound to the A-, P-, or E-sites, or associated with different translation factors (i.e. initiation, elongation, termination or recycling factors) or antibiotics.

Detailed and accurate functional annotations are needed to enable focused database searches for specific states and bound ligands and to uncover new relationships regarding structure, function, and of RNA molecules and their complexes. As large numbers of new structures are accumulating in databases faster than they can be manually annotated, automated annotation procedures need to be developed and deployed by databases such as the Database

(NDB). In addition to annotation of individual structures, related structures must be identified, compared and clustered, and representative structures chosen for detailed analysis. To facilitate research, clustering should be dynamic and user driven. iv The unifying theme of this dissertation is to develop new conceptual frameworks and analytical approaches to assess and improve the automated annotation and analysis of RNA 3D structures, and to connect these data to structural changes in functional state and evolutionary changes. More specifically, the dissertation focuses on analysis of 3D structures of ribosomal

RNA (rRNA) from the large (LSU) and small (SSU) subunits of ribosomes, and the hairpin and internal loop motifs extracted from them. A manual analysis was carried out of all atomic- resolution, 3D structures of the SSU of the bacterium T. thermophilus found in the NDB as of

April, 2014, structure equivalence class NR_4.0_81883.24, release 1.56 of ribosome structures posted in the NDB (see http://rna.bgsu.edu/rna3dhub/nrlist). Each structure was manually examined to determine the functional annotations for the state of the ribosome with all bound tRNAs, mRNAs and other factors. These data were combined in a single spreadsheet found at this link: http://tinyurl.com/16S-T-Thermophilus-summary.

NDB maintains a Non-redundant (NR) set of structures that is updated each week and used to construct the 3D Motif Atlas of RNA hairpin and internal loop motifs

(http://rna.bgsu.edu/rna3dhub/motifs). To assess the quality of motif clustering in the Motif

Atlas, links and meta-data were downloaded and analyzed for all motif instances in release 1.14

(http://rna.bgsu.edu/rna3dhub/motifs/release/IL/1.14). Overall, motifs found in conserved regions of the rRNAs were placed in the same motif groups by the automated clustering.

Information must flow between different parts of the ribosome to report on the state of one part that is relevant to another part. One hypothesis is that networks of RNA tertiary and quaternary interactions play a central role in the ribosome. As a first step to automatically detecting and comparing networks of RNA-RNA interactions in the ribosome, a new module was created for the FR3D (“Find RNA 3D”) suite of RNA analysis tools. This module can perform the analysis of large structures automatically. v

To my Mother:

“Multumesc ca esti mama mea.” vi ACKNOWLEDGMENTS

I want to thank my advisor, Dr. Neocles Leontis, for his mentorship, support and encouragement throughout my graduate studies. Thank you for believing in me. I also wish to thank all my committee members for all their expertise and time, and for agreeing to be part of my journey. I want to thank Dr. Zirbel for always being patient, for helping with writing the programs and sorting out the bugs. My thanks also go to Dr. Heckman for being a teacher, a mentor, a great boss and a friend. I would like to thank Dr. Bullerjahn as well for the time and the input given on my research.

I would like to acknowledge my colleagues Jesse Stombaugh for the advice and input in my research, Blake Sweeney for providing the meta-data for two of my projects, and Kirill

Afonin for giving me the motivation to finish my dissertation.

I would like to thank my family for their unconditional love and support, even from a distance. Special thanks go to my mother and my sister, for everything they did to see me through graduate school, for being my cheerleaders, my critics, my friends, and my support system.

I would like to express my love and gratitude to Elena and Dmitry Khon, which became my family far away from home. Without them, I would have not been able to reach the finish line. Special thanks go to my friend Mina Coman which was my avid supporter towards the end of my studies. Finally, I want to thank all the friends I made here, too many to acknowledge them, who encouraged and helped me throughout, you brought color to my and made this journey thrilling. vii

TABLE OF CONTENTS

Page

I. INTRODUCTION …………………………………………………………………...... 1

I.1. RNA history and its importance as a biological molecule ………..…………….………. 1

I.1.1. Historical overview of RNA research ……..…………………………………...... 1

I.1.2. The roles of RNA and RNA in biological processes ………………... 15

I.1.3. Methods for scrutinizing structure and function ………..………….…...… 16

I.1.4. RNA structural organization ……….……………….……………………………. 18

I.1.5. Types of RNA ………..……………….……………………………………...... … 21

I.1.6. Types of RNA and the biological processes in which they are involved ….…...... 22

I.1.6.1. RNA synthesis (Transcription) ………..……………….………………. 22

I.1.6.2. mRNA processing ………..……………….…………………………..... 27

I.1.6.3. The RNAs involved in translation ………..………………...……..…… 28

I.1.6.4. The bacterial ribosome and rRNA ……..……………….………...……. 28

I.1.6.5. The bacterial ribosome and tRNA ……..……………….…...... ……….. 31

I.1.6.6. synthesis (Translation) ………..……………….……………..... 31

I.1.7. The structure of the ribosome ……..……………….………………………...…... 37

I.2.RNA Structural Motifs ………..……………….……………………….. …………….. 38

I.2.1. Defining Motifs at Different Levels of Structure ………..………………………. 38

I.2.1.1. Hierarchical architectures and folding of structured RNA molecules …...… 39

I.2.1.2. Defining the modular units of RNA Structure ………..……………………. 39

I.2.1.3. Modular and recurrent 3D motifs ………..……………………………….... 41

viii

I.2.1.4. Neutral substitutions in helices ………..………………………………..…. 43

I.2.2. Identifying, classifying and annotating interactions that stabilize RNA

3D motifs ………..………………………………...... 44

I.2.2.1. Reduced representations of RNA 3D structure ………..…………….…….. 44

I.2.2.2. Classification and Annotation of Base-pairing Interactions in RNA Structures

………..………………………………...... ………………….. 45

I.2.2.3. Annotation of secondary structures ………..……………………………..... 53

I.2.2.4. Structure-neutral mutations in recurrent RNA 3D motifs ………..……...… 55

I.2.3. Defining recurrent 3D motifs and identifying them in structures ………..…...…. 58

I.2.3.1 Classification of “loop” motifs ………..……………………………………. 58

I.2.3.2. Defining and naming 3D motifs ………..………………………………….. 60

1.2.3.3. Defining tertiary interaction motifs ………..……………………………… 64

1.2.4. Classification of motifs according to function ………..……...... 68

I.2.5. Conclusions ………..……………………………………………………..……… 71

II. METHODS AND MATERIALS ………..………………………………………………….. 72

III. RESULTS AND DISCUSSION ………..…………………………………………..….…... 77

III. 1. Structural elements in the bacterial ribosome ………..………………………...……. 77

III. 1.1. Motivation and overview ………..………………………………………… 77

III. 1.2. Annotations of long-range tertiary interactions ………..………………….. 78

III. 1.3. Helical elements vs. stacking structural elements ………..…………...…… 80

III. 1.4. Defining the hierarchical structure of an RNA molecule ………..………... 82 ix

III.1.5. Structural elements in the ribosome ………..………. …………………...… 84

III.1.6. Assigning nucleotides to each level of hierarchical organization ………….. 96

III.1.7. Programs for calculating the interaction network ………..……………...... 97

III.1.8. Visualization of results for 2J00 and 2J01 ………..……………………...... 97

III.1.9. Conclusions ………..…………………………………………………….... 104

III. 2. Towards automated extraction of key information of ribosome structures in the

Nucleic Acids Databank (NDB) .……………………………………...... 108

III.3. Evaluation of classification of recurrent motifs in the ribosome structures by the RNA

3D Motif Atlas ………..…………………………………………………………...... 122

REFERENCES……………………………………………………………………………...... 137

x

LIST OF FIGURES

Figures Page

I.1.4. Motifs in secondary structure ……………………………………………………………. 20

I.2.1. Base edges and Base-pair geometric isomerism ………………...... 46

I.2.2. Schematic representations of geometric families and symbols for annotating structures .. 48

I.2.3. Triangle abstraction for RNA bases .………………...... 50

I.2.4. Isosteric relationships between basepairs ……………...... 57

I.2.5. Structurally similar hairpin loop motifs ………………...... 60

I.2.6. Structural Definition of “GNRA” hairpin loop motif ………………...... 63

I.2.7. Hairpin loops mediating tertiary interactions ………………...... 64

I.2.8. Local vs. Composite motifs ………………...... 65

I.2.9. zippers are tertiary interaction motifs composed of two Sugar-edge basepairs … 66

I.2.10. Conserved tertiary interaction in 23S rRNA mediated different motifs ………………... 70

II.1. Algorithm for searching tertiary interactions between structural elements in structured RNA molecules ………………...... 74

III. 1.2.1. Annotation of hairpin loops and their long range interactions in SSU 16S rRNA from

E. coli, based on the structure 2AVY.pdb ………………...... 79

III. 1.2.2. Examples of annotated hairpin loops in of E. coli 23S rRNA ………………...... 80

III. 1.3. Helical elements vs. stacking structural elements in 5S rRNA ………………...... 81

III. 1.4. Levels of hierarchies in large RNA structures ………………...... 83

III.1.5.1. Definition of helical stacking structural elements for 16S rRNA ……………...... 87

III.1.5.2. Enlarged picture of helical stacking structural elements for T. thermophilus 16S rRNA

…………………………………………………………………………………………………... 88 xi

III.1.5.3. 16S rRNA of T. thermophilus (PDB file 2J00) colored by stacked structural elements

…………………………………………………………………………………………………... 89

III.1.5.4. A snapshot of the table with structural elements summary of 16S rRNA T. thermophilus

…………………………………………………………………………………………………... 90

III.1.5.5. Partition of nucleotides in T. thermophilus rRNA structures into structural elements . 91

III.1.5.6. 23S T. thermophilus colored by stacked structural elements ……………………...... 92

III.1.5.7.a. A snapshot of the table with structural elements summary of 23S rRNA T.

thermophilus (First half) ……………………………………………………………………...... 93

III.1.5.7.b. A snapshot of the table with structural elements summary of 23S rRNA T.

thermophilus (Second half) …………………………………………………………………….. 94

III.1.8.1. A snapshot of the interaction matrix in 70S T. thermophilus, PDB files 2J00 - 2J01 97

III.1.8.2. A snapshot of the interaction list in 16S T. thermophilus colored by interaction type 98

III.1.8.3. A picture of the interacting element e_1_2_3_28_29 with e_4_15 in 16S T. thermophilus (PDB file 1J00) colored by elements. ……………………………………..…….. 99

III.1.8.4. tRNAs interactions with 16S rRNA in bacterial ribosome ………………………… 100

III.1.8.5. tRNAs interactions with 23S rRNA in bacterial ribosome ……………………….... 101

III.1.8.6. Distant helical elements connected by long range interactions in 23s rRNA …….... 102

III.1.8.7. Basepair annotations of long rage interaction between distant helical elements .….. 103

III.3.1. A snapshot of the Motif Atlas validation table ……………………………………….. 128

III.3.2. Comparative analysis of Motif Atlas classification of hairpin loops in the small subunit of

T. thermophilus, E. coli, Tetrahymena, and S. cerevisae ……...…………………………….... 131

III.3.3. Comparative analysis of Motif Atlas classification of hairpin loops in the 5’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae…………………….. 132 xii

III.3.4. Comparative analysis of Motif Atlas classification of hairpin loops in the 3’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae …………………..... 133

III.3.5. Comparative analysis of Motif Atlas classification of internal loops in the small subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae ………………………………….. 134

III.3.6. Comparative analysis of Motif Atlas classification of internal loops in the 5’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae .…………………… 135

III.3.7. Comparative analysis of Motif Atlas classification of internal loops in the 5’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae .…………………… 136

xiii

LIST OF TABLES

Table Page

I.6.1.The main roles of Translational factors in the prokaryotic ribosome …………………….. 34

I.2.1. The Twelve Geometric Base-pair families ……………………………………………..... 47

I.2.2. Spatial relationships between the geometric base-pair families …………………………. 52

III.1.6. Nucleotides assignments to each level in 23S rRNA T. Thermophilus ……………...... 96

III.2.1. A snapshot of the Motif Atlas validation table posted electronically that contains

functional annotations for all the NDB structures in the Equivalence class for the 30S ribosome

subunit of T. Thermophilus …………………………………………………………………… 112

III.2.2. Sources of data collected to functionally annotate structures of the T. Thermophilus 30S ribosome ………………………………………………………………………………………. 117

III.2.3. 16S PDB files chosen for analysis of tRNA-rRNA and tRNA-mRNA interactions …. 119

III.2.4. Summary of interactions between A-site tRNA and 16S rRNA in chosen files from 30S subunit ……………………………………………………………..………………………...... 120

III.2.5. Summary of interactions between P-site tRNA and 16S rRNA in chosen files from 30S subunit …………………………………………………………………………...…………..... 121

1

I. INTRODUCTION

I.1. RNA history and its importance as a biological molecule

I.1.1. Historical overview of RNA research

The saga of RNA began around 1866 in Germany, when Ernst Haeckel hypothesized that the nuclei might be the cellular particles responsible for transmitting the heritable traits from parents to offsprings. Haeckel was a scientist who had a variety of biological interest, especially in the zoology area, and he described, illustrated and even named thousands of species (E. Haeckel, 1868). His scientific views were greatly influenced by the Darwinian concepts after he read “” in 1864. (Darwin, 1859) Haeckel primary interest proved to be the embryology of vertebrates, which he used as evidence of inherited relationships based on identical of embryos from different species. Haeckel went as far as coining evolutionary terms like “stem cell”, “antropogeny”,” ecology”, “ontogeny”,

”phylum” and “phylogeny”, postulating the recapitulation theory: “ontogeny recapitulates phylogeny” (E. Haeckel, 1947), nevertheless retaining the Darwinian principles. Even though his theories were proven later on to be not quite accurate, the nucleus was thus brought to the attention of scientists of that period.

Friedrich Miescher, a Swiss biologist, first had the idea that one molecule from the nucleus might be responsible for the transmission of genetic information and he developed surprising new theories about how this compound can encode genetic information (Akhtar,

Schonthaler, Bron, & Dahm, 2008). Some of these ideas were also proven incorrect later on, but the concepts he proposed are surprisingly close to the concepts nowadays. His most fundamental discovery was of a substance in the nucleus of white blood cells which he called “nuclein”. His 2

experimental methods were innovative for that time. He used different chemical approaches than

his colleagues did to purify nuclei from leukocytes and he analyzed their chemical composition.

Miescher washed blood cells from bandages with salt solutions including a solution of sodium

sulfate. He filtered the cells and allowed the product to settle at the bottom of beakers, thus

isolating purified nuclei from the cytoplasm. The nuclei were subjected to alkaline extraction and

afterwards to acidification. The precipitate formed was the substance that Miescher called

“nuclein”, in response to its origin, the nucleus. The chemical analysis of this substance revealed

that it contains phosphorus and nitrogen but not sulfur. Therefore he concluded that it was

different from proteins (Dahm, 2010) Nuclein dissolved in basic pH and precipitated in acidic

pH, which indicated that this newly discovered substance is acidic in , unlike proteins.

Thus the first step towards analyzing nucleic acids, and characterizing an organelle by its function, and not by its appearance, was done in 1866 by Fredrick Miescher. He proved that the

“nuclein” is different from other known biological molecules studied thus far -namely proteins- because of nuclein’s high content in phosphorus.

This discovery was confirmed by his former mentor Felix Hoppe-Seyler (Hoppe-Seyler

& Schummelfeder, 1946), when he repeated all these experiments in order to be convinced of the new discovery. Miescher realized that nuclein is the substance that defines the properties of the nucleus, dictates its physiological activity, and “merit to be considered equal to proteins”. He even suggested the “possibility that [nuclein can be] distributed in the protoplasm which could be the precursor for some of the de novo formation of nuclei” (Miescher, 1957)). Thus Miescher proved that “Nuclein really specifically belongs to the life of the nucleus” (Miescher, 1960). He proved that nuclein appears in different types of cells, by analyzing the sperm of different species. His experiments showed that sperm cells contain very few molecules (DNA and the 3

proteins associated with DNA). Miescher thought though that the yolk of chicken eggs contains a

multitude of small nuclei and therefore a lot of nuclein, and a spermatozoon should contain very

little nuclein, due to the fact that the spermatozoons are single cells. This diverted him from the

scientific conclusion that one substance, nuclein, can be soly responsible for transmitting genetic

traits and for producing the differences between individual, races and species. Miescher also

thought that mistakes (mutations) accumulated during a lifetime can be corrected through sexual

reproduction. He hypothesized that genetic traits are transmitted with the help of one or very few

complex molecules and not through a lot of molecules for each trait, as thought at that time. He

predicted that the diversity of heritable traits among individuals and species is driven by a large

number of arrangements of a simple code, rather than a large number of codes. But the model he

proposed was based on the diversity of amino acids not of the . He presumed that for

a protein with 40 asymmetric carbons (aka amino acids) there are 240 possible arrangements and

thus as many isomers. This wrong assumption brought him to the conclusion “that not in a specific substance can the mystery of fertilization be concealed” (Dahm, 2005).

Leopold Auerbach, a German anatomist, proved in 1874 (Auerbach, 1870) that embryos

developed from the nuclei of fertilized eggs fused with each other. Oscar Hertwig, a student of

Miescher, confirmed in 1876 that the nuclei from spermatozoon and oocyte have to merge

together before the embryos could develop. The scientific fact that the embryo’s nucleus comes

from the nuclei of the zygotes established the continuity of genetic information from parents to

offspring,. Rationally, it was concluded that the nucleus must be the organelle with the ability to

transfer genetic information, making nuclein the crucial component in transmission of heritable

traits. 4

In 1881, Eduard Zacharias made the connection between the and nuclein

by observing that chromatin diluted in basic pH solutions. Shortly after, in 1882, Walter

Flemming concluded that nuclein and chromatin are the same substance (Hardy & Zacharias,

2009) However, it could not be believed that a single type of molecule could be responsible for

genetic inheritance, since proteins were favored as carriers of this role.

Albrecht Kossel became Hoppe-Seyler’s student in 1872, but he finished his medical degree at another German institute. In 1877 he returned to work in Hoppe-Seyler’s department as research assistant. Kossel adhered to Seyler’s interests in nuclein, interests shared by Seyler’s former mentee and collaborator Friedrich Miescher. Kossel carried out a more detailed research on the chemical composition of nuclein extracted from the thymus of animals.

He was the first to demonstrate that nuclein has a protein fraction and a non-protein fraction. He used hydrolysis and a few other chemical means to characterize these nucleins (nucleoproteins) and he separated the proteins from the non-protein part. In the non-protein compound he found that it contains bases, (, , , and ), carbohydrates, and phosphorus, and he reaffirmed the acidity of the non-proteinic component. The nuclein in yeast also contained bases (adenine, guanine, cytosine, and uracil), carbohydrates, and phosphorus. He studied the composition of proteins as well, showing the polymeric structure of this molecule. He showed the peptide and peptone composition of proteins, predicting that proteins are polymers.

Kossel discovered histidine and studied protamine, and “hexone bases” (arginine, histidine, and lysine) and he developed a classical method to quantitatively separate hexone bases. He also worked on the protein (!) arginase which hydrolyzes arginine into urea and ornithine. As a result of biochemical studies of nuclein, Kossel concluded that this substance is made out of 5

protein and a part that is different from proteins. From the non-protein part, he isolated and

described the organic compounds that made nuclein (adenine, cytosine, guanine, and thymine).

In 1885 Kossel named “Adenine” from the Greek “aden”, meaning gland, according to

the source of his samples: the pancreatic gland. “Cytosine” had its name given as well after the

source of the samples where it was discovered – calf thymus (Kossel 1894). Kossel and co.

proposed the structure of cytosine in 1903, and they solved its structure within the same year

(Kossel 1903).

Guanine was a base first isolated in 1844 from sea bird excrement (Hitchings, 1944).

Emil Fisher, a German synthetic chemist, worked on natural-occurring compounds. He

characterized and named “Guanine”, from the origin of its discovery, “guano” - the sea birds

excrements (Fisher 1884). Fisher showed that Adenine, Guanine, Uric acid, and Xanthine, are

the same type of compounds originated in the same parent molecule, which he called “purine”.

Fisher was able to synthesize purine and a lot of artificial derivate, and thus prove his theory, in

1898. His work on purines, as well as sugars, brought him a in 1902. Uracil was isolated by hydrolysis a bit later on, in 1900, from yeast.

Around that time, a more concrete definition of acid–base, and their association/ dissociation equations was provided by the Swiss physicist Svante Arrhenius (Arrhenius, 1908)

Hence, the scientists of 19th century assumed that nuclein was an acid which combined with

those bases. Moreover, nuclein proved to be a weak acid on its own. Thus, “nuclein” became

“nucleic acids”, term coined by the German chemist Richard Altmann in 1899 (Altmann, 1949)

Phoebus Levene was an American scientist of Lithuanian origin who studied medicine but took an interest in biochemistry under the tutelage of and .

Levene was the one who proved that the carbohydrate component of nucleic acids was ribose in 6

1909 (Levene, 1915) Twenty years later, he discovered a slightly different sugar in the thymus,

which had one less oxygen, or was de-oxygenated, and he called the compound .

Hence, the terminology for the two types of nucleic acids, deoxyribonucleic acid (DNA) and

ribonucleic acid (RNA), as well for “nucleotide” is attributed to him. He postulated that DNA is

a string of nucleo-type units connected through phosphate, and called these units “nucleotides”.

These phosphate groups form the “backbone” of the DNA. He proposed a structure which had

the phosphate-base directed towards the interior, with the bases facing outside. Due to its

chemical simplicity, Levene rejected the idea that DNA could be the carrier of genetic

information and that the variety of heritable traits could be given by the plainness of nucleobases.

Around 1910, he postulated that DNA is made of equal amounts of A,G,C,T, known as

“tetranucleotide hypothesis”. This theory proposed that DNA was a repeat of “tetranucleotides”,

hence it cannot carry genetic information. Rather, the proteins in the chromosomes are the

molecules responsible for transmitting genetic information. Thus the studying of nucleic acids

took a back seat until the 1940’s, when Astbury observed similar structural features in nucleic acids as in proteins.

In 1912, the discovery of X-ray crystallography offered an unprecedented technique to

study biological molecules. William Laurence Bragg stated the laws of crystallographic X-ray

diffraction, based on electromagnetic wave movement of the electron cloud around an atom

when X-ray incidentally hit the atom. Bragg’s law made possible to calculate the position of

atoms in a crystal lattice by factoring in the diffraction pattern of X-ray by the lattice. Bragg won

the Nobel Prize in 1915 for his discovery (Bragg 1915). In 1948 he started working on the

structure of proteins. He reported the structure of DNA in April, and May of 1953, but his reports 7

passed unnoticed. His discovery of X-ray crystallography facilitated new means of exploring

three-dimensional structures of nucleic acids.

Meanwhile, not much progress was made in elucidating nucleic acids structures. There

are however some highlights in nucleic acids saga worth mentioning: In 1934, Desmond Bernal,

a British X-ray crystallographer working under Bragg, took the first X-ray photography of

hydrated crystals of a biological molecule, pepsin (Bernal 1935). Wandell Stanley isolated

nucleic acids from tobacco mosaic in 1937. Around 1938, the term molecular was

coined by Warren Weaver (Belozersky, 1945). In 1935 Andrei Nikolayevich Belozersky isolated

for the first time pure DNA, and in 1939 he experimentally proved that both DNA and RNA are

always present in (Belozersky 1939). X-ray crystallography became a more and more

popular method for examining biological macromolecules or complexes.

William Thomas Astbury was an English molecular biologist, with an expertise in X-ray,

who studied the structure of fibrous proteins in the second quarter of the 20th century. Austbury

was the first to observe the difference in the X-ray diffraction pattern of moist wool and hair

fibers. The moist wool had un-stretched keratin proteins which formed a helix. The hair seemed

to have an extended helix as a result of un-coiling. Astbury called them α-form and β-form respectively (Astbury, 1935) Astbury’s β-form helix was in fact proven later on to be a β-pleated

sheet (Pauling, 1935) A decade later, Taylor and Huggins proposed the model of the α-helix

which is very close to the modern model (Astbury, 1947) Astbury has to be credited with the

hydrogen bonding concept as well, a crucial concept when talking about biomolecule folding

(Astbury, 1947). He proposed that hydrogen bonds between the backbone amine groups would

stabilize the structure of proteins. In 1937, he took up a project in studying DNA from thymus.

The X-ray diffraction pattern revealed a regular structure of the DNA with 2.7 nanometers 8

repeats, the flatness of the bases, and the 0.34 nanometers distance between the stacked bases.

This distance was the same in the amino acids of a protein helix and the nucleobases in the “wet”

form of nucleic acids. He stated that this “was not as arithmetical accident” (Astbury 1946

Symposium in Cambridge). But the data was not enough for Astbury to propose the correct

structure of the DNA.

Oswalt Avery and his collaborators Colin MacLeod and Maclyn McCarty where the first

research group to prove that DNA is the molecule responsible for transmitting hereditary

information, as opposed to proteins, by demonstrating that changing the DNA changes bacteria

(Avery, 1944). He did so by removing large cellular structures from S strain

bacteria Streptococcus pneumoniae, and treating them afterwards with proteases ( which digest proteins) and ribonucleases (enzymes responsible for the digestion of RNA). The newly modified R strain bacteria still transformed into a more virulent strain, which meant that some other molecules not the proteins are responsible for transmitting genetic information. In

addition, R strain bacteria were treated with deoxyribonucleases, which digest DNA. The R

strain bacteria did not transform after this treatment, which clearly indicated that DNA was the

molecule responsible for caring genetic information, or so called “transforming principle”. Six

years later, and Alfred Hershey, by radioactive labeling first the sulfur (S35) of the

proteins, and subsequently the Phosphorus ( P32) of nucleic acids, proved that when infecting

bacteria with , DNA enters the bacteria, but most proteins do not. Thus, they

showed that DNA, not the proteins, must be the responsible molecule for transmitting genetic

information (Dahm, 2005; Dahm, 2008).

In the history of nucleic acids, an important role was also played by the Austrian

, who looked at the composition of DNA in different organisms. He 9

showed that Levene was not quite right in his “tetranucleotide hypothesis” and nucleotides are

not all in equal amounts, but rather the amount of A is approximately equal with T, as is G with

C (Chargaff, 1946;Chargaff, 1934). He also demonstrated that A-T and G-C are the same for both DNA strands ((Chargaff, 1950). The rule did not apply to DNA from organelles, single stranded viral DNA, or RNA. This discovery paved Watson and Crick’s theory of baseparing in

DNA. His research on various organisms revealed that the DNA composition varies among species which was another big stepping stone in proving that DNA is the genetic information carrier.

One of the most influential scientists of the 20th century was . His work

covered a variety of topics in the fields of , biochemistry and .

Pauling’s scientific views were influenced by the quantum mechanics theorists like Niels Bohr,

Erwin Schrodinger, and Arnold Sommerfield, under which he studied. His research was centered

on studies of molecules at atomic structure, as he tried to decipher the physical laws governing

the atoms arrangement in molecules, the bond angles and the distances between atoms. In 1930

he built his own electron diffraction instrument in order to be able to study the atomic

composition of different molecules. Pauling introduced the concept of atomic orbital

hybridization. He studied both ionic bonding and covalent bonding. He also described aromatic

hydrocarbons, namely benzene, based on quantum mechanics principles, as being intermediates

of single-double bond hybrid. He proposed spheron model of atomic nuclei, where “clusters of

nucleons” are tightly packed. But his description of valence bond could not entirely explain some

characteristics of certain molecules. He has to be credited with the introduction of the concept of

electronegativity (1932). His first biological molecule of interest was , but he later

studied the structure of different proteins. He modeled the structure of hemoglobin in a helical 10

pattern, being the first one to predict its biological 3D structure. He also proposed the correct

atomic arrangement in the proteins secondary motif β-pleated sheet, the first where the atoms did

not clash. He suggested a triple helix model for the DNA, later on proven incorrect (Pauling,

1953). In biochemistry, Pauling’s biggest contribution was the theory that enzymes stabilize the

transition state of biochemical reactions. He also proposed the structural complementarity of

antibodies to antigens. Pauling was the first to demonstrate that an abnormal protein can cause a

disease (Pauling, 1949). He believed that defective enzymes can cause mental illnesses. The idea

that vitamins might be associated with these diseases (Pauling, 1968) led to a whole movement

based on mega-vitamin therapy. His work predicted and elucidated structures of several

compounds, including biological molecules.

In 1947, Alexander Stokes joined the research unit at King’s College in

London, under the directorate of ,. Randall led a team of scientists working on

elucidating the structure of DNA: , , and . Maurice

Wilkins and his Ph.D. student were analyzing DNA under X-ray. Rosalind

Franklin joined their lab in 1951 and started applying her expertise in X-ray diffraction and

physical chemistry to study DNA. Raymond Gosling was assigned to work with Franklin.

Stokes was asked by Wilkins to predict the X-ray diffraction pattern of a helix. Through mathematical calculation, he demonstrated that the helix should display a specific regular X-ray diffraction pattern (Wilkins, 1953).

Gosling and Franklin also noted that DNA also has two forms, like the fibrous proteins,

which they called “A” (when the samples were dry, the DNA was short) and “B” (when the

DNA was wet, it was long), analogous to the way Astbury named the wet and dry protein

helices. The X-ray patterns indicated that “B” form of DNA might be helical in structure. In 11

1952 Franklin and Gosling applied a type of Fourier transformation, proposed by Arthur

Patterson (Patterson, 1935), and results indicated that B-form of DNA is a double helix

Franklin’s manuscript for publication in Acta Crystallographica arrived just one day before

Watson and Crick proposed the DNA model (Franklin, 1953)

Based on the images of DNA X-rays taken by Franklin, Francis Crick and James Watson started to build a model of the “B” form of DNA. The previous model of DNA structure proposed by Linus Pauling, based on Astbury’s data, was of a triple helix (Pauling, 1951)

Watson and Crick observed the helical pattern and the planar base stacking along the helical axes in the X-ray data taken by Franklin. Thus, in 1953 the correct structure of the DNA as a double helix was proposed by Watson and Crick, which brought them a Nobel Prize in or

Medicine (in 1962)–along with Maurice Wilkins, who showed them Franklin’s data.

Unfortunately, Franklin had died four years before, and the Nobel Prize was not awarded posthumously. The April issue of Nature in 1953 published three papers on DNA structure:

Wilkins, Stokes, and Wilson; Franklin and Gosling; Crick and Watson. Watson and Crick’s model arranged the double strands of DNA in an anti-parallel fashion, with the charged phosphate backbone towards the exterior and the bases towards the interior. In their model, the hydrophilic phosphate could interact with water molecules, and the hydrophobic nucleobases were shielded in the interior of the helix. Based on Chargaff’s rule, Watson and Crick had the purines (A and G) forming base pairs with pyrimidines (T and C respectively). These base pairs are defined by the hydrogen bonding between the purines and pyrimidines, with two H-bonds between A and T and three H-bonds between C and G. This confers chemical stability to the base

pairs. At physiological pH, the phosphate groups are negatively charged, making nucleic acids 12 polyanions, explaining thus the acidic property of “nuclein” (which was shown to dissolve in acids and precipitate in basic solutions).

Besides the “classical" nucleotide bases, several modified nucleobases were discovered or synthesized in labs. Some have methylated positions, others have a carbon saturated or replaced by nitrogen, or they have a carbonyl replaced by amine or vice versa (Goodchild 1990).

Nowadays, more than a hundred modified nucleotides have been found to occur naturally. Their functions are not yet completely understood.

Several studies of RNA-containing complexes were made in the middle of the 20th century. The scientist who observed the complexes responsible for protein synthesis was the

Romanian biologist George Palade in the mid-1950’s (Palade, 1955) These microsome particles, made of nucleoproteins and lipids, were observed first under the microscope by Palade, but their name, “ribosomes”, was proposed a couple of years later by Richard B. Roberts in 1958 (Roberts

1958)

Around that time, the “central dogma” of molecular biology was stated in 1957 by

Francis Crick and George Gamov who postulated that the DNA sequence will specify the protein’s sequence, stating also the unidirectional flow of genetic information from

DNA to RNA to proteins. This “sequence hypothesis” is still the basis of molecular biology. In the same year, Matthew Meselson and Frank Stahl demonstrated the replication mechanism of

DNA. Following that, the RNA molecule which would convey the message recorded in the

DNA was considered to be a “messenger” RNA, and thus mRNA (Geiduschek, 1969).

Later, it was shown experimentally that (radioactively labeled) amino acids were delivered to the ribosome by some small RNAs which would remain soluble during sedimentation procedures. These so-called “soluble RNAs” would transfer individual amino 13 acids to the newly crescent polypeptide chain, being called eventually “transfer RNA” (tRNA)

(1976). Francis Crick was the scientist who hypothesized of the existence of an aptamer molecule first in early 1960S. tRNAs are amino acid – specific, each has its own set of enzymes charging and proofreading tRNA charging (attaching an amino acid to tRNA), and comprise approximately 10% of the total RNA in a cell. An amino acid can have up to six different tRNAs, but each tRNAs is specific to just one amino acid. The correct amino acid is delivered by the charged tRNA when its anti-codon loop pairs complementary with a codon in mRNA. Robert

Holley published the first tRNA primary structure, and three suggested secondary structures.

This was the first ribonucleic acid sequence experimentally determined, fact which brought

Holley a Nobel Prize. A decade later, two different research groups published the structure of yeast phenylalanine tRNA (Warner, 1963)

In 1961, Marshall Nirenberg synthesized a poly-U strand of mRNA and discovered that

3U’s encode the amino acid Phenylalanine, making thus the first step in deciphering the . Nirenberg and his collaborators Heinrich Mathaei and showed that three nucleotides (“codon”), encode for one amino acid just a few years later, in 1966 (Weinstein,

1966).

RNA was shown to be more than a genetic code carrier. Following the observation that RNA can form tertiary structures similar with proteins (Woese, 1966 ). , Francis Crick and Leslie Orgel hypothesized in 1967 that RNA can also act as an enzyme. The game changer for the comprehension of nucleic acid functions was the discovery of catalytic RNA simultaneously by and Sidney

Altman. In the late 70’s Cech’s group discovered a self splicing intron in Tetrahymena Thermophila

(Kruger, 1982) Altman’s group discovered the existence of an RNA fragment in the RNase P enzyme.

Moreover, this RNA was responsible for the catalysis of the reaction at maturation of pre-tRNA into active tRNA (Altman 1980). 14

These properties of RNA lid to the conclusion that RNA is the most viable candidate as the precursor molecule at the origin of life, given that it has both self-processing abilities and genetic information carrier capacity (Woese 1967). Thus the RNA world hypothesis was proposed in 1986 by (Gilbert 1986), a concept advanced by Woese in 1967. The hypothesis was supported by some of the biggest scientists of the 20th century. Furthermore, the discovery of reverse transcriptase, an RNA-dependent DNA polymerase, made even more plausible the hypotheses that RNA appeared evolutionarily before DNA and proteins. However, the instability of the ribose might be a clue of the nature of a pre-RNA world, in which a molecule, with both genetic information storage and catalytic capabilities would be the main actor. Since the original prebiotic molecule has to incorporate both genotyping and phenotyping, and moreover, to have catalytic functions along with information carrier tasks, a derivate of RNA is the most viable candidate. Several different molecules which would incorporate genetic and catalytic properties are possible (Lazcano, 1996) among which be pyranosyl RNA (pRNA) seems to be the most viable molecule sugar containing molecule. The six member sugar of pRNA would present the advantage of being more stable than the ribose. Moreover, pRNA displays a more selective basepairing, with the hybridization between two strands of pRNA was shown to be stronger than of two RNA strands (Eschenmoser 1997). Another favorite of the suggested candidates seem to be the peptide nucleic acid (PNA). PNA is a polymer of glycine derivate ( N-(2-aminoethyl)-glycine) with ethylenediamine monoacetic acid, and the bases attached to the backbone by acetic acid (Nissen, 2001). PNA units are very likely prebiotic compounds, but it hasn’t been shown that their polymers can form easily. The polymers created in the lab were shown to also bind strongly to DNA, with high specificity for basepairing. 15

However, there are several other possible prebiotic molecules, and the RNA we know nowadays might be the result of combining different prebiotic biopolymers (Miller 1996).

In the 1970’s, the research in DNA exploded, reaching new heights, from DNA cloning

(California, 1972) to gene mutations, and gene transfecting from one organism to another, which eventually led to gene therapy. Protein cloning and laboratory production of human

(like insulin) opened the gates for creating vaccines and making them widely available to the public. The first chimeric mice were engineered in 1970’s (Brinster, 1974). DNA transfection between animals was successfully reported in the early 1980’s (Gordon, 1981) in transgenic mice. The first artificial human was created in 1997 (Bokkelen 1997). From DNA cloning, the next goal was mammal cloning, which bore fruit in the birth of the sheep Dolly, the first mammal cloned from a somatic cell. Although applying the gene therapy towards genetic disease correction in humans is still a long way, the first steps towards this goal were taken a quarter of century ago, by trying to elucidate the human . The project of mapping the whole human genome began in 1990 and was roughly finished in 2000, with the completion date in

2003, and data analysis still going on. However, RNA still has structural and functional aspects not elucidated yet.

I.1.2. The roles of RNA and RNA nucleotides in biological processes

The outmost significance of ribonucleic acids is given by their chemical and functional properties. Chemically, RNA’s composition is almost identical with the DNA’s, and thus RNA can store genetic information. Retroviruses have RNA as genetic material and they use it to transmit genetic traits. Deiner and Raymer showed in 1967 that a free double stranded RNA is the infectious agent in Potato Spindle Tuber Viroid (Raymer 1967). Functionally, due to their structure and catalytic proprieties, RNA can perform enzymatic functions (Cech 1981; Kruger 16

1982, Guerrier-Takada 1983). Additionally, nucleotides or nucleotide derivatives are involved in key biological processes. Adenosine triphosphate (ATP) is the energy currency in metabolism and cellular respiration. ATP and GTP act as cofactors and regulators in various biological processes catalyzed by enzymes. Derivatives of Adenosine, like NADH (Nicotinamide Adenine

Dinucleotide), FAD (Flavin Adenine Dinucleotide), Acetyl-CoA, are coenzymes involved in several steps of metabolism, aerobic respiration, citric acid cycle, electron transport chain, and oxidative phosphorylation.

I.1.3. Methods for scrutinizing RNAs structure and function

RNA molecules have dual traits, resembling DNA in their molecular and chemical composition, while encompassing globular structure and catalytic properties analogous to proteins. In bacteria, only about 2% of the DNA in the genome does not code for proteins

(Mattick 2006); in , only 2% codes for proteins, and 98% does not, it generate perhaps as much as 50% RNA with actual functional roles. The functionality of some RNA molecules is due primarily to their tertiary structure (Hoogstraten 2007). Comprehending RNA architectural principles and the conformational changes that occur at the structural level during cellular processes can contribute to the elucidation of various molecular mechanisms, among which is protein synthesis.

The catalytic and functional roles of RNA molecules can be assessed through various methods. Biochemical methods include probing the structure enzymatically (Stern, 1988), studying the interactions by chemical footprinting (Brimacombe, 1991), Nucleotide Analog

Interference Mapping (NAIM) (Strobel 1999 ; Das 2005), analyzing specific positions by site- directed mutagenesis (Kubarenko 2006; Triman 1999; Triman 2007) or by replacing nucleotides with chemically modified nucleotides (Eargle 2008; Piekna-Przybylska 2008; Konevega 2006). 17

Structurally, RNA can be studied through NMR, microscopy, X-ray crystallography and

homology modeling using comparative sequence analysis. Each of these methods contributes to

deciphering the structural information, although each one also has its own limitations.

An isotopically labeled molecule can have its structure solved using Nuclear Magnetic

Resonance techniques if it is of a moderate size (no bigger than 40kDa) with high solubility

techniques (Dorywalska 2005; Ermolenko 2007). Thus, RNA molecules of ~100 nucleotides can

be investigated easily nowadays by NMR (Nonin-Lecomte 2006). Single particle Cryo-Electron

microscopy could resolve large complexes at “low” resolution ~7-12Å until recently (Agrawal

2000; Frank 2000; Wriggers, 2000; Mitra, 2006). Recent advances in this technique made possible to visualize ribosome structures at relatively low resolution (4 Å) (Li 2008)Moreover,

Cryo-EM is a powerful technique for visualizing the dynamics of large molecules and the overall

conformational changes that occur during molecular processes (Gao, 2003; Myasnikov, 2005;

Mitra, 2006; Agrawal, 2000). This method is complemented by crystallographic studies when

available. X-ray crystallography does not have a limit in the size of complexes that it can resolve,

and can produces accurate results down to 1Å resolution for small complexes, but resolution this

good is not common with big structures. The most delicate part of this method is obtaining a

pure, clean crystal of the complex in question (Korostelev, 2006).

The accuracy of homology modeling depends on the quality of the alignment between the

query sequence and the template sequence, and thus the availability of an adequate template

sequence. The disadvantage of this method is that it needs high homology between compared

structures, and it has limited value. This dissertation explores the basis for homology modeling,

seeing what is similar between rRNA in different organisms. 18

X-ray crystallography is the method of choice for structural biology, the structure-based

approach undertaken towards understanding the molecular mechanisms of biological processes.

The rapidly expanding database of RNA X-ray structures at high resolution delivers a powerful

tool for investigating RNA organizational and functional principles. Tertiary interactions mediate

and stabilize RNA folding, structure and, implicitly, function. Hundreds of ribosome structures

were published in the past decade, making it possible to study RNA down to atomic level. These

X-ray structures provide details about the interactions within the RNA, and about the tertiary and

quaternary contacts and between the architectural modules of RNA, and RNA and proteins, to

form the operational ribosome. Moreover, a mapping of these contacts that occur between

ribosomal RNA modular motifs can provide a clear picture of the way the ribosome is working at

the molecular level. The increasing number of X-ray structures presents a challenge for researchers to analyze and correlate structural changes with functional states. Hence, there is a

need for computer-assisted automation of the structure analysis process.

I.1.4. RNA structural organization

Ribonucleic acids have almost identical chemical composition as deoxyribonucleic acids.

The 3D structures described in the previous section show us that they have three-dimensional

structures and some functions similar to proteins. Like DNA, RNA also present a polymeric

morphology with each unit made of a phosphate, a sugar, and a base. The chemical differences

between DNA and RNA are in the dehydroxylation of the DNA’s sugar at the position 2’ and

DNA demethylation of Uracil at C5, this base being called Thymine. Structurally, RNA’s single

strandism provides the molecule the ability to fold on itself and form three-dimensional

structures. With the OH- present, the RNA helices tend to adopt the A form geometry rather than

the B form helix common in DNA. The hydroxyl group also provides additional possibilities for 19 hydrogen bonding between bases and 3’-end ribose, and further stabilization of the folded structures.

RNA folds hierarchically into globular structures similar to proteins. An RNA molecule structure can be analyzed at four organizational levels: sequence, secondary structure, 3D conformation, and quaternary assembly. An RNA molecule is a sequence of four nucleotides,

A’s, C’s, G’s and U’s, covalently linked 5’-to-3’ through phosphodiester bonds. The nucleotide chain folds back on itself in a precise manner, so the purine nucleotides, (A and G) complementarily pair pyrimidines (U and C, respectively) via hydrogen-bonding. The resulting secondary structure profiles the skeleton of a complex RNA molecule: the complementary-paired regions, or helices, are the frames of the 3D structure. The “joints” of these frames are single- stranded regions, called “loops”. Thus, the secondary structure can be nominally categorized according to the base-pairing criterion: Watson-Crick paired regions, or helices, and non-

Watson-Crick regions, or loops. According to their location, the loops can be classified into terminal or hairpin loops - when they cap a helix-, internal loops - which connect two helices-, and junctions - that join three or more helices (Figure I.1.4.).

20

Figure I.1.4. Motifs in secondary structure: helices –in green; Hairpin loops –in yellow; Internal loops – in orange; Junction loops – in magenta. (from 16S T. thermophilus, PDB 1j5e).

The tertiary structure arises as the junction flexibility allows for helical elements to pack closely together into domains, establishing interactions between helical elements. The domains

fold together into compact globular structures, stabilized by supplementary contacts between

domains. At the level of tertiary structure, the loops in 2D are in fact highly organized 3D

structural modules, or “motifs”. They can be regarded as building blocks that can replace each

other and thus could be used to engineer novel RNA molecules (Guo, 2005; Nasalean, 2006).

Most of the nucleotides of a motif interact with each other, edge to edge or by stacking. Thus, a

tertiary modular motif can be described by denoting the combination and ordering of nucleotide

interactions (Leontis & Westhof, 2002). A thorough characterization of 3D motifs has to include,

besides the types of interactions mentioned above, information about the backbone

conformations ( Richardson 2008). There are several well-established tertiary motifs (tetraloops, pseudoknots, ribose zippers, kissing hairpin loops, kink-turns, sarcin–ricin loop, T-loop, C-loop, etc.). Different 3D motifs have different functions: some have architectural roles, some interact 21

with substrates and other molecules, and have specific functions.

The 3D structure of an RNA molecule arises as long-range contacts are established between tertiary motifs - hairpin loop-hairpin loops, hairpin loop-internal loop, internal loop- internal loop, internal/hairpin loop-helix. Therefore, not only the interactions that make up tertiary motifs, but the interactions between the motifs and the manner in which they change during biological processes are also crucial to understand the dynamics and function of

structured RNA molecules, and implicitly of the ribosome.

At quaternary assembly of RNA with proteins or ligands, additional interactions occur

between the RNA nucleotides and the amino acids or ligand molecules. However, these

interactions are difficult to classify, particularly when there is only one Hydrogen bond formed.

It is not the purpose of this work to analyze quaternary assembly interactions, this study is

focused rather on the longer-range interactions that occur between tertiary motifs composing

structural elements when the globular structure folds.

I.1.5. Types of RNA

The central dogma of molecular biology states that DNA codes for RNA and RNA codes

for proteins, but this is not the entire picture. The transcribed RNAs can encode for proteins or

can be involved in various molecular processes or regulatory aspects of the gene expression

process. Thus, RNA, and some regions of mRNA, can be divided into coding and non-coding

RNA. The coding RNA is represented by messenger RNA (mRNA) which is the genetic code

emissary between DNA and proteins. Some non-coding regions of mRNA form 3D structures.

All the other types of RNA are non-coding RNAs (ncRNA). The ncRNAs have specific

functions, and some have complex tertiary structures, with their structure dictating the function.

Among the ncRNA there are transfer RNA (tRNA), ribosomal RNA (rRNA), small nucleolar 22

RNA (snoRNA), micro RNA (miRNA), small interfering RNA (siRNA), piwi- interacting RNA

(piRNA), and long ncRNA, and probably more.

I.1.6. Types of RNA and the biological processes in which they are involved

Since the proposal of Carl Woese that the comprises three different domains

of life: , Bacteria and Eukarya, the evolutionary relationship of the three has been the subject of dispute among scientists. (Woese 1990). Depending on the features studied, Archaea are closer to Eukarya from several evolutionary standpoints. Still, the simplicity of organization and cellular processes would indicate that Archaea are also . Nonetheless, Archaea displays similarities to both Bacteria and Eukarya domains. A study of the transcriptional apparatus of all three domains poses the same dilemma: which of the 3 domains are closer evolutionary.

I.1.6.1. RNA synthesis (Transcription)

Prokaryotic organisms belong to two domains of life: Bacteria and Archaea. They are

simple(r) unicellular organisms, with no membrane-bound organelles. They have small, circular

genomes with related genes clustered on “operons”. An operon is controlled by a single

promoter. A whole operon is transcribed together, leading to proteins involved in the same or

similar metabolic processes. In archaea and bacteria, DNA transcription into RNA takes place in

the cytoplasm, due to the lack of nuclear envelope (Wilson 2005; Kazanovska 2007). In

eukaryotes, transcription is done in the nucleus, and mRNA needs to be exported to the

cytoplasm, where the translation by ribosomes takes place. Organisms from all three domains of

life use one type of enzyme for transcription, RNA polymerase (RNAP), although the enzyme

differs in structure and composition between domains. 20% of transcription associated proteins

are the same throughout all domains: Archaea, Bacteria, and Eukarya. The result of transcription, 23 pre-mRNA, has to undergo the splicing process during which introns are removed from the transcript and exons are joined together to form the mature mRNA, which will be translated by the ribosome into proteins.

There are 3 main phases of the transcription process: initiation, elongation and termination. Transcription initiation starts as the RNAP complex is formed on the promoter. In prokaryotes, the promoter sequence is located 10-35 basepairs upstream the DNA gene sequence.

In eukaryotes, certain proteins act as transcription factors, binding first upstream the gene sequence, and then recruiting can the RNA polymerase bind to the promoter. In the elongation phase of transcription, RNAP reads the template strand of a DNA gene, running 3’-5’, and transcribes the RNA strand 5’ to 3’. Each newly added RNA nucleotide is complementary to the next DNA template nucleotide (A to T, G to C, C to G, and U to A). This newly synthesized

RNA molecule is an exact copy of the DNA coding strand, with the Ts replaced by Us. The elongation phase is governed by a proofreading mechanism which minimizes the misreading - or the addition of a “wrong” base (that is not complementary to the template stand). In such case, the transcription is paused, RNA editing factors bind to the RNA strand, and these factors correct for the mistaken nucleotide.

In prokaryotes, two types of transcription terminations can be observed: factor- independent, also called intrinsic termination, and factor-dependent termination. In this factor- independent termination, a sequence rich in Cs and Gs folds on itself to make a stem-loop. This sequence is followed by several Us. The stem-loop, 7-20 base pairs in length, causes strain between the RNA-Us and DNA-As, which lowers the energy of the hybrid RNA-DNA dissociation, thus releasing the newly synthesized mRNA. A protein (factor) is necessary to trigger the pause of transcription in factor-dependent termination. This protein, called Rho factor, 24

binds to the RNA strand in a region (of 72 nucleotides) rich in C and G, upstream of the

termination sequence. Subsequently, it uses energy from ATP to translocate along RNA until it

reaches the end RNA-DNA helical region and it unwinds the RNA from the DNA. In eukaryotes,

the new RNA molecule is dissociated from the RNA-DNA hybrid only after the stem-loop

formation and a run of As is added to the end of the molecule. This process through which

transcribed RNA gains a poly (A) tail is known as polyadenylation. The proteins synthesized

from these transcripts can have variations towards the end, due to the differences in the length of

the poly (A) tail. The termination in eukaryotes is more complex than in prokaryotes and is not

fully understood yet.

One type of RNA polymerase transcribes all types of genes in all prokaryotes, although

archaeal RNAP is structurally closer to the eukaryotic RNAP II than to the bacterial RNAP. The

bacterial RNAP is an assembly of two main complexes: the core enzyme and the sigma factor.

The core enzyme, a tetra-unit assembly, is the part which presents catalytic activity, but it cannot

recognize the start site of transcription by itself. The holoenzyme (the core complex and the

sigma factor together) can bind to the DNA template, initiate RNA transcription, elongate the

transcript chain and terminate the transcription. The basic transcription initiation is similar in

Bacteria and Archaea and it starts when the RNA polymerase binds to the promoter region.

Approximately 60% of the transcription associated proteins are similar in all prokaryotes, and

archaeal and bacterial transcription regulators are homologous proteins. In both domains, the

resulting mRNA transcript has a Shine-Dalgarno sequence, 8 basepairs upstream of the start codon (AUG), used for recruiting the small subunit of the ribosome. The prokaryotic mRNA

transcript has no introns and does not require processing in either Archaea or Bacteria, except for

a small number of genes, e. g. tRNA in Archaea. 25

In eukaryotes there are different RNA polymerases responsible for the transcription of

different types of RNA, and all requires assistance from transcription factors to bind to (specific)

promoters. Thus, RNAP I transcribes rRNA; RNAP II transcribes mRNA, and most snRNA and microRNA; RNAP III transcribes tRNA, 5S rRNA and other small RNA genes; and organelle

specific RNA polymerases transcribe RNA in chloroplasts and mitochondria. The latter are related to the prokaryotic RNAP. All other eukaryotic RNAPs are multi-assemblies of 10-17 subunits and even though structurally more complex, functionally they are still similar with the bacterial RNAP. There are similarities between the core subunits of the three main types of

RNAPs, and even with the bacterial RNAP. RNAP II has common subunits with RNAP III and

RNAP I, but incorporates unique subunits as well which assist in protein recruiting for further primary mRNA processing (like 5’ capping and poly-A tail).

There are several similarities in transcription process between Archaea and Eukarya.

Transcription requires transcription factors by both Bacteria and Eukarya, and can be regulated by various factors that can activate, enhance or suppress the transcription. Both archaeal and eukaryotic RNAPs require two (Auerbach) general transcription factors: TATA-binding protein

(TBP) and transcription factor B (TFB) (eukaryotic orthologue TFIIB). Introns can be found in

tRNA transcripts in both Archaea and Eukarya. The Archaea RNA polymerase has a core similar

with eukaryotic RNA polymerase II.

Different mRNAs can have different life times in cells, ranging from seconds to hours in

prokaryotes, or from minutes to days in eukaryotes. mRNA is degraded by a combination of

ribonucleases in prokaryotes. In eukaryotes mRNA is protected from degradation by a complex

which includes initiation factors (IF) eIF-4E and eIF-4G and the poly (A) binding protein.

Initiation factors obstruct the decapping enzyme from acting, and poly (A) binding protein 26

prevents the exosome from attacking. This mRNA degradation is called turnover. If the mRNA

sequence is rich in AU on the 3’-end, certain proteins can bind to this region and promote poly

(A) tail removal, which seems to initiate mRNA degradation by the exosome complex and the decapping complex. An error in the eukaryotic gene message is corrected through nonsense mediated decay, through which an mRNA with a premature stop codon undergoes degradation by 5’ decapping, 3’ poly (A) tail removal or endonucleolytic cleavage. Poly (A) tail removal destabilizes the P body complex, allowing mRNA degradation by the exosome or decapping complex. Other types of mRNA degradation involve RNA interference by small RNA molecules, such as siRNA, miRNA, which binds complementarily to mRNA, and ribonucleases, like

Argonaut proteins in RISC (RNA-induced silencing complex), or endoribonucleases, like Dicer, which cleaves double stranded RNA and facilitates mRNA digestion by the endonucleases in

RNA-induced silencing complex, and subsequently endonucleases. Complementary binding of small RNA sequences can silence the gene translation by suppressing mRNA translation and accelerating poly (A) removal. Piwi-interacting RNA (26-31 nts), as part of RNA-protein complexes, can bind to mRNA and silence the transcriptional process. The mRNAs without a stop codon are degraded by the exosome complex if the ribosome reaches the 3’ end and does not meet a stop codon, a process called non-stop decay. The non-sense mutations, which result in

premature stop codons (or non-sense codons), are corrected for through non-sense-mediated

decay. The exon junction complex (EJC) module of the RNP binds tightly 20-24 nucleotides

upstream of the splicing junction (where two exons are bound together), forcing the RNP

complex to be exported out of the nucleus into the cytoplasm through nuclear pores. Thus

truncated, incomplete, or non-functional protein precursors are degraded (by exosome complexes 27

for example). EJC can also recruit mRNA surveillance factors, which proofread each step of the

maturation process of pre-mRNA.

I.1.6.2. mRNA processing

A pre- mRNA molecule undergoes processing, including as intron splicing, alternative

splicing, 5’ cap addition, editing, or polyadenylation, and transport or nuclear export in

eukaryotes. Evidently, the eukaryotic pre-mRNA processing is often much more complex than

the prokaryotic one. mRNA is only one of the types of RNA transcripts. One of the other main types of RNA is ribosomal RNA. Most of the transcripts undergo processing before the mature

RNAs can perform their respective functions. In E. coli, the DNA sequence containing the genetic information for the ribosomal RNA has seven copies of the ribosomal RNA operon transcribed into rRNA as 16S, 23S and 5S rRNA, separated by spacers encoding for tRNAs

(Deutscher 2009). RNase III processes the primary transcripts by recognizing double stranded

RNA sequences and cleaves them during transcription (Robertson, Webster 1968). At this point, there is a 17S rRNA with extra 115 and 33 nucleotides at the 5’ and 3’ ends respectively (Young and Steitz 1999). Pre-23S rRNA has only 3-7 and 8 additional nucleotides at 5’ and 3’ ends respectively (Sirdeshmukh 1985) from 17S to 16S RNase E and RNase G process the 5’ end

(Lee, Pandit 1999). The RNase T is the enzyme responsible for 3’ end processing of pre-23S

RNA (Hiyes 1976, Lee, Pandit 1999), but the process is not completely understood. The enzymes involved in 5’ processing of pre-23S RNA are not known yet. Pre-5S RNA is cleaved by RNase E and results in an immature 5S RNA, which will have the 3 additional nucleotides at both ends (Roy 1983) cleaved by RNase at 3’ end and an unknown enzyme at 5’ end (Lee and

Deutscher 1995). Other enzymes that participate in RNA processing are PNPase end RNase PH

(Zhou and Deutscher 1997), end RNA Chaperones (Steitz 1995). Some nucleotide modification 28

occurs after RNA processing, 10 methylations plus 1 pseudouridinylation in 16S RNA in E. coli,

and14 metylations, 9 pseudouridylation, 1 metylated pseudouridylation in 23S RNA of E. coli.

These modifications include the dimetylation of A 2058 in 23S, which confers resistance to

specific macrolide antibiotics (Wase 1995), and methylation of 1948 by aminoglycoside

(Thompson 1985). The modifications do not seem to be essential for translation, cell life, or the

efficiency of translation, even though they are clustered in important regions of rRNA, like

peptidyl transferase center, and mRNA, but there are a few known exceptions. Thus, post-

transcriptional modifications appear to be essential for folding, assembly and stability of the

ribosome (Noller 1981; Brimacombe 1993; Decatur 2002; O’Farrell 2008).

I.1.6.3. The RNAs involved in translation

The last step of protein synthesis is the translation of mRNA into proteins. This process is

done by a ribonucleoprotein complex called the ribosome, which is the result of quaternary

assembly between the ribosomal RNA (rRNA) and ribosomal proteins. In archaea and bacteria protein synthesis takes place in the cytoplasm, concurrent with the transcription. Three types of

RNA are involved in protein synthesis: mRNA, tRNA, and rRNA. Since the mRNA is discussed in detail in the previous section, rRNA and tRNA are the focus of this part.

I.1.6.4. The bacterial ribosome and rRNA

The ribosome is a ribonucleoprotein complex that weighs about 2.5 million Daltons and

is found in abundance in cells –in bacterial cells, tens of thousands per cell. The ribosome is

composed of two subunits: the small subunit (SSU) and the large subunit (LSU). According to

the sedimentation analysis for “Svedberg”, the bacterial ribosome is a “70S” complex. In

bacteria, the small subunit (30S) ribosome has an RNA molecule 1542 nucleotides long (with

16S sedimentation coefficient) and about 21 proteins denoted with “S” (Wimberly et al., 2000). 29

The large bacterial subunit (50S) is a di-RNA assembly of 23S rRNA, of ~2904 nucleotides, with 5S, which has 121 nucleotides, together with 34 proteins denoted with “L” for the “Large” subunit (Ban, 2000; Harms, 2001). Precursor particles 21S and 45S of mature 30S and 50S

subunits respectively are missing r-proteins (Maki 2003).

The eukaryotic ribosome, 80S, is more complicated from the architectural and regulation

point of view, but the mechanism of translation is the same as for the prokaryotic ribosome. 80S

also has two subunits: 40S, which contains an18s rRNA and 33 proteins, and the 60S contains

three rRNAs, -5S, 5.8S, 25S- and 46 proteins. Recently, new Cryo-EM and crystallographic

studies of 80S have been published (Ben-Shem 2011; Rabl 2010; Klinge 2011; Yusupova 2014).

Chloroplasts and mitochondria are endosymbiotic-originated organelles, most probably

proteobacterial in origin (Emelyanov 2001). Their translation process is similar to Eubacteria’s

(Allen 2003). Chloroplast ribosomal proteins are all homologues to bacterial r-proteins with

extensions at both N and C-termini (Yamaguchi 2000). L25 and L30 are missing but there are six

additional proteins, for the subunit and two for the large subunit (Somanchi 1999).

Mitochondrial ribosomes have a shorter primary sequence of rRNA compared with

prokaryotes (Van de Peer 1995, Cedergren 1988) so differ at the primary and secondary structure

levels from prokaryotes, as they have a divergent evolution of the rRNA. There are also

differences in the ribosome between different species. Some mitochondrial r-proteins are much

longer than their bacterial counterparts. Mitochondrial ribosomes also have different

sedimentation coefficient. Thus, there are significant architectural differences between

mytoribosome and eubacterial ribosomes.

Although composed of more than 50 proteins in bacteria (>70 in eukaryotic ribosome) and only 3 RNA molecules (4 rRNAs in eukaryotes), the active site of peptidyl transferase for 30

translational process is composed entirely of RNA (Green, 1997; von Ahsen, 1997), therefore the

ribosome is a , an RNA enzyme that catalyzes peptide bond formation. The subunit interface is protein-free, but some proteins can be found at functionally important sites: S12 is near the decoding region, mRNA entry and exit sites, and L7/L12 stalk is close to factor binding site. Most of the ribosomal proteins encompass long, basic arms which intertwine into rRNA in order to neutralize the negative backbone charges of tightly packed rRNA, and thus to stabilize the folded rRNA. They appear to play some roles in the structural scaffolding of the ribosome.

Modifications in proteins L4 and L22 can confer the ribosome resistance to the antibiotic

Erythromycin due to implicit changes in the adjacent 23S rRNA (Gregory and Dahlberg 1999).

They have orthologues in eukaryotes and archaea (Lecompte 2002). Prokaryotic proteins are

different between species, and there are two proteins in bacteria not found in eubacteria. The

proteins that have longer arms towards the exterior, like for example proteins L7 and L12

(Bocharov, 2004; Gudkov, 1997) may partake in recruiting dissociable translation factors:

initiation factors (IFs) (Myasnikov, 2005), elongation factors (Elices et al.) (Aoki, 2008;

Fujiwara, 2004; Matassova, 2001; Savelsbergh, 2003; Spiegel, 2007; Wintermeyer, 2001), and

release factors (RFs) (Petry, 2005; Weixlbaumer, 2007). The proteins with globular structures

are located on the outside of the ribosome including proteins that protect the ribosome surface,

like central protuberance or stalks (L1, L10, L11), or serve as gates of mRNA tunnel (L22 at the

entrance (Nakatogawa, 2002; Mitra, 2006), and L1 at the exit (Nikulin, 2003), guiding the

directionality of translocation. The number of r proteins increase from bacteria to archaea to

eukaryotes, with 34 of the 79 proteins in eukaryotes similar with bacterial r-proteins (Lecompte

2002). Archaeal proteins are specific, with counterparts in bacteria or eukaryotes

(Ramirez, Louie 1991). 31

I.1.6.5. The bacterial ribosome and tRNA

The tRNA is a short RNA molecule (73 to 93 nucleotides) which carries an amino-acid at the 3’end, when “charged”. While mRNA is structurally linear, linearity kept by the concurrent

DNA transcription and mRNA translation in bacteria, tRNA has a tertiary structure. A tRNA sequence folds on itself such that in the 2D structure it forms four arms, with three capping loops, similar looking to a clover leaf. Conversely the tertiary structure of tRNA has an overturned L shape given by coaxial stacking of the arms . The arm which is formed by the beginning 5’ sequence paired with the 3’ sequence is called acceptor-stem, a seven helix, with the last 3’ (CCA) three nucleotides free. This CCA tail serves as recognition sequence by proteins during translation. The first hairpin loop occurs at the end of D arm, which is about four base pairs in length, and it is nominally called the D-loop. The following hairpin loop is a seven base pair loop with the 3 bases in the middle arranged to Watson-Crick pair with mRNA codons. The TΨC (St-Onge, Thibault, Hamel, & Major) loop caps the five base pair long T arm.

A small variable loop is situated between the anticodon arm and the T arm. An enzyme, called aminoacyl tRNA synthetase, is adding an amino-acid to the 3’ end of its cognate tRNA. Thus

“charged” tRNAs are ready to participate in the translation process. Each three mRNA nucleotides correspond to a codon, in other words. for each codon there is a tRNA which has the anti-sense nucleotide sequence at the end of L arm, also called anti-codon stem loop (abbreviated

ASL).

I.1.6.6. Protein synthesis (Translation)

Translation of mRNA into proteins occurs in four phases: initiation, elongation, termination, and ribosome recycling. In bacteria the ribosome initiates translation at the 5’ end of 32

mRNA concurrent with RNA polymerase synthesizing the mRNA upstream of the translation

site. This phenomenon is enabled by the lack of nucleolar envelope. Initiation starts as the small

ribosomal subunit, assisted by initiation factors (IFs), recruits the mRNA at the translation

initiation region (Pata, Stirtan, Goldstein, & Steitz). TIRs include the initiator codon, usually

“AUG”, the Shine-Dalgarno sequence, and translational enhancers. Then, the aminoacylated

tRNAs are recognized with perfect matching (base-pairing) between the (mRNA) codon–

(tRNA) anticodon. The first amino acid, fMet-tRNA (the start amino acid), is assisted by the

initiation factor IF2 in entering the ribosome at Peptidyl site (P-site). Initiation factors IF1 and

IF3 are bound to the A-site and E-site, respectively, leaving the P-site free for the initiator tRNA.

The initiator tRNA is Met-tRNA in both Archaea and Eukarya, and its formylated form in

bacteria f-Met-tRNA. The large subunit with the embedded active site, where peptide catalysis

takes place, assembles with the mRNA-SSU-fMet-tRNA complex. The large subunit is recruited subsequently by the 30S-rRNA/mRNA/Initiator-tRNA complex, and the initiation factors are released, leaving the ribosome ready for the elongation phase of translation. At this point, the ribosome is ready to start the mRNA decoding process, and the translation elongation phase begins.

The Elongation factor EF-Tu –GTP bound to charged tRNA diffuses to the factor-binding site of the ribosome. This tRNA folds tighter from the distorted conformation to meet the geometrical requirements by the subunits complex. Peptide bond catalysis, generated by the A- tRNA-amino acid amine nucleophilic attack to the P-tRNA-amino acid ester, seems to be an

entropic effect of the substrates orientation. Translation is due to the negative dG of the ester

bond hydrolysis. Hydrolysis of GTP to GDP, by the Elongation Factor EG, induces the

translocation motion, of A-site tRNA to P-site tRNA, and the P-site tRNA to E-site (Exit-site). 33

This results in a transition state, with A-site tRNA moving to P-site, and P-tRNA reallocation to

E-site, which induces conformational changes. In the pre-transition state EF-G binds to A-site; in the intermediate state tRNAs, mRNA and 30S (Richardson et al.) move (with respect to 50S) and

GTP is hydrolyzed with Pi release. The translocation movement, required by mRNA-tRNAs shifting with 1 codon (3 nucleotides), is a molecular ratcheting motion, with the two subunits moving 5-10º with respect to each other. Conformational changes occur at the tRNA level also: from the pre-translocation state, now the A-site tRNA is in an A/P hybrid state, and the P-site tRNA in an E/P hybrid state. In the translocation step, the 30S subunit (only) rotates clock-wise compared with the position of 50S large subunit and mRNA. This allows the mRNA to be

“shifted” over with one codon, and it leaves the ribosome “open” for new tRNA recruiting. In this post-transitional state, movements of the 30S head back, and of the EF-G domain, accomplish GDP releasing from A-site, and E-site tRNA exits.

E-site tRNA exit is assisted by the Release Factors (RF1 and RF2) which recognize stop codons. The novel peptide begins to form its secondary structure as it exits the ribosome tunnel, the ribosome proving thus to play a chaperone task into the (partial, secondary) protein folding.

Except for the very first tRNA which initiates the translation in P-site, an aminoacyl-charged tRNA enters the ribosome at the A-site.

Recently it was shown that the translational process is driven by the entropy decrease when the ribosome is organizing proficiently the substrates for reaction, instead of nucleophilic attack as in acid/base catalysis, as thought before (Beringer, 2005; Okuda, 2005; Rodnina, 2006;

Rodnina, 2007; Rodnina, 2003; Sievers, 2004; Wintermeyer, 2004). The translation begins with recruiting mRNA, and initiation factor IF2-GTP-assists in capturing f-Met-tRNA, the initiator tRNA, by the small subunit (Allen, 2005; Lomakin, 2006). The whole translational process, 34 especially the polypeptide elongation step, is based on the dynamic abilities of the ribosome to undergo conformational changes. These changes are driven by hydrolysis of GTP to GDP and by the Elongation Factor EF-G, which is thought to induce the translocation motion (Matassova,

2001). The acceptor stems of A-site tRNA moves to the P-site, and the acceptor stem of the P- site tRNA moves to E-site, with the tRNAs anti-codon loops still bound into the small subunit.

This results in a transitional state, A/P - P/E (Lisi 2007). Following shift of tRNAs anti-codon loops from A site to P site, and from P site to E site, concludes the translocation step. The translocational movement is made in a ratchet-like motion (Frank, 2000), which causes a 6º rotation between subunits, ~10Å counter-clock-wise movement of the 30S small subunit, and mRNA, with respect to the large 50S subunit (hybrid state is a "locked" state), with the extremities (L1 arm) moving as much as 20Å from the initial position (Frank 2000). The “open” state at the end of translocation leaves the ribosome A-site accessible for recruitment of the next cognate tRNAs. The ratcheting motion is a universally accepted hypothesis for the ribosome movement during translocation process. It was observed by J. Frank and R. Agrawal (2000) in a eukaryotic organism, S. cerevisiae 80S ribosome, and supported by the authors E.coli studies

(Agrawal 2000). Termination of protein biosynthesis is induced by binding release factors (RFs) that recognize the stop codon, (RF1 for UAA or UAG, and RF2 for UAA or UGA), which mimic tRNA structure and bind to the A site of the ribosome. The factor RF3 is a GTP-binding protein which accelerates termination of translation, releasing the newly nascent protein, dissociating

RF1 and RF2 afterwards, and promoting subunits dissociation. The ribosome recycling factor

(RRF) is a protein which imitates near perfection the structure of a tRNA, binding in a different manner though to the ribosome. RRF separates the small subunit from the large subunit, releasing the mRNA (For a summary of translational factors roles see Table I.6.1.). 35

TRANSLATIONAL FACTORS Abre- GTP- PDB Names Function Location viation ases Files IF1 blocks A-site; increases IF2 affinity; may prevent 50S binding. A-site 1hr0

IF2 IF2-GTP binds to initiator f-Met-tRNA and delivers it to P-site. yes P-site Initiation prevents subunits association; assists in subunits dissociation; stabilizes free Factors 30S subunits; checks for cognate fMet-tRNA; 1i94 IF3 E site in some prokaryotes (required in E.coli!) assists 16S-Shine-Dalgarno sequence 1i96 to 5'-mRNA binding. EF- binds to charged tRNA and delivers aa-tRNA to A-site; yes A-site Tu delays tRNA binding until cognate tRNA is bound. Elongation assists translocation of A-site tRNA+mRNA to P-site - compels A/P and P/E A-site EF-G yes Factors hybrid states; complete translocation after GTP hydrolysis; vicinity

involved in ribosome dissociation+ recycling. EF-Ts facilitates GDP release from EF-Tu.

RF1 recognizes termination codons UAG and UAA. A-site 2b64

Termination RF2 recognizes termination codons UGA AND UAA. A-site 2b9m

Factors facilitates RF1, RF2 binding to ribosome, catalyses RF1, RF2 release from the RF3 yes ribosome and assists in ribosome dissociation. Recycling assists ribosomal subunits dissociation and mRNA dissociation from ribosome; 2v46 RRF Factors (mimics tRNA). 2qbd Table I.6.1.The main roles of Translational factors in the prokaryotic ribosome. 36

With respect to prokaryotic peptide synthesis regulation, the elongation factors and the initiation factor IF2 are G-proteins, all involved in the translation accuracy:

– IF2 – GTP complex binds to 30S A-site, promoting the first (Met-tRNA) binding to P- site. Assembly of the small subunit with the large subunit induces GTP hydrolysis, and releasing of IF2-GDP complex from A-site for future tRNA recruiting purposes. IF2 supposingly also activates of rrnB operon transcription and presents DNA affinity;

– EF-Tu – sequesters charged tRNA in cytoplasm. Upon recognition by the correct mRNA codon, it hydrolyzes GTP to GDP + Pi and releases tRNA to A-site. EF-Tu regulates the translation by postponing GTP hydrolysis -to confer the occasion of correct tRNA recruiting by the ribosome-, by delaying the A-tRNA entrance in the ribosome, again, for accurate tRNA recruitment purposes, and it also ensures that the correct amino acid is bound to the corresponding tRNA .

– EF-Ts- is engaged in releasing the EF-Tu bound GDP. EF-Tu is capable then to bind

GTP, which frees the EF-Ts, and a new tRNA.

– EF-G – binds its (incorporated) GTP to the A-site, inducing the hybrid translocation of

A-site tRNA to a P/A hybrid, and implicitly, of P-site tRNA to the P/E hybrid. The conformational changes provoked by the EF-G result in mRNA and A, P- tRNA complete +1 frameshift (1 codon = 3 nucleotides), followed by the EF-G – GDP complex departing the ribosome A-site. This whole translocational movement, also known as ratcheting motion, a 10Ǻ rotation of the ribosome subunits (with respect to each other) back and forward, is powered by

GTP hydrolysis by EF-G.

EF-Tu is found in bacterial, mitochondrial and chloroplast ribosome. Both EF-Ts and

EF-G elongation factors are very conserved throughout evolution. IF2 is the only G-protein 37

initiation factor in bacteria. Translational factors, together with other ribosomal proteins, are involved not only in mRNA translation, but also in replication, DNA repair, transcription, RNA processing, etc. EF-Tu binds to replication complex, transcriptional complex and membranes; degrades the proteins blocked by proteosome; assists in protein folding (together with the ribosome). EF-G and EF-Tu correspondents in eukaryotes regulate cytoskeleton polymerizations by binding to actin.

I.1.7. The structure of the ribosome

Some of the most complex crystal structures solved to date are of the ribosomes ( Steitz,

2000, Selmer, 2006, Schuwirth, 2005, Korostelev, 2006). The ribosome is an intricate biomolecular machinery, the biggest enzymatic macromolecule known. All rRNA helices involved in tertiary contacts with the tRNAs directly or indirectly interact with each other, and furthermore with additional structural elements, establishing consequently an intricate network of tertiary interactions for perpetuating structural changes throughout the whole ribosome. All the interactions between nucleotides in a structural RNA molecule can be described and mapped for a better visualization and understanding. Each constituent of a nucleotide, the base, the ribose, the sugar and the phosphate group, is capable of forming hydrogen bonds. Two nucleotides can form base-base, base-sugar, base-phosphate, sugar-sugar, sugar-phosphate and phosphate- phosphate interactions. The latter three are not nucleotide specific, and moreover they might occur by chance of close proximity at globular folding, hence they have not been clearly classified yet. In the next chapter, the interactions between nucleotides, and the definition and classification of structural motifs are reviewed. It was originally published in (Nasalean 2009).

My name changed since then – former Nasalean, currently Parlea.

38

I.2.RNA Structural Motifs

Structured RNA molecules resemble proteins in the hierarchical organization of their

global structures, folding and broad range of functions. Structured RNAs are composed of

recurrent modular motifs that play specific functional roles. Some motifs direct the folding of the

RNA or stabilize the folded structure through tertiary interactions. Others bind ligands or

proteins or catalyze chemical reactions. Therefore, it is desirable, starting from the RNA

sequence, to be able to predict the locations of recurrent motifs in RNA molecules. Conversely,

the potential occurrence of one or more known 3D RNA motifs may indicate that a genomic

sequence codes for a structured RNA molecule. To identify known RNA structural motifs in new

RNA sequences, precise structure-based definitions are needed that specify the core nucleotides

of each motif and their conserved interactions. By comparing instances of each recurrent motif

and applying basepair isostericity relations, one can identify neutral mutations that preserve its

structure and function in the contexts in which it occurs.

I.2.1. Defining Motifs at Different Levels of Structure

Defining and identifying recurrent modular motifs in 3D structures and developing

bioinformatics methods to find them in sequences will improve RNA gene finding and RNA 3D structure prediction. In 2005, the RNA Ontology Consortium (http://roc.bgsu.edu/) was created as an umbrella organization to convene and coordinate working groups to reach scientific consensus on the best ways to define, classify and annotate RNA structural motifs for bioinformatics applications, including 1) identifying RNA genes in genomic sequences; 2) predicting their secondary structures from sequence and readily obtainable experimental data; 3)

inferring their function(s); and 4) modeling their three-dimensional structures (Leontis, Altman, et al., 2006). This is an area of active research in which a variety of approaches are being 39

investigated (Leontis, Lescoute, & Westhof, 2006; Leontis & Westhof, 2003). For comprehensive discussions of new RNA 3D structures and motifs and their functional roles the reader is referred to recent reviews (Hendrix, Brenner, & Holbrook, 2005; Holbrook, 2005).

I.2.1.1. Hierarchical architectures and folding of structured RNA molecules

Like proteins, RNA molecules fold hierarchically in time and space to form specific 3D structures necessary for molecular function. Local secondary structure elements - primarily short helices capped by hairpin (terminal) loops - form in the first stages of folding. In subsequent folding stages, these elements coalesce into local domains composed of helical elements organized by multi-stem junctions. Some of these helices are formed by complementary sequences distant in the RNA sequence. In the final, slowest stages of folding, the native, compactly folded tertiary structure is produced as the correct tertiary interactions are established between structural domains (Rangan, Masquida, Westhof, & Woodson, 2003; Thirumalai, Lee,

Woodson, & Klimov, 2001; Thirumalai & Woodson, 1996; Zhuang et al., 2000). While RNA 3D structures can be very large and complex, they are hierarchical and modular. As is the case for proteins, the global structures of RNA molecules change more slowly than their sequences or secondary structures. These features help us to analyze and understand them.

I.2.1.2. Defining the modular units of RNA Structure

We gain a better understanding of RNA structures by identifying modular subunits of structure and their interactions at each hierarchical level of organization.

Primary Sequence. At the level of the sequence, the modular subunits are individual nucleotides, covalently linked 5’-to-3’ by phosphodiester bonds. Each nucleotide consists of three chemical moieties -- the base, the sugar and the phosphate. When RNA molecules fold, the nucleotides interact with each other in characteristic ways. The most specific and best understood 40

interactions involve the bases -- base-base, base-sugar, and base-phosphate interactions. Base-

base interactions include edge-to-edge pairing interactions mediated by hydrogen-bonding, face-

to-face stacking interactions and (rare) edge-to-face perpendicular interactions. Sugar-sugar,

sugar-phosphate, and metal- or solvent-mediated phosphate-phosphate interactions also occur

and contribute to the stability of complex RNA structures, but are harder to classify and to relate

to sequence information. Although the sugar-phosphate backbone of RNA is very flexible, it is

possible to classify the observed conformations of nucleotides and dinucleotides in discrete,

recurrent patterns that can be associated with certain motifs or sub-motifs (Richardson et al.,

2008; Sykes & Levitt, 2005).

Watson-Crick Helices and Secondary Structure. Single-stranded RNA molecules fold

back on themselves to juxtapose Watson-Crick complementary sequences in an anti-parallel

fashion. This produces Watson-Crick helices, the fundamental modular units of secondary

structure. The helices are composed of the canonical Watson-Crick (Cate, Gooding, Podell,

Zhou, Golden, Szewczak, et al.) basepairs, AU, UA, CG, and GC, as well as "wobble" GU and

UG pairs; the Watson-Crick basepairs are the modular subunits of secondary structure and they

stack on each other in a regular, recurrent way. The helices are generally short (no longer than 10

to 15 WC pairs) because they are interrupted or terminated on their ends by nominally unpaired

stretches of sequence that are called, depending on where they occur in the secondary structure,

hairpin, internal, or multi-helix junction “loops.” Bases in loops are usually depicted in

secondary structures as not forming basepairs. In general, about 60% of the nucleotides of a

structured RNA form Watson-Crick basepairs.

RNA 3D structures. The 3D structures of a relatively small number of RNA molecules

are determined to atomic resolution by x-ray crystallography or NMR spectroscopy each year. 41

While small in size compared to the 3D protein database or all the RNAs known from genomes,

the RNA 3D structure database has expanded rapidly in the last few years. These data show that

most “loops” in 2D representations in fact form specific 3D motifs, characterized by non-

Watson-Crick base-pairing, base-stacking and base-phosphate interactions between loop

nucleotides. For example, in a survey of the 3D structures of rRNA in the 70S ribosomes of

E.coli and T. thermophilus, and the 50S subunit of H. marismortui, only ~59% of bases form

standard WC basepairs, and ~7% of these make, in addition, at least one non-WC basepair

(Stombaugh, Zirbel, Westhof, & Leontis, In Preparation). Of the remaining rRNA bases, ~20%

form one or more non-WC basepairs while ~21% do not basepair at all. However, most of the

unpaired bases, participate in base-stacking, base-phosphate, or RNA-protein interactions. Thus,

the loops comprise a significant fraction of the nucleotides of structured RNA molecules and

most of these nucleotides interact with other nucleotides, proteins or ligands.

I.2.1.3. Modular and recurrent 3D motifs

Modular 3D motifs. Most 3D motifs are flanked by WC basepairs and they are modular

in the sense that they can be attached to or inserted within any double helix and still form the

same 3D structure. These observations suggest the following general definition: “Modular RNA

3D motifs are autonomous sets of interacting nucleotides that form a defined 3D structure.” This

definition distinguishes structural motifs from sequence motifs and full motifs from sub-motifs

and emphasizes the physical interactions of the nucleotides rather than the sequence identity of

each nucleotide. While the Watson-Crick helix is the most important RNA 3D motif, here we

will focus on motifs that comprise non-Watson-Crick basepairs. When one or more flanking

Watson-Crick pairs form tertiary interactions with the "loop" nucleotides of the motif, they are best considered part of the motif. For example, in the C-loop motif, both flanking WC pairs form 42

base triples with the nucleotides of the C-loop (Lescoute, Leontis, Massire, & Westhof, 2005).

Even when the flanking basepairs are not forming base-triples, they usually interact with

nucleotides of the 3D motif by stacking. Therefore it is not surprising that the flanking basepairs

are often conserved or show a strong statistical preference. Thus the flanking basepair for UNCG

hairpin loops is usually a cis Watson-Crick CG basepair (abbreviated “cWW” --- see below) and

the flanking basepair in the eleven nucleotide GAAA loop-receptor is cis Watson-Crick (cWW)

GU (Cate, Gooding, Podell, Zhou, Golden, Kundrot, et al., 1996). Therefore, we generally

include the flanking cWW pairs in the 3D motif.

Recurrent 3D motifs. Many 3D motifs are recurrent. Homologous RNA molecules

usually contain the same motifs at corresponding positions in their structures as a result of

evolutionary conservation. Recurrent motifs also occur in unrelated RNA molecules (or at non-

equivalent positions of homologous molecules) as a result of convergent evolution. Instances of

the same recurrent motif share a set of core nucleotides that can be superposed in 3D space; each

core nucleotide bears the same relationships to neighboring nucleotides as do the equivalent

nucleotides in the other instances of the motif. Thus two helices of the same length are instances

of the same motif because they can be superposed base-by-base, and equivalent bases in each

helix basepair and stack in geometrically similar ways. Instances of a recurrent motif have

common basepairing and base-stacking interactions but can differ significantly in sequence or in

strand topology. The (generally unknown) set of all sequences that form a particular 3D motif is

its “sequence signature.” When we speak of a recurrent RNA 3D motif we are actually talking

about all the different sequence variants that can form the same 3D structure and carry out

similar functions.

Sequence differences can result from base substitutions or from insertions or deletions. 43

When comparing two structures, insertions in the first structure relative to the second structure

appear as deletions in the second relative to the first, so we refer to insertions and deletions

collectively as “indels.” Due to the flexibility of the RNA backbone, even large indels can be

accommodated at certain positions to produce different versions of what is essentially the same

motif. Comparison of different instances of recurrent motifs can help us to understand the

sequence variations compatible with the 3D structure, and thus facilitate the identification of

motifs when all we have are RNA sequences. This is an important step in predicting RNA 3D

structures and improving our ability to find non-coding RNA genes in genomes.

The take-home message from the 3D data is that to precisely define the 3D motifs of each hairpin, internal and multi-helix junction loops, the conserved interactions between motif nucleotides must be identified and classified.

I.2.1.4. Neutral substitutions in helices

The 3D structures of RNA double helices are very regular and largely independent of sequence, owing to the remarkable isostericity of the canonical cis Watson-Crick basepairs AU,

UA, GC, and CG. "Isosteric" means “occupying the same space” and in the context of base-

pairing refers to the space between the sugar-phosphate backbones of the interacting strands of

the helix. Because the canonical cis WC basepairs are isosteric, they can substitute for each other

in RNA double helices without perturbing its structure. The key observation is that the RNA

helix is defined by the type of interactions between the nucleotides, not the specific sequence. It

is usually not meaningful to speak of a “consensus” sequence for a helix because structure-

neutral mutations can substitute one Watson-Crick basepair for another. The isostericity of the

canonical cis WC basepairs is the physical basis for the comparative approach to RNA sequence

analysis which led to accurate predictions of the secondary structures of large RNAs long before 44

their 3D structures were determined (Pace, Thomas, & Woese, 1999). Of course, the exact

thermodynamic stabilities of helices are sequence dependent due to variations in base-pairing

and base-stacking free energies (Mathews & Turner, 2006). Also, if the specific helix forms

tertiary RNA interactions or binds a protein or other ligand, there may be additional base-specific

constraints on the sequence. In fact many Watson-Crick basepairs in structured RNAs like the

16S and 23S rRNAs are very conserved, and this conservation correlates with the occurrence of specific tertiary RNA or RNA-protein interaction.

This idea of structure-neutral isosteric substitutions can be fruitfully applied to non-

Watson-Crick basepairs, the basic building blocks of RNA 3D motifs, as explained in the next section.

I.2.2. Identifying, classifying and annotating nucleotide interactions that stabilize

RNA 3D motifs

I.2.2.1. Reduced representations of RNA 3D structure

Atomic resolution 3D structures from x-ray crystallography provide detailed descriptions of RNA 3D structures and motifs in the form of sets of Cartesian coordinates for each atom.

However, this description is too detailed for many applications, and in any case, the reported precision of crystallographic data, to thousandths of an Ångstrom, is misleading. To make the

RNA structural data useful to bioinformatics applications, reduced representations of RNA structure are needed that capture the nature of the conserved interactions between the core nucleotides of each 3D motif. The interactions that interest us most are those that constrain the sequence and can therefore be used to identify motifs in genomic sequences. These interactions directly involve the bases. Most attention has been paid to classifying and annotating the non- 45

WC basepairs, as they are the recurrent modular subunits of RNA 3D motifs, just as the Watson-

Crick basepairs are for double helices. The crucial issue a classification should address is which basepairs substitute for each other in structure-neutral ways, without significantly perturbing the

3D structure of the motif.

I.2.2.2. Classification and Annotation of Base-pairing Interactions in RNA

Structures

RNA bases, purines and pyrimidines, present three edges for hydrogen-bonding interactions with other bases, the Watson-Crick, the Hoogsteen, and the Sugar Edges, illustrated for adenosine (A) in the left panel of Figure I.2.1 (Leontis & Westhof, 2001). For pyrimidines, the Hoogsteen Edge is also called the "CH" edge. The RNA Sugar Edge includes the 2'- hydroxyl, a functional group that distinguishes RNA from DNA and plays an important role in

RNA tertiary interactions and RNA chemistry. Bases can pair using any of six combinations of the three edges, for example, the Watson-Crick Edge of one base with the Watson-Crick,

Hoogsteen, or Sugar Edge of a second base. In addition, for each combination of edges, the bases can approach each other in two orientations, which are called cis and trans by analogy to the geometric isomerism at carbon-carbon double bonds. As shown in the right panel of Figure I.2.1, in cis basepairs, the glycosidic bonds joining the bases to their respective sugar moieties are found on the same side of the axis shown in grey. This axis is defined by the hydrogen bonds joining the base edges. In the trans orientation, the glycosidic bonds are on opposite sides of this axis. Thus, there are twelve basic geometric families of basepairs in RNA. Information about the basepair families, their abbreviations, and symbols for representing them in secondary structures are collected in Table I.2.1. Each geometric family is shown schematically in the upper panel of

Figure I.2.2, using right triangles to represent each base (Leontis, Stombaugh, & Westhof, 2002; 46

Leontis & Westhof, 2001).

Figure I.2.1. Base edges and Base-pair geometric isomerism. (Upper left) The structure of adenosine showing the three base edges (Watson-Crick, Hoogsteen and Sugar-edge) available for hydrogen-bonding interactions. (Lower left) Representation of RNA base as a triangle (see also Figure I.2.2). The position of the ribose is indicated with a circle in the corner defined by the Hoogsteen and Sugar edges. (Right) Cis and Trans base-pairing geometries, illustrated for two bases interacting with Watson-Crick edges. (Leontis & Westhof, 2001).

The hypotenuse of each triangle represents the Hoogsteen Edge of the base. Circles or crosses are placed in the corner of the triangle defined by the Hoogsteen and Sugar Edges to indicate the direction of the sugar-phosphate backbone in the default case that all glycosidic bonds are in the anti configuration. A circle represents the sugar-phosphate backbone emerging

5' to 3' out of the plane toward the reader and the cross represents the opposite orientation

(Leontis et al., 2002; Leontis & Westhof, 2001, 2002). The six basepairs in cis are shown in the 47

upper half of Figure I.2.2 and the six basepairs in trans are shown immediately below the

respective cis basepairs. Each of the 12 geometric basepair types is represented by a symbol to

unambiguously annotate that pair in secondary structure diagrams, as described below (Leontis

& Westhof, 2001).

Table I.2.1. The Twelve Geometric Base-pair families. Each family is specified by the relative orientation of the glycosidic bonds (column 2) and the interacting edges of the bases (columns 3 and 4). Abbreviations and corresponding symbols for annotating basepairs in diagrams are given in columns 5 and 6. Column 7 defines the default local strand orientations for each base-pair family when both bases are in the default anti-configuration of the glycosidic bonds.

The cis/trans distinction for basepairs should not be confused with the designations syn and anti of rotational isomers of individual nucleotides that result from the rotation of the base about the glycosidic bond connecting the base to the sugar moiety. 48

Figure I.2.2. Schematic representations of geometric families and symbols for annotating structures. Upper panel: The twelve geometric basepair families are shown using triangles to represent bases. Circles represent Watson-Crick edges, squares Hoogsteen edges, and triangles Sugar edges. Basepair symbols are composed by combining edge symbols, with solid symbols indicating cis basepairs and open symbol, trans basepairs. Lower Left: Symbols for other pairwise interactions. Lower Right: Additional symbols for base-stacking, reversal of chain direction in hairpin loops, syn bases, and bases forming tertiary interactions (Leontis et al., 2002).

49

Abbreviations. The geometric basepair families are abbreviated "cWW" for cis Watson-

Crick/Watson-Crick, "tHS" for trans Hoogsteen/Sugar Edge, and so on, as summarized in Table

I.2.1. The cHH family is very rare and usually occurs with one nucleotide in the syn configuration of the glycosidic bond to minimize steric clash between the backbones of the interacting nucleotides. The cHS family usually occurs between adjacent nucleotides in the same strand to form platform motifs. The cWW, tWW, and tHH basepairs are generally symmetric – interchanging the bases produces equivalent basepairs -- but the cSS and tSS pairs are not symmetric, so annotations are needed that reflect their asymmetry. In cSS pairs, the nucleotide that hydrogen bonds with its 2’-OH to both the 2’-OH and the base of the other nucleotide is assigned higher priority in the interaction and is indicated with an upper-case letter while the other nucleotide is indicated with a lower-case letter (i.e., cSs) Thus, the basepair shown in the lower right panel of Figure I.2.9, is an A/G cSs pair as the A has higher priority than the G. For tSS basepairs, higher priority is assigned to the base that forms an H-bond with the 2’-OH of the other nucleotide in addition to the base-to-base H-bonds (see Figure I.2.9).

Evidence Supporting the Triangle Abstraction: Base Triples and Quadruples. How realistic is the abstraction of RNA bases as triangles? It implies that a single RNA base can interact edge-to-edge in the same plane, with up to three different bases, so as to produce base quadruples. Symbolic searching using the “Find RNA 3D” (“FR3D”) RNA motif search program

(Sarver, Zirbel, Stombaugh, Mokdad, & Leontis, 2008) shows at least ten different base quadruples of this type, consistent with the prediction. Figure I.2.3 shows an example of one of these quadruples from 16S rRNA (PDB: 1j5e), where the center base, G68 (blue), forms a cWW basepair with A101 (magenta), a tsS pair with A152 (orange), and a cHW pair with G64 (red). 50

Figure I.2.3. Triangle abstraction for RNA bases. As implied by the triangle abstraction, RNA bases can interact with three different bases using their three edges, Watson-Crick (Cate, Gooding, Podell, Zhou, Golden, Szewczak, et al.), Hoogsteen (H) and Sugar (Sug), forming “saturated” base quadruples. (Left) Example of a base quadruple of this type from T. thermophilus 16S rRNA (PDB file 1j5e) in which G68 (blue) forms a cWW pair with A101 (magenta), a cHW pair with G64 (red) and a tsS pair with A152 (Evans, Jones, Milne, & Yellowlees). The green dotted lines indicate Hydrogen-bonds. (Right) Schematic representation showing each base as a triangle with edges labeled. The base-pairing type is given using the symbols from Figure I.2.2.

Many different base triples and quadruples occur in RNA structures. As for the base quadruple in Figure I.2.3, almost all base triples and quadruples can be decomposed into combinations of the twelve geometric basepair families. In this way, it is straight-forward to classify these higher order groupings (Nasalean et al., in preparation). Most base triples comprise a central base interacting with two other bases using two distinct edges. However, a second type of base triple is also possible, in which one base, usually a purine, pairs with two other bases using the same edge, usually its Sugar Edge. This case is very frequent in tertiary interactions involving the minor groove. An example of such an interaction, which is also called a Type I A-minor motif in the literature, will be discussed below. 51

Bifurcated and Water-inserted pairs. If one also allows for bifurcated and solvent-

inserted pairs to extend the 12 basepair families, then the vast majority of basepairs can be

classified within this framework (Leontis et al., 2002). In solvent-inserted basepairs, the basepair

opens while maintaining one direct H-bond between the bases to allow a small molecule, usually

water but sometimes an ion, to insert. The inserted molecule mediates additional interactions

between the base edges.

Bifurcated basepairs involve H-bonds between an exocyclic functional group (amino or

carbonyl oxygen) of one base and the base edge of the second base and can be accommodated in

the framework of the twelve basepair families in the following way: The twelve families form

two distinct groups of six families. Within each group the six families are related by ~90°

rotations in the basepair plane of one base relative to the other, without flipping either base.

These rotations transform one basepair into another within each family by changing one

interacting edge at a time. For example a cWW pair can be transformed in one step into a cWS,

cSW, tWH or tHW pair, but not a cSS, tSH, tHS, or cHH pair. A 3x3 matrix represents each

group of basepairs as shown in Table I.2.2. The basepairs in neighboring horizontal or vertical

cells in each matrix can be transformed by rotating one base with respect to the other ~90°

without leaving the plane. Bifurcated basepairs result when this rotation is incomplete, so that an

exocyclic functional group, G(O6), U(O6), A(N6), or C(N6) in the corner between the Watson-

Crick and Hoogsteen edges, or G(N2), U(O2), or C(O2) in the corner between the Watson-Crick

and Sugar Edges, interacts with one of the edges of the second base. The most common case

involves the WC/H corner of one base and the WC edge of the second base. The bifurcated and

water-inserted basepairs have been described previously in more detail (Auffinger & Hashem,

2007; Leontis et al., 2002; Leontis & Westhof, 1998). 52

Table I.2.2. Spatial relationships between the geometric base-pair families. The geometric families form two distinct groups. Within each group, basepair types can be transformed by a ~90° rotation of one base in the basepair plane, changing one interacting edge at a time as shown by arrows connecting basepair families.

53

I.2.2.3. Annotation of secondary structures

Annotations for 2D diagrams have been developed to communicate essential features of

RNA 3D structures accurately and succinctly. In addition to the classical secondary structure, the

annotations show 1) all non-Watson-Crick basepairs with unique symbols that specify the

geometric family of the basepair; 2) all bases that are in the syn glycosidic configuration; 3) all

points in the chain where the backbone reverses direction; 4) key base-stacking and base-

phosphate interactions; and 5) sequential numbering of nucleotides in the 5’-to-3’ direction.

Annotations of Group I introns, 16S rRNA and many aptamers and small have been

published (Adams, Stahley, Kosek, Wang, & Strobel, 2004; Lescoute & Westhof, 2006a).

Annotation of basepairs. The basepair symbols are derived in a simple way by

associating a different symbol with each edge: the circle  with the Watson-Crick Edge, the square  with the Hoogsteen Edge and the triangle  with the Sugar Edge. Solid symbols indicate cis basepairs and open symbols, trans basepairs. For bases pairing with different edges, the symbols indicate the edge used by each base to form the pair. When the same edge is used by both bases, the basepair type is indicated by a single symbol, filled or open, placed on a line joining the letters designating the bases. The basepair symbols with their respective pairing types are also shown in Figure I.2.2 and compiled in Table I.2.1 and are used throughout this chapter to annotate diagrams representing 3D RNA motifs.

Symbols in common use that do not conflict with the new conventions can still be used, notably ‘-‘ for AU or UA and ‘=’ for GC or CG. “Wobble” GU or UG, being a type of cWW, is designated with a filled circle, , not an open circle, to avoid confusion with trans basepairs.

When only one hydrogen bond occurs between two bases or sugar atoms, a dashed line '- -' is used to denote the interaction. To denote cWW bifurcated or cWW water-inserted pairs, the 54

letter 'B' or 'W' is added to the filled circle used to represent cWW pairs, as shown in the lower

left corner of Figure I.2.2.

Helix packing interactions. Two nucleotides can interact by interlocking of the Sugar

Edges of two nucleotides without direct contact between the bases per se. This has been called

variously “A-minor type 0” or “helix packing” motif and can be designated using the letter “P”

placed in an open triangle (Gagnon & Steinberg, 2002; Mokdad, Krasovska, Sponer, & Leontis,

2006; Nissen, Ippolito, Ban, Moore, & Steitz, 2001).

Base-stacking. Two RNA bases can stack face-to-face in four different ways, depending

on the base faces that come in contact, the 5'-face or the 3'-face of each base. The 5’- and 3’-

faces are defined by reference to the normal orientation of each base in the Watson-Crick helix,

in which all bases are in the anti-glycosidic conformation; the 5’-face points toward the 5’-end of the strand and the 3’-face toward the 3’-end of the strand (Sarver et al., 2008). To show that two adjacent bases in the RNA chain are stacked, the letters representing them are drawn right above or below each other in the secondary structure. If one base is bulged out and not stacked on its neighbors in the chain, it is drawn to one side. In some motifs, “cross-strand stacking” occurs between bases in the same motif but on opposite strands. This can be indicated with an 'I-beam' connecting the two stacked bases. When the stacked bases are far apart, one base can be represented by a rectangle placed above or below the base with which it stacks and connected by a line to the letter representing it in the secondary structure.

Base-phosphate interactions are hydrogen bonds between the WC, Hoogsteen or Sugar edges of a base and the phosphate oxygen atoms of a second nucleotide. Base-phosphate interactions are indicated by symbols composed of a circle containing a ‘P’ to indicate the phosphate connected by a line to a circle, square or triangle to indicate the interacting edge 55

(Watson-Crick, Hoogsteen or Sugar) of the base. Different classes of the base-phosphate interactions can be proposed depending on the specific base and base edge interacting with the phosphate (Stombaugh et al., In Preparation).

Additional annotations. Some other descriptive symbols are used to denote changes in strand orientation (dashed line arrow or red solid line arrow) or to show that a base is in the syn conformation (bold nucleotide letter or red letter). A box is placed around nucleotides that participate in tertiary interactions and the box is connected with the appropriate interaction symbol to the interacting base(s).

I.2.2.4 Structure-neutral mutations in recurrent RNA 3D motifs

Structure-neutral mutations. Mutations in RNA sequence that disrupt the 3D structure of a functionally important motif are less likely to be passed on to subsequent generations as a result of the evolutionary process of natural selection. This is because the function of a molecule depends on its ability to fold into the functionally active 3D structure. Mutations that preserve

3D structure are called structure-neutral mutations. Two kinds of mutations need to be considered: substitutions and insertions or deletions (indels).

Insertions and deletions. Indels can be structure neutral, depending on where they occur in a motif. A consequence of the high flexibility of the RNA backbone is that even a single nucleotide can be bulged out of an RNA motif without significantly perturbing its 3D structure.

Sites that can accommodate one such insertion often allow two or more, as long as they do not interfere by steric clash with tertiary interactions the motif must form. Mutations that disrupt the structure of the motif and consequently impair its function will be selected against. By comparing instances of the same motif in 3D structures we can determine the nucleotide positions that tolerate insertions and thus improve our ability to predict the motif from 56

sequences. This idea will be illustrated below for hairpin loop motifs.

Base substitutions and basepair isostericity. Base substitutions for basepairs are

structure-neutral when they result in isosteric basepairs. The geometric basepair classification

groups isosteric basepairs in the same geometric families. Basepairs from different geometric

families are not isosteric. However, not all basepairs in the same geometric family are isosteric.

Rather each family comprises one or more subsets of isosteric basepairs (Leontis et al., 2002).

This is illustrated in Figure I.2.4. Two basepairs are isosteric when they meet three criteria: 1)

The C1’-C1’ distances are the same; 2) the paired bases are related by the same rotation in 3D

space; and 3) H-bonds form between equivalent base positions. The cWW GC, CG, and AU

basepairs (upper and lower left and upper center of Figure I.2.4) meet all three criteria and are

isosteric to each other, as shown. The cWW AG pair (lower center) and GU pair (upper right) are

in the same geometric family and so the paired bases are related by the same 3D rotation.

However, the cWW AG pair has a significantly longer C1’-C1’ distance (12.7 Å) and so is not isosteric to the other pairs, even though it meets the other two criteria. While the C1’-C1’ distance in the cWW GU (wobble) pair is about the same as for the GC, CG, AU, and UA (i.e. the canonical cWW pairs), the U in the GU pair is shifted toward the major groove so H-bonding does not occur between equivalent atomic positions compared to the canonical cWW pairs. This change is more subtle and so GU is considered near isosteric to the canonical cWW pairs AU,

UA, GC, and CG, consistent with its ability to substitute in Watson-Crick helices for these pairs.

The last example, cWH AG (Figure I.2.4, lower right), has about the same C1’-C1’ distance as the canonical cWW pairs, but belongs to a different geometric family. The bases are related by a very different 3D rotation so it is not isosteric or near isosteric to any of the cWW basepairs shown in Figure I.2.4. For each of the 12 geometric families, isosteric and near isosteric 57

subgroups have been identified (Leontis et al., 2002). These are applied to predict structure-

neutral substitutions in 3D motifs to supplement observed instances from the 3D database. The

isostericity relations are summarized in Isostericity Matrices, as will be illustrated below (section

I.2.3.2).

Figure I.2.4. Isosteric relationships between basepairs. Two basepairs are isosteric when they meet three criteria: 1) The C1’-C1’ distances are the same; 2) the paired bases are related by the same rotations in 3D space; and 3) H-bonds form between equivalent base positions. The cWW GC, CG, and AU basepairs (upper and lower left and upper center) meet all three criteria and are isosteric to each other, as shown. The cWW AG pair (lower center) and GU pair (upper right) are in the same geometric family and so the paired bases are related by the same 3D rotation. However, the cWW AG pair has a significantly longer C1’-C1’ distance (12.7 Å) and so is not isosteric to the other pairs, even though it meets the other two criteria. The C1’-C1’ distance in the cWW GU (wobble) pair is the about the same, but the U is shifted toward the major groove so H-bonding does not occur between the same positions as in the other cWW pairs. This change is more subtle and so GU is considered near isosteric to the canonical cWW pairs AU, UA, GC, and CG, consistent with its ability to substitute in Watson-Crick helices for these pairs. The last example, cWH AG (lower right), has about the same C1’-C1’ distance as the canonical cWW pairs, but belongs to a different geometric family. The bases are related by a very different 3D rotation so it is not isosteric or near isosteric to any of the cWW basepairs.

58

I.2.3 Defining recurrent 3D motifs and identifying them in structures.

Concise definitions of 3D motifs are needed to automatically search for them in 3D

structures and to formulate algorithms to find them in RNA sequences. The definitions should be

sufficiently precise to differentiate motifs with similar structures.

I.2.3.1 Classification of “loop” motifs.

The “loop” motifs of secondary structure are classified according to their locations: 1)

Hairpin (or terminal) loops are positioned at the ends of helices, 2) internal loops, are located

within (or between) helices, and 3) multi-helix junction loops join three or more helices. Loop

motifs have been further classified according to the number of nucleotides they contain: Hairpin

loops have been classified as tri-loops (Lee, Cannone, & Gutell, 2003; Lisi & Major, 2007),

tetra-loops (Woese, Winker, & Gutell, 1990), penta-loops (Stefl & Allain, 2005) and so on, and internal loops as symmetric internal loops (2x2, 3x3, etc.) or asymmetric internal loops (1x2,

1x3, 2x3, etc.) depending on the number of nucleotides in the component strands. Likewise,

junction loops have been classified according to the number of helices (3-way, 4-way or higher

order junctions) and the numbers of nominally unpaired bases linking the helices to each other

(Altona, 1996; Gan et al., 2004). These numerical classifications, however, can be misleading.

On the one hand, nucleotides can be inserted or deleted at certain positions in motifs without significantly perturbing the rest of the structure. On the other hand, sequences of the same length can fold in very different ways. These effects are due to the high flexibility of the RNA backbone, and the sequence specific folding of RNA. Consequently, the number of nucleotides in a “loop” is not a robust criterion for classification, as illustrated in the following examples.

Tetraloops and Pentaloops that form the same motif. Figure I.2.5 shows examples of

hairpin loops that are classified differently at the level of sequence and secondary structure, yet 59

form the same 3D structures. Figures I.2.5a and I.2.5b compare the pentaloop 5’-CAGAA-3’ with the tetraloop 5’-GAGA-3’. GAGA is an example of a “GNRA tetraloop” motif, so-named to indicate the consensus sequence identified by comparing secondary structures (Woese et al.,

1990): G exclusively as the first base, A, C, G, or U (“N”) as the second, A or G (“R”) as the third, and A as the fourth base. The GAGA hairpin conforms to the “GNRA” consensus sequence and, as expected, forms the well-known 3D structure with the tSH closing basepair

between the first and fourth bases of the loop (Ban, Nissen, Hansen, Moore, & Steitz, 2000). The

strand reverses direction after the first nucleotide of the loop, as indicated by the curved arrow in

Figure I.2.5c, and the second and third bases stack continuously on the fourth on the 3’-side of

the loop. The CAGAA pentaloop sequence does not conform to the GNRA consensus, but the

3D structure shows that it forms the same 3D motif (Wimberly et al., 2000). The extra base of the penta-loop, A497, is bulged out, but this does not significantly perturb the 3D structure, as shown by superposition of the two motifs (Figure I.2.5e). Although the closing basepair is different, CA in the pentaloop and GA in the tetraloop, the basepair type, tSH, is the same in both structures. Moreover, tSH CA and tSH GA are isosteric (Leontis et al., 2002).

As a second example, the 5’-GAAAG-3’ pentaloop and the 5’-UUCG-3’ tetraloop appear

unrelated at the level of sequence and secondary structure. UUCG conforms to the consensus

“UNCG” sequence and not surprisingly its 3D structure exhibits the characteristic features of

these well-known motifs, including the syn configuration of the fourth base, the tSW closing basepair between the first and fourth base, the stacking of the third base on the first base and the chain reversal after the third base instead of the first base, as in the GNRA tetraloops

(Krasilnikov, Yang, Pan, & Mondragon, 2003). The second base of UNCG loops is bulged out.

The 3D structure of the GAAAG pentaloop (Ban et al., 2000), annotated in Figure I.2.5h and 60

superposed on that of UUCG, shows that the GAAAG pentaloop and the UNCG tetraloop have very similar 3D structures. As in the previous example, the closing basepair type is the same

(tSW), although the bases are different (GA vs. UG). The tSW GA and tSW UG pairs are near isosteric.

Figure I.2.5. Structurally similar hairpin loop motifs. a) – e) Comparison of sequences and structure annotations of two geometrically similar hairpin loop motifs, only one of which is a tetraloop and conforms to the consensus sequence “GNRA.” f) – j) Comparison of sequences and structure annotations of geometrically hairpin loop motifs, only one is a tetraloop and conforms to the consensus sequence “UNCG”. e) Stereo superpositions of motifs in c) and d). j) Stereo superpositions of motifs in h) and i).

I.2.3.2 Defining and naming 3D motifs

The examples discussed above illustrate the confusion that results, especially for 61

newcomers to RNA structural bioinformatics, by the use of names for motifs that are based on

consensus sequences and number of nucleotides. These examples indicate the need for precise

definitions and names for RNA motifs to provide concise communication between humans and

software agents, and to make automated reasoning about RNA possible. We demonstrate the

process of constructing rigorous, structure-based definitions for 3D motifs, using “GNRA” loops

as examples.

Defining the “GNRA” hairpin motif: Instances of the same recurrent motif share a common set of core nucleotides and conserved interactions between them. The first step in constructing a structure-based definition is to identify all geometric instances in 3D structures to

determine the core nucleotides and their interactions. We have written the “Find RNA 3D”

(FR3D) suite of software tools to facilitate this process (Sarver et al., 2008). Using FR3D, we

carried out a geometric search of the non-redundant RNA structure database using as query motif

the centroid of a previous search. The query motif included the closing Watson-Crick basepair of

the adjacent double helix, which, for the reasons discussed above, is treated as part of the motif.

The search identified 108 instances with geometric discrepancy less than 0.75, as defined

in Sarver et al. (2008). These instances correspond to 45 unique sequences, and are listed in

Figure I.2.6, with representative examples for each sequence. For each motif candidate, FR3D

lists all base-pairing, base-stacking, and base-phosphate interactions between motif nucleotides

and creates a structural alignment of all instances. The structural alignment identifies bases that

superpose in 3D space as well as inserted bases not present in the query motif. Examination of

the alignment shows that insertions occur between the 3rd and 4th, 4th and 5th and 5th and 6th

nucleotide positions. The search shows that none of the insertions occurs frequently and so the

core motif consists of the 6 nucleotide positions of the query motif. 62

The search reveals two conserved basepair interactions – a cWW pair between the 1st and

6th bases and a tSH pair between the 2nd and 5th bases of the core motif. The strand always

changes direction between the 2nd and 3rd nucleotides and stacking occurs between the two

basepairs and also between the 3rd and 4th and the 4th and 5th bases. These structural features are

summarized in the panel labeled “GNRA Definition” in Figure I.2.6. The positions where

insertions are observed are also shown.

The search also returns valuable co-variation information for each basepair in the motif.

These data are summarized as 4x4 contingency tables for each basepair superposed on the

corresponding Isostericity Matrix for that basepair type (lower left panels of Figure I.2.6). The

same background shading is used to indicate basepairs in each family that are isosteric. Similar

shading indicates near isosteric relations while white boxes indicate basepairs that do not occur

in that geometric family. These data show that all canonical cWW pairs occur for the basepair

between the 1st and 6th bases, but that CG and GC predominate. A significant fraction is cWW

UG pairs, but no GUs occur. As shown by the background shading, the canonical cWW basepairs are isosteric to each other and near isosteric to UG. GU is not isosteric to UG.

The tHS basepair has two isosteric families. Almost all observed instances belong to the isosteric group 10.1 consisting of AN, CA, CC, and CU tHS basepairs. While most instances have the AG tHS basepair, a significant fraction do not. A small number of instances have GG tHS pairs, which belong to the second tHS isosteric group, indicated by different color. These loops form a similar, but not identical, hairpin loop that forms specific tertiary interactions.

While tHS AA is not observed in this set of structures, isostericity considerations indicate it may occur in new structures. The basepair information is included in the motif definition by indicating the geometric family and the preferred isosteric group within the family (see the upper 63

left panel of Figure I.2.6).

Figure I.2.6. Structural Definition of “GNRA” hairpin loop motif. Upper left: Annotations showing key features of structural definition of “GNRA” motif, including conserved base-pairing and base-stacking interactions and positions of insertions. Right: Unique instances of “GNRA” hairpin loop motif obtained by geometric search of non-redundant RNA structure database using FR3D. Lower left: Isostericity matrices for conserved basepairs in “GNRA” motif instances obtained by geometric search.

Conserved tertiary interactions. The question arises whether motifs that match the structural definition for a motif but vary in length or sequence can still function in the same way.

Again we use the example of the “GNRA” hairpin loops, which occur widely in RNA structures and function by mediating tertiary interactions. For example, the 3D structures of the 16S rRNAs 64

of E. coli and T. thermophilus each contain 13 hairpin loops that meet the structural definition proposed above. Twelve of these mediate long-range tertiary interactions. Can structurally similar pentaloops also mediate these interactions? The answer is yes. Figure I.2.7 shows an example of tertiary interactions involving homologous hairpin loops in the 23S rRNAs of H. marismortui and T. thermophilus, one of which is a pentaloop and the other a tetraloop. As shown by the annotations, they form identical tertiary interactions (Stombaugh, 2004).

Figure I.2.7. Hairpin loops mediating tertiary interactions. Both the CAACU and GAAA hairpin loops meet the structural definition of a “GNRA” hairpin loop in Figure I.2.6. They occur at homologous sites in H. marismortui (left) and T. thermophilus 23S rRNA and mediate identical tertiary interactions (PDB files 1s72 and 2j01).

I.2.3.3 Defining tertiary interaction motifs

Local vs. Composite motifs: Similar procedures as outlined above for hairpin loop motifs are used to define internal and junction loop motifs. Again, it is important to find all instances of each motif to create accurate definitions. By definition, modular and recurrent internal loop motifs comprise two strand segments and are flanked by two helices. However, motifs first identified as internal loops often are found to also occur within multi-helix junction 65

loops or more complex topologies involving pseudo-knots. The sarcin/ricin and kink-turn motifs have local and composite instances. When a motif that was first identified in an internal loop motif is also found in a junction or pseudo-knot composed of three or more different strand segments, it is called a composite motif. The original internal loop version is called a local motif.

The search program FR3D was designed to find composite as well as local versions of recurrent motifs. In Figure I.2.8, local (left panel) and composite (right panel) versions of the sarcin/ricin motif are compared. The local version is the original sarcin/ricin motif of 23S rRNA and the composite is from a complex junction in Domain II of 23S rRNA. The annotated diagrams show that the two motifs comprise the same core nucleotides and the same interactions between them.

Figure I.2.8. Local vs. Composite motifs. Left: Local (internal loop) sarcin/ricin motif from H. marismortui 23S rRNA comprising two strand segments. Right: Composite sarcin/ricin motif from E. coli 23S rRNA comprising four different strand segments. Bottom: The 3D structures. 66

Long-range tertiary interaction motifs form when different elements of the secondary structure dock to stabilize the native 3D structure of an RNA molecule. Many long-range tertiary motifs are recurrent. They are also defined by their core nucleotides and conserved interactions.

Many form through the docking of hairpin or internal loop motifs in the minor grooves of helices or loop-receptor motifs. Some of these motifs have been given names, for example the

“canonical” and “cis ribose zipper” (RZ) motifs shown in Figure I.2.9 (Cate, Gooding, Podell,

Zhou, Golden, Kundrot, et al., 1996; Tamura & Holbrook, 2002). In this figure, the schematic diagrams of Tamura & Holbrook and the corresponding basepair annotations are shown side-by- side for the canonical and cis ribose zipper. Each of these tertiary interaction motifs is a combination of sugar-edge basepairs formed when two adjacent Watson-Crick basepairs in a helix interact with two stacked “loop” nucleotides, usually adenosines, that (most often) belong to a hairpin or internal loop. In the canonical RZ, one of the loop nucleotides forms a cSs pair with one base of a cWW pair and a tSs pair with the other, as shown in Figure I.2.9. The second loop nucleotide forms a csS pair with one base of the second cWW basepair. The diagram in the upper right panel of Figure I.2.9 illustrates how a purine (A in this case) can pair with two different bases using its sugar edge to form a distinct kind of base-triple composed of cWW, cSs and tSs basepairs. In the cis RZ, both of the loop nucleotides form cSs pairs with a cWW basepair. The lower right panel of Figure I.2.9 shows the difference between cSs and csS basepairs. The lateral shift that transforms cSs into csS is also shown. These examples show how tertiary interactions can be precisely defined in terms of the specific combinations of pairwise interactions of which they are composed. More complex tertiary interactions involve three and sometimes more pairwise. 67

Figure I.2.9. Ribose zippers are tertiary interaction motifs composed of two Sugar-edge basepairs. Left: Schematic representation adapted from (Tamura & Holbrook, 2002) and base- pair annotation (Leontis & Westhof, 2001) of “canonical” and “cis” Ribose Zipper (RZ) tertiary motifs. Upper Right: Base triple composed of GC cWW, AG cSs and AC tSs basepairs. The A forms two pairs with its Sugar edge and is assigned higher priority in each interaction, as explained in the text. Lower Right: Comparison of A/G cSs (left) and A/G csS (right) basepairs. The dotted black arrow indicates the lateral shift that transforms one type into the other. In the A/G cSs, the A is the dominant base so the arrow points from the A to the G. The roles are reversed in the A/G csS pair. The dashed green arrows indicate hydrogen-bonds.

3D Motifs and Sub-motifs: We have argued that it is useful to describe 3D motifs in

terms of recurrent pairwise interactions because these interactions are conserved in geometrically

similar 3D motifs and provide the means to precisely define recurrent motifs. Moreover, the

pairwise interactions can be combined to describe more complex interactions, such as base

triples and quadruples, and tertiary interaction motifs such as ribose zippers. Finally, software

has been written to automatically classify pairwise interactions in 3D RNA structures and thus

facilitate 3D searches for motifs. 68

In certain contexts it is useful to decompose motifs using more complex sub-motifs than basepairs or other pairwise interactions. For example, when predicting the thermodynamic stability of an RNA (or DNA), the free energies of proposed double helices are calculated using the nearest neighbor model, which requires decomposing each helix into overlapping pairs of neighboring basepairs. Each pair of stacked bases is assigned a free energy specific to the nucleotides composing the stacked pairs. A similar approach is used in the decomposition of 3D motifs into cycles of interacting nucleotides, as introduced by F. Major and co-workers (Lemieux

& Major, 2006; St-Onge et al., 2007). The cycles are used to define graph grammars for predicting the 3D structures of RNA molecules (Lemieux & Major, 2006).

I.2.4. Classification of motifs according to function

Structural vs. functional classifications. RNA motifs are classified structurally to identify geometrically similar motifs, or functionally, to identify motifs that serve the same function. Recurrent RNA motifs often play the same or similar functional roles in different RNA molecules or in different places in the same RNA, so identifying them in sequences provides information about how the RNA folds and functions. For example, motifs that impart a sharp bend to helices toward the minor groove have been called kink-turn motifs (Klein, Schmeing,

Moore, & Steitz, 2001; Strobel, Adams, Stahley, & Wang, 2004). The sequence signatures of kink-turn motifs have been defined to facilitate finding them in sequences (Lescoute et al. 2005).

The functional roles RNA motifs play can be roughly classified as architectural, structure-stabilizing, ligand-binding, or catalytic. Architectural motifs direct the organization and folding of the 3D structure. Kink-turns play architectural roles. Multi-helix junctions are key architectural motifs of RNA molecules. They create branch points in the secondary structure, 69

making complex RNA structures possible. Junctions direct the folding by establishing specific

co-axial stacking between pairs of helices at the junction thus organizing them in 3D space

(Klosterman, Hendrix, Tamura, Holbrook, & Brenner, 2004). For many junctions, non-Watson-

Crick basepairs formed by junction “loop” nucleotides stabilize the native co-axial stacking

(Lescoute & Westhof, 2006c).

Structure stabilizing motifs include a variety of hairpin and internal loops that form 3D

structures which stack two or more bases (usually A's) in appropriate geometries to form tertiary

interactions. GNRA hairpin loops are the most common motifs of this kind. A number of internal

loops mediate tertiary interactions very similar to those of GNRA loops (Elgavish, Cannone,

Lee, Harvey, & Gutell, 2001; Gutell, Cannone, Shang, Du, & Serra, 2000). While GNRA loops

can interact with canonical Watson-Crick helices, more stable tertiary interactions can result

when they bind to loop-receptor motifs (Costa & Michel, 1997). These are generally internal

loops that use non-Watson-Crick basepairs to construct platforms on which GNRA loops can

dock by base-stacking as well as base-pairing. The best-known motif of this type is the recurrent

“11-nucleotide” GAAA loop receptor, first observed in the Group I intron (Cate, Gooding,

Podell, Zhou, Golden, Szewczak, et al., 1996). Platforms usually project into the minor-groove

side of helices.

Intercalation motifs “pinch” or “bulge out” a base that can then interact with a second motif by intercalation. The second motif creates a pocket for the intercalating base that consists of two bases, usually purines, that stack on either side of it and usually a third base that can base- pair with it, thus creating a stable tertiary interactions. T-loops, first observed in tRNA, are examples of recurrent motifs that have as one of their functions to accept an intercalating base

(Nagaswamy & Fox, 2002). T-loops occur in many different locations, mediate RNA-RNA or 70

RNA-ligand interactions, and they often interact with other hairpin loops.

Different motifs for the same function. Different motifs can play the same role and can

therefore substitute (“swap”) for each other in the course of evolution. This is especially true for

motifs that mediate long-range RNA-RNA interactions. Examples are shown in Figure I.2.10 of

internal and hairpin loop motifs that occur at equivalent locations in Helix 101 of evolutionarily

distant H. marismortui and E. coli 23S rRNA. The motifs mediate corresponding, conserved tertiary interactions with Helix 63. Moreover, the geometry of the interaction is identical as shown by the 3D superposition of the interacting elements in the two structures.

Figure I.2.10. Conserved tertiary interaction in 23S rRNA mediated different motifs. Upper panels: Annotated secondary structures of conserved interaction between Helices 101 and 63 in 23S rRNA of H. marismortui (left) and E. coli (right). In 23S of H. marismortui, the interaction is mediated by an internal loop in H101 (nts 2874, 2875, 2882, and 2883), whereas in the E. coli structure it is mediated by a GNRA hairpin loop at the equivalent position of H101 (nucleotides 2857-2860). Lower panel: Stereo superposition of the 3D structures of Helices 101 and 63 from 23S rRNA of H. marismortui and E. coli. (PDB files 1s72 and 2aw4.) Color coding: H. marismortui H 101 (blue), H 63 (cyan), E. coli H 101 (orange), H 63 (Evans et al.). 71

I.2.5. Conclusions

Internal, junction, and hairpin loops that appear in secondary structures are, in many cases, instances of recurrent modular RNA motifs. Different sequences can form the same recurrent 3D motif, as a result of structure-neutral mutations. RNA 3D motifs are defined by listing the conserved pairwise interactions between the core nucleotides (including base-pairing,

-stacking, and -phosphate interactions). Definitions should include the geometric type of each conserved basepair, as well as the isosteric basepair groups represented in motif instances. All motif positions where insertions can occur without significantly perturbing the 3D structure should be identified and noted. For motifs that mediate RNA-RNA or RNA-protein interactions, the nucleotides that participate directly in these interactions are noted with the type of interaction formed, since these interactions may impose additional nucleotide-specific constraints that help identify them in sequences.

Motifs can be classified according to structural or functional similarity. During evolution, global structure changes more slowly than sequence or even local 3D structure; mutations can accumulate, including insertions, deletions, or substitutions that change the structure of a motif.

However, if the motif is involved in crucial long-range interactions, the global function is preserved, resulting in a motif “swap” in which the tertiary or quaternary contact is mediated by geometrically distinct but functionally equivalent 3D motifs. 72

II. METHODS AND MATERIALS

The X-ray crystal structures of T. thermophilus, E. coli and H. Marismortui ribosomes

were downloaded from the Protein Data Bank (http://www.rcsb.org/pdb/) (Deshpande, 2005).

For visual inspection of the structures, the Swiss PDB DeepView program was downloaded from the ExPaSy website (http://expasy.org/spdbv/) (Guex, 1997) Deneba’s Canvas software

(http://www.deneba.com/) was used for creating 2D diagrams for certain fragments of 3D

structures.

“Find RNA 3D Motifs” (or familiarly “FR3D”) is a Matlab-based program

(http://www.mathworks.com/) which was initially developed for geometric symbolic, or

geometrical and symbolic, RNA motif searches, based on the Leontis-Westhof classification of

basepairing (http://rna.bgsu.edu/FR3D) (Sarver et al., 2008) Afterwards, modules for identifying

base-stacking, base-phosphate and packing interactions have been added. FR3D reads PDB files

and classifies symbolically a query interaction – e.g. a certain type of basepair – or basepairs

which have the same geometrical occupancy as a query motif. The program superimposes the

query and the candidate basepairs geometrical centers, and calculates the fitting error (L) and the

orientation error (A) as RMS sums of the distances and angles, respectively. The discrepancy

between motifs are calculated according to the formula D=1/m + AL 2^2^ . Only the motifs

with a discrepancy smaller then the cutoff discrepancy D0, set by the user, will be considered as

possible candidate motifs. Different parameters, including the guaranteed and relaxed cutoffs,

can be adjusted for an optimal search. FR3D allows for the candidate motifs to be graphically

displayed, and it can provide structural alignments of these motifs. Further FR3D programs were

developed afterwards to analyze tertiary interactions based on FR3D’s capability of identifying 73

basepairs, base-phosphate, base-stacking and packing interactions (see below). Annotations of structural RNAs, according to Leontis-Westhof nomenclature (Leontis 2001), are done now automatically by FR3D and deposited on the Nucleic Acid Database (NDB). The largely accepted Leontis-Westhof annotation (Leontis 2002) and FR3D’s user-friendly interface (Sarver

2008) is bound to make this program a useful tool for RNA structural analysis.

Each residue in the bacterial ribosome is assigned hierarchically to the substructures to which it belongs, beginning with 1) the motif (helix, hairpin loop, internal loop, linker); 2) the helical element to which the motif belongs; 3) the continuously stacked structural element to which the helical element belongs; 4) the domain to which the stacking element belongs; and finally 5) the molecule of which the domain is a part. “Structural elements” were defined by visual inspection of the T. thermophilus and the E. coli ribosomes. A structural element is considered a helix, or a group of helices stacked coaxially, regardless if they are located adjacent to each other or across junctions. Excel spreadsheets with this information were created for each structure. A new MatLab program, nElementElementInteractions, module accepts as input the

Excel file containing hierarchical assignments of each residue and the PDB file of the structure of interest. Previously written modules of FR3D detect and classify all base-pairing, base- stacking and base-phosphate interactions. The interactions can be investigated at the desired hierarchical level of analysis. For the purpose of this work, RNA tertiary interactions within each element and between each pair of elements were determined. The new FR3D module outputs 1) a matrix showing the number of interactions between each pair of elements; 2) individual PDB files for each pair of elements for visual inspection; 3) separate files with detailed lists of interactions between each pair of elements (see Figure II.1.). The analysis includes bound tRNA and mRNA molecules, when present, and facilitates the identification of changes in RNA-RNA 74

interactions resulting from the binding of substrates or antibiotics to the ribosome. In the results

and discussion section will be presented a detailed analysis of the functional networks of tertiary

interactions linking the elements in both subunits directly involved in ribosome functions

(decoding site, peptidyl transferase, factor-binding site, mRNA and tRNA binding sites).

Figure II.1. Algorithm for searching tertiary interactions between structural elements in structured RNA molecules.

Furthermore, additional modules were developed to output a detailed matrix which distinguishes between the types of base-base interactions, since a basepair type of interaction is 75

considered stronger than base-stacking or base-phosphate interactions. Also, a program which

compares two different structures was tested. Isosteric neutral mutations are taken into

consideration when comparing two different structures (Lescoute et al., 2005; Stombaugh, 2004).

“nElementElementInteractions” is the master module which analyzes an RNA structure

geometrically. The program needs as input the PDB file of the structure in question and an Excel

table (.xls type of file, Excel version 2003 or earlier) with the nucleotides assigned

hierarchically. It outputs a text file with the interactions between elements listed, a CSV file with

the number of interactions between elements, and PDB files for each pair of interacting elements.

The CSV file is an n x n table, with n = number of elements, where the actual number of interactions between elements is listed. The new FR3D module includes several programs which perform specific tasks:

– “nReadNucleotideElementAssignment.m” – Reads the Excel spreadsheet to assign

hierarchical elements of each nucleotide (molecule, chain ID, domain, motif, helical

element, structural element);

– If there is no precomputed data for the structure of interest, the previously created

“zAddData” program loads the structure from the PDB website and analyzes the 3D

data and computes the atoms coordinates;

– “nElementLookup.m” finds all the nucleotides belonging to an element;

– “nFindElementList” compiles a list of structural elements in that file;

– “nListInteractionsForOneElement” lists the interactions of a particular element (i);

– “nLoopThroughElements.m” finds pairs of interacting elements;

– “nListInteractions” extracts the interactions between 2 different structural elements,

counts the number of interactions, and lists the interactions on the screen. This 76

program also outputs a PDB file with each pair of interacting elements.

“nListInteractionsDetail” performs the same task, but it outputs separately base pair,

base stacking and base-phosphate interaction; in the output matrix the number of each

type of interaction is listed in a separate cell;

– “nWriteColorInformation”, a sub-module of “nListInteractions”, colors the

nucleotides, red the nucleotides belonging to element (i) and green the nucleotides in

element (j), in the PDB output files;

– “nElementElementMatrix” creates a table of n x n, where n is the number of

elements in a given file. The program outputs the table to a CSV file – which can be

opened in Excel –. This table has the actual number of interactions between element

(i) and element (j);

– “nListInteractionsBetweenElements” compiles the tertiary interactions between

elements and/or other hierarchical levels;

77

III. RESULTS AND DISCUSSION

III. 1. Structural elements in the bacterial ribosome.

III. 1.1. Motivation and overview

Large molecular complexes that process other molecules by moving them across their surface in non-random ways are called “molecular machines”. Ribosomes, like other macromolecular complexes, are hierarchically organized, and understanding that organization is key to understanding their function as “molecular machines”. This involves: 1) identifying distinct levels of structural organization, 2) assigning individual nucleotides to particular structural elements at each level of organization, 3) identifying interactions between elements, 4) associating function with specific elements, and 5) monitoring changes in these interactions as the macromolecular machine cycles through its functional states, or has different ligands bound.

The sum total of local and long-range interactions defines the “interaction network” of the macromolecular machine. These interactions include basepairing (especially non-Watson-Crick pairs), base stacking, and base backbone interactions, all of which are annotated automatically by the FR3D annotation pipeline for each atomic-resolution structure in the PDB and NDB databases and stored in a relational database.

The long-range interactions are individually weak, non-covalent interactions that can be easily disrupted or replaced by other interactions; this dynamism is in fact crucial for the function of molecular machines. Therefore, to deepen our understanding of how the ribosome works, we need to be able to record and track changes in its interaction network and to associate the observed changes with the appropriate level of hierarchical organization. Defining the interactions associated with distinct functional states will help to better understand the function 78

and how it emerges from the complex structure of the ribosome. While the current database contains interaction data, it lacks complete information regarding the hierarchical organization of the SSU and LSU rRNA or other large RNAs. The purpose of this study is to fill that gap.

The first part of this chapter addresses the problem of defining the relevant hierarchical levels, building on widely accepted concepts of secondary structure of RNA. Our innovation is to consistently group together helical units that stack across multi-helix junctions into higher order

“helical stacking elements”, or “structural elements”, building on an idea introduced by Harry

Noller in a cover article in (Noller 2005). Next, we describe in detail how to partition all nts in an RNA structure using the hierarchy we define in the first part of the chapter. Then, we describe computer programs written to calculate the interaction network of an rRNA, based on the hierarchical structure. In the last part of the chapter, we present one way of visualizing the interaction network in the form of an annotated matrix. This can serve as a template for creating a dynamic visualization that can be used to present successive “snapshots” of the interaction network or differences between different states of the ribosome.

III. 1.2. Annotations of long-range tertiary interactions

The long-range tertiary interactions were analyzed by manually inspecting each RNA structure of interest using Swiss-PDB Viewer. The hairpin loops and the long-range contacts they make were annotated first in the small eubacterial subunit (Figure III.1.2.1.) 79

Figure III. 1.2.1. Annotation of hairpin loops and their long range interactions in SSU 16S rRNA from E. coli, based on the structure 2AVY.pdb 80

The loops of 23S E. coli were also annotated. When a hairpin loop caps an internal loop

involved directly or indirectly with tRNA, the whole element was annotated, as shown in Figure

III.1.2.2. is the case for hairpin loop (HL) 62 in 23S rRNA, where the hairpin loop is separated by just one Watson-Crick pair from the internal loop (IL).

HL 39 HL 62 + IL 62 3‘5‘ 3‘5‘ 1685 GC 1703 953 CG 964 1686 GC 1702 954 G 963 U 1687 1701 955 962 AG U G 1700 C 961 A 2030 1688 AU 956 960 P 1699 1763 G A 1689 G G P AA 1698 2494 A 959 G 1690 GA 1697 U 958 C957 1691 GC 1696 1692U G 1695 1829 A 1693U E. coli 1694C 23S GNRA Type E. coli 23S UNCG Type

Figure III. 1.2.2. Examples of annotated hairpin loops in of E. coli 23S rRNA. Left: Hairpin loop and its tertiary interactions (HL 39). Right: Hairpin loop capping an internal loop and their tertiary interactions.

III. 1.3. Helical elements vs. stacking structural elements

The smallest ribosomal RNA is 5S rRNA and is found in all ribosomes, prokaryotic and

eukaryotic. 5S rRNA consists of three helical elements organized by a three way junction (3WJ).

Two of the elements end in hairpin loops called “loop C” and “loop D”. There are also two

internal loops, called “loop B” and “loop E”. The 3WJ is called “loop A” (Figure III.1.3., left). A

summary of tertiary interactions for the 5S rRNA is shown in the table in Figure III.1.3.

However, a closer structural analysis reveals (Figure III.1.3., Middle) that at the 3WJ helical element 2 is stacked on helical element 3 and therefore forms a single “helical stacking 81

element” (Figure III.1.3., Middle). Therefore, if a conformational change or movement occurs

in one of the stacked helices, it could propagate to the other one. Thus, helical elements 2 and 3

are joined into a single element at the higher hierarchical level of stacked helical elements.

Additionally, if a helix responds to changes in the neighboring elements structural

rearrangements, all stacked helices would be affected in a similar manner. This is the basis of the

idea of treating stacked helices as single structural elements (Figure III.1.3., right).

Figure III. 1.3. Helical elements vs. stacking structural elements in 5S rRNA.. Upper Left: 2D representation of helical elements. Upper Right: 2D representation of structural elements. Upper Middle: 3D Structure seen in Swiss-PDB Viewer. Bottom: Summary table with the tertiary interactions in 5S rRNA.

82

III. 1.4. Defining the hierarchical structure of an RNA molecule

At each level, each nucleotide is uniquely assigned to one of the categories at that level.

The first organizational level is the 1D sequence. At the second level, a nucleotide is assigned either to a hairpin loop (hl), internal loop (il), or a helix(h). The nucleotides between the helical elements are assigned to joining regions (jr). At the third level, helical elements that stacked are grouped into structural elements (e). In the SSU, the abbreviations are noted with small letters. In the LSU, the abbreviations for the helical and structural elements are capitalized. Thus, in LSU, the annotations will be HL, IL, H, JR, and E. The structural framework that is proposed in this work is illustrated below. Figure III.1.4 summarizes the hierarchical scheme we have devised for analyzing complex RNA structures. At the primary level, all nucleotides are represented by letters in the sequence. At the secondary structure (2D) level, nucleotides are partitioned into four disjoint categories: “helices”, “hairpin loops”, “internal loops” and a catch-all category that includes nominally single-stranded regions at the 5’ or 3’ ends (“terminal sequences”), and nucleotides linking helices together at multi-helix junctions and single stranded regions linking domains in the 2D structure. Nucleotides are assigned uniquely to one category at each level; however, nucleotides in the same category at one level may be split into different categories at the next level. Referring to Figure III.1.4, some nucleotides in the 4th category of the 2D level are added to the “helical elements” category on the next level, while others are placed in the “joining regions” category. Nucleotides that stack on the ends of helices and form non-Watson-Crick basepairs that extend helices are included in the respective “helical elements”. Other nucleotides are placed in the “joining regions” category. At the next level some of these nucleotides are made part of the “stacked structural elements” hereas others are assigned to “linkers” that connect the structural elements to form domains and the entire molecule. 83

Levels of Entire molecule organization (SSU; LSU)

Domains (1-4; 1-6) 3D

Structural elements SE Linkers (e_; E_) (l_; L_)

Helical elements Joining regions (h_; H_) (jr_; JR_) 2D Hairpin loops Internal Loops Helices Terminal sequences (hl_; HL_) (il_; IL_) (h_; H_) (t_; T_)

RNA sequence 1D

Figure III. 1.4 . Levels of hierarchies in large RNA structures. At each hierarchical level, each nucleotide is assigned to one category. The arrows indicate how nucleotides assigned to a category at one level are assigned to categories at the next higher

level, culminating with the entire molecule. The principle that is enforced in the hierarchical scheme is to assign each

nucleotide to one category at each level of hierarchy. 84

III.1.5. Structural elements in the ribosome

Accordingly, both 16S and 23S rRNAs from T. thermophilus and E. coli were partitioned into “structural elements” and the “linkers” connecting them (Figure III.1.5.1. and Figure

III.1.5.2. ). We define a tertiary structural element as comprising all continuously stacked helical elements. A structural element might comprise a helix, and its internal and/or hairpin loops, but it can also contain coaxially stacked helical elements, including when stacking is crossing junctions. A structural element includes sequentially adjacent nucleotides stacked on those belonging to the comprising helical element(s).

Thus, at the secondary level, rRNA was divided into helical elements and the joining regions - the single stranded nucleotide fragments between them. For abbreviation, "h" was used for helical elements and “jr” for joining regions, with capitalized letters for the large subunit. (h and jr for 16S, and H and JR for 23S). Abbreviations were also used for hairpin loops (hl), internal loops (il) and junction loops (jl). At the tertiary level, the structure was divided into structural elements (e_ or E_ abbreviated) and the linkers between them (l_ or L_).

In order to describe the tertiary contacts that occur upon folding, the ribosome was divided, following to the 2D structure, into helical elements and joining single stranded regions.

A helix is comprised of two or more consecutive Watson-Crick base pairs. Helical elements would include apart from the helix, any internal and hairpin loops they comprise (Figure

III.1.5.2.). Thus a helical element can be interrupted by an internal loop, but not by a junction. A junction loop is formed when three or more helical elements share a common strand of RNA for each pair of participating helices. Each helix was treated as an individual module. 85

Identification of stacking at the junction was carried out manually for the E. coli 16S and

23S rRNA for the structures 2AVY and 2AW4 respectively, and for the T. thermophilus 16S and

23S rRNA from the structures 2J00 and 2J01 respectively.

The SSU was analyzed for coaxial stacking of the elements, and each structural element colored differently (see Figure III.1.5.1. 3. ). The elements were also mapped on the 2D diagram of the T. thermophilus SSU (Figure III.1.5.1. 2.), and all the information conveyed in a table

(Figure III.1.5.1. 4. ). Structural elements were annotated in a self-explanatory manner, so the

name contains the incorporated stacked helices, e. g. coaxially stacked helices h1, h2, h3, h28

and h29 form the structural element e_1_2_3_28_29 (Figure III.1.5.1. 3.). The same coloring

scheme was preserved between all files (PDB, pdf, and Excel), with each element colored the

same in the tertiary structure, in the 2D diagram and in the table. A consistent methodology was

applied to partition LSU into structural elements (Figure III.1.5.1. 5.)

16S rRNA was partitioned into 31 helical stacking elements, similar to Noller's partition

(Noller, 2005) (Figure III.1.5.1. 1. and Figure III.1.5.1. 2.). The differences between our

structural element partitioning and Noller’s helical stacking include:

1) Helix h7 is not considered coaxially stacked on part of h11;

2) Helix h11 is treated as 2 separate helical elements because of the ~120º angle given

by il_11;

3) Helices h6 and h12 are not coaxially stacked, and therefore considered different

structural elements;

4) Helix h19 stacked on h35, therefore treated as one element e_19_35;

5) Helix h33a is not stacked on h34, nor on h33b or h33c, thus is considered as a separate element; 86

6) Helices h33b and h33c are coaxially stacked, therefore forming e_33b_33c;

7) Part of il_30_31 (nucleotides 956-959) loop to fold back on nucleotide A1225, similar

to a hairpin loop. Therefore, these nucleotides are considered part of element e_30;

8) Helix h_36 is not coaxially stacked on h_40; Helix h_39 is coaxially stacked on h_40,

therefore treated as one element e_39_40.

In 23S there are 74 structural elements. Stacking across big multi-helix junctions occur

sometime between helices which are separated by one or more elements, like in the case of

E_5_6_10, E_26_26a_61, and E_36_39_61 (Figure III.1.5.1. 5.). The network of tertiary

interactions is even more complex than in 16S, and some elements, such as E-81, is make

contacts with several other elements (see Figure III.1.8.7.).

Moreover, the time consuming manual technique required some computer-aided methods in order to simplify and shorten the extent of manual structural analysis. 87

Figure III.1.5.1. Definition of helical stacking structural elements for 16S rRNA. Same color lines are drawn on stacked helices belonging to the same structural element. Left: Structural elements classified according to coaxial stacking of helical elements in this work. Right: Coaxial stacking from previous work (Noller 2005). The main differences in the two schemes are in coaxial stacking of the helices 4 and 18; 6 and 12; 7 and 11; 19 and 25; 30; 31; 33 and 34, 39 and 40; 41; 43; 45. 88

Figure III.1.5.2. Enlarged picture of helical stacking structural elements for T. thermophilus 16S rRNA. 89

Figure III.1.5.3.. 16S rRNA of T. thermophilus (PDB file 2J00) colored by stacked structural elements, using the coloring scheme in the 2D representation in Figure III.1.5.2.Top: Front view. Bottom: 90° rotation of the view: Bottom left: front; Bottom center: Side view; Bottom right: back view. 90

Figure III.1.5.4 . A snapshot of the table with structural elements summary of 16S rRNA T. thermophilus. 91

Figure III.1.5.5. Partition of nucleotides in T. thermophilus rRNA structures into structural elements (hierarchical level 4) comprising continuously stacked helices and motifs. Each continuous colored solid line indicates one element (74 elements total). 92

Figure III.1.5.6 . 23S T. thermophilus colored by stacked structural elements. Top: Front view. Bottom: 90° rotation of the view: Bottom left: front; Bottom center: Side view; Bottom right: back view.

93

Elements Helix Helix Stacked Colored Number Nucleotides Nucleotides E_1 H1 1-8/2896-2902 1-11/2895-2902 pink E_2_3_4 H2 15-30/510-525 12--- 45 red H3 31-32/43474 431-454 H4 35-45/433-445 473-474 510-526 E_4a H4a not in 2D 46-48/178-180 purle

E_5_6_10 H5 53-56/114-117 49-74 cyan H6 57-70 114-120 H10 150-176 149-177 E_7 H7 76-110 75-113 salmon pink E_8_9 H9 131-148 121-148 yellow E_11 H11 183-213 181-215 fluorescent green E_12 H12 224-231 222-232 aqua blue E_13 H13 235-262 233-264 light yellow E_14 H14 266-268/424426 266-271(2)/424-427 purple E_15 H15 * missing red E_16_17_18 H16/H18 271-274/363-366 271-300 yellow-orange H18 281-297/341-359 322-366 H17 278-281/360-362 E_19 H19 301-316 301-320 blue navy E_20 H20 325-337 324-339 white E_21 H21 375-399 371-403 light cyan E_22 H22 406-421 404-421 magenta E_23 H23 461-468 455-472 cyan E_24 H24 483-496 475-509 orange E_24a HL24a 497-506 482-496 white E_25 H25 533-560 533-561 light turquise E_25a H25a 564-577 563-578 turquise E26_46a_61 H26 579-584/1256-1261 579-585 salmon pink H46a 1262-1270/2010-2017 1254-1270 H61 1648-1667/1993-2009 1648-1674 1993-2017 E_27 H27 589-601/656-668 588-602/656-670 cyan E_28 H28 604-624 604-626 pink E_29 H29 628-635 628-636 light yellow E_30 H30 * E_31 H31 637-650 638-650 red E_32_33 H32 671-683/794-809 671 698 fluorescent green H33 687-698/763-775 762-775 794-810 E_34_35 H34 700-732 699-761 magenta H35 736-760 E_35a H35a 777-787 777-787 light yellow E_39_36_46 H36 812-817/1190-1195 812-821 petrolium grey H39 946-971 946-973 silver grey H46 1190-1250 1187-1250 E_37_38 H37 822-835 822-943 candy pink H38 837-942 E_39a H39a 976-987 974-990 light pink purple E_40_41_45 H40 991-1004/1151-1163 991-1019 brown H41 1011-1019/1144-1150 1144-1185 H45 1164-1185 E_IL_41_42 1020-1025/1135-1143 orange E_42_44 H42 1030-1055/1104-1124 1027-1055 tilt H44 1087-1102 1087-1126 E_43_43a H43 1057-1081 1057-1086 brown H43a 1082-1086

Figure III.1.5.7.a. A snapshot of the table with structural elements summary of 23S rRNA T. thermophilus (First half). 94

Elements Helix Helix Stacked Colored Number Nucleotides Nucleotides E_47_48_60 H47 1276-1294 1275-1299 yellow H48 1295-1298/1642-1645 1627-1645 H60 1627-1639 E_49 H49 1303-1310/1603-1610 1301-1313/1603-1626 dark blue E_50 H50 1313-1339 1314-1339 yellow mustard E_51 H51 1345-1347/1598-1601 1342, 1345-1349/1598-1602 light yellow E_52 H52 1350-1381 1350-1383 orange E_53_54_55 H53 1385-1402 1343-1344 blue aqua H54 1405-1417/1581-1597 1385-1428 H55 1420-1424/1574-1578 1569-1597 E_56_57 H56 1429-1444/1547-1564 1429-1466 red H57 1445-1466 1547-1568 E_58_59 H58 1467-1525 1467-1546 cyan H59 1528-1543 E_59a H59a 1611-1621 1611-1621 burgundy E_62 H62 1682-1706 1682-1706 salmon pink E_63 H63 1707-1751 1707-1756 light green E_64_67 H64 1764-1772/1979-1988 1764-1772 magenta H67 1830-1833/1972-1975 1829-1834 1970-1991 E_65 H65 1775-1789 1773-1790 light cyan E_66 H66 1792-1827 1791-1828 light yellow E_68 H68 1835-1905 1835-1905 purple mov E_69 H69 1906-1924 1906-1929 cyan E_70_71 H70 1929-1940/1962-1970 1930-1969 fluorescent green H71 1945-1961 E_72 H72 2023-2040 2023-2042 white E_73 H73 2043-2054/2611-2625 2043-2054/2614-2628 light lavander E_74_75 H74 2063-2075/2434-2447 2063-2092 navy blue H75 2077-2090/2229-2243 2226-2245 2434-2447 E_76_77_79 H76 2093-2110/2179-2196 2093-2126/2162-2225 turqoise H79 2200-2223 H77 2120-2124/2174-2178 E_78 H78 2127-2161 * red E_80 H80 2246-2258 2246-2258 candy pink E_81 H81 2259-2281 2259-2282 purple mov E_82 H82 2284-2285/2383-2384 2283-2285/2383-2384 yellow E_83 H83 2288-2295/2337-2344 2287-2296/2333-2344 cyan E_84 H84 2299-2322 2297-2321 dark blue E_85 H85 2323-2332??? 2322-2332 fluorescent green E_86_87 H86 2347-2370 2345-2382 orange H87 2372-2381 E_88 H88 2395-2421 2391-2429 dark mustard E_89 H89 2455-2496 2454-2498 v light pink E_90_91 H90 2507-2517/2567-2582 2505-2546 red H91 2518-2546 2567-2584 salmon pink E_92 H92 2547-2561 2547-2566 dark green E_93 H93 2588-2606 2587-2608 light yellow E_94 H94 2630-2643/2771-2788 2630-2644/2771-2789 fluorescent green E_95_96 H95 2646-2674 2646-2732 salmon pink H96 2675-2732 E_97 H97 2735-2769 2733-2770 aqua blue E_98 H98 2791-2805 2792-2808??? silver grey E_99 H99 2811-2814/2886-2889 2811-2814/2886-2891 dark red E_99a H99a 2815-2831 2815-2831 pink E_100 H100 2836-2846/2870-2882 2832-2848/2867-2883 yellow E_101 H101 2852-2865 2850-2866 orange

Figure III.1.5.7.b. A snapshot of the table with structural elements summary of 23S rRNA T. thermophilus (Second half). 95

III.1.6. Assigning nucleotides to each level of hierarchical organization

All this information about structural elements, and nucleotides assignment to each hierarchical level, was stored in a CSV table, where nts are assigned hierarchical levels. This table is used as an input for the new program module in FR3D (Table III.1.6.).

Nt Helical Structural Molecule No. Chain Domain Motif element element 23S 210 A Domain_1 H11 E_11 23S 211 A Domain_1 H11 E_11 23S 212 A Domain_1 H11 E_11 23S 213 A Domain_1 H11 E_11 23S 214 A Domain_1 JR_11_12 E_11 23S 215 A Domain_1 JR_11_12 E_11 23S 216 A Domain_1 JR_11_12 L_11_12 23S 217 A Domain_1 JR_11_12 L_11_12 23S 218 A Domain_1 JR_11_12 L_11_12 23S 219 A Domain_1 JR_11_12 L_11_12 23S 220 A Domain_1 JR_11_12 L_11_12 23S 221 A Domain_1 JR_11_12 L_11_12 23S 222 A Domain_1 JR_11_12 E_12 23S 223 A Domain_1 JR_11_12 E_12 23S 224 A Domain_1 H12 E_12 23S 225 A Domain_1 H12 E_12 23S 226 A Domain_1 H12 E_12 23S 227 A Domain_1 H12 E_12 23S 228 A Domain_1 H12 E_12 23S 229 A Domain_1 H12 E_12 23S 230 A Domain_1 H12 E_12 23S 231 A Domain_1 H12 E_12 23S 232 A Domain_1 JR_12_13 E_12 23S 233 A Domain_1 JR_12_14 E_13_14 23S 234 A Domain_1 JR_12_15 E_13_14 23S 235 A Domain_1 H13 E_13_14 23S 236 A Domain_1 H13 E_13_14 23S 237 A Domain_1 H13 E_13_14

Table III.1.6. Nucleotides assignments to each level in 23S rRNA T. Thermophilus. 96

III.1.7. Programs for calculating the interaction network

Because FR3D has the capability of recognizing basepair interactions, additional modules

could be developed to seek long-range interactions between structural elements. The previous

program “zAddNTData” reads a PDB file, cataloging the pairwise interactions. The new module

“nReadNucleotideElementAssignment” reads the manually made Excel spreadsheet, containing

the assigned information, and allocates each nucleotide to the corresponding domain, structural

elements, and helical element. The module “nFindElementList” compiles a non-redundant list of

structural elements. “nListInteractions” looks for interaction between pairs of elements e(i) and

e(j), with 1 ≤ i ≤ (n-1) (where n= the total number of structural elements in the file), and i < j ≤ n. This module loops through for all possible values of i and j, it fishes out all the tertiary interactions between elements. The program outputs a matrix with the number of interactions, a text file listing the long-range interactions, and PDB files for each pair of interacting elements.

(Figure III.1.8. 1.).

The information about structural elements was summarized in a table (Figure for 16S and

Figure for 23S). This information was further conveyed into a CSV file, where each nucleotide

was assigned to the helical element or joining regions for the secondary level of organization,

and structural elements or linker regions for the third level of organization (see Table III.1.6.).

III.1.8. Visualization of results for 2J00 and 2J01

The program outputs a matrix (Figure III.1.8. 1.), an interaction list (Figure III.1.8. 2.),

and PDB files with each pair of interacting elements (Figure III.1.8. 3.). 97

e_1_2_3_28_29 e_38a_39_40 A_site_tRNA P_site_tRNA E_site_tRNA A_site_tRNA P_site_tRNA E_site_tRNA e_21_22_23 e_33b_33c e_26_26a e_6a_12 e_13_14 e_16_17 e_19_25 e_20_24 e_35_36 l_17_18 l_24_25 l_31_32 l_37_35 l_35_34 l_35_35 l_35_36 l_35_37 l_40_38 l_32_30 l_30_41 l_41_42 l_42_29 l_28_44 l_44_45 e_4_15 e_7_10 l_6_6a l_12_7 l_9_10 mRNA mRNA e_11b RNA_RNA Interactions between e_8_9 e_11a e_23a e_33a l_5_6

No. No. I-A e_18 e_27 e_30 e_31 e_32 e_34 e_37 e_38 e_41 e_42 e_43 e_44 e_45 No. No. I-I-A e_5 e_6 Nts. I-A /Nt Tt_R1_2J00_2J01 Tt_R1_2J00_2J01 Nts A /Nt 11 19 1.73 mRNA 0 Structural Elements in the 70s Ribosome 5 1 6 3 1 3 mRNA 11 19 1.73 19 7 0.37 A_site_tRNA 6 1 1 2 A_site_tRNA 19 7 0.37 75 13 0.17 P_site_tRNA 5 3 1 P_site_tRNA 75 13 0.17 76 8 0.11 E_site_tRNA 1 1 E_site_tRNA 76 8 0.11 LEGEND 10 3 0.30 E_1 4 2 2† 1 2 2† 2 2 1 2 3 2 2 2 e_1_2_3_28_29 102 33 0.32 74 19 0.26 E_2_3_4 mRNA Interacting Elements 4 2† 6 1 3 e_4_15 48 19 0.40 6 2 0.33 E_4a tRNA Interacting Elements 1 1 1 e_5 27 9 0.33 48 13 0.27 E_5_6_10 2 Elements directly interacting with tRNA 1 1 l_5_6 1 2 2.00 37 8 0.22 E_7 7 Elements indirectly interacting with tRNA (2nd degree) 4 1 e_6 41 7 0.17 27 5 0.19 E_8_9 4 Long -range Interacting Elements 3 l_6_6a 1 3 3.00 35 18 0.51 E_11 5 Local Interacting Elements 1 2 3 3 e_6a_12 33 11 0.33 6 9 1.50 L_11_12 2 1 Inter-Subunits Interactions 5 l_12_7 4 6 1.50 11 4 0.36 E_12 5s rRNA 2 2 1 2 e_7_10 55 8 0.15 32 12 0.38 E_13 3 2 Indicates 1 Packing I † 1 1 e_8_9 53 8 0.15 1 2 2.00 L_13_14 Interactions frrom ot* l_9_10 1 1 1.00 9 7 0.78 E_14 3 2 2 e_11a 17 9 0.53 69 19 0.28 E_16_17_18_20 1 1 1 e_11b 32 4 0.13 20 13 0.65 E_19 12 2 e_13_14 36 12 0.33 1 4 4.00 L_H16_H21 3 1 e_16_17 81 8 0.10 33 6 0.18 E_21 1 1 l_17_18 1 4 4.00 18 15 0.83 E_22 3 1 2 e_18 47 1 0.02 2 3 1.50 L_22_14 1 2 2† 2 7 e_19_25 36 15 0.42 3 7 2.33 L_14_4 1 5 1 1 2 1 2† 5 e_20_24 73 13 0.18 52 9 0.17 E_23 2 1 1 5 e_21_22_23 151 15 0.10 6 11 1.83 L_2_25 1 e_23a 16 8 0.50 25 3 0.12 E_25 3† 2 l_24_25 4 4 1.00 1 2 2.00 L_25_25a e_26_26a 36 10 0.28 16 13 0.81 E_25a 1 1 1 e_27 29 5 0.17 76 21 0.28 E26_46a_61 3 2 4 1 3 e_30 29 10 0.34 2 4 2.00 L_26_27 1 1 1 e_31 14 4 0.29 30 2 0.07 E_27 1 2 3 5 l_31_32 9 12 1.33 22 2 0.09 E_28 1 2† 2 1 e_32 19 8 0.42 1 2 2.00 L_28_29 1 3 e_33a 22 4 0.18 9 1 0.11 E_29 e_33b_33c 32 3 0.09 13 4 0.31 E_31 2† 1 e_34 44 6 0.14 2 1 0.50 L_31_27 1 1 1 1 e_35_36 25 5 0.20 59 30 0.51 E_32_33 4 3 1 3 1 3 1 e_37 14 5 0.36 63 14 0.22 E_34_35 l_37_35 1 1 1.00 11 10 0.91 E_35a 6 1 l_35_34 1 2 2.00 6 8 1.33 L_35a_32 1 1 1 2 1 l_35_35 1 2 2.00 102 21 0.21 E_36_39_46 3 2 1 1 l_35_36 1 2 2.00 114 17 0.15 E_37_38 5* 1 A-site Finger 3†† l_35_37 1 1 1.00 2 4 2.00 L_38_39 1 2 e_38 11 1 0.09 17 13 0.76 E_39a 5 2 e_38a_39_40 64 5 0.08 70 11 0.16 E_40_41_45 4 l_40_38 1 1 1.00 15 8 0.53 E_IL_41_42 7 1 l_32_30 1 3 3.00 48 5 0.10 E_42__43_44 L11 Protuberance 1 2 1 1 l_30_41 1 4 4.00 7 8 1.14 L_42_IL_41_42 1 3 2 e_41 58 10 0.17 1 3 3.00 L_45_36 1 2 1 l_41_42 3 2 0.67 3 4 1.33 L_46_26 1 2 1 1 1 e_42 36 12 0.33 4 6 1.50 L_46a_47 2 l_42_29 2 5 2.50 44 18 0.41 E_47_48_60 1 1 1 1 1 e_43 33 11 0.33 37 10 0.27 E_49 2 2 3 l_28_44 2 3 1.50 26 10 0.38 E_50 4 1 1 1 e_44 99 18 0.18 11 4 0.36 E_51 1 1 l_44_45 2 4 2.00 74 8 0.11 E_53_54_55 4 2 e_45 25 9 0.36 34 8 0.24 E_52 3 1 1 59 9 0.15 E_56_57 1 1 68 3 0.04 E_58_59 1 2 2 3 1.50 L_48_61 1 1 1 7 2 0.29 L_61_62 25 6 0.24 E_62 1 32 6 0.19 E_63 1 7 6 0.86 L_63_64 2 1 3 37 16 0.43 E_64_67 4 3 1 3 18 19 1.06 E_65 3 2 2 2 38 8 0.21 E_66 3 1 3 62 17 0.27 E_68 1† 1 1 2 24 6 0.25 E_69 1* 1 5 2 40 17 0.43 E_70_71 1 1 3 1 4 1 4 4.00 L_64_61 4 5 8 1.60 L_46a_72 3 2 20 20 1.00 E_72 5 1 5 1 2 1 3 27 12 0.44 E_73 2 1 1 8 9 1.13 L_73_74-PT 1 1 70 29 0.41 E_74_75 2 1 2 1 1 5 1 87 4 0.05 E_76_77_79 3 L1 Protuberance 7 1 0.14 E_78 1 13 9 0.69 E_80 4 P-site Loop 1 24 21 0.88 E_81 2 9 3 6 5 0.83 E_82 22 7 0.32 E_83 1 25 1 0.04 E_84 1 11 16 1.45 E_85 2 1 4 38 5 0.13 E_86_87 2 2 1† 6 13 2.17 L_82_88 2 1 1 9 39 18 0.46 E_88 3 9 E-site 1 1 3 1 4 5 1.25 L_88_74 2 1 2 45 12 0.27 E_89 3* 4 2 1 2 1 6 14 2.33 L_89_90_PT 1 4 8 1 60 25 0.42 E_90_91 2* 6 3 1 1 1 20 9 0.45 E_92 3* A-site Loop 4 5 2 2 1.00 L_90_93-PT 22 14 0.64 E_93 5 3 4 2 5 7 1.40 L_93_73-PT 1 3 1 1 1 1 3 3.00 L_73_94 3 34 10 0.29 E_94 1 4 87 15 0.17 E_95_96 2† Sarcin Loop 2† 4 2 2 38 8 0.21 E_97 2 5 1 13 2 0.15 E_98 2 7 3.50 L_98_99 2 1 9 5 0.56 E_99 3 17 6 0.35 E_99a 2 1 1 34 7 0.21 E_100 4 2 17 3 0.18 E_101 3 3 5 1.67 L_99_1 2 1 1 1† 25 1 0.04 e_1_in5s 94 3 0.03 e_2_3_4_5_in5s 5s rRNA 3† 1 mRNA A_site_tRNA P_site_tRNA E_site_tRNA E_1 E_2_3_4 E_4a E_5_6_10 E_7 E_8_9 E_11 L_11_12 E_12 E_13 L_13_14 E_14 E_16_17_18_20 E_19 L_H16_H21 E_21 E_22 L_22_14 L_14_4 E_23 L_2_25 E_25 L_25_25a E_25a E26_46a_61 L_26_27 E_27 E_28 L_28_29 E_29 E_31 L_31_27 E_32_33 E_34_35 E_35a L_35a_32 E_36_39_46 E_37_38 L_38_39 E_39a E_40_41_45 E_InterLoop_41_42 E_42_43_44 L_42_InterLoop_41_42 L_45_36 L_46_26 L_46a_47 E_47_48_60 E_49 E_50 E_51 E_53_54_55 E_52 E_56_57 E_58_59 L_48_61 L_61_62 E_62 E_63 L_63_64 E_64_67 E_65 E_66 E_68 E_69 E_70_71 L_64_61 L_46a_72 E_72 E_73 L_73_74-PT E_74_75 E_76_77_79 E_78 E_80 E_81 E_82 E_83 E_84 E_85 E_86_87 L_82_88 E_88 L_88_74 E_89 e88a or e7475 E_90_91 E_92 L_90_93-PT E_93 L_93_73 L_73_94 E_94 E_95_96 E_97 E_98 L_98_99 E_99 E_99a E_100 E_101 L_99_1 e_1_in5s e_2_3_4_5_in5s

Figure III.1.8. 1. A snapshot of the interaction matrix in 70S T. thermophilus, PDB files 2J00 and 2J01. 98

Figure III.1.8. 2 . A snapshot of the interaction list in 16S T. thermophilus colored by interaction type. 99

Figure III.1.8. 3 . A picture of the interacting element e_1_2_3_28_29 with e_4_15 in 16S T. thermophilus (PDB file 1J00) colored by elements.

100

hl24 – il34 – P-site A-site hl23 – E-site hl24 – hl31 – E-site P-site

hl30 – P-site

il18 – A-site

mhj42 – 29 E-site

mhj42_29 P-site

il44 – A-site

*

38

Figure III.1.8. 4. tRNAs interactions with 16S rRNA in bacterial ribosome. 101

HL 78 HL 69 L1 A tRNA protuberance P tRNA E tRNA

HL 80 P- site loop P tRNA

HL 88 E tRNA

HL 38 HL 89 A-site A tRNA finger- PT A tRNA

HL 91 A tRNA

HL 92 A-site loop - A tRNA

Figure III.1.8. 5. tRNAs interactions with 23S rRNA in bacterial ribosome. 102

85 81 39 80 88 L_82_88 38

41

Figure III.1.8. 6. Distant helical elements connected by long range interactions in 23s rRNA, interconnected by element E-81. 103

Figure III.1.8.7 . Basepair annotations of long rage interaction between distant helical elements, interconnected by the element E_81.

104

III.1.9. Conclusions

Thus, the RNA molecules can be organized in domains (usually predefined) and

structural elements. A structural element composed of stacked secondary motifs, and might

comprise: a helix, including the adjacent bases that are stacked, and its internal and/or hairpin

loops, or coaxially stacked helices, with their comprising loops., and the linkers between tertiary

elements. The helical elements are composed of helices and tertiary motifs, such as hairpin or

internal loops, and junctions.

Both the local and the global structure of the ribosome are affected during that

translocation step. For example, A-site tRNA directly interacts in the small subunit with helices

h18, h34, and h44, which are in three separate structural domains 30S (see Figure III.1.8. 4.).

U244 of helix h11 makes a cis Watson-Crick- Watson-Crick with C893 and is stacked on G894, part of helix h27. Therefore, it is likely that h11 will be affected by any conformational changes of helix h27. Helix h1 is geometrically situated between h27 and h18, which comprises is the

“530 loop”, involved in EF-Tu binding and mRNA decoding. The nucleotide G530 bulges out to monitor the mRNA-tRNA codon-anticodon interaction to determine whether it is Watson-Crick complementary. Helix h34 is the linker element between the two parts of the small subunit head, and it is likely to also undergo conformational changes during the rotation of the head in the translocation step. Helix h27 of the small subunit makes basepair interactions with the basepairs

29 - 41 and 30 - 40 of the P-site tRNA and it rearranges its local secondary structure when translocation is originated. The conformational changes are likely to be transmitted to the structurally neighboring helices: h1, h11, h18, h24, and h44. Helix h1 is part of the structural core of 16S rRNA, and it is likely to further convey conformational changes throughout the small subunit. Moreover, h1 is stacked on h2 that is stacked on h27 which interacts directly with 105

mRNA, besides P-site tRNA. P-tRNA also makes phosphate-sugar Hydrogen bonds with h30, these interactions have not yet been classified. Helix h24 is involved in direct interactions with

E-site tRNA, and h44. S12 intertwines between h27, h18 and h44, and may play a role in the transfer of conformational changes between them. Helix 44 has to be one of the most important structural elements of 16S subunit, or yet of the whole ribosome, because it makes tertiary interactions with mRNA, and A-site and P-site tRNAs, and several contacts with the 50S subunit. The purines A1492 and A1493 proofread the cognate tRNA recruitment, by flipping out to interact with mRNA into the minor groove. A1492 also makes a tWW pairwise interaction with G530. Helix h44’s nucleotide 1400 stacks underneath nucleotide (C)34 from P-tRNA. This nucleotide forms a near cWW basepair with G966 from hl31. Within the small ribosomal subunit, h44 also interacts with elements h27, h13, and h24. Moreover, h44 engages in tertiary contacts with helices H62, H69 and H70 of the large subunit, creating structural bridges between subunits.

The large subunit comprises the Peptidyl Transfer Center (PTC) assembled by helices

H68, H69, H70, H71, H89, H90, H92, H93 and protein L16 (Ban, 2000). Helix H68 act as a connector between A-site and P-site tRNAs and would entail structural changes to spatially neighboring helices: H22, H64, H66, H70, H75, H76, H88, H93 in 50S and to h44 in 30S, which could farther transmit the conformational changes throughout the small ribosomal subunit. The helix H68 packs close to the CCA tails of tRNAs, and would be a feasible candidate for promoting the exchange of structural information from the tRNAs to the ribosome, amongst tRNA binding sites and between ribosomal subunits. At subunit association, H69 elongates so it reaches the small subunit at h44. Helix H69 directly interacts with H68 and h44, but also possibly with h45 (2’-OH of A1919 from H69 packs in the minor groove of G1517 from h45, 106

forming a near cSS in a perpendicular angle). Helices H70 and H71 further connect with H64,

H65, besides the PTC helices H68, H69, H92, H93 and might possibly transfer conformational

changes to farther neighboring helices (4-6Å) H22, H58 , H75, H76, and H88. H70 makes, in

addition, inter-subunit bridges with h23 and h44. H89 interacts with A-site tRNA, and it is in close vicinity of helices H39, H42, H91. Helix 39 coaxially stacks across the junction on helix

H36, which is further stacked on H46. Altogether E_36_39_46 makes long-range tertiary contacts in H3, H20, the liker region that connects H4 with H23 – considered as part of E_2_3_4, since these nucleotides stack on H4-, H25, H37, H38, H39a, H72, H80, H81, H89.The base of helix H90, nucleotides 2505-2507, paired with 2582-2584, makes an attempt to contact A-tRNA

‘s nucleotide A76. A2585 is stacked underneath A2602 and might be involved in guiding mRNA through the tunnel. HL92 is the highly conserved A loop and directly interacts (cSS/tSS) with nucleotides 2582-2507 at the base of H90. Helix H92 also interacts with H39, and makes several phosphate-sugar or phosphate-phosphate with the joint region between helices H61 and H62.

The universally conserved A2602, part of H93, is directly involved in properly arranging the substrates in the PTC center, bulging out to make tertiary contacts with nucleotides 75 and 76 of the P-tRNA. Another obviously important structural element is H38, the A-site finger. H38 connects with P site through H81, and with E site through H82. H81 further interacts with H80, making several basepair interactions, stacks underneath H88, and Watson-Crick basepair in this helix with C2477, packs along H85 and the linker L_82_88, establishing several A-minor-type motifs. One of the nucleotides situated in the linker region between H74 and H89 (nearly coaxially sacked, if this linker region is considered as part of the entire structural element),

A2451, also stacks on nucleotide 76 of P-tRNA. This nucleotide A2451 was thought to assist in the peptide bond formation through conventional chemical catalysis, before the enthalpic effect 107

of the substrates rearrangement mechanism was proposed. The hairpin loop HL80 complementary pairs with nucleotides 74-76 of the P-tRNA, earning the name “P site loop”. The helix H80 also interacts with H39. E-site tRNA contacts the small subunit at h23 and h28, and the large subunit at helices H68, H76, H88, and the joint regions between H76 and 77,

JR_76_77, and H78 and H77, JR_78_77.

Termination of protein biosynthesis is induced by binding release factors (RFs) that recognize the stop codon, (RF1 recognizes UAG or UAA, and RF2 recognizes UGA or UAA), which mimic tRNA structure and bind to A-site of the ribosome. The factor RF3 accelerates termination of translation, releasing the newly nascent protein, and promotes subunits dissociation. All the helices involved in tertiary contacts with the tRNAs directly or indirectly interact with each other, and furthermore with additional structural elements, establishing consequently an intricate network of tertiary interactions for perpetuating structural changes throughout the whole ribosome.

The ribosome organization, packing, and ultimately functioning, is mediated by tertiary contacts between structural elements. Tertiary interactions can be conveyed by describing the basepairs interactions that occur at the interfaces between structural elements. Each structural element plays in integral role into ribosome function as a biomolecular machine. Elucidating the structural aspects of the ribosome will assist in deciphering the mechanism of how this molecular machine crafts naturally complex biomolecules, the proteins. The goal of this work is to thoroughly investigate the ribosome structure, to describe and map the long-range interactions that occur between RNA modules. The immediate steps of this study are to compare available ribosomal crystal structure files, and to analyze the differences between structures with different ligands bound. A complete automation of the process should also include a module program 108

which designates structural elements of an RNA molecule (work in progress). Moreover, a

scoring method assessing the “strength” of two interacting elements is essential in establishing

the most functionally important structural elements of the ribosome.

III. 2. Towards automated extraction of key information of ribosome structures in the

Nucleic Acids Databank (NDB)

In this section we investigate the conservation of interactions between tRNAs and the

SSU, with the goal of writing automated procedures to recognize these interactions and classify tRNAs as being docked into the A-, P-, and/or E- sites and to determine the nature of the tRNA- mRNA interaction: cognate, near-cognate, or non-cognate. In order to choose structures suited for quantitative analysis, several criteria have to be taken into consideration. The resolution of the structure (in Å) is of the primary importance because it reflects the amount of detail in the diffraction pattern of a crystal. The lower the resolution value (in Å), the cleaner the electron density map, the more X-ray diffraction data of the crystallized molecule in question, and therefore the better the structure can be modeled at the atomic level. The number of annotated basepairs per nucleotide (bps/nt) is another criterion which can be used in selecting the representative structures for further analysis. This number is calculated for each PDB file by a

FR3D module and reported on the BGSU RNA website (http://rna.bgsu.edu/main/). The bps/nt

value for each structure can be found by hovering with one’s cursor over the structure’s ID in the

version of non-redundant PDB files list desired– usually the most current at that point in time -

on the website http://rna.bgsu.edu/rna3dhub/nrlist. Only structures with values of bps/nt > 0.4

were used in our analysis described here. The functional state of the ribosome is yet another criterion used in the selection of structures to analyze. The functional state of the ribosome is determined by the molecular compounds found in the ribosomal complex, in addition to 109

ribosomal proteins (rproteins) the rRNA and ribosomal proteins, compounds including mRNA,

tRNAs, translation factors, and antibiotics. The number and type of tRNAs and the sites to which

they are bound is crucial to the state of the ribosome. The presence of translational factors (IFs,

EFs, RFs, RRFs) in the structure and their binding sites on the ribosome also influences the state.

The ribosome has three standard binding sites for tRNA called the Aminoacyl (A-), the Peptidyl

(P-), and the Exit (Lee-Hoeflich et al.) sites respectively, and a particular functional state of the

ribosome can have 0, 1, 2, or all 3 tRNAs bound. In addition to the standard functional states

there are hybrid states which capture the ribosome as it transitions between functional states to translocate the tRNAs through the ribosome as the mRNA sequence is read and proteins are made. The structure databases also have structures which contain proteins or mRNA sequences that trigger recoding of the message. Moreover, different ligands, such as antibiotics, can trap the ribosome in diverse states, and effects of their presence is also important in choosing structures to analyze.

Presently, advanced searches for sets of structures in specific functional states cannot be

fully carried out on the NDB website because there is no functionality for specifying many of the

variables listed above. Hence, the structure selection process must currently done manually. As a first step towards improving functional annotation of ribosome structures, the relevant data were compiled manually for the small ribosomal subunit of the bacterium Thermus thermophilus,

which belongs the non-redundant structures list (NR_4.0_81883.24), and a table with the type of

data collected is presented further below (Table III.2.2.). These structures were selected because

they represent the most diverse set of functional states. The table can be accessed online at

http://tinyurl.com/16S-T-Thermophilus-summary. Because the manual collection of all this

information is very time consuming, it is imperative to automate the data collection process. So 110

many ribosome structures are continually being solved that it is not possible for NDB to

manually extract this information for all new ribosome structures on an on-going basis. Scientists need to be able to easily find specific structures in the PDB database representing the ribosome in each functional state. The data in the table produced in this study will serve as a test set for developing an automated pipeline to add functional annotations.

The first step in designing a pipeline to automate the data collection process is to carry

out the process manually in compiled to trace the steps and translate them into a programming

algorithm. A hand-curated table was created with all the relevant information, some of it

collected manually. Ribosomes from three bacteria have been studied intensively: Thermus

thermophilus (abbreviated T. thermophilus or T. th.), Escherichia coli (abbreviated E. coli or

E.c.), and Deinococcus radiodurans (abbreviated D. radiodurans or D.r.). The largest variety

and number of functionally distinct SSU structures exists for Thermus Thermophilus. The SSU

of the bacterium D. radiodurans and the archaeabacterium Haloarcula marismortui (H.

marismortui or H.m. abbreviated) not have been crystallographically solved yet. Representative

eukaryotic ribosomes have only been solved recently, and both Saccharomyces cerevisiae (S.

cerevisiae or S.c. abbreviated) and Tetrahymena thermophila (Tetrahymena short) have only a

few structures published to date. The LSU of the mitochondrial yeast ribosome has been solved.

Therefore, for this study data were collected for the structures of the small ribosomal subunit of

T. thermophilus. At the time when the table was compiled (Spring 2014), the most current

version of non-redundant list of RNA structures was version 1.56, found at

http://rna.bgsu.edu/rna3dhub/nrlist/release/1.56, with the non-redundant list of 30S T.

thermophilus structures NR_4.0_81883.24, This list is being updated weekly, and each new

version of the list has a different number. 111

Table III.2.1. A snapshot of the Motif Atlas validation table posted electronically that contains functional annotations for all the NDB structures in the Equivalence class for the 30S ribosome subunit of T. Thermophilus. These are the structures in the class NR_4.0_81883.24 downloaded from rna.bgsu.edu in April 2014. The table can be found online at http://tinyurl.com/16S-T- Thermophilus-summary. 112

A summary of data collected and the source of each data item are presented in the Table

III.2.2 . The structures were downloaded from the PDB website http://www.rcsb.org/pdb/. The

mRNA sequences, when present, were obtained from NDB website http://ndbserver.rutgers.edu/ to identify the corresponding chain ID. Manual analysis was necessary to determine the location,

the type, and the anti-codon sequences of tRNAs, and was carried out using the Swiss-PDB

program – downloaded from http://spdbv.vital-it.ch/ (Guex 1996). Basic metadata for each

structure, e. g. the ID under which the structure can be found on the PDB website, the title of the

structure, the resolution, the citation, the organism source and the method of analysis can be

obtained electronically by directly querying the PDB API (Application Programming Interface).

More detailed information like the type of compounds in the ribosomal complex, the compounds

chain IDs, and information regarding the presence of additional molecules such as translation

factors, recoding RNA sequences or proteins, can also be found on the PDB website, but much of this information is not standardized. For example, there are no assigned chain IDs that correspond to mRNA in all structures. The mRNA sequences were obtained from the NDB website, where RNA sequences are listed for each by the chain ID. Some general information and even short description about other compounds in the 30S ribosomal complex can be found on the PDB website. But the positioning of tRNAs in the ribosome, the tRNAs anti-codons, the mRNA codon sequences in each functional site, and even some mutations found in the actual structures are not listed clearly on the PDB website in defined data fields, and have to be collected manually by visual inspection of each structure using a 3D viewing program. The type of tRNA was deduced by reading the anti-codon sequence: the tRNA anticodon and corresponding mRNA codon are Watson-Crick complementary, and for each codon the tRNA type can be found from a standard table. However, it is also necessary to read the corresponding 113

publications because sometime mutations are made in the anti-codon sequence without changing

the overall identity of the tRNA, so it still carries the original amino acid

.The Notes - Observations column in this table was populated with miscellaneous

information considered important – such as discrepancies between information listed on the PDB

website and the actual data found in the structure. For example, for structures 3KNH, 3KNJ,

3KNL, and 3KNN on the PDB website the authors listed Gln-tRNA for the type of E-site tRNA,

but according to the anticodon the tRNA is tRNA(His). Some observations also recorded in this

column included information about the role or the effect of certain additional compounds such as

toxins, recoding RNA sequences and protein factors, or details about some unusual antibiotics.

The structures in the NDB and rna.bgsu.edu (http://rna.bgsu.edu/main/) are organized

into “Equivalence Classes” based on the sequence of the longest RNA in the structure. Thus all

structures that contain E. coli 16S rRNA are in the same equivalence class, while all E. coli 23S rRNA and all T. thermophilus 16S structures are placed into different classes. The structures in each equivalence class are ranked by the number of FR3D-annotated basepairs of all types/nucleotide (bp/nts), a useful metric for the quality of the 3D modeling of a structure. Only structures with bps/nt >0.4 were analyzed.

The structure with the highest value of this metric is selected for inclusion in the Non- redundant data set posted each week on rna.bgsu.edu (http://rna.bgsu.edu/main/) and on the NDB website (http://ndbserver.rutgers.edu/) and are used for FR3D searches for building new versions of the 3D Motif Atlas each month. For most equivalence classes, the representative structure has a value > 0.4 bps/nt.

The key information was collected for each of the 204 structures in the NR_4.0_81883.24

list for T. Thermophilus 16S rRNA. Due to size of the table it is not possible to display it in the 114

manuscript, only a snapshot picture of a portion of the table is shown in Table III.2.1. The entire

table can found online at http://tinyurl.com/16S-T-Thermophilus-summary. An algorithm for collecting this data automatically was proposed.

Because each of the structures in this comprehensive list was analyzed manually, it was

possible to identify the location each tRNA, their anti-codon sequence and the mRNA codon

sequence at each functional site. However, this process is tedious and time-consuming, and

therefore not sustainable in long term. It needs to be automated. Automation requires devising

an algorithm that can be implemented using the existing data base of 3D structure annotations

which are collected automatically for each new structure added to the NDB. The position of

each tRNA within the ribosome can be deduced from to pairwise nucleotide interactions it makes

with 16S rRNA and mRNA. To determine the consensus interactions between tRNA, mRNA,

and rRNA, a statistical sample of ribosome structures was examined and interactions were

compiled by binding site to construct a statistical signature. There are 204 structures in the

NR_4.0_81883.24 list, but only 46 have all three tRNAs and mRNA present. In twelve of these,

the translation factor EF-Tu is also present, and the A-site tRNA is actually in the A/T hybrid

state, with its aminoacyl end bound by EF-Tu, and these structures were excluded from analysis.

From the remaining 34 structures, twelve have all three tRNAs and mRNA present and were

chosen for further analysis. All have a reasonable resolution and a high value of bps/nt. Out of

the twelve structures selected for further analysis, ten have a resolution of 3.3 Å or lower, one

has a resolution of 3.5Å and only one has a resolution of 3.9 Å (3V6U). A number of 0.4 or

above was the cutoff for the bps/nt value (Table III.2.3. ). 115

Information Collected Source PDB ID PDB website, field “Split Entry” Corresponding 50S Structure Manual analysis of 3D structure Title Structure PDB website, field “Title” RNA source organism PDB website, field “Organism” Method of analysis PDB website, field “Experimental details” Resolution (A) PDB website, field “Experimental details” Features (mutations, modifications, etc) PDB website, field “Molecular Description”, Line: “Molecule” 16S Chain ID PDB website, field “Molecular Description”,Line: “Chains” Anti-codon sequence Manual analysis of 3D structure

tRNA type Manual analysis of 3D structure A-site tRNA Description PDB website, field “Molecular Description”,Line: “Molecule” Chain ID Inspection of the structure Anti-codon sequence Manual analysis of 3D structure

tRNA type Manual analysis of 3D structure P-site tRNA Description PDB website, field “Molecular Description”,Line: “Molecule” Chain ID Inspection of the structure Anti-codon sequence Manual analysis of 3D structure

tRNA type Manual analysis of 3D structure E-site tRNA Description PDB website, field “Molecular Description”,Line: “Molecule” Chain ID Inspection of the structure 116

State Manual analysis of 3D structure

Hybrid Description PDB website, field “Molecular Description”,Line: “Molecule”

Chain ID Inspection of the structure A-site codon sequence Visual analysis of 3D structure P-site codon sequence Visual analysis of 3D structure mRNA E-site codon sequence Visual analysis of 3D structure Chain ID PDB website, field “Molecular Description”, Line: “Chains” IFs, EFs, RFs, RRF PDB website, field “Molecular Description”,Line: “Molecule” Translation Factors Chain ID PDB website, field “Molecular Description”, Line: “Chains” RNA PDB website, field “Molecular Description”,Line: “Molecule” Recoding RNA Chain ID PDB website, field “Molecular Description”, Line: “Chains”

Protein PDB website, field “Molecular Description”,Line: “Molecule” Recoding proteins Chain ID PDB website, field “Molecular Description”, Line: “Chains” Antibiotics PDB website, field “Ligand Chemical Component” Ligands Others PDB website, field “Biologically Interesting Molecule (BIRD)” Observations PDB website, interesting information, and personal remarks Notes Mutations PDB website, field “Modified Residues” Compounds Proteins PDB website, field “Molecular Description”,Line: “Molecule” Citation Authors, Title, Year PDB website, field “Primary Citation” Table III.2.2. .Sources of data collected to functionally annotate structures of the T. Thermophilus 30S ribosome, non-redundant

class NR_4.0_81883.24, downloaded April 2014 from rna.bgsu.edu. 117

The FR3D-annotations of all pairwise interactions between the tRNA and rRNA and

between tRNA and mRNA were collected from each of these structures from the website

rna.bgsu.edu , where FR3D analyses all RNA-containing 3D structures in each equivalence set.

Data for 1FJG are found at http://rna.bgsu.edu/rna3dhub/pdb/1FJG/interactions/fr3d/all for example. The interactions were arranged according to the anticodon nucleotide position and summarized in a table (Table III.2.4. and Table III.2.5 ).

After a careful analysis of the tertiary interactions, the frequency with which they appear in the structures, and the type of interaction, scores were assigned to each interaction. The biggest weight was given to the most conserved interactions. For example, A-site anticodon nucleotide 2 is almost always (in 91.67% of the structures) annotated as “perpendicular” to rRNA nucleotide G 530. The only structure in which this interaction was not found is 4K0L

(3.3Å resolution), but this structure seems to have the least of the interactions specific for A-site.

In these 12 structures, as well as in other structures analyzed throughout this work, the E-site tRNA is not located in well-defined site with consistently annotated rRNA and mRNA interactions; therefore no data could be summarized. A different, less granular approach will be needed to identify the E-site tRNA in the SSU. By contrast, the interactions on the E-site of the

LSU are much more reliable and reproducible. 118

PDB Class Corres bp Res A- P- E- A-site P-site E-site mRNA A-tRNA Antibiotics tRNA tRNA tRNA /mRNA ID # 50S /nts type type type codon codon codon Chain interaction

2J00 124 2J01 0.402 2.8 (Phe) (Phe) (Phe) UUC AUG AAA X Cognate Paromomycin

2WDG 139 2WDI 0.399 3.3 (Phe) (Phe) (Phe) UUC AUG AAA X Cognate Paromomycin

3TVG 77 3TVH 0.402 3.1 (Leu) (Phe) (Phe) CUU AUG UUU 1 Wobble

3UYD 107 3UYE 0.401 3.0 (Leu) (Phe) (Phe) UUU AUG UUU 1 Near-cognate

2WDH 146 2WDJ 0.400 3.3 (Phe) (Phe) (Phe) UUC AUG AAA X Cognate Paromomycin

3I8H 111 3I8I 0.420 3.1 (Phe) (Phe) (Phe) UUU UUU UUU 1 Wobble

3KNL 102 3KNM 0.407 3.5 (Gln) (Phe) (His) CAG AUG AAA V Cognate Capreomycin

3UZ3 98 3UZ1 0.403 3.3 (Leu) (Phe) (Phe) UUU AUG UUU 1 Near-cognate Paromomycin

3UZ6 79 3UZ9 0.411 3.0 (Tyr) (Phe) (Tyr) UAC AUG AAA 1 Cognate Paromomycin

3UZM 180 3UZN 0.399 3.3 (Tyr) (Phe) (Tyr) UGC AUG AAA 1 Non-cognate Paromomycin

3V6U 131 3V6W 0.401 3.9 (Phe) (Phe) (Phe) UUC AUG A X Cognate

4K0L 156 4K0M 0.402 3.3 (Ser) (Phe) (Phe) PSUAG AUG A X Non-cognate

2Y10 173 2Y10 0.418 3.1 (Trp) (Phe) (Phe) UGG UUC AUG X Cognate Kirromycin

Table III.2.3 . 16S PDB files chosen for analysis of tRNA-rRNA and tRNA-mRNA interactions.

119

A-site tRNA INTERACTIONS EXTRACTED BY FR3D - A 34 is the first nucleotide of the anticodon (Position 1)

Anti- A- tRNA 16S rRNA Interactions observed in PDB files:

codon NT NT NT NT

Nt Pos type No, type No. 2J00 2WDG 3TVG 3UYD 2WDH 3I8H 3KNL 3UZ3 3UZ6 3UZM 4K0L

. Cognate Cognate Wobble Near Wobble Near Cognate Near Cognate Non-cogn Non-cogn

1 G 34 C 1054 perp ns53 ns53 ns53 s53 perp ns53 ns53 ns53 perp

2 A 35 G 530 perp perp perp perp perp perp perp perp perp perp

2 A 35 A 1492 ntSS ntSS ntSS ntSS ntSS ntSS ntSS ntSS

3 G 36 G 530 perp perp perp perp

3 A 36 A 1493 tSS tSS ntSS tSS tSS tSS tSS tSS ntSS

+' 1 (4) A 37 A 1493 ns53 ns53 ns53

Table III.2.4 . Summary of interactions between A-site tRNA and 16S rRNA in chosen files from 30S subunit, Equivalence class of T. Thermophilus downloaded from rna.bgsu.edu, April 2014. “Near” indicates near-cognate interaction between the A-site tRNA and mRNA. “Non-cogn” indicates non-cognate interaction between the A-site tRNA and mRNA. “Wobble” indicates G-U interaction between the third nucleotide of the mRNA codon with the first nucleotide of the tRNA anti-codon.

120

P-site tRNA INTERACTIONS EXTRACTED BY FR3D - A 34 is the first nucleotide of the anticodon (Position 1)

Anti- P- tRNA 16S rRNA Interactions observed in PDB files:

codon NT NT NT NT

Nt Pos type No, type No. 2J00 2WDG 3TVG 3UYD 2WDH 3I8H 3KNL 3UZ3 3UZ6 3UZM 4K0L

. Cognate Cognate Cognate Cognate Cognate Wobble Cognate Cognate Cognate Cognate Cognate

-4 G 30 G 1338 ntSS ntSS ns53 ntSS ntSS ntSS ntSS ntSS

-4 G 30 A 1339 tSS tSS tSS tSS tSS tSS ntSS ntSS tSS ntSS tSS

-3 G 31 A 1339 ntSS ntSS ntSS

1 C 34 C 1400 s55 s55 s55 s55 s55 s55 s55 s55 s55 s55 s55

6 C 40 A 1339 cSS cSS cSS cSS cSS cSS cSS cSS cSS

7 C 41 G 1338 cSS cSS ncSS cSS cSS ncSS cSS cSS cSS cSS cSS

Table III.2.5. Summary of interactions between P-site tRNA and 16S rRNA in chosen files from 30S subunit, Equivalence class of T. Thermophilus downloaded from rna.bgsu.edu, April 2014. “Wobble” indicates G-U interaction between the third nucleotide of the mRNA codon with the first nucleotide of the tRNA anti-codon. 121

During the analysis of these twelve structures, we discovered considerable variation in the number and type of 16S rRNA-tRNA and 16S rRNA-mRNA interactions. Further scrutiny showed that there was variation in the tRNA-mRNA anti-codon/codon interactions that needed to be taken into account as some structures had near-cognate and others non-cognate interactions between the tRNA and mRNA . The data were reorganized to group together the structures according to the type of codon/anti-codon interaction. The first step of an automated algorithm will therefore have to extract the codon and anti-codon sequences and determine whether cognate, near-cognate or non-cognate. For the present analysis, we did this manually. We find that the A-site tRNA-mRNA interaction is cognate for 23 structures, wobble for 6 structures, near-cognate for 4 structures, and non-cognate for 12 structures. One of the structures, 4K0P, has only one mRNA nucleotide in the A-site, and therefore the type of A-site tRNA/mRNA interaction could not be deduced.

Focusing on the cognate structures, we find that these interactions are conserved:

- the second anticodon nucleotide of A-tRNA is perpendicular to G530 and forms a ntSS with

A1492; the third anticodon nucleotide of A-tRNA is tSS with A1493;

-the nucleotide four positions upstream the anticodon loop of P-tRNA is involved in a triplex with the nucleotides 1338 and 1339, with the interaction tSS between this particular nucleotide and 1339 being highly conserved; the first nucleotide of the codon is stacked s55 on nucleotide

1400 of the 16S rRNA; A few nucleotides downstream to the anticodon loop, nucleotides (+6) and (+7) form a ribose zipper (two cSS interactions) with the nucleotides 1338 and 1339, mentioned above. By combining the mRNA and rRNA interactions of the A-site tRNA it should be possible to automatically determine not only the presence of the tRNA, but the type of mRNA interaction as well. 122

III.3. Evaluation of classification of recurrent motifs in the ribosome structures by the

RNA 3D Motif Atlas

Recently Petrov and co. (Petrov 2013) reported an automated procedure for extracting

and clustering hairpin and internal loop motifs from a non-redundant set (NR) of high-quality

RNA 3D structures in the NDB. The motif atlas can be found at

http://rna.bgsu.edu/rna3dhub/motifs. The question we address in this chapter is how well the

motif atlas groups together geometrically similar motifs. The ability of the procedure to cluster

homologues, recurrent motifs from ribosomal RNA structures has not been previously evaluated.

Currently there are enough ribosome structures from diverse organisms to make it meaningful to

undertake this evaluation.

An important question for any automatic clustering procedure is whether it groups

together objects that scientists agree belong together while also excluding instances that do not.

The first step in evaluating a clustering procedure is to identify data sets that include related

objects which form distinct groups, preferably based on multiple criteria. Objects belonging

should be similar but not identical to provide an adequate test. In structural bioinformatics, a promising place to look for similar structures is found in homologous molecules, for example the set of all tRNA or ribosomal RNA structures. There are many different atomic-resolution tRNA structures in NDB so this presents a rich data set, but tRNAs are small RNA molecules, and only offer effectively three distinct hairpin motifs: the dihydrouracil (Woodford-Williams, Mc,

Trotter, Watson, & Bushby) hairpin loop, the anti-codon loop, and the TΨC (St-Onge et al.)

loop, one multihelix junction and no internal loops.

In the course of this work we evaluated the clustering of those loops from tRNA and

found, as expected, that practically all T-loops from tRNA are clustered together, whereas D- 123

loops, which are more loosely structured and more variable in length, are placed in several

different but related motif groups by the Motif Atlas pipeline. Surprisingly, anti-codon loops also

form more than one group. Most form one group with the conserved U33 forming the U-turn.

The other groups are diverse and populated by loops distorted by interactions with proteins, principally cognate amino-acyl-tRNA synthetases that bind and unfold the anti-codon to “read”

its sequence to ensure correct recognition.

A far richer source of hairpin and internal loop motifs is the ribosome. Until recently

however, the database of structures was limited to the LSU (large subunit) of one archaeon (H.

marismortui) and three bacteria (E. coli, T. thermophilus, and D. radiodurans), and only two

SSU structures (E. coli and T. thermophilus). When this study was carried out, two eukaryal SSU

(small subunit) and LSU structures and one mitochondrial LSU of S. cerevisiae and Tetrahymena thermophila also became available for analysis.

The basic premise of this study is that hairpin and internal loops form homologous structural elements in the ribosomes of different organisms, that belong to the conserved core of the ribosome and are conserved in length, are also likely to be conserved in structure, if not in sequence as well, and therefore should be grouped by automated procedure into the same motif groups in the Atlas. The first step of the analysis is to identify structural elements of the SSU and

LSU that are conserved in length within as well as across all phylogenetic Domains (Archaea,

Bacteria and Eukarya). This was done by consulting the summary 2D figures prepared by the

Gutell laboratory.

The motifs atlas compiles the non-redundant list of RNA structures by extracting from the PDB databank the structures with the best resolution and highest value of bps/nt, one

representative for each type of molecule, including 16S rRNA of T. thermophilus, E. coli, S. 124

cerevisiae, and Tetrahymena; 23S rRNA of T. thermophilus, E. coli, H. marismortui, D. radiodurans and 23S rRNA of Tetrahymena, and S. cerevisiae. Representative structures from the following ribosome structures were the subject of motif analysis: for the SSU T. thermophilus (PDB file 1FJG), E. coli (PDB file 2AW7), Tetrahymena thermophila (PDB file

4BPP), S. cerevisiae (PDB file 3U5F); for the large subunit and T. thermophilus (PDB file

4NVV), E. coli (PDB file 2QBG), Tetrahymena (PDB file 4A1B), S. cerevisiae (PDB file

3U5H), H. marismortui (PDB file 1s72), D. radiodurans (PDB file 4IOA). The version of Non-

Redundant structure list used was 1.56, found at http://rna.bgsu.edu/rna3dhub/nrlist/release/1.56.

Blake Sweeny assisted with downloading the metadata for the analysis from NDB and the rna.bgsu.edu website, providing a CSV file with the loop IDs, motif IDs and their URLs,

PDB IDs from which each motif was extracted, the sequence of the motifs and their respective nucleotide ranges. This allowed the location of each motif to be deduced, according to nucleotide numbers so that each motif could be assigned to the respective helical element to which it belonged and grouped with homologous motifs. The molecule names (SSU or LSU, Group I intron, Riboswitch, etc.) were added manually according to the PDB file name. Hairpin loops and internal loops were placed on separate tables. First, the Motif data summary tables were sorted by the motif ID, to color each type of motif by a different color. Next, the data in the tables were sorted by the helical element location, molecule type and the motif group, so that homologous motifs from the homologous molecules are displayed together (Figure III.3.1.). The entire tables can be found at http://tinyurl.com/Motif-Atlas-validation on “Hairpin loops” and

“Internal loops” spreadsheets respectively. All instances of the motifs found in the ribosome were visually analyzed and motif clustering was evaluated using the motif URLs. The results of the analysis were summarized by marking each hairpin loop and internal loop on 2D structure 125

diagrams using colored circles to indicate the level of structural similarity, and therefore the success of the Motif Atlas clustering procedure to place motif instances from homologous locations in the appropriate motif group. Certain regions are very conserved in the structure and often in sequence and all motif instances from those locations have the same or very similar structures. These motifs are overlaid with filled blue circles in the 2D diagrams in Figures

III.3.2. through III.3.7. Figures III.3.2 and III.3.5. are of the hairpin loops and internal loops of the SSU rRNA, and Figures III.3.3.- III.3.4., and III.3.6.- III.3.7. are of the hairpin loops and internal loops of LSU rRNA. The results were mapped onto the 2D diagrams of the SSU and

LSU respectively.

If the clustering procedure places all the structurally similar motifs in the same motif group, the solid blue circle is given a blue outline. If the motif instances are placed in two or more structurally similar motif groups, the blue circle is given a green outline. Both these outcomes are considered successful clustering. In fact, most HL and IL motifs in both SSU and

SU that have blue solid circles indicating all motifs have similar structures also have blue outlines and relatively few have green outlines. If, however, two or more structurally similar motif instances are placed in structurally different motif groups, this is indicated with a blue circle with a red outline. The only case of a blue circle with red outline is HL35 of LSU.

A second case concerns non-conserved regions of the ribosome, which can vary in length and structure at multiple levels. A good example is helix 6 in the SSU which varies in length within and across phylogenetic groups. Each of the HL instances is different within and across different Domains, and the difference is indicated with a red circle, with a red outline in the 2D diagrams. This is the case of hairpin loops 9, 10, 17 and 26 as well (Figures III.3.6. and III.3.7.). 126

Certain regions are conserved within the Bacteria Domain, but are structurally different

from the homologous regions in Eukarya. This is the case of several hairpin loops in both SSU

and LSU, and the Motif Atlas groups the similar motifs, corresponding to structures from the

same Domain, in the same motif group. Thus, the same or similar motifs from Bacteria are

grouped together, the motifs from Eukarya forming a separate group. These cases are more than

half of the instances when the Motif Atlas groups similar structures in the same motif group, and

the structures which are different in different motif groups, illustrated on the 2D diagrams by

yellow circles with blue lines. These outcomes are also considered successful motif clustering.

The Motif Atlas separates in certain cases motifs which are similar, placing them in

different motif groups. Sometime the partitioning is due to the conformational chances due to

differences in functional state. We can distinguish three sub-cases for the purpose of evaluating

the success of the clustering. In the first case, each similar instance is clustered with those like it

in the similar groups and all those that are different are placed in separate groups. These cases

are indicated by yellow solid circles with a blue outline. When similar instances are placed in

similar groups, a green outline is used. When similar instances are placed in different groups or

different instances in the same group, a red outline is used. Examining the 2D figures, one sees

that there are a considerably large number of yellow circles, especially for hairpin loops in the

LSU rRNA, but most have a blue outline and only a few a green outline. No yellow circles with

red outline appear.

Within each phylogenetic group we expect homologous motifs from conserved helices to

be placed in the same or related motif groups. We expect homologous motifs in conserved helices to be put in the same group. Exceptions would be some motifs from variable length helices. In the small subunit of bacteria, helices 6, 9, 10, 16, 17, 26, 33b, and 44 are not 127

conserved in length between T. thermophilus and E. coli. Between bacteria and eukaryotes, there

are in addition differences in the lengths or structure of helices 21, and 33 in the SSU. In the

LSU, the differences in the bacterial rRNA are in the helices 10, 16, 58, 59, 63, 68, 79, 84, and

98. When comparing bacteria with archaea, the helices in the LSU 9, 16-18, 25, 55, 58, 59, 63,

66, 68, and 98 have different lengths, and motifs located here should be expected to vary. Just

based on comparison of 2D structures, helices 8, 9, 10, 25, 30, 31, 38, 52, 55, 58, 63, 78, 79, 86,

and 98 are not conserved from the bacterial LSU to the eukaryotic LSU.

Overall, the motif at clusters motifs the same motifs in the same family, with a few cases

when it clusters same type of motifs in related families. Motif atlas does not always put related motifs in the same group but it does show you which groups are related into structure - they differ in 1-2 interactions. 128

Figure III.3.1. A snapshot of the Motif Atlas validation table. Motifs colored with the same color, like the internal loop (IL_02359.4) found in the H24 of LSU, are similar and placed in the same motif group. The next internal loop, found in helix H25 of LSU varies among organisms, due to variation in the helix length, so it is expected that the Motif Atlas does not cluster them together (see Figure ). Motif IL_78141.1, located on H25 as well, can be found only in eukaryotes, and does not have a correspondent in bacterial LSU. 129

About half of the hairpin loops in the small subunit are conserved throughout the

bacterial and eukaryotic ribosome (Figure ). There are cases where the Motif atlas does not

always put motifs in the same group, like the hairpin loops capping h12, h13, h15, h40 and h42,

but it does show you which groups are related into structure. A closer analysis reveals that these

hairpin loops differ in 1-2 interactions, and the cutoff used to cluster motif families by the Motif

Atlas places them in related families. The hairpin loops of helices 16, 39, 43, and 44 are

conserved within the bacterial structures and within the eukaryotic structures respectively, but

they are different types of loops in the eukaryotic ribosome compared with the bacterial

ribosome. The Motif Atlas clusters them accordingly, placing similar motifs in the same family,

and reflecting the evolutionary changes between Domains.

The large subunit presents a very similar situation (Figure and Figure ). Half of the

hairpin loops are conserved among organisms and placed in the same corresponding families. A

third of the hairpin loops found in the LSU vary between bacteria, archaea and eukaryotes, and

the Motif atlas clusters them accordingly. In the 5’ half of the LSU, HLs from helices 12, 22 and

25 differ between bacteria and eukaryotes, as reflected in Figure , and the Motif Atlas clusters

the similar loops in the same or similar families, and different loops in different families. There

are a few helices capped by different types of hairpin loops in different organisms: H9, H10,

H39, H45, H78, and H79, reflecting the differences in helices between organisms. They are

separated by the motif atlas in different families. HL25 has two instances of the loops similar but

classified in different motif families due to the actual structural differences between the

corresponding crystal structures.

The classification internal loops by the Motif Atlas is done in a very similar manner. The conserved internal loops are classified accordingly by the Motif Atlas, in the same family, with 130

the related motifs in related families. The variations between the non-conserved regions is reflected by classifying different type of motifs in different families. Yet there are cases when the

Motif Atlas classifies the internal loops as belonging to different families, due to the cutoff imposed. A motif is considered to be flanked by a Watson-Crick basepair, and when such basepairing occurs within an internal loop, the Motif Atlasseparates the motif in two distinct motifs. It is a trade-off since a more relax cutoff would allow related motifs to be classified as pertaining to the same family.

Generally, the Motif Atlas performs a very robust clustering of the motifs. Its automated analysis is very similar with the manual analysis, yet in a fraction of the time. In addition, the

Motif Atlas outputs various files which aid the manual analysis, such as the motif structure, its basepair signature, lists with motifs classified in the same familyor related families, along with its sequence variants and the discrepancy score and the discrepancy map between related motifs.

Overall, the Motif Atlas is an excellent program suite for automatically analyzing and classifying

RNA tertiary motids. 131

LEGEND Motif atlas validation for the hairpin loops in 16S

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.2. Comparative analysis of Motif Atlas classification of hairpin loops in the small subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 132

LEGEND Motif atlas validation for the hairpin loops in 23S – 5’

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.3. Comparative analysis of Motif Atlas classification of hairpin loops in the 5’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 133

LEGEND Motif atlas validation for the hairpin loops in 23S – 3’

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.4. Comparative analysis of Motif Atlas classification of hairpin loops in the 3’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 134

LEGEND Motif atlas validation for the internal loops in 16S

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.5. Comparative analysis of Motif Atlas classification of internal loops in the small subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 135

LEGEND Motif atlas validation for the internal loops in 23S – 5’

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.6. Comparative analysis of Motif Atlas classification of internal loops in the 5’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 136

LEGEND Motif atlas validation for the internal loops in 23S – 3’

All instances are similar and all Are placed in the same Motif Atlas group

All instances are similar and all Are placed in the similar Motif Atlas groups

All instances are similar and all Are placed in the different Motif Atlas groups

Some instances are similar and some are different. Similar ones placed in the same Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the similar Motif Atlas group

Some instances are similar and some are different. Similar ones placed in the different Motif Atlas group

All instances are different but some are placed in the same Motif Atlas groups

All instances are different but some are placed in the similar Motif Atlas groups

All instances are different and all are placed in the different Motif Atlas groups

Large Structural change in Euykaryotes

Figure III.3.7. Comparative analysis of Motif Atlas classification of internal loops in the 3’ half of the large subunit of T. thermophilus, E. coli, Tetrahymena, and S. cerevisae. 137

REFERENCES

Abu Almakarem, A. S., Petrov, A. I., Stombaugh, J., Zirbel, C. L., & Leontis, N. B. (2012). Comprehensive survey and geometric classification of base triples in RNA structures. Nucleic Acids Res, 40(4), 1407-1423. doi: 10.1093/nar/gkr810 Afonin, K. A., Cieply, D. J., & Leontis, N. B. (2008). Specific RNA self-assembly with minimal paranemic motifs. J Am Chem Soc, 130(1), 93-102. doi: 10.1021/ja071516m Afonin, K. A., Danilov, E. O., Novikova, I. V., & Leontis, N. B. (2008). TokenRNA: a new type of sequence-specific, label-free fluorescent biosensor for folded RNA molecules. Chembiochem, 9(12), 1902-1905. doi: 10.1002/cbic.200800183 Afonin, K. A., & Leontis, N. B. (2006). Generating new specific RNA interaction interfaces using C-loops. J Am Chem Soc, 128(50), 16131-16137. doi: 10.1021/ja064289h Agrawal, R. K., Spahn, C. M., Penczek, P., Grassucci, R. A., Nierhaus, K. H., & Frank, J. (2000). Visualization of tRNA movements on the Escherichia coli 70S ribosome during the elongation cycle. J Cell Biol, 150(3), 447-460. Allen, G. S., Zavialov, A., Gursky, R., Ehrenberg, M., & Frank, J. (2005). The cryo-EM structure of a translation initiation complex from Escherichia coli. Cell, 121(5), 703-712. doi: 10.1016/j.cell.2005.03.023 Altmann, R. (1949). [Not Available]. Z Kreislaufforsch, 38(13-14), 386-391. Altona, C. (1996). Classification of nucleic acid junctions. J Mol Biol, 263(4), 568-581. Aoudia, M., Guliaev, A. B., Leontis, N. B., & Rodgers, M. A. (2000). Self-assembled complexes of oligopeptides and metalloporphyrins: measurements of the reorganization and electronic interaction energies for photoinduced electron-transfer reactions. Biophys Chem, 83(2), 121- 140. Arienti, K. L., Brunmark, A., Axe, F. U., McClure, K., Lee, A., Blevitt, J., . . . Breitenbucher, J. G. (2005). Checkpoint kinase inhibitors: SAR and radioprotective properties of a series of 2- arylbenzimidazoles. J Med Chem, 48(6), 1873-1885. doi: 10.1021/jm0495935 Arrhenius, S. (1908). On the Danysz Effect. J Hyg (Lond), 8(1), 1-8. Astbury, W. T. (1945). The structural proteins of the cell. Biochem J, 39(5), lvi. Astbury, W. T. (1947). X-ray studies of nucleic acids. Symp Soc Exp Biol(1), 66-76. Astbury, W. T., Dickinson, S., & Bailey, K. (1935). The X-ray interpretation of denaturation and the structure of the seed globulins. Biochem J, 29(10), 2351-2360 2351. Auerbach, L. I. (1870). Das Jüdische Obligationenrecht V1: Umriss der Entwicklungsgeschichte des Jüdischen Rechts. . Kessinger Publishing, LLC., p. 646. Auffinger, P., & Hashem, Y. (2007). SwS: a solvation web service for nucleic acids. Bioinformatics. Avery, O. T., Macleod, C. M., & McCarty, M. (1944). Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med, 79(2), 137-158. Baker, S. J., Akama, T., Zhang, Y. K., Sauro, V., Pandit, C., Singh, R., . . . Maples, K. R. (2006). Identification of a novel boron-containing antibacterial agent (AN0128) with anti- inflammatory activity, for the potential treatment of cutaneous diseases. Bioorg Med Chem Lett, 16(23), 5963-5967. doi: 10.1016/j.bmcl.2006.08.130 138

Ban, N., Nissen, P., Hansen, J., Moore, P. B., & Steitz, T. A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science, 289(5481), 905-920. Ban, N., Nissen, P., Hansen, J., Moore, P. B., & Steitz, T. A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science, 289(5481), 905-920. Ban, S. H., Kwon, Y. R., Pandit, S., Lee, Y. S., Yi, H. K., & Jeon, J. G. (2010). Effects of a bio- assay guided fraction from Polygonum cuspidatum root on the viability, acid production and glucosyltranferase of mutans streptococci. Fitoterapia, 81(1), 30-34. doi: 10.1016/j.fitote.2009.06.019 Belozersky, A. N., & Paskhina, T. S. (1945). [Chemical nature of gramicidin C]. Biokhimiia, 10(4), 344-352. Belozersky, A. N., & Spirin, A. S. (1958). A correlation between the compositions of deoxyribonucleic and ribonucleic acids. Nature, 182(4628), 111-112. Ben-Shem, A., Garreau de Loubresse, N., Melnikov, S., Jenner, L., Yusupova, G., & Yusupov, M. (2011). The structure of the eukaryotic ribosome at 3.0 A resolution. Science, 334(6062), 1524-1529. doi: 10.1126/science.1212642 Beringer, M., & Rodnina, M. V. (2007). The ribosomal peptidyl transferase. Mol Cell, 26(3), 311-321. doi: 10.1016/j.molcel.2007.03.015 Besseova, I., Reblova, K., Leontis, N. B., & Sponer, J. (2010). Molecular dynamics simulations suggest that RNA three-way junctions can act as flexible RNA structural elements in the ribosome. Nucleic Acids Res, 38(18), 6247-6264. doi: 10.1093/nar/gkq414 Birmingham, A., Clemente, J. C., Desai, N., Gilbert, J., Gonzalez, A., Kyrpides, N., . . . Laederach, A. (2011). Meeting report of the RNA Ontology Consortium January 8-9, 2011. Stand Genomic Sci, 4(2), 252-256. doi: 10.4056/sigs.1724282 Blaha, G., Stanley, R. E., & Steitz, T. A. (2009). Formation of the first peptide bond: the structure of EF-P bound to the 70S ribosome. Science, 325(5943), 966-970. doi: 10.1126/science.1175800 Bolli, M., Micura, R., & Eschenmoser, A. (1997). Pyranosyl-RNA: chiroselective self-assembly of base sequences by ligative oligomerization of tetranucleotide-2',3'-cyclophosphates (with a commentary concerning the origin of biomolecular homochirality). Chem Biol, 4(4), 309- 320. Bonifield, H. R., Yamaguchi, S., & Hughes, K. T. (2000). The flagellar hook protein, FlgE, of Salmonella enterica serovar typhimurium is posttranscriptionally regulated in response to the stage of flagellar assembly. J Bacteriol, 182(14), 4044-4050. Brimacombe, R. (1991). RNA-protein interactions in the Escherichia coli ribosome. Biochimie, 73(7-8), 927-936. Brinster, R. L. (1974). The effect of cells transferred into the mouse blastocyst on subsequent development. J Exp Med, 140(4), 1049-1056. Brown, J. W., Birmingham, A., Griffiths, P. E., Jossinet, F., Kachouri-Lafond, R., Knight, R., . . . Westhof, E. (2009). The RNA structure alignment ontology. Rna, 15(9), 1623-1631. doi: 10.1261/rna.1601409 Cate, J. H., Gooding, A. R., Podell, E., Zhou, K., Golden, B. L., Kundrot, C. E., . . . Doudna, J. A. (1996). Crystal structure of a group I ribozyme domain: principles of RNA packing. Science, 273(5282), 1678-1685. Cate, J. H., Gooding, A. R., Podell, E., Zhou, K., Golden, B. L., Szewczak, A. A., . . . Doudna, J. A. (1996). RNA tertiary structure mediation by adenosine platforms. Science, 273(5282), 1696-1699. 139

Cech, T. R., Zaug, A. J., & Grabowski, P. J. (1981). In vitro splicing of the ribosomal RNA precursor of Tetrahymena: involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell, 27(3 Pt 2), 487-496. Cedergren, R., Lang, B. F., & Gravel, D. (1988). The relationship between RNA catalytic processes. Orig Life Evol Biosph, 18(3), 299-305. Chargaff, E. (1946). A nucleoprotein from avian tubercle bacilli. Fed Proc, 5(1 Pt 2), 129. Chargaff, E., & Abel, G. (1934). On the mechanism of the formation of choleic acids. Biochem J, 28(5), 1901-1906. Chargaff, E., Zamenhof, S., & Green, C. (1950). Composition of human desoxypentose nucleic acid. Nature, 165(4202), 756-757. Cheetham, G. M., & Steitz, T. A. (2000). Insights into transcription: structure and function of single-subunit DNA-dependent RNA polymerases. Curr Opin Struct Biol, 10(1), 117-123. Chen, B., Shen, B., & Frank, J. (2014). Particle migration analysis in iterative classification of cryo-EM single-particle data. J Struct Biol, 188(3), 267-273. doi: 10.1016/j.jsb.2014.10.006 Clancy, J. L., Nousch, M., Rodnina, M., & Preiss, T. (2007). The ins and outs of translation. Genome Biol, 8(12), 321. doi: 10.1186/gb-2007-8-12-321 Coimbatore Narayanan, B., Westbrook, J., Ghosh, S., Petrov, A. I., Sweeney, B., Zirbel, C. L., . . . Berman, H. M. (2014). The Nucleic Acid Database: new features and capabilities. Nucleic Acids Res, 42(Database issue), D114-122. doi: 10.1093/nar/gkt980 Collaboration, S., Abelev, B. I., Aggarwal, M. M., Ahammed, Z., Alakhverdyants, A. V., Alekseev, I., . . . Zoulkarneeva, Y. (2010). Observation of an antimatter hypernucleus. Science, 328(5974), 58-62. doi: 10.1126/science.1183980 Costa, M., & Michel, F. (1997). Rules for RNA recognition of GNRA tetraloops deduced by in vitro selection: comparison with in vivo evolution. Embo J, 16(11), 3289-3302. Cruz, J. A., Blanchet, M. F., Boniecki, M., Bujnicki, J. M., Chen, S. J., Cao, S., . . . Westhof, E. (2012). RNA-Puzzles: a CASP-like evaluation of RNA three-dimensional structure prediction. Rna, 18(4), 610-625. doi: 10.1261/rna.031054.111 Csaszar, K., Spackova, N., Stefl, R., Sponer, J., & Leontis, N. B. (2001). Molecular dynamics of the frame-shifting pseudoknot from beet western yellows virus: the role of non-Watson-Crick base-pairing, ordered hydration, cation binding and base mutations on stability and unfolding. J Mol Biol, 313(5), 1073-1091. doi: 10.1006/jmbi.2001.5100 Dahm, R. (2005). Friedrich Miescher and the discovery of DNA. Dev Biol, 278(2), 274-288. doi: 10.1016/j.ydbio.2004.11.028 Dahm, R. (2008). Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum Genet, 122(6), 565-581. doi: 10.1007/s00439-007-0433-0 Darwin, C. (1859). On the origin of species by means of natural selection. Book. Das, R., Travers, K. J., Bai, Y., & Herschlag, D. (2005). Determining the Mg2+ stoichiometry for folding an RNA metal ion core. J Am Chem Soc, 127(23), 8272-8273. doi: 10.1021/ja051422h Decatur, W. A., & Fournier, M. J. (2002). rRNA modifications and ribosome function. Trends Biochem Sci, 27(7), 344-351. Dhingra, K., Sneige, N., Pandita, T. K., Johnston, D. A., Lee, J. S., Emami, K., . . . Hittelman, W. N. (1994). Quantitative analysis of chromosome in situ hybridization signal in paraffin- embedded tissue sections. Cytometry, 16(2), 100-112. doi: 10.1002/cyto.990160203 Diener, T. O., & Raymer, W. B. (1967). Potato spindle tuber virus: a plant virus with properties of a free nucleic acid. Science, 158(3799), 378-381. 140

Dorywalska, M., Blanchard, S. C., Gonzalez, R. L., Kim, H. D., Chu, S., & Puglisi, J. D. (2005). Site-specific labeling of the ribosome for single-molecule spectroscopy. Nucleic Acids Res, 33(1), 182-189. doi: 10.1093/nar/gki151 Dow, R. L., Schneider, S. R., Paight, E. S., Hank, R. F., Chiang, P., Cornelius, P., . . . DaSilva- Jardine, P. (2003). Discovery of a novel series of 6-azauracil-based thyroid receptor ligands: potent, TR beta subtype-selective thyromimetics. Bioorg Med Chem Lett, 13(3), 379-382. Eargle, J., Black, A. A., Sethi, A., Trabuco, L. G., & Luthey-Schulten, Z. (2008). Dynamics of Recognition between tRNA and elongation factor Tu. J Mol Biol, 377(5), 1382-1405. doi: 10.1016/j.jmb.2008.01.073 Elgavish, T., Cannone, J. J., Lee, J. C., Harvey, S. C., & Gutell, R. R. (2001). [email protected]: A:A and A:G base-pairs at the ends of 16 S and 23 S rRNA helices. J Mol Biol, 310(4), 735-753. Emelyanov, V. V. (2001). Evolutionary relationship of Rickettsiae and mitochondria. FEBS Lett, 501(1), 11-18. Ermolenko, D. N., Majumdar, Z. K., Hickerson, R. P., Spiegel, P. C., Clegg, R. M., & Noller, H. F. (2007). Observation of intersubunit movement of the ribosome in solution using FRET. J Mol Biol, 370(3), 530-540. doi: 10.1016/j.jmb.2007.04.042 Eschenmoser, A. (1997). Towards a chemical etiology of nucleic acid structure. Orig Life Evol Biosph, 27(5-6), 535-553. Franklin, R. E., & Gosling, R. G. (1953). Evidence for 2-chain helix in crystalline structure of sodium deoxyribonucleate. Nature, 172(4369), 156-157. Friedman, M., Caldarelli, D. D., Venkatesan, T. K., Pandit, R., & Lee, Y. (1996). Endoscopic sinus surgery with partial middle turbinate resection: effects on olfaction. Laryngoscope, 106(8), 977-981. Gagnon, M. G., Lin, J., Bulkley, D., & Steitz, T. A. (2014). Crystal structure of elongation factor 4 bound to a clockwise ratcheted ribosome. Science, 345(6197), 684-687. doi: 10.1126/science.1253525 Gagnon, M. G., & Steinberg, S. V. (2002). GU receptors of double helices mediate tRNA movement in the ribosome. Rna, 8(7), 873-877. Gan, H. H., Fera, D., Zorn, J., Shiffeldrim, N., Tang, M., Laserson, U., . . . Schlick, T. (2004). RAG: RNA-As-Graphs database--concepts, analysis, and features. Bioinformatics, 20(8), 1285-1291. Gao, H., Sengupta, J., Valle, M., Korostelev, A., Eswar, N., Stagg, S. M., . . . Frank, J. (2003). Study of the structural dynamics of the E coli 70S ribosome using real-space refinement. Cell, 113(6), 789-801. Geiduschek, E. P., & Haselkorn, R. (1969). Messenger RNA. Annu Rev Biochem, 38, 647-676. doi: 10.1146/annurev.bi.38.070169.003243 Gewirth, D. T., Abo, S. R., Leontis, N. B., & Moore, P. B. (1987). Secondary structure of 5S RNA: NMR experiments on RNA molecules partially labeled with nitrogen-15. Biochemistry, 26(16), 5213-5220. Gordon, J. W., & Ruddle, F. H. (1981). Integration and stable germ line transmission of genes injected into mouse pronuclei. Science, 214(4526), 1244-1246. Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., & Altman, S. (1983). The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35(3 Pt 2), 849-857. 141

Guliaev, A. B., & Leontis, N. B. (1999). Cationic 5,10,15,20-tetrakis(N-methylpyridinium-4- yl)porphyrin fully intercalates at 5'-CG-3' steps of duplex DNA in solution. Biochemistry, 38(47), 15425-15437. Gutell, R. R., Cannone, J. J., Shang, Z., Du, Y., & Serra, M. J. (2000). A story: unpaired adenosine bases in ribosomal RNAs. J Mol Biol, 304(3), 335-354. Haeckel, E. (1868). The History of Creation. Haeckel, E. (1947). [Not Available]. Schweiz Med Wochenschr, 77(19), 531. Hansen, J. L., Schmeing, T. M., Klein, D. J., Ippolito, J. A., Ban, N., Nissen, P., . . . Steitz, T. A. (2001). Progress toward an understanding of the structure and enzymatic mechanism of the large ribosomal subunit. Cold Spring Harb Symp Quant Biol, 66, 33-42. Hardy, P. A., & Zacharias, H. (2009). Walther Flemming on histology in medicine 1878: a newly discovered letter to his father. Ann Anat, 191(2), 171-185. doi: 10.1016/j.aanat.2009.01.002 Harrington, J. J., Van Bokkelen, G., Mays, R. W., Gustashaw, K., & Willard, H. F. (1997). Formation of de novo centromeres and construction of first-generation human artificial microchromosomes. Nat Genet, 15(4), 345-355. doi: 10.1038/ng0497-345 Havrila, M., Reblova, K., Zirbel, C. L., Leontis, N. B., & Sponer, J. (2013). Isosteric and nonisosteric base pairs in RNA motifs: molecular dynamics and bioinformatics study of the sarcin-ricin internal loop. J Phys Chem B, 117(46), 14302-14319. doi: 10.1021/jp408530w Helmink, B. A., Bredemeyer, A. L., Lee, B. S., Huang, C. Y., Sharma, G. G., Walker, L. M., . . . Sleckman, B. P. (2009). MRN complex function in the repair of chromosomal Rag-mediated DNA double-strand breaks. J Exp Med, 206(3), 669-679. doi: 10.1084/jem.20081326 Hendrix, D. K., Brenner, S. E., & Holbrook, S. R. (2005). RNA structural motifs: building blocks of a modular biomolecule. Q Rev Biophys, 38(3), 221-243. Hitchings, G. H., & Falco, E. A. (1944). The Identification of Guanine in Extracts of Girella Nigricans: The Specificity of Guanase. Proc Natl Acad Sci U S A, 30(10), 294-297. Holbrook, S. R. (2005). RNA structure: the long and the short of it. Curr Opin Struct Biol, 15(3), 302-308. Hoogstraten, C. G., & Sumita, M. (2007). Structure-function relationships in RNA and RNP enzymes: recent advances. Biopolymers, 87(5-6), 317-328. doi: 10.1002/bip.20836 Hoppe-Seyler, F. A., & Schummelfeder, N. (1946). [Not Available]. Z Naturforsch B, 1B, 696- 699. Hunt, C. R., Pandita, R. K., Laszlo, A., Higashikubo, R., Agarwal, M., Kitamura, T., . . . Pandita, T. K. (2007). Hyperthermia activates a subset of ataxia-telangiectasia mutated effectors independent of DNA strand breaks and heat shock protein 70 status. Cancer Res, 67(7), 3010-3017. doi: 10.1158/0008-5472.CAN-06-4328 Jaeger, L., & Leontis, N. B. (2000). Tecto-RNA: One-Dimensional Self-Assembly through Tertiary Interactions This work was carried out in Strasbourg with the support of grants to N.B.L. from the NIH (1R15 GM55898) and the NIH Fogarty Institute (1-F06-TW02251-01) and the support of the CNRS to L.J. The authors wish to thank Eric Westhof for his support and encouragement of this work. Angew Chem Int Ed Engl, 39(14), 2521-2524. Jaeger, L., Westhof, E., & Leontis, N. B. (2001). TectoRNA: modular assembly units for the construction of RNA nano-objects. Nucleic Acids Res, 29(2), 455-463. Jo, J. Y., Jeong, J. A., Pandit, S., Stern, J. E., Lee, S. K., Ryu, P. D., . . . Park, J. B. (2011). Neurosteroid modulation of benzodiazepine-sensitive GABAA tonic inhibition in supraoptic magnocellular neurons. Am J Physiol Regul Integr Comp Physiol, 300(6), R1578-1587. doi: 10.1152/ajpregu.00627.2010 142

Jones, M. E. (1953). Albrecht Kossel, a biographical sketch. Yale J Biol Med, 26(1), 80-97. Kavran, J. M., & Steitz, T. A. (2007). Structure of the base of the L7/L12 stalk of the Haloarcula marismortui large ribosomal subunit: analysis of L11 movements. J Mol Biol, 371(4), 1047- 1059. doi: 10.1016/j.jmb.2007.05.091 Kim, S. H., Quigley, G., Suddath, F. L., McPherson, A., Sneden, D., Kim, J. J., . . . Rich, A. (1973). X-ray crystallographic studies of polymorphic forms of yeast phenylalanine transfer RNA. J Mol Biol, 75(2), 421-428. Kim, S. H., Quigley, G. J., Suddath, F. L., McPherson, A., Sneden, D., Kim, J. J., . . . Rich, A. (1973). Three-dimensional structure of yeast phenylalanine transfer RNA: folding of the polynucleotide chain. Science, 179(4070), 285-288. Kim, Y., Eom, S. H., Wang, J., Lee, D. S., Suh, S. W., & Steitz, T. A. (1995). Crystal structure of Thermus aquaticus DNA polymerase. Nature, 376(6541), 612-616. doi: 10.1038/376612a0 Klein, D. J., Schmeing, T. M., Moore, P. B., & Steitz, T. A. (2001). The kink-turn: a new RNA secondary structure motif. Embo J, 20(15), 4214-4221. Klinge, S., Voigts-Hoffmann, F., Leibundgut, M., Arpagaus, S., & Ban, N. (2011). Crystal structure of the eukaryotic 60S ribosomal subunit in complex with initiation factor 6. Science, 334(6058), 941-948. doi: 10.1126/science.1211204 Klosterman, P. S., Hendrix, D. K., Tamura, M., Holbrook, S. R., & Brenner, S. E. (2004). Three- dimensional motifs from the SCOR, structural classification of RNA database: extruded strands, base triples, tetraloops and U-turns. Nucleic Acids Res, 32(8), 2342-2352. Konevega, A. L., Soboleva, N. G., Makhno, V. I., Peshekhonov, A. V., & Katunin, V. I. (2006). [The effect of modification of tRNA nucleotide-37 on the tRNA interaction with the P- and A-site of the 70S ribosome Escherichia coli]. Mol Biol (Mosk), 40(4), 669-683. Korostelev, A., Trakhanov, S., Laurberg, M., & Noller, H. F. (2006). Crystal structure of a 70S ribosome-tRNA complex reveals functional interactions and rearrangements. Cell, 126(6), 1065-1077. doi: 10.1016/j.cell.2006.08.032 Kothe, U., Paleskava, A., Konevega, A. L., & Rodnina, M. V. (2006). Single-step purification of specific tRNAs by hydrophobic tagging. Anal Biochem, 356(1), 148-150. doi: 10.1016/j.ab.2006.04.038 Krasilnikov, A. S., Yang, X., Pan, T., & Mondragon, A. (2003). Crystal structure of the specificity domain of ribonuclease P. Nature, 421(6924), 760-764. Kruger, K., Grabowski, P. J., Zaug, A. J., Sands, J., Gottschling, D. E., & Cech, T. R. (1982). Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31(1), 147-157. Kubarenko, A., Sergiev, P., Wintermeyer, W., Dontsova, O., & Rodnina, M. V. (2006). Involvement of helix 34 of 16 S rRNA in decoding and translocation on the ribosome. J Biol Chem, 281(46), 35235-35244. doi: 10.1074/jbc.M608060200 Lazcano, A., & Miller, S. L. (1996). The origin and early evolution of life: prebiotic chemistry, the pre-RNA world, and time. Cell, 85(6), 793-798. LeBarron, J., Mitra, K., & Frank, J. (2007). Displaying 3D data on RNA secondary structures: coloRNA. J Struct Biol, 157(1), 262-270. doi: 10.1016/j.jsb.2006.08.018 Lee, B. C., Pandit, A., Croonquist, P. A., & Hoff, W. D. (2001). Folding and signaling share the same pathway in a photoreceptor. Proc Natl Acad Sci U S A, 98(16), 9062-9067. doi: 10.1073/pnas.111153598 Lee, J. C., Cannone, J. J., & Gutell, R. R. (2003). The lonepair triloop: a new motif in RNA structure. J Mol Biol, 325(1), 65-83. 143

Lee, J. W., Weiner, R. S., Sailstad, J. M., Bowsher, R. R., Knuth, D. W., O'Brien, P. J., . . . Wagner, J. A. (2005). Method validation and measurement of biomarkers in nonclinical and clinical samples in drug development: a conference report. Pharm Res, 22(4), 499-511. doi: 10.1007/s11095-005-2495-9 Lee, W., Jiang, Z., Liu, J., Haverty, P. M., Guan, Y., Stinson, J., . . . Zhang, Z. (2010). The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature, 465(7297), 473-477. doi: 10.1038/nature09004 Lee-Hoeflich, S. T., Pham, T. Q., Dowbenko, D., Munroe, X., Lee, J., Li, L., . . . Stern, H. M. (2011). PPM1H is a p27 phosphatase implicated in trastuzumab resistance. Cancer Discov, 1(4), 326-337. doi: 10.1158/2159-8290.CD-11-0062 Lemieux, S., & Major, F. (2006). Automated extraction and classification of RNA tertiary structure cyclic motifs. Nucleic Acids Res, 34(8), 2340-2346. Leontis, N., Sweeney, B., Haque, F., & Guo, P. (2013). Conference Scene: Advances in RNA nanotechnology promise to transform medicine. Nanomedicine (Lond), 8(7), 1051-1054. doi: 10.2217/nnm.13.105 Leontis, N. B., Altman, R. B., Berman, H. M., Brenner, S. E., Brown, J. W., Engelke, D. R., . . . Westhof, E. (2006). The RNA Ontology Consortium: an open invitation to the RNA community. Rna, 12(4), 533-541. Leontis, N. B., Lescoute, A., & Westhof, E. (2006). The building blocks and motifs of RNA architecture. Curr Opin Struct Biol, 16(3), 279-287. Leontis, N. B., Stombaugh, J., & Westhof, E. (2002). Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules. Biochimie, 84(9), 961-973. Leontis, N. B., Stombaugh, J., & Westhof, E. (2002). The non-Watson-Crick base pairs and their associated isostericity matrices. Nucleic Acids Res, 30(16), 3497-3531. Leontis, N. B., & Westhof, E. (1998). Conserved geometrical base-pairing patterns in RNA. Q Rev Biophys, 31(4), 399-455. Leontis, N. B., & Westhof, E. (2001). Geometric nomenclature and classification of RNA base pairs. Rna, 7(4), 499-512. Leontis, N. B., & Westhof, E. (2002). The Annotation of RNA Motifs. Comparative and Functional Genomics, 3(6), 518-524. Leontis, N. B., & Westhof, E. (2003). Analysis of RNA motifs. Curr Opin Struct Biol, 13(3), 300-308. Lescoute, A., Leontis, N. B., Massire, C., & Westhof, E. (2005). Recurrent structural RNA motifs, Isostericity Matrices and sequence alignments. Nucleic Acids Res, 33(8), 2395-2409. Lescoute, A., & Westhof, E. (2006). The interaction networks of structured RNAs. Nucleic Acids Res, 34(22), 6587-6604. Lescoute, A., & Westhof, E. (2006). Topology of three-way junctions in folded RNAs. Rna, 12(1), 83-93. Levene, P. A., & La Forge, F. B. (1915). On Chondrosamine. Proc Natl Acad Sci U S A, 1(4), 190-191. Levy, M., & Miller, S. L. (1999). The prebiotic synthesis of modified purines and their potential role in the RNA world. J Mol Evol, 48(6), 631-637. Li, W., Agirrezabala, X., Lei, J., Bouakaz, L., Brunelle, J. L., Ortiz-Meoz, R. F., . . . Frank, J. (2008). Recognition of aminoacyl-tRNA: a common molecular mechanism revealed by cryo- EM. Embo J, 27(24), 3322-3331. doi: 10.1038/emboj.2008.243 144

Li, W., & Frank, J. (2007). Transfer RNA in the hybrid P/E state: correlating molecular dynamics simulations with cryo-EM data. Proc Natl Acad Sci U S A, 104(42), 16540-16545. doi: 10.1073/pnas.0708094104 Li, Z., Pandit, S., & Deutscher, M. P. (1998). Polyadenylation of stable RNA precursors in vivo. Proc Natl Acad Sci U S A, 95(21), 12158-12162. Li, Z., Pandit, S., & Deutscher, M. P. (1999). Maturation of 23S ribosomal RNA requires the exoribonuclease RNase T. Rna, 5(1), 139-146. Li, Z., Pandit, S., & Deutscher, M. P. (1999). RNase G (CafA protein) and RNase E are both required for the 5' maturation of 16S ribosomal RNA. Embo J, 18(10), 2878-2885. doi: 10.1093/emboj/18.10.2878 Lisi, V., & Major, F. (2007). A comparative analysis of the triloops in all high-resolution RNA structures reveals sequence structure relationships. Rna, 13(9), 1537-1545. Lo, P. K., Lee, J. S., Liang, X., Han, L., Mori, T., Fackler, M. J., . . . Sukumar, S. (2010). Epigenetic inactivation of the potential tumor suppressor gene FOXF1 in breast cancer. Cancer Res, 70(14), 6047-6058. doi: 10.1158/0008-5472.CAN-10-1576 Lomakin, I. B., Shirokikh, N. E., Yusupov, M. M., Hellen, C. U., & Pestova, T. V. (2006). The fidelity of translation initiation: reciprocal activities of eIF1, IF3 and YciH. Embo J, 25(1), 196-210. doi: 10.1038/sj.emboj.7600904 Mathews, D. H., & Turner, D. H. (2006). Prediction of RNA secondary structure by free energy minimization. Curr Opin Struct Biol, 16(3), 270-278. Miescher, F. (1957). [Behavior of bovine erythrocytes in the circulation of cats]. Helv Physiol Pharmacol Acta, 15(4), 485-490. Miescher, F. (1960). [A new incidence of herediatary double albuminemia]. Schweiz Med Wochenschr, 90, 1273-1274. Mokdad, A., Krasovska, M. V., Sponer, J., & Leontis, N. B. (2006). Structural and evolutionary classification of G/U wobble basepairs in the ribosome. Nucleic Acids Res, 34(5), 1326- 1341. Myasnikov, A. G., Marzi, S., Simonetti, A., Giuliodori, A. M., Gualerzi, C. O., Yusupova, G., . . . Klaholz, B. P. (2005). Conformational transition of initiation factor 2 from the GTP- to GDP-bound state visualized on the ribosome. Nat Struct Mol Biol, 12(12), 1145-1149. doi: 10.1038/nsmb1012 Nagaswamy, U., & Fox, G. E. (2002). Frequent occurrence of the T-loop RNA folding motif in ribosomal RNAs. Rna, 8(9), 1112-1119. Nasalean, L., Baudrey, S., Leontis, N. B., & Jaeger, L. (2006). Controlling RNA self-assembly to form filaments. Nucleic Acids Res, 34(5), 1381-1392. doi: 10.1093/nar/gkl008 Nasalean, L., Stombaugh, J., & Leontis, N. B. (In Preparation). Nissen, P., Hansen, J., Ban, N., Moore, P. B., & Steitz, T. A. (2000). The structural basis of ribosome activity in peptide bond synthesis. Science, 289(5481), 920-930. Nissen, P., Ippolito, J. A., Ban, N., Moore, P. B., & Steitz, T. A. (2001). RNA tertiary interactions in the large ribosomal subunit: the A-minor motif. Proc Natl Acad Sci U S A, 98(9), 4899-4903. Noller, H. F., Kop, J., Wheaton, V., Brosius, J., Gutell, R. R., Kopylov, A. M., . . . Waese, C. R. (1981). Secondary structure model for 23S ribosomal RNA. Nucleic Acids Res, 9(22), 6167- 6189. Noller, H. F., & Woese, C. R. (1981). Secondary structure of 16S ribosomal RNA. Science, 212(4493), 403-411. 145

Nonin-Lecomte, S., Felden, B., & Dardel, F. (2006). NMR structure of the Aquifex aeolicus tmRNA pseudoknot PK1: new insights into the recoding event of the ribosomal trans- translation. Nucleic Acids Res, 34(6), 1847-1853. doi: 10.1093/nar/gkl111 Novikova, I. V., Hassan, B. H., Mirzoyan, M. G., & Leontis, N. B. (2011). Engineering cooperative tecto-RNA complexes having programmable stoichiometries. Nucleic Acids Res, 39(7), 2903-2917. doi: 10.1093/nar/gkq1231 O'Farrell, F., Esfahani, S. S., Engstrom, Y., & Kylsten, P. (2008). Regulation of the Drosophila lin-41 homologue dappled by let-7 reveals conservation of a regulatory mechanism within the LIN-41 subclade. Dev Dyn, 237(1), 196-208. doi: 10.1002/dvdy.21396 Pace, N. R., Thomas, B. C., & Woese, C. R. (1999). Probing RNA Structure, Function, and History by Comparative Analysis. In R. F. Gesteland, T. R. Cech, & J. F. Atkins (Eds.), The RNA World (2nd ed., pp. 113-141). Cold Spring Harbor: Cold Spring Harbor Laboratory Press. Palade, G. E. (1955). A small particulate component of the cytoplasm. J Biophys Biochem Cytol, 1(1), 59-68. Palade, G. E. (1955). Studies on the endoplasmic reticulum. II. Simple dispositions in cells in situ. J Biophys Biochem Cytol, 1(6), 567-582. Pandit, A., Aryal, M. R., Aryal Pandit, A., Hakim, F. A., Giri, S., Mainali, N. R., . . . Mookadam, F. (2014). Preventive PCI versus culprit lesion stenting during primary PCI in acute STEMI: a systematic review and meta-analysis. Open Heart, 1(1), e000012. doi: 10.1136/openhrt- 2013-000012 Pandit, A., Aryal, M. R., Aryal Pandit, A., Jalota, L., Hakim, F. A., Mookadam, F., . . . Tleyjeh, I. M. (2014). Cangrelor versus clopidogrel in percutaneous coronary intervention: a systematic review and meta-analysis. EuroIntervention, 9(11), 1350-1358. doi: 10.4244/EIJV9I11A226 Pandit, A., Aryal, M. R., Pandit, A. A., Jalota, L., Kantharajpur, S., Hakim, F. A., & Lee, H. R. (2014). Amplatzer PFO occluder device may prevent recurrent stroke in patients with patent foramen ovale and cryptogenic stroke: a meta-analysis of randomised trials. Heart Lung Circ, 23(4), 303-308. doi: 10.1016/j.hlc.2013.12.003 Pandit, A., Mookadam, F., Boddu, S., Aryal Pandit, A., Tandar, A., Chaliki, H., . . . Lee, H. R. (2014). Vitamin D levels and left ventricular diastolic function. Open Heart, 1(1), e000011. doi: 10.1136/openhrt-2013-000011 Pandit, A. S., Expert, P., Lambiotte, R., Bonnelle, V., Leech, R., Turkheimer, F. E., & Sharp, D. J. (2013). Traumatic brain injury impairs small-world topology. Neurology, 80(20), 1826- 1833. doi: 10.1212/WNL.0b013e3182929f38 Pandit, D., Tuske, S. J., Coales, S. J., E, S. Y., Liu, A., Lee, J. E., . . . Hamuro, Y. (2012). Mapping of discontinuous conformational epitopes by amide hydrogen/deuterium exchange mass spectrometry and computational docking. J Mol Recognit, 25(3), 114-124. doi: 10.1002/jmr.1169 Pandit, R. P., & Lee, Y. R. (2014). Efficient one-pot synthesis of novel and diverse tetrahydroquinolines bearing pyranopyrazoles using organocatalyzed domino Knoevenagel/hetero Diels-Alder reactions. Mol Divers, 18(1), 39-50. doi: 10.1007/s11030- 013-9482-6 Pandit, R. P., & Lee, Y. R. (2014). Novel one-pot synthesis of diverse gamma,delta-unsaturated beta-ketoesters by thermal cascade reactions of diazodicarbonyl compounds and enol ethers: 146

transformation into substituted 3,5-diketoesters. Org Biomol Chem, 12(25), 4407-4411. doi: 10.1039/c4ob00664j Pandit, S., Jeong, J. A., Jo, J. Y., Cho, H. S., Kim, D. W., Kim, J. M., . . . Park, J. B. (2013). Dual mechanisms diminishing tonic GABAA inhibition of dentate gyrus granule cells in Noda epileptic rats. J Neurophysiol, 110(1), 95-102. doi: 10.1152/jn.00727.2012 Pandit, S., Kim, G. R., Lee, M. H., & Jeon, J. G. (2011). Evaluation of Streptococcus mutans biofilms formed on fluoride releasing and non fluoride releasing resin composites. J Dent, 39(11), 780-787. doi: 10.1016/j.jdent.2011.08.010 Pandit, S., Song, J. G., Kim, Y. J., Jeong, J. A., Jo, J. Y., Lee, G. S., . . . Park, J. B. (2014). Attenuated benzodiazepine-sensitive tonic GABAA currents of supraoptic magnocellular neuroendocrine cells in 24-h water-deprived rats. J Neuroendocrinol, 26(1), 26-34. doi: 10.1111/jne.12123 Patterson, J. T., & Stone, W. S. (1935). Some Observations on the Structure of the Scute-8 Chromosome of Drosophila Melanogaster. , 20(2), 172-178. Pauling, L. (1935). The Oxygen Equilibrium of Hemoglobin and Its Structural Interpretation. Proc Natl Acad Sci U S A, 21(4), 186-191. Pauling, L. (1968). Orthomolecular psychiatry. Varying the concentrations of substances normally present in the human body may control mental disease. Science, 160(3825), 265- 271. Pauling, L., & Corey, R. B. (1951). The structure of hair, muscle, and related proteins. Proc Natl Acad Sci U S A, 37(5), 261-271. Pauling, L., & Corey, R. B. (1953). A Proposed Structure For The Nucleic Acids. Proc Natl Acad Sci U S A, 39(2), 84-97. Pauling, L., & Corey, R. B. (1953). Stable configurations of polypeptide chains. Proc R Soc Lond B Biol Sci, 141(902), 21-33. Pauling, L., & Corey, R. B. (1953). Structure of the nucleic acids. Nature, 171(4347), 346. Pauling, L., Itano, H. A., & et al. (1949). Sickle cell anemia a molecular disease. Science, 110(2865), 543-548. Perez-Alvarado, G. C., Martinez-Yamout, M., Allen, M. M., Grosschedl, R., Dyson, H. J., & Wright, P. E. (2003). Structure of the nuclear factor ALY: insights into post-transcriptional regulatory and mRNA nuclear export processes. Biochemistry, 42(24), 7348-7357. doi: 10.1021/bi034062o Petrov, A. I., Zirbel, C. L., & Leontis, N. B. (2011). WebFR3D--a server for finding, aligning and analyzing recurrent RNA 3D motifs. Nucleic Acids Res, 39(Web Server issue), W50-55. doi: 10.1093/nar/gkr249 Petrov, A. I., Zirbel, C. L., & Leontis, N. B. (2013). Automated classification of RNA 3D motifs and the RNA 3D Motif Atlas. Rna, 19(10), 1327-1340. doi: 10.1261/rna.039438.113 Piekna-Przybylska, D., Decatur, W. A., & Fournier, M. J. (2008). The 3D rRNA modification maps database: with interactive tools for ribosome analysis. Nucleic Acids Res, 36(Database issue), D178-183. doi: 10.1093/nar/gkm855 Powers, T., Stern, S., Changchien, L. M., & Noller, H. F. (1988). Probing the assembly of the 3' major domain of 16 S rRNA. Interactions involving ribosomal proteins S2, S3, S10, S13 and S14. J Mol Biol, 201(4), 697-716. Qiu, M., Khisamutdinov, E., Zhao, Z., Pan, C., Choi, J. W., Leontis, N. B., & Guo, P. (2013). RNA nanotechnology for computer design and in vivo computation. Philos Trans A Math Phys Eng Sci, 371(2000), 20120310. doi: 10.1098/rsta.2012.0310 147

Rabl, J., Leibundgut, M., Ataide, S. F., Haag, A., & Ban, N. (2011). Crystal structure of the eukaryotic 40S ribosomal subunit in complex with initiation factor 1. Science, 331(6018), 730-736. doi: 10.1126/science.1198308 Randall, P. A., Lee, C. A., Nunes, E. J., Yohn, S. E., Nowak, V., Khan, B., . . . Salamone, J. D. (2014). The VMAT-2 inhibitor tetrabenazine affects effort-related decision making in a progressive ratio/chow feeding choice task: reversal with antidepressant drugs. PLoS One, 9(6), e99320. doi: 10.1371/journal.pone.0099320 Rangan, P., Masquida, B., Westhof, E., & Woodson, S. A. (2003). Assembly of core helices and rapid tertiary folding of a small bacterial group I ribozyme. Proc Natl Acad Sci U S A, 100(4), 1574-1579. Riaz, I. B., Husnain, M., Riaz, H., Asawaeer, M., Bilal, J., Pandit, A., . . . Lee, K. S. (2014). Meta-analysis of revascularization versus medical therapy for atherosclerotic renal artery stenosis. Am J Cardiol, 114(7), 1116-1123. doi: 10.1016/j.amjcard.2014.06.033 Rich, A., Davies, D. R., Crick, F. H., & Watson, J. D. (1961). The molecular structure of polyadenylic acid. J Mol Biol, 3, 71-86. Richardson, J. S., Schneider, B., Murray, L. W., Kapral, G. J., Immormino, R. M., Headd, J. J., . . . Berman, H. M. (2008). RNA backbone: Consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). Rna. Ridanpaa, M., Ward, L. M., Rockas, S., Sarkioja, M., Makela, H., Susic, M., . . . Makitie, O. (2003). Genetic changes in the RNA components of RNase MRP and RNase P in Schmid metaphyseal chondrodysplasia. J Med Genet, 40(10), 741-746. Sarver, M., Zirbel, C. L., Stombaugh, J., Mokdad, A., & Leontis, N. B. (2008). FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J Math Biol, 56(1-2), 215-252. Savelsbergh, A., Katunin, V. I., Mohr, D., Peske, F., Rodnina, M. V., & Wintermeyer, W. (2003). An elongation factor G-induced ribosome rearrangement precedes tRNA-mRNA translocation. Mol Cell, 11(6), 1517-1523. Schlessinger, D., Bolla, R. I., Sirdeshmukh, R., & Thomas, J. R. (1985). Spacers and processing of large ribosomal RNAs in Escherichia coli and mouse cells. Bioessays, 3(1), 14-18. doi: 10.1002/bies.950030105 Seeman, N. C. (1998). DNA nanotechnology: novel DNA constructions. Annu Rev Biophys Biomol Struct, 27, 225-248. Sen, M., Thomas, S. M., Kim, S., Yeh, J. I., Ferris, R. L., Johnson, J. T., . . . Grandis, J. R. (2012). First-in-human trial of a STAT3 decoy oligonucleotide in head and neck tumors: implications for cancer therapy. Cancer Discov, 2(8), 694-705. doi: 10.1158/2159-8290.CD- 12-0191 Shi, Y., Cao, J., Gao, J., Zheng, L., Goodwin, A., An, C. H., . . . Morse, D. (2012). Retinoic acid- related orphan receptor-alpha is induced in the setting of DNA damage and promotes pulmonary emphysema. Am J Respir Crit Care Med, 186(5), 412-419. doi: 10.1164/rccm.201111-2023OC Singh, N., Lee, J. Z., Huang, J. J., Low, S. W., Howe, C., Pandit, A., . . . Lee, K. S. (2014). Benefit of statin pretreatment in prevention of contrast-induced nephropathy in different adult patient population: systematic review and meta-analysis. Open Heart, 1(1), e000127. doi: 10.1136/openhrt-2014-000127 148

Skeggs, P. A., Thompson, J., & Cundliffe, E. (1985). Methylation of 16S ribosomal RNA and resistance to aminoglycoside antibiotics in clones of Streptomyces lividans carrying DNA from Streptomyces tenjimariensis. Mol Gen Genet, 200(3), 415-421. Slayter, H. S., Warner, J. R., Rich, A., & Hall, C. E. (1963). The Visualization of Polyribosomal Structure. J Mol Biol, 7, 652-657. Somanchi, A., & Mayfield, S. P. (1999). Nuclear-chloroplast signalling. Curr Opin Plant Biol, 2(5), 404-409. Sponer, J., Mokdad, A., Sponer, J. E., Spackova, N., Leszczynski, J., & Leontis, N. B. (2003). Unique tertiary and neighbor interactions determine conservation patterns of Cis Watson- Crick A/G base-pairs. J Mol Biol, 330(5), 967-978. Sponer, J., Sponer, J. E., Petrov, A. I., & Leontis, N. B. (2010). Quantum chemical studies of nucleic acids: can we construct a bridge to the RNA structural biology and bioinformatics communities? J Phys Chem B, 114(48), 15723-15741. doi: 10.1021/jp104361m Stefl, R., & Allain, F. H. (2005). A novel RNA pentaloop fold involved in targeting ADAR2. Rna, 11(5), 592-597. Steitz, J. A., & Tycowski, K. T. (1995). Small RNA chaperones for ribosome biogenesis. Science, 270(5242), 1626-1627. Steitz, T. A. (2010). From the structure and function of the ribosome to new antibiotics (Nobel Lecture). Angew Chem Int Ed Engl, 49(26), 4381-4398. doi: 10.1002/anie.201000708 Sternberg, C. N., Davis, I. D., Mardiak, J., Szczylik, C., Lee, E., Wagstaff, J., . . . Hawkins, R. E. (2010). Pazopanib in locally advanced or metastatic renal cell carcinoma: results of a randomized phase III trial. J Clin Oncol, 28(6), 1061-1068. doi: 10.1200/JCO.2009.23.9764 Stombaugh, J. (2004). Developing Isostericity Matrices: A Tool for RNA Structural Alignment. M.S. Thesis. Stombaugh, J., Zirbel, C. L., Westhof, E., & Leontis, N. B. (2009). Frequency and isostericity of RNA base pairs. Nucleic Acids Res, 37(7), 2294-2312. doi: 10.1093/nar/gkp011 Stombaugh, J., Zirbel, C. L., Westhof, E., & Leontis, N. B. (In Preparation). Systematic Evaluation of RNA Basepair Isostericity Matrices. St-Onge, K., Thibault, P., Hamel, S., & Major, F. (2007). Modeling RNA tertiary structure motifs by graph-grammars. Nucleic Acids Res, 35(5), 1726-1736. Strobel, S. A. (1999). A chemogenetic approach to RNA function/structure analysis. Curr Opin Struct Biol, 9(3), 346-352. doi: 10.1016/S0959-440X(99)80046-3 Strobel, S. A., Adams, P. L., Stahley, M. R., & Wang, J. (2004). RNA kink turns to the left and to the right. Rna, 10(12), 1852-1854. Sykes, M. T., & Levitt, M. (2005). Describing RNA structure by libraries of clustered nucleotide doublets. J Mol Biol, 351(1), 26-38. Tamura, M., & Holbrook, S. R. (2002). Sequence and structural conservation in RNA ribose zippers. J Mol Biol, 320(3), 455-474. Thirumalai, D., Lee, N., Woodson, S. A., & Klimov, D. (2001). Early events in RNA folding. Annu Rev Phys Chem, 52, 751-762. Thirumalai, D., & Woodson, S. A. (1996). Kinetics of Folding of Proteins and RNA. Acc Chem Res, 29(9), 433-439. Thompson, J., Skeggs, P. A., & Cundliffe, E. (1985). Methylation of 16S ribosomal RNA and resistance to the aminoglycoside antibiotics gentamicin and kanamycin determined by DNA from the gentamicin-producer, Micromonospora purpurea. Mol Gen Genet, 201(2), 168-173. 149

Triman, K. L. (1999). Mutational analysis of 23S ribosomal RNA structure and function in Escherichia coli. Adv Genet, 41, 157-195. Triman, K. L. (2007). Mutational analysis of the ribosome. Adv Genet, 58, 89-119. doi: 10.1016/S0065-2660(06)58004-6 VanLoock, M. S., Agrawal, R. K., Gabashvili, I. S., Qi, L., Frank, J., & Harvey, S. C. (2000). Movement of the decoding region of the 16 S ribosomal RNA accompanies tRNA translocation. J Mol Biol, 304(4), 507-515. doi: 10.1006/jmbi.2000.4213 Warner, J. R., Knopf, P. M., & Rich, A. (1963). A multiple ribosomal structure in protein synthesis. Proc Natl Acad Sci U S A, 49, 122-129. Watson, J. D., & Crick, F. H. (1953). Genetical implications of the structure of deoxyribonucleic acid. Nature, 171(4361), 964-967. Watson, J. D., & Crick, F. H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356), 737-738. Watson, J. D., & Crick, F. H. (1953). The structure of DNA. Cold Spring Harb Symp Quant Biol, 18, 123-131. Weaver, C. H. (1937). Concerning a proposed federal cancer law. Cal West Med, 46(6), 442. Weinstein, I. B., Friedman, S. M., & Ochoa, M., Jr. (1966). Fidelity during translation of the genetic code. Cold Spring Harb Symp Quant Biol, 31, 671-681. Wilkins, M. H., Seeds, W. E., Stokes, A. R., & Wilson, H. R. (1953). Helical structure of crystalline deoxypentose nucleic acid. Nature, 172(4382), 759-762. Wimberly, B. T., Brodersen, D. E., Clemons, W. M., Jr., Morgan-Warren, R. J., Carter, A. P., Vonrhein, C., . . . Ramakrishnan, V. (2000). Structure of the 30S ribosomal subunit. Nature, 407(6802), 327-339. Woese, C. R., Dugre, D. H., Saxinger, W. C., & Dugre, S. A. (1966). The molecular basis for the genetic code. Proc Natl Acad Sci U S A, 55(4), 966-974. Woese, C. R., Winker, S., & Gutell, R. R. (1990). Architecture of ribosomal RNA: constraints on the sequence of "tetra-loops". Proc Natl Acad Sci U S A, 87(21), 8467-8471. Wriggers, W., Agrawal, R. K., Drew, D. L., McCammon, A., & Frank, J. (2000). Domain motions of EF-G bound to the 70S ribosome: insights from a hand-shaking between multi- resolution structures. Biophys J, 79(3), 1670-1678. doi: 10.1016/S0006-3495(00)76416-2 Yusupova, G., & Yusupov, M. (2014). High-resolution structure of the eukaryotic 80S ribosome. Annu Rev Biochem, 83, 467-486. doi: 10.1146/annurev-biochem-060713-035445 Zaug, A. J., & Cech, T. R. (1982). The intervening sequence excised from the ribosomal RNA precursor of Tetrahymena contains a 5-terminal guanosine residue not encoded by the DNA. Nucleic Acids Res, 10(9), 2823-2838. Zhou, H., Pandit, S. B., Lee, S. Y., Borreguero, J., Chen, H., Wroblewska, L., & Skolnick, J. (2007). Analysis of TASSER-based CASP7 protein structure prediction results. Proteins, 69 Suppl 8, 90-97. doi: 10.1002/prot.21649 Zhuang, X., Bartley, L. E., Babcock, H. P., Russell, R., Ha, T., Herschlag, D., & Chu, S. (2000). A single-molecule study of RNA catalysis and folding. Science, 288(5473), 2048-2051. Zirbel, C. L., Sponer, J. E., Sponer, J., Stombaugh, J., & Leontis, N. B. (2009). Classification and energetics of the base-phosphate interactions in RNA. Nucleic Acids Res, 37(15), 4898- 4918. doi: 10.1093/nar/gkp468