UNIVERSITY OF CALIFORNIA

SANTA CRUZ

CHARACTERIZATION OF ARCHAEAL THROUGH RNASE P AND TRANSFER RNAS A dissertation submitted in partial satisfaction of the requirements for the degree of

DOCTOR OF PHILOSOPHY

in

BIOINFORMATICS

by

Patricia Pak Lee Chan

December 2010

The Dissertation of Patricia P. L. Chan is approved:

______Professor Todd M. J. Lowe, Chair

______Professor David Haussler

______Professor Karen M. Ottemann

______Professor Manuel Ares, Jr.

______Tyrus Miller Vice Provost and Dean of Graduate Studies Copyright © by

Patricia Pak Lee Chan

2010

Table of Contents

List of Figures...... vi List of Tables...... ix Abstract...... x Acknowledgments ...... xiii

Chapter 1 Introduction ...... 1 1.1 , the third domain of life...... 2 1.2 A mix of bacterial and eukaryotic features...... 2 1.3 Ribonuclease P, a “nearly” universal ribozyme ...... 4 1.4 Disrupted transfer RNAs – how common are they?...... 6 1.5 Atypical genes in nature ...... 9

Chapter 2 Transcriptional and Translational Signal Detection in Archaea.....17 2.1 Introduction ...... 18 2.2 Results ...... 20 2.2.1 Absence of transcription factor B recognition element...... 21 2.2.2 Low conservation of Shine-Dalgarno motifs in , Cenarchaeum, and Nanoarchaeum ...... 22 2.2.3 A mixture of leadered and leaderless transcripts in and Euryarchaeota ...... 23 2.2.4 Highly variable 5′ UTR length in methanogens ...... 26 2.2.5 Evidence of gene coordinate mis-annotations...... 27 2.2.6 Internal promoters leading to two modes of transcription...... 28 2.2.7 Search for Shine-Dalgarno-less 5′ UTRs...... 30 2.3 Discussion...... 32 2.4 Materials and Methods ...... 35

Chapter 3 Discovery of a minimal form of RNase P in Pyrobaculum...... 55 3.1 Abstract...... 56 3.2 Introduction ...... 57 3.3 Results and Discussion ...... 58 3.3.1 Pre-tRNAs in Pyrobaculum have 5ʹ leaders...... 58

iii 3.3.2 cell extract processes 5ʹ leader from pre- tRNA...... 60 3.3.3 Evidence for three out of four known archaeal RNase P proteins in Pyrobaculum...... 61 3.3.4 Discovery and in vitro activity of the minimized Pyrobaculum RNase P RNA...... 62 3.3.5 Phylogenetic distribution of the minimized form of RNase P RNA ...64 3.3.6 Search for RNase P RNA in Aquifex and Related Species...... 66 3.4 Conclusions ...... 67 3.5 Materials and Methods ...... 68 3.6 Author Contributions...... 81 3.7 Acknowledgments ...... 82

Chapter 4 Modeling the RNase P RNA...... 97 4.1 Introduction ...... 98 4.2 Results and Discussion ...... 99 4.2.1 Common features of type T RNase P RNAs ...... 99 4.2.2 Type T RNase P RNA variants ...... 100 4.2.3 Search with type T RNase P RNA covariance model ...... 101 4.2.4 Type M RNase P RNA variants ...... 102 4.3 Conclusions ...... 103 4.4 Materials and Methods ...... 104

Chapter 5 Discovery of Permuted and Recently Split Transfer RNAs in Archaea ...... 111 5.1 Abstract...... 112 5.2 Introduction ...... 113 5.3 Results ...... 116 5.3.1 Split tRNAAsp(GUC) in Aeropyrum and Thermosphaera consist of adjacent halves...... 116 5.3.2 tRNALys(CUU) in Staphylothermus resembles its ortholog in Nanoarchaeum ...... 119 5.3.3 Permuted tRNAs in Thermofilum pendens have the same structure as in red alga...... 121 5.4 Discussion...... 123 5.5 Materials and Methods ...... 128 5.6 Author Contributions...... 132 5.7 Acknowledgments ...... 132

Chapter 6 GtRNAdb: A database of transfer RNA genes detected in genomic sequence...... 148 6.1 Abstract...... 149 iv 6.2 Introduction ...... 149 6.3 Database Features...... 151 6.3.1 tRNA identification information ...... 151 6.3.2 tRNA secondary structures and alignments ...... 152 6.3.3 tRNA search and BLAST server ...... 153 6.3.4 Error and request tracking ...... 154 6.4 Future Directions ...... 155 6.5 Funding...... 156

Chapter 7 Chracterization of a crenarchaeal-rich metagenome through RNase P and tRNAs...... 160 7.1 Introduction ...... 161 7.2 Results and Discussion ...... 163 7.2.1 At least six crenarchaeal species co-exist in Cistern Spring ...... 163 7.2.2 Search for RNase P RNAs in metagenome ...... 164 7.2.3 Majority of tRNAs in Cistern Spring have introns...... 165 7.2.4 Trans-spliced split tRNAs in ...... 167 7.2.5 Novel intron-bearing split tRNA in ...... 169 7.3 Future Directions ...... 170 7.4 Materials and Methods ...... 171

Chapter 8 Conclusions ...... 186

Bibliography...... 190

v List of Figures

Figure 1.1 Predicted secondary structures of archaeal type A and type M RNase P RNAs (RPRs) ...... 11 Figure 1.2 Precursor tRNA with canonical intron...... 12 Figure 1.3 Locations of tRNA introns in Pyrobaculum calidifontis ...... 13 Figure 2.1 Comparison of promoter motif conservation ...... 40 Figure 2.2 Predicted Shine-Dalgarno position distributions and motif conservation .41 Figure 2.3 Predicted promoter motif position distributions ...... 42 Figure 2.4 Schematic diagram of leadered and leaderless transcript layout ...... 42 Figure 2.5 Transcription start site mapping of Shine-Dalgarno-less transcripts in Pyrococcus furiosus...... 43 Figure 2.6 Highly variable lengths of 5′ UTRs in Methanocaldococcus jannaschii...44 Figure 2.7 Internal promoters in Pyrobaculum aerophilum...... 45 Figure 2.8 Shine-Dalgarno-less 5′ UTR for PF0250 in Pyrococcus furiosus ...... 46 Figure 2.9 Summary of predicted promoters and Shine-Dalgarno motifs in 46 archaeal genomes...... 47 Figure 3.1 Alignment of tRNA promoters and 5′-leader sequences across four Pyrobaculum species for three sets of tRNA orthologs ...... 83 Figure 3.2 Partial purification of P. aerophilum RNase P by ion-exchange chromatography...... 84 Figure 3.3 The pre-tRNA 5ʹ-processing activity from P. aerophilum (Pae) cell extract has all the cleavage properties of RNase P...... 85 Phe Figure 3.4 Analysis of the site of cleavage in pre-tRNA (G-1) and (U-1) by partially-purified native P. aerophilum (Pae) RNase P and in vitro transcribed RNase P RNAs (RPRs)...... 86 Figure 3.5 RNase P activity from P. aerophilum (Pae) cell extract requires both protein and RNA subunits ...... 87

vi Figure 3.6 P. aerophilum RNase P RNA displayed on the Archaeal Genome Browser (Schneider, Pollard et al. 2006) and alignment of the RNase P RNA sequences from five Pyrobaculum species ...... 88 Figure 3.7 Predicted secondary structure, native expression, and in vitro activity of P. aerophilum (Pae) and Caldivirga maquilingensis (Cma) RNase P RNAs (RPRs) ...... 89 Figure 3.8 Predicted secondary structure of P. calidifontis and V. distributa RNase P RNAs...... 90 Figure 3.9 P. aerophilum (Pae) and C. maquilingensis (Cma) RNase P RNAs (RPRs) can process pre-tRNAPhe (G-1) with a 4-nt leader...... 91 Figure 4.1 Predicted secondary structures of type T RNase P RNAs ...... 106 Figure 4.2 Structural alignments of Pyrobaculum, Caldivirga, and RNase P RNAs ...... 107 Figure 4.3 Predicted secondary structure and sequence alignment of Thermoproteus tenax RNase P RNA...... 108 Figure 4.4 Predicted secondary structures of type M RNase P RNA variants ...... 109 Figure 5.1 Predicted promoter score distribution in A. pernix ...... 133 Figure 5.2 Predicted secondary structures of trans-spliced and permuted precursor tRNAs ...... 135 Figure 5.3 RT-PCR and northern analysis of tRNAAsp(GUC) in A. pernix ...... 136 Figure 5.4 Proposed evolutionary relationship between tRNAAsp(GUC) in D. kamchatkensis, A. pernix, and T. aggregans ...... 137 Figure 5.5 tRNALys(CUU) in S. marinus and S. hellenicus loci display strong synteny on the Archaeal Genome Browser (Schneider, Pollard et al. 2006)...... 138 Figure 5.6 Alignment of tRNA promoters in T. pendens ...... 139 Figure 5.7 Phylogenetic distribution of trans-spliced and permuted tRNAs in Archaea...... 140 Figure 6.1 tRNA summary statistics with codon usage for Escherichia coli K12 ....157 Figure 6.2 Secondary structure prediction of tRNAGlu(CUC) in chromosome III of Caenorhabditis elegans ...... 158 Figure 6.3 Multiple sequence alignments of tRNAPhe(GAA) in Homo sapiens ...... 159 Figure 7.1 Phylogenetic relationships of Cistern Spring samples and crenarchaea based on 16S rRNA ...... 176

vii Figure 7.2 Phylogenetic relationships of Cistern Spring samples and crenarchaea based on RNase P RNA (RPR) ...... 177 Figure 7.3 Genomic sequence alignments of tRNAAla(GGC) in Cistern Spring with Desulfurococcaceae genomes...... 178 Figure 7.4 tRNALys(CUU) in Cistern Spring metagenome in comparison with crenarchaeal homologs ...... 179 Figure 7.5 Predicted secondary structures and sequences of trans-spliced pre- tRNAGlu(UUC) and pre-tRNAGly in Caldivirga maquilingensis (CM) and Cistern Spring metagenome (CS) ...... 180 Figure 7.6 Predicted secondary structure of trans-spliced pre-tRNAMet(CAU) in Cistern Spring metagenome ...... 182

viii List of Tables

Table 1.1 Total number of predicted tRNA introns in archaeal genomes...... 14 Table 2.1 Predicted promoter relative positions in 46 archaeal genomes...... 48 Table 2.2 Predicted Shine-Dalgarno (SD) motifs in relationship with predicted promoters...... 52 Table 3.1 tRNA genes found to have a transcribed 5′-leader by high-throughput RNA sequencing...... 92 Table 3.2 Annotated or computationally identified RNase P proteins and associated DUF54 protein ...... 93 Table 3.3 RNase P RNA search using Infernal v1.0 (Nawrocki, Kolbe et al. 2009) ..94 Table 4.1 RNase P RNA search in Thermoproteaceae using Infernal v1.0 (Nawrocki, Kolbe et al. 2009) ...... 110 Table 5.1 Summary of pre-tRNA intron size in 90 archaeal genomes...... 142 Table 5.2 Predicted promoters of tRNA genes in Aeropyrum pernix ...... 144 Table 5.3 Summary of trans-spliced and permuted tRNAs ...... 146 Table 5.4 Partial predicted archaeal tRNAs ...... 147 Table 7.1 Predicted RNase P proteins in Cistern Spring metagenome ...... 183 Table 7.2 Predicted tRNAs in Cistern Spring metagenome ...... 184

ix Abstract

CHARACTERIZATION OF ARCHAEAL SPECIES THROUGH RNASE P AND TRANSFER RNAS

by

Patricia Pak Lee Chan

Archaea, the third domain of life, includes organisms that have been least studied.

The mixture of eukaryotic and bacterial features in transcription and translation mechanisms prompts the interest of better understanding the transcription unit structure in archaea. Existing computational algorithms for operon predictions are tailored and validated largely with bacterial training data, which may not be well applicable to archaea. Previous studies on archaeal transcriptional and translational signals limited the analyzed region and focused only on a small portion of genes in a genome. Here, I extended the search area and identified promoters and Shine-

Dalgarno motifs in 46 archaeal genomes. Besides recognizing the general patterns of transcript structure in these genomes, the analyzed results revealed that the domination of leadered and leaderless transcripts are equally spread in Crenarchaeota and Euryarchaeota, the two main archaeal phyla. I also identified and experimentally verified internal promoters within upstream coding regions, and uncovered Shine-

Dalgarno-less leadered transcripts in Pyrococcus furiosus. These findings suggest the

complexity of archaeal transcription units and serve as a base for novel gene finding including the elusive Pyrobaculum RNase P, and permuted and recently split tRNAs in archaea.

RNase P is best known for its role in removing the 5ʹ leaders of pre-tRNAs, an essential step in tRNA maturation. The RNA component of this holoenzyme is thought to be a universal feature of life. The inability to identify RNase P in some organisms including Pyrobaculum has sown doubts about this phenomenon. Using comparative genomics and improved computational methods, our lab, in collaboration with the Gopalan Lab and the Brown lab, has now identified a radically minimized form of the RNase P RNA in five Pyrobaculum species and the related crenarchaea Caldivirga maquilingensis and Vulcanisaeta distributa, all retaining a conventional catalytic domain, but lacking a recognizable specificity domain. These

“Type T” RNase P RNAs are the smallest naturally occurring form yet discovered to function as trans-acting ribozymes. Due to the absence of almost half of a typical gene, the archaeal RNase P RNA covariance search model fails to detect their existence. I therefore developed a type T specific covariance model that located another shortened RNase P RNA in Thermoproteus tenax.

As in eukaryotes, pre-tRNAs in archaea often contain introns that are removed in tRNA maturation. Two unrelated archaeal species display unique pre-tRNA processing complexity in the form of “split” tRNA genes: 2-3 segments of tRNAs are transcribed from different loci, then trans-spliced to form a mature tRNA. Another

rare type of pre-tRNA, found only in eukaryotic algae, is “permuted” where the 3′ half is encoded upstream of the 5′ half, and must be processed to be functional.

Using an improved version of the gene-finding program tRNAscan-SE, comparative analyses, and experimental verifications, I have identified four novel trans-spliced tRNA genes, each in a different species of the Desulfurococcales branch of the

Archaea. Additionally, I identified the first examples of permuted tRNA genes in

Archaea, which appear to be permuted in the same arrangement seen previously in red alga. These findings illustrate that split tRNAs are sporadically spread across a major branch of the Archaea, and that permuted tRNAs are a new shared characteristic between archaeal and eukaryotic species. The split tRNA discoveries also provide new clues to their evolutionary history, supporting hypotheses for recent acquisition via viral or other mobile elements.

The advancement of high-throughput sequencing technologies has significantly increased the availability of microbial metagenomes. With the knowledge of RNase P and tRNAs from previous studies, I identified six archaeal

RNase P RNAs in a metagenome sequenced from samples collected at Yellowstone

National Park. Togther with the findings of intron-bearing and split tRNAs, these essential noncoding RNA genes can be used as markers for species characterization in microbial communities. The discovery of the first example of a split tRNA that carries multiple introns further reaffirms the unlimited opportunities of biological exploration in metagenomics.

Acknowledgments

I thank my advisor, Todd Lowe, for valuable suggestions and support throughout my research, and the opportunities to work on areas of my interest. I also thank my committee members, David Haussler, Karen Ottemann, and Manny Ares, who went through my very lengthy thesis proposal and now my dissertation. I am grateful to our collaborators, Lien Lai and Venkat Gopalan at The Ohio State University, for developing the RNase P assay that was a major achievement of the type T RNase P discovery project, and Jim Brown at North Carolina State University for predicting the secondary structure of Pyrobaculum RNase P RNA. I am indebted to Sean Eddy and Eric Nawrocki at HHMI Janelia Farm for insightful comments and ideas of

Infernal optimization. I thank Bill Inskeep at Montana State University for introducing me to the world of metagenomes that provides me a new level of challenge in my research career. In addition, I have to thank all the members of our lab for creating a very enjoyable working environment. I am especially grateful to

Aaron Cozen and David Bernick who provided me with plenty of wet lab assistance. I appreciate Aaron for sharing his Northern analysis results as part of the promoter prediction verification and providing a lot of excellent suggestions on cell culturing, tRNA experiments, and many of the research aspects. Two other lab members, Julie

xiii Muphy and Lauren Lui, who I have to thank, spent many hours to culture low-density and hard-to-grow organisms. I do not have enough words to express my appreciation for the efforts and kindness from everyone in our lab. I cannot pass over without mentioning the other part of my career life and especially thank Alan Williams, my former supervisor at Affymetrix, who provided me all the flexibility I wanted to pursue my education and interest. Lastly, I would like to thank sincerely for the support and encouragement my family and friends have been giving me.

The text of this dissertation includes reprint of the following previously published material: Lai, L.B., P.P. Chan, A.E. Cozen, D.L. Bernick, J.W. Brown, V.

Gopalan, and T.M. Lowe (2010). Discovery of a minimal form of RNase P in

Pyrobaculum. PNAS (In Press). Lien B. Lai and I contributed equally to this work. I identified the Caldivirga and Vulcanisaeta RNase P RNA genes using Infernal covariance model searches, performed promoter and tRNA sequencing analyses, developed the Pyrobaculum RNase P RNA covariance model, performed RNase P

RNA gene class analysis for all available archaeal species, created RNase P protein alignments, co-wrote the publication, generated figures, and contributed ideas to the discussion.

The text of this dissertation also includes reprint of the following previously published material: Chan, P.P. and T. M. Lowe (2009) GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37(Database

xiv issue):D93-97. The co-author listed in this publication directed and supervised the research which forms the basis for the dissertation.

xv Chapter 1

Introduction

1 1.1 Archaea, the third domain of life

Living organisms on earth were once divided into two main groups: bacteria and eukaryotes. By using ribosomal RNA (rRNA) sequences for phylogenetic analysis,

Carl R. Woese and George E. Fox discovered the uniqueness of methanogenic bacteria, and named them first as archaebacteria, and then later proposed Archaea, the third domain of life (Woese and Fox 1977; Woese, Kandler et al. 1990). As the number of isolated species increases, members of the Archaea domain now include different categories of extremophiles such as thermophiles, acidophiles, and halophiles. However, the understanding of these organisms is still very limited as compared to model bacteria like E. coli or vertebrates like human and mice. Due to the long-time research interests and focuses, one-third of the completely sequenced archaeal genomes (31 out of 92) are methanogens. Many gaps along the archaeal tree of life are yet to be filled. The open questions about the nature of a universal ancestor and the strong biases of the available archaeal data introduce unlimited opportunities to explore this widely diverse lineage of life.

1.2 A mix of bacterial and eukaryotic features

Since archaea were classified as bacteria in the past, one would expect these organisms to have features highly similar to those in bacterial genomes. Surprisingly, archaea utilize a basal transcription machinery that resembles the one in eukaryotes, but in a more simplified version (Bell, Magill et al. 2001). Instead of having three

2 different RNA polymerases, only one in archaea whose subunits are homologs of the eukaryal RNA polymerase II (RNA pol II) has been identified (Langer, Hain et al.

1995). In addition, archaeal genomes have transcription factor B (TFB), transcription factor E (TFE), and TATA-binding protein (TBP), that are homologs of the eukaryal

RNA pol II transcription factors (Baumann, Qureshi et al. 1995; Qureshi, Bell et al.

1997). Upstream of a transcript, there is a eukaryal RNA pol II-like promoter that includes a TATA-box where TBP binds, and a transcription factor B recognition element (BRE), located upstream of the TATA-box (Qureshi, Bell et al. 1997; Soppa

1999). But homologs of TFIIA, TFIIF, and TFIIH were not found or seem to be required for transcription in archaea (Bell, Jaxel et al. 1998).

Similar to bacteria, genes in archaeal genomes are arranged in polycistronic operons. These operons may include protein-coding genes, noncoding RNA genes, or a mix of them. Furthermore, bacterial-like translation initiation mechanisms are employed by archaea. In many cases, they rely on the interaction between Shine-

Dalgarno motifs in mRNAs and the 3′ end of 16S rRNA (Bell and Jackson 1998).

Yet, for leading genes of leaderless transcripts that do not have a Shine-Dalgarno motif, ribosome is positioned on the start codon by the presence of an initiator transfer RNA (tRNAi) (Benelli, Maone et al. 2003). While Shine-Dalgarno motifs are absent for these genes, they were mostly found to precede internal coding genes within an operon to facilitate translation initiation (Tolstrup, Sensen et al. 2000;

Slupska, King et al. 2001; Benelli, Maone et al. 2003). In a recent study, transcripts

3 with 5′ UTRs but not Shine-Dalgarno motifs were identified in halophilic archaea

(Brenneis, Hering et al. 2007). A novel mechanism responsible for translation initiation of these Shine-Dalgarno-less leadered transcripts was proposed (Hering,

Brenneis et al. 2009), which might be a unique mechanism only found in archaea.

To understand the transcript unit structure and gene organization in archaea, a detailed investigation of transcriptional and translational signals is a key factor.

Previous studies mostly focused on bacterial genomes and did not include noncoding

RNA genes such as the single-copy tRNA genes that tend to have strong promoters.

The research results described in chapter 2 include the prediction of promoters and

Shine-Dalgarno motifs in 46 archaeal genomes. Detailed analyses that leads to the identification of internal promoters within upstream coding regions and Shine-

Dalgarno-less leadered transcripts in a genome other than the halophiles introduce new evidence for further studies of archaeal transcription unit structure.

1.3 Ribonuclease P, a “nearly” universal ribozyme

Ribonuclease P (RNase P) is best known for its role in removing 5′ leaders of precursor tRNAs (pre-tRNAs) as one of the processes in tRNA maturation. RNase P is typically composed of a RNA subunit and a number of protein subunits depending on the domain of life (nine or more in eukaryotic nucleus, one in bacteria, and at least four in archaea) (Hall and Brown 2001; Lai, Vioque et al. 2010). The secondary structure of RNase P RNA is mostly conserved across species, with the specificity

4 domain more variable than the catalytic domain (Brown 1999). It was thought to be a universal feature of life although recent studies found that the RNA component is missing in the human and Arabidopsis organellar RNase P (Holzmann, Frank et al.

2008; Gobert, Gutmann et al. 2010).

The four conserved archaeal RNase P proteins, Pop5, Rpp30, Rpp21, and

Rpp29, are homologous to those found in eukaryotic nucleus, but do not share similarity with the bacterial RNase P protein (Hall and Brown 2002). While Pop5 and

Rpp30 interact with the catalytic domain of the RNA component, Rpp21 and Rpp29 interact with the specificity domain (Tsai, Pulukkunat et al. 2006; Xu, Amero et al.

2009). Although the secondary structure of the archaeal RNase P RNA is conserved in general, structural differences divide them into two subgroups: type A and type M

(Harris, Haas et al. 2001). Archaeal type A RNase P RNAs highly resemble the bacterial type A RNA genes while four to five stems and a loop region are absent in a type M RNase P RNA (Figure 1.1). RNase P has not been found in all archaeal species. Nanoarchaeum equitans, an obligate symbiont that lives with another archaeal organism, was found not having any identifiable RNase P gene or detectable

RNase P activity, perhaps due to the lack of the 5′ leaders in pre-tRNAs (Randau,

Schroder et al. 2008). This leaves Pyrobaculum, a hyperthermophilic crenarchaeon, and its closely related species the only ones in Archaea with a missing RNase P. In chapter 3, I will describe the discovery of a shortened form of RNase P RNA (dubbed type T) in the family Thermoproteaceae that includes Pyrobaculum, Caldivirga, and

5 Vulcanisaeta. Similar to the type A RNase P RNA, this shortened form has a well- conserved catalytic domain, but lacks of most of the specificity domain. Although collectively grouped as type T RNase P RNAs, the secondary structures of these genes in different genera vary, which I will provide a more detailed comparison in chapter 4. A type T-specific covariance search model that results in the identification of another shortened RNase P RNA in a newly available Thermoproteaceae genome will also be mentioned in that chapter.

1.4 Disrupted transfer RNAs – how common are they?

Transfer RNAs (tRNAs), best known by their cloverleaf secondary structure, play an essential role in protein translation in all living cells. During tRNA maturation in eukaryotes and archaea, introns will be removed in addition to the cleaving of 5′ leaders. While the majority of the archaeal tRNA introns are located one nucleotide downstream of the anticodon, some have been found at seemingly random,

“noncanonical” positions in tRNA genes. Both the canonical and noncanonical introns preserve a general bulge-helix-bulge (BHB) secondary structure (Figure 1.2)

(Marck and Grosjean 2003) that is recognized by the splicing endonuclease during the removal process (Biniszkiewicz, Cesnaviciene et al. 1994; Abelson, Trotta et al.

1998; Li, Trotta et al. 1998). While the well characterized eukaryotic version of this endonuclease in Saccharomyces cerevisiae consists of four subunits, namely, Sen2p,

Sen34p, Sen54p, and Sen15p (Li, Trotta et al. 1998), only the homologs of Sen2p and

6 Sen34p have been identified in archaea (Calvin and Li 2008). These two subunits form three different protein structures: homodimer (α2), homotetramer (α4), and heterotetramer (α2β2), with the first two only observed in Euryarchaoeta and

Korarchaeota. The heterotetrameric form of the splicing endonuclease that can cleave introns with a relaxed BHB motif in addition to the canonical form was found in all crenarchaea, thaumarchaea, Nanoarchaeum equitans, and Methanopyrus kandleri – genomes that contain noncanonical tRNA introns (Hall and Brown 2002; Marck and

Grosjean 2003; Calvin, Hall et al. 2005; Randau, Calvin et al. 2005). It has been a common belief that this heterotetrameric form of the enzyme might be required for removing the noncanonical introns, but the detail of this mechanism is still not known.

Although most archaeal species carry only, if any, a small number of noncanonical tRNA introns, Marck and Grosjean observed that the crenarchaeon

Pyrobaculum aerophilum contains more than four times as many noncanonical introns as in any other species (Marck and Grosjean 2003). Recent studies show that species in , particularly genera Pyrobaculum and Thermofilum, contain the largest amount of noncanonical introns, with P. calidifontis harboring a superlative total of 71 introns in 46 tRNAs (Figure 1.3, Table 1.1) (Sugahara, Kikuta et al. 2008; Chan and Lowe 2009). Interestingly, Caldivirga maquilingensis, also a

Thermoproteales, does not have as many atypical tRNA introns, but does contain a number of trans-spliced split tRNAs (Fujishima, Sugahara et al. 2009), a rare trait

7 shared only with Nanoarchaeum equitans (Randau, Munch et al. 2005). These split tRNAs are composed of two or three transcripts with their own promoters distantly separated in the genome. Same as the introns, split tRNAs form a BHB secondary structure at the exon-splicing junction which is processed by the splicing endonuclease (Randau, Calvin et al. 2005). Atypical tRNAs are not limited to archaeal species. Permuted tRNAs whose 3′ half of the sequence lies upstream of its

5′ half with the formation of a BHB-like motif at their termini were identified in an unicellular eukaryotic red alga Cyanidioschyzon merolae, a chlorarachniophyte

Bigelowiella natans nucleomorph genome, and prasinophytes Ostreococcus and

Micromonas (Soma, Onodera et al. 2007; Maruyama, Sugahara et al. 2009). Various hypotheses have been raised to explain the existence and origin of the tRNAs introns and fragmented tRNA genes, collectively described as disrupted tRNAs (Di Giulio

2008; Randau and Soll 2008; Di Giulio 2009; Fujishima, Sugahara et al. 2009;

Sugahara, Fujishima et al. 2009). In chapter 5, I will describe the discovery of four novel split tRNAs in four different archaeal species, and the prediction of the first examples of permuted tRNAs in archaea. These findings provide important information for understanding their evolutionary origins.

As the number of complete genomes increases, more tRNA genes are identified. Among the numerous tRNA search programs developed in the last decade, tRNAscan-SE (Lowe and Eddy 1997) remains a standard tool for whole genome annotation of tRNA genes. To build a collection of all the predicted and manually

8 curated tRNA genes, a genomic tRNA database (http://gtrnadb.ucsc.edu) was developed as a repository for all identifications made by tRNAscan-SE (Lowe and

Eddy 1997). I will describe in chapter 6 the features of this database, which include gene annotation retrieval, similarity searching, and gene context viewing through

UCSC Genome Browsers (Schneider, Pollard et al. 2006; Rhead, Karolchik et al.

2010).

1.5 Atypical genes in nature

Most of the microbiological knowledge on genome characterization is based on the observations and analyses of experimental results obtained from cultivated clonal cultures. Scientists can develop and maintain multiple controlled environments to evaluate gene expression and abundance, mutation rate, gene transfer events, and so on. Results from these analyses are always very informative and have high level of certainty. However, in a natural environment where multiple microbial species and strains live together, organisms may behave and evolve very differently. Due to the requirement of certain living conditions and dependencies, microorganisms in many environments cannot be cultured in a laboratory. For example, trying to grow hyperthermophilic archaea on plates for selecting isolated colonies is always a challenge. This suggests that culturing cannot represent the diversity of the microbial communities.

9 Realizing the importance of learning from the nature, archaeal researchers have been collecting environmental samples at places all around the world. Hot springs in Kamchatka – Russia, Beppu – Japan, and locally, Yellowstone National

Park, are just some of the examples. Ribosomal RNA sequences obtained from the uncultured samples have always been used as the major phylogenetic marker for analysis (Stahl, Lane et al. 1985; Schmidt, DeLong et al. 1991; Pace 1997) although many also study the organisms based on metabolic genes (Inskeep, Rusch et al.

2010).

With the advancement of high-throughput sequencing technologies, metagenomic research has been made more possible than ever. Increasing number of programs introduced in the last five years such as the Human Microbiome Project at the National Institutes of Health and metagenome sequencing at Joint Genome

Institute encourage and develop new methodologies to study microbial communities in their natural environments. With the discovery of the shortened form of RNase P

RNA and disrupted tRNAs, it will be interesting to investigate if these atypical genes exist in nature and if they can be used for species characterization in a community. In the last chapter, I will describe my first attempt of analyzing these genes in a metagenome sequenced from samples isolated at Yellowstone National Park and conclude with a proposed method to characterize archaeal-rich metagenomes using

RNase P and tRNAs.

10

Figure 1.1 Predicted secondary structures of archaeal type A and type M RNase P RNAs (RPRs) Methanobacterium thermoautotrophicum RNase P RNA (RPR) represents a typical archaeal type A RPR. Archaeoglobus fulgidus has a typical archaeal type M RPR (Brown 1999; Harris, Haas et al. 2001). Red regions highlight the major differences between type A and type M RPRs. Black circles represent the universally conserved nucleotides in RPRs.

11

Figure 1.2 Precursor tRNA with canonical intron Bulge-helix-bulge motif is formed at the exon-intron splicing junction. Gray dots highlight the tRNA intron. Black arrows indicate the splicing sites. Blue dots represent the anticodon.

12

Figure 1.3 Locations of tRNA introns in Pyrobaculum calidifontis A total of 71 introns in 46 tRNAs were found in P. calidifontis. The arrows point to the positions where tRNA introns are located. The red arrow indicates the position where canonical introns are located. The numbers at the end of the arrows represent the number of introns found at each particular position.

13 Table 1.1 Total number of predicted tRNA introns in archaeal genomes tRNA genes were predicted by tRNAscan-SE (Lowe and Eddy 1997). Introns in predicted tRNAs were identified by the same software, bulge-helix-bulge motif search using covariance model with Infernal v1.0 (Nawrocki, Kolbe et al. 2009), and sequence alignments. Orange – Crenarchaeota; Green – Euryarchaeota; Blue – Thaumarchaeota; Yellow – Nanoarchaeota; Gray – Korarchaeota.

Total Number Number of Total of Number of predicted number of canonical noncanonical Genome tRNA genes introns introns introns Pyrobaculum calidifontis JCM 11548 46 71 16 55 Thermofilum pendens Hrk 5 46 63 14 49 Pyrobaculum islandicum DSM 4184 46 52 12 40 Thermoproteus neutrophilus V24Sta 46 49 13 36 Pyrobaculum aerophilum str. IM2 46 30 6 24 Pyrobaculum arsenaticum DSM 13514 46 27 7 20 Staphylothermus hellenicus DSM 12710 46 27 23 4 Staphylothermus marinus F1 46 27 23 4 Vulcanisaeta distributa DSM 14429 46 26 10 16 Metallosphaera sedula DSM 5348 46 26 23 3 Sulfolobus tokodaii str. 7 46 24 20 4 Thermosphaera aggregans DSM 11486 46 23 18 5 Sulfolobus islandicus Y.G.57.14 47 21 18 3 Sulfolobus acidocaldarius DSM 639 46 21 19 2 Sulfolobus islandicus L.D.8.5 46 20 17 3 Sulfolobus islandicus L.S.2.15 46 20 17 3 Sulfolobus islandicus M.14.25 46 20 17 3 Sulfolobus islandicus M.16.27 46 20 17 3 Sulfolobus islandicus M.16.4 46 20 17 3 Sulfolobus islandicus Y.N.15.51 46 20 17 3 Sulfolobus solfataricus P2 46 20 17 3 Desulfurococcus kamchatkensis 1221n 47 17 15 2 Aciduliprofundum boonei T469 46 15 15 0 Ignicoccus hospitalis KIN4/I 47 14 11 3 Methanosaeta thermophila PT 47 14 14 0 Cenarchaeum symbiosum 46 13 6 7 Acidilobus saccharovorans 345-15 46 13 8 5 Aeropyrum pernix K1 46 13 8 5 Nitrosopumilus maritimus SCM1 44 12 6 6 Caldivirga maquilingensis IC-167 47 11 8 3 Ignisphaera aggregans DSM 17230 46 13 9 4 Hyperthermus butylicus DSM 5456 46 10 8 2 Methanopyrus kandleri AV19 34 10 8 2

14 Total Number Number of Total of Number of predicted number of canonical noncanonical Genome tRNA genes introns introns introns Natrialba magadii ATCC 43099 49 7 7 0 Candidatus Korarchaeum cryptofilum OPF8 46 6 3 3 Methanohalobium evestigatum Z-7303 49 6 6 0 Archaeoglobus fulgidus DSM 4304 46 5 5 0 Methanococcoides burtonii DSM 6242 50 5 5 0 Uncultured methanogenic archaeon RC-I 53 4 2 2 Methanothermobacter marburgensis str. Marburg 40 4 3 1 Methanothermobacter thermautotrophicus str. Delta H 39 4 3 1 Nanoarchaeum equitans Kin4-M 44 4 3 1 Haloarcula marismortui ATCC 43049 50 4 4 0 Haloterrigena turkmenica DSM 5511 51 4 4 0 Methanohalophilus mahii DSM 5219 52 4 4 0 Methanosarcina acetivorans C2A 61 4 4 0 Methanosarcina barkeri str. Fusaro 61 4 4 0 Methanosarcina mazei Go1 57 4 4 0 Natronomonas pharaonis DSM 2160 45 4 4 0 Thermoplasma acidophilum DSM 1728 45 4 4 0 Thermoplasma volcanium GSS1 45 4 4 0 Candidatus Methanoregula boonei 6A8 49 3 3 0 Candidatus Methanosphaerula palustris E1-9c 56 3 3 0 Ferroglobus placidus DSM 10642 50 3 3 0 Halalkalicoccus jeotgali B3 48 3 3 0 Halobacterium salinarum R1 47 3 3 0 Halobacterium sp. NRC-1 47 3 3 0 Haloferax volcanii DS2 51 3 3 0 Halomicrobium mukohataei DSM 12286 47 3 3 0 Halorubrum lacusprofundi ATCC 49239 51 3 3 0 Methanocella paludicola SANAE 47 3 3 0 Methanoculleus marisnigri JR1 49 3 3 0 Methanoplanus petrolearius DSM 11571 48 3 3 0 Methanospirillum hungatei JF-1 51 3 3 0 Picrophilus torridus DSM 9790 47 3 3 0 Archaeoglobus profundus DSM 5631 47 2 2 0 Haloquadratum walsbyi DSM 16790 45 2 2 0 Methanobrevibacter smithii ATCC 35061 36 2 2 0 Methanocaldococcus fervens AG86 37 2 2 0 Methanocaldococcus infernus ME 38 2 2 0 Methanocaldococcus jannaschii DSM 2661 37 2 2 0

15 Total Number Number of Total of Number of predicted number of canonical noncanonical Genome tRNA genes introns introns introns Methanocaldococcus sp. FS406-22 37 2 2 0 Methanocaldococcus vulcanius M7 37 2 2 0 Methanococcus aeolicus Nankai-3 38 2 2 0 Methanococcus maripaludis C5 37 2 2 0 Methanococcus maripaludis C6 37 2 2 0 Methanococcus maripaludis C7 37 2 2 0 Methanococcus maripaludis S2 37 2 2 0 Methanococcus vannielii SB 37 2 2 0 Methanocorpusculum labreanum Z 53 2 2 0 Methanosphaera stadtmanae DSM 3091 42 2 2 0 Pyrococcus abyssi GE5 46 2 2 0 Pyrococcus furiosus DSM 3638 46 2 2 0 Pyrococcus horikoshii OT3 46 2 2 0 Thermococcus gammatolerans EJ3 46 2 2 0 Thermococcus kodakarensis KOD1 46 2 2 0 Thermococcus onnurineus NA1 46 2 2 0 Thermococcus sibiricus MM 739 46 2 2 0 Halorhabdus utahensis DSM 12940 45 1 1 0

16 Chapter 2

Tra n sc rip tio n a l a n d Tra n sla t ional Signal Detection in Archaea1

1 This chapter is a draft manuscript to be submitted for publication as Chan, P.P. and T.M. Lowe,

Analysis of Gene Organization in Archaea through Transcriptional and Translational Signal Detection.

17 2.1 Introduction

The availability of complete microbial genomes has enabled the investigation of gene organization and transcription unit prediction even in the absence of cDNA library sequencing data. Computational algorithms for predicting operons in prokaryotes have been previously developed based on identifying conserved gene clusters

(Ermolaeva, White et al. 2001; Edwards, Rison et al. 2005; Westover, Buhler et al.

2005), gene expression correlations (Sabatti, Rohlin et al. 2002), and intergenic distance distributions (Moreno-Hagelsieb and Collado-Vides 2002). However, the accuracy level of these predictions is limited by a number of factors. These methods were trained and validated largely with bacterial training data, as the amount of experimentally based archaeal transcription data is very limited. Because the basal transcriptional machinery of archaeal species is most comparable to eukaryotes

(Langer, Hain et al. 1995), it is uncertain how applicable existing bacterial-centric methods are to archaea. Furthermore, mis-annotation of gene start codons and over- prediction of spurious genes result in incorrect determination of intergenic distances

(Moreno-Hagelsieb and Collado-Vides 2002). Despite the fact that gene finding programs such as GeneMark (Lukashin and Borodovsky 1998), Glimmer (Delcher,

Bratke et al. 2007), CRITICA (Badger and Olsen 1999), and Prodigal (Hyatt, Chen et al. 2010) produce reasonably good open reading frame (ORF) predictions, non-coding

RNA genes are rarely, if ever, included when predicting transcription units.

18 Therefore, operon and transcript prediction could be improved by specifically addressing these issues.

A detailed study of transcriptional and translational signals is essential in order to gain more understanding of the archaeal genomes. Similar to eukaryal RNA polymerase II promoters, TATA-like motifs are present in most archaeal promoters.

In addition, transcription factor B, a homolog of eukaryal TFIIB, binds to its recognition element (BRE) upstream of the TATA box although BRE may not exist in promoters of some genes (Qureshi, Bell et al. 1997; Soppa 1999; Baliga and

Dassarma 2000). Archaea adopt three different mechanisms to initiate translation. In many cases, they rely on the interaction between Shine-Dalgarno motifs in mRNAs and the 3′ end of 16S rRNA. Yet, leaderless transcripts were identified mostly with single genes and the first gene in operons in Halobacterium salinarum, Haloferax volcanii, Pyrobaculum aerophilum, and Sulfolobus solfataricus (Tolstrup, Sensen et al. 2000; Slupska, King et al. 2001; Brenneis, Hering et al. 2007). In such cases, ribosome is positioned on the start codon by the presence of an initiator tRNA

(tRNAi) (Benelli, Maone et al. 2003). Recently, a novel mechanism of translation initiation for leadered transcripts that do not have Shine-Dalgarno motifs in their 5′

UTRs was identified in halophilic archaea (Hering, Brenneis et al. 2009). Although the molecular aspects of this mechanism are yet to be determined, it was shown that existing methods including the eukaryotic scanning of 5′ UTRs for the first start codon are not involved. Previous studies have reported the positions of conserved

19 promoters and Shine-Dalgarno motifs in a number of archaeal genomes (Wan,

Bridges et al. 2004; Torarinsson, Klenk et al. 2005; Chang, Halgamuge et al. 2006).

However, these analyses limit the analyzed region to within approximately 50 nt upstream of the translation initiation codon and offer only generalized results for the whole genome. Here, we extended the upstream region and identified transcriptional and translational signals in 46 archaeal genomes. Besides recognizing the general patterns of gene organization within these genomes in terms of operon arrangements, we found that unlike Pyrobaculum and Sulfolobus, Desulfurococcales and

Acidilobales in Crenarchaeota contain mostly leadered transcripts. Moreover, the 5′

UTRs of the transcripts in methanogens vary largely in length that probably results in the low detection of promoters in previous studies. Our identification of Shine-

Dalgarno-less leadered transcripts outside of the halophiles and internal promoters within upstream coding regions provides more biological evidence for future efforts of archaeal transcription and translation analyses.

2.2 Results

Using the 90-nt upstream region from each known coding and non-coding genes, we developed a position-specific scoring matrix (PSSM) (Henikoff and Henikoff 1996) for a 16-nt promoter motif that includes BRE and TATA box for each archaeal genome. A promoter was predicted for each annotated coding and non-coding gene by selecting the highest scoring candidate based on analyzing the 150-nt upstream

20 region with the PSSM and the relative position obtained from the training data set.

Similar mechanism was applied to predict a 10-nt Shine-Dalgarno motif within the

25-nt upstream region from the translation start codon of each protein. The amount of genes that were found to associate with a predicted promoter and/or a predicted

Shine-Dalgarno motif in the 46 studied genomes is summarized in Tables 2.1 and 2.2.

The results of non-gene specific genome-wide promoter and Shine-Dalgano motif scanning using the PSSMs are collected at the UCSC Archaeal Genome Browser

(Schneider, Pollard et al. 2006).

2.2.1 Absence of transcription factor B recognition element

To analyze the conservation of promoter sequences, we aligned the predicted

16-nt promoter motifs and generated a sequence logo for each genome. While we found both the BRE and TATA-box in the alignments for all the studied genomes, the level of conservation for these two promoter elements varies. Comparing to the highly conserved BRE and TATA-box in euryarchaeal Pyrococcus furiosus (Figure

2.1A) and crenarchaeal Caldivirga maquilingensis (Figure 2.1B), Ignicoccus hospitalis, a hyperthermophilic Desulfurococcales that is the only known host for

Nanoarchaeum equitans, has high conservation in the TATA-box but a much weaker

BRE region (Figure 2.1C). Similarly, Aeropyrum pernix, also a Desulfurococcales, demonstrates this conservation pattern as well. Yet, the weak BRE region does not extend to all the Desulfurococcales species such as Thermosphaera aggregans

(Figure 2.1D). It was found that the A-rich BRE region upstream of the TATA-box is

21 important for promoter efficiency (Qureshi and Jackson 1998). However, previous studies show that the bop gene in Halobacterium does not require the presence of

BRE for transcription machinery orientation (Bell, Kosa et al. 1999; Baliga and

Dassarma 2000). This exceptional case may provide an explanation for the lack of

BRE conservation in these genomes. Alternatively, an unrecognized BRE motif may be used for transcription initiation. The low AT content of these genomes may have been the driving force.

2.2.2 Low conservation of Shine-Dalgarno motifs in Pyrobaculum, Cenarchaeum, and Nanoarchaeum

Similar to the promoters, predicted Shine-Dalgarno motifs are highly conserved among all studied genomes. As shown in the position distribution of

Pyrococcus furiosus (Figure 2.2A), Shine-Dalgarno motifs are mostly found around

-10 from the translation start codon. The sequence which is complementary to the 3′ end of 16S rRNA has a consensus of 5′-GAGGUGA-3′ (Figure 2.2B), with the exception to Pyrobaculum species, Cenarchaeum symbiosum and Nanoarchaeum equitans. When aligning the 25-nt upstream region, only genes separated by less than or equal to 20 nt in these genomes reveal a Shine-Dalgarno sequence pattern. Yet these predicted motifs are relatively short and weakly conserved (Figure 2.2C). This characteristic may increase the flexibility of their positions in these genomes resulting in a more uniform distribution between -20 and -6 than those found at the typical positions in other genomes (Figure 2.2A). Although the exceptionally low abundance

22 of Shine-Dalgarno motifs in N. equitans (Table 2.2) would have contributed to the weakly observed signal, the domination of leaderless transcripts that results in the absence of Shine-Dalgarno motifs does not explain this low-level observation in general as shown in the difference between Pyrobaculum and Sulfolobus.

Another unexpectedness is the consensus Shine-Dalgano motif in Aeropyrum pernix, which is 5′-GGGGUGA-3′ (Figure 2.2D), although many predicted motifs retain the typical sequence pattern. With no obvious difference in the 3′ end of 16S rRNA sequence in A. pernix from other archaeal genomes, the only explanation based on computational results perhaps is the 56.3% G/C content of the genome, leading to a more G-rich region. Interestingly, this is not observed in the even higher G/C content halophiles that have a much lower abundance of predicted Shine-Dalgarno motifs (Table 2.2).

2.2.3 A mixture of leadered and leaderless transcripts in Crenarchaeota and Euryarchaeota

Previous studies suggested that a majority of transcripts in Pyrobaculum aerophilum and Sulfolobus solfataricus are leaderless (Tolstrup, Sensen et al. 2000;

Slupska, King et al. 2001). Our results illustrate consistency with these studies in which 60% of the predicted promoters in these genomes are located between -24 and

-30, and between -22 and -43 respectively, relative to the translation start codon of proteins or the 5′ end of the mature ncRNA genes (Figure 2.3A). These imply the absence or minimal length of 5′ UTRs in most of the transcripts. The failure of

23 identifying a Shine-Dalgarno motif in the upstream region of over 60% of the proteins further confirms the situation. These leaderless transcripts were found not only in other Pyrobaculum and Sulfolobus species, but also in the genomes that belong to the same Thermoproteaceae and Sulfolobaceae families respectively.

Due to the limited availability of complete genomes, it was a common belief that crenarchaea were dominated by leaderless transcripts (Karlin, Mrazek et al. 2005;

Benelli and Londei 2009). However, we found that over 60% of proteins in

Desulfurococcales species, Acidilobus saccharovorans, and Thermofilum pendens have predicted Shine-Dalgarno motifs in the upstream region (Table 2.2). The majority of these genes are associated with predicted promoters, inplying that they are either a single gene or the leading gene of a polycistronic operon. In order to fit in a 5′

UTR before each start codon, the Garrett Lab reported that promoters of leadered transcripts are generally located at least 10 nt further upstream than those associated with leaderless transcripts. Over 60% of the predicted promoters were found at positions ranging from -22 to -53 in these genomes, suggesting the existence of 5′

UTRs in many of the transcripts (Table 2.1). Interestingly, with the exception of

Hyperthermus butylicus, over 20% of the predicted promoters are located between

-22 and -30 (Figure 2.3B). The genes in this category were found to be ncRNAs such as tRNAs or potential leaderless transcripts. For example, an ATPase subunit transporter (Igag_0186) in Ignisphaera aggregans, which is oriented divergently with an inner member component of transport systems (Igag_0187), has a strong predicted

24 promoter at position -27 from the translation start codon. However, no Shine-

Dalgarno motif was detected for this gene. On the other hand, both promoter and

Shine-Dalgarno motif were found associated with Igag_0187 (Figure 2.4A). This observation suggests that two translation initiation mechanisms, with and without the requirement of Shine-Dalgarno motifs, are utilized in these genomes simultaneously.

We then suspected if these two different translation initiation mechanisms also exist in crenarchaeal genomes that are dominated by leaderless transcripts. We noticed that over 20% of the proteins with a predicted Shine-Dalgarno motif in

Thermoproteaceae and Sulfolobaceae also have a predicted promoter. When we looked closely to the top scoring candidates, we found that a penicillin acylase precursor (SSO1460) in Sulfolobus solfataricus has a very strong predicted promoter

(5′GAAACTTTTTTATAAA) located at -61, which is approximately 30 nt further upstream than those of the leaderless transcripts. It has a predicted Shine-Dalgarno motif (5′GAGGTG) at position -12 (Figure 2.4B). RNA sequencing analyses revealed that SSO1460 has a transcription start site 28 nt upstream of the translation start codon (Wurtzel, Sapra et al. 2010). This implies that SSO1460 has a 5′ UTR with a

Shine-Dalgarno motif, among the majority of leaderless transcripts in the genome.

Similar to Pyrobaculum and Sulfolobus, euryarchaea including

Archaeoglobus, Thermoplasma, Halobacterium, and Haloferax were previously found to have mostly leaderless transcripts (Torarinsson, Klenk et al. 2005; Brenneis,

Hering et al. 2007). On the contrary, Thermococcaceae and methanogens such as

25 Methanococcus maripaludis S2 are dominated with leadered transcripts (Torarinsson,

Klenk et al. 2005). We examined the possibility of having leaderless transcripts in leadered transcript-dominated genomes. We found that both FurR family transcriptional regulator (PF1194) and putative HTH-type transcriptional regulatory protein (PF1851) in Pyrococcus furiosus, a Thermococcaceae, have predicted promoters at -29, matching the typical promoter locations found for leaderless transcripts (Figure 2.5A). Furthermore, no Shine-Dalgarno motifs were identified for both genes. To verify that these two genes are leaderless, we applied primer extension and identified the 5′ transcription start site of PF1194 at the first nucleotide of the start codon while the transcription of PF1851 started 2 nucleotides downstream of annotated start codon (Figure 2.5B). We noted that EasyGene predicted the start codon of this protein at +3 (Nielsen and Krogh 2005). This matches the BLASTP

(Altschul, Gish et al. 1990; Altschul, Madden et al. 1997) alignments with its orthologs in closely related P. abyssi and Thermococcus kodakaraensis. Thus, we suggested that while PF1194 is a leaderless transcript, PF1851 might have an incorrect annotated translation start site with a 1-nt 5′ UTR.

2.2.4 Highly variable 5′ UTR length in methanogens

Previous analyses showed that promoters in methanogens were not conserved with undetectable BRE and TATA-box but Shine-Dalgarno motifs were well-defined

(Torarinsson, Klenk et al. 2005). While our predictions demonstrate consistent Shine-

Dalgarno motif patterns and positions as those in other archaeal genomes, we found

26 that over 60% of the predicted promoters in methanogens are located between -24 and

-76, the broadest region observed among all studied genomes (Table 2.1, Fig. 2.6).

Recent study conducted by Olsen’s lab on Methanocaldococcus jannaschii revealed the mapping of transcription start sites for 134 gene transcripts (Zhang, Li et al.

2009). Although 52 of them have a 5′ UTR between 21 and 40 nt in length, 53% of the transcripts have a transcription start site beyond 40 nt upstream of the translation start site, and 21% beyond 100 nt upstream. The high variability in lengths and the extra long 5′ UTRs may also apply to other methanogens, which result in the wide range of predicted promoter locations observed in our study. We attempted to verify our promoter predictions in M. jannaschii against the predicted results in Olsen’s study, but only found about 53% consistency due to the high A/T content of the genome and the long 5′ UTRs, with some of them being outside of our prediction region. Instead of using the translation start sites of these genes, we applied the experimentally verified transcription start sites from the Olsen’s study. Our revised promoter predictions match 90% of Olsen’s predicted results, and have 96% located around the typical promoter positions.

2.2.5 Evidence of gene coordinate mis-annotations

In most of the studied genomes, we found less than 5% of the genes to be associated with a predicted promoter or a predicted Shine-Dalgarno motif. Moreover, about 10% of the predicted promoters are located beyond -100 of the target genes.

While some transcripts may have long 5′ UTRs as previously identified in

27 Methanocaldococcus jannaschii (Zhang, Li et al. 2009), another possible cause of this observation would be mis-annotation of the start codon. Detailed study of the genes with a predicted promoter beyond -100 in Pyrococcus furiosus revealed that the translation start codon of more than 80% of these genes were believed to be mis- annotated. As an example, we detected that a strong promoter

(5′GAAAGTTTATATATC) for a putative sodium dependent transporter (PF1254) is located 129 nt downstream of the annotated translation start codon, with a predicted

Shine-Dalgarno motif (5′GGAGGTGG) at position +168. Using the multiple genome alignments available at the UCSC Archaeal Genome Browser (Schneider, Pollard et al. 2006), we found that the sequence conserved region with other Pyrococcus species starts at position +176 of PF1254. This matches closely with the JCVI annotation of

NT01PF1407 (Davidsen, Beck et al. 2010) and EasyGene prediction (Nielsen and

Krogh 2005), both having a start codon at +175. Homologs in the related Pyrococcus species were also annotated with a translation start site at the same proximity, leading to the suggestion of incorrect start site annotation of this gene.

2.2.6 Internal promoters leading to two modes of transcription

Like bacteria, archaeal genomes consist of genes that overlap on the same strand or have very small upstream intergenic regions. Pyrobaculum aerophilum and

Sulfolobus solfataricus, the more outstanding examples, have 28% and 29.5% genes respectively with upstream intergenic regions of ≤ 15nt. Previous analyses typically considered overlapping or closely packed genes to be transcribed in the same operon

28 (Salgado, Moreno-Hagelsieb et al. 2000; Ermolaeva, White et al. 2001; Moreno-

Hagelsieb and Collado-Vides 2002). Although internal promoters within coding region of upstream genes were identified in bacterial genomes (LeBlanc, Lang et al.

1999; Ludwig, Homuth et al. 2001), no significant attempts have been made to investigate the possibility of using internal promoters for transcription initiation in archaea. Our study shows that about 10% of the predicted promoters in P. aerophilum are internal within upstream coding regions, and almost half of them are among the top 50% scores. Since a number of them were predicted as internal genes within operons (Ermolaeva, White et al. 2001; Price, Huang et al. 2005), we suspected that they could be transcribed from both their own internal promoter and as part of a larger transcription unit. To verify our hypothesis, we selected three genes, PAE1971,

PAE2207, and PAE3600, for experimental analysis.

PAE1971, the ribosomal protein L4 in P. aerophilum, was predicted as an internal gene of an operon that includes ribosomal protein L3 (PAE1970) (Ermolaeva,

White et al. 2001; Price, Huang et al. 2005). Although this is consistent with the existence of a strong predicted promoter located at position -28 relative to PAE1970, we identified an internal promoter located 25-nt upstream of PAE1971 (Figure 2.7A).

Primer extension with a pair of reverse primers starting at +54 and +90 respectively showed that PAE1971 has a transcription start site at the same position as the translation start codon (Figure 2.7B), which implies of a leaderless transcript.

Similarly, PAE2207 that was predicted to be part of an operon with PAE2208

29 (Ermolaeva, White et al. 2001; Price, Huang et al. 2005) was found to have an internal promoter at -28 (Figure 2.7A). Mapping of its 5′ transcription start site by extending from a reverse primer at +39 also verified it as a leaderless transcript

(Figure 2.7C). Interestingly, nitrite reductase (PAE3600) that overlaps with methyltransferase (PAE3601) located downstream was not predicted as part of an operon. The strong band at 1.6kb and the slightly faint band at 2.2kb obtained from northern analysis revealed that PAE3600 is mostly transcribed as a single gene but also belongs to the same transcript with PAE3601 (Figure 2.7A and 2.7D).

2.2.7 Search for Shine-Dalgarno-less 5′ UTRs

Despite that transcripts in the halophiles are mostly leaderless, Brenneis and colleagues identified a small portion of genes having Shine-Dalgarno-less 5′ UTRs

(Brenneis, Hering et al. 2007). We compared their experimental results with our predictions and found consistency between them. In fact, when comparing to all the studied genomes including the ones dominated with leaderless transcripts, the halophiles have the smallest amount of genes associated with predicted Shine-

Dalgarno motifs (Table 2.2), but have over 5% more genes with predicted promoters located upstream of median position ranges (Table 2.1), which implies the existence of 5′ UTRs.

To find out if Shine-Dalgarno-less leadered transcripts only exist in the halophiles, we studied Pyrococcus furiosus and observed that 9.7% of the genes have predicted promoters upstream of -43, but do not have predicted Shine-Dalgarno

30 motifs. Closed examination of Lrp family transcriptional regulator (PF0250) showed that the gene was predicted to belong to a polycistronic operon consisted of PF0248,

PF0249 (Price, Huang et al. 2005), and possibly PF0251 (Ermolaeva, White et al.

2001). RT-PCR with PF0249-specific forward primer and PF0250-specific reverse primer confirmed that PF0249 and PF0250 belong to the same transcript. On the other hand, we identified a strong promoter at -50 from PF0250, within the coding region of PF0249. But no Shine-Dalgarno motif was found. Primer extension using a pair of reverse primers starting at +32 and +38 of PF0250 respectively mapped the 5′ transcription start site at -22 (Figure 2.8A). To further prove the transcription of

PF0250 as a single gene, we adopted northern analysis with probes of PF0250 and observed a strong band at around 800 nt and a weaker band at 480 nt (Figure 2.8B).

The smaller band is consistent with the size of PF0250 as a single gene. Although the larger band does not match with the size of the 1-kb PF0249-PF0250 operon, it matches the size of PF0250 with its downstream intergenic region that was transcribed and cloned as accession number AA113467 in the P. furiosus l-ZAP II library (Borges 1996). The internal promoter of PF0250, thus, is responsible for transcribing PF0250 as a single gene and possibly an operon of PF0250 and

AA113467 (Figure 2.8C). Together with PF1851 that has a 1-nt 5′ UTR (discussed earlier), the missing of a Shine-Dalgarno motif within the 5′ UTR of PF0250 sets it as an example of Shine-Dalgarno-less leadered transcript outside of the halophiles, and a new instance of Shine-Dalgarno-less internal gene of an operon.

31 2.3 Discussion

In this work, we covered forty-six completely sequenced archaeal genomes and extended the analyzed region that was not covered in previous promoter and Shine-

Dalgarno motif studies to provide a more comprehensive understanding of transcription and translation in Archaea. While Shine-Dalgarno motifs are absent for leading genes of leaderless transcripts, they were mostly found to precede internal coding genes within an operon to facilitate translation initiation (Tolstrup, Sensen et al. 2000; Slupska, King et al. 2001; Benelli, Maone et al. 2003). Genomes with high percentage of Shine-Dalgarno motifs but low in promoters tend to have more genes arranged in polycistronic operons. Methanopyrus kandleri represents the most extreme case in our analysis with 77% and 41% of genes having a predicted Shine-

Dalgarno motif and a predicted promoter respectively (Figure 2.8). On the opposite end of the spectrum, Nanoarchaeum equitans consists of mostly single genes, 92% of them predicted with a promoter and only 8% with a Shine-Dalgarno motif.

Interestingly, we noticed that Pyrobaculum aerophilum has similar proportion of predicted promoters and Shine-Dalgarno motifs as Pyrobaculum islandicum while

Pyrobaculum arsenaticum and Pyrobaculum calidifontis are most similar in this manner (Figure 2.8). This observation does not follow the phylogenetic relationships among these species based on 23S rRNA alignments or average protein similarity, in which P. aerophilum is more closely related to P. arsenaticum while P. islandicum is closer to P. calidifontis.

32 Based on the predicted promoters and Shine-Dalgano motifs in the studied genomes, we found that the domination of leadered and leaderless transcripts are equally spread in Crenarchaeota (a ratio of 9:11) and Euryarchaeota (a ratio of 9:14), despite the earlier understanding of crenarchaea dominated with leaderless transcripts and euryarchaea dominated with leadered transcripts (Besemer, Lomsadze et al. 2001;

Karlin, Mrazek et al. 2005; Benelli and Londei 2009). We cannot make any conclusions about Nanoarchaeota, Korarchaeota, and Thaumarchaeota as the number of sequenced genomes in these phyla is very limited. But surprisingly,

Nanoarchaeum equitans is very similar to Pyrobaculum in terms of pattern conservation and position distribution for both promoters and Shine-Dalgarno motifs.

While previous phylogenetic analysis indicates that N. equitans diverged before the emergence of the Euryarchaeota and Crenarchaeota (Waters, Hohn et al. 2003), its similarity with Pyrobaculum may suggest that Crenarchaeota diverged earlier than

Euryarchaeota with Pyrobaculum closer in descent to the Nanoarchaeota.

The extended range of promoter analysis that includes coding regions allows us to identify internal promoters in multiple species. Combination of RT-PCR and northern analysis confirmed that these internal promoters are active in transcription initiation and provide an alternate mode of single gene expression in addition to transcribing within a polycistronic operon. Due to the inclusion of internal promoters in our predictions, the overall number of predicted polycistronic operons in each genome, based on the number of predicted promoters, may have been underestimated

33 (Figure 2.8). However, the existing operon prediction algorithms that have not considered the internal promoters do not cover the archaeal transcript structure comprehensively. The existence of the internal promoters reveals the complexity of the transcript structure that was not previously anticipated and calls for a more thorough transcription unit prediction model for archaea.

The discovery of Shine-Dalgarno-less leadered transcripts in halophiles has introduced a novel mechanism of translation initiation in Archaea (Brenneis, Hering et al. 2007; Hering, Brenneis et al. 2009). We expect that the new examples we observed in Pyrococcus furiosus also employ the same mechanism. However, the discovery of the absence of a Shine-Dalgarno motif for PF0250 that also transcribes as an internal gene of an operon raises the question of how translation initiates in this situation. We hypothesize that the same novel mechanism of translating a Shine-

Dalgarno-less leadered transcript will be used, although it is also possible that the mechanism used for leaderless transcripts is employed. These Shine-Dalgarno-less transcripts eliminate the possible use of Shine-Dalgarno motifs as an indicator of the presence and absence of 5′ UTRs. Although promoter positions can provide an estimated length of the 5′ UTRs, short ones like PF1851 in P. furiosus and those presented in the halophile analyses (Brenneis, Hering et al. 2007) are difficult to be determined without experimental verifications.

The variable 5′ UTR length and the high A/T content in methanogens decrease the accuracy of promoter predictions that has to rely on the translation start sites as

34 the reference positions. Although employing the transcription start sites for promoter predictions significantly increases the accuracy rate, the lack of knowledge of transcription start sites in most archaeal genomes limits the available computational prediction approaches, and emphasizes the increasing need of experimental data.

The advancement of high throughput RNA sequencing (RNA-seq) technologies has introduced biological studies into a new arena. Genome-wide transcriptome analyses in microbes such as Sulfolobus solfataricus and Helicobacter pylori provide new insights of alternative transcription start sites and operon arrangements, and identification of novel noncoding RNA genes and antisense transcripts (Sharma, Hoffmann et al. 2010; Wurtzel, Sapra et al. 2010). With the increase of adoption to RNA-seq, we anticipate a large amount of transcript mapping data will be available that can be applied for promoter and Shine-dalgarno motif prediction model optimization. Non-gene specific genome-wide promoter and Shine-

Dalgarno motif scanning can further improve transcript assembly based on short sequencing reads that leads to novel gene discovery and better understanding of transcription unit structure.

2.4 Materials and Methods

Genomic data. Complete genomic sequences and annotated ORFs were obtained from NCBI RefSeq (Pruitt, Tatusova et al. 2007). Non-coding RNA transcripts that were not available in GenBank were predicted by using tRNAscan-SE (Lowe and

35 Eddy 1997) and Snoscan (Lowe and Eddy 1999), or extracted from RNaseP database

(Brown 1999) and Ribosomal Database Project II (Cole, Chai et al. 2003). EasyGene

(Nielsen and Krogh 2005) and JCVI ORF predictions were obtained from EasyGene prediction website and Comprehensive Microbial Resource (Davidsen, Beck et al.

2010) respectively.

Transcriptional signal analysis. Alignments of 100-nt upstream region of tRNAs provided us an initial approximation of promoter position of consensus motif pattern.

To generate a training set, potential operons were predicted with the criterion of intergenic distance more than or equal to 100 bp on the same strand. A generic consensus 16-nt promoter motif that includes BRE (1 to 3 As) and TATA-box was searched within 90-nt upstream of translation start codon of single known genes and potential operons with known leading gene by using MEME (Bailey and Elkan 1994).

Known genes were defined as those that are not annotated as hypothetical/putative features nor have COG annotations. Results from MEME were filtered and added with the missing tRNA promoter motifs. Position-specific scoring matrix (PSSM)

(Henikoff and Henikoff 1996) was generated by the alignment of the promoter motifs found in known genes. PSSM was then used to scan against the 150-nt upstream region of all protein-coding and non-coding genes to identify potential promoter regions. Ten virtual genomes for each target genome were generated using 5th order

Markov chain to retain the base frequency of the target genome. PSSM was applied to

36 the same number of 16-nt randomly selected fragments from each of the virtual genomes as the number of features in target genome for null hypothesis formation.

Promoter regions previously identified by using PSSM were filtered according to relative position and a threshold of p-value equivalent to that of the lowest scoring known feature. Results were analyzed based on expected and unexpected positioning.

Sequence logos of consensus promoter motifs were generated by using WebLogo

(Crooks, Hon et al. 2004).

Translational signal analysis. A generic consensus 10-nt Shine-Dalgarno motif was searched by using MEME (Bailey and Elkan 1994) within 25-nt upstream region of known proteins with upstream intergenic distance of less than or equal to 20-nt in

Cenarchaeum, Nanoarchaeum, and Pyrobaculum, and that of known proteins with upstream intergenic distance of less than or equal to 40-nt in the other studied genomes. Results from MEME were filtered based on the pattern similarity to the consensus motif and their position relative to the translation start codon. PSSM

(Henikoff and Henikoff 1996) was generated by the alignment of the Shine-Dalgarno motifs found in known genes. PSSM was then used to scan against the 25-nt upstream region of all protein-coding genes to identify potential Shine-Dalgarno regions.

PSSM was also applied to the same number of randomly selected fragments from each of the virtual genomes as the number of features in target genome for null hypothesis formation. Shine-Dalgarno regions previously identified by using PSSM

37 were filtered according to relative position and a threshold of p-value equivalent to that of the lowest scoring known feature. Results were analyzed with the existence and positioning of promoter regions. Sequence logos of consensus Shine-Dalgarno motifs were generated by using WebLogo (Crooks, Hon et al. 2004).

Primer extension. To map the 5′ end transcription start site, gene-specific primers complementary to the 5′ end of the target cDNAs were designed and end-labeled with

32P using T4 polynucleotide kinase. The labeled primers (0.375 pmol/µl) were annealed to Pyrobaculum aerophilum or Pyrococcus furiosus total RNA (3 µg) at

60°C for 5 minutes and extended at 55°C for 30 minutes using Invitrogen Superscript

III or Thermoscript reverse transcriptase (RT) system. Products were analyzed by electrophoresis next to a 10bp DNA ladder on 10% polyacrylamide-urea gel.

Northern analysis. Pyrobaculum aerophilum or Pyrococcus furiosus total RNA aliquots (15 µg) were electrophoresed on 1% agarose gel containing glyoxal, blotted to Hybond-N nylon membranes in 20X SSC, and UV cross-linked. Probes of target genes generated by polymerase chain reaction (PCR) with gene-specific primers were body-labeled with 32P using Ambion Strip-EZ PCR kit and hybridized to blotted membranes in Ambion ULTRAhyb hybridization buffer at 42°C overnight. Final high stringency washing conditions were 0.01X SSC, 0.1% SDS at 68°C for 15 minutes twice.

38 RT-PCR. First strand cDNAs were synthesized from Pyrococcus furiosus total RNA using gene-specific reverse primers in Invitrogen Superscript III reverse transcriptase system. These cDNA templates were PCR-amplified using primers spanning genes included in predicted operons. PCR parameters were 30 cycles of denaturation at

94°C for 30 seconds, annealing at 50°C for 1 minute, and extension at 72 °C for 2 minutes using Applied Biosystems AmpliTaq DNA polymerase.

39

Figure 2.1 Comparison of promoter motif conservation Sequence logo for each genome was generated by multiple alignments of 16-nt predicted promoters for all annotated proteins and noncoding RNAs. Predicted promoters include transcription factor B recognition element (BRE) and TATA-box. Relatively low level of conservation is shown in the BRE region in Ignicoccus hospitalis in comparison to the other genomes in the same and different phylogenetic families, while the TATA-box region is conserved in all studied genomes.

40

Figure 2.2 Predicted Shine-Dalgarno position distributions and motif conservation A. The center position of the 10-nt predicted Shine-Dalgarno motif was used to determine motif positions relative to translation start sites. Percentage on the Y-axis represents the amount of predicted Shine-Dalgarno motifs observed at a particular relative position. The typical observed Shine-Dalgarno motif positioning is represented by Pyrococcus furiosus and Sulfolobus solfataricus in comparison to the relatively uniform distribution in Pyrobaculum aerophilum. B-D. Sequence logo for each genome was generated by multiple alignments of predicted Shine-Dalgarno motifs for all annotated protein-coding genes. B. The typical predicted consensus Shine-Dalgarno motif is represented by P. furiosus. C. In comparison, P. aerophilum has a shorter consensus Shine-Dalgarno motif. D. Aeropyrum pernix has a more G- rich consensus Shine-Dalgarno motif that may be due to its high G-C content.

41

Figure 2.3 Predicted promoter motif position distributions The center position of the 16-nt predicted promoters was used to determine motif positions relative to translation start sites. Percentage on the Y-axis represents the amount of predicted promoters observed at a particular range of relative positions. A. Pyrobaculum aerophilum represents genomes with mostly leaderless transcripts. The majority of the predicted promoters were found between -24 and -30. B. Predicted promoters in Ignicoccus hospitalis were mostly observed between -22 and -53, implying a mix of leadered and leaderless transcripts.

Figure 2.4 Schematic diagram of leadered and leaderless transcript layout A. Igag_0186 in Ignisphaera aggregans was predicted to be a leaderless transcript or with a very short 5′ UTR that does not have a Shine-Dalgarno motif. Igag_0187 was predicted to be a leadered transcript with a Shine-Dalargo motif upstream of the translation state site. B. SSO1460 in Sulfolobus solfataricus was found to be a leadered transcript with a predicted Shine-Dalgarno motif and a transcription start site (highlighted T) verified by RNA sequencing results. Orange, coding regions. Green, predicted promoters. Blue, predicted Shine-Dalgarno motifs. Solid arrows, direction of transcription. Dotted arrows, association of regulatory elements with transcripts.

42

Figure 2.5 Transcription start site mapping of Shine-Dalgarno-less transcripts in Pyrococcus furiosus A. Schematic representation of transcript layout for PF1194 and PF1851. Orange, coding regions. Green, predicted promoters. Arrows, direction of transcription at transcription start sites. B. Primer extension was conducted with gene-specific primers and P. furiosus total RNA for reverse transcription. TSS, transcription start site. Lane 1, PF1194 mapped by a 20-nt reverse primer started at +49. Lane 2, PF1851 mapped by a 21-nt reverse primer started at +41. M, 10bp DNA ladder.

43

Figure 2.6 Highly variable lengths of 5′ UTRs in Methanocaldococcus jannaschii The center position of the 16-nt predicted promoters was used to determine motif positions relative to translation start sites. Percentage on the Y-axis represents the amount of predicted promoters observed at a particular position. 60% of the predicted promoters were found between positions -26 and -62.

44

Figure 2.7 Internal promoters in Pyrobaculum aerophilum A. Schematic representation of transcript layout for PAE1971, PAE2207, and PAE3601 utilizing internal promoters for transcription. Orange, coding regions. Red box, coding region of the upstream gene. Green, predicted internal promoters. Arrows, direction of transcription. Highlight bases, translation start sites. B. Primer extension was conducted with PAE1971-specific primers and P. aerophilum total RNA for reverse transcription. TSS, transcription start site. Lane 1, PAE1971 mapped by a 19-nt reverse primer started at +54. Lane 2, PAE1971 mapped by a 19-nt reverse primer started at +90. PC, PAE1488 mapped by a 22-nt reverse primer started at +58 as a leaderless transcript positive control. M, 10bp DNA ladder. C. Primer extension was conducted with PAE2207-specific primer and P. aerophilum total RNA for reverse transcription. TSS, transcription start site. Lane 1, PAE2207 mapped by a 20- nt reverse primer started at +39. M: 10bp DNA ladder. D. P. aerophilum total RNA extracted from cell cultures at time points 3hrs (Lane 1) and 6hrs (Lane 2) was hybridized with probes derived from PCR products of PAE3600 for northern analysis. Band at 1.6KB represents PAE3600 transcribed as a single gene. Band at 2.2KB represents PAE3601 represents in an operon with PAE3601.

45

Figure 2.8 Shine-Dalgarno-less 5′ UTR for PF0250 in Pyrococcus furiosus A. Primer extension was conducted with PF0250-specific primers and P. furiosus total RNA for reverse transcription. TSS, transcription start site. Lane 1, PF0250 mapped by a 18-nt reverse primer started at +32. Lane 2, PF0250 mapped by a 18-nt reverse primer started at +38. M, 10bp DNA ladder. B. Northern analysis was performed with hybridization of probes derived from PCR products of PF0250 to Pyrococcus furiosus total RNA. M1, Ambion RNA century marker. M2, Ambion RNA millennium marker. C. Schematic representation of possible transcripts that include PF0250. Orange, coding regions. Red box, coding region of the upstream gene. Green, predicted internal promoter. Blue, predicted Shine-Dalgarno motif. Solid arrows, direction of transcription. Dotted arrow, association of internal predicted promoter with transcript.

46

Figure 2.9 Summary of predicted promoters and Shine-Dalgarno motifs in 46 archaeal genomes Promoter percentage represents the amount of predicted promoters identified with all annotated coding and noncoding genes in a genome. Shine-Dalgarno percentage represents the amount of predicted Shine-Dalgarno motifs identified with all annotated coding genes in a genome.

47 Table 2.1 Predicted promoter relative positions in 46 archaeal genomes. Both protein-coding and non-coding genes annotated in NCBI RefSeq (Pruitt, Tatusova et al. 2007) were included for promoter predictions. Promoter positions are relative to translation start codon of protein-coding genes or mature non-coding RNA start sites. The center position of the 16-nt predicted promoter motifs that include BRE and TATA box was used as promoter positions. Median position range was computed as the smallest range of relative positions that contain at least 60% of the predicted promoters. Accuracy rate of promoter predictions may be lower for genomes with high A/T content, misannotation of translation start codons, or absence of transcription factor B recognition element (BRE).

48 49 50

51 Table 2.2 Predicted Shine-Dalgarno (SD) motifs in relationship with predicted promoters. Protein-coding genes annotated in NCBI RefSeq (Pruitt, Tatusova et al. 2007) were included for Shine-Dalgarno motif predictions. The center position of the 10-nt predicted Shine-Dalgarno motifs and that of the 16-nt predicted promoter motifs were used to determine distances.

52 53

54 Chapter 3

Discovery of a minimal form of RNase P in Pyrobaculum2

2 This chapter is a manuscript co-written with Lien B. Lai, Venkat Gopalan, and Todd M. Lowe, and appears in Lai L.B.*, P.P. Chan*, A.E. Cozen, D.L. Bernick, J.W. Brown, V. Gopalan, and T.M.

Lowe. (2010) Discovery of a minimal form of RNase P in Pyrobaculum. PNAS (In Press).

*These authors contributed equally to this work.

55 3.1 Abstract

RNase P RNA is an ancient, nearly universal feature of life. As part of the ribonucleoprotein RNase P complex, the RNA component catalyzes essential removal of 5ʹ leaders in precursor transfer RNAs (pre-tRNAs). In 2004, Li and Altman computationally identified the RNase P RNA gene in all but three sequenced microbes: Nanoarchaeum equitans, Pyrobaculum aerophilum, and Aquifex aeolicus

(all hyperthermophiles). A recent study concluded that N. equitans does not have or require RNase P activity because it lacks 5ʹ tRNA leaders. The “missing” RNase P

RNAs in the other two species is perplexing given evidence or predictions that tRNAs are trimmed in both, prompting speculation that they may have developed novel alternatives to 5ʹ pre-tRNA processing. Using comparative genomics and improved computational methods, we have now identified a radically minimized form of the

RNase P RNA in five Pyrobaculum species and the related crenarchaea Caldivirga maquilingensis and Vulcanisaeta distributa, all retaining a conventional catalytic domain, but lacking a recognizable specificity domain. We confirmed 5ʹ tRNA processing activity by high-throughput RNA sequencing and in vitro biochemical assays. The Pyrobaculum and Caldivirga RNase P RNAs are the smallest naturally occurring form yet discovered to function as trans-acting ribozymes. Loss of the specificity domain in these RNAs suggests altered substrate specificity, and could be a useful model for finding other potential roles of RNase P. This study illustrates an

56 effective combination of next-generation RNA sequencing, computational genomics, and biochemistry to identify a divergent, formerly undetectable variant of an essential non-coding RNA gene.

3.2 Introduction

RNase P is best known for its role in removing the 5ʹ leaders of pre-tRNAs, an essential step in tRNA maturation. It also processes other RNAs in bacteria and eukaryotes, but these roles are less understood (Kazantsev and Pace 2006; Coughlin,

Pleiss et al. 2008; Liu and Altman 2010). RNase P typically functions as an RNA- protein complex, comprised of one conserved RNA and a varying number of protein subunits, depending on the domain of life: one in Bacteria, at least four in Archaea, and nine or more in the eukaryotic nucleus (Hall and Brown 2001; Lai, Vioque et al.

2010). A precedent in which the RNA component is missing entirely is found in human and Arabidopsis organellar RNase P (Holzmann, Frank et al. 2008; Gobert,

Gutmann et al. 2010), although a recent study suggests the possible co-existence of protein-only and RNA-protein-based RNase P complexes in human mitochondria

(Wang, Chen et al. 2010).

The inability to identify RNase P in some organisms has sown doubts about whether it is a universal feature of life. Studies of the hyperthermophilic bacterium

Aquifex aeolicus showed that it exhibits RNase P-like trimming of tRNAs (Lombo and Kaberdin 2008; Marszalkowski, Willkomm et al. 2008), yet a gene for the

57 expected protein component is absent and the RNA has remained elusive (Willkomm,

Minnerup et al. 2005), prompting speculation that it may have developed a unique solution for pre-tRNA processing (Marszalkowski, Willkomm et al. 2008). Perhaps most surprisingly, Söll and colleagues (Randau, Schroder et al. 2008) demonstrated that the archaeal symbiont Nanoarchaeum equitans does not contain any identifiable

RNase P genes or detectable RNase P activity, and appears to lack tRNA leaders entirely. These findings leave RNase P conspicuously absent in just one other studied microbial species: Pyrobaculum aerophilum, a hyperthermophilic crenarchaeon that has been refractory to prior biochemical (Randau, Schroder et al. 2008) and computational (Li and Altman 2004; Gardner, Daub et al. 2009) identification efforts.

Now, with the advent of new genome and RNA sequencing, augmented by improved computational search methods, we were able to uncover a unique form of RNase P in multiple Pyrobaculum species and related genera.

3.3 Results and Discussion

3.3.1 Pre-tRNAs in Pyrobaculum have 5ʹ leaders

We first obtained evidence for RNase P activity in Pyrobaculum using comparative genomics and RNA sequencing. The genomes of four Pyrobaculum species (Pyrobaculum arsenaticum, Pyrobaculum calidifontis, Pyrobaculum islandicum, and Thermoproteus neutrophilus [to be reclassified as a Pyrobaculum species]) were recently sequenced in collaboration with the Joint Genome Institute,

58 providing extensive comparative information. As in P. aerophilum, the RNase P RNA genes could not be identified in these genomes using existing computational methods

(Li and Altman 2004; Gardner, Daub et al. 2009) (see Methods). However, alignment of orthologous tRNA loci and upstream promoter regions from all these Pyrobaculum species provided compelling evolutionary evidence for pre-tRNA leaders, thus hinting at a requirement for RNase P. If no RNase P activity is present to remove 5ʹ leaders, then one should expect very little or no variation in the distance between the

TATA sequence and the 5ʹ end of the mature tRNA gene, especially among orthologs. In fact, we counted more than 12 tRNA ortholog groups where the distance between the promoter and tRNA gene varies by at least two nucleotides among species (Figure 3.1). Next, we examined native transcripts from tRNA loci for four of these species to confirm the computational observations. High-throughput RNA sequencing reads of small RNAs identified many tRNA transcripts with 5ʹ leaders: 15 in P. aerophilum, 17 in P. arsenaticum, 18 in P. calidifontis, and 21 in P. islandicum have 1- to 6-nt leaders (Table 3.1 shows counts by leader length). All of the sequenced pre-tRNAs with 5ʹ leaders were also found in their mature form. These data strongly suggest that Pyrobaculum has some form of pre-tRNA 5ʹ-processing activity.

59 3.3.2 Pyrobaculum aerophilum cell extract processes 5ʹ leader from pre-tRNA

Because RNase P activity was previously not found in a P. aerophilum crude extract (Randau, Schroder et al. 2008), we opted to assay this activity after chromatographic fractionation. When a crude lysate of P. aerophilum was subjected to successive weak cation (CM)- and anion (DEAE)-exchange matrices, a peak of 5ʹ- processing activity (Figure 3.2) was observed with a P. aerophilum pre-tRNAPhe

(Figure 3.3A) as the substrate. The products generated were identical in size to those obtained with in vitro reconstituted E. coli RNase P (Guerrier-Takada, Gardiner et al.

Phe 1983) when pre-tRNA containing uridine at the -1 position (U-1) was used (Figure

3.3B); cleavage took place as expected between nucleotides -1 and +1 (Figure 3.4 provides additional data characterizing cleavage-site selection). Another hallmark of

RNase P-mediated processing is the presence of a 5ʹ phosphate in the mature tRNA product (Guerrier-Takada, Gardiner et al. 1983). Thin-layer chromatography of mature tRNAPhe generated by P. aerophilum RNase P and subsequently digested with

RNase T2 showed the presence of a 5ʹ phosphate on G+1 (pGp in Figure 3.3C), as observed with E. coli RNase P. Collectively, these results establish that the partially purified pre-tRNA 5ʹ−processing activity in P. aerophilum is indeed RNase P.

60 3.3.3 Evidence for three out of four known archaeal RNase P proteins in Pyrobaculum

All known archaeal RNase P complexes require both RNA and protein components, so we tested P. aerophilum RNase P for these constituents. Treatment with proteinase K, which degrades proteins, eliminated the P. aerophilum RNase P activity (Figure 3.5A), raising the question of the protein component identities. Of the four established archaeal RNase P proteins (Rpp29, Rpp30, Pop5, Rpp21), Pfam

(Finn, Tate et al. 2008) and other sequence profile searches identified Pyrobaculum orthologs for Rpp29 [first noted by Hartmann and Hartmann (Hartmann and

Hartmann 2003)], gave weak support for Rpp30 candidates, but had no predictions for Pop5 or Rpp21 orthologs. Koonin and colleagues (Koonin, Wolf et al. 2001) previously noted that the Rpp30 and Pop5 genes are sometimes adjacent to each other, and are often in the same operon as the gene for a predicted exosome protein

(DUF54/COG1325) (Koonin, Wolf et al. 2001). In Pyrobaculum species, this exosome gene clearly identifies a well-conserved operon that contains both the candidate Rpp30 genes and plausible Pop5 homologs (Table 3.2). Alignment of the

Pyrobaculum candidates (PAE1830 for Rpp30; PAE1829 for Pop5) with the known archaeal protein homologs reveals potential homology: the Pyrobaculum candidates retain most of the highly conserved residues, but have lost some conserved segments, making the proteins 10-25% shorter than known archaeal family members. Alignment of the predicted Rpp29 ortholog (PAE1777) to archaeal Rpp29 proteins provides

61 similar positive support. Further operon and sequence analyses, including a highly sensitive structure-based protein search (personal communication, K. Karplus), failed to identify the Rpp21 ortholog in any Pyrobaculum genome. These results suggest that only three of the four known archaeal RNase P proteins exist in Pyrobaculum.

3.3.4 Discovery and in vitro activity of the minimized Pyrobaculum RNase P RNA

Given these apparent changes in the RNase P proteins, it was still unclear how the missing Pyrobaculum RNA might have changed to elude prior detection.

Therefore, we made no assumptions about the RNA’s primary sequence length or secondary structure features. Using the UCSC Archaeal Genome Browser (Schneider,

Pollard et al. 2006) (archaea.ucsc.edu), we aligned the four new Pyrobaculum genomes with P. aerophilum, allowing us to focus on the most highly conserved regions common to all five species. After removing known RNA genes (rRNAs, tRNAs, C/D box sRNAs, signal recognition particle RNA), only a few regions were conserved >90%, forming an enriched group to study. In one of these conserved regions, we detected a very weak partial hit to archaeal RNase P RNA using the

Infernal RNA search program (Nawrocki, Kolbe et al. 2009), but we also observed properties expected in good candidates: high (73-77%) guanine/cytosine content typical for a hyperthermophilic structural RNA, a strong transcription factor B recognition element (BRE)-TATA promoter sequence upstream, and extremely high conservation (Figure 3.6). Although the encoded RNA is much shorter than any

62 identified non-organellar RNase P RNAs (Brown 1999), the Pyrobaculum candidates can be folded into an RNase P-like consensus secondary structure (Figures 3.7A and

3.8) with one surprising difference. All known bacterial and archaeal RNase P RNAs consist of both a substrate-specificity (S) domain that aids substrate recognition, and a catalytic (C) domain essential for phosphodiester cleavage (Loria and Pan 1996; Tsai,

Pulukkunat et al. 2006). The Pyrobaculum RNase P RNA candidates have lost most of their S domain, but retain an intact C domain that includes all 11 universally conserved nucleotides (Marquez, Harris et al. 2005) (Figures 3.7A and 3.8).

To verify the candidate P. aerophilum RNase P RNA, we first confirmed expression by northern analysis of total RNA (Figure 3.7B). Second, to assess association of this RNA with the partially purified, native P. aerophilum RNase P holoenzyme, we designed PaeRPR-L15, an antisense 14-mer RNA oligonucleotide which we expected would invade an essential loop region in the RNA moiety

(nucleotides 157 to 170; Figure 3.7A) and thereby interfere with activity. As shown for bacterial RNase P (Gruegelsiepe, Willkomm et al. 2003), we observed a progressive decrease in native P. aerophilum RNase P activity when the enzyme was pre-incubated with increasing concentrations of this oligonucleotide (Figure 3.5B) and that the extent of this inhibition was more pronounced compared to that observed with a non-specific oligonucleotide (See Methods). We did not expect complete inhibition since PaeRPR-L15’s access to the RNA moiety of P. aerophilum RNase P could be hampered by enveloping protein subunits. Such a premise is consistent with

63 our findings that P. aerophilum RNase P could not be inactivated with micrococcal nuclease, for which there is a precedent in another archaeal hyperthermophile (Darr,

Pace et al. 1990). Last, but most importantly, an in vitro transcript of the P. aerophilum candidate RNase P RNA was able to process pre-tRNAPhe (Figure 3.7C), confirming it is indeed the missing RNase P ribozyme. As a negative control, deletion of the universally conserved, bulged U51 (Figure 3.7A) in the same transcript eliminated the RNA-alone activity (Figure 3.7C, lane 2); this bulged U in bacterial

RNase P RNA appears to be required for maintaining a unique geometry essential for substrate positioning and binding of catalytically important Mg2+ ions (Kaye, Zahler et al. 2002). As with the partially purified native enzyme, pre-incubation of the in vitro RNase P RNA transcript with PaeRPR-L15 drastically decreased its activity as expected.

3.3.5 Phylogenetic distribution of the minimized form of RNase P RNA

Equipped with a broader view of RNase P RNA, we surveyed all 88 currently available, complete archaeal genomes to determine which have traditional,

Pyrobaculum-like, or possibly other missing/undetectable versions. To search for traditional RNase P RNA genes, we used the Rfam archaeal RNase P covariance model (Gardner, Daub et al. 2009) and the rule-based method by Li and Altman (Li and Altman 2004). Although slower, the covariance model was much more sensitive, finding matches in every species except N. equitans, Pyrobaculum spp., and unexpectedly, Caldivirga maquilingensis and Vulcanisaeta distributa, two

64 crenarchaea in the same family (Thermoproteaceae) as Pyrobaculum (Table 3.3). By employing a covariance model based on the unusual Pyrobaculum RNAs, we uncovered a similarly shortened form of RNase P RNA in C. maquilingensis and V. distributa (Figures 3.7A and 3.8; Table 3.3), now accounting for an RNase P RNA gene in all complete archaeal genomes except N. equitans (Randau, Schroder et al.

2008). We established expression of the C. maquilingensis candidate RNase P RNA in vivo (Figure 3.7B) and that an in vitro transcript of this RNA can process pre- tRNAPhe (Figures 3.7C and 3.9). Like its Pyrobaculum counterpart, deletion of the bulged U63 together with C62 and C64 in the C. maquilingensis transcript eliminated the RNA-alone activity (Figure 3.7C).

Both of these RNAs display all the hallmarks of other RNase P RNAs: they cleave at the expected site (Figure 3.9B), they generate 5ʹ-phosphate and 3ʹ-hydroxyl termini (Figure 3.9C), and they are capable of cleaving pre-tRNAPhe with a short and long leader (Figure 3.9B). Taken together, these data suggest that our new

Pyrobaculum-based covariance search model is complementary to the existing Rfam

RNase P model, and should facilitate detection of other shortened forms of RNase P

(dubbed “type T” for the phylogenetic family Thermoproteaceae).

A search for the RNase P proteins in Caldivirga and Vulcanisaeta revealed likely homologs for Pop5, Rpp30, and Rpp29 (see Methods), but, as in Pyrobaculum, no apparent ortholog of Rpp21. Because Pyrobaculum, Caldivirga, and Vulcanisaeta are closely related genera, it is most parsimonious to assume that the type T RNase P

65 RNA was a feature of the common ancestor of this family. Outside of

Thermoproteaceae, the next most closely related species with a decoded genome is

Thermofilum pendens, which seems to possess a typical RNase P RNA (Gardner,

Daub et al. 2009) with a traditional S domain and all four traditional archaeal RNase

P proteins (Tables 3.2 and 3.3). Thus, the conspicuous absence of Rpp21 in

Thermoproteaceae may reflect fewer cognate RNase P proteins for the smaller type T

RNA, or that Rpp21 has diverged too extensively to be detected by sequence similarity searches. Reconstitution studies have revealed that the archaeal RNase P proteins function as two binary complexes: Pop5•Rpp30 and Rpp21•Rpp29 (Tsai,

Pulukkunat et al. 2006). Footprinting studies indicate that Pop5•Rpp30 interacts with the C domain, and Rpp21•Rpp29 with the S domain of archaeal RNase P RNA (Tsai,

Pulukkunat et al. 2006; Xu, Amero et al. 2009). Thus, the loss or radical change of

Rpp21 would be consistent with loss of the traditional S domain in type T RNase P

RNAs.

3.3.6 Search for RNase P RNA in Aquifex and Related Species

With the identification of the smaller RNase P RNA in Pyrobaculum and related archaeal species, we decided to apply existing and our newly developed covariance search models to four bacterial species in which RNase P RNA appears to be absent. These species include the well-studied Aquifex aeolicus, as well as three other species also in the Aquificaceae family (Hydrogenivirga sp. 128-5-R1-1,

Hydrogenobacter thermophilus, and Hydrogenobaculum sp. Y04AAS1). We searched

66 these genomes using all existing bacterial, archaeal, and eukaryotic RNase P RNA models in Rfam (Gardner, Daub et al. 2009), as well as our new Pyrobaculum- specific model, producing no good candidates. Because there are significant structural differences between bacterial and archaeal RNase P RNAs (Hall and Brown 2001), we also created a new model based on the RNase P RNAs from the closest related bacterial species that have known RNase P RNAs, and artificially removed the specificity domain. The shortened bacterial search model produced no good candidates in any of the four genomes.

3.4 Conclusions

The reduced size of the type T form of RNase P RNA may correlate with unique features of this group of organisms. Pyrobaculum and Vulcanisaeta species are unusual for the large number of tRNAs with more than one intron, many at “non- canonical” positions (Marck and Grosjean 2003; Chan and Lowe 2009). Caldivirga does not have as many atypical tRNA introns, but does contain a number of trans- spliced split tRNAs (Fujishima, Sugahara et al. 2009), a rare trait shared only with N. equitans (Randau, Munch et al. 2005). It is unclear if changes in the pre-tRNA structures are related to the altered substrate recognition domain, but a link would not be surprising.

Our discovery of the drastically smaller type T RNase P emphasizes the striking and surprising plasticity in the subunit composition of an essential and

67 ubiquitous enzyme. This flexibility may be an adaptive trait related to the variability in the RNAs that are substrates for RNase P. In bacteria, these assorted substrates include 4.5S, tmRNA, viral RNAs, riboswitches, and some mRNAs (Kazantsev and

Pace 2006). In the Thermoproteaceae, type T RNase P may play a similarly expanded role in small RNA processing.

Collectively, our findings highlight the rich evolutionary story of RNase P and offer new opportunities for structural, comparative, and functional studies. As the cost of genome and transcriptome sequencing continues to decrease, we expect the combination of comparative genomics and improved RNA search models will continue to reveal exceptional cases advancing RNA biology.

3.5 Materials and Methods

Archaeal RNase P RNA sequence search. Infernal v1.0 (Nawrocki, Kolbe et al.

2009) was used to search for RNase P RNA candidates in archaeal genomes. The program was initially run in the global search mode using the Rfam (Gardner, Daub et al. 2009) archaeal RNase P RNA covariance model (RF00373). All hits with a score > 0 bits were manually examined. Local search mode was also employed, which provided better sensitivity but decreased selectivity. Using local search mode with a threshold of 0 bits (necessary to initially detect the Pyrobaculum RNase P

RNA with the RF00373 model) was only feasible when searching a small set of short

68 candidate regions (i.e., not entire genomes) because low-scoring partial hits required close manual inspection and further structural analyses.

The rule-based program by Li and Altman (Li and Altman 2004) was used to search for RNase P RNA candidates in all available archaeal genomes retrieved from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. The genome of Vulcanisaeta distributa was downloaded from the publicly accessible Integrated Microbial Genomes with

Genome Encyclopedia of Bacteria and Archaea Genomes website

(http://img.jgi.doe.gov/cgi-bin/geba/main.cgi?section=TaxonDetail&taxon_ oid=2502790013). Because hits were not found in a number of genomes, Infernal

(global search mode) was applied using the Rfam RF00373 model. A universal primary sequence pattern for RNase P RNA, modified from the original Li and

Altman pattern, was used to pre-screen candidates that were then analyzed with

Infernal. For archaeal genomes producing no hits using this fast pre-screening method, whole genome searches were conducted with Infernal alone (greatly slowing search speed, but improving search sensitivity). The Pyrobaculum RNase P RNA covariance model was built with Infernal using the five Pyrobaculum RNA sequences and a manually predicted secondary structure (Figure 3.7A).

Culture conditions for Pyrobaculum aerophilum. P. aerophilum cultures derived from Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ) strain

DSM 7523 were generously provided by Christopher House (Penn State University,

69 University Park, PA). P. aerophilum cultures used for RNase P purification were grown in a slightly modified version of DSMZ DSM390 medium (Cozen, Weirauch et al. 2009), amended with 1% tryptone and 0.1% yeast extract to produce higher cell yields. These cultures were grown micro-aerobically at 95˚C under a gas headspace of N2 plus 2% or 3% O2 until early-log phase, then shifted to fully aerobic growth under atmospheric air and collected at mid- to late-log phase.

Culture conditions for Caldivirga maquilingensis. Cultures derived from DSMZ

13496/IC-167 were provided by Christopher House, Penn State University.

C. maquilingensis cultures were grown in modified DSM883 media containing (per liter) 2.94 g Na3-citrate•2H2O, 0.5 mg resazurin, 0.5 g yeast extract, 10 g tryptone,

0.02 g FeCl3•6H2O, 1 ml of a vitamin solution (DSM141), 100 ml of a 10x salts solution, 10 ml of a 100x trace elements solution, and 2 ml of a polysulfide solution.

The vitamin solution was filter-sterilized and stored at 4°C. The 10x salts solution derived from DSM88 was prepared by combining (per liter) 13 g (NH4)2SO4, 2.8 g

KH2PO4, 2.5 g MgSO4•7H2O, and 0.7 g CaCl2•2H2O. The 100x trace elements solution derived from DSM88 was prepared by combining (per liter) 0.18 g

MnCl2•4H2O, 0.45 g Na2B4O7•10H2O, 0.022 g ZnSO4•7H2O, 0.005 g CuCl2•2H2O,

0.003 g Na2MoO4•2H2O, 0.003 g VOSO4•2H2O, 0.001 g CoSO4. The polysulfide solution was prepared using 15 g Na2S•9H2O, 3 g elemental sulfur in 100 ml anoxic water under N2. The complete medium preparation was adjusted to pH 4 using H2SO4

70 for creating colloidal sulfur, and sterilized by autoclaving. All C. maquilingensis cultures were incubated anaerobically at 85°C, typically with 200 ml of medium in sealed 0.5 L Wheaton bottles under a gas headspace of N2. Cells were harvested at mid-log phase by centrifugation at 10,000 g for 10 minutes. Cell pellets were frozen in liquid N2 and stored at -80°C.

Total RNA preparation. Total RNA was extracted from the frozen cell pellets using a Polytron tissue homogenizer and TRI Reagent (Sigma-Aldrich). RNA samples were treated with TURBO DNase (Ambion) to remove any residual DNA, reextracted with

TRI Reagent, and normalized to 1 µg/µl.

High-throughput RNA sequencing. Total RNA from P. aerophilum, P. arsenaticum, P. calidifontis, and P. islandicum (100 µg each) was denatured and resolved by electrophoresis in separate lanes on a 15% (w/v) polyacrylamide-urea gel.

RNA with a size of 15-70 nt (migrating just ahead of the tRNA band, down to and including 75% of the region between the xylene cyanol and bromophenol blue loading dye bands) was excised from the gel. Samples were eluted and precipitated with ethanol. A 3′-linker with 5′-adenylation and a 3′-terminal dideoxy-C base

(Linker-1 from Integrated DNA Technologies, IDT) was ligated to the RNA as described by Lau and colleagues (Lau, Lim et al. 2001). A second gel purification was performed to remove excess 3′-linker by extracting the ligated RNA which migrated above the xylene cyanol dye band. Resulting linked molecules were reverse

71 transcribed using Superscript III (Invitrogen) and a complementary DNA primer.

Exonuclease I (Thermo) followed by treatment with EDTA and sodium hydroxide were used to degrade excess primer, inactivate reverse transcriptase and selectively hydrolyze RNA, respectively. The resulting cDNA was purified using a NucAway spin column (Ambion) before the addition of a 5′-adenylated linker (Linker-2 from

IDT) using T4 RNA ligase (Ambion). cDNA was amplified by a 16-cycle polymerase chain reaction (PCR) which was followed by a second 16-cycle PCR reamplification using Roche/454 specific hybrid primers based on the method described by Hannon

(Hannon 2006). A four-base barcode, as described by Ambros (Ambros), was included in the 5′ hybrid primer. The final reaction was purified using the DNA Clean and Concentration kit (Zymo Research). Samples were then sent to the Joint Genome

Institute, where they were pooled in equal quantities, amplified using the manufacturer’s protocol, and analyzed on a Roche GS-FLX sequencer.

RNase P purification. P. aerophilum cells were grown as described previously. One gram of cells was resuspended in 5 mL cold extraction buffer (EB; 20 mM Tris-HCl, pH 8 @ 25°C and 5 mM MgCl2) containing 50 mM NaCl, 10 mM DTT and 1 mM

PMSF. Cell lysate was generated by sonication and clarified by centrifugation at

14,000 g and 4°C for 30 min. The crude lysate was then loaded on a 1-mL HiTrap

CM FF Sepharose column (GE Healthcare). EB supplemented with various NaCl concentrations, 10% (v/v) glycerol, 2 mM DTT and 0.2 mM PMSF was used in all

72 the following steps. With an FPLC apparatus, fractions were eluted using a 50-2000 mM NaCl gradient and subsequently assayed for RNase P activity (Vioque, Arnez et al. 1988). The active CM fractions were pooled, dialyzed and loaded on a 1-mL

HiTrap DEAE FF Sepharose column (GE Healthcare). The activity was eluted with three sequential NaCl gradients: 50-400 mM, 400-1000 mM, and 1000-1250 mM.

The peak of RNase P activity eluted at ~700 mM NaCl.

Cloning and transcription of P. aerophilum and C. maquilingensis RNase P

RNAs, and P. aerophilum pre-tRNAPhe. PCR was used to amplify the coding sequence of P. aerophilum (Pae) RNase P RNA using the P. aerophilum genomic

DNA as the template, and the primers PaeRPR-F (5′-GGCGCCGAGGGGACG-3′) and PaeRPR-R (5′-GGGGCGCCGCGTACC-3′).

Cloning of C. maquilingensis (Cma) RNase P RNA was also performed with

PCR amplification. Since the sequence of the P1 helix in C. maquilingensis RNase P

RNA is long and completely complementary (Figure 3.7A), only one primer

(CmaRPR-FR: 5′-CCCAGTGGCCATGGTGC-3′) was used to amplify the sequence from a BAC clone harboring the C. maquilingensis RNase P RNA gene.

The P. aerophilum pre-tRNAPhe with an 18-nt leader was amplified from P. aerophilum genomic DNA using the primers PaePhe-F

(5′-AAGAGATGAGCTCGAKGCGGCCGTAGCTCAGC-3′) and PaePhe-R (5′-

GCGAATTCCTGGTGCGGCCGCC-3′); the degenerate nucleotide K (G or T)

73 Phe allowed for cloning both the pre-tRNA (G-1) and (U-1), discriminated by the presence and absence of an XhoI site, respectively. The 18-nt leader facilitated electrophoretic separation of pre-tRNAPhe and its cleavage products after the RNase P assay. For cloning the 4-nt-leadered pre-tRNAPhe, PaePhe(4)-F (5′-

AGGCGGCCGTAGCTCAGC-3′) was used with PaePhe-R. The underlined sequences in PaePhe-F and PaePhe(4)-F correspond to the 18-nt and 4-nt 5′-leaders, respectively. Note that the 5′-GG of each leader sequence was encoded by the host vector on account of the cloning strategy employed.

Subsequent to amplification, all PCR products were cloned downstream of the

T7 promoter in pBT7 (Tsai, Lai et al. 2002). Both the P. aerophilum and the C. maquilingensis RNase P RNA fragments were ligated blunt-ended into the StuI site, and the pre-tRNAPhe fragments were digested with EcoRI (the italicized sequence introduced in PaePhe-R) before ligating into the StuI and EcoRI sites. Deletion mutants of P. aerophilum and C. maquilingensis RNase P RNA were generated by

PCR using these clones as the respective templates; pairs of primers were designed to flank the deleted nucleotide(s) and orient outward such that the complete plasmids would be amplified and then circularized. All clones were confirmed by DNA sequencing.

Transcripts were generated from run-off in vitro transcription with T7 RNA polymerase using these clones as the templates after linearizing pBT7-PaeRPR

74 (RNase P RNA) and pBT7-CmaRPR with BstBI (present in the vector), and pBT7- pre-tRNAPhe with BstNI (depicted in bold in PaePhe-R).

RNase P assay. The P. aerophilum pre-tRNAPhe was in vitro transcribed and radioactively labeled either internally using [α-32P]-GTP during transcription or at the

5ʹ end using T4 polynucleotide kinase (New England Biolabs, NEB) and [γ-32P]-ATP after dephosphorylation of the transcript by calf intestinal alkaline phosphatase

(NEB). RNase P activity was assayed at 55°C in 50 mM Tris-HCl (pH 7.5 @ 25°C),

400 mM NH4OAc and 10 mM Mg(OAc)2, 5 mM DTT, 0.1 unit/µL RiboLock

(Fermentas), 100 nM unlabeled pre-tRNAPhe, and trace amounts of labeled pre- tRNAPhe (~1 nM). Activity of the in vitro transcribed P. aerophilum and C. maquilingensis RNase P RNA was assayed, without a pre-folding incubation, at 55°C for 20 h in 50 mM Tris-HCl (pH 7.5 @ 25°C), 1.5 M NH4OAc, trace amounts (~1

Phe nM) of labeled pre-tRNA , and Mg(OAc)2 at 75 and 100 mM, respectively. The cleavage products were then separated on 12% (w/v) polyacrylamide/7 M urea gels

(or 15% for the 4-nt leadered pre-tRNAPhe) and radiographed on a phosphorimager.

Thin-layer chromatography (TLC). Subsequent to cleavage of internally labelled pre-tRNAPhe by P. aerophilum and E. coli RNase P or P. aerophilum, C. maquilingensis and E. coli RNase P RNA, the mature tRNAPhe products were purified using an 8% (w/v) polyacrylamide/7 M urea gel. Uncleaved internally labeled pre- tRNAPhe was used as a negative marker and uncleaved 5ʹ end-labeled pre-tRNAPhe

75 (with the 5ʹ monophosphorylated G being the only labeled residue) as the positive marker of pGp. These RNAs were then completely digested with 12 units of RNase

T2 (Gibco BRL) at 50°C for 1 h in 20 mM NaOAc (pH 5.2 @ 25°C), 1 mM EDTA, and 0.4 µg/µL yeast tRNAs. Each sample was spotted on a TLC PEI cellulose F plate

(EMD Chemicals), separated with 0.5 M potassium phosphate (pH 6.3):methanol in an 80:20 ratio, and radiographed with a phosphorimager.

Inactivation of P. aerophilum RNase P activity by treatment with proteinase K.

An aliquot of partially-purified P. aerophilum RNase P (DEAE fraction 9 in Fig.

S2B) was first incubated with proteinase K (8 µg/µl; Roche) in 50 mM Tris-HCl (pH

7.9 @ 25°C) and 5 mM CaCl2 for 30 min at 55°C, before incubating at 55°C for 20 h to assay for pre-tRNAPhe-cleavage activity.

Inactivation of P. aerophilum RNase P activity using an antisense RNA oligonucleotide complementary to the P. aerophilum RNase P RNA. Aliquots of partially-purified P. aerophilum (Pae) RNase P (DEAE fraction 9 in Figure 3.2B) were pre-incubated at 55°C for 30 min in the presence of all components needed for

RNase P assay, except for pre-tRNAPhe (U-1). Where indicated, PaeRPR-L15, an antisense RNA oligonucleotide (Sigma Genosys; 5′-CUUGCCCCCUACCC-3′), designed to invade L15 and the downstream region (nucleotides 157-170 in the Pae

RNase P RNA) was also present in the pre-incubation at either 50 or 150 µM (Figure

3.5B). Subsequently, assay for RNase P activity was initiated with addition of pre-

76 Phe tRNA (U-1) and incubated at 55°C for 20 h. In independent experiments undertaken to assess the specificity of PaeRPRL-15, we employed as a negative control the RNA oligonucleotide pGln-16 (IDT Inc., 5′-UGGGGUGUAGCCAAGC-3′), which has a similar GC-content as PaeRPRL-15. In contrast to PaeRPRL-15, at concentrations below 120 µM, pGln-16 fails to inhibit Pae RNase P (in fact, was weakly stimulatory); even at 180 µM where pGln-16 does inhibit, it is at least 2-fold lower in potency relative to PaeRPR-L15. This latter finding is consistent with high concentrations of GC-rich oligonucleotides exhibiting non-specific binding to GC- rich RNAs, such as the Pae RNase P RNA. Nevertheless, the inhibition of Pae RNase

P activity observed at lower concentrations of PaeRPR-L15 lends support for its specificity.

Northern analysis. Three microgram samples of total RNA extracted from

P. aerophilum and C. maquilingensis cell cultures were combined with two parts

(v/v) load buffer (95% formamide, 18 mM EDTA and 0.025% each of SDS, xylene cyanol, and bromophenol blue), and denatured for 5 min at 70°C. Denatured RNA samples were resolved in a 6% (v/v) acrylamide, 8M urea gel, and transferred to

Hybond N+ membranes (GE healthcare). Probes for detection of the P. aerophilum and C. maquilingensis RNase P RNAs were prepared by PCR of genomic DNA, followed by asymmetric amplification of the PCR product in the presence of [α-32P]-

CTP using only the antisense (reverse) primers to generate strand-specific

77 radiolabeled probes. To prepare template for the P. aerophilum probe, a 200-bp segment of the P. aerophilum RNase P RNA gene (corresponding to nucleotides 10-

199; Fig. 4A) was amplified from P. aerophilum genomic DNA using the forward primer (5′-GGCCCCTTCTGGAACCTC-3′) and the reverse primer (5′-

CCTCACAGGCCCTGCTTG-3′). To prepare template for the C. maquilingensis probe, a 150-bp segment of the C. maquilingensis RNase P RNA gene (corresponding to nucleotides 48-197; Fig. 3.7A) was amplified from C. maquilingensis genomic

DNA using the forward primer (5′-GGCCCCTTCTGGAACCTC-3′) and the reverse primer (5′-CCTCACAGGCCCTGCTTG-3′). Hybridizations were performed at 42˚C using UltraHyb buffer (Ambion). tRNA gene and promoter identification. tRNAs were predicted using tRNAscan-

SE (Lowe and Eddy 1997). Manual inspection and adjustments were made due to difficulty in identifying tRNAs with introns placed outside of the canonical position

(between nucleotides 37 and 38). Annotations and sequences were deposited into the

Genomic tRNA Database (Chan and Lowe 2009) (http://gtrnadb.ucsc.edu).

To generate a training set for promoter identification, potential operons were predicted genome-wide with the requirement of a minimum intergenic separation of at least 100 bp (on the same strand). A 16-mer motif search of the 90 bp upstream of known genes (not annotated as putative or hypothetical genes) using MEME (Bailey and Elkan 1994) was conducted to identify the consensus promoter, including the

78 transcription factor B response element (BRE – 1 to 3 As) plus the TATA box. A position-specific scoring matrix (PSSM) was generated from the alignments of the

MEME results after manual inspection. Each organism’s PSSM was used to scan the

150-bp upstream region of all non-coding and protein-coding genes to identify potential promoter regions. Ten virtual genomes for each target genome were generated with the use of a fifth-order Markov chain to retain the base frequency of the target genome. The PSSM was applied to the same features in the target genome for null hypothesis formation. The promoter candidates previously identified were filtered according to expected position (Slupska, King et al. 2001) and a threshold p- value equivalent to that of the lowest-scoring known gene was established.

RNase P RNA sequence search in Aquificaceae. Infernal v1.0 (Nawrocki, Kolbe et al. 2009) was used to search for RNase P RNA candidates in Aquifex aeolicus,

Hydrogenivirga sp. 128-5-R1-1, Hydrogenobacter thermophilus, and

Hydrogenobaculum sp. Y04AAS1. The program was run in both global and local search modes using the Rfam (Gardner, Daub et al. 2009) bacterial (RF00010 and

RF00011), archaeal (RF00373), and eukaryotic (RF00009) RNase P RNA covariance models. All hits with a score > 0 bits were manually examined. Covariance model built with Pyrobaculum RNase P RNA sequences was applied also in global and local search modes. Upon failure of identifying a RNase P RNA candidate, a shortened bacterial covariance model was developed to search against the genomes by aligning

79 the RNase P RNA sequences of Persephonella marina, Sulfurihydrogenibium sp.

YO3AOP1, and Sulfurihydrogenibium azorense against the Rfam (Gardner, Daub et al. 2009) bacterial type A RNase P RNA covariance model (RF00010) using Infernal

(Nawrocki, Kolbe et al. 2009) and removing the specificity domain from the alignments manually.

RNase P protein database searches and sequence alignments. Protein sequences of Pop5, Rpp30, Rpp29, and Rpp21 for all archaeal genomes were retrieved from

GenBank. PSI-BLAST (Altschul, Madden et al. 1997), Pfam (Finn, Mistry et al.

2010) domain searches (RNase_P_Rpp14 [Pop5]: PF01900; RNase_P_p30 [Rpp30]:

PF01876; UPF0086 [Rpp29]: PF01868; and Rpr2 [Rpp21]: PF04032), and Phylo-

HMM (Siepel and Haussler 2004) multiple alignments in the Archaeal Genome

Browser (Schneider, Pollard et al. 2006) were used to predict homology. Default scoring thresholds for PSI-BLAST (E-value: 10; word size: 3) and Pfam (trusted cutoff for Pop5: 23.4 bits; Rpp30: 20.3 bits; Rpp29: 21.1 bits; Rpp21: 23.2 bits) searches were initially adopted. Thresholds were further adjusted (E-value: 100 and word size: 2 for PSI-BLAST; trusted cutoff as -80 bits for Pfam) to search for proteins not identified with the default scan. Multiple sequence alignments across all archaeal genomes for each RNase P protein were generated using MUSCLE v3.7

(Edgar 2004) with default options. Alignments were visualized with the ClustalX color scheme within Jalview v2.4 (Waterhouse, Procter et al. 2009), a multiple

80 alignment editor freely available at http://www.jalview.org/. The RNase P protein sequence alignments in FASTA format are provided in a separate supplementary dataset file.

3.6 Author Contributions

L.B.L. did the cloning, designed the purification, optimized and performed all biochemical assays for Pyrobaculum RNase P activity, co-identified Pop5 and Rpp30 candidates, and co-wrote the manuscript. P.C. performed promoter and tRNA sequencing analyses, identified Caldivirga and Vulcanisaeta RNase P RNA genes using Infernal covariance model searches, and created RNase P protein alignments.

A.C. grew Pyrobaculum cultures, purified total RNA, performed Pyrobaculum and

Caldivirga northern analyses, and co-edited the manuscript. D.B. grew Pyrobaculum cultures, designed and carried out the RNA sequencing, and performed initial bioinformatic analyses of sequencing reads. J.B. created secondary structure predictions for the Pyrobaculum RNase P candidates. V.G. guided the biochemical studies, co-identified Pop5 and Rpp30 candidates, and co-wrote the manuscript. T.L. identified the Pyrobaculum RNase P RNAs, analyzed the operonic context of RNase

P proteins, and co-wrote the manuscript. All authors generated figures and contributed ideas to the discussion.

81 3.7 Acknowledgments

We thank J. Murphy (UCSC) for her contribution of Pyrobaculum cell material, L.

Lui (UCSC) for her contribution of Caldivirga cell material for northern analysis, and

K. Karplus (UCSC) for his Rpp21 structural protein search and expert advice on protein alignment interpretation. We are grateful to members of the Joint Genome

Institute for making 454 sequencing possible (P. Richardson and J. Bristow for providing resources, and E. Lindquist and N. Zvenigorodsky for sample preparation and analysis). We thank J. Jackman (OSU) for advice and reagents for the TLC analysis, and E. J. Behrman (OSU) for helpful discussions. We are indebted to S.

Eddy (HHMI Janelia Farm) for insightful comments that helped improve the manuscript. This work was supported by grants from the National Science Foundation

(MCB0238233 and MCB0843543 to V.G. and EF-0827055 to T.L.) and National

Institutes of Health (GM067807 to Mark P. Foster and V.G., and HG004002-02 subaward to T.L.).

82

Figure 3.1 Alignment of tRNA promoters and 5′-leader sequences across four Pyrobaculum species for three sets of tRNA orthologs The predicted TATA box of each tRNA gene is highlighted in green, mature tRNA- encoding sequence in blue, and the 5′-leader sequence that is supported by RNA sequencing in orange. Scales above sequences are positions relative to the 5′ end of mature tRNAs. The black arrows indicate the direction of transcription.

83

Figure 3.2 Partial purification of P. aerophilum RNase P by ion-exchange chromatography A. P. aerophilum RNase P was first purified from crude extract using CM-Sepharose, and the peak of activity (~900 mM NaCl) was then dialyzed to remove NaCl before loading on to the DEAE column. B. Peak of RNase P activity from DEAE-Sepharose. Fractions were eluted with three sequential NaCl gradients: 50-400 mM (fractions 1- 6), 400-1000 mM (fractions 7-12), and 1000-1250 mM (fractions 13-16). The RNase Phe P activity in eluted fractions was assayed with internally labeled pre-tRNA (G-1). Fraction 9 eluted from DEAE-Sepharose above was used in all subsequent characterization assays. PC, the positive control obtained from processing of pre- tRNAPhe by E. coli RNase P; SC, the substrate control incubated without RNase P.

84

Figure 3.3 The pre-tRNA 5ʹ-processing activity from P. aerophilum (Pae) cell extract has all the cleavage properties of RNase P A. A P. aerophilum pre-tRNAPhe with an 18-nt leader used for RNase P assays; pre- Phe Phe tRNA (G-1) and pre-tRNA (U-1) differ only in the identity of the -1 nucleotide. Phe B. Pae RNase P cleaves internally labeled pre-tRNA (U-1) at the canonical cleavage site (lane 3), similar to E. coli (Eco) RNase P (lane 2; see also Figs. S2 and S3A). SC, a substrate control incubated without RNase P. C. TLC analysis of tRNAPhe containing a 5ʹ-pGp, produced by Eco and Pae RNase P (lanes 3 and 4). NM and PM, a negative marker without and a positive marker with pGp.

85

Phe Figure 3.4 Analysis of the site of cleavage in pre-tRNA (G-1) and (U-1) by partially-purified native P. aerophilum (Pae) RNase P and in vitro transcribed RNase P RNAs (RPRs) Phe (A - Left panel) Internally labeled P. aerophilum pre-tRNA (G-1) with an 18-nt leader was used for these assays. Pae RNase P cleaves with equal frequency at the canonical cleavage site and one nucleotide upstream (lane 3) unlike E. coli (Eco) RNase P (lane 2) which does not miscleave. (A - Right panel) Fig. 1B is provided Phe here for comparison to contrast the correct cleavage of pre-tRNA (U-1) by partially- purified Pae RNase P. SC, a substrate control incubated without RNase P. B. Both partially-purified Pae and E. coli (Eco) RNase P cleave the 5′ end-labeled pre- Phe tRNA (U-1) between U-1 and G+1 as expected for bona fide RNase P (lanes 2 and 3). Similarly, in vitro transcribed Pae and C. maquilingensis (Cma) RPRs exhibit correct processing of the substrate (lanes 6 and 9). T1, a G ladder generated by partial Phe digestion of denatured 5′ end-labeled pre-tRNA (U-1) with RNase T1 (Ambion) that cleaves 3′ to the G residues; subsequent to RNase T1 digestion, the cleaved RNA fragments were treated with T4 polynucleotide kinase to remove their 3′-phosphate (lanes 4 and 8) and normalize their migration in the gel with the 5′-leader of pre- tRNAPhe, which is not 3′ phosphorylated. SC, a substrate control without RNase P; the three SC reactions (lanes 1, 5 and 10) correspond, respectively, to controls which mimic the assay conditions employed for the native Pae holoenzyme (lane 2), Pae RPR (lane 6) and Cma RPR (lane 9).

86

Figure 3.5 RNase P activity from P. aerophilum (Pae) cell extract requires both protein and RNA subunits A. Treatment with proteinase K (PK) eliminated Pae RNase P activity. Pae RNase P was pre-incubated with PK (lane 3) or without PK (lane 4) before assaying for RNase P activity. B. Pre-incubation of Pae RNase P with the oligo PaeRPR-L15 (complementary to nucleotides 157 to 170 in Pae RNase P RNA) resulted in decreased activity. Pae RNase P was pre-incubated without (lane 3) or with 50 and 150 µM PaeRPR-L15 (lanes 4 and 5) before assaying for RNase P activity. All assays Phe were performed with internally labeled pre-tRNA (U-1). SC, a substrate control incubated without RNase P; PC, a positive control with pre-tRNAPhe processed by E. coli RNase P.

87

Figure 3.6 P. aerophilum RNase P RNA displayed on the Archaeal Genome Browser (Schneider, Pollard et al. 2006) and alignment of the RNase P RNA sequences from five Pyrobaculum species A. The red segment located between protein-coding genes PAE0934 and PAE0936 corresponds to the P. aerophilum RNase P RNA. The arrows on the genes indicate the 5′-to-3′ expression direction. The blue track above the genes represents the G/C content computed with a 20-base sliding window. Compared to its neighboring protein-coding genes, P. aerophilum RNase P RNA has a high G/C content that is required for structural RNA stability in hyperthermophiles. Green lines in “Promoter +” indicate BRE-TATA promoter signals on the positive strand. A strong promoter signal just upstream of the P. aerophilum RNase P RNA locus is consistent with start of transcription precisely at the predicted RNase P RNA gene. The black “Conservation” track at the bottom displays a graph of the nucleotide conservation level of the alignment with multiple genome sequences (including P. aerophilum and the other four Pyrobaculum genomes). The most highly conserved region within the sequence window corresponds to P. aerophilum RNase P RNA, consistent with a structural non-coding RNA gene (as opposed to the weaker nucleotide conservation of the flanking protein-coding genes). B. Multiple RNA sequence alignment shows high conservation and a few single-nucleotide insertions/deletions, which are common in non-coding RNA genes, but unlikely in protein-coding genes due to frameshifts.

88

Figure 3.7 Predicted secondary structure, native expression, and in vitro activity of P. aerophilum (Pae) and Caldivirga maquilingensis (Cma) RNase P RNAs (RPRs) A. The Pae RPR structure shared by four other Pyrobaculum spp., with unlabeled nucleotides identical in all five species, and others highlighted as follows: universally conserved nucleotides (black circles), pairs showing covariation among different Pyrobaculum RPRs (green), conservative G-C to G-U changes (yellow), non- conserved insertions in some Pyrobaculum species (lowercase), differences in unpaired regions (blue), and complementary region of PaeRPR-L15 (red line). The RPR from C. maquilingensis (Cma) shows low sequence identity (<50%) but high secondary structure similarity. Arrow, deletion of this catalytically important bulge abolishes activity. B. Pae and Cma RPRs are expressed in vivo, as shown by northern analysis of total RNA. M, size markers. C. RNase P assay of in vitro transcribed, wild-type (WT) Pae (100 µM, lane 3) or Cma (40 µM, lane 7) RPRs with ~1 nM Phe internally labeled pre-tRNA (G-1). Deletion of the bulged U51 (Figure 3.7A, left, arrow) in the P4 helix rendered Pae RPR inactive (lane 2). Similarly, deletion of C62- U63-C64 (Figure 3.7A, right, arrow) in the P4 helix inactivated Cma RPR (lane 6). SC, substrate control and PC, positive control, as in Figure 3.5.

89

Figure 3.8 Predicted secondary structure of P. calidifontis and V. distributa RNase P RNAs Black circles indicate universally conserved nucleotides, and all others highlighted relative to P. aerophilum (Figure 3.7A) as follows: pairs showing covariation (green), conservative G-C to G-U changes (yellow), and differences in unpaired regions (blue).

90

Figure 3.9 P. aerophilum (Pae) and C. maquilingensis (Cma) RNase P RNAs (RPRs) can process pre-tRNAPhe (G-1) with a 4-nt leader A. Cleavage of a 5′-labeled substrate was assessed using wild-type (WT) Pae (lane 3) or Cma (lane 7) RPRs, and their corresponding inactive mutant derivatives (lanes 2 and 6; Figure 3.7). B. Pae RPR cleaves pre-tRNAPhe (G-1) with either a 4- or 18-nt leader with equal efficiency. SC, a substrate control incubated without the RNase P RNA; PC, a positive control with pre-tRNAPhe processed by E. coli RNase P. C. TLC analysis of mature tRNAPhe containing a 5′-pGp, produced by E. coli (Eco), Pae and Cma RPRs (lanes 3, 4, and 5). NM and PM, a negative marker without and a positive marker with pGp.

91 Table 3.1 tRNA genes found to have a transcribed 5′-leader by high-throughput RNA sequencing For all counts, the existence of a 5′-leader is consistent with its prediction by computational promoter detection.

92 Table 3.2 Annotated or computationally identified RNase P proteins and associated DUF54 protein Each protein is represented by the gene locus tag with the protein accession number in parenthesis. Proteins located in the same neighborhood within three genes are highlighted in blue.

93 Table 3.3 RNase P RNA search using Infernal v1.0 (Nawrocki, Kolbe et al. 2009) Archaeal RNase P RNA Rfam covariance model RF00373 and Pyrobaculum RNase P RNA covariance model with Infernal v1.0 (Nawrocki, Kolbe et al. 2009) were used to search 88 archaeal genomes. Identified RNase P RNA candidates with highest bit scores were reported.

Genome Covariance Model Search Score (bits) Archaeal RNase P Pyrobaculum RNA (RF00373) RNase P RNA Model Model Crenarchaeota Aeropyrum pernix K1 125.72 5.07 Caldivirga maquilingensis IC-167 Not Found 52.95 Desulfurococcus kamchatkensis 1221n 166.33 7.44 Hyperthermus butylicus 124.17 Not Found Ignicoccus hospitalis KIN4-I 117.03 2.32 Ignisphaera aggregans AQ1.S1 174.72 4.79 Metallosphaera sedula 121.51 7.21 Pyrobaculum aerophilum Not Found 206.22 Pyrobaculum arsenaticum 1.34 206.63 Pyrobaculum calidifontis Not Found 192.45 Pyrobaculum islandicum 0.05 217.18 Thermoproteus neutrophilus V24Sta Not Found 196.85 [to be reclassified as a Pyrobaculum species] Pyrobaculum oguniense Not Found 210.93 Staphylothermus hellenicus 139.85 Not Found Staphylothermus marinus F1 165.76 3.05 Sulfolobus acidocaldarius 184.72 Not Found Sulfolobus islandicus L.D.8.5 169.11 Not Found Sulfolobus islandicus L.S.2.15 169.11 Not Found Sulfolobus islandicus M.14.25 169.11 Not Found Sulfolobus islandicus M.16.4 169.11 Not Found Sulfolobus islandicus M.16.27 169.11 Not Found Sulfolobus islandicus Y.G.57.14 169.15 Not Found Sulfolobus islandicus Y.N.15.51 164.82 Not Found Sulfolobus solfataricus P2 180.39 4.08 Sulfolobus tokodaii str. 7 162.43 8.65 Thermofilum pendens Hrk 5 72.24 6.24 Thermosphaera aggregans M11TL 143.48 Not Found Vulcanisaeta distributa IC-017 5.32 48.24 Euryarchaeota Aciduliprofundum boonei T469 175.42 Not Found Archaeoglobus fulgidus 124.98 Not Found Archaeoglobus profundus Av18 178.91 4.65

94 Genome Covariance Model Search Score (bits) Archaeal RNase P Pyrobaculum RNA (RF00373) RNase P RNA Model Model Ferroglobus placidus AEDII12DO 162.95 5.37 Haloarcula marismortui 89.22 5.35 Halobacterium salinarum R1 136.86 5.19 Halobacterium sp. NRC-1 136.86 5.19 Haloferax volcanii DS2 133.15 9.89 Halogeometricum borinquense PR3 60.64 8.00 Halomicrobium mukohataei 106.3 4.40 Haloquadratum walsbyi 172.81 Not Found Halorhabdus utahensis 125.24 6.72 Halorubrum lacusprofundi 162.14 7.91 Haloterrigena turkmenica 112.04 5.92 Methanothermobacter thermautotrophicus 228.58 13.49 str. Delta H Methanobrevibacter ruminantium M1 193.70 Not Found Methanobrevibacter smithii ATCC 35061 206.29 Not Found Methanocaldococcus sp. FS406-22 147.62 Not Found Methanocaldococcus fervens AG86 144.16 Not Found Methanocaldococcus infernus ME 126.83 Not Found Methanocaldococcus vulcanius M7 144.13 Not Found Methanocella paludicola SANAE 165.96 9.18 Methanococcoides burtonii 184.98 Not Found Methanococcus aeolicus 102.82 Not Found Methanocaldococcus jannaschii 149.37 Not Found Methanococcus maripaludis C5 117.63 Not Found Methanococcus maripaludis C6 113.98 Not Found Methanococcus maripaludis C7 115.95 Not Found Methanococcus maripaludis S2 120.92 Not Found Methanococcus vannielii SB 113.31 Not Found Methanocorpusculum labreanum Z 173.66 Not Found Methanoculleus marisnigri JR1 196.35 12.39 Methanohalophilus mahii 155.97 Not Found Methanopyrus kandleri AV19 53.55 5.29 Candidatus Methanoregula boonei 199.44 3.83 Methanosaeta thermophila PT 192.62 Not Found Methanosarcina acetivorans C2A 181.08 6.52 Methanosarcina barkeri str. Fusaro 187.82 Not Found Methanosarcina mazei Go1 191.12 8.56 Methanosphaera stadtmanae 137.68 Not Found Candidatus Methanosphaerula palustris E1- 201.92 5.85 9c Methanospirillum hungatei JF-1 197.20 9.90 Natrialba magadii 96.05 8.78

95 Genome Covariance Model Search Score (bits) Archaeal RNase P Pyrobaculum RNA (RF00373) RNase P RNA Model Model Natronomonas pharaonis 153.40 5.56 Picrophilus torridus 205.24 3.44 Pyrococcus abyssi GE5 196.36 4.10 Pyrococcus furiosus 196.16 4.77 Pyrococcus horikoshii OT3 190.46 Not Found Thermococcus barophilus 171.24 Not Found Thermococcus gammatolerans EJ3 175.13 8.20 Thermococcus kodakarensis KOD1 171.24 4.66 Thermococcus onnurineus NA1 175.74 4.99 Thermococcus sibiricus MM 739 199.28 3.91 Thermoplasma acidophilum 206.76 Not Found Thermoplasma volcanium GSS1 186.71 2.59 Uncultured methanogenic archaeon RC-I 191.28 6.72 Nanoarchaeota Nanoarchaeum equitans Kin4-M Not Found Not Found Korarchaeota Candidatus Korarchaeum cryptofilum OPF8 151.76 8.67 Thaumarchaeota Cenarchaeum symbiosum 173.78 7.99 Nitrosopumilus maritimus SCM1 184.41 Not Found

96 Chapter 4

Modeling the Thermoproteaceae RNase P RNA3

3 This chapter is a draft manuscript to be submitted in collaboration with James W. Brown.

97 4.1 Introduction

Ribonuclease P (RNase P) is best known for its function of removing the 5ʹ-leaders of pre-tRNAs for maturation. This enzyme typically includes a catalytic RNA subunit and at least one protein although the RNA component has been missing in human and

Arabidopsis organellar genomes. Despite that RNase P RNA has been identified in almost all the organisms, distinctive structural differences have been observed among species in different domains of life (Brown 1999). Rfam database (Gardner, Daub et al. 2009) classifies them into four RNA families: nuclear RNase P in eukaryotes, bacterial RNase P type A, bacterial RNase P type B, and archaeal RNase P. Archaeal

RNase P RNAs can be further divided into two subgroups (Harris, Haas et al. 2001).

Type A, that highly resembles the structure of bacterial type A RNase P RNA, is the typical archaeal RNase P RNA component found in most of the sequenced genomes, whereas Archaeoglobus, Methanocaldococcus, Methanococcus, and

Methanothermococcus have a type M RNase P RNA that lacks a number of stems in the specificity and catalytic domains of the gene. The recent discovery of a shortened form of RNase P RNA in Thermoproteaceae introduces a new type T to the archaeal

RNA families (Lai, Chan et al. 2010). Due to the absence of most of the specificity domain in this type T RNA gene, the archaeal RNase P RNA covariance model in

Rfam (Gardner, Daub et al. 2009) fails to detect its existence. Here we describe the features of an archaeal type T RNase P RNA and the development of a covariance

98 model for searching this elusive form of ancient RNA. We detected another type T

RNase P RNA in the recently available Thermoproteus tenax using the new model, and unexpectedly identified a novel type M variant in Archaeoglobaceae.

4.2 Results and Discussion

4.2.1 Common features of type T RNase P RNAs

The shortened form (type T) of RNase P RNA has only been identified in

Pyrobaculum (P. aerophilum, P. arsenaticum, P. calidifontis, P. islandicum, P. oguniense, and Thermoproteus neutrophilus [to be reclassified as a Pyrobaculum species]), Caldivirga maquilingensis, and Vulcanisaeta distributa, all belonging to the same family Thermoproteaceae (Lai, Chan et al. 2010). In general, this RNA gene has a catalytic domain closely resembled to type A RNase P RNA, but lacks most of the specificity domain (Figure 4.1). The universally conserved positions in the P4 stem, the P2/P4 joining region, and the P2/P15 joining region are in place. However, the P2 stem is only 3bp in length, which is relatively short comparing to a 6bp stem in other archaeal RNAs and a 7bp stem in bacterial ones (Haas, Armbruster et al. 1996). The

4-nt P2/P3 joining region in C. maquilingensis and most of the Pyrobaculum RNase P

RNAs does not seem to have compensated the difference. The P15 stem in all the identified type T RNAs is also 1bp shorter than the typical one in type A. The 2-nt

P5/P15 linker compared to the typical 3 nt makes that region more reduced than usual. The P10 stem that typically extends to P11 and P12 of the specificity domain in

99 type A RNase P RNAs is ended with a small loop in Pyrobaculum and V. distributa

(Figures 4.1B and 4.1D). The relatively reduced region extended from the P7 stem of

C. maquilingensis RNase P RNA could be a P10 stem similar to the other type T

RNAs, or a shortened P8 or P9 stem (Figure 4.1C).

4.2.2 Type T RNase P RNA variants

Closer inspection of the secondary structures among the identified type T

RNase P RNAs revealed three variants, one for each (Figure 4.1). The 20-nt P1 stem in C. maquilingensis and V. distributa RNase P RNAs is about twice the length of that in Pyrobaculum. Although long P1 stem has been observed in genomes like

Aeropyrum pernix (Brown 1999), the length in these two type T genes is among the longest in all the verified archaeal RNase P RNAs. It was found in previous studies that P1 interacts with L9 as part of the mechanism for orienting the catalytic and specificity domains in bacterial RNase P RNAs (Massire, Jaeger et al. 1997; Massire,

Jaeger et al. 1998). The catalytic activity of RNase P in Methanothermobacter thermoautotrophicus significantly increased with a longer P1 stem that could contact

L9 (Li, Willkomm et al. 2009). While both Pyrobaculum and V. distributa RNase P

RNAs have a typical GNRA tetraloop at L9 (Massire, Jaeger et al. 1997), the extension of P7 stem into an atypical P9 or partial P10 stem in C. maqulingensis

(Figure 4.1C) raises the question of how the interaction occurs in this variant.

A typical P8 stem, similar to the one in type A RNase P RNAs, is only observed in V. distributa, but not the other two variants (Figures 4.1B-D).

100 Interestingly, type M RNase P RNAs are also known for its loss of a P8 stem (Harris,

Haas et al. 2001). Since P8 was found to be involved in T-loop recognition of pre- tRNAs in bacteria (Nolan, Burke et al. 1993; Harris, Nolan et al. 1994), it is still unclear if another region of the holoenzyme in archaea compensates this essential missing piece. On the other hand, C. maquilingensis and V. distributa RNase P RNAs have the shortest P5 stem (2bp versus a typical 4bp) observed in archaea. In addition, the V. distributa variant has a 2-nt joining region between P5 and P7, that makes the structure of this RNA differ from the other known archaeal RNase P RNAs.

4.2.3 Search with type T RNase P RNA covariance model

Due to the higher G/C content in Pyrobaculum species, the RNase P RNAs are also more GC-rich than those in C. maquilingensis and V. distributa. Together with the slight differences in the secondary structures, the original covariance model built with only the Pyrobaculum RNase P RNA sequences does not perform well in searching for the two other type T variants (Lai, Chan et al. 2010). We therefore structurally aligned the RNase P RNA sequences from Caldivirga maquilingensis and

Vulcanisaeta distributa in addition to the Pyrobaculum genomes (Figure 4.2), and built a type T specific covariance model using Infernal (Nawrocki, Kolbe et al. 2009).

By employing this newly developed model, we identified another shortened form of

RNase P RNA in Thermoproteus tenax, a crenarchaeon in the same

Thermoproteaceae family (Figure 4.3A). For positive controls, we applied the same covariance model to search against the source sequences and found very close scores

101 between the T. tenax candidate and the Pyrobaculum RNase P RNAs (Table 4.1).

Structural comparison showed that the RNase P RNA in T. tenax is a Pyrobaculum type T variant with the sequence highly conserved with the Pyrobaculum ones

(Figure 4.3B). The 16S rRNA of T. tenax has over 96% identity with the one in P. calidifontis, much higher than with C. maquilingensis and V. distributa (93% and

94% respectively), which provides further evidence of our finding. To verify that T. tenax does not include an archaeal RNase P RNA in another form, we searched the genome with the Rfam (Gardner, Daub et al. 2009) archaeal covariance model and did not found any matches. A search of the RNase P proteins in T. tenax revealed likely homologs of Pop5, Rpp30, and Rpp29, but not Rpp21, as in the other

Pyrobaculum, which extends the confirmation of phylogenetic distribution of type T

RNase P within the Thermoproteaceae family.

4.2.4 Type M RNase P RNA variants

Previous studies have reported the structural differences in details between type A and type M archaeal RNase P RNAs in which type M is missing P8 at the specificity domain, and L15 with P16, P17 and P6 at the catalytic domain (Figure

4.4A) (Harris, Haas et al. 2001). While conducting structural comparison between the type T and type M RNase P RNAs in available genomes, we unexpectedly identified a novel type M variant in two recently sequenced euryarchaea, Archaeoglobus profundus and Ferroglobus placidus. Since Archaeoglobus fulgidus, also belongs to the same family Archaeoglobaceae, was found to have a type M RNase P RNA

102 (Brown 1999; Harris, Haas et al. 2001), we anticipated that closely related species would also carry the same form of this RNA gene. Like the other type M RNase P

RNAs, the genes in A. profundus and F. placidus do not have L15, P16, P17 and P6 in their secondary structures. However, they both include a typical P8 stem observed in type A RNase P RNAs (Figure 4.4B). Interestingly, P8 is also one of the variable components observed among the three type T structural variants. Phylogenetic analysis based on 16S rRNAs showed that A. profundus is closer to F. placidus than other Archaeoglobus species including A. fulgidus (von Jan, Lapidus et al. 2010), which provides an explanation on the similarity of the RNase P RNA genes between these two genomes.

4.3 Conclusions

Type T RNase P RNAs in Thermoproteaceae display significant differences from the typical archaeal ones. It is still an open question on how this shortened form of RNA interacts with the protein subunits. The undetectable Rpp21 and the lack of most of the specificity domain bring forth the idea of having one or more unknown subunits as their substitution, although we were not able to find a separate specificity domain- like gene in these genomes (Lai, Chan et al. 2010). Crystallization of this holoenzyme may help answer some of the uncertainties. Additionally, the discovery of multiple type T and type M RNase P RNA variants introduces a new level of complexity to the mechanism of pre-tRNA 5ʹ-processing activity. The presence and absence of the P8

103 stem in different variants may suggest this particular region as the most recent loss during evolution. With the increasing availability of sequenced genomes, we anticipate more RNase P RNA variants may be identified for further studies.

4.4 Materials and Methods

Genomic data. Complete genomic sequences and annotated ORFs for all archaeal genomes except Pyrobaculum oguniense and Thermoproteus tenax were obtained from NCBI RefSeq (Pruitt, Tatusova et al. 2007). Pyrobaculum oguniense is a pre- release draft genome sequenced by the Lowe Lab at University of California Santa

Cruz. Genomic sequence and gene annotations of T. tenax were provided by Bettina

Siebers (Universität Duisburg-Essen, Germany).

Type T RNase P RNA covariance model development. Alignments of RNase P

RNA sequences in Pyrobaculum (P. aerophilum, P. arsenaticum, P. calidifontis, P islandicum, and T. neutrophilus), Caldivirga maquilingensis, and Vulcanisaeta distributa against the predicted secondary structures (Figure 4.1) were made manually to create a Stockholm file. Infernal v1.0 (Nawrocki, Kolbe et al. 2009) was used to build and calibrate the type T specific covariance model.

Archaeal RNase P RNA sequence search. Infernal v1.0 (Nawrocki, Kolbe et al.

2009) was used to search for RNase P RNA candidates in archaeal genomes using the type T RNase P RNA covariance model and the Rfam (Gardner, Daub et al. 2009)

104 archaeal RNase P RNA covariance model (RF00373). The program was initially run in the global search mode. All hits with a score > 0 bits were manually examined.

Local search mode was also employed, which provided better sensitivity but decreased selectivity.

RNase P protein database searches in T. tenax. Protein sequences of Pop5, Rpp30,

Rpp29, and Rpp21 for T. tenax were retrieved from Pfam (Finn, Mistry et al. 2010) domain searches (RNase_P_Rpp14 [Pop5]: PF01900; RNase_P_p30 [Rpp30]:

PF01876; UPF0086 [Rpp29]: PF01868; and Rpr2 [Rpp21]: PF04032), and Phylo-

HMM (Siepel and Haussler 2004) multiple alignments in the UCSC Archaeal

Genome Browser (Schneider, Pollard et al. 2006) were used to predict homology.

Default scoring thresholds for PSI-BLAST (E-value: 10; word size: 3) and Pfam

(trusted cutoff for Pop5: 23.4 bits; Rpp30: 20.3 bits; Rpp29: 21.1 bits; Rpp21: 23.2 bits) searches were initially adopted. Thresholds were further adjusted (E-value: 100 and word size: 2 for PSI-BLAST; trusted cutoff as -80 bits for Pfam) to search for proteins not identified with the default scan.

105

Figure 4.1 Predicted secondary structures of type T RNase P RNAs A. Methanobacterium thermoautotrophicum RNase P RNA (RPR), a typical archaeal type A RPR, has catalytic and specificity domains (Harris, Haas et al. 2001). It is shown for comparison with type T RPRs. Red, common structural differences between type A and type T RPRs. Black circle, universally conserved nucleotide. B- D. Type T RPRs found in Pyrobaculum aerophilum, Caldivirga maquilingensis, and Vulcanisaeta distributa have structural differences in P1, P5, P7, P8, and P9 stems (blue).

106

Figure 4.2 Structural alignments of Pyrobaculum, Caldivirga, and Vulcanisaeta RNase P RNAs RNase P RNA sequences from five Pyrobaculum species, C. maquilingensis, and V. distributa were structural aligned in Stockholm file format (Nawrocki, Kolbe et al. 2009) for creation of type T RNase P RNA covariance model.

107

Figure 4.3 Predicted secondary structure and sequence alignment of Thermoproteus tenax RNase P RNA A. Predicted secondary structure of RNase P RNA (RPR) in T. tenax resembles Pyrobaculum type T RPR variant (Figure 4.1B). Black circle, universally conserved nucleotide. B. Multiple RNA sequence alignment shows high conservation between Pyrobaculum species and T. tenax, with a few single-nucleotide insertions/deletions, which are common in non-coding RNA genes, but unlikely in protein-coding genes due to frameshifts.

108

Figure 4.4 Predicted secondary structures of type M RNase P RNA variants A. Archaeoglobus fulgidus has a typical archaeal type M RNase P RNA (RPR) and is shown for comparison (Harris, Haas et al. 2001). B. Newly identified type M RPR variant in Archaeoglobus profundus has a P8 stem (orange) that is missing in other type M RPRs. Black circle, universally conserved nucleotide.

109 Table 4.1 RNase P RNA search in Thermoproteaceae using Infernal v1.0 (Nawrocki, Kolbe et al. 2009) Archaeal RNase P RNA Rfam covariance model RF00373 and Type T RNase P RNA covariance model with Infernal v1.0 (Nawrocki, Kolbe et al. 2009) were used to search the Thermoproteaceae genomes. Identified RNase P RNA candidates with highest bit scores were reported.

Genome Covariance Model Search Score (bits) Archaeal RNase P Type T RNA (RF00373) RNase P RNA Model Model Caldivirga maquilingensis IC-167 Not Found 151.30 Pyrobaculum aerophilum Not Found 126.02 Pyrobaculum arsenaticum 1.34 135.34 Pyrobaculum calidifontis Not Found 124.20 Pyrobaculum islandicum 0.05 141.56 Thermoproteus neutrophilus V24Sta Not Found 125.71 [to be reclassified as a Pyrobaculum species] Pyrobaculum oguniense Not Found 145.73 Thermoproteus tenax -0.45 117.04 Vulcanisaeta distributa IC-017 5.32 167.00

110 Chapter 5

Discovery of Permuted and Recently Split Transfer RNAs in Archaea4

4 This chapter is a manuscript co-written with Todd M. Lowe and co-edited with Aaron E. Cozen. It was submitted for publication as Chan, P.P., A.E. Cozen, L.M. Lui, and T.M. Lowe. Discovery of

Permuted and Recently Split Transfer RNAs in Archaea.

111 5.1 Abstract

As in eukaryotes, precursor transfer RNAs in archaea often contain introns that are removed in tRNA maturation. Two unrelated archaeal species display unique pre- tRNA processing complexity in the form of “split” tRNA genes: 2-3 segments of tRNAs are transcribed from different loci, then trans-spliced to form a mature tRNA.

Another rare type of pre-tRNA, found only in eukaryotic algae, is “permuted” where the 3′ half is encoded upstream of the 5′ half, and must be processed to be functional.

Using an improved version of the gene-finding program tRNAscan-SE, comparative analyses, and experimental verifications, we have now identified four novel trans- spliced tRNA genes, each in a different species of the Desulfurococcales branch of the Archaea: tRNAAsp(GUC) in Aeropyrum pernix and Thermosphaera aggregans, and tRNALys(CUU) in Staphylothermus hellenicus and Staphylothermus marinus. Each of these includes features surprisingly similar to previously studied split tRNAs, yet comparative genomic context analysis and phylogenetic distribution suggest several independent, relatively recent splitting events. Additionally, we identified the first examples of permuted tRNA genes in Archaea: tRNAiMet(CAU) and tRNATyr(GUA) in

Thermofilum pendens, which appear to be permuted in the same arrangement seen previously in red alga. Our findings illustrate that split tRNAs are sporadically spread across a major branch of the Archaea, and that permuted tRNAs are a new shared characteristic between archaeal and eukaryotic species. The split tRNA discoveries

112 also provide new clues to their evolutionary history, supporting hypotheses for recent acquisition via viral or other mobile elements.

5.2 Introduction

Transfer RNA (tRNA) genes play an essential role in protein translation in all living cells. Among the various stages of maturation that are required to generate functional tRNAs, intron removal is a key processing event occurring in precursor tRNAs (pre- tRNAs) in eukaryotes and archaea. While the majority of these introns are found one nucleotide downstream of the anticodon, some archaeal species have introns scattered among seemingly random, “noncanonical” positions in tRNA genes. These noncanonical introns preserve a general bulge-helix-bulge (BHB) secondary structure that is similar to canonical introns (Marck and Grosjean 2003), and are found almost exclusively in members of the phylum Crenarchaeota, with most species containing only a small number of noncanonical introns. The hyperthermophile Pyrobaculum aerophilum is an exceptional case, containing a total of 21 noncanonical introns in 46 tRNAs, more than four times as many as in any other species (Marck and Grosjean

2003). With the availability of four recently sequenced Pyrobaculum genomes

(Pyrobaculum arsenaticum, Pyrobaculum calidifontis, Pyrobaculum islandicum, and

Thermoproteus neutrophilus, to be reclassified as a Pyrobaculum species), it was found that P. calidifontis harbors a superlative total of over 70 tRNA introns, the vast majority being noncanonical (Sugahara, Kikuta et al. 2008; Chan and Lowe 2009).

113 A separate but potentially related class of tRNAs requiring unusual processing was first found in Nanoarchaeum equitans, an obligate archaeal hyperthermophilic symbiont with an extremely small genome. It contains six trans-spliced split tRNAs, encoded by genes that are broken into halves and distantly separated in the genome

(Randau, Munch et al. 2005; Randau, Pearson et al. 2005). These trans-spliced tRNAs are similar to pre-tRNAs containing canonical or noncanonical introns in that they form a BHB secondary structure at the exon-splicing junction, which is likely processed by a common endonuclease (Randau, Calvin et al. 2005). Although this finding was considered an exceptional process in an unusual organism, the discovery of trans-spliced tRNAs in the free-living thermophilic crenarchaeon Caldivirga maquilingensis hints at a broader relevance (McClay and van den Oord 2005;

Fujishima, Sugahara et al. 2009). Unfortunately, the large evolutionary distance between N. equitans and C. maquilingensis and lack of similarity in characteristics between their split tRNAs limits estimates of their age or the genome dynamics that might involved.

Atypical tRNAs are not limited to archaeal species. Permuted tRNAs with a 3′ half positioned upstream of the 5′ half, creating a BHB-like motif when paired at their termini, were identified in the genomes of several unicellular algae including the red alga Cyanidioschyzon merolae (Soma, Onodera et al. 2007; Maruyama, Sugahara et al. 2009). Various hypotheses have been proposed to explain the existence of these introns and fragmented tRNA genes, including a possible archaeal origin (Di Giulio

114 2008; Randau and Soll 2008; Di Giulio 2009; Fujishima, Sugahara et al. 2009;

Sugahara, Fujishima et al. 2009). However, the mechanism of their acquisition and biological significance is still not known.

With the increasing number of sequenced archaeal genomes, there are opportunities to uncover new cases of exceptional tRNA encoding which could provide important clues to understanding of the evolutionary origins and mechanistic details shared among trans-spliced, permuted, and atypical intron-containing tRNAs.

Here we report two key discoveries with new evolutionary implications. First, we describe four novel trans-spliced tRNAs, distributed among half (4 out of 8) of the species with decoded genomes in the Desulfurococcales branch of the Crenarchaeota.

Unlike the trans-spliced tRNAs previously observed in N. equitans and C. maquilingensis (Randau, Munch et al. 2005; Fujishima, Sugahara et al. 2009), the genomic proximity of the newly discovered tRNA gene fragments suggests relatively recent introduction into these genomes. Second, we describe strong evidence for the first examples of permuted tRNAs in Archaea, which have striking resemblance to permuted tRNAs recently found in eukaryotic algal genomes. Our findings demonstrate that these special tRNA features are not as rare as once perceived, increasing their biological relevance as well as providing valuable additional data points for future efforts to determine their origins and required co-factors.

115 5.3 Results

5.3.1 Split tRNAAsp(GUC) in Aeropyrum and Thermosphaera consist of adjacent halves

Our initial investigation focused on tRNAs with properties that place them at the extremes of archaeal tRNA characteristics. Transfer RNA introns identified in sequenced archaeal genomes have sizes ranging from 11 to 129 nucleotides (nt) with a median of 15 nt (Chan and Lowe 2009). Besides introns of tRNATrp(GUC) that encode an embedded C/D box small RNA (sRNA) (Omer, Lowe et al. 2000; Clouet d'Orval, Bortolin et al. 2001; Singh, Gurha et al. 2004), only one tRNA gene, tRNAAsp(GUC) of Aeropyrum pernix, contains an intron exceeding 100 nt (Table 5.1).

Using an archaeal-specific version of snoscan (Lowe and Eddy 1999), and manual alignment to other predicted C/D box sRNAs in A. pernix and other archaeal species, we failed to find any trace of a C/D box sRNA that might explain the long tRNA intronic region. Promoter analyses in the region revealed a strong candidate with transcription factor B recognition element (BRE) and TATA box, in the middle of the tRNAAsp(GUC) intron, on the same coding strand as the tRNA gene. This high- confidence promoter prediction scores better than 88% of all predicted transcripts in the genome (Figure 5.1), better than 63% of the predicted promoters for annotated tRNA genes in A. pernix, and is highly similar to the promoter upstream of the 5′ end of the tRNAAsp(GUC) gene (Table 5.2). The promoter’s placement predicts a transcription start site 20 to 25 nt upstream of the 3′ end of the intron, consistent with

116 production of a leader sequence of length and high G/C content that is similar to leaders found in the 3′ halves of trans-spliced tRNA genes (Randau, Pearson et al.

2005; Fujishima, Sugahara et al. 2009)

With the aid of an improved version of tRNAscan-SE (Lowe and Eddy 1997), we found that tRNAAsp(GUC) can be modeled as a combination of two separate transcripts joined between position 37 and 38 (canonical intron position) (Figure

5.2A, Table 5.3). Similar to other split tRNAs, a canonical BHB (hBHBh′) secondary structure at the exon-splicing junction followed by a 14 bp G/C-rich stem is observed.

RT- PCR with transcript-specific primers and sequencing of PCR-derived clones verify expression of the mature tRNA and the two tRNA halves (Figure 5.3A). To further confirm that the two halves are separate transcripts, we conducted northern analysis with specific probes antisense to the two halves and the full-length 199-nt precursor tRNA. Results showed that the two pre-tRNA halves are separately expressed, as is the predicted mature tRNA (Figure 5.3B). The absence of a detectable 199-nt pre-tRNAAsp(GUC) in these experiments supports the conclusion that this tRNA is derived from two separate primary transcripts rather than a single, intron-containing precursor (Figure 5.3B).

tRNAAsp(GUC) in A. pernix is the first example of a trans-spliced tRNA encoded by separate, but directly adjacent transcripts. Upon re-examination of all tRNAs in Archaea for a similar pattern, we found that the tRNAAsp(GUC) ortholog in

Thermosphaera aggregans, a crenarchaeon in the same phylogenetic order

117 (Desulfurococcales) also appears to be a split with the two halves joined at the canonical intron position (Figure 5.2A, Table 5.3). Similar to the arrangement in A. pernix, the 5′ half of the T. aggregans split tRNA is located adjacent to the 3′ half.

Unlike A. pernix, the two halves are encoded on opposite strands in a convergently transcribed orientation (Figure 5.4).

To get an evolutionary perspective on the lineage of this tRNA gene, we examined the syntenic region in Desulfurococcus kamchatkensis, the closest sequenced relative to T. aggregans. In both of these species, a sequence of three genes is present: eIF-2A, Nop10p, then tRNAAsp(GUC) (Figure 5.4). However, the end of the syntenic region occurs in the middle of the tRNAAsp(GUC), where apparently the ancestral, uninterrupted form of tRNAAsp(GUC) observed in D. kamchatkensis has been split and inverted in T. aggregans, by an unknown series of genome re-arrangement events. Examination of the tRNAAsp(GUC) genomic regions for syntenic blocks in the eight sequenced Desulfurococcales species showed that every species had genome rearrangements downstream of tRNAAsp(GUC), relative to every other species, and three had rearrangements upstream as well. The split tRNAAsp(GUC), along with many other tRNAs (She, Brugger et al. 2002; Krupovic and Bamford 2008) appear to be common positions for genome recombination and/or viral integration. Although A. pernix and T. aggregans are both Desulfurococcales, they are separated by four other species which are more closely related, but lack the split tRNAAsp(GUC). Given that none of the other six Desulfurococcales species have a split tRNAAsp(GUC), the most

118 parsimonious explanation for these observations is that a tRNA-splitting event happened relatively recently in two independent instances among the

Desulfurococcales. The agent causing the splitting event could be specific for these tRNA sequences, as there are only 4 nucleotide changes between the A. pernix and T. aggregans tRNA sequences. Notably, there are many more changes in length and sequence identity (20 nt differences) in the complementary region required to join the split halves, thus disfavoring the possibility of recent lateral transfer of a precursor split tRNA gene. As such, these are likely to be the first compelling examples of

“late” acquisition of split tRNA genes.

5.3.2 tRNALys(CUU) in Staphylothermus resembles its ortholog in Nanoarchaeum

Using an improved version of the tRNAscan-SE computer program (Lowe and Eddy 1997) (manuscript in preparation), we predicted 46 tRNAs in

Staphylothermus marinus, including a missing tRNALys(CUU) gene and an extra, low- scoring tRNALeu(CAA) prediction (26.23 bits versus all other archaeal tRNAs score

>30.0 bits). This suggested a potential tRNA isotype mis-assignment. Indeed, closer examination of the low-scoring tRNALeu(CAA) revealed an almost exact match to tRNALys(UUU) from position 30 to the 3′ terminus, with no similarity upstream of position 30. The single difference between these two sequences is a U to C change, aligned to anticodon position 34 in tRNALys(UUU), consistent with the fragment being the 3′ end of the missing tRNALys(CUU). Additional sequence similarity searches

119 identified a candidate 5′ half fragment of the tRNALys(CUU) encoded 4500 nucleotides away (Figure 5.5A, Table 5.3), ruling out the possibility of an extremely long polycistronic intron. Strong promoters matching the promoter consensus (better than

40% of other identified tRNA promoters) were found at the expected distance upstream of both candidate loci.

Intriguingly, the two halves of the candidate split tRNALys(CUU) in S. marinus join between position 30 and 31, precisely the same position as in the split tRNALys(CUU) of distantly-related Nanoarchaeum equitans (Table 5.3) (Randau,

Pearson et al. 2005). A canonical BHB (hBHBh′) secondary structure at the exon- splicing junction is also observed, just as in N. equitans tRNALys(CUU), followed by a perfect 13 nucleotide trans-pairing between the regions downstream of the 5′ half and the upstream of the 3′ half (Figure 5.2B). Fortuitously, the genome of another species in the genus Staphylothermus, S. hellenicus, was recently sequenced, enabling identification of the orthologous split tRNALys(CUU), closely resembling the counterpart in S. marinus (Figure 5.2B, Table 5.3). While the sequences of 3′ halves in the two species are identical, there is one nucleotide difference near the end of the

5’ tRNA half. The difference conserves a base pair in the trans-paired region, changing the predicted G-U pairing in S. marinus to a G-C pair in S. hellenicus

(Figure 5.2B), further suggesting selective pressure to maintain a perfect 13 nt trans interaction.

120 These two closely related Staphylothermus species give the first case in which the issue of genome stability and split tRNAs may be examined. The neighboring genes located upstream of the 5′ half (Smar_1316-Smar_1321, Shell_1133-

Shell_1128), in between the 5′ and 3′ halves (Smar_1322 – Smar_1326, Shell_1127 –

Shell_1123), and downstream of the 3′ half (Smar_1327 – Smar_1332, Shell_1122 –

Shell_1116) retain complete synteny between species, showing a lack of recombination or integration events in this region (Figure 5.5). However, tRNA genes are known to be positions of genome rearrangement due to processes including transposon and viral integration (Reiter, Palm et al. 1989; Hall and Collis 1995; She,

Peng et al. 2001; Krupovic and Bamford 2008). For context, we examined genome rearrangement events adjacent to the other 45 ortholog pairs of tRNAs, and found that

20 of them (44%) had a break in synteny either upstream, downstream, or on both sides of the tRNAs. Thus, the split tRNALys(CUU) arrangement since the divergence of

S. marinus and S. hellenicus has been preserved with no local recombination, and is consistent with hypotheses proposing enhanced genome stability from split tRNAs.

5.3.3 Permuted tRNAs in Thermofilum pendens have the same structure as in red alga

We computationally screened for other atypical tRNA transcripts by aligning tRNAs and their upstream promoter regions to identify unusual spacing between candidate promoters and the predicted tRNA gene. When applied to the thermophilic crenarchaeon Thermofilum pendens, we found that the promoters of 44 mature tRNA

121 genes, out of a total of 46 in the genome, are located in the upstream region between

30 and 49 nucleotides relative to the 5′ end of the mature tRNAs (Figure 5.6), explained by natural variation in the lengths of 5′ leaders of pre-tRNAs. The promoters of two outliers, tRNAiMet(CAU) and tRNATyr(GUA), were found at positions

-72 and -65 respectively, implying 5′ leaders at least 16 nucleotides longer than all others. Similar to the initial prediction of the split tRNA in S. marinus, tRNAiMet(CAU) and tRNATyr(GUA) in T. pendens scored relatively low (26.23 and 32.76 bits respectively, compared to > 60 bits for other tRNA genes in this genome). Secondary structure analysis showed that these tRNA predictions fail to form requisite tRNA cloverleaf secondary structure at the T-Ψ-C loop and the acceptor stem.

Results from an improved version of tRNAscan-SE (Lowe and Eddy 1997) showed that the original predictions of tRNAiMet(CAU) and tRNATyr(GUA) in T. pendens are only the 5′ halves of these tRNA genes. For each tRNA, a precisely matching 3′ half was found between the 5′ half and its predicted promoter on the same strand, suggesting that the two fragments belong to the same transcript, using a single promoter (Figure 5.6). Primary and secondary structure analyses strongly support that these are circularly permuted tRNAs with a BHB motif at the exon-splicing junction, located between position 59 and 60. Unexpectedly, the split position is precisely the same as the T-Ψ-C loop-permuted tRNAs identified in the distantly related eukaryotic alga Cyanidioschyzon merolae (Figure 5.2C) (Soma, Onodera et al. 2007). The intervening sequences between the 3′ and 5′ halves of tRNAiMet(CAU) and tRNATyr(GUA)

122 in T. pendens are 7 nt and 1 nt long respectively, in comparison to intervening sequences in known algal permuted tRNAs ranging from 5 nt to 85 nt. Both of the genes also include a canonical intron, indicating that permuted tRNA structure and normal intron splicing are not exclusive processes. Unlike the permuted tRNAs in eukaryotes, these two archaeal tRNAs have genomically encoded 3′-terminal CCA sequences, and the tRNA sequences are quite different from those found in red alga, ruling out a recent inter-domain transfer between algal and archaeal species.

However, the sequences and proteins flanking these tRNAs share strongest similarity to species outside of the Thermoproteales, suggesting these may have been part of laterally transferred regions acquired after the divergence from other sequenced

Thermoproteales. Both tRNAiMet(CAU) and tRNATyr(GUA) are single-copy genes in T. pendens with essential decoding functions that cannot be supplanted by other tRNAs, indicating that processing of permuted genes is an essential activity in this species.

5.4 Discussion

RNA trans-splicing, a processing event that joins two separate transcripts, was first discovered in the messenger RNAs of trypanosomes (Sutton and Boothroyd

1986), Caenorhabditis elegans (Krause and Hirsh 1987), and more recently in human endometrial stromal cells (Li, Wang et al. 2008). With the surprising discovery of trans-spliced tRNAs encoded in the minimal genome of N. equitans (Randau, Munch et al. 2005), and the most recent identification in a single unrelated crenarchaeal

123 species (Fujishima, Sugahara et al. 2009), the broader significance and phylogenetic scope of these intriguing RNAs has remained uncertain. Our discovery of four novel trans-spliced tRNAs and two novel permuted tRNA genes greatly expands the total number of rearranged tRNAs in the Archaea to 18, in seven different species

(Randau, Munch et al. 2005; Fujishima, Sugahara et al. 2009). All archaeal trans- spliced tRNAs have been found either in the crenarchaeal orders Desulfurococcales or Thermoproteales (Figure 5.7), or in N. equitans, an endosymbiont of a

Desulfurococcales species (Huber, Hohn et al. 2002). Re-examination of tRNA predictions in all sequenced genomes in the Crenarchaeota did not find any additional trans-spliced or permuted tRNAs, although we expect many more examples as the pace of new genome sequencing accelerates (Wu, Hugenholtz et al. 2009).

Whether there are ecological or genetic characters that distinguish organisms with split or permuted tRNA genes from organisms that lack such tRNA variants is an open and interesting question. Among the seven species now identified as harboring split tRNAs, the most prominent ecological trait is that all are hyperthermophiles.

Soon after the discovery of split tRNAs in N. equitans and permuted tRNAs in C. merolae, Randau and Söll suggested that split tRNA genes, like tRNA introns, present a strategy for preventing the integration of viral genomes or other mobile elements into otherwise conserved tRNA genes (Randau and Soll 2008; Heinemann,

Soll et al. 2009). This is particularly relevant for tRNAs in hyperthermophiles, which must preserve a high number of strong G-C hydrogen bonds in stems in order to

124 maintain stable secondary structure. In combination with the many other tRNA sequence identity elements needed for proper modification, aminoacylation, and decoding function, this further limits sequence diversity, making thermophile tRNAs more similar to each other than in non-thermophiles. For example, there are just 7 nt changes in the mature tRNALys(CUU) between the phylogenetically distant

Nanoarchaeum and Staphylothermus split tRNA orthologs (Figure 5.2).

Pre-tRNAs across the archaeal phyla (Crenarchaeota, Thaumarchaeota,

Nanoarchaeota, Korarchaeota and three thermophilic euryarchaeal methanogens) contain introns located at canonical as well as noncanonical positions (Marck and

Grosjean 2003; Hallam, Konstantinidis et al. 2006; Sugahara, Kikuta et al. 2008;

Chan and Lowe 2009). The majority of pre-tRNAs have zero or one intron, although some include two or even three introns (Sugahara, Kikuta et al. 2008; Chan and Lowe

2009). Most of these multi-intronic pre-tRNAs have been identified in Pyrobaculum and Thermofilum, contributing to the highest tRNA intron counts observed in Archaea in these genera. However, none of the five sequenced Pyrobaculum species have any trans-spliced or permuted tRNAs. C. maquilingensis and N. equitans have the most split tRNAs (six), yet each has a relatively small numbers of tRNA introns. We concur with prior suggestions that the evolutionary selective pressures to maintain intronic or split tRNAs may be similar (Randau and Soll 2008); however, the specific genetic component or environmental vector(s) necessary for acquisition or maintenance of split versus intronic tRNAs are potentially quite different. Discovery

125 of split or permuted tRNAs in five new species allows a more focused searches for requisite proteins, protein variants, or genomic properties that correspond to the species that support these types of pre-tRNA variants. For example, it is now reasonably clear that small genome size, an important feature of N. equitans where split tRNA were discovered (Randau, Munch et al. 2005), is not strongly correlated with split tRNAs: only four out of the eight Desulfurococcales, with relatively smaller genome sizes among all the crenarchaea, have split tRNAs.

The relative age of tRNA introns and split tRNAs has been a matter of open debate as well. Di Giulio proposed that split tRNAs such as those in N. equitans are the ancestral forms of the single-locus tRNA genes observed in most genomes (Di

Giulio 2008; Di Giulio 2009). Sugahara et al. have suggested that the conserved intron sequences observed in Pyrobaculum support a late, rather than early origin for tRNA introns (Sugahara, Kikuta et al. 2008). In this study, instances of split tRNAAsp(GUC) from species in different genera, and instances of split tRNALys(CUU) from species within the same genus suggest multiple, relatively recent splitting or lateral transfer events within the Desulfurococcales. These are fundamentally different from prior examples of split tRNAs in several respects: the split halves are adjacent or relatively close in the genome, there is just one split tRNA per genome, and an orthologous split tRNA exists to help gauge the age and stability of flanking sequences. Careful examination of the exon-splicing junctions reveals that tRNALys(CUU) in S. marinus and S. hellenicus are ancestrally related and locate in a

126 syntenic region of their genomes that has been stable since their divergence.

Examination of the disparate exon-splicing junctions between tRNAAsp(GUC) in A. pernix and T. aggregans suggests two different local genome rearrangement events, created by a viral or mobile element that targets precisely the same position in the same tRNA. Viral genes have been identified integrated within tRNA genes in the euryarchaeal species Thermococcus kodakarensis KOD1 and Methanococcus voltae

A3 (Krupovic and Bamford 2008). Integrated elements that overlap tRNA genes were also found in Sulfolobus and A. pernix (She, Peng et al. 2001; She, Brugger et al.

2002). We also observed six partial tRNA fragments in A. pernix and one in T. aggregans (Table 5.4), potentially due to other recent viral integrations or rearrangements at tRNA loci.

The trans-spliced and permuted tRNAs identified in this study indicate that rearrangement of tRNA genes is relatively common in at least one major branch of the thermophilic crenarchaea. Interactions between pre-tRNA splicing, 5’ and 3’ trimming of pre-tRNAs, tRNA modification, and tRNA editing have not been thoroughly investigated, but these present multiple related processes that may modulate the ability to support split or permuted tRNAs. The discovery of the six uniquely rearranged tRNAs in the Archaea presents new opportunities to study their evolution via their genomic context and basic sequence attributes. Based on these new examples, improved methods for detection, and increasing availability of

127 sequenced genomes, we anticipate numerous additional split or permuted tRNAs will be identified for future study.

5.5 Materials and Methods

Genomic data. Complete genomic sequences of Aeropyrum pernix, Staphylothermus hellenicus, Staphylothermus marinus, Thermofilum pendens, and Thermosphaera aggregans were obtained from NCBI RefSeq (accession numbers: NC_000854,

NC_014205, NC_009033, NC_008698, and NC_014160). tRNA gene prediction. Trans-spliced and permuted tRNAs were predicted using an improved version of tRNAscan-SE (manuscript in preparation) (Lowe and Eddy

1997). Archaeal tRNA-specific and BHB motif covariance models were created for similarity searching using Infernal 1.0 (Nawrocki, Kolbe et al. 2009) after pre- filtering possible candidates with tRNAscan (Fichant and Burks 1991) and an A/B box motif detection algorithm (Pavesi, Conterio et al. 1994). A default cutoff score was set to 20 bits.

Promoter identification. To generate a training set for promoter identification, potential operons were predicted genome-wide with the requirement of a minimum intergenic separation of at least 100 nt (on the same strand). A 16-mer motif search of the 90 nt upstream of known genes (not annotated as putative or hypothetical genes) using MEME (Bailey and Elkan 1994) was conducted to identify the consensus

128 promoter, including the transcription factor B response element (BRE – 1 to 3 adenosines) plus the TATA box. A position-specific scoring matrix (PSSM) was generated from the alignments of the MEME results after manual inspection. Each organism’s PSSM was used to scan the 150-bp upstream region of all non-coding and protein-coding genes to identify potential promoter regions. Ten virtual genomes for each target genome were generated using a fifth-order Markov chain to retain the base frequency of the target genome, and scanned to identify the score distribution of false positives. The promoter candidates identified were filtered according to expected position (Slupska, King et al. 2001) and a threshold p-value equivalent to that of the lowest-scoring known gene.

Culture conditions for Aeropyrum pernix. A. pernix K1 was obtained from DSMZ.

A. pernix cultures were grown in TY medium (0.4% tryptone, 0.2% yeast extract, pH

7) supplemented with 3.86 mM sodium thiosulfate (Kim and Lee 2003). These cultures were incubated aerobically at 90˚C and collected at mid- to late-log phase.

Cell pellets were frozen in liquid N2 and stored at -80°C.

Total RNA preparation. Total RNA was extracted from the frozen cell pellets using a Polytron tissue homogenizer and TRI Reagent (Sigma-Aldrich). RNA samples were treated with TURBO DNase (Ambion) to remove any residual DNA, re-extracted with TRI Reagent, and normalized to 1.5 µg/µl.

129 Genomic DNA preparation. Cell pellets were incubated overnight in SNET lysis buffer (400 mM NaCl, 1% SDS, 20 mM Tris-Cl, 5 mM EDTA) with 400 µg/ml proteinase K at 55°C. Genomic DNA was then extracted from cell lysate using phenol:chloroform:isoamyl alcohol. DNA samples were treated with RNaseA to remove any residual RNA and reextracted with phenol:chloroform:isoamyl alcohol.

RT-PCR and sequencing. Total RNA from A. pernix was denatured at 100 °C for

5 min and cooled on ice for 5 min. First strand cDNAs were synthesized from denatured total RNA using gene-specific reverse primers in Superscript III reverse transcriptase (Invitrogen) at 65°C for 30 min according to manufacturer’s instructions. These cDNA templates were PCR-amplified using forward and reverse primers spanning the mature tRNAs, the 5′ tRNA halves, and the 3′ tRNA halves.

PCR parameters for 30 cycle amplifications were denaturation at 94°C for 30 seconds, annealing at 58°C for 1 minute, and extension at 72 °C for 2 minutes using

AmpliTaq DNA polymerase (Applied Biosystems). PCR products were cloned with the pCR-2.1-TOPO cloning kit (Invitrogen). Plasmid DNA was extracted using

Zyppy plasmid miniprep kit (Zymo Research). DNA samples were sequenced at

University of California Berkeley DNA Sequencing Facility. Primers used for RT-

PCR were listed as follows:

Forward primer for mature tRNAAsp(GUC) in A. pernix and its 5′ half:

5′-CGCGGTAGTATAGCCTGGA-3′

130 Forward primer for 3′ half of tRNAAsp(GUC) in A. pernix:

5′-CGGGCCTGCGGAGAG-3′

Reverse primer for mature tRNAAsp(GUC) in A. pernix and its 3′ half:

5′-GCGGCCGGGATTTGAAC-3′

Reverse primer for 5′ half of tRNAAsp(GUC) in A. pernix:

5′-GCGGGGCCCTTGACAG-3′

Northern analysis. Five micrograms of total RNA extracted from A. pernix cell cultures was denatured for 3 minutes at 90°C in an equal volume of 95% formamide gel loading buffer (Ambion), resolved on a 8% polyacryamide-urea gel by electrophoresis, and blotted onto Hybond N+ membrane (GE healthcare) by overnight electro-transfer. A 239 nt sequence including the A. pernix tRNAAsp(GUC) and the region between the two fragments was amplified by PCR using A. pernix genomic

DNA and gene-specific primers Sn-Ape-tRNA-Asp (5′-

CCCAGTGGTAAGATATGTGAACC -3′) and Asn-Ape-tRNA-Asp (5′-

GGCCGCGAGGATTATTG-3′). The resulting PCR product served as the template for generating single-stranded, [α-32P]-ATP-labeled DNA probes by linear PCR using

Asn-Ape-tRNA-Asp. Hybridizations were carried out at 42°C in UltraHyb buffer

(Ambion). Hybridization patterns were determined using a PhosphorImager

(Molecular Dynamics).

131 5.6 Author Contributions

P.C. identified trans-spliced and permuted tRNA genes, purified Aeropyrum total

RNA and genomic DNA, performed RT-PCR and cloned-based sequencing, performed Aeropyrum northern analysis, and co-wrote the manuscript. A.C. grew

Aeropyrum cultures, contributed ideas for experimental verifications, and co-edited the manuscript. L.L. contributed ideas for scientific advancements. T.L. guided the research studies and co-wrote the manuscript.

5.7 Acknowledgments

This work was supported by a grant from the National Institutes of Health

(HG004002-01A2 subaward to T.L.).

132

Figure 5.1 Predicted promoter score distribution in A. pernix Histogram represents the score distribution of predicted promoters for all transcripts in the A. pernix genome. The scores of predicted promoters for the 5′ half and the 3′ half of pre-tRNAAsp(GUC) are as marked.

133

134 Figure 5.2 Predicted secondary structures of trans-spliced and permuted precursor tRNAs A. Mature tRNAAsp(GUC) in A. pernix and T. aggregans are formed by joining the 5′ half and the 3′ half at position 37/38 after splicing at the bulge-helix-bulge (BHB) motif. B. The 5′ half and the 3′ half of trans-spliced tRNALys(CUU) in S. hellenicus and S. marinus join at position 30/31, same as the previously identified split tRNALys(CUU) in N. equitans (Randau, Pearson et al. 2005). C. Circularized permuted tRNAiMet(CAU) and tRNATyr(GUA) in T. pendens have the 3′ half located upstream of the 5′ half separated by intervening sequences represented in green. The two fragments join at position 59/60, same as the T-Ψ-C loop permuted tRNAs in red alga C. merolae (Soma, Onodera et al. 2007). Pre-tRNAAla(UGC) in C. merolae is shown for comparison. 5′ half of tRNA transcripts are represented in blue, the 3′ halves in orange. Black arrows indicate positions of splicing. Anticodons are boxed in light blue.

135

Figure 5.3 RT-PCR and northern analysis of tRNAAsp(GUC) in A. pernix A. Expression analysis of mature tRNAAsp(GUC) (Mature), 5′ half of pre-tRNAAsp(GUC) (5′ h), and 3′ half of pre-tRNAAsp(GUC) (3′ h) using RT-PCR. M represents the 10bp DNA ladder. The band sizes correspond to the sizes of the PCR products based on selected primers, but not the transcript sizes of the mature tRNA and the halves. B. Northern analysis of tRNAAsp(GUC) using radiolabeled DNA probe that spans the mature tRNAAsp(GUC) and the region between the two fragments. The mature tRNA, 5′ half transcript and the 3′ half transcript are as marked. No expression was found corresponding to the 199 nt transcript originally predicted as pre-tRNAAsp(GUC) with a 121 nt intron. Bands at approximately 90 nt and 125 nt are expected due to cross- hybridization of highly similar tRNA sequences in other tRNA transcripts. M1 and M2 represent the 10bp and 100bp RNA ladders respectively.

136

Figure 5.4 Proposed evolutionary relationship between tRNAAsp(GUC) in D. kamchatkensis, A. pernix, and T. aggregans D. kamchatkensis has a typical linear tRNAAsp(GUC) which represents the likely ancestor of the split tRNAAsp(GUC) genes in A. pernix and T. aggregans. The 5′ and 3′ halves of pre-tRNAAsp(GUC) are located adjacent to each other in A. pernix and T. aggregans. The two halves in A. pernix are transcribed on the forward strand while those in T. aggregans are transcribed on opposite strands. Breaks in synteny were observed between the tRNA halves and upstream of the 5′ half in A. pernix.

137

Figure 5.5 tRNALys(CUU) in S. marinus and S. hellenicus loci display strong synteny on the Archaeal Genome Browser (Schneider, Pollard et al. 2006) The blue segments located at the bottom tRNA gene track correspond to the 5′ half and the 3′ half of tRNALys(CUU). The arrows on the genes indicate the 5′-to-3′ expression direction. The colors on the protein-coding genes represent annotations of different functional classes of clusters of orthologous groups (COG). The blue track above the genes represents the G/C content computed with a 20-base sliding window. Compared to the neighboring protein-coding genes, the two half transcripts of tRNALys(CUU) have a high G/C content that is required for structural RNA stability in hyperthermophiles.

138

Figure 5.6 Alignment of tRNA promoters in T. pendens The predicted promoter including BRE and TATA box of each tRNA gene is highlighted in yellow. The 5′ end of mature tRNA-encoding sequence is highlighted in cyan. The 3′ half of permuted mature tRNAiMet(CAU) and tRNATyr(GUA) are highlighted in orange. Gray is the splicing region of the 3′ half of pre-tRNAs. Green is the intervening sequence between the 5′ half and 3′ half of the permuted tRNAs. Scales above sequences are positions relative to the 5′ end of mature tRNAs. The black arrows indicate the direction of transcription.

139

Figure 5.7 Phylogenetic distribution of trans-spliced and permuted tRNAs in Archaea Trans-spliced tRNAs were identified in A. pernix, T. aggregans, S. marinus, and S. marinus in this study, in N. equitans by Randau and colleagues (Randau, Munch et al. 2005; Randau, Pearson et al. 2005), and in C. maquilingensis by Fujishima and colleagues (Fujishima, Sugahara et al. 2009) (highlighted in gray). Permuted tRNAs

140 were found in T. pendens in this study (yellow). Phylogenetic tree was generated based on the concatenation of 23S and 16S ribosomal RNAs. Sequences were aligned using ClustalW (Larkin, Blackshields et al. 2007). Alignments were manually adjusted using Jalview (Waterhouse, Procter et al. 2009) to remove introns. Maximum likelihood tree was computed using PhyML (Guindon, Dufayard et al. 2010) with general time-reversible model of sequence evolution. Numbers at nodes represent non-parametric bootstrap values computed by PhyML (Guindon, Dufayard et al. 2010) with 1,000 replications of the original dataset.

141 Table 5.1 Summary of pre-tRNA intron size in 90 archaeal genomes tRNAs and their introns were predicted using an improved version of tRNAscan-SE (Lowe and Eddy 1997) and are publicly available at Genomic tRNA Database (Chan and Lowe 2009). Red highlights the atypical predicted intron size in pre-tRNAAsp.

tRNA Total Number Total Number Intron Length Isotype Anticodon of tRNA Genes of Introns Minimum Maximum Median Trp CCA 88 95 13 129 65 Asp GTC 105 15 15 121 18 Tyr GTA 91 60 13 94 13 Met CAT 280 122 12 84 17 Pro CGG 73 26 15 79 18 Thr GGT 93 10 12 68 14 Ser CGA 70 28 13 61 24 Thr CGT 80 33 13 56 13 Glu TTC 106 37 14 55 15 Thr TGT 97 34 12 54 15 Arg GCG 91 11 13 52 16 Arg CCG 68 6 14 51 14 His GTG 90 11 16 50 17 Ile GAT 94 24 12 49 12 Gln TTG 91 17 14 48 16 Pro TGG 90 19 15 48 18 Cys GCA 125 45 11 47 25 Gln CTG 78 16 15 47 16 Arg TCT 90 23 13 44 15 Ala TGC 121 12 13 40 15 Glu CTC 75 31 15 37 15 Pro GGG 88 39 13 37 21 Gly CCC 74 8 12 36 14 Leu CAA 75 19 12 36 15 Lys CTT 74 23 15 36 22 Lys TTT 98 19 14 36 23 Leu TAA 93 17 12 35 15 Asn GTT 95 30 11 34 14 Gly GCC 102 9 12 34 25 Ala GGC 90 4 14 33 14 Phe GAA 96 10 14 33 17 Ala CGC 76 7 15 32 16 Ile TAT 2 2 19 32 19 Val TAC 90 16 14 32 25

142 tRNA Total Number Total Number Intron Length Isotype Anticodon of tRNA Genes of Introns Minimum Maximum Median Val CAC 82 9 16 31 18 Gly TCC 92 11 14 30 15 Ser GGA 93 9 12 29 14 Val GAC 94 10 13 28 20 Arg CCT 78 23 13 24 13 Arg TCG 90 10 16 24 20 Leu CAG 73 6 15 23 15 Ser TGA 91 10 11 23 12 Ser GCT 91 2 12 22 12 Leu GAG 96 2 19 20 19 Leu TAG 90 2 16 18 16 Ala AGC 0 0 0 0 0 Arg ACG 0 0 0 0 0 Asn ATT 0 0 0 0 0 Asp ATC 0 0 0 0 0 Cys ACA 0 0 0 0 0 Gly ACC 0 0 0 0 0 His ATG 0 0 0 0 0 Ile AAT 0 0 0 0 0 Leu AAG 0 0 0 0 0 Phe AAA 0 0 0 0 0 Pro AGG 0 0 0 0 0 Sel TCA 5 0 0 0 0 Ser ACT 0 0 0 0 0 Ser AGA 0 0 0 0 0 Supres CTA 0 0 0 0 0 Supres TTA 0 0 0 0 0 Thr AGT 0 0 0 0 0 Tyr ATA 0 0 0 0 0 Val AAC 0 0 0 0 0

143 Table 5.2 Predicted promoters of tRNA genes in Aeropyrum pernix Promoter motifs that include transcription factor B response element (BRE) and TATA-box were predicted using genome-specific position scoring matrix (see Methods). Relative position of the promoter represents the center position of predicted promoter motif relative to the 5′ end of the mature tRNA genes (except for the 3′ half of tRNAAsp which includes the 5′ leader).

Relative tRNA genes Promoter Position Score (bits) chr.tRNA4-AspGTC-exon1 GCAAAGCTTTAAACCC -33 14.88422966 chr.tRNA34-PheGAA TTTAAGGTTAAAAACC -43 13.85530281 chr.tRNA48-ThrTGT CAAACCCTTTAAACCC -40 13.48234177 chr.tRNA28-LeuGAG CTAACACTTTATAGCC -37 13.46852493 chr.tRNA38-GlyGCC GTAAATCTTTAACCCT -36 12.29589653 chr.tRNA39-AlaGGC CTAATCCTTAAAACCT -68 12.20191097 chr.tRNA50-SerGCT GTAAACTTTTATTCCC -40 12.04985428 chr.tRNA7-ValTAC GTAAACCTATAAGACC -43 11.99695683 chr.tRNA19-GlyTCC TTATAGGTTAAAAACC -56 11.44040298 chr.tRNA6-LeuCAA CCTAGACTATAAAATA -17 11.40154266 chr.tRNA27-AlaTGC TAATAGGTTAAAAACT -39 11.16594601 chr.tRNA45-ValCAC CGTAAGCTATAAACCC -44 11.05230331 chr.tRNA25-LeuCAG CAAAGCGTTTAAAGGC -36 11.00018215 chr.tRNA30-ProGGG TTCAAACTTTTTACCC -39 10.86858368 chr.tRNA10-IleGAT TCTAGACTTAATAAGC -40 10.86446857 chr.tRNA29-SerGGA GTAGATCTTTATAACG -34 10.75248814 chr.tRNA4-AspGTC-exon2 GCAAACGTTTTTAACC -32 10.72685909 chr.tRNA32-GlyCCC GCTAACCTTTAAGCCC -37 10.67765903 chr.tRNA42-LeuTAG TAAACCGTTTTAACCC -37 10.55155659 chr.tRNA5-IleCAT GAAACAGTATTAAACC -42 10.43042278 chr.tRNA52-ThrCGT GATAACCCTTAAACCC -41 9.53168869 chr.tRNA12-ArgTCT TCCATACTATAAAGGC -39 9.501112938 chr.tRNA13-iMetCAT ATGACACTATAATACT -24 9.24230957 chr.tRNA14-GluCTC GAAAGACTCAAAAACC -72 8.909661293

144 Relative tRNA genes Promoter Position Score (bits) chr.tRNA16-ValGAC GGAAACCTATAAGACC -43 8.867957115 chr.tRNA24-GluTTC TCTAACCCTTAAAGCC -36 8.451208115 chr.tRNA22-HisGTG CTAGATCTATATAGGT -138 8.201022148 chr.tRNA3-AlaCGC GCAAGGCTTATTATCC -42 8.062055588 chr.tRNA20-GlnCTG GCACACCTATTTAACC -35 7.943173409 chr.tRNA23-CysGCA GTGAACCTCTAAAACC -35 7.922645092 chr.tRNA51-ProCGG AGAAGCCTTTAAGCCT -57 7.771871567 chr.tRNA9-ArgGCG GAACCACTATAACCAC -33 7.722810268 chr.tRNA47-MetCAT ACCAAGCTTAACACAC -46 7.426821709 chr.tRNA31-ThrGGT CTAACCAATTAAACCC -41 7.340769291 chr.tRNA49-LysCTT GAAAACTCTTATAGCC -41 7.306219578 chr.tRNA15-LysTTT AATAACCTTATGAGAG -18 7.306199074 chr.tRNA26-ArgCCT GTAACGCTTTTAGCCA -55 7.185641766 chr.tRNA41-AsnGTT GAAACCGTTAAAGCCA -35 7.106212139 chr.tRNA33-GlnTTG AATATCCTTTAGGGCA -43 6.95100832 chr.tRNA43-SerCGA CCAAGGGATTAAAACC -54 6.744953156 chr.tRNA1-ArgCCG GAGACGATAAAAATTC -33 6.734880924 chr.tRNA21-TyrGTA TTCAAACCTTATAGGT -35 6.627279282 chr.tRNA35-SerTGA GAGACTCTATTAACTC -37 6.212833881 chr.tRNA40-TrpCCA TTAGCGATTTTAAGCC -42 6.016829968 chr.tRNA17-ArgTCG GAAAACCATAACACAC -34 5.531616211 chr.tRNA36-LeuTAA GCGAGTATTTTTAGCC -38 3.697657108 chr.tRNA44-ProTGG -

145 Table 5.3 Summary of trans-spliced and permuted tRNAs The novel trans-spliced and permuted tRNAs identified in this study are marked by asterisks. The previously identified split tRNAs in N. equitans (Randau, Munch et al. 2005; Randau, Pearson et al. 2005) and permuted tRNAs in C. merolae (Soma, Onodera et al. 2007) are listed for comparison. The 5′ start value represents the coordinates of the 5′ start site of the mature tRNA.

146 Table 5.4 Partial predicted archaeal tRNAs tRNA fragments were predicted by tRNAscan-SE (Lowe and Eddy 1997) and verified by manual sequence alignments for tRNA isotype identification.

Organism Partial tRNA tRNA Anticodon Consensus coodinates isotype tRNA positions Aeropyrum pernix 200292-200336 (+) Arg CCG 32-76 1666802-1666846 Arg CCG 32-76 (+) 478150-478213 (-) Leu TAA 21-76 898974-899036 (+) Thr TGT 7-69 328963-329007 (-) Val CAC 32-76 580684-580733 (+) Val TAC 27-76 Thermosphaera aggregans 843106-843141 (+) Ser TGA 30-55

147 Chapter 6

GtRNAdb: A database of transfer RNA genes detected in genomic sequence5

5 This chapter is an updated version of a manuscript co-written with Todd M. Lowe. The published version appears in Chan, P.P. and T.M. Lowe (2009) GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37: D93-7.

148 6.1 Abstract

Transfer RNAs represent the single largest, best-understood class of non-protein coding RNA genes found in all living organisms. By far, the major source of new transfer RNAs is computational identification of genes within newly sequenced genomes. To organize the rapidly growing collection and enable systematic analyses, we created the Genomic tRNA Database (GtRNAdb), currently including over 91,000 tRNA genes predicted from 759 species. The web resource provides overview statistics of tRNA genes within each analyzed genome, including information by isotype and genetic locus, easily downloadable primary sequences, graphical secondary structures, and multiple sequence alignments. Direct links for each gene to

UCSC eukaryotic and microbial genome browsers provides graphical display of tRNA genes in the context of all other local genetic information. The database can be searched by primary sequence similarity, tRNA characteristics, or phylogenetic group. The database is publicly available at http://gtrnadb.ucsc.edu.

6.2 Introduction

Transfer RNA (tRNA) genes play an essential role in protein translation in all living cells. Among the numerous tRNA search programs created in the last ten years, tRNAscan-SE (Lowe and Eddy 1997) remains a popular standard for whole-genome annotation of tRNA genes. This PERL program uses the original tRNAscan program

(Fichant and Burks 1991) and a linear sequence signal detection algorithm by Pavesi

149 and colleagues (Pavesi, Conterio et al. 1994) as pre-filters to obtain an initial list of tRNA candidates. The program then passes these candidates to a highly sensitive and selective covariance model search program (Eddy and Durbin 1994) to obtain a final set of gene predictions that represent 99-100% of true tRNAs with a false positive rate of fewer than 1 per 15 gigabases (Lowe and Eddy 1997).

To catalog the increasing number of predicted tRNA genes found in complete genomes, we developed the Genomic tRNA Database (GtRNAdb) as a repository for all identifications made by tRNAscan-SE. This database has been in regular use by the community for over seven years, but never formally described. Recently, we updated the interface, content, and search capabilities, justifying a new report of this improved resource. As before, the database provides summary statistics of predicted tRNA genes and the number of isotypes detected in each genome. Researchers can view tRNA genes by retrieving primary sequences, secondary structure information, and isotype alignments. Alternatively, tRNA genes can now be viewed within the eukaryotic-specific UCSC Genome Browser (Karolchik, Kuhn et al. 2008) or similar microbial genome browsers (Schneider, Pollard et al. 2006). In addition, a new database search page and BLAST (Altschul, Gish et al. 1990) server enable similarity studies of tRNA genes across species. To date, GtRNAdb contains 91,054 predicted tRNA genes derived from 43 eukaryotes, 86 archaea, and 630 bacteria. Together with tRNAscan-SE (Lowe and Eddy 1997), this public database provides an important information resource to the transfer RNA and genomics research communities.

150 6.3 Database Features

6.3.1 tRNA identification information

Transfer RNAs from individual species can be selected from a full organism list on the GtRNAdb front page. Researchers can study the summary statistics of tRNA gene predictions from each genome, including the number of tRNAs with introns and the distribution of tRNAs belonging to each isotype. tRNA isotypes are grouped by

“two-box”, “four-box”, or “six-box” codon families, with highlighting colors to indicate potentially missing tRNAs (Figure 6.1). Users can study the frequency of tRNA genes in relationship to the codon usage, which is computed using protein gene annotations in NCBI RefSeq (Pruitt, Tatusova et al. 2007) for all prokaryotes and fungi, or obtained from the Codon Usage Database (Nakamura, Gojobori et al. 2000) for other eukaryotes. The GtRNAdb provides two viewing modes for gene lists: organized by isotype, or by genome locus. Both views include tRNA gene and intron positions relative to the source chromosome (or plasmid); upstream and downstream sequence flanking the tRNA genes; and covariance model search scores that are broken down by contribution from primary sequence patterns versus secondary structures (this breakdown enables identification of some types of tRNA pseudogenes). If the eukaryotic or microbial genomes are available in external genome browsers (Schneider, Pollard et al. 2006; Karolchik, Kuhn et al. 2008), users can follow the provided links to study each tRNA within the context of neighboring genes. tRNA gene information can also be displayed and saved as plain text in the

151 standard tRNAscan-SE output file format. In addition, researchers can download the tRNA sequences for each species in FASTA format, or as part of a full set for each phylogenetic domain.

6.3.2 tRNA secondary structures and alignments

Although all mature non-organellar tRNAs form a general cloverleaf secondary structure, variations in the length of stem-loops exist. tRNAscan-SE (Lowe and Eddy

1997) provides highly accurate secondary structure predictions via covariance model analysis (Eddy and Durbin 1994) for each tRNA. These secondary structures can be viewed within GtRNAdb in linear string representations or as graphical two- dimensional images (Figure 6.2). To enable critical evaluation of lower-scoring tRNA identifications, the database also provides multiple sequence alignments across all tRNAs of the same isotype within a species. These structural alignments are constructed via alignment to domain-specific tRNA covariance models (Eddy and

Durbin 1994). Each stem-loop in the alignments is color-coded (similar to alignments found in Rfam 8.1) for easy viewing (Figure 6.3). For comparison to older reference tRNA sequences, multiple alignments also include aligned entries from the original

Sprinzl tRNA database (Sprinzl and Vassilenko 2005), when present from the same species and isotype.

152 6.3.3 tRNA search and BLAST server

One of the goals in developing the GtRNAdb is to provide a tool for comparative analysis across multiple genomes. The search capabilities allow researchers to query the database with criteria including phylogenetic domain and clade, partial species name, chromosome or scaffold name, any combination of amino acids and anticodons, nucleotide identity at the -1 upstream and +1 downstream positions, number of introns, and the existence of a genome-encoded terminal CCA sequence. Besides viewing results within the web browser interface, search results can be downloaded for further analysis, containing gene annotation and sequences.

Researchers can use this search functionality to address various biological questions.

For example, “which eukaryotes have predicted selenocysteine tRNAs in their genomes?” By selecting the domain “Eukarya” and amino acid “selenocysteine”, we find that there are 87 total selenocysteine tRNA predictions across 25 genomes such as human, mouse, horse, fruit fly, and model legume Medicago truncatula.

Although genome-encoded anticodons starting with guanosine (G) or adenosine (A) are commonly used to decode codons ending with cytosine (C) or uridine (U), tRNAs with anticodons starting with A were not found in complete archaeal genomes (Grosjean, Marck et al. 2007). To search for possible exceptions, we selected the domain Archaea and all anticodons starting with A as the search criteria. The result shows that Ferroplasma acidarmanus includes a tRNA for leucine with anticodon AAG. Considering (a) the relatively low covariance model score of

153 45.65 bits as compared to the other tRNAs identified in the same genome, and (b) the absence of an expected leucine tRNA with anticodon GAG, this “flags” either a potential sequencing error, or a target for further study in terms of post-transcriptional modification or RNA editing.

To search any given sequence directly against the tRNAs in the database, the tRNA BLAST server can be used. Options include searching for tRNA matches in all species, or only in one of the three domains of life. Standard BLAST options including expect value threshold and word size can be set for each query (Altschul,

Gish et al. 1990). Users can also enter advanced BLAST options in a free-text window. Pair-wise alignments are listed upon the completion of the search. If tRNA matches occur in genomes available in the external UCSC genome browsers

(Schneider, Pollard et al. 2006; Karolchik, Kuhn et al. 2008), users can view tRNA hits within genomic context by clicking on the provided links.

6.3.4 Error and request tracking

In order to document tRNA gene predictions in a rapidly expanding list of completed genomes, most annotations in the database are automated without experimental verification or inspection against published literature. We acknowledge there are exceptions to general anticodon-based isotype identification rules and other occasional errors due to post-transcriptional anticodon modification, unrecognized pseudogenes, some classes of short interspersed nuclear elements (SINEs), and other tRNA-derived sequences. In some cases, tRNA introns are also misidentified by

154 automated searches (e.g., noncanonical introns found in many crenarchaeal species), which can cause incorrect determination of the anticodon and tRNA type. We have manually examined and corrected some of these errors (including crenarchaeal noncanonical introns, trans-spliced split tRNAs recently identified in Nanoarchaeum equitans (Randau, Munch et al. 2005; Randau, Pearson et al. 2005) and Caldivirga maquilingensis (Fujishima, Sugahara et al. 2009), and some tRNA-derived SINEs), yet we continue to search for new cases of obvious tRNA misidentification. We sincerely encourage feedback on any unaddressed discrepancies by submitting a report through our bug and request tracking system. We also welcome ideas for new features within the database, and often accept special requests for manually reviewed tRNA analyses from the user community. Users can monitor the progress of their requests and search through the development of other reports in the system.

6.4 Future Directions

Due to the design of a static web interface, the capability of data searching across genomes is currently limited. We plan to expand the database features by providing functionality to execute queries with more criteria such as ecotype of organisms, or allowing specification of sequence patterns at multiple positions within the tRNAs.

Genes found via searches will be dynamically aligned with secondary structure information for comparative studies. Users will be able to download gene information in various file formats, including the BED format developed for the UCSC Genome

155 Browser (Karolchik, Kuhn et al. 2008), and the Stockholm format used in Pfam

(Finn, Tate et al. 2008) and Rfam (Gardner, Daub et al. 2009) for multiple sequence and secondary structure alignments. We will also continue to update the database with new tRNA identifications as additional genomes are made available. Although the GtRNAdb generally focuses on collections of tRNAs from complete genomes, we encourage members of the research community to request analyses of draft or incomplete genomes.

6.5 Funding

Funding for open access charges: A gift from Hewlett-Packard via the UC Santa Cruz

Center for Information Technology Research in the Interest of Society (CITRIS).

156

Figure 6.1 tRNA summary statistics with codon usage for Escherichia coli K12 Number of total tRNA genes and genes by isotypes and anticodons were provided by tRNAscan-SE (Lowe and Eddy 1997) identification results. Protein-coding genes annotated in RefSeq (Pruitt, Tatusova et al. 2007) were used to compute codon usage of the genome. Side menus include links to detailed information for tRNA genes and external databases for gene analysis.

157

Figure 6.2 Secondary structure prediction of tRNAGlu(CUC) in chromosome III of Caenorhabditis elegans A. Linear string representation of secondary structure prediction generated within tRNAscan-SE by COVE (Eddy and Durbin 1994). B. Graphic representation of secondary structure prediction rendered by NAVIEW (Bruccoleri and Heinrich 1998).

158

Figure 6.3 Multiple sequence alignments of tRNAPhe(GAA) in Homo sapiens Sequence alignments are grouped by identical secondary structures with the linear string representation listed on top of each block. Each color in the alignments codes for the base pairing of each stem loop in the secondary structure. The tRNA genes marked as “pseudo” were identified as pseudogenes. The last tRNA RF9990_GAA_HUMAN_PLACENTA was retrieved from Sprinzl tRNA database (Sprinzl and Vassilenko 2005).

159 Chapter 7

Chracterization of a crenarchaeal-rich metagenome through RNase P and tRNAs

160 7.1 Introduction

Many microbiology research studies have been focused on model organisms cultured in a laboratory. Under controlled growth conditions, scientists can objectively analyze and compare experimental results that reveal biological significance. However, in a natural environment where multiple microbial species and strains live together, growth responses may differ significantly. Cells may evolve at a different rate and metabolic pathways may change based on variable growth conditions. In addition, many microorganisms cannot be cultured in a laboratory. It was estimated that only

0.1% to 1% of soil bacteria can grow on media under standard conditions

(Handelsman 2004). The "great plate count anomaly" (Staley and Konopka 1985) that reveals the discrepancy between the populations cultured on plate and those observed under microscope further indicates the limitation of representation by cultured microorganisms.

Recognizing the diversity of microorganisms in nature, scientists have been developing methods to study microbes in an uncultured environment. Pace and colleagues used the rRNA sequences directly from the environment samples for phylogenetic analysis of the uncultured organisms (Stahl, Lane et al. 1985; Pace

1997). The development of PCR technology further enabled the use of rRNAs as a phylogenetic marker for analyzing microbial communities (Giovannoni, Britschgi et al. 1990; Schmidt, DeLong et al. 1991). With the advancement of high-throughput

161 sequencing technologies, DNA extracted from environmental samples can be sequenced and assembled as metagenomes. Researchers can better characterize the microbial communities by using genes with different functionalities in addition to rRNAs (McDaniel, Breitbart et al. 2008; Inskeep, Rusch et al. 2010; Qin, Li et al.

2010). To support this emerging field of metagenomics, increasing number of bioinformatics tools and data repositories such as CAMERA (Sun, Chen et al. 2010) and the metagenomics RAST server (Glass, Wilkening et al. 2010) were developed.

The knowledge gained from metagenomic studies can be used to strengthen and deepen basic microbiology research.

Yellowstone National Park has one of the largest caldera in the world. The diverse geothermal systems, composed of over ten thousand thermal features and three hundred geysers, attract countless number of researchers to uncover the unique and potentially novel organisms in Bacteria, Archaea, and Eukarya that may be hosted in these environments. In particular, phylogenetic analyses using 16S rRNA sequences and predicted protein similarity searches from the assembled metagenome reveal that Cistern Spring, a hot spring located at the back basin thermal area of

Norris Geyser Basin with an average temperature between 80°C and 90°C and an average pH between 4.7 and 7.1 (Inskeep and Young), is dominated by crenarchaea

(Markowitz, Ivanova et al. 2008). This suggests that the shortened form of archaeal

RNase P RNA (type T) (Lai, Chan et al. 2010) and disrupted tRNAs that contain multiple introns (Sugahara, Kikuta et al. 2008; Chan and Lowe 2009) or split into

162 fragments (Randau, Munch et al. 2005; Fujishima, Sugahara et al. 2009) may exist in the organisms living in this environment. By applying the knowledge developed in previous studies, I identified three type A and three type T archaeal RNase P RNAs in

Cistern Spring metagenome, that provides further evidence, in addition to the 16S rRNA analysis, of the crenarchaeal genome domination in the environmental samples.

The identification of intron-bearing and split tRNAs verifies the existence of

Themoproteaceae in the metagenome. The discovery a novel split tRNAMet(CAU) in

Thermoproteus further reaffirms the unlimited opportunities of biological exploration in metagenomics.

7.2 Results and Discussion

7.2.1 At least six crenarchaeal species co-exist in Cistern Spring

To obtain a better understanding of the species distribution in the metagenome of Cistern Spring and for comparison purpose, I used a traditional approach — the phylogenetic analysis based on 16S rRNA. By using sequence similarity searches against 16S rRNA sequences in completely sequenced archaeal genomes, I noticed a number of 16S rRNA fragments in the metagenome, ranging from about 100bp to

1000bp in length. They are mostly included as a single gene fragment in a small contig that might not be easily assembled with the others due to gaps in sequencing results. Since model crenarchaeal genomes only have a single copy of 16S rRNA, the number of genes found in the metagenome can be used to estimate the number of

163 species in the sequenced samples. The six most complete 16S rRNAs (3 – full lengths; 3 – partial fragments) identified in the metagenome suggest that there might be two Pyrobaculum related species, one Caldivirga species, one Vulcanisaeta related species, one Acidilobus species, and one Sulfolobus related species (Figure 7.1).

These findings show consistency between the basic environment conditions

(temperature and pH) and the standard growing conditions of these species.

7.2.2 Search for RNase P RNAs in metagenome

By employing both the Rfam (Gardner, Daub et al. 2009) archaeal RNase P

RNA covariance model and the type T RNase P RNA covariance model, I identified three type A and three type T RNase P RNA genes. Due to the lack of primary sequence conservation in these genes, structural alignments against the covariance models were used to evaluate their phylogenetic relationships (Figure 7.2). Similar to the 16S rRNA analysis results, the three type T genes are closely related to

Pyrobaculum, Caldivirga, and Vulcanisaeta respectively, with the predicted secondary structures matching the corresponding genus-specific variants described in

Chapter 4 of this work. While one of the type A RNase P RNA genes may be related to Desulfurococcaceae, no 16S rRNA was found related to this family. Instead of an extra Pyrobaculum-related 16S rRNA was identified. This could be caused by the incompleteness of the metagenome.

A search for the archaeal RNase P proteins in the Cistern Spring metagenome revealed that not all four conserved proteins (Pop5, Rpp30, Rpp21, and Rpp29) were

164 detected for each species with a predicted RNase P RNA (Table 7.1). However, the identification of a Pop5 gene and an Rpp29 gene most similar to Staphylothermus hellenicus and Thermosphaera aggregans respectively further proves the existence of a Desulfurococcaceae in the environmental samples. On the other hand, two copies of

Pop5, Rpp30 and Rpp21 were found as highly similar to the corresponding proteins in the same species. For example, two Pop5, two Rpp30, and one Rpp29 were found to be Pyrobaculum-like. Rpp21 was not identified in this case as the model organisms.

With only one Pyrobaculum RNase P RNA variant detected, the extra proteins may associate with the missing RNA component that exists with the extra copy of

Pyrobaculum-related 16S rRNA. However, both Caldivirga and Acidilobus were found to associate with two Pop5 and two Rpp30, and two Rpp21 respectively, while only one RNase P RNA gene was detected for each of these genera. This could be a result of another two missing RNA subunits in the metagenome assembly. Or alternatively, this holoenzyme may function in a form that has not been observed in an isolated culture, such as two homologs of the same protein as part of the composition.

7.2.3 Majority of tRNAs in Cistern Spring have introns

With the improved version of tRNAscan-SE (Lowe and Eddy 1997), I identified 253 predicted tRNA genes in the metagenome of Cistern Spring, with a total of 185 introns (Table 7.2). Six tRNAs were found to have three introns while 34 of them have two introns. Only 45% of the predicted tRNAs do not carry any introns.

165 Similar to 16S rRNA, archaeal genomes mostly contain single copies of tRNA genes.

A typical crenarchaeal genome includes 46 tRNA genes to decode 62 codons

(Grosjean, Marck et al. 2007). Multiple tRNA fragments upstream and downstream of the contig or scaffold boundaries were found and filtered by the gene finding software. This might explain the inconsistency of the amount of genes identified for each tRNA isotype (Table 7.2). Surprisingly, we observed that four copies (the highest number of copies in the tRNA gene set) of tRNAAla(GGC) have the same sequence and are highly conserved (2-nt difference) with Desulfurococcus,

Staphylotherms, and Thermosphaera in family Desulfurococcaceae (Figure 7.3).

They could belong to multiple closely related Desulfurococcaceae species, or a single genome, that is consistent to the findings of the RNase P study, but rare in model crenarchaeal organisms.

Mature tRNA sequences are highly conserved across hyperthermophilic archaea like crenarchaea. Structural RNAs in high growth temperature have higher G/C content than the average genome to maintain their stability against thermal denaturation (Galtier and Lobry 1997; Grogan 1998). A typical tRNA sequence in a hyperthermophilic archaeal genome has an average of about 70% Gs and Cs.

Therefore, using mature tRNA sequences for species characterization in crenarchaeal- rich metagenomes is not very effective. However, when looking more closely to the predicted tRNAs with multiple introns, I noticed that one of the predicted tRNALys(CUU) probably belongs to a Pyrobaculum or Thermoproteus species. This

166 tRNA gene has one canonical (position 37/38) intron, and two noncanonical introns at positions 30/31 and 59/60 respectively (Figure 7.4A). Reviewing the known intron positions for tRNALys(CUU) in complete crenarchaeal genomes showed that Acidilobus saccharovorans and 4 out of 8 Desulfurococcales species have a noncanonical intron, but conservedly located at position 45/46 (Figure 7.4A). Only canonical introns were found for this tRNA isotype in Sulfolobales. Interestingly, three Thermoproteales genomes have a noncanonical intron located at the same position 30/31. Two of these genomes, Pyrobaculum islandicum and Thermoproteus tenax, additionally have a canonical intron. Sequence alignments revealed a 100% identity between the mature tRNALys(CUU) in Cistern Spring and five Pyrobaculum genomes in addition to

Thermoproteus tenax (Figure 7.4B). Although the first noncanonical intron only has

58% identity among Cistern Spring, P. islandicum, and T. tenax, the canonical intron is highly conserved between these genomes, with 6-nt and 3-nt differences comparing to P. islandicum, and T. tenax respectively. This suggests that tRNALys(CUU) in Cistern

Spring may be more closely related to T. tenax.

7.2.4 Trans-spliced split tRNAs in Caldivirga

Caldivirga maquilingensis, also a Thermoproteaceae, does not have as many tRNA introns as its closest relative Pyrobaculum. But it carries six trans-spliced split tRNAs, two of them being tri-split (Fujishima, Sugahara et al. 2009). The existence of a Caldivirga species in Cistern Spring metagenome revealed by the 16S and RNase P studies increases the possibility that split tRNAs may also be found. With the

167 improved version of tRNAscan-SE (Lowe and Eddy 1997) and sequence similarity searches, I uncovered four Caldivirga split tRNAs in the metagenome, namely tRNAGlu(UUC), tRNAGly(CCC), and the tri-split tRNAGly(GCC) and tRNAGly(UCC). The sequence of the mature tRNAGlu(UUC) and the position (25/26) where the two halves join are exactly identical to the one in Caldivirga maqulingensis. However, the complementary regions downstream of the 5′ half and upstream of the 3′ half vary by

4 nt, making the complementary stem at the splicing region 3 bp shorter (Figure

7.5A). The two halves in the metagenome are located in separate scaffolds with their own strong promoters (better than 40% of identified tRNA promoters in C. maquilingensis) at the upstream regions. The neighboring genes upstream and downstream of the 5′ half, although shown as Caldivirga origin, differ from the ones in C. maquilingensis. Contrarily, the upstream and downstream genes of the 3′ half retain complete synteny with C. maquilingensis.

Similar to the three split tRNAGly in C. maquilingensis, the first fragment of the tri-split genes and the third fragment of the three split genes only have a single copy in the metagenome, suggesting the sharing of the two fragments among the three genes. There is a one-nucleotide substitution from U to C in the first fragment of the mature tri-split tRNAGly and the mature split tRNAGly(CCC) in comparison to C. maquilingensis. As tRNAGlu(UUC), the complementary regions around the fragments of tRNAGly in Cistern Spring differ from those in C. maquilingensis by a low percentage

(Figure 7.5B). All the fragments are located at separate scaffolds with their own

168 promoters and no breaks in synteny between the Cistern Spring metagenome and C. maquilingensis were observed around all the fragments. The finding of these four split tRNAs further confirms the existence of Caldivirga in the Cistern Spring samples. Due to the variation in the splicing regions of the split tRNAs, the species in

Cistern Spring is unlikely to be C. maquilingensis.

7.2.5 Novel intron-bearing split tRNA in Thermoproteus

While reviewing the tRNA predictions in the Cistern Spring metagenome, I noticed two candidates with relatively low scores (21.0 and 30.4 bits respectively as compared to at least 50 bits in all crenarchaeal tRNAs) and would not fold into a typical cloverleaf secondary structure. Closer analysis revealed that these two candidates combine together as the 5′ half and 3′ half of the trans-spliced split tRNAMet(CAU). The two halves of this novel split tRNA join at position 30/31 (Figure

7.6A), same as the split tRNALys(CUU) in Nanoarchaeum equitans (Randau, Pearson et al. 2005) and Staphylothermus (Chapter 5 of this work). Sequence comparison displays only 1 nt difference between the mature split tRNAMet(CAU) sequence and the non-fragmented tRNA gene in Thermoproteus tenax, suggesting that this split tRNA belongs a Thermoproteus species, a genus with no split tRNAs identified previously.

The uniqueness of this split tRNAMet(CAU) lies at the point where both the 5′ and

3′ halves carry an intron, a character only been observed in circularized permuted tRNAs (Soma, Onodera et al. 2007; Maruyama, Sugahara et al. 2009). The noncanonical intron in the 5′ half is located at position 29/30, 1 nt upstream of split

169 site (Figure 7.6B). This is the same position where a noncanonical intron was observed at the linear tRNAMet(CAU) of T. tenax. In order to form a bulge-helix-bulge motif between the 5′ and 3′ halves at the exon-splicing junction, this noncanonical intron in the 5′ half has to be removed first. The 3′ half harbors a canonical intron at position 37/38 (Figure 7.6C), same as the permuted tRNAs in Thermofilum pendens.

The two halves are located only about 700bp apart with their own promoters. Two homolog of a putative archaeal replication gene (paREP1) were found between the two tRNA transcripts, supporting the previous evidence of tRNAs being a favorite region of viral insertions that may give rise to split tRNAs (She, Brugger et al. 2002;

Krupovic and Bamford 2008) (Chapter 5 of this work).

7.3 Future Directions

The emerging field of metagenomics provides researchers opportunities to study and discover new biology. The species characterization in Cistern Spring metagenome using RNase P introduces a new alternative marker in addition to the commonly used

16S rRNA and metabolic genes. Although the highly conserved tRNA sequences in hyperthermophilic archaea are not very effective in genome classification, the addition of tRNA introns makes it possible to determine the existence of a species in a metagenome. However, instead of studying each individual tRNA gene closely to obtain an answer that may not be consistent or reproducible, a more systematic and repeatable method should be applied. I therefore propose the generation of a tRNA

170 feature profile for phylogenetic analysis. Similar to profiles used for protein alignments, a tRNA feature profile will contain feature information of an individual tRNA gene including but not limited to the tRNA isotype, anticodon, tRNA structural type, G/C content, stem and loop sizes, number of introns and their properties such as positions, lengths, bulge-helix-bulge motif specifications, and primary sequences.

Each tRNA feature can be represented by a specific code. The profile can then be used for tRNA alignments and phylogenetic analysis in metagenomes.

The discovery of the novel trans-spliced split tRNAMet(CAU) in Thermoproteus that was not found in model organisms provides an example of opportunities in metagenomics. With the combination of the knowledge developed from standard biological studies and the increasing availability of sequenced metagenomes, I anticipate more unexpected findings will be identified in the near future.

7.4 Materials and Methods

Genomic data. Complete genomic sequences and annotated ORFs for all archaeal genomes were obtained from NCBI RefSeq(Pruitt, Tatusova et al. 2007).

Metagenome assembly of Cistern Spring was provided by William Inskeep at

Montana State University. Protein annotations of Cistern Spring metagenome were retrieved from Integrated Microbial Genomes with Microbiome Samples (Markowitz,

Ivanova et al. 2008).

171 16S ribosomal RNA sequence searches in Cistern Spring. BLAT (Kent 2002) was used for sequence similarity searches of 16S rRNA sequences in available complete archaeal genomes against Cistern Spring metagenome. Hits with over 90% sequence identity and a length of over 600bp were selected.

RNase P RNA sequence searches in Cistern Spring. Infernal v1.0 (Nawrocki,

Kolbe et al. 2009) was used to search for RNase P RNA candidates in Cistern Spring metagenome using the type T RNase P RNA covariance model and the Rfam

(Gardner, Daub et al. 2009) archaeal RNase P RNA covariance model (RF00373).

The program was initially run in the global search mode. All hits with a score > 0 bits were manually examined. Local search mode was also employed, which provided better sensitivity but decreased selectivity.

RNase P protein database searches in Cistern Spring. Protein sequences of Pop5,

Rpp30, Rpp29, and Rpp21 for Cistern Spring metagenome were retrieved from Pfam

(Finn, Mistry et al. 2010) domain searches (RNase_P_Rpp14 [Pop5]: PF01900;

RNase_P_p30 [Rpp30]: PF01876; UPF0086 [Rpp29]: PF01868; and Rpr2 [Rpp21]:

PF04032), and PSI-BLAST similarity searches against RNase P protein homologs in

Thermoproteaceae and NCBI non-redundant database. Default scoring thresholds for

PSI-BLAST (E-value: 10; word size: 3) and Pfam (trusted cutoff for Pop5: 23.4 bits;

Rpp30: 20.3 bits; Rpp29: 21.1 bits; Rpp21: 23.2 bits) searches were initially adopted.

Thresholds were further adjusted (E-value: 100 and word size: 2 for PSI-BLAST;

172 trusted cutoff as -80 bits for Pfam) to search for proteins not identified with the default scan.

Phylogenetic analysis of Cistern Spring metagenome based on 16S rRNA.

Phylogenetic tree was generated based on 16S ribosomal RNA. Sequences were aligned using ClustalW (Larkin, Blackshields et al. 2007). Alignments were manually adjusted using Jalview (Waterhouse, Procter et al. 2009) to remove introns.

Maximum likelihood tree was computed using PhyML (Guindon, Dufayard et al.

2010) with general time-reversible model of sequence evolution. Numbers at nodes represent non-parametric bootstrap values computed by PhyML (Guindon, Dufayard et al. 2010) with 1,000 replications of the original dataset.

Phylogenetic analysis of Cistern Spring metagenome based on RNase P RNA.

Phylogenetic tree was generated based on RNase P RNA. Sequences were structurally aligned against Rfam (Gardner, Daub et al. 2009) archaeal RNase P RNA covariance model (RF00373) and type T archaeal RNase P RNA covariance model separately using Infernal v1.0 (Nawrocki, Kolbe et al. 2009). The two sets of structural alignments were manually merged together. Maximum likelihood tree was computed using PhyML (Guindon, Dufayard et al. 2010) with general time-reversible model of sequence evolution. Numbers at nodes represent non-parametric bootstrap values computed by PhyML (Guindon, Dufayard et al. 2010) with 1,000 replications of the original dataset.

173 tRNA gene predictions in Cistern Spring metagenome. tRNAs were predicted using tRNAscan-SE (Lowe and Eddy 1997). Archaeal tRNA-specific and BHB motif covariance models were created for similarity searching using Infernal 1.0 (Nawrocki,

Kolbe et al. 2009) after pre-filtering possible candidates with tRNAscan (Fichant and

Burks 1991) and an A/B box motif detection algorithm (Pavesi, Conterio et al. 1994).

A default cutoff score was set to 20 bits. BLAT (Kent 2002) was used for verification of tRNA candidates against predicted archaeal tRNA sequences available in Genomic tRNA database (Chan and Lowe 2009).

Promoter identification. To generate a training set for promoter identification, potential operons were predicted genome-wide with the requirement of a minimum intergenic separation of at least 100 nt (on the same strand). A 16-mer motif search of the 90 nt upstream of known genes (not annotated as putative or hypothetical genes) using MEME (Bailey and Elkan 1994) was conducted to identify the consensus promoter, including the transcription factor B response element (BRE – 1 to 3 adenosines) plus the TATA box. A position-specific scoring matrix (PSSM) was generated from the alignments of the MEME results after manual inspection. Each organism’s PSSM was used to scan the 150-bp upstream region of all non-coding and protein-coding genes to identify potential promoter regions. Ten virtual genomes for each target genome were generated using a fifth-order Markov chain to retain the base frequency of the target genome, and scanned to identify the score distribution of

174 false positives. The promoter candidates identified were filtered according to expected position (Slupska, King et al. 2001) and a threshold p-value equivalent to that of the lowest-scoring known gene.

175

Figure 7.1 Phylogenetic relationships of Cistern Spring samples and crenarchaea based on 16S rRNA Maximum likelihood tree was generated using multiple alignments of 16S rRNA sequences (See methods). Numbers at nodes represent non-parametric bootstrap values with 1,000 replications of the original dataset. Gray boxes highlight the 16S rRNAs identified in Cistern Spring metagenome.

176

Figure 7.2 Phylogenetic relationships of Cistern Spring samples and crenarchaea based on RNase P RNA (RPR) Maximum likelihood tree was generated using multiple structural alignments of RNase P RNA sequences (See methods). Numbers at nodes represent non-parametric bootstrap values with 1,000 replications of the original dataset. Gray boxes highlight the RPRs identified in Cistern Spring metagenome. Magenta represents type T archaeal RPRs. Green represents type A archaeal RPRs.

177

Figure 7.3 Genomic sequence alignments of tRNAAla(GGC) in Cistern Spring with Desulfurococcaceae genomes The first four sequences are the identical copies of predicted tRNAAla(GGC) in Cistern Spring metagenome. The bottom four rows are sequences for the same tRNA isotype in Desulfurococcaceae. The line below the sequences is the linear string representation of a tRNA secondary structure. Note that only two nucleotides (excluding the 3′ CCA) differ in the alignments between the Cistern Spring tRNAs and the Desulfurococcaceae tRNAs.

178

Figure 7.4 tRNALys(CUU) in Cistern Spring metagenome in comparison with crenarchaeal homologs A. The secondary structure displays the consensus sequence of tRNALys(CUU) in complete crenarchaeal genomes. Arrows point to positions where introns were found in the specified clade. Red arrows are positions where introns were located at tRNALys(CUU) in Cistern Spring metagenome. B. Multiple genomic sequence alignments of pre-tRNALys(CUU) in Cistern Spring metagenome, Pyrobaculum, and Thermoproteus tenax. The line below the sequences is the linear string representation of a tRNA secondary structure. The black boxes highlight the intron sequences.

179

Figure 7.5 Predicted secondary structures and sequences of trans-spliced pre-tRNAGlu(UUC) and pre-tRNAGly in Caldivirga maquilingensis (CM) and Cistern Spring metagenome (CS) A. Mature tRNAGlu(UUC) is formed by joining the 5′ half and the 3′ half at position 25/26 after splicing at the bulge-helix-bulge (BHB) motif. Only sequences at the splicing regions vary between CM and CS. B. The two halves of tRNAGly(CCC) join at position 37/38. The three fragments of tRNAGly(GCC) and tRNAGly(UCC) join at

180 positions 25/26 and 37/38. One-nucleotide substitution was found at mature tRNAs between CM and CS. Gray represents the splicing regions. Black arrows indicate positions of splicing. Anticodons are highlighted in red. Black boxes are the complementary regions between fragments. Yellow highlights the base substitutions.

181

Figure 7.6 Predicted secondary structure of trans-spliced pre-tRNAMet(CAU) in Cistern Spring metagenome A. Mature tRNAMet(CAU) in Thermoproteus of Cistern Spring metagenome is formed by joining the 5′ half and the 3′ half at position 30/31 after splicing at the bulge-helix- bulge (BHB) motif. Introns at the 5′ half and the 3′ half (represented by gray U-sharp lines) were predicted to be removed before the joining of the two halves. Black circle highlights the base difference with the linear tRNAMet(CAU) in Thermoproteus tenax. B. Predicted secondary structure of 5′ half transcript with a BHB motif formed at exon-intron junction. C. Predicted secondary structure of 3′ half transcript with a relaxed BHB motif formed at exon-intron junction. 5′ half of tRNA transcripts are represented in blue, the 3′ halves in orange. Black arrows indicate positions of splicing. Anticodons are boxed in light blue.

182 Table 7.1 Predicted RNase P proteins in Cistern Spring metagenome RNase P proteins were predicted using Pfam (Finn, Mistry et al. 2010) domain searches and PSI-BLAST (Altschul, Madden et al. 1997) similarity searches. Genomes listed have proteins with highest e-value obtained from PSI-BLAST searches when comparing with the protein candidates in Cistern Spring. The numbers in the table represent the numbers of RNase P protein candidates identified in the metagenome.

Family Highest similarity Pop5 Rpp30 Rpp29 Rpp21 Acidilobaceae Acidilobus saccharovorans 1 0 0 2 Desulfurococcaceae Staphylothermus hellenicus 1 0 0 0 Thermosphaera aggregans 0 0 1 0 Sulfolobaceae Sulfolobus acidocaldarius 1 1 0 0 Sulfolobus islandicus 0 0 0 1 Thermoproteaceae Caldivirga maquilingensis 2 2 0 0 Pyrobaculum islandicum 2 2 1 0 Vulcanisaeta distributa 0 0 1 0

183 Table 7.2 Predicted tRNAs in Cistern Spring metagenome tRNA genes were predicted using tRNAscan-SE (Lowe and Eddy 1997). Introns in predicted tRNAs were identified by the same software, bulge-helix-bulge motif search using covariance model with Infernal v1.0 (Nawrocki, Kolbe et al. 2009), and sequence alignments.

Total Number of Number of tRNA Number number of canonical noncanonical isotype Anticodon of genes introns introns introns Met CAT 20 14 8 6 Thr TGT 9 13 5 8 Leu CAA 9 9 2 7 Pro CGG 7 9 5 4 Thr CGT 6 8 5 3 Tyr GTA 5 8 5 3 Lys TTT 8 7 4 3 Pro GGG 6 7 5 2 Lys CTT 5 5 2 3 Glu CTC 7 5 2 3 Cys GCA 5 5 2 3 Val TAC 4 5 2 3 Gly TCC 6 5 3 2 Ala TGC 7 5 3 2 Ser CGA 5 5 4 1 Ile GAT 5 5 4 1 Ala GGC 8 5 4 1 Arg GCG 7 4 0 4 Ser GGA 7 4 1 3 Trp CCA 5 4 2 2 Ala CGC 7 4 2 2 Leu TAG 6 4 2 2 Val CAC 6 4 3 1 Phe GAA 5 4 3 1 Leu CAG 5 3 1 2 Arg CTT 6 3 1 2 Leu GAG 4 3 1 2 Glu TTC 6 3 1 2 Ser GCT 4 3 2 1 Pro TGG 7 3 2 1 Arg TCT 3 3 3 0 Gly CCC 4 2 0 2 Gln TTG 5 2 1 1 Val GAC 5 2 2 0 Thr GGT 3 2 2 0

184 Total Number of Number of tRNA Number number of canonical noncanonical isotype Anticodon of genes introns introns introns Leu TAA 4 2 2 0 Gln CTG 3 1 1 0 Gly GCC 4 1 1 0 Asp GTC 6 1 1 0 His GTG 5 1 1 0 Asn GTT 3 1 1 0 Ser TGA 3 1 1 0 Arg CCG 4 0 0 0 Arg TCG 4 0 0 0

185 Chapter 8

Conclusions

186 RNase P and tRNAs are both ancient molecules that are essential in all domains of life. The discovery of RNase P with a catalytic RNA subunit (Stark, Kole et al. 1978;

Guerrier-Takada, Gardiner et al. 1983) supports “The RNA World” hypothesis first coined by Walter Gilbert in 1986, in which genetic continuity was dependent on the replication of RNA and RNA was responsible for the full range of catalytic roles.

Although Nanoarchaeum equitans was found not to have RNase P (Randau, Schroder et al. 2008), its dependency on another crenarchaeal species as a host that requires the presence of the RNase P for growth may provide an explanation for this isolated case.

Over the years, researchers have identified the increasing complexity of the RNase P holoenzyme, from one RNA subunit and one protein in bacteria, to one RNA subunit and at least nine proteins in eukaryotes, with archaea being an intermediate of having one RNA subunit and at least four proteins (Hall and Brown 2001; Lai, Vioque et al.

2010). However, the evolution of RNase P through the three domains of life remains as an open question. Interestingly, the bacterial RNase P that includes only one protein has a larger RNA subunit (typically 350-400 nts) than the archaeal and eukaryotic ones (typically 300-350 nts). If following this phenomenon, the shortened- form (Type T) RNase P RNA that has a size of approximately 200 nts (Lai, Chan et al. 2010) (Chapter 3 of this work) should be compensated with the largest number of proteins. The missing of Rpp21, one of the four typical archaeal RNase P proteins, in all Thermoproteaceae species may suggest the opposite. One possible explanation could be the relatively smaller amount of pre-tRNAs with 5ʹ leaders in this family

187 leading to a different selective pressure. Whether or not other unrecognizable proteins or RNAs are working together with this atypical RNase P will in turn suggest an active evolutionary process of this ancient enzyme.

The identification of RNase P substrates other than tRNAs (Kazantsev and

Pace 2006; Coughlin, Pleiss et al. 2008) raises the question if the type T RNase P

RNA that does not have most of the specificity domain would act on a wider range of substrates. Although the correlation between the disrupted tRNAs and the shortened form of RNase P RNA is still unknown, the recently splitting or lateral transfer events suggested by the discovery of the split tRNAs in Desulfurococcales (Chapter 5 of this work) illustrate an active process of tRNA evolution. The two halves of split tRNAMet(CAU) in Cistern Spring metagenome (Chapter 7 of this work), being separated by replicable genes, further supports the hypothesis of a linear non-intron-bearing tRNA as an ancestor of a disrupted tRNA that may be introduced by viral or mobile element insertion. The mechanism of the disruption that always results in a bulge- helix-bulge motif at the exon-splicing junction is yet to be determined.

With the results from biochemical experiments and structural studies, the forms of ancient RNAs were considered as better understood than other fast-evolving genes. The discovery of type T RNase P RNA and the permuted and recently split tRNAs in Archaea suggests that these ancient RNAs are actively evolving due to environmental changes and selective pressures. The improved methods for gene finding and increasing availability of sequenced genomes of model organisms and

188 uncultured microbial communities provide new opportunities to reinvestigate the connections of these evolving ancient RNAs with other noncoding RNAs and proteins in this contemporary world.

189 Bibliography

Abelson, J., C. R. Trotta, et al. (1998). tRNA splicing. J Biol Chem 273: 12685-8.

Altschul, S. F., W. Gish, et al. (1990). Basic local alignment search tool. J Mol Biol 215: 403-10.

Altschul, S. F., T. L. Madden, et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-402.

Ambros, V. "MicroRNA cloning protocol." from http://146.189.76.171/lab/MicroRNAs/Ambros_microRNAcloning.htm.

Badger, J. H. and G. J. Olsen (1999). CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16: 512-24.

Bailey, T. L. and C. Elkan (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28-36.

Baliga, N. S. and S. Dassarma (2000). Saturation mutagenesis of the haloarchaeal bop gene promoter: identification of DNA supercoiling sensitivity sites and absence of TFB recognition element and UAS enhancer activity. Mol Microbiol 36: 1175-83.

Baumann, P., S. A. Qureshi, et al. (1995). Transcription: new insights from studies on Archaea. Trends Genet 11: 279-83.

Bell, S. D. and S. P. Jackson (1998). Transcription and translation in Archaea: a mosaic of eukaryal and bacterial features. Trends Microbiol 6: 222-8.

190 Bell, S. D., C. Jaxel, et al. (1998). Temperature, template topology, and factor requirements of archaeal transcription. Proc Natl Acad Sci U S A 95: 15218-22.

Bell, S. D., P. L. Kosa, et al. (1999). Orientation of the transcription preinitiation complex in archaea. Proc Natl Acad Sci U S A 96: 13662-7.

Bell, S. D., C. P. Magill, et al. (2001). Basal and regulated transcription in Archaea. Biochem Soc Trans 29: 392-5.

Benelli, D. and P. Londei (2009). Begin at the beginning: evolution of translational initiation. Res Microbiol 160: 493-501.

Benelli, D., E. Maone, et al. (2003). Two different mechanisms for ribosome/mRNA interaction in archaeal translation initiation. Mol Microbiol 50: 635-43.

Besemer, J., A. Lomsadze, et al. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29: 2607-18.

Biniszkiewicz, D., E. Cesnaviciene, et al. (1994). Self-splicing group I intron in cyanobacterial initiator methionine tRNA: evidence for lateral transfer of introns in bacteria. EMBO J 13: 4629-35.

Borges, K. M., S. R. Brummet, A. Bogert, M. C. Davis, K. M. Hujer, S. T. Domke, J. Szasz, J. Ravel, J. DiRuggiero, C. Fuller, J. W. Chase and F. T. Robb. (1996). A Survey of the genome of the hyperthermophilic archaeon, Pyrococcus furiosus. Genome Science and Technology 1: 37-46.

Brenneis, M., O. Hering, et al. (2007). Experimental characterization of Cis-acting elements important for translation and transcription in halophilic archaea. PLoS Genet 3: e229.

Brown, J. W. (1999). The Ribonuclease P Database. Nucleic Acids Res 27: 314.

191 Bruccoleri, R. E. and G. Heinrich (1998). An improved algorithm for nucleic acid secondary structure display. Comp. Appl. Biosci. 4: 167-173.

Calvin, K., M. D. Hall, et al. (2005). Structural characterization of the catalytic subunit of a novel RNA splicing endonuclease. J Mol Biol 353: 952-60.

Calvin, K. and H. Li (2008). RNA-splicing endonuclease structure and function. Cell Mol Life Sci 65: 1176-85.

Chan, P. P. and T. M. Lowe (2009). GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37: D93-7.

Chang, B., S. Halgamuge, et al. (2006). Analysis of SD sequences in completed microbial genomes: non-SD-led genes are as common as SD-led genes. Gene 373: 90-9.

Clouet d'Orval, B., M. L. Bortolin, et al. (2001). Box C/D RNA guides for the ribose methylation of archaeal tRNAs. The tRNATrp intron guides the formation of two ribose-methylated nucleosides in the mature tRNATrp. Nucleic Acids Res 29: 4518-29.

Cole, J. R., B. Chai, et al. (2003). The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic . Nucleic Acids Res 31: 442-3.

Coughlin, D. J., J. A. Pleiss, et al. (2008). Genome-wide search for yeast RNase P substrates reveals role in maturation of intron-encoded box C/D small nucleolar RNAs. Proc Natl Acad Sci U S A 105: 12218-23.

Cozen, A. E., M. T. Weirauch, et al. (2009). Transcriptional map of respiratory versatility in the hyperthermophilic crenarchaeon Pyrobaculum aerophilum. J Bacteriol 191: 782-94.

Crooks, G. E., G. Hon, et al. (2004). WebLogo: a sequence logo generator. Genome Res 14: 1188-90.

192 Darr, S. C., B. Pace, et al. (1990). Characterization of ribonuclease P from the archaebacterium Sulfolobus solfataricus. J Biol Chem 265: 12927-32.

Davidsen, T., E. Beck, et al. (2010). The comprehensive microbial resource. Nucleic Acids Res 38: D340-5.

Delcher, A. L., K. A. Bratke, et al. (2007). Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23: 673-9.

Di Giulio, M. (2008). Permuted tRNA genes of Cyanidioschyzon merolae, the origin of the tRNA molecule and the root of the Eukarya domain. J Theor Biol 253: 587- 92.

Di Giulio, M. (2008). The split genes of Nanoarchaeum equitans are an ancestral character. Gene 421: 20-6.

Di Giulio, M. (2009). A comparison among the models proposed to explain the origin of the tRNA molecule: A synthesis. J Mol Evol 69: 1-9.

Di Giulio, M. (2009). Formal Proof that the Split Genes of tRNAs of Nanoarchaeum equitans Are an Ancestral Character. J Mol Evol.

Eddy, S. R. and R. Durbin (1994). RNA sequence analysis using covariance models. Nucleic Acids Res 22: 2079-88.

Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-7.

Edwards, M. T., S. C. Rison, et al. (2005). A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context. Nucleic Acids Res 33: 3253-62.

Ermolaeva, M. D., O. White, et al. (2001). Prediction of operons in microbial genomes. Nucleic Acids Res 29: 1216-21.

193 Fichant, G. A. and C. Burks (1991). Identifying potential tRNA genes in genomic DNA sequences. J Mol Biol 220: 659-71.

Finn, R. D., J. Mistry, et al. (2010). The Pfam protein families database. Nucleic Acids Res 38: D211-22.

Finn, R. D., J. Tate, et al. (2008). The Pfam protein families database. Nucleic Acids Res 36: D281-8.

Fujishima, K., J. Sugahara, et al. (2009). Tri-split tRNA is a transfer RNA made from 3 transcripts that provides insight into the evolution of fragmented tRNAs in archaea. Proc Natl Acad Sci U S A 106: 2683-7.

Galtier, N. and J. R. Lobry (1997). Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J Mol Evol 44: 632-6.

Gardner, P. P., J. Daub, et al. (2009). Rfam: updates to the RNA families database. Nucleic Acids Res 37: D136-40.

Giovannoni, S. J., T. B. Britschgi, et al. (1990). Genetic diversity in Sargasso Sea bacterioplankton. Nature 345: 60-3.

Glass, E. M., J. Wilkening, et al. (2010). Using the metagenomics RAST server (MG- RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc 2010: pdb prot5368.

Gobert, A., B. Gutmann, et al. (2010). A single Arabidopsis organellar protein has RNase P activity. Nat Struct Mol Biol 17: 740-4.

Grogan, D. W. (1998). Hyperthermophiles and the problem of DNA instability. Mol Microbiol 28: 1043-9.

194 Grosjean, H., C. Marck, et al. (2007). The various strategies of codon decoding in organisms of the three domains of life: evolutionary implications. Nucleic Acids Symp Ser (Oxf): 15-6.

Gruegelsiepe, H., D. K. Willkomm, et al. (2003). Antisense inhibition of Escherichia coli RNase P RNA: mechanistic aspects. Chembiochem 4: 1049-56.

Guerrier-Takada, C., K. Gardiner, et al. (1983). The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35: 849-57.

Guindon, S., J. F. Dufayard, et al. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59: 307-21.

Haas, E. S., D. W. Armbruster, et al. (1996). Comparative analysis of ribonuclease P RNA structure in Archaea. Nucleic Acids Res 24: 1252-9.

Hall, R. M. and C. M. Collis (1995). Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Mol Microbiol 15: 593-600.

Hall, T. A. and J. W. Brown (2001). The ribonuclease P family. Methods Enzymol 341: 56-77.

Hall, T. A. and J. W. Brown (2002). Archaeal RNase P has multiple protein subunits homologous to eukaryotic nuclear RNase P proteins. RNA 8: 296-306.

Hallam, S. J., K. T. Konstantinidis, et al. (2006). Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum. Proc Natl Acad Sci U S A 103: 18296-301.

Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68: 669-85.

195 Hannon, G. (2006). "Cloning Small RNAs for Sequencing with 454 Technology." from http://www.454.com/downloads/hannon_smallRNA-cloning_protocol2.pdf.

Harris, J. K., E. S. Haas, et al. (2001). New insight into RNase P RNA structure from comparative analysis of the archaeal RNA. RNA 7: 220-32.

Harris, M. E., J. M. Nolan, et al. (1994). Use of photoaffinity crosslinking and molecular modeling to analyze the global architecture of ribonuclease P RNA. EMBO J 13: 3953-63.

Hartmann, E. and R. K. Hartmann (2003). The enigma of ribonuclease P evolution. Trends Genet 19: 561-9.

Heinemann, I. U., D. Soll, et al. (2009). Transfer RNA processing in archaea: Unusual pathways and enzymes. FEBS Lett.

Henikoff, J. G. and S. Henikoff (1996). Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci 12: 135-43.

Hering, O., M. Brenneis, et al. (2009). A novel mechanism for translation initiation operates in haloarchaea. Mol Microbiol 71: 1451-63.

Holzmann, J., P. Frank, et al. (2008). RNase P without RNA: identification and functional reconstitution of the human mitochondrial tRNA processing enzyme. Cell 135: 462-74.

Huber, H., M. J. Hohn, et al. (2002). A new phylum of Archaea represented by a nanosized hyperthermophilic symbiont. Nature 417: 63-7.

Hyatt, D., G. L. Chen, et al. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119.

196 Inskeep, W. P., D. B. Rusch, et al. (2010). Metagenomes from high-temperature chemotrophic systems reveal geochemical controls on microbial community structure and function. PLoS One 5: e9773.

Inskeep, W. P. and J. M. Young. "The YNP RCN Geothermal Features Database."

Karlin, S., J. Mrazek, et al. (2005). Predicted highly expressed genes in archaeal genomes. Proc Natl Acad Sci U S A 102: 7303-8.

Karolchik, D., R. M. Kuhn, et al. (2008). The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 36: D773-9.

Kaye, N. M., N. H. Zahler, et al. (2002). Conservation of helical structure contributes to functional metal ion interactions in the catalytic domain of ribonuclease P RNA. J Mol Biol 324: 429-42.

Kazantsev, A. V. and N. R. Pace (2006). Bacterial RNase P: a new view of an ancient enzyme. Nat Rev Microbiol 4: 729-40.

Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12: 656-64.

Kim, K. W. and S. B. Lee (2003). Growth of the hyperthermophilic marine archaeon Aeropyrum pernix in a defined medium. J Biosci Bioeng 95: 618-22.

Koonin, E. V., Y. I. Wolf, et al. (2001). Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res 11: 240-52.

Krause, M. and D. Hirsh (1987). A trans-spliced leader sequence on actin mRNA in C. elegans. Cell 49: 753-61.

Krupovic, M. and D. H. Bamford (2008). Archaeal proviruses TKV4 and MVV extend the PRD1-adenovirus lineage to the phylum Euryarchaeota. Virology 375: 292-300.

197 Lai, L. B., P. P. Chan, et al. (2010). Discovery of a minimal form of RNase P in Pyrobaculum. PNAS (In Press).

Lai, L. B., A. Vioque, et al. (2010). Unexpected diversity of RNase P, an ancient tRNA processing enzyme: challenges and prospects. FEBS Lett 584: 287-96.

Langer, D., J. Hain, et al. (1995). Transcription in archaea: similarity to that in eucarya. Proc Natl Acad Sci U S A 92: 5768-72.

Larkin, M. A., G. Blackshields, et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23: 2947-8.

Lau, N. C., L. P. Lim, et al. (2001). An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 294: 858-62.

LeBlanc, H., A. S. Lang, et al. (1999). Transcript cleavage, attenuation, and an internal promoter in the Rhodobacter capsulatus puc operon. J Bacteriol 181: 4955-60.

Li, D., D. K. Willkomm, et al. (2009). Minor changes largely restore catalytic activity of archaeal RNase P RNA from Methanothermobacter thermoautotrophicus. Nucleic Acids Res 37: 231-42.

Li, H., C. R. Trotta, et al. (1998). Crystal structure and evolution of a transfer RNA splicing enzyme. Science 280: 279-84.

Li, H., J. Wang, et al. (2008). A neoplastic gene fusion mimics trans-splicing of RNAs in normal human cells. Science 321: 1357-61.

Li, Y. and S. Altman (2004). In search of RNase P RNA from microbial genomes. RNA 10: 1533-40.

Liu, F. and S. Altman (2010). Ribonuclease P. New York, Springer-Verlag.

198 Lombo, T. B. and V. R. Kaberdin (2008). RNA processing in Aquifex aeolicus involves RNase E/G and an RNase P-like activity. Biochem Biophys Res Commun 366: 457-63.

Loria, A. and T. Pan (1996). Domain structure of the ribozyme from eubacterial ribonuclease P. RNA 2: 551-63.

Lowe, T. M. and S. R. Eddy (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25: 955-64.

Lowe, T. M. and S. R. Eddy (1999). A computational screen for methylation guide snoRNAs in yeast. Science 283: 1168-71.

Ludwig, H., G. Homuth, et al. (2001). Transcription of glycolytic genes and operons in Bacillus subtilis: evidence for the presence of multiple levels of control of the gapA operon. Mol Microbiol 41: 409-22.

Lukashin, A. V. and M. Borodovsky (1998). GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26: 1107-15.

Marck, C. and H. Grosjean (2003). Identification of BHB splicing motifs in intron- containing tRNAs from 18 archaea: evolutionary implications. RNA 9: 1516-31.

Markowitz, V. M., N. N. Ivanova, et al. (2008). IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res 36: D534-8.

Marquez, S. M., J. K. Harris, et al. (2005). Structural implications of novel diversity in eucaryal RNase P RNA. RNA 11: 739-51.

Marszalkowski, M., D. K. Willkomm, et al. (2008). 5'-end maturation of tRNA in aquifex aeolicus. Biol Chem 389: 395-403.

Maruyama, S., J. Sugahara, et al. (2009). Permuted tRNA genes in the nuclear and nucleomorph genomes of photosynthetic eukaryotes. Mol Biol Evol.

199 Massire, C., L. Jaeger, et al. (1997). Phylogenetic evidence for a new tertiary interaction in bacterial RNase P RNAs. RNA 3: 553-6.

Massire, C., L. Jaeger, et al. (1998). Derivation of the three-dimensional architecture of bacterial ribonuclease P RNAs from comparative sequence analysis. J Mol Biol 279: 773-93.

McClay, J. L. and E. J. van den Oord (2005). Split genes uncovered through science fusion. Heredity 95: 1-2.

McDaniel, L., M. Breitbart, et al. (2008). Metagenomic analysis of lysogeny in Tampa Bay: implications for prophage gene expression. PLoS One 3: e3263.

Moreno-Hagelsieb, G. and J. Collado-Vides (2002). A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18: S329- S336.

Nakamura, Y., T. Gojobori, et al. (2000). Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res 28: 292.

Nawrocki, E. P., D. L. Kolbe, et al. (2009). Infernal 1.0: inference of RNA alignments. Bioinformatics 25: 1335-7.

Nielsen, P. and A. Krogh (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21: 4322-9.

Nolan, J. M., D. H. Burke, et al. (1993). Circularly permuted tRNAs as specific photoaffinity probes of ribonuclease P RNA structure. Science 261: 762-5.

Omer, A. D., T. M. Lowe, et al. (2000). Homologs of small nucleolar RNAs in Archaea. Science 288: 517-22.

Pace, N. R. (1997). A molecular view of microbial diversity and the biosphere. Science 276: 734-40.

200 Pavesi, A., F. Conterio, et al. (1994). Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions. Nucleic Acids Res 22: 1247-56.

Price, M. N., K. H. Huang, et al. (2005). A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res 33: 880-92.

Pruitt, K. D., T. Tatusova, et al. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35: D61-5.

Qin, J., R. Li, et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464: 59-65.

Qureshi, S. A., S. D. Bell, et al. (1997). Factor requirements for transcription in the Archaeon Sulfolobus shibatae. EMBO J 16: 2927-36.

Qureshi, S. A. and S. P. Jackson (1998). Sequence-specific DNA binding by the S. shibatae TFIIB homolog, TFB, and its effect on promoter strength. Mol Cell 1: 389-400.

Randau, L., K. Calvin, et al. (2005). The heteromeric Nanoarchaeum equitans splicing endonuclease cleaves noncanonical bulge-helix-bulge motifs of joined tRNA halves. Proc Natl Acad Sci U S A 102: 17934-9.

Randau, L., R. Munch, et al. (2005). Nanoarchaeum equitans creates functional tRNAs from separate genes for their 5'- and 3'-halves. Nature 433: 537-41.

Randau, L., M. Pearson, et al. (2005). The complete set of tRNA species in Nanoarchaeum equitans. FEBS Lett 579: 2945-7.

Randau, L., I. Schroder, et al. (2008). Life without RNase P. Nature 453: 120-3.

Randau, L. and D. Soll (2008). Transfer RNA genes in pieces. EMBO Rep 9: 623-8.

201 Reiter, W. D., P. Palm, et al. (1989). Transfer RNA genes frequently serve as integration sites for prokaryotic genetic elements. Nucleic Acids Res 17: 1907-14.

Rhead, B., D. Karolchik, et al. (2010). The UCSC Genome Browser database: update 2010. Nucleic Acids Res 38: D613-9.

Sabatti, C., L. Rohlin, et al. (2002). Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res 30: 2886-93.

Salgado, H., G. Moreno-Hagelsieb, et al. (2000). Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 97: 6652-7.

Schmidt, T. M., E. F. DeLong, et al. (1991). Analysis of a marine picoplankton community by 16S rRNA gene cloning and sequencing. J Bacteriol 173: 4371-8.

Schneider, K. L., K. S. Pollard, et al. (2006). The UCSC Archaeal Genome Browser. Nucleic Acids Res 34: D407-10.

Sharma, C. M., S. Hoffmann, et al. (2010). The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 464: 250-5.

She, Q., K. Brugger, et al. (2002). Archaeal integrative genetic elements and their impact on genome evolution. Res Microbiol 153: 325-32.

She, Q., X. Peng, et al. (2001). Gene capture in archaeal chromosomes. Nature 409: 478.

Siepel, A. and D. Haussler (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. J Comput Biol 11: 413-28.

Singh, S. K., P. Gurha, et al. (2004). Sequential 2'-O-methylation of archaeal pre- tRNATrp nucleotides is guided by the intron-encoded but trans-acting box C/D ribonucleoprotein of pre-tRNA. J Biol Chem 279: 47661-71.

202 Slupska, M. M., A. G. King, et al. (2001). Leaderless transcripts of the crenarchaeal hyperthermophile Pyrobaculum aerophilum. J Mol Biol 309: 347-60.

Soma, A., A. Onodera, et al. (2007). Permuted tRNA genes expressed via a circular RNA intermediate in Cyanidioschyzon merolae. Science 318: 450-3.

Soppa, J. (1999). Transcription initiation in Archaea: facts, factors and future aspects. Mol Microbiol 31: 1295-305.

Sprinzl, M. and K. S. Vassilenko (2005). Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res 33: D139-40.

Stahl, D. A., D. J. Lane, et al. (1985). Characterization of a Yellowstone hot spring microbial community by 5S rRNA sequences. Appl Environ Microbiol 49: 1379- 84.

Staley, J. T. and A. Konopka (1985). Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol 39: 321-46.

Stark, B. C., R. Kole, et al. (1978). Ribonuclease P: an enzyme with an essential RNA component. Proc Natl Acad Sci U S A 75: 3717-21.

Sugahara, J., K. Fujishima, et al. (2009). Disrupted tRNA Gene Diversity and Possible Evolutionary Scenarios. J Mol Evol.

Sugahara, J., K. Kikuta, et al. (2008). Comprehensive analysis of archaeal tRNA genes reveals rapid increase of tRNA introns in the order thermoproteales. Mol Biol Evol 25: 2709-16.

Sun, S., J. Chen, et al. (2010). Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Res.

203 Sutton, R. E. and J. C. Boothroyd (1986). Evidence for trans splicing in trypanosomes. Cell 47: 527-35.

Tolstrup, N., C. W. Sensen, et al. (2000). Two different and highly organized mechanisms of translation initiation in the archaeon Sulfolobus solfataricus. Extremophiles 4: 175-9.

Torarinsson, E., H. P. Klenk, et al. (2005). Divergent transcriptional and translational signals in Archaea. Environ Microbiol 7: 47-54.

Tsai, H. Y., L. B. Lai, et al. (2002). A Modified pBluescript-based vector for facile cloning and transcription of RNAs. Anal Biochem 303: 214-7.

Tsai, H. Y., D. K. Pulukkunat, et al. (2006). Functional reconstitution and characterization of Pyrococcus furiosus RNase P. Proc Natl Acad Sci U S A 103: 16147-52.

Vioque, A., J. Arnez, et al. (1988). Protein-RNA interactions in the RNase P holoenzyme from Escherichia coli. J Mol Biol 202: 835-48. von Jan, M., A. Lapidus, et al. (2010). Complete genome sequence of Archaeoglobus profundus type strain (AV18T). Standards in Genomic Sciences 2: 327-346.

Wan, X. F., S. M. Bridges, et al. (2004). Revealing gene transcription and translation initiation patterns in archaea, using an interactive clustering model. Extremophiles 8: 291-9.

Wang, G., H. W. Chen, et al. (2010). PNPASE regulates RNA import into mitochondria. Cell 142: 456-67.

Waterhouse, A. M., J. B. Procter, et al. (2009). Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25: 1189-91.

204 Waters, E., M. J. Hohn, et al. (2003). The genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proc Natl Acad Sci U S A 100: 12984-8.

Westover, B. P., J. D. Buhler, et al. (2005). Operon prediction without a training set. Bioinformatics 21: 880-8.

Willkomm, D. K., J. Minnerup, et al. (2005). Experimental RNomics in Aquifex aeolicus: identification of small non-coding RNAs and the putative 6S RNA homolog. Nucleic Acids Res 33: 1949-60.

Woese, C. R. and G. E. Fox (1977). Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74: 5088-90.

Woese, C. R., O. Kandler, et al. (1990). Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87: 4576-9.

Wu, D., P. Hugenholtz, et al. (2009). A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462: 1056-60.

Wurtzel, O., R. Sapra, et al. (2010). A single-base resolution map of an archaeal transcriptome. Genome Res 20: 133-41.

Xu, Y., C. D. Amero, et al. (2009). Solution structure of an archaeal RNase P binary protein complex: formation of the 30-kDa complex between Pyrococcus furiosus RPP21 and RPP29 is accompanied by coupled protein folding and highlights critical features for protein-protein and protein-RNA interactions. J Mol Biol 393: 1043-55.

Zhang, J., E. Li, et al. (2009). Protein-coding gene promoters in Methanocaldococcus (Methanococcus) jannaschii. Nucleic Acids Res 37: 3588-601.

205