￿￿￿￿￿￿ ￿.￿￿￿￿

￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿

￿￿￿￿￿. ￿￿￿￿￿￿￿.￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ Danish￿￿￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿￿￿￿ Centre http://dac.molbio.ku.dk 2012, September Shiraz A. Shah: Characterising the CRISPR immune system in Ar- chaea using sequence analysis, Ph. D. thesis, 2012, Septem- ber. : http://dac.molbio.ku.dk/index.php?page=shirazcv ￿￿￿￿￿￿￿: [email protected] ￿-￿￿￿￿

This Ph. D. thesis is based on work carried out between May 2007 and July 2012 in Danish Archaea Centre, University of Copenhagen (http://dac.molbio.ku.dk/). ABSTRACT

Archaea, a group of microorganisms distinct from and eukaryotes, are equipped with an adaptive immune system called the CRISPR system, which relies on an RNA interference mech- anism to combat invading viruses and plasmids. Using a genome sequence analysis approach, the four components of archaeal genomic CRISPR loci were analysed, namely, repeats, spacers, leaders and cas genes. Based on analysis of spacer sequences it was predicted that the immune system combats viruses and plas- mids by targeting their DNA. Furthermore, analysis of repeats, leaders and cas genes revealed that CRISPR systems exist as dis- tinct families which have key differences between themselves. Closely related were seen harbouring different CR- ISPR systems, while some distantly related carried similar systems, indicating frequent horizontal exchange. Moreover, it was found that cas genes of Type I CRISPR systems could be di- vided into functionally independent modules which occasionally exchange to form new combinations of Type I systems. Further- more, Type III systems were found to be genomically associated with various combinations of accessory genes which may play a role in functionally extending the activity of the Type III inter- ference complexes. This dynamic nature of the CRISPR immune systems may be a prerequisite for their continued efficacy against the ever changing threats they protect their hosts from.

iii SUMMARY

Archaea comprise a group of microorganisms distinct from both bacteria and eukaryotes. These organisms are equipped with an adaptive immune system against invading viruses and plasmids. The immune system works by taking up DNA from a virus, and saving it on the host’s own chromosome as a template to produce interference RNA. The RNA recognises the virus the next time it infects and signals the degradation of its genetic material. All the components of this system are encoded on the chromosomes of the organisms, and by looking into these components using genome sequence analysis, a number of insights were gained. It was found that the immune system kills viruses by targeting their DNA, first and foremost. Furthermore, different archaea have different variants of the immune system, and most archaea harbour several variants at the same time, probably to aid them in targeting different types of viruses. The systems themselves are composed of independent modules which are responsible for different stages of the immune response. By combining the mod- ules in various combinations and extending them with additional components as well as exchanging them with other archaea, the organisms ensure that their immune systems are fit to handle diverse and continuously evolving threats.

iv SAMMENFATNING

Arkæa udgør en gruppe af organismer som er forskellige fra både bakterier og eukaryoter. De er udstyrret med et adaptivt immun system mod invaderende vira og plasmider. Immunsy- stemet virker ved at optage virussens DNA, som bliver gemt i værtens eget kromosom for derved at blive brugt som skabelon til fremstilling af interferens RNA. RNA’et genkender virussen næste gang den inficerer, og signalerer derved for nedbrydelsen af virussens genetiske materiale. Alle immunsystemets kompo- nenter er indkodet i organismernes DNA, og ved at undersøge komponenterne gennem genom sekvens analyse blev der gjort en række opdagelser. Vi fandt ud af at immunsystemet dræber vira ved først og fremmest at angribe deres DNA. Derudover har forskellige arkæa forskellige varianter af immun systemet, og de fleste arkæa besidder flere af varianterne på én gang, hvilket sandsynligvis hjælper dem med at kunne tackle forskellige typer vira. Selve immunsystemerne består af uafhængige moduler som hver især står for forskellige stadier af immun reaktionen. Ved at kombinere modulerne i forskellige kombinationer eller udvide dem med yderligere komponenter samt at ombytte dem med andre arkæa, sikrer organismerne sig at deres immun system er opdateret til at kunne modstå de forskelligartede trusler som hele tiden udvikler sig.

v ThePREFACE work presented in this thesis was carried out at the Dan- ish Archaea Centre at the Department of Biology, University of Copenhagen from May 2007 to July 2012 under the supervision of Professor Roger A. Garrett. The initial objective of the Ph.D. study was to characterise the CRISPR immune system in Sulfolobus species using computa- tional methods, especially comparative genome sequence analysis. Sulfolobus species have been extensively studied with regard to the viruses and plasmids which infect them, with many genome sequences available of hosts as well as their extrachromosomal elements. Furthermore, Sulfolobus species harbour extensive and diverse CRISPR immune systems. Thus there was more than enough data to begin the analyses. After exhausting the possible ways of analysing CRISPR spacer, repeat, leader and cas gene sequences from Sulfolobales, the analyses were extended to the rest of the available archaeal in collaboration with Dr. Gisle A. Vestergaard starting July 2010. Gisle worked with me on the project until October 2011 after which I overtook it. Extending the study to other archaea proved fruitful, but was also a big mouthful, and these analyses are still in the process of being completed. Preliminary results have, however, been included in this thesis, to which especially the last part is dedicated. The results from this Ph.D. study have been published through- out many individual research papers, which are all enclosed, and most of which have multiple co-authors. Therefore the ex- tent of my own contributions to each of these papers have been stipulated on the sheet preceding every paper. For the sake of clarifying the extent of my collaboration with Dr. Gisle A. Vestergaard and its influence on what is presented in this thesis, he was deeply involved in all work concerning the classification of archaeal cas genes into separate functional modules (aCas, iCas, etc.), and the classification of iCmr modules into 5 families, A through E. In these studies our workload was more or less equal. Some aspects of this work is already published while others remain. As for the analysis of archaeal CRISPR repeats and leaders, as well as the definition of iCmr accessory genes, these analyses were conducted by my myself after Gisle left the project and are still unpublished.

vi FirstACKNOWLEDGEMENTS and foremost I’d like to thank my supervisor, Professor Roger A. Garrett. We had many long, inspiring discussions. Also, with Roger I saw how experience and wisdom go hand in hand. But most importantly I want to thank him for having patience and maintaining his trust in me during the challenges which were faced. During my Ph.D. I made friends. Chandra, Gisle, Chao and Ling. Four friends in four years. It’s difficult not to be thankful for that. Also, I had the privilege of working in a lab full of absolutely terrific people. I can’t think of a single exception. Despite many of us having to deal with our own day-to-day challenges, between us there was a genuinely positive and understanding atmosphere. When comparing with most other workplaces you realise that this is the kind of thing you mustn’t take for granted.

vii

CONTENTS1 1 1.1 The tree of life 1 ￿￿￿￿￿￿￿￿￿￿￿￿1.2 Archaea - the third domain of life 3 1.2.1 Defining characteristics of Archaea 3 1.2.2 Archaeal diversity: Ecology, and 8 1.3 Sulfolobus, a crenarchaeal model 12 1.3.1 Sulfolobus genomics 14 1.3.2 Sulfolobales viruses and plasmids 16 1.4 CRISPR: Adaptive immunity in archaea and bac- teria 19 1.4.1 A history of the research 19 1.4.2 CRISPR bioinformatics 21 2 27 2.1 Spacers 27 ￿￿￿￿￿￿￿2.1.1 A note on nucleotide vs. amino spacer matches 28 2.2 Repeats 28 2.3 Leaders 29 2.4 cas genes and CRISPR genomic context 31 2.4.1 Evidence for iCmr accessory genes 32 3 35 4 39 5 ￿￿￿￿￿￿￿￿￿￿ &￿￿￿￿￿￿￿￿￿￿￿￿41 5￿￿￿￿￿￿￿￿￿￿.1 Paper 1 42 ￿￿￿￿￿￿￿￿￿￿￿￿5.2 Paper 2 52 5.3 Paper 3 59 5.4 Paper 4 74 5.5 Paper 5 89 5.6 Paper 6 103 5.7 Paper 7 116 5.8 Paper 8 126 5.9 Paper 9 134 5.10 Paper 10 146 5.11 Book chapter A 155 5.12 Book chapter B 176

ix

TableLIST1 OFFeatures TABLES of Archaea, Bacteria and Eucarya 6 Table 2 record-holders 9 Table 3 Sequenced Sulfolobales genomes 15 Table 4 Properties of Sulfolobus viruses & plasmids 18 Table 5 cas genes and their functions 25 Table 6 Overview of accessory iCmr genes 33

xi

FigureLIST1 OFHaeckel’s FIGURES tree of life 2 Figure 2 16S-RNA universal 4 Figure 3 RNA polymerases form the three domains 7 Figure 4 Archaeal 16S RNA phylogenetic tree 11 Figure 5 A Sulfolobus infected with a virus 13 Figure 6 Morphologies of select archaeal viruses 17 Figure 7 CRISPR immunity: mode of action 20 Figure 8 Gene maps of accessory iCmr genes 34 Figure 9 Tree of csx1 genes from Sulfolobales 37

xiii

1

INTRODUCTION

Life is diverse. In an attempt to make sense of this diversity, 1The animals vs. throughout￿.￿ ￿￿￿ history ￿￿￿￿ biologists ￿￿ ￿￿￿￿ have tried to classify life into classes plants classification and sub classes. One of the earliest schemes involved the classific- dates back to Aristotle, ca. 300 BC ation of all life into plant and animal ‘kingdoms’1. The limitations of this scheme became more and more obvious with the continu- ous discovery of organisms which seemed neither to resemble plants nor animals. Later, Ernst Haeckel2 devised a third kingdom Protista, a catch- 2Ernst Heinrich all class for organisms which did not fit the animal-plant dicho- Philipp August tomy. Inspired by the emerging theory of evolution, Haeckel was Haeckel, 1834 - 1919, also known as the the first to depict the relationships between various life forms German Darwin within the three kingdoms in what resembles a modern phylo- genetic tree (Figure 1). Initially, most unicellular organisms, as well as algae, diatoms, slime moulds, marine sponges, etc. were placed in this new non-plant, non-animal kingdom. Eventually, the kingdom was redefined to contain unicellular organisms only, with all multicellular organisms reclassified as being either plants or animals. With the advent of electron microscopy in the 1940s it be- came increasingly clear that the concept of ‘kingdoms’, at least as the upper-most rank, was rather artificial. On electron mi- crographs the differences between nucleated and non-nucleated cells seemed far greater than those between cells of different kingdoms. This marked the beginning of a new dichotomy, i. e. that of prokaryotes vs. eukaryotes. To hierarchically and meaningfully classify organisms into clas- ses and sub classes, as is done in the discipline of , the organisms must be compared to one another. Comparison is only possible when comparing inherently comparable traits. i. e. the organisms being compared must have some characters in common. But the classification of microorganisms based on phenotypic characters was difficult. Phenotypically, diverse mi- croorganisms had practically no characters in common, so a unified classification scheme was impossible to establish. Thus

1 2 ￿￿￿￿￿￿￿￿￿￿￿￿

Haeckel’s Monophyletischer Stammbaum der Organismen, i. e. ‘Monophyletic family tree of organisms’. This tree was Figure 1: based on the assumption that all organisms on earth are ‘monophyletic’, i. e. share a common ancestor. Haeckel also made other ,‘polyphyletic’ trees, but was inclined towards the monophyletic model as being the most attractive one[14] ￿.￿ ￿￿￿￿￿￿￿- ￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿ 3 the new Prokaryote domain of life was no more than a catch-all term for non nucleated cells and therefor potentially artificial. Likewise, Protista was retained as a catch-all term for eukaryotes which defied classification (everything except plants and animals at the time). In other words, the new Eukaryote/Prokaryote dichotomy, despite becoming widely accepted, hadn’t resolved the underlying problem. There was still an imminent possibility that this scheme would be deemed artificial in the light of some new insight.

￿In.￿ a 1977￿￿￿￿￿￿￿paper[82] - Carl￿￿￿ Woese ￿￿￿￿￿ and ￿￿￿￿￿￿ colleagues ￿￿ demonstrated ￿￿￿￿ how phylogeny could be based on molecular criteria rather than morphological criteria. Furthermore, by using small sub unit (SSU) ribosomal RNA (or 16S rRNA), for the first time ever it became possible to make an apples-to-apples comparison of all life on earth. In principle, this enabled a unified classification scheme for all known life forms without having to rely on any artificial criteria. However, the picture which emerged (Figure 2) from basing phylogeny on 16S rRNA was so radically different from what was established at the time, that it took years for the scientific community to come to terms with it[14]. The 16S rRNA based tree offered a glimpse of what the di- versity of life looks like when seen in full view. Crucially, the plant and animal kingdoms which had previously taken up the most space were now confined to a corner of the new tree of life, while the rest of the tree was occupied by microorganisms. But even more importantly, organisms with a prokaryotic cell structure did not comprise a monophyletic group, rendering the Prokaryote-Eukaryote dichotomy meaningless. Instead it be- came obvious that life exists in 3 fundamentally different flavours termed ‘domains’, one of which, the Archaea, had so far gone unnoticed.

The￿.￿.￿ domainDefiningArchaea characteristicscovers a diverseof Archaea group of microorganisms. They share some features with eukaryotes and others with bac- teria. Also, they have their own features which are completely 4 ￿￿￿￿￿￿￿￿￿￿￿￿

                      

Universal phylogenetic tree based on 16S rRNA, showing the three domains, adopted from [83]. The numbers corres- Figure 2: pond to, Bacteria: 1, the Thermotogales; 2, the flavobacteria and relatives; 3, the cyanobacteria; 4, the purplebacteria; 5, the Gram-positive bacteria; and 6, the green nonsulfur bacteria. Archaea: the kingdom : 7, the genus Pyrodic- tium; and 8, the genus Thermoproteus; and the kingdom : 9, the Thermococcales; 10, the Methanococ- cales; 11, the Methanobacteriales; 12, the Methanomicrobiales; and 13, the extreme . Eucarya: 14, the animals; 15, the ciliates; 16, the green plants; 17, the fungi; 18, the flagellates; and 19, the microsporidia. ￿.￿ ￿￿￿￿￿￿￿- ￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿ 5 unique to this new domain of life. Put shortly, archaea can be described as:

having a ‘prokaryotic’ cell structure, i.e. no nucleus and • organelles, and cells which superficially resemble bacteria

having a core cellular machinery (i.e. replication, tran- • scription, translation, etc.) which resembles the one in eukaryotes

thriving in extreme environments (high temperature, pres- • sure, acidity, salinity etc) where they are the dominating life form, outcompeting bacteria and eukaryotes.

Table 1 summarises some differences and similarities between Archaea and the two other domains of life. The combinations of features shared and unique paint an interesting picture: Consider, as an example, that multicellular macroorganisms are exclusive to the Eucarya, or the fact that virtually no eukaryotes are extremo- philes nor , whereas these qualities are widespread within both the Bacteria and Archaea. So the fact that archaea share a feature as fundamental as their information processing machinery with the eukaryotes rather than bacteria cements their being a separate and unique domain of life; they are equally different, so to speak, from both Bacteria and Eucarya, sharing some features with either, as well has having their own unique features.

CellMost walls known archaeal species have a cell wall composed entirely of protein subunits arranged in hexagonal symmetry encom- passing the entire cell surface. This gives the cell surface a lattice-like appearance in scanning electron micrographs. Such a cell wall is called a ‘surface layer’ or S-layer. The S-layer pro- teins are often glycosylated. Other species of archaea have a cell wall composed entirely of polysaccharide. Additionally, in some species the cell wall is composed of a compound called pseudopeptidoglycan, which bears some resemblance to bacterial peptidoglycan, but also has key differences. Thus we see quite some diversity in archaeal cell walls, which is in contrast to bacteria, where peptidoglycan is almost universally present. 6 ￿￿￿￿￿￿￿￿￿￿￿￿ Feature Archaea Eucarya Bacteria

Nucleated cells no yes no Membrane link Ether Ester Ester Cell wall misc. misc. peptidoglycan Actin-based cytoskeleton yes yes no Circular chromosomes yes no yes Histones yes yes no

Macroorganisms no yes no Chemolithotrophy yes no yes Extremophily yes rare yes

Operons yes no yes Plasmids yes no yes

replication sliding clamp PCNA PCNA -complex transcription factors TBP, TFII TBP, TFII -factor initiator tRNA methionine methionine formylmethionine

Some features of Archaea, Bacteria and Eucarya compared. In summary, Archaea are equidistant from both Bacteria and Table 1: Eucarya, sharing some features with either, as well has having their own unique features.

A striking difference between Archaea and the two other domains is in the chemical structure of their membrane lipids. While in bacteria and eukaryotes long chain hydrocarbons are attached to the glycerol backbone via ester linkages, in archaea the bond to glycerol is an ether linkage. Thus the chemical presence of ether-linked lipids in e. g. an environmental sample is always an indication of the presence of archaea. Moreover, the long chain hydrocarbons are often branched and can even contain cyclic structures rarely seen in bacteria and eukaryotes. Also, in many species, opposing hydrocarbon chains in the lipid bi-layer are covalently linked (effectively making it a monolayer), resulting in a membrane which is more rigid and less permeable.

The cytoskeleton, an organelle previously thought to be exclusive Cytoskeleton to Eucarya, is composed of long filaments of the protein, actin. The filaments traverse cells on the inside, and provide the struc- tural support which gives the cell its shape. A recent study[18] ￿.￿ ￿￿￿￿￿￿￿- ￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿ 7 revealed the presence of an actin based cytoskeleton in members of the hyperthermophilic crenarchaeal order Thermoproteales. The cytoskeleton is likely to be responsible for maintaining the rod-shaped morphology which is a key trait of this particular class of crenarchaea.

ArchaealInformation information processing processing mechanisms (i. e. replication, transcription and translation) bear unmistakable resemblance to counterparts within Eucarya.

Projections of bacterial (left), archaeal (middle) and eukaryotic (right) RNA polymerase 3D structures. The archaeal and Figure 3: eukaryotic polymerases are very similar

Each of the three types of RNA polymerases present in eu- karyotic cells more closely resemble the archaeal RNA poly- merase, than either of them do bacterial RNA polymerase[60]. The Archaeal RNA polymerase looks like a simplified version of eukaryotic RNA polymerase II (Figure 3), which is the RNA poly- merase used by eukaryotes to transcribe protein coding genes. Despite the five bacterial RNA polymerase subunits (2 ↵, , ⇥ 0, and !) having corresponding counterparts in both archaea and eukaryotes, archaeal polymerases carry an additional 5 to 6 subunits, absent from in bacteria, which all have counterparts in eukaryotes[77]. Eukaryotic RNA polymerases however, do have some additional subunits which are not found in archaea, or bacteria for that matter. Likewise DNA replication in Archaea appears like a simple version of its eukaryotic counterpart[4]. This has made archaeal organisms attractive models for studying eukaryotic replication. Also, like eukaryotes, archaeal species possess multiple replica- tion origins[42]. 8 ￿￿￿￿￿￿￿￿￿￿￿￿ Bacteria and archaea have 70S ribosomes and 16S SSU rRNA, while eukaryotes have 80S ribosomes and 18S SSU rRNA. Des- pite the similarity in size between the bacterial and archaeal ribosomes, functionally however, the archaeal ribosomes more closely resemble their eukaryotic counterparts. Although transla- tion in archaea has not yet been studied in great detail, antibiotic and sensitivities testify to the similarities to the eukaryotic system. In Bacteria, for example, erythromycin inhibits transla- tion by obstructing the ribosomal A-site, which is not the case for archaea and eukaryotes. On the other hand, bacterial ribosomes are insensitive to diphtheria toxin, which interferes with protein synthesis by chemically modifying Elongation Factor II in both archaea and eukaryotes[34].

Not all archaea are which thrive in hostile en- Extremophily vironments. Nevertheless, extremophilic species within Archaea are over-represented by far when comparing with Bacteria and Eucarya. This is exemplified in Table 2, where most of the cur- rent record holders for life in extreme environments constitute archaeal species, rather than bacteria or eukaryotes. It has been proposed that this added ability to cope with ex- treme environments is an inherent consequence of the nature the archaeal cell and its biochemistry, which has adapted to deal with chronic stress to make the best of low net energy yields[73]. Although the environments in which extremophilic species dwell may not be low in energy as such, making chem- ical reactions thermodynamically favourable requires upholding strong chemical gradients which are taxing on the cell’s resources. In addition, high temperatures and acidic environments cause chemical breakdown resulting in considerable maintenance costs even when cells are not growing. The low permeability of the ether-linked membranes in archaea, along with a op- timised for energy efficiency help the cells to deal with persistent energy stress. This in turn allows archaea to thrive on the fringes of what the planet has to offer.

Taxonomically,￿.￿.￿ ArchaealArchaea diversity:can Ecology, be divided taxonomy into and a number genomics of phyla, two of which, the Euryarchaeota and Crenarchaeota, are well es- tablished. Additional to the two major phyla, the phyla Korar- ￿.￿ ￿￿￿￿￿￿￿- ￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿ 9

Extreme Domain Species

temperature

122 C Archaea Methanopyrus kandleri

-12 C Bacteria Psychromonas ingrahamii

pH -0.06 Archaea 12 Archaea Natronobacterium gregoryi

pressure 120 MPa Archaea Pyrococcus yayanosii

salinity 32% (saturation) Archaea Halobacterium salinarum

radiation 30 KGy Archaea Thermococcus gammatolerans

Current record-holders for life in extreme environments, most of which are archaea, and some of which are polyex- Table 2: tremophiles: P. oshimae is also a , with an optimal growth rate at 60 C. Likewise N. gregoryi is also an extreme and P. yayanosii and T. gammatolerans are both hyper- with optimal growth at more than 80 C. 10 ￿￿￿￿￿￿￿￿￿￿￿￿ chaeota[16], Thaumarchaeota[9] and recently, Aigarchaeota[54] have been proposed. Although these new phyla are less well repres- 1Until recently ented and subject to some debate1, it is worthy to note that all thaumarchaea as well five phyla are separated the by deep branches on phylogenetic as aigarchaea were trees like the one in Figure 4. More significantly, however, there believed to be part of 2 Crenarchaeota[9, 54], are a number of physiological properties which separate them, and Nanoarchaeum such as the different sets of features which they have in common equitans which with eucaryotes. E. g. crenarchaea and thaumarchaea lack Eu- initially formed its carya-type histones, whereas euryarchaea and aigarchaea have own , is now believed to be a them[54]. On the other hand crenarchaeal and thaumarchaeal fast-evolving species harbour components of the eukaryotic ESCRT-III (endo- euryarchaeon[8] somal sorting complex required for transport–III) system, which they use for cell division[11, 66], while euryarchaea and possibly also aigarchaea employ a homolog of the bacterial cell division 2inferred from protein, FtsZ, for this task. All this is despite thaumarchaea be- genome sequences ing more similar to euryarchaea[9], and aigarchaea being more similar to crenarchaea[54], as revealed by comparative genom- ics. The genome sequence of C. subterranum, the first, and so far only aigarchaeal genome to have been almost fully sequenced, encodes a Eucarya-type ubiquitin modification and degradation system[54], the first of its kind to be found in any non-eukaryote. This patchy distribution of common and unique features between the phyla themselves, and between them and Eucarya is a test- ament to the depth of their divergence and thus supports their 3This is reminiscent status as separate phyla3. of the patchy Almost certainly additional archaeal phyla exist in nature, distribution of features common and which may be at least as divergent as the ones described here, unique to the three but their formal classification is hampered by the limitation of domains in Table 1, culture dependent methods. Thus the archaeal diversity so far Page 3, and how this recognised, although appreciable, is almost certainly underestim- supported that the domains were in fact ated. separate Figure 4 gives an overview of the diversity of the 125 archaeal genomes sequenced at the time of writing. All species indic- ated as inhabiting ‘hydrothermal’ habitats are thermophiles and hyperthermophiles which thrive at temperatures between 50 and 120 C. Taxonomically, they are diverse with representat- ives from four of the five phyla. They comprise the majority of characterised archaeal species probably due to the ease in cultur- ing them, rather than their abundance in nature. They inhabit everything from terrestrial hot springs to deep hydrothermal vents, where geothermally heated water forms the basis for their lifestyle. Sulphur, an abundant element in such environments, is often an integrated part of the energy of such ￿.￿ ￿￿￿￿￿￿￿- ￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿ ￿￿ ￿￿￿￿ 11 







    

















 

Phylogenetic tree based on 16S RNA sequences from all (125) available completely sequenced archaeal genomes. Members Figure 4: of the same taxonomic order have been grouped into triangles, where the triangle’s vertical height indicates the number of members, also indicated in brackets, and its horizontal width, the diversity within the group. The branch length corresponding to a 1% difference in 16S RNA sequence is indicated by the ruler in the upper left. The two major and three minor phyla are colour coded on the bar to the left. Crenarchaeota: green, Aigarchaeota: light green, Korarchaeota: yellow, Thaumarchaeota: light blue, Euryarchaeota: dark blue. The ecological habitat is indicated by the line to the right of the bar. Hydrothermal: orange, marine: blue, wetland sediment: brown, hyper-saline lake: pink. 12 ￿￿￿￿￿￿￿￿￿￿￿￿ species, many of which can to grow chemolithotrophically, as aerobes or anaerobes. The methanogens inhabiting ‘wetland sediments’ (Figure 4) are mesophiles which produce methane from dioxide or acet- ate in anoxic environments, using carbon as an for energy. Although this type of metabolism provides little en- ergy surplus[73], it enables the exploitation of an unoccupied . Since only methanogens are able to carry out 1with the exception methanogenesis, the environmental occurrence of methane1 is an of natural gas unambiguous indicator of their presence. Apart from wetlands deposits some of these species are also found in sewage, while others are known to colonise the digestive tracts of animals where they aid in the metabolism of complex carbohydrates. Species inhabiting ‘hyper-saline’ habitats comprise a mono- phyletic group from a single (although diverse) taxonomic order, the Halobacteriales. They thrive in salt lakes where salinity ap- proaches saturation. Some halobacterial species are polyextremo- philes, since the salt lakes which they inhabit can get excessively hot, as well as alkaline. The intense pink colour of salt lakes is caused by a halobacterial , halorhodopsin, which captures photons from sunlight to generate energy in a manner which is unrelated to photosynthesis. Archaeal species inhabiting ‘marine’ environments live in sea water and are both abundant and diverse, though notoriously difficult to culture due to a lifestyle where they are part of com- plex microbial communities. Until now only two species have had their genomes sequenced completely, namely Nitrosopumilus maritimus and Cenarchaeum symbiosum, both of which are mem- bers of the phylum Thaumarchaeota. Both species are able to ob- tain energy through ammonia oxidation and carbon by fixation, but while the former lives planktonically and is very abundant throughout surface sea waters all over the world, the latter is restricted to living symbiotically with a particular type of marine sponge.

￿.￿ ￿￿￿￿￿￿￿￿￿￿,￿ ￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿ ￿￿- Genera within the crenarchaeal order Sulfolobales include Sulfo- lobus, Acidianus￿￿￿￿￿￿, Metallosphaera and Stygioglobus. Members of the Sulfolobales grow optimally at around 80C and were amongst ￿.￿ ￿￿￿￿￿￿￿￿￿￿,￿ ￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿￿￿ 13

A Sulfolobus cell infected with the virus STSV1. The bar in the bottom left corresponds to 1µm Figure 5: 14 ￿￿￿￿￿￿￿￿￿￿￿￿ the first thermophilic archaea to be isolated. With optimum growth around pH 2 to 3, these organisms are in addition to being hyperthermophiles. Terrestrial geothermal en- vironments which are acidic and rich in sulphur comprise their natural habitats. Such environments are found near volcanically active areas in the shape of hot springs, fumaroles and mud pots. Sulfolobales species are metabolically versatile and can grow chemoorganotrophically, obtaining energy by catabolising or- ganic compounds, or chemolithotrophically by oxidising H2S or 0 S to H2SO4 (sulfuric acid). The latter is the mode of growth which is responsible for acidifying the environments they live in. Some species such as M. sedula or S. metallicus have the added capability of using as an electron donor by oxidising Fe(II) to Fe(III), with potential applications in bioleaching. Moreover, species of the genus Acidianus can grow by anaerobic respira- tion, using S0 as an electron acceptor instead of a donor, and converting it to H2S in the presence of H2. Furthermore, some Sulfolobales species are capable of autotrophic growth by fixing CO2 which is an abundant gas in geothermally emitted fumes. Owing to their relative ease of culturing, various Sulfolobus species have long served as model organisms for studying some fundamental processes in Archaea, including DNA replication and repair, transcription and translation. Versatile genetic systems have also been developed[15, 76], and a multitude of genomes have been sequenced (Table 3), making it an especially attractive model system to work with.

To￿.￿. date,￿ Sulfolobus sixteen Sulfolobales genomics genomes have been sequenced and are publicly available (Table 3). Of these, thirteen are from the genus Sulfolobus, making it the most extensively sequenced ar- chaeal genus so far. Sulfolobales genomes range approximately in size between 2 and 3 Mbps and and are A/T-rich. Most Sulfolobus genomes are rich in transposable elements which contribute to their plasticity. The genomes contain many different families of transposable elements (Publication 5.7[25]) as well as multiple copies of similar transposable elements. The latter, in turn, act as sites of homologous recombination within the genomes, resulting in frequent inversions as well as gain and loss events. As a result, even closely related Sulfolobus species exhibit considerable differences in gene content. Most of these differences, as well as the transposable elements which facilit- ￿.￿ ￿￿￿￿￿￿￿￿￿￿,￿ ￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿￿￿ 15 ate them, are contained within a hyper-variable region which is around 0.7 Mbps in length, comprising around 25% of the genome. Even when comparing two closely related strains of the same species, this particular region is extensively shuffled and heterogeneous, providing a hot spot for horizontal gene trans- fer. The hyper-variable region generally contains genes which may be beneficial, but are non-essential in nature. Examples in- clude gene cassettes associated with certain metabolic pathways, carbohydrate uptake, CRISPR/Cas loci (Section 1.4) and toxin antitoxin systems (Publication 5.12). On the other hand, house keeping genes which are essential for the cell at any given time are located outside this region throughout the rest of the genome, which is very conserved in gene content and synteny.

Organism Size Ref. Accession #

Sulfolobus solfataricus P23.0 [70] AE006641 Sulfolobus tokodaii str. 72.7 [33] BA000023 Sulfolobus acidocaldarius DSM 639 2.2 [13] CP000077 Metallosphaera sedula DSM 5348 2.2 [2] CP000682 Sulfolobus islandicus Y.G.57.14 2.7 [64] CP001403 Sulfolobus islandicus Y.N.15.51 2.8 [64] CP001404 Sulfolobus islandicus L.D.8.52.7 [64] CP001731 Sulfolobus islandicus L.S.2.15 2.7 [64] CP001399 Sulfolobus islandicus M.16.42.6 [64] CP001402 Sulfolobus islandicus M.16.27 2.7 [64] CP001401 Sulfolobus islandicus M.14.25 2.6 [64] CP001400 Sulfolobus islandicus HVE10/42.7 [25] CP002426 Sulfolobus islandicus REY15A 2.5 [25] CP002425 Metallosphaera cuprina Ar-41.8 [41] CP002656 Acidianus hospitalis W12.1 [85] CP002535 Sulfolobus solfataricus 98/22.7 u.p. CP001800 Acidianus brierleyi ? u.p.

List of Sulfolobales strains with sequenced genomes. The list is ordered according to their date of publication. The size Table 3: for each genome is given in megabase pairs (Mbps), along with an article reference and the genome accession number. u.p = unpublished. 16 ￿￿￿￿￿￿￿￿￿￿￿￿

Viruses￿.￿.￿ Sulfolobales which infect viruses archaea and are plasmids strikingly distinct from those which infect both bacteria and eukaryotes. The vast majority of bacterial viruses, or bacteriophages, exhibit a morphology with an icosahedral head attached to a tail. Archaeal viruses on the other hand have diverse morphologies (reviewed in[58]) with the bacteriophage-like head/tail morphology being observed only occasionally. The most common morphology seen amongst archaeal viruses is that of the ‘fuselloviruses’, which basically resembles the shape of a lemon. The lemon-shaped viruses have been seen asso- ciated with diverse euryarchaeal as well as crenarchaeal hosts from the orders Sulfolobales[63], Halobacteriales[5] as well as Thermococcales[22], but it is unknown to what extent the viruses are related as no genes have been found shared. Although vari- ations around this central lemon shaped theme is seen in other archaeal virus families, which are ‘droplet’-shaped or tailed, there is little evidence that any of them are related to each other or to fuselloviruses in general. Apart from lemon-shaped virions, linear viruses are also found, some of which resemble stiff rods, while others are flexible filaments. Furthermore, spherical viruses have been characterised, some of which are icosahedral whereas others are enveloped in lipid membranes. Finally, a number of viruses have been observed which do not fall under any of the morphological categories mentioned above. Examples include the ‘bottle’-shaped virus ABV1[28], as well as a single-stranded DNA (ssDNA) virus with a ‘pleomorphic’ morphology [55]. With the exception of the latter, so far all described archaeal viruses are double stranded DNA (dsDNA) viruses. Electron micrographs of representative archaeal viruses are shown in Figure 6. Sulfolobales is the archaeal order most extensively character- ised in terms of infecting viruses and plasmids and most of our current knowledge about the diversity of archaeal viruses and plasmids stems from studies carried out initially on Sulfolob- ales. At the time of writing, a total of around ninety complete archaeal virus and plasmid genome sequences have been depos- ited in public sequence databases, around half of which are from Sulfolobales hosts. Many Sulfolobales strains are good hosts for viruses and plasmids, making them attractive as models for studying host-virus interactions in archaea. Viruses which infect Sulfolobales include the rod shaped ru- diviruses, filamentous lipothrixviruses, lemon shaped fusellovir- ￿.￿ ￿￿￿￿￿￿￿￿￿￿,￿ ￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿ ￿￿￿￿￿￿￿￿ 17

A B C

E

D F G

H

Typical morphologies of representatives of different families of archaeal viral families. a, SNDV; b, STSV1; c, ATV; d, SIFV; Figure 6: e, AFV1; f, PSV; g, SSV4; h, ARV1. Bars are 100 nm [adopted from Publication 5.11] 18 ￿￿￿￿￿￿￿￿￿￿￿￿ Name Type size (kbps) # of ORFs

SIRV1 rod shaped virus 32 46 AFV9 filamentous virus 41 74 SSV5 lemon shaped virus 15 29 STIV2 turreted icosahedral virus 17 35 ATV tailed lemon shaped virus 63 73 ABV bottle shaped virus 23 58

pRN1 cryptic plasmid 57 pNOB8 conjugative plasmid 41 53

Representative Sulfolobales extrachromosomal elements have been listed along with their genomic properties. The Table 4: type of each extrachromosomal element is given along with the size of their genomes in kbps as well as the number of open reading frames (genes) annotated. Viruses with names begin- ning with ‘A’ were isolated from Acidianus strains, whereas the ones beginning with ‘S’ were isolated from Sulfolobus strains. However some of these viruses exhibit a broad host specificity range, cross-infecting different genera, while others are much more specific, down to the strain level. As for the plasmids both of the ones shown here were isolated from Sulfolobus strains, although pNOB8-like conjugative plasmids have also been isolated from Acidianus hosts.

uses, tailed bicaudaviruses, ‘turreted’ icosahedral viruses, a par- ticular droplet shaped virus and a peculiar bottle-shaped virus. Thus Sulfolobales species act as hosts for almost the entire spec- trum of viral diversity currently known for Archaea. Some of the exceptional morphologies of the viruses are accompanied by equally remarkable characteristics. E. g. viruses of the bi- caudavirus family exit the host cells as lemon shaped particles, only later to grow tails independently when subject to certain environmental conditions[59]. Most of the viruses which infect Sulfolobales do not cause cell lysis. Instead the virus maintains a stable relationship with the host, possibly because unneces- sarily lysing the cell would excessively expose the virus to the harsh chemical environment outside, diminishing its chances of propagation in the long run. Some representative viruses along with their genomic properties have been outlined in Table 4. As for plasmids which infect Sulfolobales, they can be crudely divided into conjugative and cryptic plasmids. Little is known ￿.￿ ￿￿￿￿￿￿:￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿￿ 19 about the role of the cryptic plasmids in the cells where they propagate, hence their name. The simplest of such plasmids, such as those of the pRN family, encode little more than the proteins required for their own replication. As for the large conjugative plasmids, they confer cellular conjugation. They spread efficiently within cultures, and severely retard the cellular growth rate in the process[40]. Thus, judging from the impact on growth, this class of plasmids seem more dangerous to the cell than many of the stably maintained viruses which only have a small effect on host growth. However, there is a possibility, that the presence of conjugative plasmids in Sulfolobales habitats are advantagous to the host populations in the long run, as they may be the vehicle for the widespread which we know takes place between Sulfolobales cells. Representative plasmids are listed in Table 4 along with their genomic properties.

1 ￿.￿ ￿￿￿￿￿￿:￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿￿

Clusters￿.￿.￿ A history of regularly of the research inter-spaced short palindromic repeats 1This section (CRISPRs) are a family of tandem repeats found in the gen- contains a overview omes of bacterial and archaeal species. Depending on the type of of the field. For a more in depth review, CRISPR, the individual repeats are between 20 to 40 base pairs please see (bp) in length, separated from each other by intervening spacer Publication 5.11 sequences, 30 to 50 bp long. A CRISPR array, composed of mul- tiple repeats and spacers can contain anywhere from a few to a hundred or more repeat/spacer units, with the total length of the array thus exceeding several thousand bp. The array is flanked on one side by a non-coding sequence called the ‘leader’ which is between 100 and 500 bp long, again depending on the CRISPR type. The leader contains the promoter sequence responsible for generating full length transcripts of the whole array. Also, nearby the CRISPR array and leader, typically a particular set of genes called cas genes (CRISPR associated genes) are found. These genes encode protein complexes which work together with the CRISPR array and the its RNA to confer the host cell with an adaptive immune system that fights invading viruses and plasmids. Figure 7 provides an overview of how this system works. 20 ￿￿￿￿￿￿￿￿￿￿￿￿

Simplified overview of how CRISPR immunity works. Upon a first encounter with an invader, the aCas proteins Figure 7: involved in the adaptation stage, excise a small piece of DNA (30-50 bp) adjacent to a small sequence motif (the PAM) from the invader chromosome and incorporate it as a spacer se- quence between the repeats of a CRISPR array on the host chromosome. The entire CRISPR array, which also contains a record of previous invader encounters is transcribed, and the RNA processed by pCas proteins into small CRISPR RNAs (crRNAs). The crRNAs are loaded onto an interference complex composed of the iCas proteins which targets and degrades invader DNA in the event of a subsequent infec- tion. Recognition of the PAM (protospacer adjacent motif) is required during both the adaptation and interference stages. ￿.￿ ￿￿￿￿￿￿:￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿￿ 21 CRISPR arrays have been described on genomes of various bacteria and archaea for decades, with almost all archaeal gen- omes and roughly half of bacterial genomes harbouring them. However, their function remained enigmatic until the crucial dis- covery that the spacer sequences originated from viruses and plasmids[7, 37, 50, 57]. Given that the CRISPR arrays were already known to be transcribed and that their RNA was processed into small CRISPR RNAs [72] (crRNAs), this new insight led to the hypothesis that the CRISPR arrays conferred the host with a nucleic acid based adaptive immune system. This aroused the attention of much larger parts of the scientific community, and thus progress in the field has been rapid ever since. In 2007 it was shown that newly added spacers in CRISPR arrays of Streptococcus thermophilus conferred immunity against phages which the strains had been challenged by[3]. The follow- ing year the genes responsible for the second and third stages of the immune system (see pCas and iCas on Figure 7) were determined1 in E. coli[10]. Later that year it was shown that 1implying aCas interference in Staphylococcus epidermidis was directed against in- genes by inference vader DNA[47]. In 2009 a different kind of interference complex, the Cmr-complex, from Pyrococcus furiosus was shown to target RNA rather than DNA[27]. Further advances were made in 2010 and 2011 shedding light on particulars of the interference mechanism, such as mismatch tolerance in Sulfolobus[24], the mechanism for distinguishing self from non-self DNA[48], or the three-dimensional structure of the multi-subunit interference complex[80]. Progress regarding the adaptation process involving the aCas gene products (Figure 7) was only made recently when three independent studies demonstrated how the uptake of new spacers into the CRISPR array could be achieved reproducibly in laboratory conditions in E. coli[71, 84] and Sulfolobus solfatari- cus[17].

As￿.￿.￿ mentionedCRISPR above, bioinformatics the 2005 discovery that spacer sequences originated from extrachromosomal elements, was what triggered widespread interest in the field. Until the time it took for experi- mental studies to gain momentum there was an opportunity to shed light on some fundamental questions through purely bioin- formatical studies. Such studies, being more or less successful, were concerned with issues like predicting the mode of action or devising families of CRISPR systems using data from sequenced 22 ￿￿￿￿￿￿￿￿￿￿￿￿ archaeal and bacterial genomes. As the field is slowly maturing bioinformatics of this sort will come to play a less important role. This section contains a brief overview of the bioinformatical studies made throughout the years. Bioinformatics in the field has, for the most part, been con- cerned with sequence analysis of the elements of the CRISPR 1although very system1, namely the repeats, spacers, leaders and cas genes, recently quite a few which is also what will be reviewed here. mathematical simulations have also begun to appear The characteristic nature of the CRISPR repeats and the way Repeats they are organised was what lead to their discovery in 1987 in E. coli[30]. Early on it was apparent by comparing CRISPRs from the few organisms in which they had been found[51], that the 3’ end of the repeat was more conserved, with its GAAAN motif, and that some repeats were partially “palindromic”, having internal inverted repeats indicative of secondary structure. A more comprehensive study was made years later[36] with data from whole genome sequencing having become abundant. CRISPRs were identified in nearly 200 bacterial and archaeal gen- omes, and their repeats fell within twelve clusters, half of which were palindromic and showed compensating base changes. It was found that most bacterial repeats were palindromic, whereas most archaeal repeats were not. Also, with the emergence of abundant genomic data, new genomes were automatically scanned for the presence of short interspaced repeats, and the results were made available on pubic web-based CRISPR databases such as CRISPRdb[23] or CRISPI[65]. Such databases were useful for getting an overview of CRISPRs in a given genome and the genes associated with them, or retrieving and analysing spacer sequences. However, the databases contain no information about the correct orientation of the CRISPR arrays with respect to the direction of the genome sequence, as this would require the manual identification of leader sequences which are also absent from the databases.

Early attempts at matching spacers to viral or plasmid genomes Spacers were hampered by the fact that searches were performed against large public sequence databases. With search databases that large, the small spacer sequences cannot produce nucleotide alignments long enough to yield significant scores. Studies like ￿.￿ ￿￿￿￿￿￿:￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿￿ 23 the one carried out by Mojica et al. [50], which sought to prove for the first time, the extrachromosomal origin of spacers, were based on such an approach and were met with sceptical referees as a result. Obviously, back then, there was no other choice than to search large public databases, since the origin of the spacers was completely unknown. In Publication 5.2[68] spacer searches were limited to relevant virus and plasmid genomes, yielding many more significant matches. The number of matches were enough to make an attempt at statistically predicting the target nucleic acid, which was found to be invader DNA over mRNA. In similar study, based on a metagenomic analysis of a biofilm containing acidophilic archaea and viruses, spacers were taken from host contigs and mapped onto viral contigs. This yielded sufficient data to conclude, using statistics, that the viral recom- bination rate was high enough to avoid long term targeting by the host, rendering CRISPR-based immunity short lived[1].

TheLeader significance of the leader sequence was recognised already in the earliest of studies in the field[31, 51]1. Back then the leaders 1Even the genome were noted for their always being located on the same side of sequencing projects the CRISPR array, their length, their non-coding nature, the low- predating these studies managed to complexity sequence they were composed of, and finally their identify leader lack of conservation outside the species boundary. The latter was sequences as “long reiterated in studies which came much later[29, 37], and not much repeats” due to their new was added to the original findings for long. Even with the multiple occurrences within the same advent of widespread bacterial and archaeal genome sequencing, genome there have been no systematic and large scale bioinformatical studies addressing leaders, which is contrary to what we saw for the spacers, repeats and cas genes. Leaders have been almost ignored, probably due to the lack of any easy (i. e. automatic) way of identifying them owing to their low sequence conservation. A brief exception to this trend was seen when a motif searching approach (as opposed to a sequence alignment approach) was em- ployed to identify leaders, based on the underlying assumption that various proteins must bind to sequence motifs throughout the leader sequence (Publication 5.3[38]). This approach enabled the identification of similar leaders across different species and even genera of Sulfolobales and the results were in agreement with corresponding data for repeats and cas genes. This finding 24 ￿￿￿￿￿￿￿￿￿￿￿￿ has not been followed up since, and much potential still exists for bioinformatical characterisation of leader sequences.

Of the elements of the CRISPR system, the cas genes have been cas genes the subject of most intense bioinformatical research. The field has been historically complicated due firstly to the immense diversity of the CRISPR system and thereby its cas gene products, and secondly by the simultaneous attempts made by different research groups to devise classification schemes so the diversity could somehow be made sense of[10, 20, 21, 26, 31, 35, 38, 43– 1In this thesis the 45, 74]. The various classification schemes have contradicted one term CRISPR/Cas another on various aspects, even though each had some merits of locus is defined to its own. The complicated cas gene nomenclature also presented a refer to genomic loci composed of adjacent steep learning curve for anyone new to the field. cas genes and The problem arose because of the diversity of the cas genes and CRISPR arrays. because of the small number of bacterial and archaeal genomes sequenced back in 2005, which made it impossible to bridge the gaps between diverse members of the same gene family. Con- sequently such members were presumed to comprise separate gene families. This was what led to the an overestimation of 45 cas gene families[26], with each family of CRISPR systems carrying several of their own specific genes. In addition to this, bioinformatic predictions were made for the functions of the vari- ous cas genes and models were proposed for the mode of action of the CRISPR system as a whole[43]. Many such predictions 2the term iCmr turned out wrong later, wasting the efforts of researchers who in (interference Cas the meanwhile had been basing their experiments on them. module: RAMP) Despite the murky history it seems that cas gene nomenclature covers all systems referred to elsewhere is finally nearing maturation (due in part to insights provided by as Type III, experimental findings). The numerous “subtype specific” genes including Type III-A which were a key part of the previous nomenclature have been and B[45], as well as practically eliminated and replaced by the universally present any other uncharacterised cas7 and cas8 genes. Without dwelling on the old nomenclature, modules which show our latest knowledge of the cas genes is presented in Table 5. the defining The aCas, pCas and iCas genes are normally found together characteristics of and adjacent to CRISPR arrays as shown in Figure 7. Together, encoding cas10,a 1 small subunit gene, these components comprise CRISPR/Cas loci . Additionally, and a variable iCmr modules2 which are sometimes located elsewhere on the number of genes genome, encode a different type of interference complex which encoding RAMP also uses crRNA to target invader nucleic acid, albeit in a manner superfamily proteins which is different from the iCas complex. Key differences include ￿.￿ ￿￿￿￿￿￿:￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿ ￿￿ ￿￿￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿￿ 25

Type Gene Description/Function

cas1 required for spacer acquisition[84] aCas cas2 required for spacer acquisition[84] cas4 unknown, part of adaptation complex[56] csa1 homolog of cas4, part of aCas complex[56]

pCas cas6 crRNA processing[12]

cas3 target unwinding and cleavage[79] cas5 unknown, part of interference complex[80] iCas cas7 crRNA binding backbone[39, 80] cas8 PAM recognition during interference[52] csa5/cse2 R-loop stabilisation [78] cas3” separate nuclease domain of cas3

cas10 large subunit of iCmr complex[86] iCmr cmr5/csm2 small subunit of iCmr complex RAMP genes RNA binding[27], related to cas5 and cas7[44] There are 3 to 6 RAMP genes in a module[21]

Overview of cas genes and the functions of their proteins. Genes from the Bacteria-specific Type II system are not presen- Table 5: ted here. Shaded cells indicate genes which are non-universal and only found in a minority of systems. In some systems, the helicase and nuclease domains of cas3 are encoded as two separate genes termed cas3’ and cas3”. 26 ￿￿￿￿￿￿￿￿￿￿￿￿ that iCmr complexes are not dependent on PAMs (Figure 7), and that some are able to target RNA, whereas iCas complexes are only known to target dsDNA. A recent study proposed that the iCas (also called Type I) and iCmr (also called Type III) systems are evolutionarily related even though their protein subunits have diverged considerably[44]. Both iCas and iCmr modules are diverse, while the aCas modules are very conserved. Thus both types of interference modules can be further divided into a number of sub types which have key differences. So far iCmr systems have been divided into two major types A and B[45], although there has long been evidence for at least two additional types referred to in this thesis as C and D (MTH326-like and ST0012-like respectively[43]). As for iCas systems, these do not yet have a separate formal classification. Instead iCas and aCas systems together, have been divided into types A through F[45]. However the boundaries between A, B and C are ill defined due in part to inconsistencies arising from a lack of distinction between aCas and iCas. Ideally the classification of aCas and iCas should remain separate since the two modules are functionally independent and sometimes exchange partners (Publication 5.10[21]). Nevertheless, this classi- fication scheme is widely accepted and will also be used in parts of the thesis. Many archaeal genomes contain multiple CRISPR/Cas loci and iCmr modules of different subtypes, possibly adding to the versatility of the immune response (see Publication 5.3[38]). Also, most subtypes of iCmr and iCas are distributed both amongst archaea and bacteria, although biases exist, with some systems being more prevalent in archaea and others in bacteria. 2

This chapterRESULTS contains a brief overview of the results of this PhD study with references to the relevant publications included in Chapter 5. Also, results which have not made their way into any publications yet are briefly presented. Four major areas were the subject of this Ph.D. study, namely, sequence analysis of spacers, repeats, leaders and the genomic context of CRISPR arrays, in- cluding the cas genes. This chapter has been divided into sections accordingly. Analyses were carried out for Sulfolobales to be- gin with, and later extended, where possible, to all sequenced archaeal genomes1. 1still work in progress

Most of the work with sequence analysis of spacers was carried out￿.￿ in 2007￿￿￿￿￿￿￿and 2008. Back then it was still not known whether the CRISPR system inhibited invader propagation by targeting their DNA or RNA. Also, the studies which had documented spacer matches to viruses and plasmids back then had only found few matches (see Section 1.4.2), so the focus here was to try to maximise the number of spacer matches using different strategies. Such strategies included:

restriction of search database size (see section 1.4.2 under • ‘Spacers’)

searching for translated sequences which allow • for the detection of more distant matches

employing negative controls to estimate match significance2 2the E-values • reported by most employing sensitive smith-waterman alignments, instead sequence alignment • of word-based alignment programs like BLAST programs become excessively inaccurate when As a result of these approaches, up to 40% of the spacers from sequences are very individual Sulfolobales species could be traced back to viruses short and plasmids and large parts of the viral and plasmid genomes were consequently covered by CRISPR spacer matches. This

27 28 ￿￿￿￿￿￿￿ kind of data allows one to ask questions about the nature of the CRISPR adaptation and interference processes, as we did in publications 5.2[68] and 5.3[38], or the significance of different parts of the viral or plasmid genome, as we did in publications 5.1[75] and 5.4[62], or even about virus/plasmid host specificities, which we did in Publication 5.5[19].

￿Because.￿.￿ A noteDNA on sequences nucleotide diverge vs. amino so acid much spacer more matches rapidly than protein sequences, nucleotide-level spacer matches and amino- acid level spacer matches have different implications. Whereas a nucleotide-level spacer match can reveal exactly which virus the spacer came from, an amino acid-level match can only say something about which family of viruses the spacer arose from. This is because different members of a viral or plasmid family may have nothing in common on the DNA sequence level, while still encoding many similar proteins. Furthermore, the proteins which are highly conserved within a viral family will yield more significant spacer matches than proteins which are less conserved, despite the underlying spacer uptake process being essentially random. The most conserved viral or plasmid proteins will also, often, be the ones most important for the life cycle of the extrachromosomal element. So, in summary, nucleotide-level spacer matches are useful for determining the exact identity of viruses and plasmids because they act as a fingerprint, whereas the number of amino-acid level matches against viral/plasmid proteins can give hints about the importance of such proteins and the parts of the genome where they are encoded.

Most archaeal genomes contain multiple CRISPR arrays. Some- times￿.￿ such￿￿￿￿￿￿￿ arrays in a genome all have identical or nearly identical repeats, but it is just as common to find multiple divergent array types simultaneously present in the same host genome. To gain an overview of the repeat types present in Sulfolobales, they were clustered based on their sequence and initially three distinct fam- ilies were found. This led us to ask the question, whether certain repeat families were always accompanied by certain families of ￿.￿ ￿￿￿￿￿￿￿ 29 cas genes, leaders, and PAM motifs and what we found was that this was indeed the case (as presented in Publications 5.3[38] and 5.6[69]). This result was contrary to a previous study[36] which had found no clear cut correlations between repeat families and cas gene families in Archaea especially. However the study was based on an error prone automatic approach to determining the ori- entation of the CRISPR array, which could have led to wrongly defined repeat families. Towards the end of this PhD study, we carried out a manual an- notation of all (detectable) CRISPR arrays in more than a hundred archaeal genomes. This approach, although tedious, guarantied the correct orientation of the arrays with the added benefit of obtaining the leader sequences accompanying the arrays. The repeats were clustered using similarity scores from needleman- wunch alignments coupled with markov clustering and the res- ulting clusters were compared with corresponding clusters for leaders and cas genes. Although this is still work in progress, it is already clear that the correlations between repeats, leaders and cas gene families when looking at all archaea do not turn out as clear as what we found when we looked at the Sulfolobales alone (Publication 5.3[38]). There are a number of cases where a given repeat family is split between different families of cas genes and vice versa. Although the reason for this is still an open question, I believe that the small size of the repeat sequence does not contain enough information to yield well-defined clusters when too many repeat sequences are analysed at the same time. Thus the anomaly observed may be a result of a methodological constraint, rather than being a reflection of something biological.

In an early attempt to define some universally conserved se- quence￿.￿ ￿￿￿￿￿￿￿ element within leader sequences to aid in their identi- fication, we gathered leaders from Sulfolobales genomes and constructed multiple sequence alignments. We found that leaders were poorly conserved and that large parts of the multiple se- quence alignments consisted of gaps, indicating the the sequences were unalignable. Assuming that a leader in the cell acts as an assembly point for various Cas proteins, transcriptional regulators as well as 30 ￿￿￿￿￿￿￿ RNA polymerase, we imagined that it might consist of a series of sequence motifs which act as binding sites for the different proteins and complexes. If this were the case, the actual order in which the sequence motifs appear throughout the length of the leader may be less important, as long as the sequence motifs are there. Following this line of thought, it wasn’t surprising that the different leaders were unalignable, since motifs, although conserved, could be spread throughout the sequence in an un- conserved order. This prompted us to analyse the leaders using motif analysis, which in turn enabled us to identify similar leader sequences in divergent species of Sulfolobales. The results are presented in Publication 5.3[38]. When dwelling on the results from this analysis it becomes clear that the order in which motifs appear is mostly conserved, even though most of the motifs are family specific. Thus viewing leaders as we initially did, i. e. as long conserved stretches of sequence, was not invalid, and traditional sequence alignments should be suitable as long as leaders from the same family are compared. This is attractive because identifying which family a genomic 1Anyone who has CRISPR system belongs to has always been a challenge1. Repeat tried it knows this sequences are too small, and thus contain to little information to unambiguously pinpoint which family the CRISPR array be- longs to. cas genes are multiple and different cas genes can give different results (Publication 5.6[69]). Also cas genes are “too conserved” so to speak, with many cas genes showing a high degree of sequence similarity across divergent CRISPR systems. If leaders within the same CRISPR family are easily alignable while being unalignable across families, they present an ideal tool for defining CRISPR families as well as easing their identification subsequently. Building on this, after manual annotation of the leaders in all 2Divergence alone archaeal genomes, we performed sequence clustering based on may not be the traditional smith-waterman alignments and markov clustering. problem. If many The clusters which emerged correlated consistently with certain positions along the length of the leader families of cas genes and repeat families giving the first indica- are subject to neutral tions that the exercise had been successful. However, there was a drift, the sequence marked tendency for the leaders to “over-cluster”, meaning that will diverge in a leaders from clearly similar CRISPR systems (judging from cas random manner and thus won’t genes and repeats), would fall into separate unalignable clusters. 2 contribute to Thus the leader sequences may be too divergent and “noisy” to accurate clustering yield ideal clustering. ￿.￿ cas ￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿ 31

￿.￿ cas ￿￿￿￿￿ ￿￿￿ ￿￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿ ￿￿ Back in 2010, before the new cas gene classification[45], the clas- sification￿￿￿￿￿￿ schemes ￿￿￿￿￿￿ in use[26, 43] were old and outdated. Many new genomes had been sequenced which revealed new major families of CRISPR systems (such as the cyanobacterial type) not included in the old schemes. In a collaboration with Gisle A. Vestergaard, we decided to attempt to devise an updated classification scheme, based on all available archaeal genomes at the time, that could serve a basis for an updated cas gene nomenclature. The project involved the combining of automatic and manual genome sequence analysis approaches. The automatic approach involved first identifying key cas genes (cas1, cas7 and cas10) from over a hundred archaeal genomes. Ten flanking genes in either direction of the key genes (resulting in around 5000 genes from all genomes) were pooled and clustered and each cluster was analysed extensively. Fur- thermore all the genomes were manually inspected near the cas gene cassettes and CRISPR arrays to define operons and modules, and this was compared to the gene clusters from the automatic analysis to define which gene clusters co-occurred in distinct modules. This project, although is still unfinished, has provided key insights which I will outline throughout the remainder of this chapter and and the next. These insights include:

CRISPR systems are located within genomic hyper-variable • regions and major differences in CRISPR/Cas locus con- tent are observed even between closely related organisms. (Publication 5.6). Genes neighbouring CRISPR systems in- clude mobilome genes, such as transposable elements and toxin-antitoxin loci which appear to modulate the mobility of the systems in and out of the host genomes (Publications 5.7[25], 5.9[85] and 5.12).

cas gene cassettes are best viewed as consisting of a com- • bination of functional modules responsible for adaptation, processing and interference (aCas, pCas, iCas and iCmr). The functional modules are exchangeable and are found in different combinations. This also means that their classific- ation should be kept separate (see Publication 5.10[21]). 32 ￿￿￿￿￿￿￿ iCmr modules consist of the known core components, i. e. a • gene encoding the large sub-unit (cas10), and one encoding the small sub-unit (cmr5) plus 3 to 6 genes encoding RAMP superfamily proteins (see Publication 5.8[20]). However, a long list of additional genes are found occasionally, possibly extending the functional capabilities of the iCmr complex. This unpublished result is summarised below.

The￿.￿.￿ flankingEvidence gene for and iCmr clustering accessory analysis genes outlined above revealed that certain families of genes were over-represented near iCmr modules. These genes have the following characteristics: The genes are sometimes found as part of the iCmr modules, • and in other cases immediately flanking the modules They are never found associated with anything other than • iCmr modules, save for a few exceptions They are extremely diverse, many carry ATPase and nucle- • otide binding motifs Some of them consistently occur as operons of two or three • genes, sometimes with their own transcriptional regulator Some of the genes are not widely distributed taxonomically, • and are only associated with certain subfamilies of iCmr modules Others show the opposite trend. They are widely distrib- • uted taxonomically, and associate themselves with disparate 1In many cases the iCmr families1 same gene is seen Multiple genes can be found near the same iCmr module. associated with iCmr • types A, B, and D. In some cases the number of accessory genes exceeds the Also, very similar number of core iCmr genes genes are found associated with iCmr A few of these gene families have been identified as CRISPR- modules in diverse associated in previous studies, such as csx1 through 6[26], and organisms within most recently herA and nurA[6], but the exclusive link with iCmr Crenarchaea, Euryarchaea, modules has not been made in published literature. Korarchaeota as well Table 6 provides an overview of some of the accessory iCmr as Aigarchaea genes found. Figure 8 shows gene maps of iCmr modules (red), where the accessory genes have been mapped (purple). In con- clusion it seems that that the minimal core of an iCmr complex is functionally extendable by sets of optional accessory proteins. ￿.￿ cas ￿￿￿￿￿ ￿￿￿ ￿￿￿￿￿￿ ￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿ 33 Name Rank Count Pfam/CDD Example locus

Csx1a 18 40 Cas_DxTHG/Csx1 PH0168 Csm6 30 17 regulator SSO 1393 CmrA 39 12 DUF87/ATPase Cmaq 1512 Csx6a 44 10 Cas_NE0113/Csx1 Ahos 0357 CmrB 54 9 NurA Vdis 1157 CmrC 60 8 AAA-ATPase Tneu 0569 Csx6b 61 8 Cas_NE0113/Csx1 PTO0065 CmrD 66 7 retroviral M1627 1087 CmrE 67 7 AAA-ATPase Vdis 1155 Csx1b 70 7 Cas_DxTHG/Csx1 SiRe 0884 Cmr7 87 5 GSH_synth_ATP SSO 1986 CmrF 93 5 ? Tneu 1153 Csx4a 95 5 Cas_Cas02710 Maeo 1077 CmrG 114 4 AAA-ATPase P186 0965 CmrH 124 4 ? Pcal 0278 Csx4b 134 3 Cas_Cas02710/Csx1 MTH 1076 Csx3 139 3 csx3 PYCH 08020 Csx4c 165 3 Cas_Cas02710 PYCH 07970 CmrI 199 3 ? VMUT 1493 Csx1c 221 3 Cas_DxTHG/Csx1 Metvu 1145 Csx2 222 3 TM1812 Metvu 1146 cmrJ 1311 2 ? Tneu 0570 cmrK 1122 2 ? Pcal 0268

Accessory iCmr genes found by clustering flanking genes. The pool of around 5000 genes flanking key cas genes in 125 Table 6: archaeal genomes were gathered into gene families (clusters) based on sequence similarity. Clusters were ranked accord- ing their count (i. e. the number of members within them). Some clusters were found exclusively near iCmr modules, even though they were not core iCmr genes as such. These are the clusters shown in this table. The ones named with the ‘Csx’ prefix have been noted in previous bioinformatic studies[26] and Cmr7 was characterised in an experimental study[86], whereas the ones found in this study have been named, tent- atively, ‘CmrA’ to ‘K’. CmrA and B have just been described genes in a recent bioinformatical study as the putative novel cas genes herA and nurA, a helicase and a nuclease respectively[6]. Protein sequences were run against the Pfam[61] and Con- served Domain[46] databases and their closest matches are recorded in the table. 34

￿￿￿￿￿￿￿                                                                                                . The type of iCmr system (A, B or D) is also indicated, along with the type of iCas if present. Accessory genes    6                            Gene maps (A through H)listed showing in examples Table of iCmrseem modules to in associate various themselves archaea, with whichhere iCmr are are modules associated of regardless with iCmr of the modules their accessory with type genes multiple (A, accessory B genes. or D). However, many For iCmr illustration, most modules of have no the accessory examples shown genes associated, while most modules have a single or two.                                Figure 8:   3

DISCUSSION & Although experimental studies resolving the mechanistic details of the Cas interferencePERSPECTIVES complexes have started to gain momentum, with more and more articles being published each month, there are still many holes in our understanding of key parts of the inter- ference process. Despite this, there are some marked differences which are already established between iCas and iCmr, such as the manner in which self vs. non-self nucleic acid is distinguished1, 1with PAMs[49] as or the species of mature crRNA (long[10] vs. short[27]) utilised opposed to by either system. base-pairing[48] respectively In addition to such specific differences, a deeper divergence between the two systems is becoming increasingly apparent. For the iCas protein complex, the findings[32, 67] first made for the system in E. coli (Type E), are remarkably being rediscovered in the diverse iCas systems (types A, D and F) of Sulfolobus, Bacillus and Pseudomonas[39, 53, 81]. As for iCmr, the opposite seems to be the case. Here we see surprising mechanistic diversity despite the apparent homology between the systems. E. g. the Type 2Both types utilise A iCmr system of Staphylococcus targets DNA[47], and while crRNA to target two different Type B systems, in Pyrococcus[27] and Sulfolobus[86] complementary ssRNA, but while respectively, both target RNA, they do so in ways which are very the former always 2 different . Furthermore, there is evidence now for a another cleaves the target Sulfolobus Type B iCmr system targeting DNA (Deng et al., under RNA at a fixed revision), obscuring the picture even further. position employing some kind of This tendency for iCas systems being mechanistically con- ruler-mechanism, the served, and iCmr systems exhibiting diversity, is also reflected latter cleaves the on the genomic level. Although sequences of the individual iCas target in a sequence specific manner at genes have diverged considerably between the subtypes, some each ‘UA’ even beyond recognition, the overall gene composition is con- dinucleotide stant, with cas3, cas5, cas7 and cas8 comprising a universal core. encountered iCmr modules on the other hand vary with regard to the content of RAMP genes depending on their being types A, B, C or D. Also, and perhaps more importantly, very similar iCmr modules are sometimes seen accompanied by different combinations of accessory genes (Section 2.4.1) which encode proteins that may be responsible for modifying the core functionality of the iCmr complex, possibly accounting for the mechanistic diversity so far

35 36 ￿￿￿￿￿￿￿￿￿￿ &￿￿￿￿￿￿￿￿￿￿￿￿ observed. In line with this, it has been found that the DNA tar- geting activity of the Sulfolobus Type B system mentioned above, is abolished when the accessory csx1b-gene is deleted. The same activity is restored upon complementation of the gene (Deng et al., under revision). In a similar fashion, possibly, the difference seen in RNA cleavage activity between the Sulfolobus and Pyro- RNA-targeting Type B complexes is a result of the former complex carrying an additional protein encoded by the accessory gene cmr7[86], whereas the latter complex carried no additional proteins[27]. 1An iCmr module is Despite this observed diversity between the iCmr modules, defined as carrying a they still seem to share a deep link. This is generally evident from cas10 gene, a small their defining features such as their signature gene composition1, subunit gene, and a variable number of their mobile nature[69], or their tendency to be located away genes encoding from aCas/iCas cassettes. Most strikingly however, the link members of the between them is apparent form their ability to share and exchange RAMP superfamily. accessory genes. Apart from the widespread exchange of csx1 The various iCmr types (A, B C and D (Figure 9), cmrA and cmrB also exchange between widely different so far) show types of iCmr modules (see Figure 8 for one example of this). differences in the cmrE and cmrI only exchange between Type A modules, and number and nature cmr7, encoding an apparent structural protein[86], also exchanges, of the RAMP genes they carry although only between some subtypes of Type B systems in Sulfolobales. The presence of accessory genes and their ability to exchange tells us some important things about the nature of the iCmr systems in general. Firstly, iCmr must a confer a platform which has to be versatile enough to accommodate and benefit from diverse accessory proteins. Secondly, this platform has to be conserved enough for accessory proteins to be shared between different types of iCmr systems. Thirdly, the iCmr complex must have a core functionality which is beneficial enough to the cell on its own to account for the widespread occurrence of “bareback” iCmr modules. And finally, the accessory genes must provide added functionality to the core complex which is significant enough to account for their widespread presence as well. The picture this paints of the iCmr system is quite different from what we know about iCas. Presumably, iCas systems are designed from the bottom up, so to speak, to target dsDNA in that they encode their own helicase and HD nuclease. Since, probably, dsDNA is the most widespread type of invader nucleic acid, this setup has a proven track record and isn’t in need of any added functionality. This may have allowed for the stream- lining of the iCas systems with their operons being consider- ￿￿￿￿￿￿￿￿￿￿ &￿￿￿￿￿￿￿￿￿￿￿￿ 37

Phylogenetic tree showing the csx1 genes present in Sulfo- lobales genomes. csx1 from Pyrococcus furiosus is also in- Figure 9: cluded for reference. The branch length corresponding to a 5% change in amino acid sequence is indicated by the bar in the upper left. The type of iCmr module associated with each gene is also indicated in the column to the right. There is no correlation between the phylogeny of csx1 and the iCmr type, indicating widespread and frequent exchange. The same pat- tern is seen for other archaeal genomes analysed. [Adopted from Deng et al. under revision] 38 ￿￿￿￿￿￿￿￿￿￿ &￿￿￿￿￿￿￿￿￿￿￿￿ ably smaller than iCmr operons, which in turn would allow for their efficient packaging into fully functional CRISPR/Cas loci. iCmr systems, on the other hand, are more open-ended, func- tionally. The iCmr complex contains the core functionality of targeting single stranded nucleic acid which is complementary to the crRNA it carries. Although basic, this functionality may be directed against a host of purposes depending on the accessory proteins present. Conceivably, the single-stranded targeting cap- ability may be harnessed for cleavage of dsDNA substrates by the simple presence of a helicase and a nuclease such as the cmrAB gene pair. Other uses might include the silencing of transcripts from integrated proviruses, targeting of single stranded invader replication intermediates, targeting of non-dsDNA invaders, or even the regulation of the host’s own genes. 4

During thisCONCLUSION Ph.D study the CRISPR systems in Sulfolobales have been extensively characterised using a bioinformatical genome sequence analysis approach. Later the analyses were extended to all available archaeal genomes. The latter work is still in the process of being concluded. To summarise: Sulfolobales CRISPR spacer sequences were analysed to • find matches to the large number of sequenced Sulfolobales extrachromosomal elements. The matches obtained were used to successfully predict the target nucleic acid of the CRISPR system, back when the target nucleic acid was not known.

Analysis of Sulfolobales CRISPR repeats, spacers, leaders • and cas genes revealed that CRISPR systems exist in fam- ilies, where leader types, repeat types, cas gene types and PAM motifs go hand in hand. At the time, this had not been shown for any other organism.

Analysis of the genomic contexts of Sulfolobales CRIS- • PR/Cas loci revealed that the systems are located in ge- nomic hyper-variable regions and subject to frequent ho- rizontal gene transfer, where transposable elements and toxin-antitoxin loci play a role in modulating their mobility.

Extending the analyses to other archaea outside Sulfolob- • ales revealed that the CRISPR interference modules (iCas and iCmr) in particular are very diverse, while the adapta- tion modules (aCas) are remarkably conserved. Individual modules were also seen interchanging giving rise to CR- ISPR/Cas loci with different combinations of functional modules.

iCmr modules of different types were found to be associated • with a rich array of various accessory genes which were also found to exchange between different types of iCmr modules. It was hypothesised that these accessory genes extend the core functionality of the iCmr modules, e. g. by conferring the ability to switch target nucleic .

39 40 ￿￿￿￿￿￿￿￿￿￿ These findings underline that CRISPR systems are not only di- verse per se, but also that they are configured in a manner that allows for the constant generation of new diversity, with mod- ules recombining and individual genes coming and going. This dynamic nature of the CRISPR immune systems may be a pre- requisite for their continued efficacy against the ever changing threats which they must protect their hosts against. Although many interesting insights have been gained through- out this study, bioinformatics has obvious limitations, and the potential to gain fundamental insights by sequence analysis alone may be exhausted. However, computational studies can still play a role which is supplementary to experimental studies that address many of the open questions which still exist. 5

The publicationsPUBLICATIONS resulting from this PhD study are included in this chapter in chronological order. As most of the publications here have multiple authors with varying contributions, I have rated my own level of contribution to each publication as either ‘major, ‘substantial’ or ‘minor’. ‘Major’ means that the majority of the work behind the publication was carried out by myself. ‘Substantial’ means that my contribution comprised a smaller but crucial part of the manuscript, while ‘minor’ means that my contribution was small and non-crucial to that particular manuscript, although still a part of my own Ph. D project. In addition to my level of contribution, the exact nature of my contribution is also stipulated.

In conformance with the recent update of cas gene nomenclature[45], A note on iCmr family nomenclature the iCmr families referred to in publications 5.7[25], 5.8[20] and 5.10[21] as ‘B’ and ‘C’ are now merged into ‘B’, while ‘E’ is now ‘A’, ‘A’ is now ‘C’, and ‘D’ remains ‘D’. The new nomenclature is used throughout this thesis and in publications to come, whereas the publications listed above contain the old nomenclature. So in summary:

in thesis in [25], [20] and [21]

AE B B and C CA DD

41 42 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ My contribution to this paper was limited to the mapping of Contribution: minor matching Sulfolobales CRISPR spacers onto the four rudiviral genomes ARV, SRV, SIRV1 and SIRV2, preparing Figure 6, inter- preting the data and writing the paragraphs pertaining to this work under the results, discussion and methods sections. JOURNAL OF BACTERIOLOGY, Oct. 2008, p. 6837–6845 Vol. 190, No. 20 0021-9193/08/$08.00ϩ0 doi:10.1128/JB.00795-08 Copyright © 2008, American Society for Microbiology. All Rights Reserved.

Stygiolobus Rod-Shaped Virus and the Interplay of Crenarchaeal Rudiviruses with the CRISPR Antiviral Systemᰔ† Gisle Vestergaard,1 Shiraz A. Shah,1 Ariane Bize,2 Werner Reitberger,3 Monika Reuter,3 Hien Phan,1 Ariane Briegel,4 Reinhard Rachel,3 Roger A. Garrett,1 and David Prangishvili2* Danish Archaea Centre and Centre for Comparative Genomics, Department of Biology, Copenhagen University, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark1;MolecularBiologyoftheGeneinExtremophilesUnit, Institut Pasteur, rue Dr. Roux 25, 75724 Paris Cedex 15, France2; Department of Microbiology, University of Regensburg, Universita¨tsstrasse 31, D-93053 Regensburg, Germany3; and Max-Planck-Institut of Biochemistry, Molecular Structural Biology, Am Klopferspitz 21, D-82152 Martinsried, Germany4

Received 6 June 2008/Accepted 11 August 2008

A newly characterized archaeal rudivirus Stygiolobus rod-shaped virus (SRV), which infects a hyperthermo- philic Stygiolobus species, was isolated from a hot spring in the Azores, Portugal. Its virions are rod-shaped, 702 by 22 (؎ 3) nm in size, and nonenveloped and carry three tail fibers at each terminus. The linear (50 ؎) double-stranded DNA genome contains 28,096 bp and an inverted terminal repeat of 1,030 bp. The SRV shows

morphological and genomic similarities to the other characterized rudiviruses Sulfolobus rod-shaped virus 1 Downloaded from (SIRV1), SIRV2, and Acidianus rod-shaped virus 1, isolated from hot acidic springs of Iceland and Italy. The single major rudiviral structural protein is shown to generate long tubular structures in vitro of similar dimensions to those of the virion, and we estimate that the virion constitutes a single, superhelical, double- stranded DNA embedded into such a protein structure. Three additional minor conserved structural proteins are also identified. Ubiquitous rudiviral proteins with assigned functions include glycosyl transferases and a S-adenosylmethionine-dependent methyltransferase, as well as a Holliday junction resolvase, a transcription- ally coupled helicase and nuclease implicated in DNA replication. Analysis of matches between known cren- jb.asm.org archaeal chromosomal CRISPR spacer sequences, implicated in a viral defense system, and rudiviral genomes revealed that about 10% of the 3,042 unique acidothermophile spacers yield significant matches to rudiviral genomes, with a bias to highly conserved protein genes, consistent with the widespread presence of rudiviruses by on October 1, 2008 in hot acidophilic environments. We propose that the 12-bp indels which are commonly found in conserved rudiviral protein genes may be generated as a reaction to the presence of the host CRISPR defense system.

Viruses of the hyperthermophilic crenarchaea are ex- At present, viruses of the family Rudiviridae are the most tremely diverse in their morphotypes and in the properties promising for detailed studies because they can be obtained in of their double-stranded DNA (dsDNA) genomes (reviewed reasonable yields, and there are already some insights into in references 19 and 23). Moreover, some of the virion their mechanisms of replication, transcriptional regulation, morphotypes are unique for dsDNA viruses from any do- and host cell adaptation (4, 12, 13, 20, 21). To date, three main of life. Many of these viruses have been classified into rudiviruses have been characterized, all from the order Sul- seven new families that include rod-shaped rudiviruses, fil- folobales: the closely related Sulfolobus rod-shaped virus 1 amentous lipothrixviruses, spindle-shaped fuselloviruses, (SIRV1), and SIRV2, isolated on Iceland, which infect strains and a bottle-shaped ampullavirus (reviewed in reference of Sulfolobus islandicus (20, 22), and Acidianus rod-shaped 24). The bicaudavirus Acidianus two-tailed virus (ATV) ex- virus 1 (ARV1), isolated at Pozzuoli, Italy, which propagates in hibits an exceptional two-tailed morphology and the unique Acidianus strains (34). Moreover, rudivirus-like morphotypes viral property of developing long tail-like appendages inde- and partial rudiviral genome sequences have been detected in pendently of the host cell (11). Crenarchaeal viral research environmental samples collected from both acidic and neutro- is still at an early stage of development, and insights into philic hot aquatic sites (27, 29, 32). basic molecular processes, including infection, replication, All rudiviral genomes carry linear dsDNA genomes with packaging, and virus-host interactions, are limited. One of long inverted terminal repeats (ITRs) ending in covalently the main reasons for this lies in the high proportion of closed hairpin structures with 5Ј-to-3Ј linkages (4, 20). The predicted genes with unknown functions (25). terminal structure is important for replication, which pre- sumably is initiated by site-specific single-strand nicking within the ITR, with the subsequent formation of head-to- * Corresponding author. Mailing address: Molecular Biology of the head and tail-to-tail intermediates, and the conversion of Gene in Extremophiles Unit, Institut Pasteur, rue Dr. Roux 25, 75724 genomic concatemers into monomers by a virus-encoded Paris Cedex 15, France. Phone: 33-(0)144-38-9119. Fax: 33-(0)145-68- Holliday junction resolvase (20). This basic replication 8834. E-mail: [email protected]. mechanism appears to be similar to that used by the eu- † Supplemental material for this article may be found at http://jb .asm.org/. karyal poxviruses, Chlorella virus and African swine fever ᰔ Published ahead of print on 22 August 2008. virus, although there is no clear similarity between the se-

6837 6838 VESTERGAARD ET AL. J. BACTERIOL. quences of the implicated archaeal and eukaryal proteins software (TVIPS GmbH, Gauting, Germany). To some samples, 0.1% sodium (20, 25). dodecyl sulfate (SDS) was added, and those samples were maintained at 22°C for The transcriptional patterns of rudiviruses SIRV1 and 30 min in order to study the stability of the virion particles. Electron tomography of intact, negatively stained virions was performed as described previously (10, SIRV2 are relatively simple, with few temporal expression 26). Visualization of the three-dimensional (3D) data was performed using differences. An exception is the gene encoding the major Amira software (Visage Imaging, Fu¨rth, Germany). structural protein that binds to DNA and, at an early stage Protein analyses. Proteins of SRV were separated in 13.5% SDS–poly- of infection, is expressed as a polycistronic mRNA but appears acrylamide gels (14) and stained with Coomassie brilliant blue R-250 (Serva, Heidelberg, Germany). N-terminal protein sequences were determined by as a single gene transcript close to the eclipse period (12). It Edman degradation using a Procise 492 protein sequencer (Applied Biosys- has also been shown that rudiviral transcription can be acti- tems, Foster City, CA). vated by a Sulfolobus host-encoded protein, Sta1, that interacts SIRV2 proteins were separated in 4 to 12% SDS–polyacrylamide NuPAGE specifically with TATA-like promoter motifs in the viral ge- gradient gels by the use of MES (morpholineethanesulfonic acid) buffer (both nome (13). from Invitrogen, Paisley, United Kingdom). The gels were stained with Sypro Ruby (Invitrogen). Protein bands were analyzed by peptide mass fingerprinting For SIRV1, a detailed study of the mechanism of adaptation with matrix-assisted laser desorption ionization–time of flight mass spectrometry to foreign hosts was conducted. Upon passage of the virus using a Voyager DE-STR biospectrometry workstation (Applied Biosystems, through closely related S. islandicus strains, complex changes Framingham, MA) as described earlier (26). The analysis was performed in were detected that were concentrated within six genomic re- conjunction with the proteomic platform at the Pasteur Institute. Cloning and heterologous expression of ARV1-ORF134b and purification of gions (21, 22). These changes included insertions, deletions, the recombinant protein and its self-assembly. ARV1-ORF134b was amplified gene duplications, inversions, and transpositions, as well as from purified viral DNA with primers ARV1ORF134F (GGAATTCCATATG changes in gene sizes that often involved the insertion or de- ATGGCGAAAGGACACACACC) and ARV1ORF134R (GGAATTCTCGA letion of what appeared to be “12-bp elements.” It was con- GACTTACGTATCCGTTAGGAC). The PCR product was purified (PCR pu- rification kit; Roche, Mannheim, Germany) and cloned into pET30a expression

cluded that the virus generated a complex mixture of variants, Downloaded from vector (Novagen, Madison, WI) between restriction sites for EcoRI and XbaI. one or more of which were preferentially propagated when the The protein was expressed overnight at 20°C in the Escherichia coli virus entered a new host (21). Rosetta(DE3)pLysS strain. Protein expression was controlled by SDS-polyacryl- Here we describe a novel rudivirus, Stygiolobus rod-shaped amide gel electrophoresis analysis and by performing a Western blot analysis using anti-His-tag-specific antibodies (Novagen). The native protein was purified virus (SRV), isolated from the Azores, Portugal, a location 2ϩ 2ϩ geographically distant from the locations of the other charac- on a Ni -nitrilotriacetic acid (Ni -NTA)-agarose column (Novagen) with elu- tion buffers containing 50 to 500 mM imidazole. The accuracy of its sequence was terized rudiviruses (20, 34). SRV shows sufficient differences confirmed. Self-assembly of the recombinant protein into filamentous structures jb.asm.org from the other rudiviruses, both morphologically and genomi- was performed at 75°C and pH 3.5 and observed by electron microscopy. cally, to warrant its classification as a novel species. The struc- Preparation of cellular and viral DNA and DNA sequencing. DNA was ex- tural and genomic properties of the rudiviruses are compared tracted from Stygiolobus azoricus cells as described previously (2), and the 16S

rRNA gene was amplified by PCR using primers 8aF and 1512 uR (6) and by on October 1, 2008 and contrasted, and new data on the conserved virion struc- sequenced. tural proteins are presented. Different rudiviruses were se- Viral DNA was obtained by disrupting SRV particles with 1% SDS for1hat lected for these studies on the basis of the virion or protein room temperature and extraction with phenol-chloroform (9). A shotgun library yields that were obtained. Moreover, matches between the was prepared by sonicating viral DNA to generate fragments of 2 to 4 kb and spacer regions of the crenarchaeal chromosomal CRISPR re- cloning these into the SmaI site of the pUC18 vector. DNA was purified from single colonies by the use of a Biorobot 8000 workstation (Qiagen, Westburg, peat clusters, which have been implicated in a viral defense Germany) and sequenced in MegaBACE 1000 sequenators (Amersham Biotech, system (18) involving processed RNA transcribed from one Amersham, United Kingdom). The viral sequence was assembled using Se- DNA strand (reviewed in references 16 and 17), and the rudi- quencher 4.2 software (Gene Code, Ann Arbor, MI). PCR primers for gap viral genomes are analyzed and their significance, and possible closing and resolving sequence ambiguities were designed using Primers for Mac, version 1.0. Sequence alignments were obtained using MUSCLE software (7). relationships to the 12-bp indels, are considered. Open reading frames (ORFs) were defined with the help of ARTEMIS software (30) and investigated in searches using the EMBL and GenBank (1), 3D-Jury (8), and SMART (15) databases. Genome maps were generated and compared using MATERIALS AND METHODS Mutagen software, version 4.0 (5). Enrichment culture, isolation of viral hosts, and virus purification. An envi- Bioinformatical matching of crenarchaeal CRISPR spacers to rudiviral ge- ronmental sample was taken from a hot acidic spring (93°C, pH 2) in the Furnas nomes. CRISPRs were predicted for each of the 14 publicly available crenar- Basin on Sao˜ Miguel Island, the Azores, Portugal. The aerobic enrichment chaeal genomes in GenBank (NC_000854 [Aeropyrum pernix K1], NC_002754 culture was established from the environmental sample and maintained at 80°C [Sulfolobus solfataricus P2], NC_003106 [Sulfolobus tokodaii strain 7], under conditions described previously for cultivation of members of the Sul- NC_003364 [Pyrobaculum aerophilum strain IM2], NC_007181 [Sulfolobus aci- folobales (35). Single strains were isolated by plating on Gelrite (Kelco, San docaldarius DSM 639], NC_008698 [Thermofilum pendens Hrk5], NC_008701 Diego, CA) containing colloidal (35) and grown in the medium of the [Pyrobaculum islandicum DSM 4184], NC_008818 [Hyperthermus butylicus DSM enrichment culture. Cell-free supernatants of cultures were analyzed by trans- 5456], NC_009033 [Staphylothermus marinus F1], NC_009073 [Pyrobaculum mission electron microscopy for the presence of virus particles. calidifontis JCM 11548], NC_009376 [Pyrobaculum arsenaticum DSM 13514], SRV was isolated from the growth culture of its host strain Stygiolobus sp., NC_009440 [Metallosphaera sedula DSM 5348], NC_009676 [Cenarchaeum sym- which was colony purified as described above. After cells were grown to the late biosum], and NC_009776 [Ignicoccus hospitalis KIN4/I]). In addition, the six exponential phase and harvested by low-speed centrifugation (Sorvall GS3 rotor) sequenced repeat clusters from Sulfolobus solfataricus P1 (16) were added to the (4,500 rpm), virions were precipitated from the supernatant by adding NaCl (1 data set as well as CRISPRs from five incomplete Sulfolobus islandicus genomes M) and polyethylene glycol 6000 (10% [wt/vol]) and maintaining the mixture at publicly available through the Joint Genome Institute (http://genome.jgi.doe.gov 4°C overnight. They were purified further by CsCl gradient centrifugation (34). /mic_asmb.html) and unpublished genome sequences of Sulfolobus islandicus Transmission electron microscopy. Samples were deposited on carbon-coated HVE10/4 and Acidianus brierleyi from the Copenhagen laboratory. The repeat copper grids, negatively stained with 2% uranyl acetate (pH 4.5), and examined cluster sequences were found using publicly available software (3, 7). in a CM12 transmission electron microscope (FEI, Eindhoven, The Netherlands) All predictions were curated manually. The orientation of each repeat cluster operated at 120 keV. The magnification was calibrated using catalase crystals was inferred from the repeat sequence and by locating the low-complexity flank- negatively stained with uranyl acetate (28). Images were digitally recorded using ing sequence that generally resides immediately upstream from the cluster and a slow-scan charge-coupled-device camera connected to a PC running TVIPS contains the transcriptional leader (16). All unique spacer sequences of the VOL. 190, 2008 CRENARCHAEAL RUDIVIRUSES 6839 Downloaded from

FIG. 1. Electron micrographs of SRV virions negatively stained with 3% uranyl acetate. (A) A full virion particle, with a discontinuous central line along the virion. (B) Six virions attached to liposome-like structures. (C) Enlargement of a portion of panel B displaying the terminal fibers. (D to H) Electron tomography images of an SRV virion. (D) Horizontal x-y slice (0.7 nm) showing the accumulated stain in the central part of jb.asm.org the virion (white arrow). (E) Vertical y-z slice (0.7 nm) through the 3D data set of the reconstructed part of an SRV particle. (F) Visualization of the 3D data set using Amira software. (G and H) Vertical x-z slice (0.7 nm) through the tomogram showing that the virion particles are embedded in negative stain and that accumulated stain visible in panel D is absent from the plug (black arrows). Bars, 200 nm (A and B); 50 nm

(C); 20 nm (D, E, G, and H). by on October 1, 2008

repeat clusters, corresponding to the processed spacer transcript sequence (16), Materials and Methods). Its 16S rRNA sequence represented were aligned to the complete nucleotide sequences on each strand of all four the genus Stygiolobus of the Sulfolobales crenarchaeal order rudiviral genomes (SRV [accession no. FM164764], SIRV1 [AJ414696], SIRV2 [AJ344259], and ARV1 [AJ875026]) by use of Paralign, an MMX-optimized and was closely related to that of Stygiolobus azoricus. How- implementation of the Smith-Watermann algorithm (31). Moreover, assuming ever, it differs from S. azoricus, the type species of the genus, in that the spacer DNA can be incorporated into the oriented CRISPRs in either its capacity to grow aerobically, and a description of the new direction, we also translated the two strands of the spacer DNA into all the species is in preparation. The virus particles produced consti- reading frames, yielding six amino acid sequences per spacer. Reading frames containing stop codons (ca. 50%) were omitted to make the subsequent search tuted flexible rods 702 (Ϯ 50) by 22 (Ϯ 3) nm in size, with three more specific. Each translation was aligned against the amino acid sequences of short fibers at each terminus (Fig. 1A to C; Table 1). A Fourier all the annotated ORFs in each of the four rudiviral genomes. Significant e-value analysis of the virion (not shown) revealed the presence of cutoffs were determined for both the nucleotide and amino acid sequence regular features with a periodicity of (4.2 nm)Ϫ1, which prob- searches using the genome sequence of Saccharomyces cerevisiae as a negative control (data not shown). ably reflect a helical subunit arrangement. This feature is also seen in the tomographic data set (Fig. 1D to H), which re- vealed more structural details. The helical arrangement in the RESULTS virion core occurs in two different configurations. In the central SRV isolation and structure. The virus-producing strain was region, a zigzag structure with dark contrast, probably arising colony purified from an enrichment culture established from a from uranyl acetate staining, is surrounded by a protein shell sample collected from an acidic hot spring in the Azores (see (Fig. 1D and E). In contrast, in the terminal plug, which is

TABLE 1. Properties of the rudiviruses

Virion Genome size Total no. of ITR length Rudivirus Origin GϩC (%) Reference length (nm) (bp) ORFs (bp) SRV Azores 702 28,097 37 29.3 1,030 ARV1 Pozzuoli 610 24,655 41 39.1 1,365 34 SIRV1 Iceland 830 32,308 45 25.3 2,032 20 SIRV2 Iceland 900 35,498 54 25.2 1,626 20 6840 VESTERGAARD ET AL. J. BACTERIOL.

Self-assembly of the major coat protein. The major rudiviral structural protein is highly conserved in sequence and is gly- cosylated (20, 22, 32a, 34). In order to study its possible self- assembly properties, the ARV1 protein (ORF134b [34]) was expressed heterologously in E. coli (see Materials and Meth- ods) and a His-tagged protein was purified to homogeneity on an Ni2ϩ-NTA-agarose column. The protein was shown by transmission electron microscopy to self-assemble to produce filamentous structures of uniform widths and different lengths (Fig. 3). The optimal conditions for the assembly, 75°C and pH 3, were close to those of the natural environment, and no additional energy source was required for this process. FIG. 2. Electron micrograph of a portion of an SRV virion after The transmission electron microscopy analysis revealed that the treatment with 0.1% SDS for 30 min (see Materials and Methods). White arrows indicate DNA or DNA-protein fibers lacking the protein filaments had structural parameters similar to those of the core. Bar, 100 nm. native virions, with a diameter of 21 (Ϯ 3) nm and a periodicity of (4.2 nm)Ϫ1. Thus, the data suggest that the single major coat protein alone can generate the body of the virion. about 50 nm in length, a helically arranged protein mass, with Minor rudiviral virion proteins. To date, the major coat no obvious uranyl acetate inclusions, is seen (Fig. 1D to F). protein is the only rudiviral structural protein to have been The three terminal fibers, anchored in the plug-like structure, characterized. Given the closely similar structures of the dif- appear to be built up of multiple subunits ordered in a linear ferent rudiviruses, we attempted to identify minor structural Downloaded from array (Fig. 1D). The side view of the reconstructed virion proteins for the SIRV2 virus, which can be produced in high particle (Fig. 1E), as well as cross-sections of the negatively yields. Protein components of SIRV2 virions, separated on a stained virions obtained from the tomograms (Fig. 1G and H), polyacrylamide gel, yielded six distinct major bands (Fig. 4), shows that the virion particles are embedded in negative stain and all except D2, which is the strongest band and corresponds (Fig. 1G and H) and partially collapsed due to staining and air to ORF134 (gp26), were analyzed by mass spectrometry. Their drying; the height of the particles was about half of the appar- identities were as follows: band A contained ORF1070 (gp38), jb.asm.org ent diameter. Nevertheless, the accumulated central stain is band B contained ORF488 (gp33), and band C contained clearly visible in the cross-section (Fig. 1G) of the central part ORF564 (gp39), while bands D1 and D3 both contained

of the virion (Fig. 1D), while this feature is absent from the ORF134 (gp26), probably in a glycosylated or, in the case of by on October 1, 2008 plug (Fig. 1G and H). The rod-shaped morphology of SRV, D3, a proteolytically degraded form. Thus, three additional with a regular helical core and tail fibers, is characteristic of SIRV2 structural proteins were identified, each highly con- rudiviruses. served in sequence in all rudiviruses (Table 2). To investigate further the fine structure of the virion, virion SRV genome content. A shotgun library of the viral genome particles were incubated in buffer containing 0.1% SDS for 30 was prepared, sequenced, and assembled (see Materials and min at 22°C. Most of the virion remained undisturbed, with Methods) to yield an approximately 10-fold coverage of a the particles showing the same diameter as native virions and the 26-kb contig. Since 1 to 2 kb of terminal sequence is always densely stained, helical core. However, in local regions the absent from shotgun libraries of linear viral genomes, these protein shell had dissociated (Fig. 2) and a fine fiber with a additional sequences were generated by primer walking using diameter of 3 to 4 nm that constituted either naked DNA or a viral DNA, or using PCR products obtained therefrom, until DNA-protein complex was visible. subsequent rounds of walking yielded no further sequence.

FIG. 3. Electron micrograph images of the self-assembled major coat protein of ORF134 from ARV1 after negative staining with 3% uranyl acetate. Bar, 100 nm. VOL. 190, 2008 CRENARCHAEAL RUDIVIRUSES 6841

Each virus type carries a few genes which are unique, and these are generally clustered near the ends of the linear genomes and yield no matches to genes in public sequence databases. In SRV, these are ORF145, -116a, -109, -59, -108, -97b, and -92 (left to right in Fig. 5). Although for SIRV1 and SIRV2 some of these nonconserved ORFs have been shown to be tran- scribed (12), further work is necessary to establish whether they are all protein-coding genes. Some of the proteins carry predicted structural motifs, and putative functions could be assigned to some of the conserved ORFs on the basis of public database searches; most of these are encoded in other crenar- chaeal genomes (Table 2). The host-encoded transcriptional regulator Sta1, a winged helix-turn-helix protein, was shown to bind to some SIRV1 promoters, including those of ORF134 and ORF399, and to enhance their transcription (13). A similar regulation may oc- cur also for SRV, since the promoter regions of the homologs of ORF134 and ORF399 contain putative Sta1 binding sites. In contrast, in ARV1 only the ORF134 homolog is present in an operon for which the first ORF is a putative transcriptional

regulator, and its promoter does not carry Sta1 binding motifs. Downloaded from Genomic features. Sequence heterogeneities and other ex- ceptional properties were detected in the SRV genome and in other rudiviral genomes that are described below. (i) ITRs. For SRV, the 1,030-bp ITR is perfect, except for a 36-bp insert at positions 799 to 834 at the left end and inverted

tetramer sequences (AAAA [positions 425 to 428] and TTTT jb.asm.org [positions 27672 to 27669]). It shows little sequence similarity to ITRs of the other rudiviruses, except for the 21-bp sequence (AATTTAGGAATTTAGGAATTT) located at the terminus that is predicted to be a Holliday junction resolvase binding by on October 1, 2008 site occurring in all sequenced rudiviruses (34). The ITRs of SRV and SIRV1 and -2 carry four to five degenerate copies of FIG. 4. SIRV2 virion proteins separated by SDS-polyacrylamide this direct sequence repeat, while that of ARV1 carries mul- gel electrophoresis and stained with Sypro Ruby. Molecular masses of tiple degenerate copies of other diverse repeats of similar protein standards are indicated in kilodaltons on the left. sizes. (ii) Genome heterogeneity in SRV and 12-bp indels. Se- quence heterogeneities were detected in the SRV genome, The total sequence obtained was 28,096 bp, with a GϩC con- within the 10-fold sequence coverage, and mutations were lo- tent of 29% and an ITR of about 1,030 bp (Table 1). An EcoRI calized to groups of subpopulations, including one 180-bp de- restriction digest yielded fragments consistent with the genome letion between positions 11896 and 12077 in two out of six size (data not shown). clones. Moreover, a 48-bp insertion was observed in one vari- Thirty-seven ORFs were predicted for which start codons ant (out of 18 clones) precisely at the C terminus of ORF533 were assigned on the basis of the upstream locations of TATA- (position 20285) that generated a third copy of a 16-amino-acid like and transcription factor B-responsive element (BRE) pro- direct repeat. Some changes corresponding to 12-bp indels moter motifs and/or Shine-Dalgarno motifs. Details of the were also apparent in overlapping clones, and they are indi- putative genes and operon structures are presented in Table S1 cated in Table 3 together with those observed earlier for in the supplemental material, and a comparative genome map SIRV1 (4, 20, 21). Moreover, sequence comparison of highly of SRV and rudiviruses SIRV1 and ARV1 is presented in Fig. conserved ORFs present in the four rudiviral genomes re- 5; the genome map of SIRV2, which is closely similar to that of vealed several additional 12-bp indels. The locations of all the SIRV1, is not included (12, 20). SRV differs from the other identified indels which occur in conserved rudiviral genes or rudiviruses in that fewer ORFs are organized in operons, and sites corresponding to SRV ORF75, -104, -138, -163, -168, it has a lower level of gene order conservation (Fig. 5). More- -197, -199, -286, -294, -419, -440, -464, -533, and -1059 (Fig. 5) over, whereas for the other rudiviruses TATA-like motifs are are indicated in the SIRV1 genome map in Fig. 6. often directly preceded by a conserved GTC triplet (12, 20, 34), Rudiviral matches to CRISPRs. The availability of four sep- in SRV the ensuing triplet sequence was GTA for 10 of the 30 arate rudiviral genome sequences provided a basis for analyz- putative TATA-like motifs (see Table S1 in the supplemental ing the frequency and distribution of the matches of CRISPR material). spacer sequences to the viral genomes. Therefore, we analyzed Homologs of 17 SRV ORFs are present in all rudiviruses, the repeat clusters of each of the available crenarchaeal ge- and a further 10 SRV ORFs are conserved in some rudiviruses. nomes in the public EMBL/GenBank and JGI sequence data- 6842 VESTERGAARD ET AL. J. BACTERIOL.

TABLE 2. Rudiviral proteins with predicted functions

Rudiviral Other crenarchaeal SRV ORF category Predicted function or description Analysis tool E-value or score homolog(s) virus(es) Structural proteins ORF134 All Structural protein ORF464 All Structural protein ORF581 All Structural protein ORF1059 All Structural protein

Transcriptional regulators ORF58 All RHH-1 SMART 2.0e-08 Many ORF95 None “Winged helix” repressor DNA 3D-Jury 64.00 None binding domain

Translational regulator ORF294 SIRV1 and -2 tRNA-guanine transglycosylase 3D-Jury 167.57 STSV1

DNA replication ORF440 All RuvB Holliday junction helicase 3D-Jury 53.71 AFV1, AFV2 (Lon ATPase) ORF116c All Holliday junction resolvase SMART 2.4e-45 (archaeal) ORF199 All Nuclease 3D-Jury 63.86 AFV1, SIFV

DNA metabolism Downloaded from ORF168 SIRV1 and -2 dUTPase SMART 1.5e-12 STSV1 ORF257 ARV1 Thymidylate synthase (Thy1) SMART 7.9e-46 STSV1 ORF159 All S-adenosylmethionine-dependent 3D-Jury 73.67 SIFV methyltransferase

Glycosylation ORF335 All Glycosyl transferase group 1 SMART 6.7e-09 jb.asm.org ORF355 All Glycosyl transferase SMART 5.1e-04

Other

ORF419 SIRV1 and -2 11 transmembrane regions TMHMM by on October 1, 2008

bases and in our own unpublished genomes (see Materials and yielded 4,283 spacer sequences. Subsequently, 278 sequences Methods). Fourteen complete genomes and 8 partial genomes that are shared between S. solfataricus strains P1 and P2 (16) were analyzed. In total, 82 repeat clusters from complete ge- were omitted from the data set, yielding a total of 4,005 spacer nomes and 44 clusters, some incomplete, from partial genomes sequences.

FIG. 5. Genome maps of SRV, SIRV1, and ARV1 showing the predicted ORFs and the ITRs (bold lines). SRV ORFs are identified by their amino acid lengths. Homologous genes shared between the rudiviruses are color-coded. Genes above the horizontal line are transcribed from left to right, and those below the line are transcribed in the opposite direction. Predicted functions or structural characteristics of the gene products are indicated as follows: sp, structural protein; rhh, ribbon-helix-helix protein; wh, winged helix protein; tm, transmembrane; tgt, tRNA guanine transglycosylase; hjh; Holliday junction helicase; hjr, Holliday junction resolvase; n, nuclease; du, dUTPase; ts, thymidylate synthase; sm, S-adenosylmethionine-dependent methyltransferase; gt, glycosyl transferase. VOL. 190, 2008 CRENARCHAEAL RUDIVIRUSES 6843

TABLE 3. Occurrence of the 12-bp indels in overlapping rudiviral clone libraries

No. of ϩ12-bp No. of Ϫ12-bp Genome position ORF or ITR Sequence clones clones (reference) SRV ORF58 5 1 AATTAAATTATG 26079–26068 SRV ORF95 8 8 TTTTGAATTATG 7112–7101 SIRV1 ORF335 7 3 AACATTCATTAA Variant (21) SIRV1 ORF562 1 4 ATACAAATTTCA Variant (21) SIRV1-ITR 10 29 TTTAGCAGTTCA (20)

In the first analysis, each of the 4,005 spacer sequences was with that of the filamentous lipothrixviruses, which exhibit a compared to the four rudiviral genomes at the nucleotide level. variety of surface, envelope, and tail structures and much more In total, 158 spacers yielded 268 rudiviral matches. The latter heterogeneous genomes (24). number exceeds the former because (i) some spacers match to The virion length of SRV, 702 (Ϯ 50) nm, shows the same more than one locus within repeat sequences of a given virus direct proportionality to genome size (28 kb) as those for the and (ii) some spacers match to more than one virus. Second, other rudivirus virions, which range in length from 610 (Ϯ 50) the analysis was performed at the protein level (see Materials nm (ARV1) to 900 (Ϯ 50) nm (SIRV2) (Table 1). A super- and Methods). This analysis revealed 148 additional matching helical core, with a pitch of 4.3 nm and a width of 20 nm, spacers and a further 427 rudiviral genome matches exclusively terminates in 45-nm-long nonhelical “plugs,” and it correlates at the protein level. (An additional 105 matching spacer se- with the internal structure observed earlier in electron micro- quences from the latter analysis that overlapped, partially or graphs of SIRV1 (22). In order to determine whether a single Downloaded from completely, with 158 of those detected within rudiviral ORFs superhelical DNA can span the SRV virion length, we applied at the nucleotide level were not counted.) Only 6 of the 14 the following formulae to estimate the sizes and length of the completed crenarchaeal genomes carried spacers yielding superhelical DNA: matches to rudiviral genomes, and they are listed, together 2 2 with the results for the partial genomes, in Table 4. These Lturn ϭ ͱp ϩ c results reinforced the choice of criteria employed for deter- jb.asm.org where L represents the arc length of a turn, p represents the mining the significance of sequence matches (see Materials turn pitch, and c represents the cylinder circumference, and and Methods).

The locations of the spacer sequence matches are superim- by on October 1, 2008 Ltotal ϭ t ϫ Lturn posed on the genome map of SIRV1 in Fig. 6. The matches are not evenly distributed along the genome; some genes have no where t represents the number of turns and Ltotal represents matches, while others carry up to 18. Although there is no strict the arc length of entire helix. correlation between the level of gene sequence conservation Calculations using structural parameters for B-form DNA and the number of matching spacers, the five most conserved yielded a genome size of 26 kbp without, and 30 kbp with, genes, ORF440, ORF1059, ORF134, ORF355, and ORF581, terminal “plugs”. The estimated width (20 nm) is an upper- exhibit the highest number of matches (18, 15, 14, 14, and 13, limit estimate. A reciprocal calculation, with a 28-kbp genome, respectively) (Fig. 3 and 6). yields a diameter of 21.2 nm without, and 18.5 nm with, the “plugs.” Given that the major rudiviral coat protein is capable DISCUSSION of self-assembly into filamentous structures similar in width to the native virion (Fig. 4), it is likely that the rod-shaped body The morphological and genomic data for SRV and the other consists of a single superhelical DNA embedded within this characterized rudiviruses are summarized in Table 1. The con- filamentous protein structure. Thus, the three newly identified servation of their morphologies and genomic properties con- minor structural proteins probably contribute to conserved trasts with that of other crenarchaeal viruses and, in particular, terminal features of the virion; consistent with this, the largest

FIG. 6. CRISPR spacer sequence matches for SIRV1 are superimposed on the SIRV1 genome map. Protein-coding regions translated from left to right are shown above the line, and those translated from right to left are shown below the line. Highly conserved coding genes are presented in dark blue, while less-conserved or nonconserved genes are in light blue. The inverted terminal repeat is shaded in violet. Matches to spacers are shown as vertical lines and are color-coded as indicated. Matches to the upper DNA strand are placed above the genome, and those to the lower strand are located below the genome. The red vertical lines correspond to the nucleotide sequence matches, and the green vertical lines correspond to matching amino acid sequences, after translation of the spacer sequences from both DNA strands. In total, there were 106 matches to SIRV1 at the nucleotide level, some of them occurring more than once, and an additional 127 matches to SIRV1 ORFs at the amino acid level. The black arrowheads indicate the positions of the 12-bp indels that occur in one or more conserved rudiviral genes. 6844 VESTERGAARD ET AL. J. BACTERIOL.

TABLE 4. Number of CRISPR spacer sequences from complete positive matches. Employing alignments at the amino acid and partial crenarchaeal genomes which match level considerably increased the number of positive matches rudiviral genomesa detected, because nucleotide sequences diverge more rapidly. No. of matching Thus, the genomes of SRV and SIRV1 share almost no (ϳ4%) sequences at the Accession no., indicated levelb similarity at the DNA level, whereas most homologous pro- Strain reference, teins show, on average, 47% sequence identity or similarity. Amino or source Nucleotide acid When studying the distribution of the spacer matches in the rudiviral genomes, some trends are evident. First, there is no Complete genomes significant bias with regard to the DNA strand carrying the S. solfataricus P2 22 (14) 31 (18) NC_002754 S. tokodaii 7 9 14 NC_003106 matching sequence. In SIRV1, for example, 122 matches occur M. sedula 5 15 NC_009440 on one strand and 111 on the other (Fig. 6). This is consistent S. acidocaldarius 5 9 NC_007181 with our assumption that the incorporation of viral or plasmid S. marinus F1 2 1 NC_009033 H. butylicus 0 1 NC_008818 DNA into the orientated CRISPRs is nondirectional. Second, in accordance with earlier analyses (16), for matches to coding Incomplete genomes regions, there is no significant bias to matches occurring in a S. solfataricus P1 20 (14) 30 (18) 16 sense or antisense direction. Thus, for SIRV1, 39% of the S. islandicus (5 strains) 39/12/4/2/1 26/7/2/2/0 See text S. islandicus HVE10/4 36 11 Unpublished matches are in the sense direction whereas 54% are anti- A. brierleyi 15 14 Unpublished sense—the remaining 7% constitute nucleotide matches to non-protein-coding regions (Fig. 6). Third, when the latter a All the acidothermophilic organisms from the family Sulfolobaceae have spacers matching those of the rudiviral genomes. However, the neutrophilic nucleotide sequence-based matches are considered, the pro- hyperthermophiles S. marinus and H. butylicus produced very few matches. portion of matches which occur in intergenic regions, as op- Downloaded from Matches at the amino acid sequence level that overlapped with those at the nucleotide sequence level were excluded from the data. posed to those occurring in protein-coding regions, is not sig- b Numbers in parentheses in columns 2 and 3 indicate the number of matches nificantly different from the overall coding percentage of the that arose from spacers shared by S. solfataricus strains P1 and P2 (16). virus. For SIRV1 19% of the nucleotide matches fall within intergenic regions, whereas 20% of the genome is non-protein- coding. Finally, some genes have many matches whereas others jb.asm.org structural protein (corresponding to SRV ORF1059) was lo- have none at all. Five genes have 13 or more matches in calized within the virion tail fibers of SIRV2 by studying func- SIRV1; these genes correspond to SRV ORF440, ORF1059, tional groups by the use of bioconjugation (Steinmetz et al., ORF134, ORF355, and ORF581. Apart from being conserved submitted). in each rudivirus, their gene products have important struc- by on October 1, 2008 We still have limited insight into functional roles of rudivi- tural or functional roles (Table 2). rus-encoded proteins (Table 2). The glycosyl transferases have The results pose an important question as to how the host been implicated in the glycosylation of the structural proteins distinguishes between more important and less important (34). Moreover, a few proteins have been linked to viral rep- genes when adding the spacers to its CRISPRs. Possibly, al- lication. Two of these, ORF440 and ORF199, lie within an though the de novo addition of spacers may well be an unbi- operon and are conserved in phylogenetically diverse lipothrix- ased process with respect to both viral genome position and viruses (33). The former yielded significant matches to RuvB, direction, the selective advantage provided by some spacers the helicase facilitating branch migration during Holliday junc- would result in a population being enriched in hosts with tion resolution, while ORF199 yielded the best matches to CRISPRs carrying spacers targeting crucial viral genes. nucleases, including Holliday junction resolvases (Table 2). The 12-bp viral indels were originally shown to occur com- Thus, they are likely to facilitate rudiviral replication, which, in monly in SIRV1 variants that arose as a result of passage of an SIRV1, involves site-specific nicking within the ITR, formation SIRV1 isolate through different closely related S. islandicus of head-to-head and tail-to-tail intermediates, and conversion strains from Iceland, and it was inferred that this unusual of genomic concatemers to monomers by a Holliday junc- activity reflected adaptation of the rudivirus to the different tion resolvase (ORF116c) (20). In addition, SRV encodes a hosts (21). The positions of the 12-bp indels that have been dUTPase and a thymidylate synthase, both of which are in- identified in conserved rudiviral protein genes are shown to- volved in thymidylate synthesis, whereas the other rudiviruses encode only one of these , both of which are consid- gether with the CRISPR spacer matches on the SIRV1 ge- ered helpful in maintaining of a low dUTP/dTTP ratio and thus nome map in Fig. 6. Many of the sites are very close or overlap. in minimizing detrimental effects of misincorporating uracil This raises the possibility that lengthening or shortening of into DNA. Two putative transcriptional regulators have been conserved protein genes by 12 bp could be a mechanism to identified, together with the putative tRNA transglycosylase overcome the host CRISPR defense system. encoded by SRV, which has homologs in SIRV1 and -2 and in We conclude that the rudiviruses are excellent models for other crenarchaeal viruses (Table 2) and is distantly related to studying details of viral life cycles and virus-host interactions in a tRNA-guanine transglycosylase implicated in archeosine for- crenarchaea. These viruses appear to be much more conserved mation. in their morphologies and genomes than, for example, the The two approaches employed to analyze CRISPR spacers equally ubiquitous lipothrixviruses. Moreover, they are rela- matching the four rudiviral genomes demonstrated that about tively stably maintained in their hosts and can be isolated in 10% of the 3,042 unique acidothermophile spacers yielded reasonable yields for experimental studies. VOL. 190, 2008 CRENARCHAEAL RUDIVIRUSES 6845

ACKNOWLEDGMENTS analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol. Direct 1:7. We are grateful to Georg Fuchs for providing the environmental 18. Mojica, F. J., C. Diez-Villasenor, J. Garcia-Martinez, and E. Soria. 2005. sample from Sao˜ Miguel Island, the Azores. Intervening sequences of regularly spaced prokaryotic repeats derive from The research in Copenhagen was supported by grants from the foreign genetic elements. J. Mol. Evol. 60:174–182. Danish Natural Science Research Council, the Danish National Re- 19. Ortmann, A. C., B. Wiedenheft, T. Douglas, and M. Young. 2006. Hot search Foundation, and Copenhagen University. The research in Paris crenarchaeal viruses reveal deep evolutionary connections. Nat. Rev. Micro- was partly supported by grant NT05-2_41674 from Agence Nationale biol. 4:520–528. 20. Peng, X., H. Blum, Q. She, S. Mallok, K. Bru¨gger, R. A. Garrett, W. Zillig, de Recherche (Programme Blanc). and D. Prangishvili. 2001. Sequences and replication of genomes of the archaeal rudiviruses SIRV1 and SIRV2: relationships to the archaeal lipo- REFERENCES thrixvirus SIFV and some eukaryal viruses. Virology 291:226–234. 1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, 21. Peng, X., A. Kessler, H. Phan, R. A. Garrett, and D. Prangishvili. 2004. and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation Multiple variants of the archaeal DNA rudivirus SIRV1 in a single host and of protein database search programs. Nucleic Acids Res. 25:3389–3402. a novel mechanism of genomic variation. Mol. Microbiol. 54:366–375. 2. Bettstetter, M., X. Peng, R. A. Garrett, and D. Prangishvili. 2003. AFV1, a 22. Prangishvili, D., H. P. Arnold, D. Gotz, U. Ziese, I. Holz, J. K. Kristjansson, novel virus infecting hyperthermophilic archaea of the genus Acidianus. and W. Zillig. 1999. A novel virus family, the Rudiviridae: structure, virus- Virology 315:68–79. host interactions and genome variability of the Sulfolobus viruses SIRV1 and 3. Bland, C., T. L. Ramsey, F. Sabree, M. Lowe, K. Brown, N. C. Kyrpides, and SIRV2. Genetics 152:1387–1396. P. Hugenholtz. 2007. CRISPR recognition tool (CRT): a tool for automatic 23. Prangishvili, D., P. Forterre, and R. A. Garrett. 2006. Viruses of the Ar- detection of clustered regularly interspaced palindromic repeats. BMC chaea: a unifying view. Nat. Rev. Microbiol. 4:837–848. Bioinformatics 8:209. 24. Prangishvili, D., and R. A. Garrett. 2005. Viruses of hyperthermophilic 4. Blum, H., W. Zillig, S. Mallok, H. Domdey, and D. Prangishvili. 2001. The Crenarchaea. Trends Microbiol. 13:535–542. genome of the archaeal virus SIRV1 has features in common with genomes 25. Prangishvili, D., R. A. Garrett, and E. V. Koonin. 2006. Evolutionary genom- of eukaryal viruses. Virology 281:6–9. ics of archaeal viruses: unique viral genomes in the third domain of life. 5. Bru¨gger, K., P. Redder, and M. Skovgaard. 2003. MUTAGEN: multi-user Virus Res. 117:52–67. tool for annotating genomes. Bioinformatics 19:2480–2481. 26. Prangishvili, D., G. Vestergaard, M. Ha¨ring, R. Aramayo, T. Basta, R. 6. Eder, W., W. Ludwig, and R. Huber. 1999. Novel 16S rRNA gene sequences Rachel, and R. A. Garrett. 2006. Structural and genomic properties of the

retrieved from highly saline brine sediments of Kebrit Deep, Red Sea. Arch. hyperthermophilic archaeal virus ATV with an extracellular stage of the Downloaded from Microbiol. 172:213–218. reproductive cycle. J. Mol. Biol. 359:1203–1216. 7. Edgar, R. C. 2004. MUSCLE: a multiple sequence alignment method with 27. Rachel, R., M. Bettstetter, B. P. Hedlund, M. Ha¨ring, A. Kessler, K. O. reduced time and space complexity. BMC Bioinformatics 5:113. Stetter, and D. Prangishvili. 2002. Remarkable morphological diversity of 8. Ginalski, K., A. Elofsson, D. Fischer, and L. Rychlewski. 2003. 3D-Jury: a viruses and virus-like particles in hot terrestrial environments. Arch. Virol. simple approach to improve protein structure predictions. Bioinformatics 147:2419–2429. 19:1015–1018. 28. Reilin, A. 1998. Preparation of catalase crystals. University of Illinois at Urbana- 9. Ha¨ring, M., X. Peng, K. Bru¨gger, R. Rachel, K. O. Stetter, R. A. Garrett, and Champaign, Urbana, IL. http://www.itg.uiuc.edu/publications/techreports/98-009.

D. Prangishvili. 2004. Morphology and genome organization of the virus 29. Rice, G., K. Stedman, J. Snyder, B. Wiedenheft, D. Willits, S. Brumfield, T. jb.asm.org PSV of the hyperthermophilic archaeal genera Pyrobaculum and Thermo- McDermott, and M. J. Young. 2001. Viruses from extreme thermal environ- proteus: a novel virus family, the Globuloviridae. Virology 323:233–242. ments. Proc. Natl. Acad. Sci. USA 98:13341–13345. 10. Ha¨ring, M., G. Vestergaard, K. Bru¨gger, R. Rachel, R. A. Garrett, and D. 30. Rutherford, K., J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Rajan- Prangishvili. 2005. Structure and genome organization of AFV2, a novel dream, and B. Barrell. 2000. ARTEMIS: sequence visualization and anno- archaeal lipothrixvirus with unusual terminal and core structures. J. Bacte- tation. Bioinformatics 16:944–945. by on October 1, 2008 riol. 187:3855–3858. 31. Sæbø, P. E., S. M. Andersen, J. Myrseth, J. K. Lærdahl, and T. Rognes. 2005. 11. Ha¨ring, M., G. Vestergaard, R. Rachel, L. Chen, R. A. Garrett, and D. PARALIGN: rapid and sensitive sequence similarity searches powered by Prangishvili. 2005. Virology: independent virus development outside a host. parallel computing technology. Nucleic Acids Res. 33:W535–W539. Nature 436:1101–1102. 32. Snyder, J. C., B. Wiedenheft, M. Lavin, F. F. Roberto, J. Spuhler, A. C. 12. Kessler, A., A. B. Brinkman, J. van der Oost, and D. Prangishvili. 2004. Ortmann, T. Douglas, and M. Young. 2007. Virus movement maintains local Transcription of the rod-shaped viruses SIRV1 and SIRV2 of the hyperther- virus population diversity. Proc. Natl. Acad. Sci. USA 104:19102–19107. mophilic archaeon Sulfolobus. J. Bacteriol. 186:7745–7753. 32a.Steinmetz, N. F., A Bize, K. C. Findlay, G. P. Lomonossoff, M. Manchester, 13. Kessler, A., G. Sezonov, J. I. Guijarro, N. Desnoues, T. Rose, M. Delepierre, D. J. Evans, and D. Prangishvili. Site-specific and spatially controlled ad- S. D. Bell, and D. Prangishvili. 2006. A novel archaeal regulatory protein, dressability of a new viral nanobuilding block: Sulfolobus islandicus rod- Sta1, activates transcription from viral promoters. Nucleic Acids Res. 34: shaped virus 2. Adv. Funct. Mat., in press. 4837–4845. 33. Vestergaard, G., R. Aramayo, T. Basta, M. Ha¨ring, X. Peng, K. Bru¨gger, L. 14. Laemmli, U. K. 1970. Cleavage of structural proteins during the assembly of Chen, R. Rachel, N. Boisset, R. A. Garrett, and D. Prangishvili. 2008. the head of bacteriophage T4. Nature 227:680–685. Structure of the Acidianus filamentous virus 3 and comparative genomics of 15. Letunic, I., R. R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bo¨rk. 2006. related archaeal lipothrixviruses. J. Virol. 82:371–381. SMART 5: domains in the context of genomes and networks. Nucleic Acids 34. Vestergaard, G., M. Ha¨ring, X. Peng, R. Rachel, R. A. Garrett, and D. Res. 34:D257–D260. Prangishvili. 2005. A novel rudivirus, ARV1, of the hyperthermophilic ar- 16. Lillestøl, R. K., P. Redder, R. A. Garrett, and K. Bru¨gger. 2006. A putative chaeal genus Acidianus. Virology 336:83–92. viral defence mechanism in archaeal cells. Archaea 2:59–72. 35. Zillig, W., A. Kletzin, C. Schleper, I. Holz, D. Janekovic, H. Hain, M. 17. Makarova, K. S., N. V. Grishin, S. A. Shabalina, Y. I. Wolf, and E. V. Koonin. Lanzendo¨rfer, and J. K. Kristjansson. 1994. Screening for Sulfolobales, their 2006. A putative RNA-interference-based immune system in prokaryotes: plasmids and their viruses in Icelandic solfataras. System. Appl. Microbiol. computational analysis of the predicted enzymatic machinery, functional 16:609–628. 52 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ All the work for this manuscript was done by myself, except for Contribution: major the advanced statistical analyses of spacer match distributions which were conducted by Dr. Niels R. Hansen. A few paragraphs pertaining to the methodology of this part of the work were also drafted by him. Professor Roger A. Garrett made extensive revisions to the text after it was finished. Molecular Biology of Archaea 23

Distribution of CRISPR spacer matches in viruses and plasmids of crenarchaeal acidothermophiles and implications for their inhibitory mechanism

Shiraz Ali Shah*, Niels R. Hansen† and Roger A. Garrett*1 *Centre for Comparative Genomics, Department of Biology, Biocenter, Copenhagen University, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark, and †Department of Mathematical Sciences, Copenhagen University, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark

Abstract Transcripts from spacer sequences within chromosomal repeat clusters [CRISPRs (clusters of regularly interspaced palindromic repeats)] from archaea have been implicated in inhibiting or regulating the propagation of archaeal viruses and plasmids. For the crenarchaeal , the chromosomal spacers show a high level of matches ( 30%) with viral or plasmid genomes. Moreover, their distribution ∼ along the virus/plasmid genomes, as well as their DNA strand specificity, appear to be random. This is consistent with the hypothesis that chromosomal spacers are taken up directly and randomly from virus and plasmid DNA and that the spacer transcripts target the genomic DNA of the extrachromosomal elements and not their transcripts.

Archaeal CRISPR system recently provided for bacteria on infecting Streptococcus CRISPRs (clusters of regularly interspaced palindromic re- thermophilus with bacteriophages !858 and !2972 [9]. peats) consist of identical repeats separated by unique spacer sequences of constant length which occur in the sequenced chromosomes of almost all archaea and approx. 40% of Hypothesis bacteria (reviewed in [1]). The archaeal repeat clusters are gen- In the present article, we explore and interpret trends erally large and can constitute >1% of the chromosome. The which emerge when collectively analysing chromosomal original observation that some spacers show close sequence CRISPR spacer matches to viral and plasmid genomes. matches with archaeal viral genomes led to the hypothesis The crenarchaeal acidothermophiles were selected for the that spacer regions have a regulatory effect on viral propaga- analysis because they carry large and multiple repeat clusters tion [2] and plasmid propagation [1], and this proposal [1] and because many of their viruses and plasmids have been was subsequently reinforced by several studies on both sequenced [10]. The results should yield insights into both the archaea and bacteria (reviewed in [1,3,4]). Moreover, a mechanism of uptake of new spacer regions in CRISPRs and mechanism for this putative inhibitory effect was suggested, the mechanism of inhibition or regulation of the viruses at an early stage, by the finding that RNA transcripts are and plasmids. We assume that, if chromosomal spacer se- produced, and processed, from at least one strand of the quence matches occur randomly on the virus or plasmid archaeal repeat clusters [5,6], with the smallest product genome, then the chromosomal spacer regions are generated corresponding roughly in size to a single spacer transcript by DNA excision and insertion and not by reverse trans- [1]. This opened for the possibility of an antisense RNA cription from virus/plasmid transcripts. In contrast, a or RNAi (RNA interference)-like mechanism acting either non-random distribution of matches biased to the genes on the viral transcripts or directly on the viral DNA [1,3]. would favour the latter RNA-based mechanism. A random New spacer-repeat units are added at the end of the repeat distribution of spacer matches on the virus/plasmid genomes clusters adjoining a low-complexity flanking sequence [1,7], would also favour a DNA-directed inhibitory mechanism by a process that probably involves Cas proteins which are for the spacer transcripts, whereas a gene-biased distribution generally encoded adjacent to the clusters [3,5,8]. Experi- would support the spacer transcripts inhibiting virus/plasmid mental evidence for such a virus-induced addition was gene expression. Previous studies on the archaeal CRISPRs of related

Key words: acidothermophile, archaeal plasmid, archaeal virus, cluster of regularly interspaced Sulfolobus solfataricus strains have suggested that individual palindromic repeats (CRISPR), crenarchaeon. spacers are quite stable and that any selective pressure acts Abbreviations used: ATV, Acidianus two-tailed virus; CRISPR, cluster of regularly interspaced on larger blocks of spacers [1], so we infer that any selective palindromic repeats; ITR, inverted terminal repeat; ORF, open reading frame; SIRV1, Sulfolobus islandicus rod-shaped virus 1; STIV, Sulfolobus turreted icosahedral virus. pressures on CRISPR spacer contents will not influence our 1 To whom correspondence should be addressed (email [email protected]). results and interpretation significantly.

C C Biochemical Society Transactions www.biochemsoctrans.org Biochem. Soc. Trans. (2009) 37, 23–28; doi:10.1042/BST0370023 !The Authors Journal compilation !2009 Biochemical Society 24 BiochemicalSocietyTransactions(2009)Volume37,part1

Selection of viruses, plasmids and CRISPRs shared between S. solfataricus strains P1 and P2 [1]. Approx. Five crenarchaeal virus families, a class of conjugative 30% of the spacers from the acidothermophile genomes plasmids and a family of cryptic plasmids were selected for the match to the virus and plasmid families (Table 1), whereas study (Table 1). They include six β-lipothrixviruses, family only approx. 5% matched for the neutrothermophiles. Lipothrixviridae;fourrudiviruses,familyRudiviridae;seven This difference probably reflects that the viruses and plasmids fuselloviruses, family Fuselloviridae; a single bicaudavirus only fall within the host specificity range for the acido- ATV (Acidianus two-tailed virus), family Bicaudaviridae; thermophiles. The locations of all the spacer matches are STIV (Sulfolobus turreted icosahedral virus), an unclassified superimposed on genome maps of representative genetic ele- icosahedral virus (reviewed in [10]), seven members of a con- ments in Figure 1. Spacers giving nucleotide sequence matches jugative plasmid family and four members of the pRN cryptic to either DNA strand (red lines) occur mainly within genes, plasmid family (reviewed in [11]). Each extrachromosomal but a few are located intergenically or within the non-protein- element can propagate in members of the related crenarchaeal coding region of the ITR (inverted terminal repeat). thermoacidophilic genera Sulfolobus or Acidianus.Spacer Translated spacers yielding amino acid sequence matches, sequences were derived from 13 whole crenarchaeal chromo- additionally to the nucleotide sequence matches, occur somal sequences, from both acidothermophiles and neutro- within annotated ORFs on either DNA strand (green lines). thermophiles, and the partial genomes of Acidianus brierleyi, In a series of three tests, we attempted to address the S. solfataricus P1 and Sulfolobus islandicus HVE10/4 from our question of whether or not the spacers present in host laboratory and of S. islandicus strains LD85, YG5714, chromosomal CRISPRs match the virus/plasmid genomes YN1551, M164 and U328 which were publicly available in in a biased non-random manner. Potential biases include the May 2008 (Table 1). preferential matching to certain regions of the virus/plasmid genome and DNA strand biases. We exclusively used the nuc- leotide sequence matching data because it covered the whole Identifying spacer matches genome. CRISPR regions were localized using publicly available First, we examined the distribution of spacer sequence software [12,13] and examined for the occurrence of spacer matches, at a nucleotide level, along the virus/plasmid sequence matches to the selected viruses and plasmids. Two genomes. We assumed that a uniform distribution would approaches were employed. In one, matches were identified follow, roughly, a homogeneous Poisson process, whereas at a nucleotide sequence level between the similarly oriented an irregular distribution along the genome would yield a spacer sequences (corresponding to the processed transcript deviation from the homogeneous Poisson process. We invest- sequence [1,5,6]) and either strand of the virus/plasmid DNA. igated for this using Kolmogorov–Smirnov test statistics for In a second approach, we exploited the observation that each virus and plasmid and we were generally unable to detect protein sequences are more highly conserved than gene any significant deviations from a homogeneous Poisson sequences and tried to detect significant matches additional distribution. to those identified at a nucleotide sequence level. Each spacer Secondly, we tested whether there was any detectable strand was translated into three amino acid sequences, and, bias in the spacer matches to the most conserved viral genes after removing sequences containing stop codons (about given that they are more likely to be targets for inhibition 50%), each translated sequence was aligned against amino of propagation. The number of matches to each gene was acid sequences of all annotated ORFs (open reading frames) analysed using a Poisson regression model with the gene con- of all the viruses and plasmids. Implicit in this approach is servation and length as explanatory variables. This analysis the assumption that the uptake of spacers in the oriented showed that the number of matches to a given gene did not CRISPRs is non-directional, and this is borne out by the depend significantly upon the degree of its conservation, results (see below). A nucleotide sequence approach was although, for SIRV1 (Sulfolobus islandicus rod-shaped also applied to the whole acidothermophile chromosomes virus 1), we did observe a weak effect for the seven to ten by searching for exact matches to CRISPR spacers (Table 1). most conserved genes. Moreover, it was found that the Significant e-value cut-offs were determined for both the nuc- expected number of matches was proportional to the gene leotide and amino acid sequence searches using the genome length, in agreement with the homogeneous Poisson process. sequence of Saccharomyces cerevisiae as a negative control Thirdly, we tested for any bias in the distribution of (results not shown). All sequence alignments were performed spacer matches in coding compared with non-coding regions using Paralign, an MMX-optimized implementation of the or to the sense compared with antisense strands of the virus/ Smith–Watermann algorithm [14]. plasmid genes using a specific alternative of a Poisson process with different intensities for matches occurring within, and outside, protein-coding regions, treating each DNA Analysis of the distribution of strand separately. We were unable to detect any significant chromosomal spacer matches on deviations from a homogeneous Poisson distribution for the virus/plasmid genomes match intensities of the coding compared with non-coding In total, 82 repeat clusters, some incomplete (Table 1), yielded regions, with the exception of STIV, where there is a bias to 4005 spacer sequences, after subtracting 278 spacer sequences the antisense strand (Figure 1).

C C !The Authors Journal compilation !2009 Biochemical Society Molecular Biology of Archaea 25 Pyrobaculum /JGI accession  4023464, 4023466 Staphylothermus marinus KIN4/I), NC_003364 ( eutrothermophile genomes were complete DSM4184), NC_009033 ( are unpublished work from our laboratory. Genomes the number of spacers that match each plasmid and ected at an amino acid level. Spacer matches to the OG1 and pSOG2, and the pRN family consists of pHEN7, Ignicoccus hospitalis -lipothrixviruses constitute AFV3, AFV6, AFV7, AFV8, AFV9 β A. brierleyi Pyrobaculum islandicum HVE10/4 and DSM5456), NC_009776 ( S. islandicus P1, JCM11548), NC_008701 ( Hyperthermus butylicus S. solfataricus K1), NC_008818 ( Pyrobaculum calidifontis Aeropyrum pernix -Lipothrixviruses Fuselloviruses STIV ATV (conjugative) (cryptic) (total matching) own genome number/reference β P1 shares with strain P2 were subtracted during the analysis, but have been reinserted in this Table. For the partial genomes, the total numbers of DSM 13514), NC_009073 ( S. solfataricus Hrk5). 367 29 21 9 8 5 32 10 100 0 Unpublished 481 30 22 32 25 8 19 5 116 0 4023468, 4005359, Spacers pNOB8family pRNfamily Spacers Matcheswith GenBank accession numbers NC_000854 (  Pyrobaculum arsenaticum strains DSM639 223 14 5 2 1 2 15 4 38 0 NC_007181 DSM5348 386 20 9 8 6 59 31 4 110 0 NC_009440 P2P1 415 423 53 50 24 22 15 19 9 9 20 26 26 32 12 7 135 144 0 0 NC_002754 [1] LD85 287 65 39 10 6 1 6 6 114 0 4023472 HVE10/4 270 47 20 20 4 3 19 9 104 0 Unpublished 74612319191324361081NC_003106 Thermofilum pendens strains LD85, YG5714, YN1551, M164 and U328 are publicly available from the JGI (Joint Genome Institute) database (http://www.jgi.doe.gov/). All n IM2), NC_009376 ( Summary of the chromosomal spacer matches to the virus and plasmid genomes of the crenarchaeal acidothermophiles Sulfolobus islandicus S. islandicus Sulfolobus islandicus Four (YG5714, YN1551, M164, U328) Sufolobus tokodaii Sulfolobus acidocaldarius Metallosphaera sedula Acidianus brierleyi Sulfolobus solfataricus Sulfolobus solfataricus Sulfolobus islandicus Strain (total) Rudiviruses Neutrothermophiles (total) 963 6 13 14 1 4 16 0 52 0 – aerophilum F1) and NC_008698 ( Acidothermophiles (total) 3313 331 181 134 81 126 226 63 969 1 – Table 1 pDL10, pRN1 and pRN2. The 278spacers spacers are which approximate, since repeat clusters may not be fully sequenced. Genome sequences for The number of CRISPR spacershost’s are own given genome which constitute match onlyvirus virus/plasmid exact family nucleotide family because matches. genomes some The significantly spacers total match atand number more SIFV, a and than of nucleotide fuselloviruses one chromosomal level, include family, spacers SSV2, as but matching SSV4, well have to SSV5, been as SSVrh, virus/plasmid counted additional SSVk1 genomes only and matches differs once. SSV1. det from Rudiviruses The pNOB8 comprise family SIRV1, contains SIRV2, pNOB8, ARV pARN3, and pARN4, SRV1; pHVE14, pING1, pKEF9, pS of and obtained through GenBank

C C !The Authors Journal compilation !2009 Biochemical Society 26 BiochemicalSocietyTransactions(2009)Volume37,part1

Figure 1 CRISPR spacer matches superimposed on genomes of representative viruses and plasmids SIRV1, rudiviruses; AFV9 (Acidianus filamentous virus 9), β-lipothrixviruses; SSV2 (Sulfolobus spindle-shaped virus 2), fuselloviruses; STIV, unclassified icosahedral virus; ATV, bicaudavirus; pNOB8, conjugative plasmids; pHEN7, cryptic plasmids. A preliminary version of the rudiviral data was presented in [15]. The circular genomes (SSV2, STIV, ATV, pNOB8 and pHEN7) are presented in a linear format. Protein-coding regions are boxed and shaded, according to their levels of conservation for those genomes for which comparative data are available (all except for STIV and ATV). Spacer sequence matches are indicated by lines above and below the genomes for the two DNA strands and they are colour-coded according to whether they occur exclusively at a nucleotide level (red) or additionally at an amino acid level (green).

Similar results for the first and third tests were obtained cluster, suggested that the CRISPRs can be classified into when the analysis was limited to spacer matches from family families. All crenarchaeal flanking sequences share a common I CRISPRs (see below). A/T-rich motif adjacent to the first repeat of the cluster, whereas the remainder of the flanking sequence is family- specific. At least three distinct families, each with multiple Classifying crenarchaeal acidothermophile members, were found for the acidothermophiles by analysing CRISPR families the flanking sequences alone (Figure 2A), and this finding CRISPRs are oriented and they generally carry a 300–600 bp was reinforced by constructing a multiple alignment of repeat low-complexity flanking sequence immediately upstream of sequences from the clusters (Figure 2B). Thus there is a the repeat cluster which contains the transcriptional leader clear correlation between the nature of the flanking sequence sequence [1]. Sequence analysis of the flanking sequences and the repeat sequence which constitutes a repeat cluster. by multiple alignment [16] and motif analysis [17], along These CRISPR families cross species and genus barriers, and with sequence comparison of the repeat sequence from each most of the acidothermophile genomes contain clusters from

C C !The Authors Journal compilation !2009 Biochemical Society Molecular Biology of Archaea 27

Figure 2 CRISPR families of crenarchaeal acidothermophiles completely sequenced and the total number of repeats is not given. (A)Schematicrepresentationofthethreetypesofflankingsequence The three major repeat cluster families are indicated by differently associated with CRISPR families I, II and III. All three flanking se- shaded boxes. (C) Logo-plot (http://weblogo.berkeley.edu/) of the quences share a motif adjacent to the repeat cluster, whereas motif located upstream of the area on a virus or plasmid genome the upstream region of the flank is specific for each family. (B) matched by a group I spacer. The CC motif was found at approx. 75% of Phylogenetic tree created using ClustalW [18] based on a multiple all matching sites. alignment of a repeats from each acidothermophile repeat cluster. The CRISPRs studied are labelled by a four-letter prefix based on the genus and species name in addition to the number of different families. Therefore no families are specific to a given repeats carried by the repeat cluster. Abri, Acidianus brierleyi;Msed, species and no species is limited to a single family. These Metallosphaera sedula;Saci,Sulfolobus acidocaldarius; Sisl, Sulfolobus results strongly reinforce the hypothesis that CRISPR–Cas islandicus; Ssol, Sulfolobus solfataricus;Stok,Sufolobus tokodaii. systems are acquired via horizontal gene transfer [1,19]. S. islandicus HVE10/4 and A. brierleyi repeat clusters were not Over half of the acidothermophile repeat clusters belong to family I, where, generally, the sequence just upstream of the virus or plasmid site which matches a family I spacer carries a CC motif (Figure 2C). Insufficient data precluded our establishing whether such motifs occur adjacent to family II and family III spacer matches.

Conclusions The results demonstrate that CRISPR spacer matches are uniformly distributed throughout the virus/plasmid gen- omes,regardlessofbothgenelocationanddegreeofgenecon- servation. Moreover, there is no significant bias to either sense or antisense strands of genes (with the exception of STIV): both strands are targeted to an equal degree. These findings strongly suggest that the spacer regions of the CRISPR are taken up randomly, and non-directionally, from the virus or plasmid DNA and are not generated by reverse transcriptase from virus/plasmid transcripts. The results are also consistent with the hypothesis that the CRISPR spacer transcripts target the virus/plasmid by hybridizing directly to their DNA, possibly priming it for degradation. The results also support a mechanism whereby virus or plasmid propagation is inhibited primarily at a DNA level and not at a gene-expression level. For example, the non-protein- coding ITR region, which is implicated in rudiviral replica- tion [10], carries seven spacer matches in SIRV1 (Figure 1) and other spacer matches occur in intergenic regions which appear not to be involved in transcriptional regulation (results not shown). The inhibitory mechanism also appears to be highly specific for virus/plasmid DNA, since only one perfect spacer sequence match was detected within any of the acido- thermophile chromosomal sequences examined (Table 1). This may be crucial for cell survival if the inhibitory mechanism involves DNA degradation, but, given that viruses and plasmids often integrate reversibly into archaeal chromosomes [20], it suggests that the CRISPR–Cas system selectively targets DNA of extrachromosomal elements, whether circular or linear. The CRISPR–Cas system has been primarily implicated in viral inhibition in both archaea and bacteria [1,3,4], but it is clear from the present analysis that, at least for archaea, its role is more complex. The apparatus targets plasmids, both conjugative and cryptic, with a similar frequency to viruses (Figure 1). Moreover, some host CRISPR spacers match their

C C !The Authors Journal compilation !2009 Biochemical Society 28 BiochemicalSocietyTransactions(2009)Volume37,part1

own viruses or plasmids, suggesting a regulatory, rather than 8 Jansen, R., Embden, J.D., Gaastra, W. and Schouls, L.M. (2002) aninhibitory,role,andthispossibilityisreinforcedbythelow Identification of genes that are associated with DNA repeats in prokaryotes. Mol. Microbiol. 43,1565–1575 copy numbers, and non-lytic properties, of most crenarchaeal 9 Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., viruses [10]. Finally, the observation that a spacer sequence Moineau, S., Romero, D.A. and Horvath, P. (2007) CRISPR provides in the repeat cluster of the conjugative plasmid pKEF9 acquired resistance against viruses in prokaryotes. Science 315, 1709–1712 [21] matches a rudiviral genome suggests that plasmids 10 Prangishvili, D., Forterre, P. and Garrett, R.A. (2006) Viruses of the themselves can also inhibit/regulate co-infecting viruses. Archaea: a unifying view. Nat. Rev. Microbiol. 11,837–848 11 Lipps, G. (2006) Plasmids and viruses of the thermoacidophilic crenarchaeote Sulfolobus. Extremophiles 10,17–28 Acknowledgements 12 Edgar, R.C. (2007) PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics 8,18–24 Dr Kim Brugger ¨ kindly provided unpublished genome sequence data. 13 Bland, C., Ramsey, T.L., Sabree, F., Lowe, M., Brown, K., Kyrpides, N.C. and Hugenholtz, P. (2007) CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic Funding repeats. BMC Bioinformatics 8,209–217 14 Saebø, P.E., Andersen, S.M., Myrseth, J., Laerdahl, J.K. and Rognes, T. Work was supported by the Danish National Research Foundation for (2005) PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Res. 33, a Centre of Comparative Genomics and the Danish Natural Science 535–539 Research Council [grant number 272-06-0442]. 15 Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, M., Phan, H., Briegel, A., Rachel, R., Garrett, R.A. and Prangishvili, D. (2008) SRV, a new rudiviral isolate from Stygiolobus and the interplay of crenarchaeal References rudiviruses with the host viral-defence CRISPR system. J. Bacteriol. 190, 1 Lillestøl, R.K., Redder, P., Garrett, R.A. and Brugger, ¨ K. (2006) A putative 6837–6845 viral defence mechanism in archaeal cells. Archaea 2,59–72 16 Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high 2 Mojica, F.J., Diez-Villasenor, C., Garcia-Martinez, J. and Soria, E. (2005) accuracy and high throughput. Nucleic Acids Res. 32,1792–1797 Intervening sequences of regularly spaced prokaryotic repeats derive 17 Bailey, T.L., Williams, N., Misleh, C. and Li, W.W. (2006) MEME: from foreign genetic elements. J. Mol. Evol. 60,174–182 discovering and analyzing DNA and protein sequence motifs. 3 Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I. and Koonin, E.V. Nucleic Acids Res. 34,369–373 (2006) A putative RNA-interference-based immune system in 18 Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: prokaryotes: computational analysis of the predicted enzymatic improving the sensitivity of progressive multiple sequence alignment machinery, functional analogies with eukaryotic RNAi, and hypothetical through sequence weighting, position-specific gap penalties and weight mechanisms of action. Biol. Direct 1,7 matrix choice. Nucleic Acids Res. 22,4673–4680 4Sorek,R.,Kunin,V.andHugenholtz,P.(2008)CRISPR:awidespread 19 Godde, J.S. and Bickerton, A. (2006) The repetitive DNA elements called system that provides acquired resistance against phages in bacteria and CRISPRs and their associated genes: evidence of horizontal transfer archaea. Nat. Rev. Microbiol. 6,181–186 among prokaryotes. J. Mol. Evol. 62,718–729 5Tang,T.-H.,Bachellerie,J.-P.,Rozhdestvensky,T.,Bortolin,M.-L., 20 Wang, Y., Duan, Z., Zhu, H., Guo, X., Wang, Z., Zhou, J., She, Q. and Huber, H., Drungowski, M., Elge, T., Brosius, J. and Huttenhofer, ¨ A. (2002) Huang, L. (2007) A novel Sulfolobus non-conjugative extrachromosomal Identification of 86 candidates for small non-messenger RNAs from the genetic element capable of integration into the host genome and archaeon Archaeoglobus fulgidus. Proc. Natl. Acad. Sci. U.S.A. 99, spreading in the presence of a fusellovirus. Virology 363, 7536–7541 124–133 6Tang,T.-H.,Polacek,N.,Zywicki,M.,Huber,H.,Br¨ugger, K., Garrett, R.A., 21 Greve, B., Jensen, S., Brugger, ¨ K., Zillig, W. and Garrett, R.A. (2004) Bachellerie, J. P. and Huttenhofer, ¨ A. (2005) Identification of novel Genomic comparison of archaeal conjugative plasmids from Sulfolobus. non-coding RNAs as potential antisense regulators in the archaeon Archaea 1,231–239 Sulfolobus solfataricus. Mol. Microbiol. 55,469–481 7 Pourcel, C., Salvignol, G. and Vergnaud, G. (2005) CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary Received 6 August 2008 studies. Microbiology 151,653–663 doi:10.1042/BST0370023

C C !The Authors Journal compilation !2009 Biochemical Society ￿.￿ ￿￿￿￿￿ ￿ 59

￿.￿ ￿￿￿￿￿ ￿ All CRISPR related bioinformatics including the preparation of Contribution: substantial figures 1, 8 and 9 was carried out by myself. Molecular Microbiology (2009) 72(1), 259–272 ᭿ doi:10.1111/j.1365-2958.2009.06641.x First published online 2 March 2009 CRISPR families of the crenarchaeal genus Sulfolobus: bidirectional transcription and dynamic properties

Reidun K. Lillestøl, Shiraz A. Shah, Kim Brügger,† present in the sequenced chromosomes of almost all Peter Redder, Hien Phan, Jan Christiansen and archaea and about 40% of bacteria, as well as in some Roger A. Garrett* plasmids (Lillestøl et al., 2006; Grissa et al., 2007; Sorek Centre for Comparative Genomics, Department of et al., 2008). The original observation that some spacers Biology, University of Copenhagen, Ole Maaløes Vej 5, show close sequence matches to viral genomes and plas- 2200 Copenhagen N, Denmark. mids (Mojica et al., 2005) led to the hypothesis that spacer regions are incorporated into the chromosome from the extra-chromosomal element and have a regula- Summary tory or inhibitory effect on their propagation (Bolotin et al., Clusters of regularly interspaced short palindromic 2005; Mojica et al., 2005; Pourcel et al., 2005; Lillestøl repeats (CRISPRs) of Sulfolobus fall into three main et al., 2006). Recently, this hypothesis was reinforced families based on their repeats, leader regions, asso- experimentally for bacteria by showing that new spacers ciated cas genes and putative recognition sequences deriving from phage genomes integrate into CRISPRs of on viruses and plasmids. Spacer sequence matches Streptococcus thermophilus in response to phage infec- to different viruses and plasmids of the Sulfolobales tion, which in turn leads to phage resistance (Barrangou revealed some bias particularly for family III CRISPRs. et al., 2007; Deveau et al., 2008; Horvath et al., 2008a). In Transcription occurs on both strands of the five both archaea and bacteria, new spacer-repeat units are repeat-clusters of Sulfolobus acidocaldarius and a added at the end of the repeat-clusters adjoining a low repeat-cluster of the conjugative plasmid pKEF9. complexity leader sequence (Jansen et al., 2002; Tang Leader strand transcripts cover whole repeat-clusters et al., 2002; Pourcel et al., 2005; Lillestøl et al., 2006; and are processed mainly from the 3Ј-end, within Barrangou et al., 2007), presumably facilitated by Cas repeats, yielding heterogeneous 40–45 nt spacer proteins which are generally encoded adjacent to the RNAs. Processing of the pKEF9 leader transcript clusters (Jansen et al., 2002; Haft et al., 2005; Makarova occurred partially in spacers, and was incomplete, et al., 2006). probably reflecting defective repeat recognition by Despite the akaryotic nature of the CRISPR system host enzymes. A similar level of transcripts was gen- (Forterre, 1992), there are significant differences between erated from complementary strands of each chromo- the archaeal and bacterial systems studied so far. First, somal repeat-cluster and they were processed to archaeal repeat-clusters tend to be very extensive and yield discrete ~55 nt spacer RNAs. Analysis of the can constitute more than 1% of the chromosome (Lillestøl partially identical repeat-clusters of Sulfolobus solfa- et al., 2006). Second, they often exhibit a low level of , or taricus strains P1 and P2 revealed that spacer-repeat no, dyad symmetry in their repeat sequences (Lillestøl units are added upstream only when a leader and et al., 2006; Kunin et al., 2007). Third, some of the cas certain cas genes are linked. Downstream ends of the genes implicated in RNA processing and spacer repeat-clusters are conserved such that deletions and sequence insertion are highly divergent between archaea recombination events occur internally. and bacteria (Haft et al., 2005). Fourth, the many archaeal spacer sequences which match plasmids or viruses show no clear bias to viruses (Shah et al., 2009). Introduction A mechanism for the putative regulatory or inhibitory effect in both euryarchaea and crenarchaea was sug- Clusters of regularly interspaced short palindromic gested, at an early stage, by the finding that RNA tran- repeats (CRISPRs) consist of identical repeats separated scripts are produced, and processed, from one strand of by unique spacer sequences of constant length. They are archaeal repeat-clusters (Tang et al., 2002; 2005), with the smallest product corresponding approximately in Accepted 14 February, 2009. *For correspondence. E-mail garrett@ bio.ku.dk; Tel. (+45) 35322010; Fax (+45) 35322228. †Present both size and sequence to a single spacer transcript address: Wellcome Trust Sanger Institute, Hinxton, UK. (Lillestøl et al., 2006). Furthermore, it was demonstrated

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd 260 R. K. Lillestøl et al. ᭿ experimentally for a bacterium that a complex of Cas Results proteins was responsible for processing in the repeats to generate the small RNAs encompassing the spacer CRISPR families in Sulfolobales regions (Brouns et al., 2008) and for the euryarchaeon The repeat-clusters of the Sulfolobales are quite diverse Pyrococcus furiosus, it was shown that the Cas6 protein structurally and we attempted to classify them into fami- binds to the 5′-end of the repeat transcript and cuts, by lies on the basis of their repeat sequences, leader prop- a putative ruler mechanism, within the 3′-end (Carte erties, associated cas genes and conserved sequences et al., 2008). In the crenarchaeon Sulfolobus acidocal- adjoining spacer sequence matches on viruses/plasmids. darius, evidence was also presented for transcription A total of 48 complete and eight incomplete repeat- occurring from the complementary strand of the DNA clusters were identified for the Sulfolobales, of which at spacer (Lillestøl et al., 2006). These results opened for least 51 carried putative leader sequences, and they the possibility of an antisense RNA or RNAi-like mecha- yielded 3685 spacer sequences. Phylogenetic tree build- nism acting either on the viral/plasmid transcripts or ing based on repeat sequences revealed three main fami- directly on their DNA (Lillestøl et al., 2006; Makarova lies and some minor ones where family I dominates et al., 2006). Recent studies on the P. furiosus have (Fig. 1A). Each species typically carries representatives shown that the leader strand spacer RNAs can generate of two repeat families (Fig. 1A). Analyses of all cas genes distinct RNA–protein complexes (Hale et al., 2008). associated with the repeat-clusters of the Sulfolobales Moreover, bioinformatical studies on crenarchaeal reinforced the family divisions. A phylogenetic tree built CRISPRs (Shah et al., 2009), as well as experimental from alignments of the most conserved cas1 gene, encod- studies on bacteria (Brouns et al., 2008; Marraffini and ing a predicted integrase or nuclease (Makarova et al., Sontheimer, 2008), support that spacer RNAs directly 2006), yielded essentially the same family tree as in target DNA of extra-chromosomal elements, rather than Fig. 1A (data not shown). Moreover, in an all-against-all their mRNAs. comparison of cas genes adjoining repeat-clusters, each Here, we characterize CRISPR families of the model gene generally yielded best matches to other genes of the crenarchaeal genus Sulfolobus, and related members of same family, despite family I genes being overrepresen- the Sulfolobales, for which several genomes and numer- tative (data not shown). ous viruses and plasmids have been sequenced (Prang- For the leader regions, alignment of 300 bp of each ishvili et al., 2006; Brügger, 2007). The families are sequence revealed a large fairly conserved downstream classified on the basis of their repeat sequences, leader region which carries multiple distinct sequence motifs, region motifs, associated cas genes, and conserved most of which are specific for a given CRISPR family dinucleotide motifs adjoining spacer matching sequences (Fig. 1B). Moreover, these classes of motifs show signifi- on viruses and plasmids. Properties of transcripts from cant levels of sequence conservation despite some of each strand of repeat-clusters in S. acidocaldarius chro- them exhibiting low sequence complexity. Of these, motif A mosomes and the conjugative plasmid pKEF9 are exam- carries 70% adenines, motif B exhibits 95% purines, motifs ined, as well as the possible formation of double-stranded C, F, G and J contain 50–60% thymines, while motifs D, H spacer RNAs. Moreover, sequencing and bioinformatical and I are more complex. For all families some motifs are analyses of the six repeat-clusters of Sulfolobus solfatari- repeated (Fig. 1B). We infer that these motifs are likely to cus strains P1 and P2 were performed and conclusions provide, directly or indirectly, assembly sites for Cas pro- are drawn concerning the dynamics of repeat-cluster teins involved in processing RNA and/or in extending the development and functions of the different CRISPR repeat-clusters. Lastly, examination of spacer sequence families. matches on viruses/plasmids of the Sulfolobales revealed

Fig. 1. Family classification of Sulfolobales CRISPRs. A. Phylogenetic tree based on a multiple alignment of repeat sequences showing three main families I, II and III. CRISPRs are labelled by a four-letter prefix denoting the species, and the number of repeats. B. Motif maps for leader regions of the three main CRISPR families. The motifs constitute conserved sequences, 30–100 bp in length, showing on average 80% sequence identity. Sequence motifs A, B and C occur in more than one family [motif C occurs in some unclassified leaders (Fig. 1A)], whereas the other motifs are family specific. Thus motifs D, E, F and G occur only in family I leaders, motifs H and I are present only in family II leaders and motif J is exclusive to family III leaders. Leaders of each family show some variation in the number and order of the motifs present. Motif A overlaps with the transcriptional leader region. C. Logo-plot (http://weblogo.berkeley.edu/) of the motif located immediately upstream of the spacer match on viral/plasmid genomes where CC predominates in 129 matches for family I, TC in 23 matches for family II, and GT in 19 matches for family III CRISPRs, where one bit corresponds to about 75% presence and two bits correspond to 100%. The logo plots are based exclusively on spacer matches which show a maximum of five nucleotide mismatches.

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 261

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 262 R. K. Lillestøl et al. ᭿ A Saci-133 Saci-78 - 6.5 kb - 3.8 kb cas-genes L 8.13 kb L cas1 Saci_1871 cas3 Saci_1882

Saci-11 Saci-2

L L cas1 cas4 csa3 Saci_2016

Saci-5 pKEF-7 1 kb L Saci_1974 Saci_1975 CAG38159 CAG38160

B

Repeat cluster Repeat sequence

Saci-133/78 GTAATAACGACAAGAAACTAAAAC Saci-11/2 GATGAATCCCAAAAGGGATTGAAAG Saci-5 A T pKEF-7 GTTGCAATTCCCTAAATGTGCGGG

Fig. 2. A. Diagram showing the genomic context of the S. acidocaldarius repeat-clusters, and of the pKEF9 cluster. Saci-133 and Saci-78 are physically linked on the chromosome, as are Saci-11 and Saci-2. Saci-5 and the plasmid cluster pKEF-7 are separate. L denotes the leader region. Identities of genes bordering the clusters or their GenBank/EMBL assignments are given and their directions of transcription are indicated. B. Repeat sequences where inverted repeat sequences are underlined, and experimentally identified processing sites are marked with ‘ ’s. conserved upstream dinucleotide motifs: CC for family I, cal leader sequences while the more divergent Saci-5 TC for family II and GT for family III which may direct DNA (family II) exhibits a repeat with two base pair changes incorporation into CRISPRs (Fig. 1C). These may consti- and a leader sequence showing 75% sequence identity tute an archaeal parallel to the AGAAA and GGNG motifs (Chen et al., 2005; Lillestøl et al., 2006). All of the family II located downstream from bacterial proto-spacers of repeats carry a 5 bp inverted repeat (Fig. 2B). Repeat- S. thermophilus (Horvath et al., 2008a). clusters Saci-5 and Saci-2 each carry a degenerate repeat, distal to the leader region. The repeat-cluster (pKEF-7) of conjugative plasmid pKEF9 carries no leader Genome contexts of the repeat-clusters sequence and no associated cas genes (Fig. 2A) (Greve The S. acidocaldarius chromosome carries five repeat- et al., 2004). clusters with 133, 78, 11, 5 and 2 repeats which fall into CRISPR families II and III (Fig. 1A). Saci-133 and Saci-78 Repeat-clusters generate single transcripts covering the (family III) are physically linked, with shared cas genes. whole cluster They exhibit 95% identical leader sequences adjoining the first repeat and carry identical, non-palindromic In order to investigate transcripts formed during the repeats (Fig. 2A and B). Saci-11 and Saci-2 (family II) are growth cycle, RNA was extracted from S. acidocaldarius, physically linked by cas genes (Fig. 2A) and carry identi- and from S. solfataricus P2 conjugated with pKEF9, har-

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 263

exp. stat. phatase in 5′-RLM RACE procedures. The results 1234 M demonstrate that start sites occur immediately upstream from the first repeat sequence for both repeat-clusters 6.0 (Fig. 4A and B) and the start sites are preceded upstream 5.0 by archaeal BRE/TATA motifs (Torarinsson et al., 2005). For Saci-133, transcription initiated at the sequence 4.0 GATGG, 17 nt upstream from the first repeat (Fig. 4A; Table 1). An identical sequence/motif pattern occurs for 3.0 Saci-78. A different pattern was found for the family II clusters Saci-11, Saci-5 and Saci-2 where transcription 2.5 initiates at the sequence AAGGG, 21 nt upstream from the first repeat and is also preceded by archaeal promoter motifs (Fig. 4B; Table 1). 2.0 We probed for transcripts initiating at the leader of Saci- 133 using oligonucleotides complementary to spacers 5, 6, 59 and 131. Strong signals were obtained for each 1.5 spacer (Fig. 5A) consistent with the whole cluster being transcribed in fairly high yield, as was demonstrated for Saci-78 (Fig. 3). The low level of larger transcripts detected with probes against spacers 59 and 131 sug- 1.0 A Saci-133 B Saci-5 C Saci-133 Fig. 3. Northern blot of Saci-78 transcripts using an M - + M - + M oligonucleotide probe against spacer 4. Ten microgram RNA was 0 0 isolated from S. acidocaldarius cells harvested at: (1) early log 500 phase, (2) late log phase, (3) early stationary phase and (4) late 400 500 stationary phase. RNA size markers (0.5–9 kb) were 500 400 Start 300 300 co-electrophoresed and excised from the gel prior to RNA blotting. 400 200 133(10) 300 200 132(11) 100 vested at different stages of exponential growth and, 200 4(9) 131(22) for the former, stationary phase. Oligonucleotide probes 5(3) complementary to spacers of the repeat-clusters were 5(23) Start tested in Northern blot analyses. Initially, Saci-78 tran- 100 scripts were probed for spacer 4, adjacent to the leader 6(3) 100 region, and the results demonstrate that processing 6(15) 1(8) increased progressively as stationary phase was 1(17) approached (Fig. 3). The maximum transcript size, about 5000 nt, exceeds the size of the 4624 bp repeat-cluster, Fig. 4. Determination of the transcriptional start sites, and processing sites of RNA products generated from Saci-133 and indicating that the whole cluster was transcribed (Fig. 3). Saci-5 using the 5′-RLM RACE and 3′-RLM RACE procedures. However, the majority of detected transcripts fall in the A. Determination of 5′-ends of transcripts from Saci-133 where size range 3000–3500 nt suggesting that endogenous RNA was treated with (+) and without (-) tobacco acid phosphatase to remove 5′-phosphates from the 5′-end of the initial transcript, degradation, processing or premature termination had and an oligonucleotide primer specific for spacer 7 was employed. occurred towards the 3′-end. Evidence for the formation of Bands exclusive to the + lane retain the transcriptional start site whole transcripts was also obtained for each of the small whereas bands present in both + and - lanes represent transcripts which have been processed at the 5′-end. repeat-clusters Saci-5 (Lillestøl et al., 2006), Saci-11 and B. Determination of 5′-ends of transcripts from Saci-5 using an Saci-2 (data not shown), and pKEF-7 (see below). oligonucleotide primer specific for spacer 1. The band showing in the + lane of about 160 bp is an artefact, where sequencing revealed that two adapters had ligated to each other and to the start site. Transcription from the leader strand C. 3′-RLM RACE experiment performed on transcripts from Saci-133 using a primer specific for spacer 130. The three bands In order to test whether transcription initiated at single or represent transcripts which have been processed within the multiple sites, we determined start sites at the leader of terminal repeats 131, 132 and 133. In each experiment processing sites are indicated by number of the repeat (from the leader) where Saci-133 and Saci-5 by identifying RNA fragments carry- the position of the 5′-nucleotide within the repeat is given in ing 5′-terminal triphosphates using tobacco acid phos- brackets.

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 264 R. K. Lillestøl et al. ᭿

Table 1. Overview of promoters, transcriptional start sites and processing sites in repeat-clusters Saci-133, Saci-5 and pKEF-7 identified by the 5′-RLM RACE method.

Distance from Cluster BRE-TATA Start first repeat Processing sites

Saci-133 GAAAATATTTATAAA GATGG +17 nt 4 (9), 5 (3), 5 (23), 6 (3), 6 (18) Saci-5 GCAAAAGTTTATTAA AAGGG +21 nt 1 (8), 1 (17) pKEF-7 GAAAAAGTTTATTA AATCT +32 nt +23, 1 (24)

Putative BRE and TATA motif sequences are located approximately 25 bp upstream from transcription start sites and the processing sites within the repeats (numbered from the leader region) give the position of the 5′-nucleotide in brackets.

gests that transcript processing occurs primarily from the The larger products observed with both spacer and 3′-end (Fig. 5A). Probing of the repeat also revealed a repeat probes correspond in size to transcripts of multiple similar series of bands except for the smallest RNAs repeat-spacer units (Fig. 5A) whereas the smallest prod- (Fig. 5A). ucts seen when using spacer probes range in size from a single repeat-spacer unit (62–68 nt) to the spacer (40 nt) suggesting that progressive exoribonuclease trimming A occurs within repeats flanking the spacer, consistent with M1repeat 133(5) 133(6) 133(59) 133(131) M2 the inability to detect the smallest RNAs with the repeat 500 400 probe (Fig. 5A) and the earlier observation for Saci-5 300 (Lillestøl et al., 2006). Saci-11 and Saci-2 were also 200 probed with spacer-specific oligonucleotides, and North- ern blots yielded closely comparable patterns (data not 150 shown). 5′-RLM RACE analyses of Saci-133 and Saci-5 also 100 90 100 revealed processing sites (Fig. 4A and B). Multiple pro- 80 cessing sites were identified throughout the repeats for 70 Saci-133 but confined to the inverted repeat for Saci-5 at 60 positions 8 and 17 (Table 1; Fig. 2B). 50 3′-Termini of Saci-133 transcripts were also determined 40-50 40 by the 3′-RLM RACE method employing a probe against spacer 130. Three main bands were produced which, B on sequencing, revealed processing sites distributed M1repeat 133(5) 133(6) 133(60) 133(131) M2 throughout terminal repeats 131, 132 and 133, at posi- 500 400 tions 10, 11 and 22 (Fig. 4C). The absence of further 300 downstream bands suggested that the transcript terminus 200 had been efficiently excised. 150 In order to confirm that processing occurred exclusively within repeat sequences, Saci-133 transcripts on 100 100 90 one membrane were probed, successively, by spacer 80 5-specific, and then repeat-specific, probes. Both probes 70 yielded similar patterns for the larger transcripts but the 60 50-60 smallest transcripts were only detected with spacer- 50 specific probes (Fig. 5A). Thus, the final processing step 40 occurs in the repeat leaving the spacer intact.

30

Fig. 5. Northern blot analyses of Saci-133 transcripts. The repeat Complementary strand is transcribed sequence and spacers at positions 5 (37 nt), 6 (42 nt), 59 (41 nt) and 131 (36 nt) from the leader, were probed with oligonucleotides In a preliminary experiment, we demonstrated that Saci-5 to detect: (A) transcripts initiating within the leader sequence, and transcripts are produced from both DNA strands (Lillestøl (B) transcripts generated from the complementary strand. Twenty et al., 2006). As this raised the possibility that dsRNA microgram RNA was isolated from cells grown to stationary phase. RNA size markers of 10–150 nt (M1) and 100–2000 nt (M2) are intermediates could be formed, we studied these effects aligned approximately with the transcript lanes. more systematically for Saci-133 Saci-78, Saci-11, Saci-5

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 265 and Saci-2. Transcripts from the complementary DNA the complementary DNA strand, for any chromosomal strand of Saci-133 were probed for spacers 5, 6, 60 and repeat-clusters, except Saci-2 (Fig. 2B). 131 (numbered from the leader). Each showed strong A comparison of transcript yields from the leader and signals (Fig. 5B) but they differed from those of leader complementary strands indicated qualitatively similar strand transcripts in that products were less regular in size expression levels from both strands of Saci-133 (Fig. 5A and larger transcripts prevailed. Nevertheless, the and B). This is difficult to quantify accurately because of smallest product for each spacer probe was a discrete the complexity and diversity of the RNA fragment patterns band of about 55 nt (Fig. 5B). Similarly sized RNAs (Fig. 5A and B) but qualitatively similar transcription levels were observed when probing each of the other four were observed for all five repeat-clusters. S. acidocaldarius repeat-clusters (data not shown), con- The possibility that functional dsRNAs were generated sistent with the earlier observation for Saci-5 (Lillestøl between spacer transcripts from each DNA strand was et al., 2006). These small RNAs must contain all or most tested by a ribonuclease digestion approach using of the spacer sequence because the strong band ssRNA-specific enzymes RNase T1 and RNase U2 which observed with each spacer probe was not detected with a cleave preferentially 3′- to G and A residues respectively, repeat probe (Fig. 5B). but do not cleave regular dsRNA (Christiansen et al., Northern analyses of each of the chromosomal repeat- 1990). Total RNA from S. acidocaldarius was treated with clusters indicated strong signals for all tested spacer increasing concentrations of each ribonuclease and probes, indicating that transcription from the complemen- Northern blots were obtained by probing for spacer 6 of tary strand occurred throughout each cluster, as is illus- Saci-133 transcripts from each strand. The results trated for Saci-133 (Fig. 5B). This result was reinforced by revealed progressive cleavage of both the leader and a Northern blot analysis in which the complementary complementary strand transcripts at increasing ribonu- strand transcripts from the Saci-5 cluster were probed for clease concentrations but no resistant dsRNA band of spacer 1, adjacent to the leader region, and the largest about 40 bp was detected (data not shown). This may transcript (430 nt) exceeds the minimal size of the repeat- reflect that specific protein–ssRNA complexes form as cluster (300 bp) (Fig. 6). Moreover, each of the five clus- was shown for the leading strand spacer RNA of ters carries at least one putative promoter BRE/TATA P. furiosus (Hale et al., 2008). motif within 50 bp of the terminal repeat of the repeat- cluster. In addition, there are no open reading frames pKEF-7 transcripts are processed in both repeats (ORFs) within at least 3 kb of the putative promoters, on and spacers M Saci-5 Despite its lack of associated cas genes and leader 0 region, we considered the pKEF9 repeat-cluster to be a 500 400 430 CRISPR system because three of the six spacers match 300 to Sulfolobus viruses, spacer 3 to rudiviruses and 200 190 spacers 5 and 6 to fuselloviruses. This is consistent with the conjugative plasmid regulating the viruses intracellu- larly. Therefore, RNA was isolated from S. solfataricus 100 P2 14 h after conjugating with pKEF9 before plasmid levels rapidly decline. For the predicted leader strand (Fig. 2A), 5′-ends were determined by 5′-RLM RACE analyses using a primer specific for spacer 1. The results revealed a single transcript start site, 32 nt upstream from the first repeat, preceded by promoter motifs (Table 1). Processing sites were also identified 52 23 nt upstream from the first repeat, and at the junction of the first repeat and spacer (Fig. 7A; Table 1). Northern blotting experiments were then performed probing each half of each spacer sequence, as well as the repeat (Fig. 7B). The results for each probe revealed a largest Fig. 6. Northern blot analysis of transcripts from the product of about 465 nt, corresponding in size to a tran- complementary strand of Saci-5, probing for spacer 1, adjacent to script from the whole repeat-cluster. The transcript pat- the leader region. RNA size markers of 100–2000 nt (M1) are terns indicated that smaller products disappeared, aligned approximately with the transcript lanes. The size of the smallest transcript was estimated using an independent, stepwise, as one probed along the transcript in a 5′ to 3′ co-electrophoesed, size maker as shown in Fig. 5. direction (Fig. 7B). The experiment was repeated, after

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 266 R. K. Lillestøl et al. ᭿ A B C M - + 1L 1R 2L 2R 3L 3R 4L 4R 5L 5R 6L 6R Rep M 6 Rep M 465 500 410 400 500 345 400 400 300 285 300 300 245 200 210 200 200 183 Start 165 +32 148 145 100 +23 139

100 1(24) 104 100 95

Fig. 7. Transcription from the pKEF-7 cluster. A. 5′-RLM RACE analyses of the transcriptional start site and processing sites near the start of the transcript. RNA was treated with (+) and without (-) tobacco acid phosphatase. B. Northern blot analyses of transcripts from the pKEF-7 cluster using oligonucleotide probes specific for the left (L) and right (R) halves of spacers 1, 2, 3, 4, 5, 6 and the repeat sequence respectively. C. Northern blot analyses of transcripts from the complementary strand from the pKEF-7 cluster using oligonucleotide probes specific for spacer 6 and the repeat sequence. RNA was isolated 14 h after conjugation in A, B and C. RNA size markers of 100–500 nt (M) are aligned and approximate fragment sizes are given. conjugating S. solfataricus P2 for 20 h, when the smaller scripts in the size range 185–480 nt were observed for the transcripts observed for spacer 1 were also seen for the former and 145–480 nt for the latter, similar in size to other spacers, consistent with increased processing transcripts observed from the leader strand. The absence having occurred as stationary phase was approached of spacer-sized RNAs from either DNA strand could (data not shown). reflect that the final RNA processing enzymes are acti- The transcript patterns are complicated by the presence vated mainly in stationary phase (Fig. 3) or incompatibility of sets of weak and strong signals (Fig. 7B) where the former match those of the S. acidocaldarius clusters Table 2. Summary of transcriptional start sites and estimated sizes (above) in size and putative processing in repeats and processing sites for transcripts deriving from pKEF-7 as illus- (Table 2). For example, the 95 nt and 104 nt transcripts trated in Fig. 7B. observed for the spacer 1 probe are consistent in size with their extending from the start site, or processing site 9 nt Transcript Start Stop downstream (Fig. 7A, Table 1), to a processing site in Weak (normal) Repeat/(position) repeat 2 (Table 2). For the stronger transcripts, differ- 95/104 +23/+32 2 (9) 165/183 23/ 32 3 (22) ences were observed when probing each half of the + + 245 +23/+32 4 (19) spacer transcripts (Fig. 7B). Probes upstream halves (L) Strong (abnormal) Spacer (position) revealed smaller fragments than probing downstream 139/148 +23/+32 2 (29) halves (R), seen most dramatically for probes against 210 +23/+32 3 (25) 285 23/ 32 4 (35) spacer 2 (139–148 nt) and spacer 3 (210 nt) (Fig. 7B). + + 345 +23/+32 5 (23) The strong transcripts are consistent in size with their 410 +23/+32 6 (30) extending from the initiation, or downstream processing, Weak (abnormal) site to the downstream (R) spacer halves (Table 2). 145 1 (24) 3 (16)

The repeat probe yielded a similar transcript pattern as Transcripts included in the normal/weak category appear to be for spacer 2 and differed from that for spacer 1 (Fig. 7B) processed in the same manner as the S. acidocaldarius clusters. probably because of non-annealing of the primer to the Processing sites are localized by the repeat number and the estimated nucleotide position in brackets. Transcripts in the abnormal degenerate first repeat. This was reinforced by the lack of category are processed in the right half of spacers (position denoted processing in the first repeat sequence, as determined by by the spacer number and the estimated nucleotide position in the 5′-RLM RACE method (Fig. 7A). Transcripts from the brackets). +32 denotes the transcriptional initiation site and +23 and 1 (24) indicates processing sites identified by the 5′-RLM RACE complementary DNA strand were also detected probing method. for spacer 6 and the repeat sequence (Fig. 7C), and tran-

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 267

Fig. 8. Patterns of repeat-spacer units in repeat-clusters A–F of S. solfataricus strain P1 are aligned with those from S. solfataricus strain P2 (She et al., 2001), where each arrowhead represents a single spacer-repeat unit, and the number to the right indicates the total number of units. Grey boxed regions indicate sequences that are identical for a given pair of clusters. Blackened units lie within these conserved regions but yield no matches to viruses/plasmids. Spacers which yield significant matches to viruses or plasmids are colour-coded as indicated on the figure. Boxes to the left of the clusters represent leader regions that are coloured according to the leader family, blue – family I, purple – family II (Fig. 1B). The larger arrowhead in cluster D of strain P1 represents a 899 bp pNOB8-like fragment, and the large arrowhead in cluster F denotes a 106 bp insert with two atypical repeat sequences and abnormal spacer regions. Preliminary data on clusters B, C and E were presented earlier (Lillestøl et al., 2006). between processing enzymes and the plasmid repeat matches (Lillestøl et al., 2006). The primary structures of sequence (Carte et al., 2008). repeat-clusters A–F of strain P1 are displayed together with those of strain P2 (She et al., 2001) in Fig. 8, where the locations and distributions of virus/plasmid matches Functional properties of the CRISPR families are indicated. For S. acidocaldarius the 297 spacer sequences yield Repeat-clusters A and B represent family II CRISPRs, only 44 (15%) significant matches to virus/plasmid while C, D, E and F belong to family I (Fig. 1A). Each sequences, relatively few compared with up to 40% for repeat-cluster of strains P1 and P2 shares identical other Sulfolobales genomes (Lillestøl et al., 2006; Shah regions of sequence enclosed in grey boxes (Fig. 8). et al., 2009). Therefore, to gain more insight into the func- While cluster pairs E and F are identical, the others all tional diversity of different CRISPR families, we com- show evidence of repeat-spacer units having been added pleted the sequencing of the six repeat-clusters A–F of at the leader region, after separation of the strains, S. solfataricus strain P1 because, although repeat- although the repeat-cluster sizes and apparent rates of clusters B, C and E share regions of perfectly conserved extension differ greatly. Repeat-clusters B from strain P1 spacer-repeat sequences with S. solfataricus strain P2, and D from strain P2 show evidence of putative deletions they also yielded many additional virus/plasmid sequence of 21 and 45 repeat spacer units respectively, and there

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 268 R. K. Lillestøl et al. ᭿

Fig. 9. Pie plots for the main CRISPR families I, II and III of the Sulfolobales where the percentage of spacer sequence matches are given for the different crenarchaeal viral families and plasmid classes which are colour-coded. Spacer matches investigated for each family: family I (2031 spacers tested, 771 significant matches), family II (710 spacers tested, 230 significant matches) and family III (298 spacers tested, 88 significant matches). are minor differences within the conserved regions of further, data for significant spacer sequence matches to cluster A. Moreover, cluster A shares a sequence of four viruses/plasmids for the three main CRISPR families of repeat-spacer units with cluster B of strain P1, suggesting the Sulfolobales were summarized in Pie plots (Fig. 9). that homologous recombination has occurred between The overall ratios of spacer matches to viruses/plasmids, different clusters of the same family. Importantly, the for families I, II and III, are 3.5, 2.0 and 3.5 respectively, downstream ends of each pair of repeat-clusters are con- suggesting that the family II CRISPRs have a relative bias served which suggests that the clusters lose their repeat- to plasmids. Although no absolute biases are apparent spacer units primarily by internal deletions (Fig. 8). (Fig. 9), rudiviral matches dominate for family III and con- Cluster D of strain P1 and cluster F of both strains carry jugative plasmid matches are enhanced for family II anomalous inserts. The former is an 899 bp region CRISPRs. The rudiviruses, lipothrixviruses and conjuga- showing a significant sequence match to the conjugative tive plasmids, which predominate in the Pie plot, are all plasmid pNOB8 (She et al., 1998) while the latter region abundant environmentally (Greve et al., 2004; Bize et al., carries a degenerate repeat-spacer region with a different 2008; Vestergaard et al., 2008). repeat sequence and an abnormally sized spacer, possi- bly also of plasmid origin. Discussion The absence of newly added repeat-spacer units to cluster F, and the lack of a leader region (Fig. 8), raised Biogenesis of small archaeal RNAs appears to proceed the question as to whether the cluster was active. There- from a full-length single-stranded primary transcript that is fore, we probed for spacer 11 of cluster F of strain P2 cleaved by endoribonucleases as was recently reported using a Northern blot analysis. A similar fragment pattern for the Cas6 protein in P. furiosus (Carte et al., 2008). This was obtained as for the S. acidocaldarius clusters suggests that the mechanism of cleavage in archaea (Fig. 5A) except that small spacer RNAs (< 66 nt) were is distinct from the Dicer endoribonuclease-dependent absent (data not shown). This indicated, as for pKEF-7 mechanism generating si- and miRNAs in eukarya. (Fig. 7B), a defective final processing stage which could However, eukarya also generate small RNAs by Dicer- be caused by the lack of a leader region and/or the independent mechanisms such as seen for piRNA-like absence of some physically linked cas genes. species, and although the mechanism of biogenesis of the The number of significant spacer matches to viruses/ latter in terms of trans-acting factors is unresolved, certain plasmids was 39% and 38% for strains P1 and P2 respec- aspects are reminiscent of the process observed in this tively, which carry a total of 431 and 417 spacers study. In particular, the presence of an independently pro- respectively. The colour coding of the matches (Fig. 8) cessed complementary RNA strand has been reported reveals some apparent biases. For example, there is a (reviewed in Klattenhoff and Theurkauf, 2008). As there is high proportion of bicaudaviral matches in the newly no evidence for an RNA-dependent RNA polymerase in added spacers of cluster D (family I), for both strains, Sulfolobus, transcription of the complementary strand is which contrasts with the high proportion of rudiviral likely to be dictated by the putative promoter elements matches in clusters A and B (family II) and suggests that located immediately downstream from the CRISPR loci. individual CRISPR families exhibit a preference for certain Inspection of downstream elements of all CRISPR clus- extra-chromosomal elements. To test this hypothesis ters in S. acidocaldarius reveals BRE/TATA promoter

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 269 regions, that are likely to initiate full-length complemen- leader region. Processing occurs primarily from the 3′-end tary strand RNA products, as shown for the Saci-5 cluster of a single transcript of the whole repeat-cluster, although (Fig. 6). Further processing of the complementary tran- limited processing also occurs at the 5′-end (Fig. 4A), and scripts are likely to proceed by an endoribonuclease dis- repeats are targeted to generate a series of fragments. tinct from that generating spacer RNAs from the leader The small spacer RNAs from exponentially growing and strand transcript, because the ‘handles’ in the repeats stationary phase cells, ranging in size from 40 to 52 nt must be different given their different RNA sizes (about and 35 to 52 nt respectively, represent a spectrum of 55 nt versus 40–45 nt). What is the functional significance fragments which anneal with spacer-specific, but not of the complementary small RNAs? One possibility is that repeat-specific probes (Fig. 5A), consistent with earlier they neutralize the leader spacer RNAs in the absence of observations for archaea and bacteria (Lillestøl et al., invading extra-chromosomal elements, although we failed 2006; Brouns et al., 2008; Hale et al., 2008). Processing to detect dsRNAs in the expected size range, but another activity initiates mainly at stationary phase, at least for possibility is that loading of leader-spacer RNAs onto cells lacking extra-chromosomal elements (Fig. 3). an Argonaute-containing complex has to proceed via a Recently, it was shown that the Cas6 endoribonuclease dsRNA intermediate, as observed for the si- and miRNA binds to the 5′-end of a P. furiosus repeat, which can pathways. The presence of Argonautes in archaea may generate a hairpin structure, and cuts near the 3′-end facilitate a distinct mode of guide RNA presentation from (Carte et al., 2008). This result could explain the anoma- that seen in bacteria, where there is no evidence of the lous processing of the pKEF-7 transcript (Fig. 7B; Table 2) participation of a complementary RNA strand in CRISPR which exhibits an unusual 3′-terminal repeat sequence function (Brouns et al., 2008; Marraffini and Sontheimer, (Fig. 2B). Nevertheless, given the wide sequence and 2008). secondary structural diversity of repeat RNAs (Peng et al., 2003; Kunin et al., 2007), the enzymes must exhibit Cellular activity of CRISPRs a wide range of recognition mechanisms. Transcripts of the complementary strand were invari- The observation that more than one CRISPR family is ably produced from each repeat-cluster and they ranged generally present in one organism suggested that they in size from larger fragments to spacer RNAs of about may provide added versatility in regulating or inhibiting 55 nt, about 16 nt larger than the probed spacer, and invading viruses or plasmids, and this supposition consistent with the earlier observation for Saci-5 (Lillestøl received some support from the finding that putative rec- et al., 2006). Although no reproducible RNA expression ognition signals upstream from predicted proto-spacer was observed from the complementary spacer strand for sequences on viruses/plasmids are different for different the euryarchaeon P. furiosus (Hale et al., 2008), this could CRISPR families (Fig. 1C). Analysis of the repeat-clusters have a technical explanation. For the cDNA libraries only of the two CRISPR families of S. solfataricus strains P1 fragments < 50 nt were screened for, and in the Northern and P2 revealed biases to bicaudaviruses for family I, and blot analysis, the 12% polyacrylamide gels used would to rudiviruses for family II, CRISPRs (Fig. 8). A study of not have resolved the large transcripts observed for 3039 spacers from the three main families of all the Sul- Sulfolobus (Fig. 5B). folobales also showed significant biases, in particular a preference of family III spacers for rudiviruses (Fig. 9). This supports that the presence of multiple CRISPR fami- Regular and irregular development of repeat-clusters lies may produce a more versatile response to invading The pairs of repeat-clusters E and F from S. solfataricus genetic elements. P1 and P2 are both identical and have not undergone The results also show that the CRISPR system of structural changes since the strains diverged (Fig. 8). S. acidocaldarius is primed to react rapidly to invasion in Cluster E (Ssol-8) carries a family I leader but a degen- that the large cluster transcripts are present despite the erate first repeat, which may inhibit cognate enzyme rec- absence of viruses and plasmids. The system only ognition of the repeat and, thereby, subsequent extension requires that the RNA processing enzymes are rapidly of the repeat-cluster. In contrast, cluster F (Ssol-91) lacks activated. The observation that processing of the leader a leader sequence which could provide an assembly site transcript strongly increases in the stationary phase for DNA enzymes involved in cluster extension. In addi- (Fig. 3) is also consistent with these cells being more tion, clusters E and F lack physically linked cas genes susceptible to external attack. which could be important for DNA insertion functions (Cas1) or RNA processing (Cas2 and Cas4) (Makarova Generation of spacer RNAs et al., 2006; Beloglazova et al., 2008). Transcripts on the leader strand initiate just upstream Irregularities in archaeal repeat-clusters are extremely from the first repeat, independently of the presence of a rare (Lillestøl et al., 2006). However, in cluster F of both

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 270 R. K. Lillestøl et al. ᭿ strains, a 106 bp region containing a half spacer preceded RNA preparation, RNase digestion and Northern blotting by two atypical repeat sequences is followed by a regular Total RNA from S. acidocaldarius cells, and S. solfataricus repeat sequence and no spacer (Fig. 8). These structures cells conjugated with pKEF9, was prepared using Trizol maintain the precise size of the spacer-repeat units in the (Invitrogen, Paisley, UK) according to the Invitrogen protocol cluster suggesting that some kind of ruler mechanism essentially as used for extracting plant si-RNAs (Sunkar regulates the insertion of new spacer-repeat units. et al., 2005) and treated with DNase I (Applied Biosystems/ Another exceptional irregularity occurs in cluster D of Ambion, Austin, TX) according to the protocol from Ambion, strain P1, where an 899 bp fragment carrying a pNOB8- and essentially as used for extracting plant si-RNAs like conjugative plasmid sequence (She et al., 1998) is (Sunkar et al., 2005). To detect dsRNA, 20 mg of RNA was treated with various concentrations of RNase T (Ambion) in flanked by repeats. Both examples may reflect a mecha- 1 RNase-digestion III buffer (Ambion), and RNase U2 nistic defect whereby large plasmid regions, the former (Sankyo, Japan) in digestion buffer 20 mM Na acetate carrying repeats, have been incorrectly excised and incor- (pH 4.6), 2 mM MgCl2, 100 mM KCl, at 37°C for 30 min. porated into the repeat-cluster. Further examination of this RNase was inactivated and the RNA was precipitated region may yield some insight into how DNA is obtained with 225 ml of RNase inactivation/precipitation solution III from extra-chromosomal elements. (Ambion) together with 150 ml ethanol at -20°C for 1 h or overnight. For Northern blotting of small RNAs, 20 mg RNA was mixed with 10 ml Gel Loading Buffer II (Applied Mechanism of CRISPR transfer Biosystems/Ambion) and fractionated in a 6–10% polyacry- The commonality of CRISPR families in different Sulfolo- lamide gel containing 7 M urea, 90 mM Tris, 90 mM boric acid, 2 mM EDTA, pH 8.3, together with a 10–150 nt ladder bus strains suggests that they can be transferred horizon- (Decade Marker System, Ambion, Huntigdon, UK) or a tally (Lillestøl et al., 2006; Horvath et al., 2008b) but the 0.1–2.0 kb RNA ladder (Invitrogen). RNA was transferred mechanism by which this could occur is unclear. For some onto Hybond N+ nylon membranes (GE Healthcare, Amer- bacteria it was proposed that large plasmids could carry sham, UK) or GeneScreen plus nylon membranes (Perki- and transmit the CRISPR apparatus (Godde and Bicker- nElmer Life Sciences, Boston, USA) using the Bio-Rad ton, 2006) but known crenarchaeal cryptic plasmids are semidry blotting apparatus (Bio-Rad, Hercules, CA) and quite small (5–10 kb) and the largest conjugative plas- 0.5¥ TBE (45 mM Tris, 45 mM boric acid, 1 mM EDTA, pH 8.3) as the blotting buffer. For Northern blotting with mids are only 40–50 kb (Greve et al., 2004), insufficiently large RNAs, 12 mg RNA was mixed with Northern Max-Gly large to carry complex CRISPR systems. One possibility Sample Loading Dye (Applied Biosystems/Ambion) and is that the system is transferred by chromosomal con- fractionated in a 1.5% agarose-BPTE (10 mM PIPES, jugation. Both S. acidocaldarius and Sulfolobus tokodaii pH 6.5, 30 mM Bis-Tris, 1 mM EDTA) gel, together with a chromosomes carry encaptured conjugative plasmids 0.5–9 kb Millenium Marker (Applied Biosystems/Ambion). where the genes implicated in the conjugative process are The RNA was transferred onto Hybond N+ nylon mem- maintained (Greve et al., 2004) and for S. acidocaldarius, branes (GE Healthcare) by capillary blotting with 0.2 M NaH PO , pH 7.4, 3.0 M NaCl, 0.02 M EDTA. After immobi- at least, conjugative transfer of chromosomal DNA has 2 4 lizing the RNAs using a UV Crosslinker (Stratagene, La been demonstrated experimentally (Aagaard et al., 1995; Jolla, USA), the nylon membranes were pre-hybridized for Grogan, 1996). 1 h in 6¥ SSPE buffer (0.9 M NaCl, 60 mM NaH2PO4, 4.6 mM EDTA, pH 7.4), 0.5% SDS and 5¥ Denhardt’s solu- Experimental procedures tion at 5°C lower than the Tm of the probe (TH). Oligonucle- otides 24–26-mers complementary to a spacer, or the Growth of Sulfolobus cells and preparation of DNA repeat, on either strand, were end-labelled with [g 32P]-ATP and T4 polynucleotide kinase. Hybridization was performed Sulfolobus acidocaldarius cells were grown at 78°C in at the TH of the probe in 6¥ SSPE, 0.5% SDS, 3¥ Den- complex medium containing 2% tryptone (Schleper et al., hardt’s solution for 18 h. The samples were washed three 1995) and harvested at exponential or stationary phase by times at room temperature with 6¥ SSPE buffer and 0.1% centrifuging at 4000 r.p.m. and 4°C for 15 min. Cells of SDS for 15 min each and, subsequently, at the TH in the S. solfataricus strains P1 and P2 were grown at 80°C in same buffer. Membranes were exposed to Ultra UV-G X-ray complex medium containing 2% tryptone (Schleper et al., film (Dupharma, Kastrup, Denmark) for 1 h to 3 days. 1995). Total DNA used for repeat-cluster sequencing was isolated from S. solfataricus strain P1 using DNeasy Kit (Qiagen, Westberg, Germany). Conjugation was initiated by Determination of transcript ends mixing a culture of S. islandicus strain Hi165 which harbours the conjugative plasmid pKEF9, with S. solfataricus P2 cells The RLM-RACE kit (Applied Biosystems/Ambion) was used at a ratio of 1:10 000 at A600 = 0.17 (Schleper et al., 1995). to determine the ends of transcripts generated from repeat- Cells were harvested at 14 h after conjugation and centri- clusters in S. acidocaldarius and pKEF9, with some modifi- fuged at 4000 r.p.m. for 6 min at 4°C. pKEF9 was isolated cations in the kit-protocol. To identify 5′-ends, 5 mg RNA was using the Plasmid Mini Kit (Qiagen) and digested with EcoRI treated with tobacco acid pyrophosphatase (TAP) according to verify its presence (Greve et al., 2004). to the protocol. Both TAP-treated and untreated RNA were

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 Archaeal CRISPRs of Sulfolobus 271 then linked to a 5′-RLM RACE adapter with RNA ligase, References followed by reverse transcription from a spacer-specific Aagaard, C., Dalgaard, J., and Garrett, R.A. (1995) Inter- primer according to the protocol. Products were then ampli- cellular mobility and homing of an archaeal rDNA intron fied by PCR with a 5′-RLM RACE adapter-specific primer confers selective advantage over intron- cells of Sulfolobus containing a BamHI restriction site at the 5′ end and a spacer- acidocaldarius. Proc Natl Acad Sci USA 92: 12285–12289. specific primer carrying an EcoRI restriction site, in order to Bailey, T.L., Williams, N., Misleh, C., and Li, W.W. (2006) facilitate cloning of the PCR products into pUC18. The PCR- MEME: discovering and analyzing DNA and protein products were run on a 2% low melting agarose gel and sequence motifs. Nucleic Acids Res 34: 369–373. purified with QIAquick Gel Extraction Kit (Qiagen). The frag- Barrangou, R., Fremaux, C., Deveau, H., Richards, M., ments were cloned into BamHI and EcoRI-digested pUC18 at Boyaval, P., Moineau, S., et al. (2007) CRISPR provides a molar ratio of 4:1 and sequenced. acquired resistance against viruses in prokaryotes. Science 315: 1709–1712. Sequencing of clusters in S. solfataricus P1 Beloglazova, N., Brown, G., Zimmerman, M.D., Proudfoot, M., Makarova, K.S., Kudritska, M., et al. (2008) A novel Long range PCR products were obtained across the chromo- family of sequence-specific endoribonucleases associated somal cluster regions of S. solfataricus strain P1 using the with the Clustered Regularly Interspaced Short Palindro- Herculase II kit (Stratagene, La Jolla, CA) according to the mic Repeats. J Biol Chem 29: 20361–20371. protocol, with 300 ng genomic DNA in 50 ml reactions. DNA Bize, A., Peng, X., Prokofeva, M., Maclellan, K., Lucas, S., fragments were purified using Qiaquick PCR purification kit Forterre, P., et al. (2008) Viruses in acidic geothermal envi- (Qiagen) and sequenced. Sequences were analysed with ronments of the Kamchatka Peninsula. Res Microbiol 159: Sequencher (Gene Codes, Ann Arbor, MI). BLAST searches 358–366. were performed against the Sulfolobus Database (http:// Bland, C., Ramsey, T.L., Sabree, F., Lowe, M., Brown, K., sulfolobus.org). Kyrpides, N.C., and Hugenholtz, P. (2007) CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatical analysis of CRISPRs of the Sulfolobales Bioinformatics 8: 209. Bolotin, A., Quinquis, B., Sorokin, A., and Ehrlich, S.D. (2005) Repeat-clusters were identified using publicly available Clustered regularly interspaced short palindrome repeats software (Edgar, 2007; Bland et al., 2007) in all available (CRISPRs) have spacers of extrachromosomal origin. Sulfolobales genomes (S. solfataricus P2, S. tokodaii 7, Microbiology 151: 2551–2561. S. acidocaldarius DSM 639, Metallosphaera sedula Brouns, S.J., Jore, M.M., Lundgren, M., Westra, E.R., DSM5348 from GenBank (http://www.ncbi.nlm.nih.gov/ Slijkhuis, R.J., Snijders, A.P., et al. (2008) Small CRISPR Genbank/), Sulfolobus islandicus strains LD85, YG5714, RNAs guide antiviral defense in prokaryotes. Science 321: YN1551, M164 and U328 from JGI (http://genome.jgi.doe. 960–964. gov/mic_asmb.html), and S. islandicus strains HVE10/4 and Brügger, K. (2007) The Sulfolobus database. Nucleic Acids REY15A and Acidianus brierleyi (unpublished data). Repeat- Res 35: D413–D415. cluster names identify the species and number of repeats. Carte, J., Wang, R., Li, H., Terns, R.M., and Terns, M.P. Repeat-cluster orientations were determined by locating the (2008) Cas6 is an endoribonuclease that generates guide upstream leader sequence and/or by examining the repeat RNAs for invader defense in prokaryotes. Genes Dev 22: sequence. Leader sequences, when present, were limited to 3489–3496. 300 bp for the multiple alignment analyses (Edgar, 2004) and Chen, L., Brügger, M., Skovgaard, M., Redder, P., She, Q., motif analyses (Bailey et al., 2006). Representative repeat Torarinsson, E., et al. (2005) The genome of Sulfolobus sequences from each identified repeat-cluster were aligned acidocaldarius, a model organism of the Crenarchaeota. (Edgar, 2004) and a phylogenetic tree was generated J Bacteriol 187: 4992–4999. (Higgins et al., 1994). Spacer sequences from each repeat- Christiansen, J., Egebjerg, J., Larsen, N., and Garrett, R.A. cluster were aligned (Sæbø et al., 2005) against the (1990) Analysis of rRNA structure: experimental and theo- genomes of extra-chromosomal elements of the Sulfolobales retical considerations. In Ribosomes and Protein (http://sulfolobus.org/; Brügger, 2007) at a nucleotide level Synthesis. Spedding, G. (ed.). Oxford: Oxford University (Shah et al., 2009). Additionally, spacers were aligned Press, pp. 229–252. against amino acid sequences of annotated ORFs of the Deveau, H., Barrangou, R., Garneau, J.E., Labonté, J., extra-chromosomal elements, at an amino acid level (Shah Fremaux, C., Boyaval, P., et al. (2008) Phage response to et al., 2009; Vestergaard et al., 2008). Significance cut-offs CRISPR-encoded resistance in Streptococcus thermo- were determined for both alignment types by using the philus. J Bacteriol 190: 1390–1400. genome sequence of Saccharomyces cerevisiae as a nega- Edgar, R.C. (2004) MUSCLE: multiple sequence alignment tive control. with high accuracy and high throughput. Nucleic Acids Res 32: 1792–1797. Acknowledgements Edgar, R.C. (2007) PILER-CR: fast and accurate identifica- tion of CRISPR repeats. BMC Bioinformatics 8: 18. The work was supported by grants from the Danish Natural Forterre, P. (1992) Neutral terms. Nature 355: 305. Science Research Council and the Danish National Godde, J.S., and Bickerton, A. (2006) The repetitive DNA Research Foundation. elements called CRISPRs and their associated genes: evi-

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72, 259–272 272 R. K. Lillestøl et al. ᭿

dence of horizontal transfer among prokaryotes. J Mol Evol Peng, X., Brügger, K., Shen, B., Chen, L., She, Q., and 62: 718–729. Garrett, R.A. (2003) Genus-specific protein binding to the Greve, B., Jensen, S., Brügger, K., Zillig, W., and Garrett, large clusters of DNA repeats (Short Regularly Spaced R.A. (2004) Genomic comparison of archaeal conjugative Repeats) present in Sulfolobus genomes. J Bacteriol 185: plasmids from Sulfolobus. Archaea 1: 231–239. 2410–2417. Grissa, I., Vergnaud, G., and Pourcel, C. (2007) The CRISP- Pourcel, C., Salvignol, G., and Vergnaud, G. (2005) CRISPR Rdb database and tools to display CRISPRs and to gen- elements in Yersinia pestis acquire new repeats by prefer- erate dictionaries of spacers and repeats. Bioinformatics 8: ential uptake of bacteriophage DNA, and provide additional 172. tools for evolutionary studies. Microbiology 151: 653–663. Grogan, D.W. (1996) Exchange of genetic markers at Prangishvili, D., Forterre, P., and Garrett, R.A. (2006) Viruses extremely high temperatures in the archaeon Sulfolobus of the Archaea: a unifying view. Nat Rev Microbiol 11: acidocaldarius. J Bacteriol 178: 3207–3211. 837–848. Haft, D.H., Selengut, J., Mongodin, E.F., and Nelson, K.E. Sæbø, P.E., Andersen, S.M., Myrseth, J., Laerdahl, J.K., (2005) A guild of 45 CRISPR-associated (Cas) protein and Rognes, T. (2005) PARALIGN: rapid and sensitive families and multiple CRISPR/Cas subtypes exist in sequence similarity searches powered by parallel comput- prokaryotic genomes. PLoS Comput Biol 1: 474–483. ing technology. Nucleic Acids Res 33: 535–539. Hale, C., Kleppe, K., Terns, R.M., and Terns, M.P. (2008) Schleper, C., Holz, I., Janekovic, D., Murphy, J., and Zillig, W. Prokaryotic silencing (psi) RNAs in Pyrococcus furiosus. (1995) A Multicopy plasmid of the extremely thermophilic RNA 14: 1–8. archaeon Sulfolobus effects its transfer to recipients by Higgins, D., Thompson, J., Gibson, T., Thompson, J.D., mating. J Bacteriol 177: 4417–4426. Higgins, D.G., and Gibson, T.J. (1994) CLUSTAL W: Shah, S.A., Hansen, N.R., and Garrett, R.A. (2009) Distribu- improving the sensitivity of progressive multiple sequence tions of CRISPR spacer matches in viruses and plasmids alignment through sequence weighting, position-specific of crenarchaeal acidothermophiles and implications for gap penalties and weight matrix choice. Nucleic Acids Res their inhibitory mechanism. Biochem Soc Trans 37: 23–28. 22: 4673–4680. She, Q., Phan, H., Garrett, R.A., Albers, S.-V., Stedman, Horvath, P., Romero, D.A., Coûté-Monvoisin, A.C., Richards, K.M., and Zillig, W. (1998) Genetic profile of pNOB8 from M., Deveau, H., Moineau, S., et al. (2008a) Diversity, activ- Sulfolobus: the first conjugative plasmid from an archaeon. ity, and evolution of CRISPR loci in Streptococcus Extremophiles 2: 417–425. thermophilus. J Bacteriol 190: 1401–1412. She, Q., Singh, R.K., Confalonieri, F., Zivanovic, Y., Gordon, Horvath, P., Coûté-Monvoisin, A.C., Romero, D.A., Boyaval, P., Allard, G., et al. (2001) The complete genome of the P., Fremaux, C., and Barrangou, R. (2008b) Comparative crenarchaeon Sulfolobus solfataricus P2. Proc Natl Acad analysis of CRISPR loci in lactic acid bacteria genomes. Int Sci USA 98: 7835–7840. J Food Microbiol doi:10.1016/j.ijfoodmicro.2008.05.030 Sorek, R., Kunin, V., and Hugenholtz, P. CRISPR – a Jansen, R., Embden, J.D., Gaastra, W., and Schouls, L.M. widespread system that provides acquired resistance (2002) Identification of genes that are associated with DNA against phages in bacteria and archaea. (2008) Nat Rev repeats in prokaryotes. Mol Microbiol 43: 1565–1575. Microbiol 6: 181–186. Klattenhoff, C., and Theurkauf, W. (2008) Biogenesis and Sunkar, R., Girke, T., and Zhu, J.K. (2005) Identification and germline functions of piRNAs. Development 135: 3–9. characterization of endogenous small interfering RNAs Kunin, V., Sorek, R., and Hugenholtz, P. (2007) Evolutionary from rice. Nucleic Acids Res 33: 4443–4454. conservation of sequence and secondary structures in Tang, T.-H., Bachellerie, J.-P., Rozhdestvensky, T., Bortolin, CRISPR repeats. Genome Biol 8: R61. M.-L., Huber, H., Drungowski, M., et al. (2002) Identifica- Lillestøl, R.K., Redder, P., Garrett, R.A., and Brügger, K. tion of 86 candidates for small non-messenger RNAs from (2006) A putative viral defence mechanism in archaeal the archaeon Archaeoglobus fulgidus. Proc Natl Acad Sci cells. Archaea 2: 59–72. USA 99: 7536–7541. Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I., Tang, T.-H., Polacek, N., Zywicki, M., Huber, H., Brügger, K., and Koonin, E.V. (2006) A putative RNA-interference- Garrett, R.A., et al. (2005) Identification of novel non- based immune system in prokaryotes: computational coding RNAs as potential antisense regulators in the analysis of the predicted enzymatic machinery, functional archaeon Sulfolobus solfataricus. Mol Microbiol 55: 469– analogies with eukaryotic RNAi, and hypothetical mecha- 481. nisms of action. Biol Direct 1: 7. Torarinsson, E., Klenk, H.P., and Garrett, R.A. (2005) Diver- Marraffini, L.A., and Sontheimer, E.J. (2008) CRISPR inter- gent transcriptional and translational signals in Archaea. ference limits horizontal gene transfer in Staphylococci by Environ Microbiol 7: 47–54. targeting DNA. Science 322: 1843–1845. Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, Mojica, F.J., Diez-Villasenor, C., Garcia-Martinez, J., and M., Phan, H., et al. (2008) SRV, a new rudiviral isolate from Soria, E. (2005) Intervening sequences of regularly spaced Stygiolobus and the interplay of crenarchaeal rudiviruses prokaryotic repeats derive from foreign genetic elements. with the host viral-defence CRISPR system. J Bacteriol J Mol Evol 60: 174–182. 190: 6837–6845.

©2009TheAuthors Journal compilation © 2009 Blackwell Publishing Ltd, Molecular Microbiology, 72,259–272 74 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ My contribution to this paper was limited to the mapping of Contribution: minor matching Sulfolobales CRISPR spacers onto the fuselloviral gen- omes and preparing Figure 4. Environmental Microbiology (2009) doi:10.1111/j.1462-2920.2009.02009.x

Four newly isolated fuselloviruses from extreme geothermal environments reveal unusual morphologies

and a possible interviral recombination mechanismemi_2009 1..14

Peter Redder,1* Xu Peng,2 Kim Brügger,2 Introduction Shiraz A. Shah,2 Ferdinand Roesch,1 Bo Greve,2 In contrast to the rather uniform landscape of virion Qunxin She,2 Christa Schleper,3 Patrick Forterre,1 morphotypes in aquatic systems under moderate environ- Roger A. Garrett2 and David Prangishvili1 mental conditions, mainly represented by tailed bacte- 1Unite de Biologie Moleculaire du Gene chez les riophages (reviewed by Prangishvili, 2003), virus-like Extremophiles, Institut Pasteur, 25, rue du Dr Roux, particles observed in ecological niches at high tempera- F-75015 Paris, France. tures, low pH or high salinity reveal a high diversity of 2Danish Archaea Centre, Department of Biology, complex morphotypes (Guixa-Boixareu et al., 1996; Oren Biocenter, Ole Maaløesvej 5, Copenhagen University, et al., 1997; Rice et al., 2001; Rachel et al., 2002; Häring DK-2200 Copenhagen N, Denmark. et al., 2005; Porter et al., 2007; Bize et al., 2008). About 3Department of Genetics in Ecology, University of 40 virus species isolated from such environments, all Vienna, Althanstrasse 14, A-1090 Vienna, Austria. carrying double-stranded (ds) DNA genomes, have been described, which infect members of the third domain of Summary life, the Archaea (reviewed in Prangishvili et al., 2006a). Most common are viruses with an overall spindle-shaped Spindle-shaped virus-like particles are abundant in morphology, either tail-less, tailed or even two-tailed, extreme geothermal environments, from which five which taxonomically have been assigned to the viral spindle-shaped viral species have been isolated to families Fuselloviridae (SSV1, SSV2, SSV4, SSVrh and date. They infect members of the hyperthermophilic SSVk1, single-tailed), Bicaudaviridae (ATV, two-tailed) archaeal genus Sulfolobus, and constitute the Fusell- and the genus Salterprovirus (His 1 and His 2) while some oviridae, a family of double-stranded DNA viruses. still require classification (STSV1 and PAV1) (Schleper Here we present four new members of this family, all et al., 1992; Bath and Dyall-Smith, 1998; Arnold et al., from terrestrial acidic hot springs. Two of the new 1999; Geslin et al., 2003; Wiedenheft et al., 2004; Xiang viruses exhibit a novel morphotype for their proposed et al., 2005; Bath et al., 2006; Prangishvili et al., 2006b; attachment structures, and specific features of their Peng, 2008). genome sequences strongly suggest the identity of Five fuselloviruses have so far been isolated from the host-attachment protein. All fuselloviral genomes acidic geothermal environments in different locations in are highly conserved at the nucleotide level, although Asia, Europe and North America, and they replicate in the regions of conservation differ between virus- species of the hyperthermophilic archaeal genus Sulfolo- pairs, consistent with a high frequency of homolo- bus, which represents a significant percentage of the gous recombination having occurred between them. microbial population in most acidic terrestrial hot springs. We propose a fuselloviral specific mechanism for Another major player in these environments is the genus interviral recombination, and show that the spacers of Acidianus, from which several viruses have been isolated, the Sulfolobus CRISPR antiviral system are not including the linear filamentous and rod-shaped viruses biased to the highly similar regions of the fusellovirus AFV1 and ARV1, respectively, which have close viral rela- genomes. tives that also infect Sulfolobus (Prangishvili et al., 2006a; Snyder et al., 2007). Although the two genera coexist, no fusellovirus has yet been isolated from Acidianus, even though it appears to be the most predominant Sulfolobus viral type. The circular dsDNA genomes of five known fusellovi- Received 2 April, 2009; accepted 18 June, 2009. *For correspon- dence. E-mail [email protected]; Tel. (+41) 774000253; Fax ruses are highly similar at both nucleotide and amino acid (+41) 223795108. sequence levels, with the majority of gene products being

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd 2 P. Redder et al.

Fig. 1. A. Representative electron microscopy images of SSV6, SSV7 and ASV1. The end-filaments of SSV7 are very sticky, and the virus is almost always observed in ‘rosettes’ or attached to vesicles (white arrows). A rare single SSV7 is also shown (dotted white arrow). SSV6 and ASV1 do not have sticky ends and are always single, even when lying close together. Furthermore, SSV6 and ASV1 exhibit a wide range of morphotypes, varying from the standard spindle shape to an elongated sausage shape (indicated by dotted black arrows for SSV6). B. Magnification of the end-filaments of the three viruses. The filaments of SSV6 and ASV1 are thick, and seem to form a crown around the virus tips (black arrows) whereas SSV7 carries thinner filaments, that protrude directly from the virus tips. All samples were negatively stained with 2% Uranyl acetate and the scalebars are all 100 nm. of unknown function and lacking homologues in public virions to produce rosette-like aggregates (Fig. 1A – sequence databases other than in other archaeal viruses SSV7). (Wiedenheft et al., 2004). The viral DNA is protected SSV1 is the best studied fusellovirus, and the virion has against the harsh environment, at temperatures above been shown to contain proteins VP1, VP2, VP3 and small 80°C and pH values below 2, within a spindle-shaped amounts of SSV1_D244 and SSV1_C792 (Reiter et al., virion about 100 nm long and 60 nm wide, with a bunch of 1987a; Menon et al., 2008). VP1 and VP3 are thought to short, thin fibres at one of the pointed ends (Martin et al., be capsid proteins, whereas VP2 has been assigned a 1984; Stedman et al., 2003; Wiedenheft et al., 2004; DNA-binding role, organizing DNA, but it is not encoded Peng, 2008). In the electron microscopy, the body is by other fuselloviruses (Stedman et al., 2003; Wiedenheft sometimes observed to be slightly elongated and more et al., 2004). Four non-structural SSV1 proteins have ‘cigar-shaped’, and the tail fibres appear to be quite sticky, been characterized. SSV1_D63 is considered to link readily attaching to cellular fragments, as well as linking two different protein complexes, while SSV1_F93 and

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology Fuselloviral diversity 3

Fig. 2. A. Graphical alignment of the nine circular fuselloviral genomes, linearized at the first nucleotide after the VP3 stop codon (following the convention of Wiedenheft et al., 2004). All ORFs larger than 50 amino acids indicated by arrows. Shades of blue and green: 13 ‘core’ genes. Dark grey: ORFs found in two or more fuselloviruses. Light grey: ORFs only found in one fusellovirus. Black: VP2. Yellow: SSV1_C792 homologues, both full length and partial. Red: SSV6_B1232 homologues. Orange: SSV1_B78 homologues. Light pink: SSV1_D244 homologues associated with the Integrase operon in all but ASV1 and SSVk1. Dark violet and light violet: Rad3-like helicase and Msed_2283 homologues substituting for a large part of the Integrase operon in ASV1, SSV7 and SSVk1. Dark pink: SSV1_F93 homologues. Brown: Highly conserved SSV1_C84 homologue overlapping with some of the other ‘core’ genes. Magenta: SSV1-C80 homologues and ASV1-A59. The transcripts identified by Fröls and colleagues (2007) are indicated below SSV1. B. The two different putative end-filament modules, exemplified by SSV1 and SSV6.

SSV1_F112 are DNA binding proteins implicated in tran- shown to be non-essential for virus replication and basic scriptional regulation (Kraft et al., 2004a,b; Menon et al., viral functions (Clore and Stedman, 2007). 2008). The fourth protein is an integrase of the tyrosine Replication of SSV1 and SSV2 can be induced by UV recombinase family, which catalyses site-specific integra- irradiation (Yeats et al., 1982; Stedman et al., 2003). The tion of the viral genome into the host chromosome. As the SSV1 transcription cycle, following UV induction, has also viral recombination site (attP) is located within the inte- been elucidated by Northern analysis, physical mapping grase gene, integration leads to gene partition (Palm and DNA microarrays, and transcripts were classified as et al., 1991; Muskhelishvili et al., 1993). Despite this early (T5, T6 and T9), late (T3, Tx and T8) and UV induc- highly specialized adaptation, the integrase was recently ible (Tind) (Fig. 2) (Reiter et al., 1987b; Fröls et al., 2007).

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology 4 P. Redder et al.

The proteins encoded in the early transcripts of SSV1, bias is imposed on the virus, which excludes the possibil- and their homologues in other fuselloviruses, are often ity of isolating single colonies of the host if the virus is cysteine-rich compared with proteins encoded in the late highly lytic under the chosen conditions. transcripts (Palm et al., 1991; Stedman et al., 2003; Finally, the fourth virus described here, Acidianus Wiedenheft et al., 2004). This has recently been proposed spindle-shaped virus 1, ASV1, was discovered as an to be due to intra- and extra-cellular localization of the extrachromosomal and integrated element in the course early and late proteins respectively (Menon et al., 2008). of sequencing the genome of Acidianus brierleyi Here we report on the isolation and properties of four DSM1651, and the production of virions was subse- novel members of the Fuselloviridae, infecting species of quently confirmed by electron microscopy (Fig. 1). While the hyperthermophilic archaeal genera Sulfolobus and this method for isolating new viruses is not generally Acidianus, almost doubling the number of known fusell- applicable, it is likely to become more common that extra- oviruses and extending their host-range to a new genus, chromosomal elements are detected while sequencing Acidianus. This merited a revised comparative genomic genomes from strain collections. analysis of fuselloviruses, which provided insights into functions of some viral proteins and addressed general Morphology questions concerning the evolution of the viruses and interactions with their hosts. The spindle-shaped virion of SSV7, ~90 nm long and ~50 nm wide, resembles virions of all previously known fuselloviruses morphologically, as well as by its tendency Results to form ‘rosettes’ by sticking to neighbouring viral particles (Fig. 1). In contrast, negatively stained virions of SSV6 Isolation of virus–host systems and ASV1, both appear much more pleiomorphic than the Three different methods were used to acquire the new other fuselloviruses, and assume shapes ranging from viruses reported in this communication. SSV5 was discov- thin cigar-like to pear-like, with tail fibres at the end cor- ered as an extrachromosomal element within cells of responding to where the pear ‘stalk’ would be (Fig. 1). S. solfataricus P2 (DSM1617), infected as a result of The virion bodies, and tail fibres of ASV1 and SSV6, mixing the cells with an icelandic HVE14 enrichment seem to differ from those of the other fuselloviruses. culture (see Experimental procedures). This traditional Instead of multiple thin fibres, these virions carry 3 or 4 method of isolating new viruses allows a large number of thicker and slightly curved, fibres that appear to protrude viruses to be screened, but it restricts the search for sideways, not from the particle apex but from a point specific virus–host systems. slightly more towards the body (Fig. 1B). Furthermore, the A different approach was used for SSV6 and SSV7, ASV1 and SSV6 fibres seem to be less ‘sticky’ than their where transmission electron microscopy analysis of the thin counterparts, and the characteristic ‘rosettes’ were supernatant from an enrichment of the G4 site at never observed for ASV1 and SSV6. Hveregedi, Iceland, revealed a large number of fusellovi- To exclude that the observed pleiomorphicity of SSV6 ruses. Attempts to isolate single virus–host systems by virions was an artifact caused by the purification process, colony purification resulted in two pure strains, each har- or by uranyl-acetate staining, two control experiments bouring a different fusellovirus. Strain G4T-1 was a host were carried out. (i) The virions were analysed by EM for Sulfolobus spindle-shaped virus 7 (SSV7) while strain directly after removal of host cells by mild centrifugation at G4ST-T-11 was the natural producer of a pleiomorphic 4000 r.p.m. (Jouan S40 rotor), and although omitting the virus named Sulfolobus spindle-shaped virus 6, SSV6. concentration step yielded few virions, they exhibited the The former was found to be produced in very low amounts normal pleiomorphicity. (ii) The virion pleiomorphicity was under normal growth conditions, but it was possible to also observed when we used phosphotungstenate as an increase SSV7 production about 10-fold (as estimated by alternative contrasting agent (not shown), confirming that counting viral particles in the electron microscope), either the heterogeneity of the shape was an integral property of by shifting the culture to a medium with lower tryptone the virions rather than a result of the experimental treat- concentration, or by inducing with UV light. Strain G4T-1 ment. Moreover, SSV7 virions, for which little to no pleio- and G4ST-T-11 may in fact be the same species, as their morphicity was observed, were routinely treated in exactly partial 16S rRNA sequences were identical to S. islandi- the same manner as SSV6 and ASV1 virions (Fig. 1A). cus strain I7 (AY247894.1) with a single base substitution to distinguish them from S. solfataricus P2. This virus Genomic organization and comparison isolation approach did not impose any bias on the choice of viral host (except for choosing the growth conditions), Owing to their special structural properties, we originally and it provided a ‘natural’ virus–host system. However, a suspected that ASV1 and SSV6 were representatives of

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology Fuselloviral diversity 5 a new spindle-shaped viral family. However, genome We found no biased correlation between conserved analyses revealed that they, and the SSV5 and SSV7 regions and spacer matches, and it is possible that fusell- isolates, are all closely related to known members of the oviruses recombine frequently enough to reduce the family Fuselloviridae, and we therefore assign the four effectiveness of the CRISPR system. The results are newly isolated viruses to this family. The similarities are summarized in Fig. 4, exemplified by SSV2 which has the evident, both in terms of overall gene synteny and highest number of spacer matches, and by the most dis- sequence similarity (Table 1, Figs 2 and 3), and also tantly related fusellovirus, ASV1. The spacer matches extends to the distribution of the cysteine codons in a occur on both strands of the viruses, consistent with DNA manner that supports the findings of Menon and recognition by the spacer transcripts, as recently pro- colleagues (2008). posed (Marraffini and Sontheimer, 2008; Shah et al., 2009).

Sequence similarity among the fuselloviruses Encoded proteins The genome of ASV1 carries 24 186 bp and is by far the largest of the fuselloviruses, and one or two gene dupli- Many of the ORFs encoded on ASV1 yield no, or very cations appear to have occurred (ASV1_B91 and weak, matches in public sequence databases, especially ASV1_C137), as well as the acquisition of new genes. ORFs found in the ‘extra’ ~6 kb that are not present in Most of the ASV1 genome is closely related to the other other fuselloviruses. Exceptions are ASV1_B276 and fuselloviruses, with several regions of more than 75% C106, which are homologous to genes from an integrated identity at the nucleotide level (Fig. 3). One 5.6 kb region virus in the Sulfolobus tokodaii chromosome (ST1724 and that is similar to SSV6, starts in the middle of ASV1_C213 ST1725), and ASV1_A59, which exhibits sequence simi- and ends in ASV1_B90 (Fig. 3D). larity to CopG transcriptional regulators in M. sedula and An extreme example of how closely related some fusell- S. acidocaldarius. Both SSV6 and ASV1 encode a homo- oviruses are, can be seen by comparing SSV4 and SSV5, logue of the structural protein SSV1_VP2, which is absent where a 7.9 kb region is almost 100% identical (Fig. 3B), from the other six fuselloviruses. Furthermore, SSV6 and consistent with a recent recombination event having ASV1 do not encode a full-length SSV1_C792 homo- occurred between the viruses. Moreover, the junctions of logue, and SSV1_B78 homologue, as do all other fusell- nucleotide similarity regions are generally intragenic, such oviruses (Fig. 2B). Instead, they carry two other genes: a that sections of high sequence similarity are mostly short, small gene (SSV6_C213 and ASV1_B208) homologous distributed all over the genomes, and often start and stop only to the C-terminal 170 aa of SSV1_C792, and follow- in the middle of open reading frames (ORFs) (Fig. 3). ing this gene, a large ORF (ASV1_A1231 and SSV6_ These patterns of similarity raise interesting questions B1232), which is similar to Saci_1002 from Sulfolobus concerning interplay and recombination between fusell- acidocaldarius (49% identity, 65% similarity, for oviral genomes. SSV6_B1232). No other sequence similarity is found in The presence of regions of nucleotide identity between databases, but a clue to the function of both the the fuselloviruses raises the question as to how they avoid SSV1_C792 and SSV6_B1232 homologues is given by the extensive antiviral CRISPR systems present in all the Phyre fold-prediction-server (Kelley and Sternberg, sequenced Sulfolobus genomes. Therefore, we analysed 2009), which suggest they both have a fold similar to the correlation between sequence matching of CRISPR- the adsorption protein P2 from bacteriophage prd1 spacers and fuselloviral genomes. A total of 3420 (E-value < 0.5, estimated precision 85%). CRISPR spacer sequences were obtained from four com- ASV1, SSV7 and SSVk1 differ from the other fusellovi- plete and nine incomplete Sulfolobales genomes (after ruses by lacking all genes of the SSV1_T5 operon except subtracting the 278 spacers which S. solfataricus P1 and the integrase and, for ASV1 and SSVk1, a predicted P2 have in common). Ninety-one of these spacers match helix–turn–helix transcriptional regulator (Fig. 2). Instead, to one or more of the fuselloviruses on a nucleotide the three viruses carry a set of ORFs on the plus-strand, sequence level. An additional 101 spacers were found which encode a putative Rad3-like helicase, an Msed_ matching to one or more fuselloviruses when extending 2283 homologue (hypothetical protein) and a few small the search to the amino acid sequence level. Thus out of proteins (Fig. 2). the 3420 Sulfolobales spacers, in total 192 spacers yield Beside these peculiarities of the individual genomes, 436 significant matches to fuselloviral genomes. The latter analyses have revealed 13 genes that are conserved in all number exceeds the former because many spacers, nine fuselloviruses. These ‘core’ genes include VP1 and especially on the amino acid sequence level, yield VP3, the integrase and three putative transcriptional regu- matches to more than one virus, and because some lators, including one helix–turn–helix and two zinc-finger spacers match to repeats within the same viral genome. proteins (Fig. 2). The attP sites within the integrase genes

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology 6 .Redder P.

Table 1. Genes in SSV5, SSV6, SSV7 and ASV1, as well as the homologues from other fuselloviruses.

SSV1 SSV2 SSV4 SSV5 SSVk1 SSVrh SSV6 SSV7 ASV1 Size-range al. et (15 465 bp) (14 796 bp) (15 135 bp) (15 330 bp) (17 385 bp) (16 473 bp) (15 684 bp) (17 602 bp) (24 186 bp) (aa) Comments

Japan Iceland Iceland Iceland Kamchatka USA Iceland Iceland USA Isolated from Russia Arg (CCG) Gly (CCC) Glu (TTC) Gln (CTG) Asp (GTC), Leu (GAG) Gln (CTG) Gly (CCC) Lys (TTT) Matching S. solfataricus P2 tRNA of the attP site in Glu (CTC), the integrase gene (anticodon) Glu (TTC)

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,VP2 C76 A82a 74–82 VP2 protein detected in the SSV1 virion and thought to be the DNA binding protein (Reiter et al., 1987a) A82 ORF83 ORF82 gp07 B83 A83 C83 C82 A83 82–83 Putative membrane proteina C84 ORF88c ORF81 gp08 B90 C78 B81 B83 C97 81–104 a A92 ORF90 ORF89 gp09 A82 A93 C90 C90 A94 89–94 Overlaps other genes B277 ORF276 ORF280 gp10 C279 C277 A269 A281 C263 269–281 Putative membrane proteina A154 ORF153 ORF152 gp11 C157 C154 B149 C150 C155 149–157 Also found in pSSVxa B251 ORF233 ORF233 gp12 A231 A247 C234 A255 A232 231–255 DnaA-like (Koonin, 1992) Also in pSSVx, ATV and A. pernixa D335 ORF328 ORF330 Integrase F340 D355 F354 D336 D347 328–355 Integrasea E79 79 C176 176 A66* C72 A58a 58–72 Also found in AFV2 (gp06) B204 A171 171–204 C74 B80 74–80 B494 A583 C559 494–583 Rad3-like helicase A460 B471 C674 460–674 Similar to Metallosphera sedula protein Msed_2283 B192 192 Similar to C-terminal of ASV1_C674 A136 B119 119–154 B64 B102 64–102 D244 ORF211 ORF209 gp15 D212 F215 209–244 Similar to Saci_0475 D108 108 Similar to SIRV2gp12 F90 90 Similar to ORFs from pARN3 and pSOG1 E94 94 F93 E81 F110 D95 81–110 Putative HTH transcriptional regulator (Kraft et al., 2004b) D63 ORF57 ORF63 gp16 F61 E60 57–63 3D X-ray structure from SSV1 (Kraft et al., 2004a) ORF159b gp18 E152 F185 152–185 ORF61 ORF61 gp21 F62 E61 61–62 niomna Microbiology Environmental ORF79a ORF73 gp23 E73 D77 73–79 A49 49 C-terminal similar to SSV7_B76 A100 ORF96 ORF96 gp24 C96 C93 C106 C96 93–106 Weak hit to ARV1 C48 C49 48–49 ORF88a B87 87–88 B92 92 A59 59 Similar to CopG from M. sedula and Saci_0942. Possible functional homologue of SSV1_C80 ©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,

C80 ORF82A ORF79 gp26 C82 B64 A78 C80 64–82 RHH protein, CopG-likea A109 109 Paralogue of ASV1_B91 A79 ORF82B ORF80B gp27 A80 B79 B82 B82 B91 79–91 Zinc finger motif. Similar to ATV_gp28 and pHVE14–51. ASV1_B91 is a paralogue of ASV1_A109a C54 54 C102a ORF100 ORF100 gp29 B98 A102b C100 A101 98–102 B-block_TFIIC-domain, Zinc finger ORF205 ORF206 gp30 A204 C287 B206 204–287 Similar to CRISPR associated gene Cas4 in Staphylothermus marinus. B129 ORF155 ORF124 gp31 B158 C150 B123 C128 C137 124–173 Two Zinc finger motifs. ASV1_C137 is a paralogue of ASV1_C125 B99 99 ORF107b gp32 B111 C113 C113 107–113 Similar to ST1721 from S. tokodaii b ORF311 gp33 B252 252–311 Similar to ST1722 from S. tokodaii ORF111 gp35 C108 108–111 Similar to ST1723 from S. tokodaii B85 C62 62–85 C247 A298 247–298 B74 C67 67–74 B276 276 Similar to ST1724 from S. tokodaii C106 106 Similar to ST1725 from S. tokodaii C125 125 Paralogue of ASV1_C137 A367 367 niomna Microbiology Environmental A137 137 Similar to STS262 from S. tokodaii C806 806 558–785 similar to APE_0858 from Aeropyrum pernix A96 96 C792 ORF809 ORF808 gp01 B793 B812 C213 C811 B208 208–812 ASV1_B208 and SSV6_C211 are similar to the C-terminal of the C792 homologues B78 ORF79 ORF80a gp02 A79 A79 B79 79–80 Part of the SSV1_C792 module B68 A58b 58–68 B1232 A1231 1231–1232 Similar to Saci1002 from S. acidocaldarius C166 ORF176 ORF167 gp03 B169 B170 C134 C170 B130 130–176 Gapped in ASV1 and SSV6. Putative membrane protein B115 ORF112 ORF107a gp04 A123 A113 A88 B112 A82b 82–123 Putative HTH transcriptional regulator Shorter in ASV1 and SSV6a VP1 ORF88b ORF136 VP1 B137 A89 A143 C88 A140 88–143 VP1 structural protein in SSV1 (Reiter et al., 1987a)a VP3 ORF92 ORF92 VP3 A93 C96 B94 C97 B90 92–96 VP3 structural protein in SSV1 (Reiter et al., 1987a) uelvrldiversity Fuselloviral A, B and C indicate genes on the three reading frames of the plus-strand, and D, E and F indicate genes on the minus-strand. The number following the letter is the number of encoded amino acids. The 13 ‘core’ genes are in boldface, and proteins for which experimental data are available are underlined. The asterisk indicates an ad hoc ORF name for a gene which is not present in the NCBI annotation. a. Core gene in Held and Whitaker (2009). b. The upstream 40 bp of the SSV7_C113 homologues are highly conserved in all fuselloviruses, with two copies in ASV1. In SSV1, this motif is immediately next to the BRE+TATA-box of the T3 transcript. 7 8 P. Redder et al.

Fig. 3. Similarity at the nucleotide level between selected representative pairs of fusellovirusal genomes. A. Comparison between SSV1 and SSV5. B. Between SSV5 and SSV4. C. Between SSV4 and SSV6. D. Between SSV6 and ASV1. E. Between ASV1 and SSVk1. F. Between SSVk1 and SSV7. Regions of high (> 70%) pairwise identity on the nucleotide level (light grey boxes) are interspersed by regions with no detectable similarity (white boxes). The dark grey box indicates an exceptional example of similarity between SSV4 and SSV5, where a 7.9 kb region is almost 100% identical between the two genomes. The junctions between similar regions and a dissimilar regions (indicated by dotted lines) often occur in the middle of genes, and are not confined to intergenic regions. Short regions (< 100 bp) of similarity or dissimilarity are not shown. Black arrows denote ‘core’ genes, dark grey arrows denote ORFs that are found in more than one fusellovirus, and light grey arrows denote ORFs that have no homologues in the database, some of which may not be protein-coding.

all have their best hits to tRNAs from S. solfataricus, with Until now, fuselloviruses had only been found to rep- Gln, Gln, Gly and Lys for SSV5, SSV6, SSV7 and ASV1 licate in Sulfolobus species. Our discovery of ASV1 in respectively. Table 1 shows an overview of the genes in Acidianus brierleyi shows that fuselloviruses can propa- SSV5, SSV6, SSV7 and ASV1, as well as the correspond- gate in both the major culturable genera from aerobic, ing homologues in the other fuselloviruses. acidic hot springs. Therefore, it is likely that fusellovi- ruses also infect other host species from these en- vironments, such as Caldococcus, Vulcanisaeta and Discussion Stygiolobus (Snyder et al., 2007). Furthermore, the In this paper we describe four new members of the family family Fuselloviridae presumably also extends its host Fuselloviridae, SSV5, SSV6, SSV7 and ASV1, isolated range into the vast number of currently uncultured from acidic hot springs of Iceland and USA, which infect species found in other extreme environments, such members of the hyperthermophilic archaeal genera as the acid mine drainage ecosystem, where a VP2 Sulfolobus and Acidianus. homologue, recently found by community genomics

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology Fuselloviral diversity 9

Fig. 4. CRISPR spacer sequence matches for ASV1 and SSV2 are superimposed on linearized genome maps of ASV1 and SSV2 respectively. ORFs are shown as arrows above and below the line. Sequence matches to spacers are shown as vertical lines. The black vertical lines denote the nucleotide sequence matches, and the grey vertical lines show matching amino acid sequences, after translation of the spacer sequences from both DNA strands. The dark boxes below the genome maps indicate areas > 50 bp with nucleotide level sequence similarity to other fuselloviruses (the relevant fusellovirus is indicated to the left of the dark boxes). In total, there are 12 spacer matches to ASV1 and 22 matches to SSV2 at a nucleotide level. At an amino acid sequence level, there are 42 spacer matches to ASV1 and 28 matches to SSV2.

(Andersson and Banfield, 2008), indicate the presence to the VP3 protein, and their roles might be partially inter- of fuselloviruses. changeable in the virion matrix. Bioinformatical analyses predict DnaA-like activity for SSV1_B251 homologues (Koonin, 1992) and transcriptional regulation activity for ‘Core’ genes three other ‘core’ genes: SSV1_A79 and SSV1_B129, By almost doubling the number of described fusellovi- which are transcribed early, during infection and are prob- ruses, we are refining the definion of ‘core’ genes of the ably involved in controlling the hosts transcriptional appa- family. The 18 conserved, or ‘core’ genes, that were ratus, and SSV1_B115, which is co-transcribed together defined for SSV1, SSV2, SSVk1 and SSVrh (Wiedenheft with VP1, VP2, VP3 and SSV1_C792, later in infection, et al., 2004) can now be reduced to 13 (Table 1) and and may be involved in controlling the assembly and/or may have to be revised further as more fuselloviruses packaging of virions. are sequenced, but our findings correlate well with a recent analysis of fuselloviral proviruses in S. islandicus ‘Non-core’ genes strains (Held and Whitaker, 2009). We exclude the SSV1_C792 homologues from the list of ‘core’ genes, Genes that are highly conserved but present in a subset because we do not consider SSV6_C213 and of the fuselloviruses could provide a possible way of clas- ASV1_B208 to be able to fully complement the proteins sifying the fuselloviruses into subgroups, albeit subgroups found in other fuselloviruses, which are about four times that overlap. larger (Table 1). Thus, ASV1, SSV6 and SSV1, encode a VP2 homo- Six of the ‘core’ genes have no discernible function logue, indicating that they all share a DNA packaging based on their primary sequence, except for some of system. However, the difference to the SSV1 protein in them carrying predicted transmembrane segments the C-terminus may indicate an alternative mode of inter- (Table 1), and experimental data will be needed to deter- action of the protein and viral DNA with the major virion mine their functional roles. Of the remaining seven, the proteins, VP1 and VP3. integrase function was characterized experimentally Another subgroup would be the SSVs, which all encode (Muskhelishvili et al., 1993; Muskhelishvili, 1994; Serre a highly conserved homologue of SSV1_C80, a protein et al., 2002; Letzelter et al., 2004; Clore and Stedman, containing the RHH 1 CopG domain. ASV1 does not 2007). Moreover, VP1 and VP3 are virion components in encode any gene with obvious sequence similarity to SSV1 virions, and VP1 is processed from the N-terminus SSV1_C80. However, ASV1_A59 also has an RHH 1 in SSV1, to a length of 73 aa (Reiter et al., 1987a), which CopG domain, although it groups with other RHH1- may explain the significant size difference we observe containing genes, including a few Sulfolobus chromo- among the VP1 genes (Table 1). The remaining somal genes (e.g. Saci_0942). Furthermore, ASV1_A59 C-terminus of VP1 is similar in both length and sequence occupies the same genomic position as the SSV1_C80

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology 10 P. Redder et al. homologues do in the SSV genomes, and it very likely 2008a). Therefore, we suggest that a different mechanism acts as a functional homologue of SSV1_C80. is more likely. A third subgroup consists of ASV1, SSV7 and SSVk1, Integrated fusellovirus genomes have been found in the which all encode the Rad3-like helicase protein and the Sulfolobus solfataricus P2 and in four S. islandicus chro- neighbouring Msed_2283 homologue (Fig. 2). The pres- mosomes, where no trace of the covalently closed circular ence of the helicase strongly suggests that these two DNA (cccDNA) form was detected (Stedman et al., 2003; proteins are involved in DNA replication or recombination, Held and Whitaker, 2009). Once a virus has been and it is possible that the other fuselloviruses recruit host ‘caught’, a second, slightly different, fusellovirus might proteins to fulfill the same function. infect the same host, and insert itself into the same tRNA gene, resulting in a concatamer of the two fuselloviruses in the host chromosome (Fig. 5). This structure might be A possible filament protein maintained for a couple of generations, but it would be The most striking genomic difference among the fusellovi- inherently unstable if the two viral genomes are reason- ruses is the ‘replacement’ of the SSV1_C792 module ably similar, as there would be a high chance of homolo- with the SSV6_B1232 module (Fig. 2B). It seems the gous recombination between the two integrated viruses. C-terminal 170 aa from SSV1_C792 are essential, since Such a recombination event would lead to the formation of they are retained as a small separate gene in both the one cccDNA virus and one inserted virus, both of which ASV1 and SSV6 genomes; however, the remaining would consist of a part of each of the original two viruses ~620 aa of SSV1_C792 and the whole of SSV1_B78 are (Fig. 5). Owing to the very short sequence similarity substituted by SSV6_B1232. The presence of the required for homologous recombination in Sulfolobus SSV6_B1232 module correlates with a difference in the (Grogan, 2009), the cross-over point could potentially be number and structure of the sticky terminal filaments of in many different places, and each of these recombination the SSV6 and ASV1 virions, when compared with the events would form a unique mixture of the two viruses, SSV1_C792 module viruses (Fig. 1B). Possibly, there is a similar to meiosis in eukaryotes. Thus, this offers a phenotype–genotype link, with the SSV1_C792 module mechanism for rapidly generating a large number of being responsible for the multiple, thin, sticky filaments diverse viral offspring. Our model does not exclude direct and the SSV6_B1232 module for the few, thick, less sticky recombination between the cccDNA forms of fusellovi- filaments. In support of this hypothesis, small amounts of ruses, but we propose that this type of ‘tandem insertion’ SSV1_C792 were recently found by mass-spectrometry event happens frequently (on an evolutionary scale) in in SSV1 virions (Menon et al., 2008). Moreover, the Phyre nature, and that repeated events, each involving a differ- prediction tool suggested that both SSV1_C792 and ent pair of ‘parent’ fuselloviruses, would eventually SSV6_B1232 had a similar fold to the P2 receptor binding produce the patchwork viral genomes we see today protein prd1, and it was recently shown that a large (Fig. 3). protein is responsible for the sticky end-fibres in the rudi- Our model also serves to explain why fuselloviruses virus SIRV2 (Steinmetz et al., 2008). Nevertheless, have developed an integrase that is inactivated upon inte- further studies will be needed to determine the exact gration. The integrase is not essential for viral propagation functions of the SSV1_C792 and SSV6_B1232 modules (Clore and Stedman, 2007) but if the proposed recombi- in fuselloviruses. nation mechanism is correct, then the unique SSV-type integrase will help the virus in the long term, by promoting recombination with closely related viruses, since the inac- Fuselloviral nucleotide similarity and a putative tivation provides a high chance of the viral genome being mechanism for interviral recombination ‘caught’ in an integrated form in the chromosome. Never- The multiple regions of high nucleotide similarity, or even theless, inactivation of the integrase is not required for identity, between the fuselloviral genomes do not repre- recombination between tandem insertions. Studies of the sent a ‘core’ fusello-genome, since the regions of similar- Sulfolobus plasmids pARN3 and pARN4 reveal stretches ity differ between the various pairs of viruses, and often do of nucleotide identity, which might have been generated not include the ‘core’ genes (Fig. 3). Instead, the pattern by tandem insertions, even though these plasmids carry of similar and non-similar sections of DNA indicates non-inactivatable integrases (Greve et al., 2004). frequent recombination events between fuselloviruses, The inherent instability of a tandem insertion makes it similar to that observed for some bacteriophages (Hendrix difficult, if not impossible, to detect in nature. However, a et al., 1999). Possibly this occurs between pairs of fusell- concatamer of inserted viral genomes, similar to the one oviruses, present in the same host; however, we do not proposed in our model, was recently discovered in the see a similar pattern of sequence similarity for the linear chromosome of Methanococcus voltae A3. There, the two non-integrating archaeal viruses (Vestergaard et al., viral genomes integrated into the same tRNA gene are

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology Fuselloviral diversity 11

Fig. 5. Proposed model for recombination between integrated fuselloviruses. A. The first fusellovirus (SSVa) infects the host, and integrates into the chromosome. B. The second fusellovirus (SSVb) infects the host, and integrates into the same tRNA as SSVa. C. The ‘tandem integration’ of SSVa and SSVb. The dashed arrows indicate examples of homologous recombination sites. D. Examples of ‘offspring’ cccDNA fuselloviruses from the recombination of SSVa and SSVb.

very different, preventing homologous recombination, should be disadvantageous, since a single spacer, thus ‘trapping’ the viral concatamer in the host chromo- matching a conserved region, will provide a host with some (Krupovic and Bamford, 2008). The attP sites of immunity to several virus strains (Lillestøl et al., 2009). SSV2 and SSV7 as well as SSV5 and SSV6 match the Nevertheless, the puzzling fact remains that fusellovi- same tRNA in S. solfataricus P2 (Table 1), making it likely ruses do possess highly similar, sometimes identical, that fuselloviruses are also able to integrate into the same nucleotide regions, and it is possible that the integration tRNA, forming concatamers, which are unstable due to and/or the frequent recombination somehow provide the the similarity between the fuselloviruses. Moreover, it was fuselloviruses with the means to evade the CRISPR shown that SSVk1 is able to integrate into several differ- system in their hosts. ent sites in the host genome (Wiedenheft et al., 2004), It has been proposed that thermoacidophilic archaeal increasing the likelihood of finding a ‘partner’ for recom- viruses are highly mobile, even between distant hot bination. Finally, examples of related viruses infecting the springs in the same geothermal area, and that different same host at the same time are known for Sulfolobales, fuselloviruses continuously infect a more-or-less stable such as AFV6, AFV7 and AFV8 in Acidianus convivator population of host species (Snyder et al., 2007). The high (Vestergaard et al., 2008b). nucleotide similarity we have found, even between If the ‘tandem insertion’ model is correct, then an evo- fuselloviruses isolated on different continents, seems to lutionary tree of an entire viral genome has no meaning, confirm that they do manage to exchange genetic material nor would that from individual ‘core’ genes (since two over the intercontinental distances that separate some of halves of the same gene might originate from different the geothermal ‘islands’ in the cold ‘ocean’. ‘parent’ viruses). One might instead analyse genes, described in the previous section, that are not shared by all fuselloviruses, since these genes cannot serve Experimental procedures as cross-over points for homologous recombination. Sulfolobus and Acidianus medium Although for the moment, the data set is too small for a phylogenetic analysis based on these genes, the pres- Z medium: 25 mM (NH4)2SO4, 3 mM K2SO4, 1.5 mM KCl, ence or absence of certain genes in a subset of the 20 mM glycine, 4.0 mM MnCl2, 10.4 mM Na2B4O7, 0.38 mM viruses, has provided important clues to understanding ZnSO4, 0.13 mM CuSO4, 62 nM Na2MoO4, 59 nM VOSO4, 18 nM CoSO , 19 nM NiSO , 0.1 mM HCl, 1 mM MgCl , protein functions in the fuselloviruses, including the puta- 4 4 2 0.3 mM Ca(NO3)2 adjusted to pH 3.5 with H2SO4. T medium: tive filament proteins SSV1_C792 and SSV6_B1232. Identical to Z medium, but with 0.2% Tryptone added. ST With the current understanding of the CRISPR antivi- medium: Identical to T medium, but with small amounts of ral system, high nucleotide similarity between viruses elemental sulphur added.

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology 12 P. Redder et al.

Isolation and purification of hosts and viruses concentrated by spinning at 38 000 r.p.m. for 3 h using a SW41 Beckman rotor, and finally the virions were treated with Samples were collected from the Hveragerdi hot-spring area Protease K and the DNA was extracted with Phenol, Phenol/ in south-western Iceland and 1 ml was used to establish an Chloroform and Chloroform extraction. The SSV6 DNA was enrichment culture, by incubating in 50 ml ST medium for then treated as described below for SSV7. 9 days at 80°C, after which 1 ml was of the enrichment was In order to sequence SSV7, 5 ml of an exponential G4T-1- transferred to fresh ST medium and incubated for a further culture was pelleted by centrifugation, and resuspended in Z 4 days. Four millilitres of the enrichment (designated G4ST) medium. The SSV7 production was induced by 50 J cm-2 UV was then centrifuged at 4000 r.p.m. for 20 min (Jouan S40 radiation (254 nm) under constant mild agitation, and the cells rotor) to remove cells, whereupon the supernatant was spun were then transferred to 45 ml T medium for over-night incu- further at 38 000 r.p.m. for 3 h to pellet virions (Beckman bation. Five millilitres was used for a miniprep (QIAprep Spin SW60 rotor). Finally, the pellet was resuspended in 50 ml of Miniprep Kit, QIAGEN SA, Courtaboeuf, France), which was the supernatant. The resuspension was then examined by used for amplification and subsequent library construction electron microscopy, and several different morphotypes of based on the Linker Amplified Shotgun Library method de- virus-like particles were in evidence. Among these was a scribed at http://www.sci.sdsu.edu/PHAGE/LASL/index.htm. group of fusellovirus-like particles, but with different filament Shot-gun library construction of SSV5 and SSV6, as well structures at the end, and a large diversity in their morpho- as ASV1-containing A. brierleyi total DNA, was performed types, ranging from sausage-shaped to an almost spindle- as described previously using SmaI digested pUC18 as like pear-shape (Fig. 1). cloning vector (Peng, 2008). Plasmid DNA of clones, from To isolate single host–virus systems, G4ST was spread on all four libraries, were purified using a Model 8000 Bioro- a plate containing ST medium and solidified with Gel-rite bot (Qiagen, Westburg, Germany) and sequenced in (Sigma-Aldrich, St Louis, USA). After 10 days of incubation MegaBACE 1000 Sequenators (Amersham Biotech, Amer- at 80°C, 30 colonies of representative sizes, shapes and sham, UK). Sequences were assembled using Sequencher colours were transferred to 5 ml liquid ST medium and incu- 4.5 (http://www.genecodes.com). Genome annotations and bated with vigorous shaking for 4 days. Each of the growing comparisons were done using the MUTAGEN software strains was examined for virus in the electron microscope, (Brügger et al., 2003) with a minimum ORF-length set to and the SSV6 and SSV7 were detected in the supernatant of 50 aa and allowing AUG, GUG and UUG as possible start strain G4ST-T-11 and G4T-1 respectively. The 16S rRNA codons. Accession numbers are EU030939, FJ870915, genes of G4T-1 and G4ST-T-11 were amplified using the FJ870916 and FJ870917 for SSV5, SSV6, SSV7 and ASV1 primers 8aF: TCYGGTTGATCCTGCC and 1512uR: ACG respectively. GHTACCTTGTTACGACTT (Accession number FJ870913 for G4ST-T-11 and FJ870914 for G4T-1). CRISPR spacer analysis SSV5 was present in HVE14, an enrichment culture, estab- lished from a natural sample collected near the G4 site, but To obtain a list of spacer sequences from Sulfolobales, the 10 years previously (Zillig et al., 1996). It was propagated in following partial or full genomes were used: S. solfataricus S. solfataricus P2, by mixing a small amount of HVE14 with a P2, S. tokodaii 7, S. acidocaldarius DSM 639, Metal- well-grown S. solfataricus P2 culture (1:1000), which was losphaera sedula DSM5348 from GenBank (http:// then harvested and used for DNA isolation of extrachromo- www.ncbi.nlm.nih.gov/Genbank/), Sulfolobus islandicus somal elements using plasmid miniprep kit from Qiagen. strains LD85, YG5714, YN1551, M164 and U328 from JGI Acidianus brierleyi were cultured at 70°C in ST medium and (http://www.jgi.doe.gov/genome-projects/), and S. islandicus ASV1 was recovered from the supernatant by ultracentrifu- strains HVE10/4 and REY15A and Acidianus brierleyi (K. gation (38 000 r.p.m. for 3 h in a Beckman SW41 rotor). Brügger and Q. She, unpubl. data). CRISPRs were identified using publicly available software (Edgar, 2007; Bland et al., 2007). Spacer sequences from each repeat-cluster were Electron microscopy aligned (Sæbø et al., 2005) against the fuselloviral genomes at a nucleotide level (Shah et al., 2009). Additionally, spacers Ten microlitres of the samples was deposited on a carbon were aligned against amino acid sequences of annotated and formvar coated grid (Ted Pella, Redding, CA, USA) and ORFs of the Fuselloviruses, at an amino acid level (Vester- left for 2 min before removing excess fluid. Ten microlitres of gaard et al., 2008a; Shah et al., 2009). Significance cut-offs 2% Uranyl-acetate or phosphotunstenate (Sigma-Aldrich) were determined for both alignment types by using the was allowed to stain the samples negatively for 10 s. Images genome sequence of Saccharomyces cerevisiae as a nega- were taken on a JEOL1200EXII microscope with an 80 kV tive control. beam, using a CCD camera. Acknowledgements

DNA isolation and sequencing P.R. was funded by grant VIRAR (NT05-2_41674) from the Agence Nationale de la Recherche, France. The research in Six litres of G4ST-T-11 was grown in a fermentor, and after Copenhagen was supported by grants from the Grundforskn- removing cells by centrifuging twice at 4000 r.p.m. for 20 min ingsfond and the Reseach Council for Natural Sciences. We (Sorvall GS-3 rotor), the virions in the supernatant were con- would also like to thank the Electron Microscopy Platform centrated using a Sartorius Vivaflow 200 filter cartridge (Sar- at Institut Pasteur for helpful advice and use of their torius, Goettingen Germany). The resulting 15 ml was further JEOL1200EXII microscope.

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology Fuselloviral diversity 13

References diverse bacteriophages and prophages: all the world’s a phage. Proc Natl Acad Sci USA 96: 2192–2197. Andersson, A.F., and Banfield, J.F. (2008) Virus population Kelley, L.A., and Sternberg, M.J.E. (2009) Protein structure dynamics and acquired virus resistance in natural microbial prediction on the web: a case study using the Phyre server. communities. Science 320: 1047–1050. Nature Protocols 4: 363–371. Arnold, H.P., She, Q., Phan, H., Stedman, K., Prangishvili, Koonin, E.V. (1992) Archaebacterial virus SSV1 encodes a D., Holz, I., et al. (1999) The genetic element pSSVx of putative DnaA-like protein. Nucleic Acids Res 20: 1143. the extremely thermophilic crenarchaeon Sulfolobus is a Kraft, P., Kümmel, D., Oeckinghaus, A., Gauss, G.H., hybrid between a plasmid and a virus. Mol Microbiol 34: Wiedenheft, B., Young, M., and Lawrence, C.M. (2004a) 217–226. Structure of D-63 from Sulfolobus spindle-shaped virus 1: Bath, C., and Dyall-Smith, M.L. (1998) His1, an archaeal surface properties of the dimeric four-helix bundle suggest virus of the Fuselloviridae family that infects Haloarcula an adaptor protein function. J Virol 78: 7438–7442. hispanica. J Virol 72: 9392–9395. Kraft, P., Oeckinghaus, A., Kümmel, D., Gauss, G.H., Bath, C., Cukalac, T., Porter, K., and Dyall-Smith, M.L. (2006) Gilmore, J., Wiedenheft, B., et al. (2004b) Crystal structure His1 and His2 are distantly related, spindle-shaped halovi- of F-93 from Sulfolobus spindle-shaped virus 1, a winged- ruses belonging to the novel virus group, Salterprovirus. helix DNA binding protein. J Virol 78: 11544–11550. Virology 350: 228–239. Krupovic, M., and Bamford, D.H. (2008) Archaeal proviruses Bize, A., Peng, X., Prokofeva, M., Maclellan, K., Lucas, S., TKV4 and MVV extend the PRD1-adenovirus lineage to Forterre, P., et al. (2008) Viruses in acidic geothermal envi- the phylum Euryarchaeota. Virology 375: 292–300. ronments of the Kamchatka Peninsula. Res Microbiol 159: Letzelter, C., Duguet, M., and Serre, M.C. (2004) Mutational 358–366. analysis of the archaeal tyrosine recombinase SSV1 inte- Bland, C., Ramsey, T.L., Sabree, F., Lowe, M., Brown, K., grase suggests a mechanism of DNA cleavage in trans. Kyrpides, N.C., and Hugenholtz, P. (2007) CRISPR Rec- J Biol Chem 279: 28936–28944. ognition Tool (CRT): a tool for automatic detection of Lillestøl, R.K., Shah, S.A., Brügger, K., Redder, P., Phan, H., clustered regularly interspaced palindromic repeats. BMC Christiansen, J., and Garrett, R.A. (2009) CRISPR families Bioinformatics 8: 209–217. ofthecrenarchaealgenusSulfolobus:bidirectionaltranscrip- Brügger, K., Redder, P., and Skovgaard, M. (2003) tion and dynamic properties. Mol Microbiol 72: 259–272. MUTAGEN: multi-user tool for annotating genomes. Bioin- Marraffini, L.A., and Sontheimer, E.J. (2008) CRISPR inter- formatics 19: 2480–2481. ference limits horizontal gene transfer in Staphylococci by Clore, A.J., and Stedman, K.M. (2007) The SSV1 viral inte- targeting DNA. Science 322: 1843–1845. grase is not essential. Virology 361: 103–111. Martin, A., Yeats, S., Janekovic, D., Reiter, W.-D., Aicher, W., Edgar, R.C. (2007) PILER-CR: fast and accurate identifica- and Zillig, W. (1984) SAV1, a temperate u.v.-inducible DNA tion of CRISPR repeats. BMC Bioinformatics 8: 18–24. virus-like particle from archaebacterium Sulfolobus aci- Fröls, S., Gordon, P.M., Panlilio, M.A., Schleper, C., and docaldarius isolate B12. EMBO J 3: 2165–2168. Sensen, C.W. (2007) Elucidating the transcription cycle of Menon, S.K., Maaty, W.S., Corn, G.J., Kwok, S.C., Eilers, the UV-inducible hyperthermophilic archaeal virus SSV1 by B.J., Kraft, P., et al. (2008) Cysteine usage in Sulfolobus DNA microarrays. Virology 365: 48–59. spindle-shaped virus 1 and extension to hyperthermophilic Geslin, C., Le Romancer, M., Erauso, G., Gaillard, M., Perrot, viruses in general. Virology 376: 270–278. G., and Prieur, D. (2003) PAV1, the first virus-like particle Muskhelishvili, G. (1994) The archaeal SSV integrase pro- isolated from a hyperthermophilic euryarchaeote, ‘Pyro- motes intermolecular excisive recombination in vitro. Syst coccus abyssi’. J Bacteriol 185: 3888–3894. Appl Microbiol 16: 605–608. Greve, B., Jensen, S., Brügger, K., Zillig, W., and Garrett, Muskhelishvili, G., Palm, P., and Zillig, W. (1993) SSV1- R.A. (2004) Genomic comparison of archaeal conjugative encoded site-specific recombination system in Sulfolobus plasmids from Sulfolobus. Archaea 1: 231–239. shibatae. Mol Gen Genet 237: 334–342. Grogan, D.W. (2009) Homologous recombination in Sulfolo- Oren, A., Bratbak, G., and Hendal, M. (1997) Occurrence of bus acidocaldarius: genetic assays and functional proper- virus-like particles in the Dead Sea. Extremophiles 1: 143– ties. Biochem Soc Trans 37 (Pt 1): 88–91. 149. Guixa-Boixareu, N., Calderon-Paz, J.I., Heldal, M., Bratbak, Palm, P., Schleper, C., Grampp, B., Yeats, S., McWilliam, P., G., and Pedros-Alio, C. (1996) Viral lysis and bacterivory Reiter, W.D., and Zillig, W. (1991) Complete nucleotide as prokaryotic loss factors along a salinity gradient. Aquat sequence of the virus SSV1 of the archaebacterium Sul- Microb Ecol 11: 215–227. folobus shibatae. Virology 185: 242–250. Häring, M., Rachel, R., Peng, X., Garrett, R.A., and Prang- Peng, X. (2008) Evidence for the horizontal transfer of an ishvili, D. (2005) Viral diversity in hot springs of Pozzuoli, integrase gene from a fusellovirus to a pRN-like plasmid Italy, and characterization of a unique archaeal virus, Acidi- within a single strain of Sulfolobus and the implications for anus bottle-shaped virus, from a new family, the Ampul- plasmid survival. Microbiol 154 (Pt 2): 383–391. laviridae. J Virol 79: 9904–9911. Porter, K., Russ, B.E., and Dyall-Smith, M.L. (2007) Virus– Held, N.L., and Whitaker, R.J. (2009) Viral biogeography host interactions in salt lakes. Curr Opin Microbiol 10: revealed by signatures in Sulfolobus islandicus genomes. 418–424. Environ Microbiol 11: 457–466. Prangishvili, D. (2003) Evolutionary insights from studies on Hendrix, R.W., Smith, M.C.M., Burns, R.N., Ford, M.E., and viruses of hyperthermophilic archaea. Res Microbiol 154: Hatfull, G.F. (1999) Evolutionary relationships among 289–294.

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology 14 P. Redder et al.

Prangishvili, D., Garrett, R.A., and Koonin, E.V. (2006a) of crenarchaeal acidothermophiles and implications for Evolutionary genomics of archaeal viruses: unique viral their inhibitory mechanism. Biochem Soc Trans 37: 23– genomes in the third domain of life. Virus Res 117: 52–67. 28. Prangishvili, D., Vestergaard, G., Häring, M., Aramayo, R., Snyder, J.C., Wiedenheft, B., Lavin, M., Roberto, F.F., Basta, T., Rachel, R., and Garrett, R.A. (2006b) Structural Spuhler, J., Ortmann, A.C., et al. (2007) Virus movement and genomic properties of the hyperthermophilic archaeal maintains local virus population diversity. Proc Natl Acad virus ATV with an extracellular stage of the reproductive Sci USA 104: 19102–19107. cycle. J Mol Biol 359: 1203–1216. Stedman, K.M., She, Q., Phan, H., Arnold, H.P., Holz, I., Rachel, R., Bettstetter, M., Hedlund, B.P., Häring, M., Garrett, R.A., and Zillig, W. (2003) Relationships be- Kessler, A., Stetter, K.O., and Prangishvili, D. (2002) tween fuselloviruses infecting the extremely thermophilic Remarkable morphological diversity of viruses and virus- archaeon Sulfolobus: SSV1 and SSV2. Res Microbiol 154: like particles in hot terrestrial environments. Arch Virol 147: 295–302. 2419–2429. Steinmetz, N.F., Bize, A., Kindlay, K.C., Lomonosoff, G.P., Reiter, W.-D., Palm, P., Henschen, A., Lottspeich, F., Zillig, Manchester, M., Evans, D.J., and Prangishvili, D. (2008) W., and Grampp, B. (1987a) Identification and character- Site-specific and spatially controlled addressability of a ization of the genes encoding three structural proteins of new viral nanobuilding block: Sulfolobus islandicus rod- the Sulfolobus virus-like particle SSV1. Mol Gen Genet shaped virus 2. Adv Funct Mater 18: 1–9. 206: 144–153. Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, Reiter, W.D., Palm, P., Yeats, S., and Zillig, W. (1987b) M., Phan, H., et al. (2008a) SRV, a new rudiviral isolate Gene expression in archaebacteria: physical mapping from Stygiolobus and the interplay of crenarchaeal rudivi- of constitutive and UV-inducible transcripts from the ruses with the host viral-defence CRISPR system. Sulfolobus virus-like particle SSV1. Mol Gen Genet 209: J Bacteriol 190: 6837–6845. 270–275. Vestergaard, G., Aramayo, R., Basta, T., Häring, M., Peng, Rice, G., Stedman, K., Snyder, J., Wiedenheft, B., Willits, D., X., Brügger, K., et al. (2008b) Structure of the acidianus et al. (2001) Viruses from extreme thermal environments. filamentous virus 3 and comparative genomics of related Proc Natl Acad Sci USA 98: 13341–13345. archaeal lipothrixviruses. J Virol 82: 371–381. Sæbø, P.E., Andersen, S.M., Myrseth, J., Laerdahl, J.K., Wiedenheft, B., Stedman, K., Roberto, F., Willits, D., Gleske, and Rognes, T. (2005) PARALIGN: rapid and sensitive A.K., Zoeller, L., et al. (2004) Comparative genomic analy- sequence similarity searches powered by parallel comput- sis of hyperthermophilic archaeal Fuselloviridae viruses. ing technology. Nucleic Acids Res 33: 535–539. J Virol 78: 1954–1961. Schleper, C., Kubo, K., and Zillig, W. (1992) The particle Xiang, X., Chen, L., Huang, X., Luo, Y., She, Q., and Huang, SSV1 from the extremely thermophilic archaeon L. (2005) Sulfolobus tengchongensis spindle-shaped virus Sulfolobus is a virus: demonstration of infectivity and of STSV1: virus–host interactions and genomic features. transfection with viral DNA. Proc Natl Acad Sci USA 89: J Virol 79: 8677–8686. 7645–7649. Yeats, S., McWilliam, P., and Zillig, W. (1982) A plasmid in the Serre, M.-C., Letzelter, C., Garel, J.-R., and Duguet, M. archaebacterium Sulfolobus acidocaldarius. EMBO J 1: (2002) Cleavage properties of an archaeal site-specific 1035–1038. recombinase, the SSV1 integrase. J Biol Chem 277: Zillig, W., Prangishvilli, D., Schleper, C., Elferink, M., Holz, I., 16758–16767. Albers, S., et al. (1996) Viruses, plasmids and other Shah, S.A., Hansen, N.R., and Garrett, R.A. (2009) Distribu- genetic elements of thermophilic and hyperthermophilic tions of CRISPR spacer matches in viruses and plasmids Archaea. FEMS Microbiol Rev 18: 225–236.

©2009SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology ￿.￿ ￿￿￿￿￿ ￿ 89

￿.￿ ￿￿￿￿￿ ￿ For this manuscript I prepared figures 2 and 4, assisted Professor Contribution: substantial Roger A. Garrett in annotating and preparing the genomes for submission, and prepared the data behind Table 6. Only the latter has a direct relation to my Ph.D project, however. Environmental Microbiology (2010) 12(11), 2918–2930 doi:10.1111/j.1462-2920.2010.02266.x

Metagenomic analyses of novel viruses and plasmids from a cultured environmental sample of

hyperthermophilic neutrophilesemi_2266 2918..2930

Roger A. Garrett,1* David Prangishvili,2 lishing possible host identities on the basis of Shiraz A. Shah,1 Monika Reuter,2,3 Karl O. Stetter3 sequence similarity to host CRISPR immune systems. and Xu Peng1 1Archaea Centre, Department of Biology, Copenhagen Introduction University, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark. Archaeal viruses exhibit a wide variety of morphotypes 2Institut Pasteur, Molecular Biology of the Gene in and genomic properties. They have been isolated and Extremophiles Unit, rue Dr. Roux 25, 75724 Paris characterized primarily from terrestial acidic hot springs or Cedex 15, France. hypersaline lakes, in many different geographical loca- 3Department of Microbiology, Archaea Centre, University tions. Several viruses from terrestial acidic hot springs of Regensburg, D-93053 Regensburg, Germany. have now been classified into new viral families while others, together with a few haloarchaeal viruses from the euryarchaeal kingdom, remain unclassified (Prangishvili Summary et al., 2006a; Porter et al., 2007; Lawrence et al., 2009). Two novel viral genomes and four plasmids were Although some crenarchaeal and euryarchaeal virions assembled from an environmental sample collected share similar morphotypes, their genomic properties show from a hot spring at Yellowstone National Park, USA, little in common (Ortmann et al., 2006; Porter et al., 2007) and maintained anaerobically in a bioreactor at 85°C nor, with the exception of a few head-tail euryarchaeal and pH 6. The double-stranded DNA viral genomes viruses, do they share many homologous genes with are linear (22.7 kb) and circular (17.7 kb), and derive either bacterial or eukaryal viruses (Prangishvili et al., apparently from archaeal viruses HAV1 and HAV2. 2006b). Despite the broad diversity of characterized Genomic DNA was obtained from samples enriched in archaeal viruses, as a group they probably constitute a filamentous and tadpole-shaped virus-like particles biased sample because most of them exclusively infect respectively. They yielded few significant matches in thermoacidophilic members of the order Sulfolobales or a public sequence databases reinforcing, further, the few haloarchaeal strains. wide diversity of archaeal viruses. Several variants of Few studies, to date, have addressed the relative abun- HAV1 exhibit major genomic alterations, presumed to dance of different viral morphotypes in archaea-rich envi- arise from viral adaptation to different hosts. They ronments. Electron microscopy studies of samples from include insertions up to 350 bp, deletions up to 1.5 kb, terrestial hot springs suggest that spindles, filaments, and genes with extensively altered sequences. Some rods and spheres predominate (Rachel et al., 2002; Bize result from recombination events occurring at low et al., 2008), while other morphotypes are much less complexity direct repeats distributed along the common. In hypersaline environments spindle-shaped genome. In addition, a 33.8 kb archaeal plasmid pHA1 and spherical forms predominate (Oren et al., 1997; Diez was characterized, encoding a possible conjugative et al., 2000; Porter et al., 2007) while head-tail virus-like apparatus, as well as three cryptic plasmids of ther- particles (VLPs) are quite common and their proviruses mophilic bacterial origin, pHB1 of 2.1 kb and two have been detected in some sequenced genomes of halo- closely related variants pHB2a and pHB2b, of 5.2 and and methanoarchaea (Porter et al., 2007; Krupovicˇ and 4.8 kb respectively. Strategies are considered for Bamford, 2008; Krupovicˇ et al., 2010). assembling genomes of smaller genetic elements Only four crenarchaeal viruses from extreme geother- from complex environmental samples, and for estab- mal environments at neutral pH values have been fully characterized to date, the rod-shaped Thermoproteus tenax lipothrixvirus, TTV1 (Janekovic et al., 1983), Pyro- Received 10 February, 2010; accepted 20 April, 2010. *For corre- spondence: E-mail [email protected]; Tel. (+45) 35322010; Fax (+45) baculum spherical virus, PSV (Häring et al., 2004), the 35322128. closely related T. tenax spherical virus 1, TTSV1 (Ahn

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd Novel viruses and plasmids from hyperthermoneutrophiles 2919 et al., 2006), and the Aeropyrum pernix bacilliform virus 1, contaminating chromosomal DNA fragments. Therefore, APBV1 (Mochizuki et al., 2010). However, electron DNase I treatment was introduced to remove chromo- microscopy studies of an enrichment culture from a somal contamination before deproteinization. Moreover, sample collected from Obsidian Pool, Yellowstone we examined samples collected at different times over a National Park, USA, maintained at 85°C and pH 6 under 2-year period, for which a given VLP type was dominant in anaerobic conditions, revealed five morphologically electron micrographs (Fig. 1B and C), in order to correlate diverse VLPs (fig. 1 in Rachel et al., 2002), including viral genome types with morphotypes. Furthermore, strat- spherical virions of the virus PSV which was character- egies were developed for distinguishing viral from plasmid ized earlier (Häring et al., 2004). The enrichment culture DNA, and linear from circular DNA genomes (Table 1; see also carried a variety of genera, including crenarchaeal also Experimental procedures). Thermofilum, Thermoproteus and Thermosphaera, eur- Five main libraries were prepared to generate viral DNA yarchaeal Archaeoglobus and the bacterial genera and plasmid clones (Table 1). These include the larger Thermus, Geothermobacterium and Thermodesulfobac- library of supernatant DNA from the mix of bioreactor terium (Rachel et al., 2002). The enrichment culture was samples collected at different time intervals (4000 maintained in a bioreactor over a 2-year period and ali- sequences) (Fig. 1A), and an earlier library that was used quots were extracted at regular intervals over by this time to sequence the partially purified Pyrobaculum spherical and screened for VLPs by electron microscopy. The dif- virus PSV, isolated from the same bioreactor (Häring ferent VLP morphotypes observed including the spherical et al., 2004). Moreover, samples enriched in two of the PSV varied considerably in their relative yields over time novel VLPs were obtained (Fig. 1B and C) and used to (Fig. 1). generate clone libraries. Thus, a shot-gun filament library In this study, we attempted to obtain genome sequences was prepared from two samples rich in short filamentous associated with the remaining unidentified VLPs. To this VLPs (Fig. 1B), and further clone libraries were prepared end, the samples extracted from the bioreactor at different from samples rich in tadpole-shaped particles (Fig. 1C) time intervals were investigated, as well as mixtures of after selecting for (i) circular plasmids which were pre- samples. A variety of approaches were used to generate ferentially amplified (tadpole-1) and (ii) circular viral clone libraries and to distinguish viral from plasmid DNA, genomes after degrading chromosomal and plasmid DNA and circular from linear DNAgenomes and, for the VLPs, to and then deproteinizing virions and treating with circular correlate genome-type with morphotype. DNA-safe nucleases (tadpole-2). We also screened, Since attempts to find hosts for the VLPs were unsuc- unsuccessfully, for RNA viral genomes by generating cessful, we investigated potential hosts for the archaeal cDNA libraries (data not shown). viruses and plasmids by matching their genome Complete genomes from two putative archaeal viruses sequences to spacer sequences of the chromosomal HAV1 and HAV2 were assembled, the former 22.7 kb and immune CRISPR/Cas system (Van der Oost et al., 2009). linear, and the latter 17.7 kb and circular, and, in addition, These chromosomal spacers derive from infecting viruses four plasmids were sequenced pHA1 – 33.8 kb, pHB1 – or plasmids (Barrangou et al., 2007) and are present 2.1 kb, and two variants of pHB2a and pHB2b of 4.8 kb within all the available sequenced genomes of thermo- and 5.4 kb respectively. The approximate percentages of philic neutrophiles. The spacers represent a history of clones from the five main libraries that were incorporated invading viruses and plasmids and a close sequence into each assembled genetic element are given (Table 1), match implies that the host has been infected by a similar and the numbers are consistent with the strategy virus or plasmid (Lillestøl et al., 2006; Andersson and employed for distinguishing viral from plasmid genomes, Banfield, 2008; Shah et al., 2009). except that the relatively high percentage of clones of plasmids pHB2a and pHB2b (20%), obtained from the tadpole-2 library, probably reflects incomplete DNase-1 Results digestion of non-viral circular DNA (Table 1). The average An enrichment culture established from a sample col- genome coverage was about fivefold for each element, lected from Obsidian Pool at Yellowstone National Park, unless otherwise stated, and all sequence ambiguities USA, was maintained in a bioreactor at 85°C and pH 6. were resolved by primer walking on clones. The identities Virus-like particles were concentrated from supernatant and general properties of the sequenced genetic ele- aliquots taken from the bioreactor and subjected to CsCl ments are summarized in Table 2. density gradient ultracentrifugation. Initially, shot-gun clone libraries were prepared from a mixture of bioreactor Filamentous VLPs samples (bioreactor-mix) (Fig. 1A) which were deprotein- ized after density gradient centrifugation without any pre- Two bioreactor samples rich in short filamentous VLPs treatment but most clones were found to derive from (Fig. 1B) were pooled and treated with DNase I at 37°C

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2920 R. A. Garrett et al. A

BC

Fig. 1. Electron micrographs showing VLP morphotypes observed in the analysed bioreactor culture. A. A mixture of all preparations of VLPs collected from the bioreactor. B. Preparation enriched in filamentous VLPs. C. Preparation enriched in tadpole-shaped VLPs. The size marker corresponds to 500 nm.

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 Novel viruses and plasmids from hyperthermoneutrophiles 2921

Table 1. Pre-treatment of viral samples before library construction and the approximate percentage of clone sequences assembled from each library for each genetic element.

Clone libraries

Treatment Bioreactor mix PSV Filament Tadpole-1 Tadpole-2

Element Size (bp) i, iii i, iii i, iii, v i, iii, v i, ii, iii, iv, v

HAV1 22 743 5 95 HAV2 17 666 5 95 pHA1 33 795 40 57 3 pHB1 2 099 100 pHB2a/2b 4 780/5 370 80 20

Bioreactor supernatant extracts were subjected to the following treatments: i. CsCl gradient centrifugation of virions. ii. DNase I treatment of the virion band from CsCl gradients. iii. Deproteinization with SDS and proteinase K followed by phenol extraction of DNA. iv. Plasmid-safe DNase treatment of DNA. v. In vitro amplification of DNA. The total number of sequenced clones that were assembled into the virus and plasmid genomes (prior to sequence polishing) were HAV1 – 956, HAV2 – 195, pHA1 – 188, pHB1 – 49, pHB2a/2b which were co-assembled – 55. for 15 min, to remove extraneous chromosomal and consistent with the low levels of spherical particles plasmid DNA before extracting DNA from VLPs by phenol observed in electron micrographs (Fig. 1B). treatment. A clone library was generated and DNA No genes yielded highly significant matches in public sequencing yielded a non-circular contig of about 20 kb, sequence databases; only weak but persistent matches consistent with a linear genome. Since terminal were observed to a Cas4-like protein (DUF83) (ORF218), sequences are invariably absent from shot-gun clone possibly a nuclease, for which matches were also libraries of linear genomes (e.g. Vestergaard et al., observed in several crenarchaeal fuselloviruses (Redder 2008a), libraries were produced using the Linker Ampli- et al., 2009) and the filamentous lipothrixvirus virus AFV1 fied Shotgun Library method (see Experimental proce- (Bettstetter et al., 2003), a parB-like partition protein dures) which yielded a high sequence coverage of the (ORF253) and a transcriptional regulator (ORF146) DNA termini. The complete linear DNA genome consists (Table 3). Several ORFs carry putative transmembrane of 22 743 bp with a 21 bp inverted terminal repeat (ITR) of motifs, some with predicted signal peptides, as illustrated sequence 5′-CGTCTCTCTGTGTGTATGGGA-3′.We in Fig. 2A. The very low level of gene matches to public infer that both termini are free, blunt and unmodified, sequence databases is a characteristic of the other because they were efficiently ligated with the blunt end of sequenced thermoneutrophilic viruses PSV, TTSV1 and the adaptor during library construction (see Experimental TTV1 (Janekovic et al., 1983; Bettstetter et al., 2003; Ahn procedures). Since only one major contig was assembled et al., 2006), and appears to be a general feature of many from the filament-library sequences, we inferred that it crenarchaeal viral genomes (Prangishvili et al., 2006b). derived from the filamentous virus (Fig. 1B). Genome analyses indicated that the virus was of archaeal origin HAV1 genomic variants (Torarinsson et al., 2005), and all of the predicted genes lie on one strand of the genome, similar to the highly Although there is a low level of sequence heterogeneity biased strand usage of the crenarchaeal thermoneutro- throughout the genome, there are numerous local hetero- philic viruses PSV and TTSV1 (Häring et al., 2004; Ahn geneity ‘hot-spots’, present in almost half of the predicted et al., 2006). A few clone sequences from the filament genes (17 out of 40) as indicated in Fig. 2A. In addition, library assembled into the PSV genome, indicating that several genomic variants of HAV1 were assembled with small amounts of that virus had co-purified with HAV1, major alterations including gene insertions of up to 350 bp

Table 2. Genomic properties of the thermoneutrophilic viruses and plasmids.

Element ds DNA (kb) Form G+C content Domain GenBank accession number

HAV1 22 743 Linear 46.2 Archaea GU722196 HAV2 17 666 Circular 52.1 Archaea GU722197 pHA1 33 795 Circular 45.4 Archaea GU722198 pHB1 2 099 Circular 54.7 Bacteria GU722199 pHB2a 4 780 Circular 61.6 Bacteria GU722200 pHB2b 5 370 Circular 60.2 Bacteria GU722201

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2922 R. A. Garrett et al.

Table 3. Significant ORF matches within public sequence databases.

ORF e-value Match Origin

HAV1 ORF253 4e-06 parB-like partition Dethiobacter alkaliphilus AHT 1 ORF218 9e-05 Cas4-like (DUF83) Sulfolobus fusellovirus SSV2 ORF170 1e-06 Hypothetical – Tpen_1879 Thermofilum pendens Hrk 5 ORF146 5e-05 CopG/Arc/MetJ family transcriptional regulator Pyrobaculum aerophilum str. IM2 HAV2 ORF1767 2e-11 Hypothetical – ATV_gp60 (ORF710 – C-terminal 375 aa) Acidianus bicaudavirus ATV ORF909 1e-125 Primase/DNA polymerase Sulfolobus neozealandicus pORA1 ORF506 2e-15 AAA-ATPase, CDC48-type (ORF618 N-terminal 210 aa) Acidianus bicaudavirus ATV ORF263 2e-04 ORF731 N-terminal 100 aa Sulfolobus bicaudavirus STSV1 ORF122 3e-45 IS element Dka2 OrfA Desulfurococcus kamchatkensis ORF420 2e-160 IS element Dka2 OrfB Desulfurococcus kamchatkensis pHA1 ORF575 2e-08 Phage/plasmid primase COG3378 (C-terminal 300 aa) P4 family ORF396 2e-25 C5-cytosine-specific methylase Thermus phage P23-45 ORF375 8e-42 Type III restriction enzyme, res subunit Thermofilum pendens Hrk 5 ORF337 2e-07 DEAD/DEAH box helicase Thermofilum pendens Hrk 5 ORF320 5e-18 Abortive infection protein Thermofilum pendens Hrk 5 ORF282 4e-74 Integrase Thermofilum pendens Hrk 5 ORF93 6e-06 Holliday junction resolvase Methanocaldococcus vulcanius M7 pHB1 ORF477 8e-14 Rep protein-rolling circle Bacterial plasmid pAB49 pHB2a+b ORF399/557 8e-75 TraA-like, conjugal transfer Polaromonas naphthalenivorans CJ2 ORF269 2e-47 RepB protein Acinetobacter baumannii ACICU pHB2a ORF116 3e-18 Hypothetical – Veis_1406 Verminephrobacter eiseniae EF01-2 pHB2b ORF115 2e-19 Hypothetical – StreC_09508 Streptomyces sp. C

and deletions of up to 1.5 kb, genes with altered Tadpole-shaped VLPs sequences, and duplications. The number of clone sequences that assembled into each of the variant DNA was extracted from a purified viral preparation that regions (Table 4), relative to the number of clones in the was rich in tadpole-shaped VLPs (Fig. 1C), and was dominant genome, indicated that the original viral popu- amplified using the f29 polymerase, before preparing a lation was very heterogeneous, and this was reinforced by shot-gun clone library (Table 1; see Experimental proce- preparative gel electrophoresis pattern of the viral DNA dures). Sequences were assembled, together with some which revealed a broad heterogeneous band between sequences from the bioreactor mix library (Table 1), into DNA size markers of 19.4 and 24 kb (Fig. 3). a circular double-stranded (ds) DNA genome of 17 666 Ten assembled contigs of HAV1 variants showed major kb (Fig. 2B), where the predicted genes are preceded by genomic changes with some carrying two to three inde- archaea-specific motifs (Torarinsson et al., 2005). There pendent alterations (Table 4). Most of the deletions and was little sequence heterogeneity in the HAV2 genome other major genomic changes occur at one or more of the which almost certainly reflects the DNA amplification 11 adjoining pyrimidine-rich and purine-rich sequences, step prior to cloning, such that an initial dominating com- most of which are intergenic (Fig. 2A; Table 5). These ponent was preferentially amplified. As for HAV1, only sites constitute partially conserved, low-complexity direct one major contig was assembled and we inferred there- repeats along the genome, and some carry inverted fore that it derived from the tadpole-shaped VLPs repeats (Table 5). Only a quarter of the viral genes are (Fig. 1C). affected by these genomic changes. Of these, ORFs In contrast to HAV1, a few significant matches to public 123a, 156, 284, 102, 78a and 170 appear dispensable for sequence databases were found (Table 3). Highly signifi- the virion, while ORFs 140, 174, 276 and 352 can cant matches occurred for an archaea-specific bifunc- undergo large sequence variations, and ORF585, which tional DNA primase-polymerase encoded on two plasmids contains two putative recombination sites (Fig. 2A; of Sulfolobus neozealandicus (Lipps et al., 2004; Greve Table 5), has undergone insertions, partial deletions et al., 2005), and for an IS element of the IS 200/650 and/or extensive sequence changes, and exhibits altered family present in Desulfurococcus kamchatkensis. start codon positions. Moreover, two matches occurred to a crenarchaeal

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 Novel viruses and plasmids from hyperthermoneutrophiles 2923

Fig. 2. Genome maps of the HAV1 (A) and HAV2 (B) viruses where predicted genes are indicated by arrows and denoted by their amino acid lengths. Significant predictions of gene product functions are indicated. Striated genes encode predicted transmembrane motifs. In (A) red sections indicate gene regions carrying hot-spots for single-site mutations. Putative recombination sites are indicated (•). bicaudavirus ATV which exhibits a similar spindle-shaped before viral DNA extraction (Table 1). The 33 795 bp morphology but with two tails (Prangishvili et al., 2006c). pHA1 (Hyperthermophilic Archaeon) is of archaeal origin Thus, the C-terminal 350 amino acids of HAV2-ORF1767 and was assembled from different libraries of non- showed significant sequence similarity to the correspond- amplified DNA, including that of the bioreactor mix and ing region of ATV-ORF710, and HAV2-ORF506 also the PSV virus (Häring et al., 2004) (Table 1). Minor carries an AAA-ATPase domain of the CDC48 type, sequence heterogeneities occurred throughout the similar to that present in ATV-ORF618 (Fig. 2B). genome but no larger genomic changes were observed. About one-third of the 59 predicted genes are homolo- gous to genes in the 31 504 bp plasmid TPEN01 from Archaeal and bacterial plasmids Thermofilum pendens Hrk5 (Anderson et al., 2008), and Plasmid sequences were assembled mainly from the they are clustered in the pHA1 genome (Fig. 4A). Several clone libraries of samples lacking DNase I treatment genes encoding hypothetical proteins carry putative

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2924 R. A. Garrett et al.

Fig. 3. Characterization of DNA isolated from the purified preparation of the filamentous virus VLP enriched preparation (HAV1), after removal of plasmid and chromosomal DNA, and prior to generating the filament library (Table 1). M – DNA size markers.

transmembrane motifs, some also exhibiting predicted signal peptides, and these include a cluster of 10 tightly linked genes some of which are probably co-transcribed (Fig. 4A). Although there is no significant sequence simi- larity, these proteins may generate a novel conjugative apparatus, by analogy with a group of conjugative mem- brane proteins encoded by a conserved gene cluster of conjugative plasmids of the crenarchaeal thermoacido- philes (Greve et al., 2004). The three smaller plasmids were assembled exclusively from the tadpole-1/2 libraries of amplified DNA (Table 1) and the sequences are relatively homogeneous. Each plasmid is of bacterial origin, as judged by promoter and ribosome binding motifs (Torarinsson et al., 2005). The 2099 bp pHB1 (Hyperthermophilic Bacterium) encodes a

1659–31888868–87559946–10757 Deleted 1529 bp Altered gene Deleted 813 bplarge Deleted ORFs284/102 replication Deleted ORF218 ORFs78a/170 (ORF218, 88% identity/92% similarity) protein, probably of the rolling circle type 1500015717–16062 Insert Altered 350 gene bp Variant ORF585, ORF585, altered C-terminal 345 half bp (ORF653) centrally (Table 3), and other predicted genes overlap on the two DNA strands (Fig. 4B). pHB2a and pHB2b, of 4780 and 5370 bp, respectively, are variants sharing 3780 bp of highly similar sequence but with two altered regions as illustrated (Fig. 4B). Whereas the shorter altered regions exhibit no sequence similarity, the larger regions of 781 bp and 1257 bp for pHB2a and pHB2b, respectively, carry about 300 bp with a low but significant level of sequence similarity. These altered sequences resulted in ORF125 Properties of the HAV1 genomic variants. being exclusive to pHB2a, and ORFs 58, 60, 67, 68 and 97 being specific to pHB2b (Fig. 4B). In addition, ORFs 12345 2 146 67 9 4189 19 492–1827 1032 7 2 3761 8042 8050 8 14430–14699 Deleted 1337 bp, 80–143 replaced 14360–14498,14737–14885 14995–15460 65 bp partial duplication, Altered T-C-rich gene region 17352–17899 Altered Deleted gene ORFs123a/156 No Insert ORF 92 270 bp bp Insert – duplication 49 bp 60 bp in repeat C-rich in region start Deleted 375 bp Altered gene ORF141 end extends 10 aa, ORF94 start extends 36 ORF141 aa end extends 9 aa, ORF94 Heterogeneities start ORF585 extends 10 aa No ORF ORF174 (ORF183, 36% identity/59% similarity) Truncated ORF585 ORF276, altered central 170 aa (ORF259, 61% identity/71% similarity)

Table 4. HAV1 variant Number of clones Viral position Genome change10 16399 ORF changes and 157 in pHB2a 19439–20029 are fusedin Altered gene pHB2b (ORF557) and ORF325 (ORF315, 72% identity/80% similarity)

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 Novel viruses and plasmids from hyperthermoneutrophiles 2925

Table 5. Putative recombination sites associated with alterations in the variant HAV1 genomes.

Variant number Genome change Genome positions Recombination sites

1a 1337 bp deletion 474–496 CCCTCCCCTTTTTCTATGAAGTCGAAGGTGGA 1b Recombination 1805–1833 TTTTTTCTTTTTCCTCTTTTTTTCCCTTCGGAGAAAAG 2a 65 bp partial duplication 1014–1052 TCTTTTTTCCCCTCTTTTCCTTTCTTCATGATGAAAGGA 2b 1529 bp deletion 1650–1677 CCTCTTTTTTTCTAGCCGCACCTCCTTTGGAGAAAAA 2c 1529 bp deletion 3183–3193 TCTGACCCTTCGGAGAAAAA 5a 49 bp insertion 8060 CCCGTTCCCGGCGTCTCGGTGGAA 6b 350 bp altered 15000 CTCCTCACTCTTCTTCTCGCTGTTCAGGAGGAGGA 8b 345 bp replacement 16040–16065 CTTTGCTGTATCTATTGCGAGGAAGA

Similar sites exist also at genome positions 559–585, 2728–2748 and 2766–2778. Inverted repeats (underlined) are present in some recombi- nation sites. Details of the variants are given in Table 4.

Fig. 4. Genome maps of the circular plasmids (A) archaeal pHA1 and (B) three bacterial plasmids pHB1, pHB2a and pHB2b, where arrows indicate predicted genes denoted by their amino acid lengths. Striated genes encode predicted transmembrane motifs while grey shaded genes are homologous to genes in TPEN01. Shaded areas inside the circles for pHB2a and pHB2b indicate regions of different sequence.

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2926 R. A. Garrett et al.

ORF96 (pHB2a) and ORF98 (pHB2b), as well as ORF115 (pHB2a) and ORF116 (pHB2b), show limited sequence differences (Fig. 4B). Both plasmid variants encode a rep- lication protein and a Tra-like conjugal protein and carry a high G+C-rich region which may constitute an RNA gene (Fig. 4B). As for HAV2, there was little sequence heterogeneity for the bacterial plasmids, which probably also reflects their :..:::.::.:.:: :::::.:::. :: :::::: ::::: :: : :::. : : :. ::: .::: ... : :: : : . . : : :.. . :::.::: . .. :

amplification prior to cloning. Detection of variants pHB2a 1WLHWLYIYGASHTG14 1LGRSYDTIRKYQ12 1AQYNSWLESRL11 1VVYVDETYTSATCP14 1DIWKIRWPEAIKS13 1RCDLCGRRVSYET13 and pHB2b suggests that both were substantial compo- 1RCDLCGRRVSYET13 114 AKILGREYDTVRKYRNAA 131 568 RGKW IRWLYLYGSSKTGKTT 587 40 PDTRCD ICGRK I GYGPYMV 58 112 EVEAQYNSWLESRLAVL 128 nents in the original DNA preparation. 329 GITAVYVDEAYTSSKCPIHG 348 75 RNFDIWKVKWPTALRAQIA 93 40 PDTRCD I CGRK I GYGPYMV 58

CRISPR spacer matches otal of 1321 CRISPR spacers were extracted from -value Alignment RNAs transcribed from CRISPR repeat clusters, and pro- e cessed to spacer RNAs, can target and inactivate extra- chromosomal elements (reviewed in Van der Oost et al., 97 0.017 633 0.079 2009). Thus, host repeat clusters maintain a record of 100a 0.041 100a 0.041 invading genetic elements. In principle therefore it should be possible to determine a host of an isolated genetic element by comparing its genome sequence with CRISPR spacer sequences from chromosomes of poten- -values derive from searching translated spacers against a database with a length of tial hosts. We attempted to do this for the newly charac- e terized viruses and plasmids by comparing their Thermofilum Pyrobaculum Thermoproteus Thermoproteus Thermoproteus sequences, and those of other available thermoneutro- . philic viruses and plasmids, with the 1321 spacer sequences in the CRISPR clusters of the 13 sequenced thermoneutrophilic genomes (see Experimental proce-

dures). Although a sequence comparison at a nucleotide T. neutrophilus level yielded no significant matches, a few significant matches were found when searching at the more con- served amino acid sequence level, after translating the spacers into all six reading frames, essentially as described earlier (Shah et al., 2009). At an e-value cut-off IC-167 to 225 for of 0.12, there are seven significant matches to the viruses and plasmids which are listed in Table 6, and 35 matches to annotated crenarchaeal proteins in the 13 genomes (some of which may occur to integrated viruses or plasmids which were not removed from the data set). Given that viral/plasmid ORFs constituted only 0.8% of sequences present in the search (corresponding to Caldivirga maquilingensis 54 684 out of a total of 6 317 506 amino acids), the results show a 20-fold preference for spacers matching viral/ 94 88 10 HAV2 909 0.023 126 90 47 HAV1 253 0.012 225 26 13182 HAV2 27 17 pTPEN01 420 0.067 plasmid ORFs over crenarchaeal genome ORFs which 225225 16225 38 5 38 5 PSV 6 TTSV1 TTSV1 reinforces the significance of the matches (Table 6). A similar, and significant, analysis of the bacterial plasmids was not possible because of their small sizes and the paucity of available bacterial thermophile CRISPR sequences. Of the four published genetic elements, PSV, TTSV1 and TPEN01 yielded one or more significant spacer Significant CRISPR spacer matches to crenarchaeal thermoneutrophilic viruses and plasmids. matches to a known host genus (Table 6). Moreover, HAV1 gave a good match to a Pyrobaculum, while Table 6. Crenarchaeal genomePyrobaculum arsenaticum Desulfurococcus kamchatkensis Total spacers CRISPR Spacer Virus/plasmid Host genus ORF 6.3 million amino acids,the where 13 the genomes viral/plasmid which ORFs ranged comprise from 0.8%, 39 and the repeats crenarchaeal for genome ORFs constitute 99.2%, of the sequence database. A t Thermoproteus neutrophilus Thermofilum pendens CRISPR repeat clusters are identified by the total number of repeats, and spacers are numbered from the leader end. HAV2 yielded good matches to Desulfurococcus and Thermoproteus neutrophilus Thermoproteus neutrophilus Thermoproteus neutrophilus

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 Novel viruses and plasmids from hyperthermoneutrophiles 2927

Thermoproteus, consistent with the high sequence simi- ruses (Snyder et al., 2004). Possibly those that were larity of the HAV2 genome to a Desulfurococcus IS undetectable by viral DNA amplification in the laboratory element (Table 3; Fig. 2B). The next two most significant cultures had undergone genome rearrangements or were matches (not included in the table) were both between present in very low amounts, or in integrated form, at the pHA1 ORF68 and single spacers in T. pendens and T. time of testing. neutrophilus, with e-values of 0.64 and 0.67, respectively, Attempts were made to identify putative viral hosts by also consistent with the extensive gene homology isolating strains from the bioreactor using a laser micro- between pHA1 and the T. pendens plasmid TPEN01 scope and cell sorter but none of them were infected with (Fig. 4A). Thus, this approach appears to yield useful viruses and, moreover, no crenarchaeal strains were insights into possible hosts for the newly characterized found which were infected by the crude virus prepara- archaeal genetic elements, and it should be more gener- tions (Fig. 1B and C) except for the spherical PSV, char- ally applicable for such metagenomic studies as more acterized earlier, which infected two Pyrobaculum and archaeal CRISPR repeat-cluster sequences, or whole Thermoproteus strains (Häring et al., 2004). Moreover, at genome sequences become available. present no reliable practical procedures have been developed for transfecting viral DNA into neutrophilic crenarchaea. Discussion Sequence heterogeneities occur throughout the HAV1 We characterized the genomic diversity of viruses and genome, such that the final sequence is necessarily a plasmids in a bioreactor established from a sample from a consensus, where the dominant nucleotide is taken at hot spring at Yellowstone National Park (Obsidian Pool) each position. Moreover, nearly half the predicted genes and maintained at 85°C and pH 6 for 2 years (Rachel carry regions that were particularly susceptible to et al., 2002). Using a variety of cloning strategies to select sequence change (Fig. 2A), and some of these also incur for linear or circular genomes, and to distinguish viruses deletions or insertions. We infer that their gene products from plasmids, the analyses yielded two novel viral are most likely to be involved in virus–host interactions, genomes, HAV1 and HAV2, from samples highly enriched cell adhesion or viral extrusion mechanisms. These genes in filamentous and tadpole-shaped VLPs, respectively, include ORFs 140, 174, 276, 325 and 585 (Fig. 2A) and where the former yielded several genomic variants. No ORF585 is by far the most susceptible to change and is additional longer genomic contigs were assembled, from therefore a strong candidate for recognition of cellular either sample, which could correspond to the other elon- receptors. The latter is reminiscent of the hypervariable gated VLP that was observed in the original sample ORFTPX of the Thermoproteus virus TTV1, although (Fig. 1) (Rachel et al., 2002). Neither viral genome shows ORFTPX sequence changes occurred by a different any clear similarity to other known archaeal viruses; only mechanism (Neumann and Zillig, 1990a,b) HAV2 shows morphological similarities with the two-tailed The most conserved genes of HAV1 (Fig. 2A) are bicaudavirus ATV, and limited sequence similarity strong candidates for participation in the basic viral between two genes (Prangishvili et al., 2006c), and they mechanisms of DNA replication, transcriptional regulation may be distantly related. and virion packaging. Earlier studies on the filamentous Electron microscopic visualization of bioreactor and rod-shaped viruses of crenarchaeal thermoacido- samples taken at regular intervals indicated that the levels philes concluded that the conserved core viral genes tend of individual types of VLPs dramatically rose and fell over to be concentrated at the centre of linear genomes (Vest- time. This was also true for the HAV1 variants which ergaard et al., 2008a,b) and this is consistent with the showed different yields with time as revealed by gel elec- variants carrying deletions of combinations of the four trophoresis (data not shown). Presumably this reflects a genes at the left end of the genome (Fig. 2A). reaction to: (i) the availability of receptive host cells, and There is a precedent for the formation of multiple (ii) the ability to overcome the archaeal cellular CRISPR genomic variants of a crenarchaeal virus. Earlier, the cre- immune systems (Lillestøl et al., 2006; Shah et al., 2009; narchaeal rudivirus SIRV1 was isolated and passed Van der Oost et al., 2009). We infer that the extensive through a series of closely related Sulfolobus islandicus variety of HAV1 variants, which carry numerous sequence strains, before reisolating the virions and sequencing their changes and major genomic structural alterations, reflect genomes. Several SIRV1 variants were detected which adaptation of the virus to these constraints. Moreover, the also exhibited localized regions of insertions, deletions, fact that they were isolated as virions (Table 1) suggests duplications and extensive gene sequence changes that they are all functional. These observations may be (Peng et al., 2004). However, at least some of the under- relevant to an earlier study which demonstrated a selec- lying mechanisms of genomic change appear to be differ- tive bias of viruses in laboratory cultures of environmental ent. For example, HAV1 carries recombination sites samples which contained diverse crenarchaeal fusellovi- constituting low-complexity direct repeats, some of which

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2928 R. A. Garrett et al. can generate hairpin structures (Table 5) and are possibly Identifying spacer matches related to the recombination sites characterized for plas- CRISPR spacer sequences were extracted from the avail- mids of Sulfolobus which can generate regular hairpin able crenarchaeal thermoneutrophilic genomes: A. pernix structures (Peng et al., 2000; Greve et al., 2004). In con- K1 (NC_000854), Caldivirga maquilingensis IC-167 trast, SIRV1 variants incurred multiple 12 bp indels, (NC_009954), D. kamchatkensis 1221n (NC_011766), mainly within genes (Peng et al., 2004) and they were not Hyperthermus butylicus DSM 5456 (NC_008818), Ignicoccus observed for HAV1. Despite some mechanistic differ- hospitalis KIN4/I (NC_009776), Nitrosopumilus maritimus ences, the overall genomic changes in the viral variants SCM1 (NC_010085), Pyrobaculum aerophilum IM2 (NC_ are quite similar with some genes being conserved, 003364), Pyrobaculum arsenaticum DSM 13514 (NC_009376), Pyrobaculum calidifontis JCM 11548 (NC_ others dispensable and deleted, and a few genes are 009073), Pyrobaculum islandicum DSM 4184 (NC_008701), radically changed in sequence. Staphylothermus marinus F1 (NC_009033), T. pendens Hrk 5 In contrast to classical studies on virus characterization, (NC_008698) and T. neutrophilus V24Sta (NC_010525). a degree of uncertainty necessarily exists in the inter- They were aligned against HAV1, HAV2, pHA1, pHB1, pretation of metagenomic data. It is difficult to confirm pHB2a and pHB2b, and published genomes of the viruses unambiguously a genome-type–morphotype relationship, PSV (AJ635161), TTSV1 (AY722806) and TTV1 (X14855), especially when so few archaeal viral families are char- and the plasmid pTPEN01 (NC_008696), using an MMX optimized Smith-Waterman implementation (Saebø et al., acterized although, as shown here, the uncertainty can be 2005). Alignments were performed at both a nucleotide level minimized by first enriching VLPs. Moreover, attempts to and an amino acid sequence level by translating the spacers identify potential archaeal hosts, on the basis of the in all seven reading frames essentially as described earlier CRISPR immune system, will become more robust as (Shah et al., 2009), where the false positive level was esti- more archaeal host chromosomes are sequenced but it mated by aligning the spacers against all the above crenar- will always be limited by the ability of some crenarchaeal chaeal genomes (minus CRISPR repeat regions) and using viruses to infect a broader range of host species (Lillestøl this as a negative control. et al., 2006; 2009; Vestergaard et al., 2008b). Acknowledgements

Experimental procedures We thank Ariane Bize, Lanming Chen, Hien Phan, John Smyth, Gisle Vestergaard and Kim Brügger for much help in DNA isolation and sequencing the early stages of this work. The research in Copenhagen All virion preparations from CsCl density gradients were was supported by the Natural Science Research Council. dialysed against 10 mM Tris-acetate, pH 6 overnight. For some libraries (Table 1) chromosomal and plasmid DNA con- References tamination was removed from viral samples (tadpole-2 and filament) by treating first with DNase I (50 units ml-1) at 37°C Ahn, D.G., Kim, S.I., Rhee, J.K., Kim, K.P., Pan, J.G., and Oh, for 15 min. followed by heat inactivation of the DNase I at J.W. (2006) TTSV1, a new virus-like particle isolated from 85°C for 15 min. Nucleic acid was isolated from virions as the hyperthermophilic crenarchaeote Thermoproteus described earlier (Peng et al., 2004); briefly, virions were tenax. Virology 351: 280–290. disrupted by incubation with 1% SDS and 0.5 mg ml-1 pro- Anderson, I., Rodriguez, J., Susanti, D., Porat, I., Reich, C., teinase K at 50°C for 1 h, DNA was extracted by phenol and Ulrich, L.E., et al. (2008) Genome sequence of Thermofi- phenol-chloroform treatment before precipitating with 0.1 vol. lum pendens reveals an exceptional loss of biosynthetic of 3 M sodium acetate, pH 5.3, 0.8 vol. of isopropanol. The pathways without genome reduction. J Bacteriol 190: DNA pellet was washed with 70% ethanol, air-dried and 2957–2965. resuspended in an appropriate volume of 10 mM Tris-HCl, pH Andersson, A.F., and Banfield, J.F. (2008) Virus population 8.0, 1 mM EDTA. Clone libraries were prepared by sonicating dynamics and acquired virus resistance in natural microbial DNA to produce fragments of 2–3 kb and then constructing communities. Science 320: 1047–1050. shot-gun libraries using SmaI-digested pUC18 as cloning Barrangou, R., Fremaux, C., Deveau, H., Richards, M., vector (Peng, 2008) and, also, using the Linker Amplified Boyaval, P., Moineau, S., Romero D.A. and Horvath, P. Shotgun Library method described at http://www.sci. (2007) CRISPR provides acquired resistance against sdsu.edu/PHAGE/LASL/. DNA was extracted using a Model viruses in prokaryotes. Science 315: 1709–1712. 8000 Biobot (Qiagen, Westburg, Germany) and sequenced in Bettstetter, M., Peng, X., Garrett, R.A., and Prangishvili, D. MegaBACE 1000 sequenators (Amersham Biotech, Amer- (2003) AFV1, a novel virus infecting hyperthermophilic sham, UK). Viral and plasmid sequences were assembled archaea of the genus Acidianus. Virology 315: 68–79. using Sequencher 4.9 (http://www.genecodes.com/). Bize, A., Peng, X., Prokofeva, M., Maclellan, K., Lucas, S., Genome analyses and gene annotations were performed Forterre, P., et al. (2008) Viruses in acidic geothermal envi- using Artemis (http://www.sanger.ac.uk/Software/Artemis/). ronments of the Kamchatka peninsula. Res Microbiol 159: Gene sequence searches were made in GenBank/EMBL 358–366. (http://www.ncbi.nlm.nih.gov/blast) and motifs were identified Diez, B., Anton, J., Guixa-Boixereu, N., Pedros-Alio, C., and using the SMART facility (http://smart.embl-heidelberg.de/). Rodriguez-Valera, F. (2000) Pulse-field gel electrophoresis

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 Novel viruses and plasmids from hyperthermoneutrophiles 2929

of virus assemblages present in a hypersaline environ- Peng, X. (2008) Evidence for the horizontal transfer of an ment. Int Microbiol 3: 159–164. integrase gene from a fusellovirus to a pRN-like plasmid Greve, B., Jensen, S., Brügger, K., Zillig, W., and Garrett, within a single strain of Sulfolobus and the implications for R.A. (2004) Genomic comparison of archaeal conjugative plasmid survival. Microbiology 154: 383–391. plasmids from Sulfolobus. Archaea 1: 231–239. Peng, X., Holz, I., Zillig, W., Garrett, R. A., and She, Q. Greve, B, Jensen, S., Phan, H., Brügger, K., Zillig, W., She, (2000) Evolution of the family of pRN plasmids and their Q., and Garrett, R.A. (2005) Novel plasmids pTAU4, integrase-mediated insertion into the chromosome of the pORA1 and pTIK4 from Sulfolobus neozealandicus. Crenarchaeon Sulfolobus solfataricus. J Mol Biol 303: Archaea 1: 319–325. 449–454. Häring, M., Peng, X., Brügger, K., Rachel, R., Stetter, K.O., Peng, X., Kessler, A., Phan, H., Garrett, R.A., and Prangish- Garrett, R.A., and Prangishvili, D. (2004) Morphology and vili, D. (2004) Multiple variants of the archaeal DNA rudivi- genome organisation of the virus PSV of the hyperthermo- rus SIRV1 in a single host and a novel mechanism of philic archaeal genera Pyrobaculum and Thermoproteus:a genome variation. Mol Microbiol 54: 366–375. novel virus family, the Globuloviridae. Virology 323: 233– Porter, K., Russ, B.E.,and Dyall-Smith, M.L. (2007) Virus– 242. host interactions in salt lakes. Curr Opin Microbiol 10: Janekovic, D., Wunderl, S., Holz, I., Zillig, W., Gierl, A., and 418–424. Neumann, H. (1983) TTV1, TTV2 and TTV3, a family of Prangishvili, D., Forterre, P., and Garrett, R.A. (2006a) viruses of the extremely thermophilic anaerobic, sulphur Viruses of the archaea: a unifying view. Nat Rev Microbiol reducing, archaebacterium Thermoproteus tenax. Mol Gen 4: 837–838. Genet 192: 39–45. Prangishvili, D., Garrett, R.A., and Koonin, E.V. (2006b) Krupovic, M., and Bamford, D.H. (2008) Archaeal proviruses Evolutionary genomics of archaeal viruses: unique viral TKV4 and MVV extend the PRD1-adenovirus lineage to genomes in the third domain of life. Virus Res 117: 52– the phylum Euryarchaeota. Virology 375: 292–300. 67. Krupovic, M., Forterre, P., and Bamford, D.H. (2010) Com- Prangishvili, D., Vestergaard, G., Häring, M., Aramayo, R., parative analysis of the mosaic genomes of tailed archaeal Basta, T., Rachel, R., and Garrett, R.A. (2006c) Structural viruses and proviruses suggests a common themes for and genomic properties of the hyperthermophilic archaeal virion architecture and assembly with tailed viruses of bac- virus ATV with an extracellular stage of the reproductive teria. J Mol Biol 397: 144–160. cycle. J Mol Biol 359: 1203–1216. Lawrence, C.M., Menon, S., Eilers, B.J., Bothner, B., Khayat, Rachel, R., Bettstetter, M., Hedlund, B.P., Häring, M., R., Douglas, T., and Young, M.J. (2009) Structural and Kessler, A., Stetter, K.O., and Prangishvili, D. (2002) functional studies of archaeal viruses. J Biol Chem 284: Remarkable morphological diversity of viruses and virus- 12599–12603. like particles in terrestrial hot environments. Arch Virol 147: Lillestøl, R.K., Redder, P., Garrett, R.A., and Brügger, K. 2419–2429. (2006) A putative viral defence mechanism in archaeal Redder, P., Peng, X., Brügger, K., Shah, S.A., Roesch, F., cells. Archaea 2: 59–72. Greve, B., She, Q., Schleper, C., Forterre, P., Garrett, R.A., Lillestøl, R.K., Shah, S.A., Brügger, K., Redder, P., Phan, H., and Prangishvili, D. (2009) Four newly isolated fusellovi- Christiansen, J., and Garrett, R.A. (2009) CRISPR families ruses from extreme geothermal environments reveal of the crenarchaeal genus Sulfolobus: bidirectional tran- unusual morphologies and a possible interviral recombina- scription and dynamic properties. Mol Microbiol 72: 259– tion mechanism. Environ Microbiol 11: 2849–2862. 272. Saebø, P.E., Andersen, S.M., Myrseth, J., Laerdahl, J.K., and Lipps, G., Weinierzl, A.O., von Scheven, G., Buchen, C. and Rognes, T. (2005) PARALIGN: rapid and sensitive Cramer, P. (2004) Structure of a bifunctional DNA primase- sequence similarity searches powered by parallel comput- polymerase. Nat Struct Mol Biol 11: 157–162. ing technology. Nucleic Acids Res 33: 535–539. Mochizuki, T., Yoshida, T., Tanaka, R., Forterre, P., Sako, Y., Shah, S.A., Hansen, N.R., and Garrett, R.A. (2009) Distribu- and Prangishvili, D.(2010) Diversity of viruses of the hyper- tions of CRISPR spacer matches in viruses and plasmids thermophilic archaeal genus Aeropyrum, and isolation of of crenarchaeal acidothermophiles and implications for the Aeropyrum pernix bacilliform virus 1, APBV1, the first their inhibitory mechanism. Trans Biochem Soc 37: 23– representative of the family ‘Clavaviridae’. Virology 402: 28. 347–352. Snyder, J.C., Spuhler, J., Wiedenheft, B., Roberto, F.F., Neumann, H., and Zillig, W. (1990a) Structural variability in Douglas, T., and Young, M.J. (2004) Effects of culturing on the genome of Thermoproteus tenax virus TTV1. Mol Gen the population structure of a hyperthermophilic virus. Genet 222: 435–437. Microbiol Ecol 48: 561–566. Neumann, H., and Zillig, W. (1990b) The TTV1-encoded viral Torarinsson, E., Klenk, H.-P., and Garrett, R.A. (2005) Diver- protein TPX: primary structure of the gene and the protein. gent transcriptional and translational signals in Archaea. Nucleic Acids Res 18: 195. Environ Microbiol 7: 47–54. Oren, A., Bratbak, G., and Hendal, M. (1997) Occurrence of Van der Oost, J., Jore, M.M., Westra, E.R., Lundgren, M., virus-like particles in the Dead Sea. Extremophiles 1: 143– and Brouns, S.J. (2009) CRISPR-based adaptive and heri- 149. table immunity in prokaryotes. Trends Biochem Sci 34: Ortmann, A.C., Wiedenheft, B., Douglas, T., and Young, M. 401–407. (2006) Hot crenarchaeal viruses reveal deep evolutionary Vestergaard, G., Aramayo, R., Basta, T., Häring, M., Peng, connections. Nat Rev Microbiol 4: 520–528. X., Brügger, K., Chen, L., Rachel, R., Boisset, N., Garrett,

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 2930 R. A. Garrett et al.

R.A., and Prangishvili, D. (2008a) Structure of the Acidi- M., Phan, H., Briegel, A., Rachel, R., Garrett, R.A., and anus filamentous virus 3 and comparative genomics of Prangishvili, D. (2008b) SRV, a new rudiviral isolate from related archaeal lipothrixviruses Acidianus. J Virol 82: Stygiolobus and the interplay of crenarchaeal rudiviruses 371–381. with the host viral-defence CRISPR system. J Bacteriol Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, 190: 6837–6845.

©2010SocietyforAppliedMicrobiologyandBlackwellPublishingLtd,Environmental Microbiology, 12,2918–2930 ￿.￿ ￿￿￿￿￿ ￿ 103

￿.￿ ￿￿￿￿￿ ￿ All bioinformatics and the figures including the data behind Contribution: substantial them were prepared by myself. The manuscript was written by Professor Roger A. Garrett. Research in Microbiology 162 (2011) 27e38 www.elsevier.com/locate/resmic

CRISPR/Cas and Cmr modules, mobility and evolution of adaptive immune systems

Shiraz A. Shah1, Roger A. Garrett*,1

Archaea Centre, Department of Biology, Copenhagen University, DK2200 Copenhagen N, Denmark Received 17 May 2010; accepted 22 July 2010 Available online 21 September 2010

Abstract

CRISPR/Cas and CRISPR/Cmr immune machineries of archaea and bacteria provide an adaptive and effective defence mechanism directed specifically against viruses and plasmids. Present data suggest that both CRISPR/Cas and Cmr modules can behave like integral genetic elements. They tend to be located in the more variable regions of chromosomes and are displaced by genome shuffling mechanisms including transposition. CRISPR loci may be broken up and dispersed in chromosomes by transposons with the potential for creating genetic novelty. Both CRISPR/Cas and Cmr modules appear to exchange readily between closely related organisms where they may be subjected to strong selective pressure. It is likely that this process occurs primarily via conjugative plasmids or chromosomal conjugation. It is inferred that interdomain transfer between archaea and bacteria has occurred, albeit very rarely, despite the significant barriers imposed by their differing conjugative, transcriptional and translational mechanisms. There are parallels between the CRISPR crRNAs and eukaryal siRNAs, most notably to germ cell piRNAs which are directed, with the help of effector proteins, to silence or destroy transposons. No homologous proteins are identifiable at a sequence level between eukaryal siRNA proteins and those of archaeal or bacterial CRISPR/Cas and Cmr modules. Ó 2010 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved.

Keywords: CRISPR/Cas; CRISPR/Cmr; crRNA; Evolution; Mobile elements; siRNA

1. Introduction clusters of spacer-repeat units and can vary in size from one to more than a hundred spacer-repeat units where each unit is The CRISPR/Cas and CRISPR/Cmr systems provide the about 60e90 bp with repeats and spacers of, on average, 30 bp basis for an adaptive and a heriditable immune system directed and 40 bp, respectively (reviewed in Karginov and Hannon, against the DNA and RNA, respectively, of invading elements. 2010). The CRISPR loci are preceded by non-protein coding The former consists of CRISPR loci and physically linked leader regions of about 150e550 bp (Tang et al., 2002; Jansen cassettes of cas genes which together appear to constitute et al., 2002; Lillestøl et al., 2006, 2009), and they are generally integral genetic modules. The cmr genes of Cmr modules are physically linked to a group of cas genes encoding Cas also clustered and are sometimes linked directly to the proteins of diverse functions (Jansen et al., 2002; Haft et al., CRISPR/Cas modules. The CRISPR/Cas immune system 2005; Makarova et al., 2006). occurs in most archaea and about 70% of these also carry Cmr Critical for the functioning of the immune systems are the modules, whereas only about 40% of bacteria contain spacer sequences which derive from foreign invading elements CRISPR/Cas modules and about 30% of these exhibit Cmr (Mojica et al., 2005; Pourcel et al., 2005; Bolotin et al., 2005; modules. Moreover, the archaea CRISPR loci consist of Barrangou et al., 2007). Whole transcripts are produced from CRISPR loci which initiate within the leader sequence adja- cent to the first repeat (Lillestøl et al., 2009), and they are * Corresponding author. Tel.: 45 35322010. subsequently processed in the repeat regions to yield end- þ E-mail address: [email protected] (R.A. Garrett). products corresponding to single spacer crRNAs (Tang et al., 1 The two authors contributed equally to the work. 2002, 2005; Lillestøl et al., 2006). Regulation of formation

0923-2508/$ - see front matter Ó 2010 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved. doi:10.1016/j.resmic.2010.09.001 28 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 of the whole CRISPR transcript is probably required to prevent using crenarchaeal CRISPR systems as representative exam- interference from promoter and terminator regions which are ples: (1) Whether CRISPR/Cas modules constitute integral randomly taken up in the spacers (Shah et al., 2009). The genetic units. (2) Phylogenetic relationships between CRISPR/ processing is effected by specific Cas or Cmr proteins which, Cas and Cmr modules. (3) Diversification and degeneration of at least for the latter, generate two discrete crRNAs each CRISPR/Cas modules. (4) Mobilisation and loss of CRISPR/ carrying 8 bp of repeat at the 50-end and lacking 2 nt and 8 nt, Cas modules. (5) Transfer of CRISPR/Cas modules between from the 30-end of each spacer (Hale et al., 2009). Combina- organisms. (6) Co-evolution of the CRISPR/Cas system in the tions of proteins then transport the processed crRNAs to target archaeal and bacterial domains. (7) A possible common and inactivate invading genetic elements for both CRISPR/Cas ancestry with the diverse eukaryal siRNA systems. and CRISPR/Cmr systems (Brouns et al., 2008; Hale et al., 2008, 2009; Carte et al., 2008). Base pairing mismatches occurring between the 50 8 nt repeat sequence of the crRNA 2. Methods and the sequence adjacent to the targeted protospacer of the invading DNA are essential for subsequent degradation of the Amino acid sequences of Cas1 proteins were collected from latter and for ensuring that the chromosomal CRISPR locus, all publicly available archaeal and bacterial genomes by running itself, is not targeted (Marraffini and Sontheimer, 2010). an in-house-constructed Cas1-specific HMM against NCBI’s Cas and Cmr proteins are phylogenetically and functionally “non-redundant” protein database. All sequences were extracted very diverse and are involved in at least two mechanistic path- and an all-against-all SmitheWaterman sequence comparison ways which target invading genetic elements via the crRNAs. was made using the FASTA package (Pearson, 2000). After The CRISPR/Cas system specifically targets DNA (Marraffini taking into account the distribution of the resulting and Sontheimer, 2008; Shah et al., 2009), while the CRISPR/ SmitheWaterman scores, all matching sequence pairs were Cmr system targets RNA (Hale et al., 2009). The two pathways assigned weights between 0 and 1 with 0 corresponding to require the products of the cas gene cassette adjoining a CRISPR aSmitheWaterman score of 200 or less, and 1 corresponding to locus or the products of the Cmr module which is either directly 1200 or more. This was used as an input for Markov clustering linked to a CRISPR/Cas module or lies separately on the chro- (MCL) (Enright et al., 2002) with the default options (inflation mosome (Fig. 1)(Jansen et al., 2002; Makarova et al., 2006). factor 2) as an input for BioLayout (Goldovsky et al., 2005). Although most bacterial CRISPR/Cas modules are unpaired, Repeat¼ sequences were clustered by a similar approach, but using different combinations of CRISPR/Cas and Cmr modules, SmitheWaterman DNA sequence alignments. Leader sequences including paired CRISPR loci, are common amongst the cren- were clustered using the same approach but with an MCL infla- archaea (Fig. 1D and E). Phylogenetic studies have demon- tion factor of 1.2 due to their very low sequence conservation. strated that homologs of a few Cas proteins occur widely With the exception of the genomes of Sulfolobus islandicus throughout the archaeal and bacterial domains, while others are strains HVE10/4 and Rey15A and Acidianus brierleyi from our predominantly archaeal or bacterial in character (Haft et al., own lab, all other genomes are publicly available with the 2005; Makarova et al., 2006). accession numbers NC_009135, NC_009975, NC_009637, This article will consider the following issues relating to the NC_005791, NC_013769, NC_012589, NC_012588, mobility and evolution of the CRISPR/Cas system, generally NC_012632, NC_012726, NC_012622, NC_012623, 4023466

Fig. 1. Scheme showing different arrangements of CRISPR/Cas and Cmr modules. A. Typical monomeric CRISPR/Cas structure. B. Linked Cmr and CRISPR/Cas modules. C. Separated Cmr and CRISPR/Cas modules. D. Paired family I CRISPR/Cas modules carrying inverted CRISPR loci. Typical gene contents and order for E. a paired crenarchaeal family I CRISPR/Cas module, and F. a crenarchaeal Cmr module. S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 29

(JGI project) CP001800, NC_002754. Dot-plots were con- (Fig. 2BeD), consistent with earlier results (Lillestøl et al., structed using the MUMmer package (Kurtz et al., 2004). 2009), and they strongly suggest that the four CRISPR/Cas CRISPR clusters were found using publicly available families have evolved independently and that they do indeed software (Bland et al., 2007) and Cmr modules were found constitute discrete genetic modules. The results in Fig. 2A reveal using HMMs constructed in-house. The core genomes of that each of the Sulfolobales families IeIVare components of an Sulfolobus solfataricus and S. islandicus strains were deter- earlier defined group of families, CASS1 5 6 7(Haft et al., mined by finding all orthologous genes occurring only once in 2005; Makarova et al., 2006) that in Fig. 2þA canþ beþ seen to merge all the genomes. Orthologs were found by performing an all- into a superfamily. against-all sequence similarity search for all the encoded For bacteria, a comparative genomic analysis of strains proteins with subsequent clustering using MCL (Enright et al., of Streptococcus thermophilus also revealed a putative 2002). A multiple alignment was made of the DNA sequence co-evolution of Cas proteins and CRISPR loci within the corresponding to each ortholog (Edgar, 2004). All multiple CRISPR/Cas modules (Horvath et al., 2008), and a more alignments, with gaps removed, were concatenated and the extensive study of CRISPR loci in 47 genomes of a variety of resulting alignment was used to build a phylogenetic tree genera and species of lactic acid bacteria revealed 8 different (Thompson et al., 1994). The length of each family I leader classes of CRISPR/Cas modules with evidence for a phyloge- was determined using sequence alignments of different leaders netic congruence between Cas1 protein sequences, the repeat before constructing the leader tree. sequences, and the cas gene content and synteny but, with one partial exception, no phylogenetic link was detected between 3. Results and discussion the leader regions and the rest of the CRISPR/Cas modules (Horvath et al., 2009). Whether the latter reflects a real 3.1. Do CRISPR/Cas modules constitute integral genetic difference in the significance of the leader between these units? bacteria and the crenarchaea requires further clarification. Amongst crenarchaea, there is a preference for paired Several studies have detected a broad phylogenetic corre- CRISPR loci which are inverted with respect to one another, lation between selected Cas proteins and repeat sequences of generally (see below) resulting in internalised leader regions CRISPR loci, with the reservation that the repeats are of and some cas genes located between the leaders (Fig. 1D and limited and variable size (Haft et al., 2005; Kunin et al., 2007). E) (Lillestøl et al., 2009). Moreover, for the Sulfolobales, at For the Sulfolobales, phylogenetic analyses of sequences of least, the paired modules belong to the same family and share repeats, leaders and Cas1 proteins demonstrated that the a single set of cas genes. Family I CRISPR/Cas modules are CRISPR/Cas modules could be classified into at least three the most common amongst the Sulfolobales and other cren- distinct families (Lillestøl et al., 2009). Here we extend this archaea and they are also the most conserved in structure. The analysis and present comparative results for the Cas1 protein, cas genes are partitioned, with one group located between the the leader and the repeat sequences using unsupervised clus- leaders and another lying externally at one end of the module tering. MCL classifies nodes into clusters based on pairwise (Fig. 1D and E). This separation may be functionally signifi- distances to other nodes (Enright et al., 2002). Here, the nodes cant with the internal cas genes adjacent to both leader regions comprise the sequences of Cas1, the repeat and the leader and encoding proteins involved in processing and insertion of the distances correspond to the sequence alignment scores DNA spacer-repeat units, while the external cas genes encode between them. This approach is preferable to the use of RNA processing and guiding proteins. There are fewer iden- phylogenetic trees for the following reasons. Firstly, the tified examples of paired family II and III CRISPR/Cas problem of delineating boundaries between neighbouring modules and they appear to be less conserved in their genetic families is determined by the algorithm itself, avoiding the organisation than the family I modules and, at this stage, it is potential error and bias of manual definition. Moreover, more premature to propose a consensus structure. A similar family- than 1000 Cas1 sequences are available in public sequence specific cas gene content and synteny has also been observed databases and they cannot be readily presented in phylogenetic for CRISPR/Cas modules of lactic acid bacteria (Horvath trees, whereas they can be visualised in a two- or three- et al., 2009). Presumably, the pairing of the CRISPR/Cas dimensional space using the Biolayout program (Goldovsky modules reflects a compromise between limiting the sizes of et al., 2005). Finally, leader sequences share significant individual CRISPR loci and avoiding the necessity of sequence similarity within, but not across, families such that producing very long transcripts while using only one set of cas all leaders cannot be represented in one phylogenetic tree. genes. Moreover, if one CRISPR locus becomes inactivated as Thus, MCL clustering is the best approach for automated a result of, for example, mutations at the leader-repeat junc- classification of CRISPR leader sequences, and by using the tion, the other locus will still be active. same method for Cas1 and repeat sequences, potential inconsistencies arising from using different methodologies are 3.2. Phylogenetic relationships between CRISPR/Cas avoided. and Cmr modules The results are illustrated in Fig. 2 for the Sulfolobales and they show closely similar clustering patterns for the Cas1, leader The Cmr module has been implicated in directing pro- and repeat sequences of the CRISPR/Cas families I to IV cessed crRNAs to target the RNA of invading genetic 30 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38

Fig. 2. Results of MCL clustering of components of CRISPR/Cas modules visualised using BioLayout (Goldovsky et al., 2005). A. Clustering of all Cas1 proteins found in public databases where 5 large clusters and 11 smaller ones emerge and are colour-coded. Sequences within a given cluster show as little as 25% amino acid sequence identity. Three of the large clusters correspond directly to previously defined families, labelled CASS2 to 4 (Haft et al., 2005; Makarova et al., 2006). B to D. Clustering of Sulfolobales CRISPR/Cas families I, II, III and IV: B - Cas1 proteins; C leaders where leaders from the same family share about 70% nucleotide sequence identity and little or no nucleotide sequence conservation occurs between different families; D - repeats which show about 80% sequence identity within a given family. The results for the four families in B to D show similar patterns. Colour-coding for the Sulfolobales CRISPR/Cas families: I - blue, II - purple, III - yellow, and IV - green. Family IV represents the CRISPR/Cas modules in Metallosphaera sedula and Acidianus brierleyi which were previously unclassified (Lillestøl et al., 2009).

elements, whether RNA genomes, transcripts, or both, 3.3. Diversification and degeneration of CRISPR/Cas remains unclear (Hale et al., 2009). The cmr genes are modules apparently co-transcribed in a distinct cassette which is sometimes physically linked to the CRISPR/Cas module CRISPR loci vary considerably in size extending from (Fig. 1). It occurs less widely than CRISPR/Cas modules, and a single spacer bordered by repeats to a maximum, to date, of is particularly prevalent in thermophilic archaea and bacteria. 375 spacers (Lillestøl et al., 2006; Grissa et al., 2008). All such Comparison of phylogenetic trees for the CRISPR/Cas and CRISPR loci that have been tested, including those lacking Cmr modules, based on sequences of a Cas1 or Cas3 proteins leader regions, have been shown to produce transcripts which (the former is not present in all CRISPR/Cas modules) and are processed (Tang et al., 2002, 2005; Brouns et al., 2008; apredictedpolymerase,respectively,revealedtwomajor Carte et al., 2008; Lillestøl et al., 2006, 2009). There is branches for the Cmr modules, carrying distinctive gene evidence from studies of both archaea and bacteria that syntenies, but showing little congruence with the Cas1/Cas3- CRISPR loci commonly undergo deletions without impairing based tree (Makarova et al., 2006). This suggests that despite overall CRISPR/Cas functionality, and that the deletions can their being interdependent mechanistically and sometimes range in size from single to several repeat-spacer units, physically coupled, the DNA- and RNA-directed systems presumably resulting from recombination at the identical have evolved independently. Both module types tend to be direct repeats. There is a tendency to lose the central and located in variable genomic regions and their positions, and downstream regions of the CRISPR loci farthest from the copy numbers, vary even for the closely related Sulfolobus leader region, where the earliest spacer inserts are located, and species (see below). which are likely to be less important for the immune system, S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 31 on average, than the more recently inserted spacers (Lillestøl elements, with some flanking either end of a CRISPR locus et al., 2006, 2009; Tyson and Banfield, 2007; Deveau et al., (Horvath et al., 2009). Therefore, IS elements are likely to 2008; Horvath et al., 2008). However, in addition to the generate changes in active CRISPR loci possibly particularly spacer-repeat units added at the leader-repeat junction in biofilms, or environments with low virus and/or plasmid (Pourcel et al., 2005; Lillestøl et al., 2006, 2009), there are levels. a few putative examples of duplications of spacer-repeat units, Thus, IS elements, or other transposable elements, may or small groups thereof, occurring in mycobacteria and induce shortening and/or degeneration of CRISPR loci by methanoarchaea (Van Embden et al., 2000; Lillestøl et al., inserting into CRISPR loci and causing transposition of 2006). Moreover, it has also been claimed, for two out of spacer-repeat clusters to other chromosomal sites. Many four derivatives of S. thermophilus strain SMQ-301, that chromosomes, with or without CRISPR/Cas modules, carry a single new spacer-repeat unit was inserted internally within short CRISPR-like clusters lacking associated leader regions the CRISPR locus at the exact position where seven spacer- and cas genes (Grissa et al., 2008). Their repeats are often repeat units had been deleted, suggesting that the insertion- phylogenetically divergent from the CRISPR loci in a given deletion events had occurred concurrently (Deveau et al., genome, or in closely related genomes. Although there is no 2008). A related phenomenon occurs in the CRISPR loci of consensus view as to their origin(s) or function(s), if they are S. solfataricus strains. Pairwise alignments of CRISPR locus preceded by promoters, their transcripts can, in principle, be A of strains P1, P2 and 98/2 in Fig. 3 show shared spacers processed and activated by Cas and/or Cmr proteins, if (shaded), as well as different spacers adjoining the leader present. For example, Sulfolobus conjugative plasmids region and considered to have been added after the strains carrying CRISPR-like loci lack cas genes and leader diverged. Deletions are apparent when pairs of CRISPR locus sequences (She et al., 1998; Greve et al., 2004) but for at least A are compared, but there is one site in the P1 locus where six one of them, pKEF9, the repeat cluster is transcribed and the spacer-repeat units (a) have been replaced by four (b) from the RNA is processed, which indicates that the active crRNAs can CRISPR locus B, presumably in a single recombination event be produced intracellularly if a complementary set of cas (Fig. 3). genes (or cmr genes) is present in the host (Lillestøl et al., Earlier studies suggested that mobile elements or integra- 2009). Since three of the six spacers in the pKEF9 repeat tive elements rarely target CRISPR/Cas modules in either cluster have good sequence matches to archaeal fuselloviruses archaea or bacteria (Van Embden et al., 2000; Haft et al., (2) and a rudivirus (1), it was proposed that the genetic 2005; Lillestøl et al., 2006). Moreover, in the three closely elements may also exploit the host’s CRISPR/Cas (or related strains of S. solfataricus P1, P2 and 98/2, which are CRISPR/Cmr) immune system, to compete with co-invading rich in active transposable elements and where extensive foreign elements (Lillestøl et al., 2009). This hypothesis is genomic shuffling has been observed (Bru¨gger et al., 2004; consistent with the demonstration that infection of an Acid- Redder and Garrett, 2006), no IS insertions were detected in ianus strain, carrying the conjugative plasmid pAH1 (lacking their extensive CRISPR loci (350e450 spacer-repeat units) a CRISPR locus), with the lipothrixvirus AFV1, led to inhi- (Fig. 3). Thus, although they do occur occasionally inter- bition of plasmid replication (Basta et al., 2009). genically in the cas and cmr gene clusters, there appears to be a strong selective pressure to maintain the integrity of CRISPR 3.4. Mobilisation and loss of CRISPR/Cas modules loci in crenarchaea. Nevertheless, recent studies of environ- mental bacterial samples suggest that transpositions occur Genome analyses of closely related members of the Sul- commonly in some systems. In a study of two biofilms folobales revealed CRISPR/Cas modules at different positions carrying acidophilic Leptospirillum group II bacteria, for one in genomes which show high levels of gene synteny, raising biofilm about 20% of the partially sequenced CRISPR loci the question as to whether they have moved within the carried IS elements (Tyson and Banfield, 2007) and in a recent genome, or been lost and/or gained. There are also differences study of many lactic acid bacterial strains, several CRISPR in the contents of the CRISPR/Cas module families in the loci and cas genes cassettes were found to be interrupted by IS sequenced genomes. For example, S. solfataricus carries

Fig. 3. Pairwise comparison of repeat-spacer units of CRISPR locus A of three strains of S. solfataricus P1, P2 and 98/2. Shaded spacer-repeat units which are linked are identical in sequence between pairs of CRISPR loci. Six spacer-repeat units, indicated by a, are deleted from strain P1, while four spacer-repeat units, denoted b, have apparently been acquired from CRISPR locus B (Lillestøl et al., 2009). Leader regions are indicated by L. 32 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 family I and II modules while Sulfolobus acidocaldarius results, although in an earlier study of two Thermatoga carries modules of family II and III (Lillestøl et al., 2009). genomes, CRISPR loci were found to be located close to To determine whether CRISPR/Cas modules are readily variable sites where chromosomal inversions had occurred mobilised, we investigated the presence or absence of (DeBoy et al., 2006). CRISPR/Cas modules in genomes of pairs of closely related There are examples of CRISPR/Cas modules being lost from Sulfolobus, showing >99% DNA sequence identity, respec- genomes. For example, a variant strain of S. solfataricus P2 tively (Fig. 4). The Sulfolobus strains (She et al., 2001a; Reno (P2A) was characterised that had lost four of the six CRISPR/ et al., 2009) exhibit differences in the numbers of modules, for Cas modules (A, B, C and D) which were physically linked, in example, for the pair of closely related S. islandicus strains total 124 kb, apparently via a single recombination event HVE10/4 and REY15A; the former carries two CRISPR/Cas between two bordering IS elements (Redder and Garrett, 2006), modules and one Cmr module, whereas the latter has one and S. solfataricus 98/2 lacks two whole clusters (C and F) CRISPR/Cas module and two Cmr modules. CRISPR loci of (Lillestøl et al., 2009). Bordering IS elements also have the the five pairs of S. islandicus strains, in contrast to those of the potential to generate transposons carrying whole CRISPR/Cas S. solfataricus strains (Fig. 3), share no common spacers. or Cmr modules and, rarely, paired family II CRISPR/Cas However each strain carries one paired family I CRISPR/Cas modules are bordered by identical inverted leaders (e.g., module so that it was possible to test whether the module had CRISPR loci A and B of S. solfataricus) which could recombine, persisted since the 10 strains diverged. The genomic position leading to loss of the whole module. Examples of closely related of the module was compared between each strain pair and was strains apparently losing CRISPR/Cas modules have also been shown not to be conserved in position for 4 out of 5 pairs reported for some bacteria (e.g., Godde and Bickerton, 2006; (Fig. 4). However, the displacements, even for the most closely Horvath et al., 2008). related strains, could be attributed to the genomic region For S. solfataricus P2A, loss of CRISPR/Cas modules was carrying the module being variable and having undergone attributed to its being a laboratory strain where the CRISPR/ complex rearrangements, rather than the module itself having Cas immune system had become an unnecessary burden on the been mobilised. At present, there are insufficient closely cell’s energy resources in the absence of invading genetic related genomes available, for archaea and bacteria, which elements. Possibly in niches relatively poor in viruses and carry CRISPR/Cas modules to test for the generality of these plasmids, including numerous bacterial endosymbionts, there

AB C

DE

Fig. 4. Dot-plots showing the degree of variability in gene syteny at the genomic sites of the CRISPR loci for closely related pairs of Sulfolobus strains (AeE). At the top and right sides of each plot, I, II, III indicates the position of a CRISPR/Cas family and C denotes a Cmr module. For the Sulfolobus strains, CRISPR/Cas modules and Cmr modules are invariably located within an approximately 0.75 Mb variable region containing many IS elements. In general, the gene synteny bordering the modules, and the genomic locations of the modules, have changed, possibly due to transpositional activity. S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 33 is a tendency to offload the CRISPR/Cas system. For example, small CRISPR loci have been detected in two crenarchaeal many human/animal pathogens including Borrelia, Brucella, conjugative plasmids (She et al., 1998; Peng et al., 2003; Buchnera, Burkholderia, Chlamydia and Rikketsia lack Greve et al., 2004). Although, for the latter, no physical CRISPR loci while others, including Pseudomonas strains and proximity of integrated conjugative plasmids and CRISPR loci Staphylococcus aureus, either lack CRISPR loci or carry occurs within Sulfolobus chromosomes (Chen et al., 2005; apparently degenerate copies. This may partly explain why Kawarabayashi et al., 2001). To date, CRISPR loci have not about 60% of bacteria lack the CRISPR/Cas and CRISPR/Cmr been detected in viral genomes, although they do occur within immune systems (Grissa et al., 2008; Mojica et al., 2009). prophages of the human pathogen Clostridium difficile (Sebaihia et al., 2006). 3.5. Transfer of CRISPR/Cas modules between At least for the paired crenarchaeal CRISPR/Cas modules, organisms they were considered to be too large to be borne on extra- chromosomal elements (Lillestøl et al., 2009). Another more Various lines of evidence suggest that CRISPR loci have likely mechanism for transferring such large CRISPR/Cas been transferred between organisms. For example, the variety modules between closely related organisms is via chromo- and combinations of different families of CRISPR/Cas modules somal conjugation. The archaea-specific integration mecha- that occur in closely related crenarchaeal genomes, with nism, generating a partitioned integrase gene, provides a similar pattern for the lactic acid bacterial genomes (Horvath a mechanism favouring encapturing genetic elements in et al., 2009; Lillestøl et al., 2009; Shah et al., 2009). This chromosomes (Muskhelishvili et al., 1993; She et al., 2001b) underlines that exchange does occur between closely related and some Sulfolobus species that carry encaptured integrated organisms. Other evidence derives from an analysis of the conjugative plasmids are also capable of conjugating their euryarchaeon Pyrococcus furiosus, where a 155 kb fragment chromosomal DNA (Aagaard et al., 1995; Grogan, 1996). bordered by CRISPR locus and a repeat shows significantly Possibly unknown transmission mechanisms may operate, for different properties of G C content, third codon position and example, within biofilms. codon usage from theþ rest of the genome (Portillo and Gonzalez, 2009). Similarly, the lactic acid bacterium Bifido- 3.6. Co-evolution of the CRISPR/Cas system in the bacterium adolescentis was shown to carry a cas gene cassette archaeal and bacterial domains with a much lower G C content (47%) than the average chromosomal G C contentþ (59.2%) (Horvath et al., 2009). Ever since the earliest studies on the CRISPR/Cas system, In order to examineþ the degree to which CRISPR/Cas the prevailing view has been that the archaeal and bacterial modules are subject to structural changes, we examined paired systems are closely related. This view was underpinned by the family I CRISPR/Cas modules in several closely related Sul- similar ordering of repeat-spacer units in the CRISPR loci and folobus strains. Phylogenetic trees were constructed for the by extensive comparative sequence studies of selected Cas external and internal cas gene cassettes and the leader region proteins (Haft et al., 2005; Godde and Bickerton, 2006; (Fig. 1E) and they were compared with a tree of the core Makarova et al., 2006). Moreover, it has been further rein- genomes (Fig. 5A). The tree of the external cas gene cassette forced by the mechanism of elongation of CRISPR loci at the is similar to the core genome tree, suggesting that these genes leader-repeat junction as well as by processing and maturation were retained in the genome after divergence of the strains mechanisms of crRNAs in both domains (Tang et al., 2002, (Fig. 5C). However, the trees for the internal cas gene cassette 2005; Brouns et al., 2008; Hale et al., 2008, 2009). located between the two leaders (Fig. 5D) and the leader Nevertheless, there are distinctive features of the two regions (Fig. 5B), match one another fairly closely, and they systems. CRISPR loci are much more common amongst also match a tree derived from the repeat sequences (Lillestøl archaea and tend to be larger, more complex and more labile et al., 2009). This indicates that the external cas genes, puta- (Lillestøl et al., 2006; Grissa et al., 2008). In addition, most tively involved in RNA processing and crRNA mobility, have repeat sequences of bacterial CRISPR loci carry inverted been retained within the strains, whereas the internal cas gene repeat motifs which can generate hairpin structures in tran- cassettes, which are functionally implicated in spacer addition scripts; these are less common amongst archaeal repeats at the leader-repeat junction, seem to co-evolve, and be which, in turn, suggests that different RNA processing signals mobilised with, the CRISPR loci. occur within repeat regions of the transcripts (Lillestøl et al., Mechanisms of transfer of CRISPR/Cas modules are less 2006; Kunin et al., 2007). Moreover, phylogenetic relation- clear and may be diverse. CRISPR/Cas loci can vary in size ships based on SmitheWaterman alignments show that most from about 7 kb for a cas gene cassette, a leader region and families of archaeal and bacterial repeat sequences exhibit a small CRISPR locus, to 25 kb or more for the paired family I minimal overlap (Kunin et al., 2007). A similar pattern arises crenarchaeal CRISPR/Cas modules. Indirect evidence for the from sequence alignments of Cas proteins where phylogenetic transfer of CRISPR/Cas modules on conjugative plasmids trees of Cas proteins show many archaeal genes clustering in arose from the observation that a few bacterial conjugative separate groups (Fig. 2A) (Haft et al., 2005; Godde and plasmids from Thermus thermophilus, Synechocystis and Bickerton, 2006; Makarova et al., 2006). In addition, the Shewanella, carried CRISPR loci, sometimes associated with average synteny of the cas and cmr genes is quite conserved afewcas genes (Godde and Bickerton, 2006) and, moreover, within, but not between, major phyla (Haft et al., 2005). There 34 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38

Fig. 5. Phylogenetic trees of S. solfataricus and S. islandicus strains based on: (A) nucleotide sequence alignments of core genomes of the host organisms, (B) leader sequences, (C) concatenated external cas genes, and (D) concatenated internal cas genes for paired family I CRISPR loci. Only bootstrap values less than 100% are given. All 12 strains were too closely related to be distinguished on the basis of 16S rDNA sequences. For S. islandicus strains LS, LD, YG, YN and M14 (Reno et al., 2009), it is evident that since they diverged from their closest relative, identical changes have occurred in both copies of the leader, possibly due to the whole CRISPR/Cas module, or a part of it, having been replaced. In contrast, the external cas cassette appears to have resided on the genome since all strains diverged because of the similarity of trees A and C. The internal cas cassette tree (D) follows that of the leader consistent with tight functional coupling. is also a CRISPR repeat binding protein, of elusive function when one assumes the activity of exchange of genetic that has only been detected amongst the crenarchaea (Peng elements was rife, we are left with two main scenarios for their et al., 2003). Other mechanistic differences may surface as subsequent development: (1) that the systems have remained the systems are studied more widely and in more depth. relatively conserved, and separated, and have gradually Importantly, however, crenarchaeal viruses have radically developed specific archaeal, or bacterial, characteristics; or different virusehost relationships from those of bacteria and (2) there has been periodic interdomain exchange, and thereby eukarya (Prangishvili et al., 2006; Bize et al., 2009). Consis- co-evolution of the archaeal and bacterial systems. tent with this, there are putative archaeal virus-specific anti- Clearly, crossing domain boundaries would be a very CRISPR systems (Peng et al., 2004; Vestergaard et al., 2008; complex process given the basic differences between archaea Garrett et al., 2010) and bacteria-specific CRISPR regulating and bacteria in their transcriptional initiation, elongation and systems (Pu¨hl et al., 2010). Therefore, it is likely that the termination mechanisms, and their translational initiation CRISPR/Cas and Cmr systems have maintained and/or mechanisms (Torarinsson et al., 2005; Santangelo et al., 2009) undergone domain-specific adaptations during evolution. and would be very unlikely to occur for modern cells. Transfer Assuming that a CRISPR/Cas-like system evolved prior to by conjugation would also be unlikely given the differing the separation of the archaeal and bacterial lineages, at a time conjugative systems and the different membrane and cell wall S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 35 structures of archaea and bacteria (Greve et al., 2004; Veith complex locates and anneals to a viral mRNA carrying et al., 2009). For the CRISPR/Cas and Cmr modules, inter- a complementary sequence which is then inactivated by another domain transfer would seriously compromise both expression endonuclease (Slicer). However, the initial processing step of the numerous essential cas and cmr genes as well as tran- involving the Dicer endonuclease seems to be quite different in scription of the CRISPR loci. In this context, the influential the CRISPR/Cas system. claim of the large uptake of functioning archaeal genes in the The closest parallel to the crRNAs and CRISPR loci genome of the bacterium Thermotoga maritima (24% of the amongst the eukaryal siRNA systems are the Argonaute Piwi- total including a CRISPR locus) (Nelson et al., 1999), was interacting RNAs (piRNAs) processed from piRNA cluster always highly controversial, not least because it would have transcripts which also do not require a Dicer-like endonuclease required the wholesale reprogramming of a large part of the (Lillestøl et al., 2009; Karginov and Hannon, 2010). This chimeric genome for transcriptional and translational signals. eukaryal system has been studied primarily in insects, fish and A recent reevaluation of this genome, together with those of mammals and strong evidence has been provided for its four other members of the Thermatogales, has provided involvement in maintaining germline integrity and develop- a much more nuanced and cautious view of the phylogenetic ment (Aravin et al., 2008; Klattenhoff and Theurkauf, 2008). origins of these bacteria (Zhaxybayeva et al., 2009), thereby The piRNA clusters are rich in transposons and repeat- underlining the perils of inferring phylogeny from BLAST sequence elements and occur at specific chromosomal sites, as sequence searches. On the other hand, co-evolution of the for the CRISPR loci. The piRNA clusters increase their archaeal and bacterial CRISPR/Cas systems would only informational capacity by the insertion of transposon require cross-domain events to succeed very rarely, after sequences which provide novel sequence content and become which the transferred system could be under strong selective fixed in the piRNA clusters by selection. Thus, continual pressure. Some limited interdomain transfer would be expansion of piRNA clusters occurs, as for CRISPR loci, but consistent with the phylogenetic trees produced for Cas1 or the process is passive rather than directed. Moreover, as for the Cas3 proteins of the CRISPR/Cas modules and Cmr2 of the CRISPR/Cas system, the newly incorporated DNA derives Cmr module (Haft et al., 2005; Godde and Bickerton, 2006; exclusively from genetic elements that are to be targeted. Both Makarova et al., 2006). Archaea-specific Cas proteins (Haft piRNA clusters and CRISPR loci yield large transcripts prior et al., 2005) may be associated with CRISPR/Cas or Cmr to processing into smaller RNAs. The processed piRNAs are systems that have evolved more independently in environ- 24e30 nt in length while the crRNAs lie in the range 39e45 ments of high temperature, extremes of pH or hypersaline nt. piRNAs complex with the Argonaute Piwi/RISC protein conditions where bacterial levels tend to be relatively low, and complex, similarly to crRNAs assembling in Cas or Cmr gradually become functionally incompatible with those living protein complexes, and they target and control mobile in less extreme, bacteria-rich environments where limited endogenous genetic elements primarily in germ cells. To date, genetic exchange between archaea and bacteria is more likely. piRNA complexes have been exclusively associated with tar- geting RNAs but this may reflect the fact that retrotransposons 3.7. A common ancestry with the diverse eukaryal siRNA predominate in those germ cells under study. systems? No homologous proteins have been detected from sequence analyses between proteins of the eukaryl siRNA systems and Diverse small interference RNA systems (siRNA) are those of the CRISPR system, although similarities may appear widespread in eukarya. Thus, in plants, small RNAs are at a tertiary structural level. Moreover, despite Argonaute important for antiviral defence and regulation of transposons Piwi-like domain proteins occurring in many archaea and and similar functions are common amongst invertebrates bacteria (Cerutti et al., 2000), they have not been implicated in (Hannon, 2002; Jinek and Doudna, 2009). Moreover, they crRNA-targeting. There is also very limited evidence for have been implicated in controlling repeat and transposon a functional targeting overlap between the two systems. A few contents of somatic nuclei in protozoa (Mochizuki and sequence matches have been observed between archaeal and Gorovsky, 2004). Although some mechanisms are confined bacterial CRISPR spacers and transposons, consistent with the to certain eukaryal lineages, they all essentially provide CRISPR/Cas system targeting mobile elements (Lillestøl, a mechanism for discriminating and targeting “foreign” et al., 2006; Held and Whitaker, 2009; Mojica et al., 2009; genetic elements or transposons. Moreover, there are broad Shah et al., 2009). However, those reported have generally mechanistic similarities between the eukaryal siRNA systems been carried on virus or plasmid genomes including, for and the DNA- and RNA-targeting CRISPR systems. They all example, spacer matches to each of the four transposase genes have to discriminate foreign DNA from self-DNA, and target carried by the bicaudavirus ATV (Shah et al., 2009), but these nucleic acids which both show little sequence similarity and transposase genes/IS elements are presumably indistinguish- can undergo continual sequence change. able from any other viral/plasmid genomic target if they carry There is a limited parallel between the CRISPR/Cmr RNA- appropriate sequence motifs adjacent to protospacer sites. targeting and eukaryal antiviral systems. The latter cut and Moreover, in the archaeon S. solfataricus P2, which carries process invading dsRNAviruses into small 21e22 bp dsRNAs by about 350 putative mobile elements (Bru¨gger et al., 2002), an endonuclease (Dicer), and these are converted into ssRNAs there is evidence that chromosomal transpositional activity is by the Argonaute proteineRISC complex. The proteineRNA regulated, at least partly, by antisense RNAs (Tang et al., 36 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38

2005), and very few close sequence matches were found to any discussions with Qunxin She, Soley Gudbergsdottir, Ling of the 417 CRISPR spacers (Lillestøl et al., 2006; Shah et al., Deng, Guo Li and Xu Peng. 2009). Finally, the piRNA system, like the CRISPR/Cas and References CRISPR/Cmr systems, may be very ancient. Evolution of genomic parasites occurred concurrently with the emergence Aagaard, C., Dalgaard, J., Garrett, R.A., 1995. Inter-cellular mobility and of self replicating genomes. Thus, the development of adaptive homing of an archaeal rDNA intron confers selective advantage over and heritable systems would be important for maintaining intron-cells of Sulfolobus acidocaldarius. Proc. Natl. Acad. Sci. U.S.A. 92, fitness. 12285e12289. Aravin, A.A., Hannon, G.J., Brennecke, J., 2008. The Piwi-piRNA pathway provides an adaptive defense in the transposon arms race. Science 318, 4. Conclusion 761e764. Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., The CRISPR/Cas and CRISPR/Cmr immune machineries Moineau, S., Romero, D.A., Horvath, P., 2007. CRISPR provides acquired provide an effective defence mechanism in most archaea and resistance against viruses in prokaryotes. Science 315, 1709e1712. Basta, T., Smyth, J., Forterre, P., Prangishvili, D., Peng, X., 2009. Novel some bacteria. The system is dynamic and hereditable, archaeal plasmid pAH1 and its interaction with the lipothrixvirus AFV1. although the benefit for the cell in evolutionary terms is Mol. Microbiol. 71, 23e34. transitional because DNA from extrachromosomal elements Bize, A., Karlsson, E.A., Ekefja¨rd, K., Quax, T.E., Pina, M., Prevost, M.C., taken up as spacers in CRISPR loci has a rapid turnover and is Forterre, P., Tenaillon, O., Bernander, R., Prangishvili, D., 2009. A unique lost again via recombination at repeats and/or transpositional virus release mechanism in the Archaea. Proc. Natl. Acad. Sci. U.S.A. 106, 11306e11311. events. Current evidence suggests that CRISPR/Cas and Cmr Bland, C., Ramsey, T.L., Sabree, F., Lowe, M., Brown, K., Kyrpides, N.C., modules behave like integral genetic elements. They tend to be Hugenholtz, P., 2007. CRISPR Recognition Tool (CRT): a tool for auto- located in the most variable regions of chromosomes and are matic detection of clustered regularly interspaced palindromic repeats. frequently displaced as a result of genome shuffling, including BMC Bioinform. 8, 209. possibly transposition of whole modules. CRISPR loci may be Bolotin, A., Quinquis, B., Sorokin, A., Ehrlich, S.D., 2005. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extra- broken up and dispersed in chromosomes with the potential for chromosomal origin. Microbiology 151, 2551e2561. creating genetic novelty. Small leaderless CRISPR-like loci Brouns, S.J., Jore, M.M., Lundgren, M., Westra, E.R., Slijkhuis, R.J., are commonly found in chromosomes and in plasmids, and Snijders, A.P., Dickman, M.J., Makarova, K.S., Koonin, E.V., van der some can be transcribed, but it remains unclear whether they Oost, J., 2008. Small CRISPR RNAs guide antiviral defense in prokary- derive from CRISPR loci or whether they have other origins otes. Science 321, 960e964. Bru¨gger, K., Redder, P., She, Q., Confalonieri, F., Zivanovic, Y.,Garrett, R.A., 2002. and/or other functions. The CRISPR/Cas and Cmr modules Mobile elements in archaeal genomes. FEMS Microbiol. Lett. 206, 131e141. appear to exchange readily between closely related organisms Bru¨gger, K., Torarinsson, E., Chen, L., Garrett, R.A., 2004. Shuffling of where they may be subjected to strong selective pressure. It is Sulfolobus genomes by autonomous and non-autonomous mobile likely that this can occur via conjugative plasmids or chro- elements. Biochem. Soc. Trans. 32, 179e183. mosomal conjugation. While universal phylogenetic trees for Carte, J., Wang, R., Li, H., Terns, R.M., Terns, M.P., 2008. Cas6 is an endoribonuclease that generates guide RNAs for invader defense in Cas1/Cas3 proteins of the CRISPR/Cas module and Cmr2 of prokaryotes. Genes Dev. 22, 3489e3496. the Cmr module suggest that interdomain transfers between Cerutti, L., Mian, N., Bateman, A., 2000. Domains in gene silencing and cell archaea and bacteria have occurred, the relatively large differentiation proteins: the novel PAZ domain and redefinition of the Piwi number of archaea-specific Cas/Cmr proteins suggests that domain. Trends Biochem. Sci. 25, 481e482. these may have been very rare events, consistent with the Chen, L., Bru¨gger, K., Skovgaard, M., Redder, P., She, Q., Torarinsson, E., Greve, B., Awayez, M., Zibat, A., Klenk, H.F., Garrett, R.A., 2005. The incompatibility of the transcription, translation and con- genome of Sulfolobus acidocaldarius, a model organism of the Cren- jugative systems. archaeota. J. Bacteriol. 187, 4992e4999. There are parallels to the eukaryal siRNAs, most notably DeBoy, R.T., Mongodin, E.F., Emerson, J.B., Nelson, K.E., 2006. Chromo- for the germ cell piRNAs, which are also directed by effector some evolution in the Thermotogales: large-scale inversions and strain proteins to silence or destroy invading foreign DNA and diversification of CRISPR sequences. J. Bacteriol. 188, 2364e2374. Deveau, H., Barrangou, R., Garneau, J.E., Labonte´, J., Fremaux, C., transposons. While some common effector proteins are Boyaval, P., Romero, D.A., Horvath, P., Moineau, S., 2008. Phage response utilized in different eukaryal siRNA systems, no homologous to CRISPR-encoded resistance in Streptococcus thermophilus. J. Bacteriol. proteins are identifiable between the eukaryal siRNA proteins 190, 1390e1400. and those of the archaeal and bacterial CRISPR/Cas and Cmr Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy modules. Possibly very distant phylogenetic relationships will and high throughput. Nucleic Acids Res. 32, 1792e1797. Enright, A.J., Van Dongen, S., Ouzounis, C.A., 2002. An efficient algorithm appear as more crystal structures of the siRNA and crRNA for large-scale detection of protein families. Nucleic Acids Res. 30, effector proteins are determined. 1575e1584. Garrett, R.A., Prangishvili, D., Shah, S.A., Reuter, M., Stetter, K., Peng, X., Acknowledgements 2010. Metagenomic analyses of novel viruses, plasmids, and their variants, from an environmental sample of hyperthermophilic neutrophiles cultured in a bioreactor. Environ. Microbiol., doi:10.1111/j.1462-2920.2010.02266.x. Research at the Archaea Centre was supported by grants Godde, J.S., Bickerton, A., 2006. The repetitive DNA elements called from the Danish Natural Science Research Council and the CRISPRs and their associated genes: evidence of horizontal transfer Danish Foundation for Basic Research. We appreciate helpful among prokaryotes. J. Mol. Evol. 62, 718e729. S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38 37

Goldovsky, L., Cases, I., Enright, A.J., Ouzounis, C.A., 2005. BioLayout Marraffini, L.A., Sontheimer, E.J., 2010. Self versus non-self discrimination (Java): versatile network visualisation of structural and functional rela- during CRISPR RNA-directed immunity. Nature 463, 568e571. tionships. Appl. Bioinform. 4, 71e74. Mochizuki, K., Gorovsky, M.A., 2004. Small RNAs in genome rearrangements Greve, B., Jensen, S., Bru¨gger, K., Zillig, W., Garrett, R.A., 2004. Genomic in Tetrahymena. Curr. Opin. Genet. Dev. 14, 181e187. comparison of archaeal conjugative plasmids from Sulfolobus. Archaea 1, Mojica, F.J., Diez-Villasenor, C., Garcia-Martinez, J., Soria, E., 2005. Inter- 231e239. vening sequences of regularly spaced prokaryotic repeats derive from Grissa, I., Vergnaud, G., Pourcel, C., 2008. CRISPRcompar: a website to foreign genetic elements. J. Mol. Evol. 60, 174e182. compare clustered regularly interspaced short palindromic repeats. Nucleic Mojica, F.J., Diez-Villasenor, C., Garcia-Martinez, J., Almendros, C., 2009. Acids Res. 36, 145e148. Short motif sequences determine the targets of the prokaryotic CRISPR Grogan, D.W., 1996. Exchange of genetic markers at extremely high system. Microbiology 155, 733e740. temperatures in the archaeaon Sulfolobus acidocaldarius. J. Bacteriol. 178, Muskhelishvili, G., Palm, P., Zillig, W., 1993. SSV1-encoded site-specific 3207e3211. recombination system in Sulfolobus shibatae. Mol. Gen. Genet. 237, Haft, D.H., Selengut, J., Mongodin, E.F., Nelson, K.E., 2005. A guild of 334e342. 45 CRISPR-associated (Cas) protein families and multiple CRISPR/ Nelson, K.E., Clayton, E., Gill, S.R., Gwinn, M.L., Dodson, R.J., Haft, D.H., Cas subtypes exist in prokaryotic genomes. PloS Comput. Biol. 1, Hickey, E.K., Peterson, J.D., Nelson, W.C., Ketchum, K.A., et al., 1999. 474e483. Evidence for lateral gene transfer between archaea and bacteria from Hale, C., Kleppe, K., Terns, R.M., Terns, M.P., 2008. Prokaryotic silencing genome sequence of Thermotoga maritima. Nature 399, 323e329. (psi)RNAs in Pyrococcus furiosus. RNA 14, 1e8. Pearson, W.R., 2000. Flexible sequence similarity searching with the FASTA3 Hale, C.R., Zhao, P., Olson, S., Duff, M.O., Graveley, B.R., Wells, L., program package. Methods Mol. Biol. 132, 185e219. Terns, R.M., Terns, M.P., 2009. RNA-guided RNA cleavage by a CRISPR Peng, X., Bru¨gger, K., Shen, B., Chen, L., She, Q., Garrett, R.A., 2003. Genus- RNA-Cas protein complex. Cell 139, 945e956. specific protein binding to the large clusters of DNA repeats (short regu- Hannon, G.J., 2002. RNA interference. Nature 418, 244e251. larly spaced repeats) present in Sulfolobus genomes. J. Bacteriol. 185, Held, N.L., Whitaker, R.J., 2009. Viral biogeography revealed by signatures in 2410e2417. Sulfolobus islandicus genomes. Environ. Microbiol. 11, 457e466. Peng, X., Kessler, A., Phan, H., Garrett, R.A., Prangishvili, D., 2004. Multiple Horvath, P., Romero, D.A., Couˆte´-Monvoisin, A.-C., Richards, M., Deveau, H. variants of the archaeal DNA rudivirus SIRV1 in a single host and a novel , Moineau, S., Boyaval, P., Fremaux, C., Barrangou, R., 2008. Diversity, mechanism of genomic variation. Mol. Microbiol. 54, 366e375. activity, and evolution of CRISPR loci in Streptococcus thermophilus. Portillo, M.C., Gonzalez, J.M., 2009. CRISPR elements in the thermococcales: J. Bacteriol. 190, 1401e1412. evidence for associated horizontal gene transfer in Pyrococcus furiosus.J. Horvath, P., Couˆte´-Monvoisin, A.-C., Romero, D.A., Boyaval, P., Fremaux, C., Appl. Genet. 50, 421e430. Barrangou, R., 2009. Comparative analysis of CRISPR loci in lactic acid Pourcel, C., Salvignol, G., Vergnaud, G., 2005. CRISPR elements in Yersinia bacteria genomes. Int. J. Food Microbiol. 131, 62e70. pestis acquire new repeats by preferential uptake of bacteriophage DNA, Jansen, R., Embden, J.D., Gaastra, W., Schouls, L.M., 2002. Identification of and provide additional tools for evolutionary studies. Microbiology 151, genes that are associated with DNA repeats in prokaryotes. Mol. Micro- 653e663. biol. 43, 1565e1575. Prangishvili, D., Forterre, P., Garrett, R.A., 2006. Viruses of the Archaea: Jinek, M., Doudna, J.A., 2009. A three dimensional view of the molecular a unifying view. Nat. Rev. Microbiol. 11, 837e848. machinery of RNA interference. Nature 457, 405e412. Pu¨hl, U¨ ., Wurm, R., Arslan, Z., Geissen, R., Hofmann, N., Wagner, R., 2010. Karginov, F.V., Hannon, G.J., 2010. The CRISPR system: small RNA-guided Identification and characterisation of E. coli CRISPR-cas promoters and defense in bacteria and archaea. Mol. Cell 37, 7e19. their silencing by H-NS. Mol. Microbiol. 75, 1495e1512. Kawarabayashi, Y., Hino, Y., Horikawa, H., Jin-no, K., Takahashi, M., Redder, P., Garrett, R.A., 2006. Mutations and rearrangements in the genome Sekine, M., Baba, S., Ankai, A., Kosugi, H., Hosoyama, A., Fukui, S., of Sulfolobus solfataricus P2. J. Bacteriol. 188, 4198e4206. Nagai, Y., Nishijima, K., Otsuka, R., Nakazawa, H., Takamiya, M., Reno, M.L., Hel, N.L., Fields, C.J., Burke, P.V., Whitaker, R.J., 2009. Kato, Y., Yoshizawa, T., Tanaka, T., Kudoh, Y., Yamazaki, J., Kushida, N., Biogeography of the Sulfolobus islandicus pan-genome. Proc. Natl. Acad. Oguchi, A., Aoki, K., Masuda, S., Yanagii, M., Nishimura, M., Sci. U.S.A. 106, 8605e8610. Yamagishi, A., Oshima, T., Kikuchi, H., 2001. Complete genome sequence Santangelo, T.J., Cubonova´, L., Skinner, K.M., Reeve, J.N., 2009. Archaeal of an aerobic thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain7. intrinsic transcription termination in vivo. J. Bacteriol. 191, 7102e7108. DNA Res. 8, 123e140. Sebaihia, M., Wren, B.W., Mullany, P., Fairweather, N.F., Minton, N., Stabler, R. Klattenhoff, C., Theurkauf, W., 2008. Biogenesis and germline functions of , Thomson, N.R., Roberts, A.P., Cerden˜o-Ta´rraga, A.M., Wang, H., et al., piRNAs. Development 135, 3e9. 2006. The multidrug resistant human pathogen Clostridium difficile has Kunin, V., Sorek, R., Hugenholtz, P., 2007. Evolutionary conservation of a highly mobile mosaic genome. Nat. Genet. 38, 779e786. sequence and secondary structures in CRISPR repeats. Genome Biol. 8, Shah, S.A., Hansen, N.R., Garrett, R.A., 2009. Distributions of CRISPR spacer R611eR617. matches in viruses and plasmids of crenarchaeal acidothermophiles and Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., implications for their inhibitory mechanism. Biochem. Soc. Trans. 37, Antonescu, C., Salzberg, S.L., 2004. Versatile and open software for 23e28. comparing large genomes. Genome Biol. 5, R12. She, Q., Phan, H., Garrett, R.A., Albers, S.-V., Stedman, K.M., Zillig, W., Lillestøl, R.K., Redder, P., Garrett, R.A., Bru¨gger, K., 2006. A putative viral 1998. Genetic profile of pNOB8 from Sulfolobus: the first conjugative defence mechanism in archaeal cells. Archaea 2, 59e72. plasmid from an archaeon. Extremophiles 2, 417e425. Lillestøl, R.K., Shah, S.A., Bru¨gger, K., Redder, P., Phan, H., Christiansen, J., She, Q., Singh, R.K., Confalonieri, F., Zivanovic, Y., Gordon, P., Allard, G., Garrett, R.A., 2009. CRISPR families of the crenarchaeal genius Sulfo- Awayez, M.J., Chan-Weiher, C.C., Clausen, I.G., Curtis, B.A., et al., lobus: bidirectional transcription and dynamic properties. Mol. Microbiol. 2001a. The complete genome of the crenarchaeon Sulfolobus solfataricus 72, 259e272. P2. Proc. Natl. Acad. Sci. U.S.A. 98, 7835e7840. Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I., Koonin, E.V., 2006. She, Q., Peng, X., Zillig, W., Garrett, R.A., 2001b. Gene capture events in A putative RNA-interference-based immune system in prokaryotes: archaeal chromosomes. Nature 409, 478. computational analysis of the predicted enzymatic machinery, functional Tang, T.-H., Bachellerie, J.-P., Rozhdestvensky, T., Bortolin, M.-L., Huber, H., analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Drungowski, M., Elge, T., Brosius, J., Hu¨ttenhofer, A., 2002. Identification Biol. Direct 1, 7. of 86 candidates for small non-messenger RNAs from the archaeon Marraffini, L.A., Sontheimer, E.J., 2008. CRISPR interference limits hori- Archaeoglobus fulgidus. Proc. Natl. Acad. Sci. U.S.A. 99, 7536e7541. zontal gene transfer in Staphylococci by targeting DNA. Science 322, Tang, T.-H., Polacek, N., Zywicki, M., Huber, H., Bru¨gger, K., Garrett, R., 1843e1845. Bachellerie, J.P., Hu¨ttenhofer, A., et al., 2005. Identification of novel non- 38 S.A. Shah, R.A. Garrett / Research in Microbiology 162 (2011) 27e38

coding RNAs as potential antisense regulators in the archaeon Sulfolobus Veith, A., Klingl, A., Zolghadr, B., Lauber, K., Mentele, R., Lottspeich, F., solfataricus. Mol. Microbiol. 55, 469e481. Rachel, R., Albers, S.V., Kletzin, A., 2009. Acidianus, Sulfolobus and Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving Metallosphaera surface layers: structure, composition and gene expres- the sensitivity of progressive multiple sequence alignment through sion. Mol. Microbiol. 73, 58e72. sequence weighting, position-specific gap penalties and weight matrix Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, M., Phan, H., choice. Nucleic Acids Res. 22, 4673e4680. Briegel, A., Rachel, R., Garrett, R.A., Prangishvili, D., 2008. SRV, a new Torarinsson, E., Klenk, H.P., Garrett, R.A., 2005. Divergent transcriptional and rudiviral isolate from Stygiolobus and the interplay of crenarchaeal rudi- translational signals in Archaea. Environ. Microbiol. 7, 47e54. viruses with the host viral-defence CRISPR system. J. Bacteriol. 190, Tyson,G.W.,Banfield,J.F.,2007.RapidlyevolvingCRISPRsimplicatedinacquired 6837e6845. resistance of microorganisms to viruses. Environ. Microbiol. 10, 200e208. Zhaxybayeva, O., Swithers, K.S., Lapierre, P., Fournier, G.P., Bickhart, D.M., Van Embden, J.D.A., Van Gorkom, T., Kremer, K., Jansen, R., Van Der DeBoy, R.T., Nelson, K.E., Nesbø, C.L., Doolittle, W.F., Gogarten, J.P., Zeijst, B.A.M., Schouls, L.M., 2000. Genetic variation and evolutionary Noll, K.M., 2009. On the chimeric nature, thermophilic origin, and phylo- origin of the direct repeat locus of Mycobacterium tuberculosis complex genetic placement of the Thermotogales. Proc. Natl. Acad. Sci. U.S.A. 106, bacteria. J. Bacteriol. 182, 2393e2401. 5865e5870. 116 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ All CRISPR related bioinformatics including Figure 3 was pre- Contribution: substantial pared by myself. I also contributed to other parts of the paper, including figures 1 and 2 and Table 1, as well as the work related to toxin/anti-toxin systems. JOURNAL OF BACTERIOLOGY, Apr. 2011, p. 1672–1680 Vol. 193, No. 7 0021-9193/11/$12.00 doi:10.1128/JB.01487-10 Copyright © 2011, American Society for Microbiology. All Rights Reserved.

Genome Analyses of Icelandic Strains of Sulfolobus islandicus, Model Organisms for Genetic and Virus-Host Interaction Studiesᰔ Li Guo,1† Kim Bru¨gger,2† Chao Liu,2† Shiraz A. Shah,2† Huajun Zheng,3 Yongqiang Zhu,3 Shengyue Wang,3 Reidun K. Lillestøl,2 Lanming Chen,2 Jeremy Frank,2 David Prangishvili,4 Lars Paulin,5 Qunxin She,2‡ Li Huang,1‡* and Roger A. Garrett2‡* State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, No. 1 West Beichen Road, Chaoyang District, Beijing 100101, China1; Archaea Centre, Department of Biology, Copenhagen University, Ole Maaløes Vej 5, DK-2200N Copenhagen, Denmark2;ShanghaiMOSTKeyLaboratoryofDiseaseandHealthGenomics, Chinese National Human Genome Center at Shanghai, Shanghai 201203, China3; Molecular Biology of the Gene in Extremophiles Unit, Institut Pasteur, rue Dr Roux 25, 75724 Paris Cedex, France4; and DNA Sequencing and Genomics Laboratory, Institute of Biotechnology, University of Helsinki, 00790 Helsinki, Finland5

Received 10 December 2010/Accepted 16 January 2011

The genomes of two Sulfolobus islandicus strains obtained from Icelandic solfataras were sequenced and analyzed. Strain REY15A is a host for a versatile genetic toolbox. It exhibits a genome of minimal size, is stable genetically, and is easy to grow and manipulate. Strain HVE10/4 shows a broad host range for exceptional crenarchaeal viruses and conjugative plasmids and was selected for studying their life cycles and host interactions. The genomes of strains REY15A and HVE10/4 are 2.5 and 2.7 Mb, respectively, and each genome carries a variable region of 0.5 to 0.7 Mb where major differences in gene content and gene order occur. These include gene clusters involved in specific metabolic pathways, multiple copies of VapBC antitoxin-toxin gene pairs, and in strain HVE10/4, a 50-kb region rich in glycosyl transferase genes. The variable region also contains most of the insertion sequence (IS) elements and high proportions of the orphan orfB elements and SMN1 miniature inverted-repeat transposable elements (MITEs), as well as the clustered regular interspaced short palindromic repeat (CRISPR)-based immune systems, which are complex and diverse in both strains, consistent with them having been mobilized both intra- and intercellularly. In contrast, the remainder of the genomes are highly conserved in their protein and RNA gene syntenies, closely resembling those of other S. islandicus and Sulfolobus solfataricus strains, and they exhibit only minor remnants of a few genetic elements, mainly conjugative plasmids, which have integrated at a few tRNA genes lacking introns. This provides a possible rationale for the presence of the introns.

Iceland has been a rich source of hyperthermophilic crenar- strains and their genetic elements has yielded important in- chaea over the past 3 decades and especially of acidothermo- sights into the biology of these model crenarchaea, a major philic members of the order Sulfolobales. Many Sulfolobus is- impediment to more detailed insights has been the paucity of landicus strains (“Island” is German for “Iceland”) have also robust and versatile vector-host systems for genetic studies. A yielded many novel viruses showing varied and sometimes few Sulfolobus species have been successfully employed as unique morphologies and exceptional genome contents. These hosts for such systems, including Sulfolobus solfataricus strains properties are consistent with these viruses constituting an P1 and 98/2 (22, 58), Sulfolobus acidocaldarius (57), and S. archaeal lineage distinct from those of bacteria and eukarya, islandicus strain REY15A (54). To date, the genetic tools de- and they have now been classified into several new viral fam- veloped for the latter host are the most versatile and include ilies (38, 63). In addition, a family of conjugative plasmids has the following: (i) Sulfolobus-Escherichia coli shuttle vectors been characterized, with most members deriving from Iceland, carrying either viral or plasmid replication origins (50); (ii) which appear to conjugate by a mechanism unique to the conventional and novel gene knockout methodologies (14, 62), archaeal domain (18, 37). and (iii) a D-arabinose-inducible expression system with a lacS Although the availability of genome sequences of Sulfolobus reporter gene system (35). The S. islandicus system has also been employed successfully to demonstrate the dynamic char- acter of the clustered regular interspaced short palindromic * Corresponding author. Mailing address for R. A. Garrett: Archaea repeat (CRISPR)-based immune systems of Sulfolobus when Centre, Department of Biology, Copenhagen University, Ole Maaløes challenged with genetic elements carrying matching viral gene Vej 5, DK-2200N Copenhagen, Denmark. Phone: 045-353-22010. Fax: 045-353-22128. E-mail: [email protected]. Mailing address for L. and protospacers maintained under selection (20). These de- Huang: State Key Laboratory of Microbial Resources, Institute of velopments necessitated the determination of the genome se- Microbiology, Chinese Academy of Sciences, No. 1 West Beichen quence of S. islandicus strain REY15A as a prerequisite for Road, Chaoyang District, Beijing 100101, China. Phone: 086-10- successful exploitation of the genetic systems. 64807430. Fax: 086-10-64807429. E-mail: [email protected]. S. islandicus † These authors contributed equally. A second Icelandic strain, strain HVE10/4, has ‡ The last three authors are joint senior authors. been employed as a broad laboratory host for propagating ᰔ Published ahead of print on 28 January 2011. diverse Sulfolobus viruses and conjugative plasmids (63) and

1672 VOL. 193, 2011 GENOME ANALYSES OF ICELANDIC STRAINS OF S. ISLANDICUS 1673

FIG. 1. (A) Dot plot of the two Icelandic genomes showing the approximate levels of sequence synteny. The large variable regions extend from about 0.35 to 1.0 Mb. Transposase genes are denoted by black lines along the axes. Putative origins of replication adjacent to the cdc6 and whiP genes are indicated with red circles, while the families of the CRISPR/Cas (I and III) and Cmr (B) modules are indicated by blue squares. (B) Dot plot of the S. islandicus REY15A and S. solfataricus P2 genomes.

was selected for in-depth studies of their life cycles and host (IS) elements were identified by BLASTN search against the IS Finder database interactions. This effort received added impetus with the dem- (http://www-is.biotoul.fr/). All annotations were manually curated using Artemis onstration that some genetic elements show exceptional and software (47). sometimes unique properties of their viral life cycles or con- jugative mechanisms (3, 8, 18, 40). Therefore, the genome RESULTS sequence of S. islandicus strain HVE10/4 was also determined. The genome sequences of two Icelandic strains, REY15A Genome general properties. Genomes of the two Icelandic and HVE10/4, were analyzed and compared and contrasted strains were sequenced using a combination of sequencing with one another and with genomes of other S. solfataricus and strategies. S. islandicus REY15A was determined primarily by S. islandicus strains isolated from different geographical loca- 454 sequencing, while strain HVE10/4 was obtained by a com- tions, including Naples, Italy; Kamchatka, Russia; Lassen Vol- bination of Sanger and 454 sequencing at approximately 30- canic National Park; and Yellowstone National Park (44, 53). fold and 10-fold coverage, respectively. Protein-coding genes were annotated in Artemis (47), where start codons for single MATERIALS AND METHODS genes and first genes of Sulfolobus operons were generally located 25 to 30 bp downstream from the archaeal hexameric Genome sequencing. S. islandicus strains REY15A and HVE10/4 were colony purified three times and cultured essentially as described earlier (11). Total DNA TATA-like box and only genes within operons were preceded was extracted from the cells using phenol-chloroform and further purified by by Shine-Dalgarno motifs, of which GGUG predominates CsCl density-gradient centrifugation. For strain REY15A, sequencing of shotgun (56). Where alternative start codons were juxtapositioned, we libraries with a 454 GS FLX sequenator yielded 324,123 reads with 31-fold selected the most probable on the basis of its position relative genome coverage. For strain HVE10/4, DNA was sonicated to yield fragments in the size range of 1.5 to 4.0 kb, and clone libraries were generated in pUC18 using to the putative promoter and/or Shine-Dalgarno motifs or ex- the SmaI site. Sequencing was performed on MegaBace 1000 sequenators to perimental data from closely related organisms. yield approximately 3-fold sequence coverage, and the sequencing data were Dot plots of the two genomes demonstrate long sections of combined with a sequencing run using a 454 FLX sequenator to yield approxi- gene synteny. One region of about 0.5 to 0.7 Mb exhibits mately 10- to 15-fold coverage. The genome sequences were assembled using the extensive gene shuffling, and there is a smaller region with a phred/phrap/consed package, contigs were linked by combinatorial PCR using primers matching to each contig end, and the PCR products were sequenced to 200-kb inversion bordered by shuffled genes (Fig. 1A). Some of close the gaps. Remaining ambiguous sequence regions in the genome were the minor irregularities in the dot plot were attributable to identified and resolved by generating and sequencing PCR products. Both ge- insertion or integration events. The synteny is maintained, to a nomes were annotated automatically and refined manually. large degree, when each genome is compared to that of S. Sequence analyses. Open reading frames (ORFs) were predicted with Glim- mer (13). Frameshifts were detected and checked by sequencing after manual solfataricus P2, despite the occurrence of a large inversion in annotation, and the remaining frameshifts were considered to be authentic. the latter, and this is illustrated in a dot plot for the genomes Functional assignments of ORFs are based on searches against GenBank (http: of strain REY15A and S. solfataricus P2 (Fig. 1B). This exten- //www.ncbi.nlm.nih.gov/) and the Conserved Domain Database (CDD) (www sive gene synteny is surprising, given the high level of transpo- .ncbi.nlm.nih.gov/cdd/). tRNA genes were located with tRNAscan-SE (26). Po- sitional activity occurring in S. solfataricus (Table 1) (7, 30, 41). tential noncoding RNAs were predicted by comparison with the untranslated RNAs characterized for S. solfataricus and S. acidocaldarius, in terms of se- A similar pattern was also observed when other pairs of S. quence similarity and gene context (see Results). Putative insertion sequence islandicus genomes from different geographical locations were 1674 GUO ET AL. J. BACTERIOL. I 2.8 15 10 19 42 840 140 130 USA Yellowstone, strains BE I 2.7 ϫ 12 20 42 USA Yellowstone, S. islandicus and 4914 21 15 113 43 45 Absent 103 Russia Kamchatka, S. solfataricus I, II I, II 75 11 21 15 42 47 Russia : b Kamchatka, Russia Kamchatka, B, D B B, D B, D 3 strains and other available USA ϫ FIG. 2. Neighbor-joining tree based on a gene content matrix, in- Lassen, cluding the conserved, core, and unique genes for each available S. islandicus and S. solfataricus genome (Table 1). The branch lengths represent the number of differences between the strains in terms of the

S. islandicus presence or absence of individual genes. The data for the tree were USA prepared using methods described earlier (44, 48). Only bootstrap Lassen, Genetic properties obtained from genomes of values below 100% for the individual branches are given.

Iceland compared (48), consistent with a high level of conservation of

Hvergaardi, gene synteny for all the S. solfataricus and S. islandicus ge- nomes.

BB B2 A phylogenetic tree derived from the available genomes ϫ

Iceland clusters together S. islandicus strains from different geograph- ical locations (44), with S. solfataricus strains P2 and 98/2 being more distantly related (Fig. 2 and Table 1). The nucleotide I) I I, III I, III (I) I (II) I

ϫ sequence identity for the concatenated core genes of the two S. islandicus genomes (Fig. 1A) is 99.6%, and between all the S. islandicus genomes, it is about 99%. The relatively long

I) II (2 branches for individual strains (Fig. 2) arise mainly from dif- ϫ

B (B) B 2 ferences in gene content of the large variable regions (Fig. 1A). 3.0 2.7 2.5 2.7 2.7 2.7 2.6 2.7 2.6 ϫ 1120 11 18 13 16 14 18The degree 11 21 of 11 sequence 24 12 identity 21 between the concatenated SsolP2 Ssol98/2 REY15A HVE10/4 LD8.5 LS2.15 M16.4 M16.27 M14.25 YG57.14 YN15.51 123168155 (6) 81 133 158 (6) 9 (2) 44 75 11 (2) 42 65 4 (1) 42 68 4 (1) 44 10 (2) 60 5 39 (2) 34 7 (2) 5 (1) 5 (1)

I, II (2 core genes of the S. islandicus and S. solfataricus genomes is D, 2 AE006641 CP001402 CP002425 CP002426 CP001731 CP001399 CP001402 CP001401 CP001400 CP001403 CP001404 Naples, Italy Unknown Reykjanes, Absent Absent Absent 15 Absent 15 15 about 90% (Fig. 2). Three origins of chromosome replication, demonstrated ex- perimentally for S. solfataricus and S. acidocaldarius (27, 46), are well conserved with respect to both the DNA sequence and flanking gene organization in both of the genomes, albeit with a the origin oriC2 being inverted relative to the genomes of S. solfataricus P2 and S. islandicus strain YN1551 (Fig. 1B). Or-

a igin oriC1 lies immediately upstream of cdc6-1, oriC2 is close to TABLE 1. Summary of genetic properties obtained from genomes of two Icelandic Characteristic cdc6-3, while oriC3 is positioned downstream of the whiP gene (Fig. 1A). The two cdc6 genes and the whiP gene encode putative replication initiators (45). region) Letters and numbers in parentheses for the Cmr and CRISPR/Cas modules families (25, 48) denote the numbers and families of putatively defective modules generally lacking essential genes. Lassen, Lassen Volcanic National Park; Yellowstone, Yellowstone National Park. Large variable region. The genomes carry two types of vari- Conserved genes (total, 1,679)Unique single genes (total,Transporters 1,346) (total, 15) VapBC antitoxin- Glycosyl transferases (50-kb 190 675Conserved noncoding RNAs Transposases/IS elements MITEs (families) 138 656 118 765 114 847 209 837 100 842 100 848 823 797 869 a b Origin GenBank accession no. Genome size (Mb) No. of: Cmr family(ies) CRISPR/Cas family(ies) able regions. The large region, constituting 20 to 25% of each VOL. 193, 2011 GENOME ANALYSES OF ICELANDIC STRAINS OF S. ISLANDICUS 1675

TABLE 2. Integration events at tRNA genes showing the sizes and origins of the residual integrated genes

a Intron Integration event tRNA Conserved? present REY15A HVE10/4 Val—TAC No SiRe1242-1247 conj plasmid No insert No Phe—GAA No SiRe1321-1323 conj plasmid SiH1399-1402 conj plasmid Yes Met—CAT Yes SiRe1465-1479, 12 kb ϩ IS, pNOB8 integrase, unknown No insert No Glu—TTC No SiRe1484-1490 conj plasmid SiH1561-1574 conj plasmid Yes Ala—GGC No intN fragment intN fragment Yes Thr—GGT No SiRe2413-2417 IS/MITEs SSV SiH2464-24672 SSV Partly Pro—GGG Yes intN fragment intN fragment Yes His—GTG No SiRe1787-1792, 7 kb ϩ IS No insert No

a SSV, spindle-shaped fusellovirus; conj, conjugative; int, integrase. genome, extends approximately from positions encompassing at tRNALeu[GAG] and different alleles of tRNAArg (Table 2). 0.3 to 0.8 Mb and 0.3 to 1.0 Mb for strains REY15A and For the integrated tRNA genes of the Icelandic strains, there HVE10/4, respectively (Fig. 1A). The other class is repre- was no significant correlation between the identity of the sented mainly by regions downstream from tRNA genes, where tRNA anticodon and the frequency of codon usage or between integration events have occurred (Table 1; also, see below). the encoded amino acid and the average number of amino The large variable region contains about 60% of the potentially acids in the genome-encoded proteins. transposable IS elements and most of the nonautonomous Anti-integration role for tRNA introns. Each genome carries mobile elements, as well as many degenerate copies of the 45 tRNA genes and 2 to 3 pseudo-tRNA genes all located in former (Fig. 1A). It carries some gene clusters, which are conserved regions. Sixteen of the tRNA genes contain introns present in one or more of the Sulfolobus genomes, including immediately 3Ј to the anticodon, varying in size from 12 to 65 operons and gene cassettes associated with metabolic path- bp, and in contrast to many archaeal tRNA genes, none were ways, and it contains the diverse CRISPR/Cas and Cmr mod- detected at other sites (29), although putatively degenerate ules (Table 1; also, see below). It generally lacks essential introns, lacking the capacity to form splicing sites, occur in genes; for example, no tRNA genes or replication origins are D-loop regions of tRNAGlu[CTC] and tRNAGlu[TTC]. More- present, and thus, it appears to constitute a region where over, the tRNA genes and introns are highly conserved in nonessential genes are collected, interchanged, and exchanged sequence between the two genomes, and also with the other six intercellularly and where genetic innovation occurs. S. islandicus genomes, with very few base changes occurring Integration sites. tRNA gene integration events in Sulfolo- between the introns of a given tRNA. This high level of tRNA bus genomes predominantly involve conjugative plasmids and and intron sequence conservation extends to S. solfataricus P2, fuselloviruses, and these were also the genetic elements most with only very minor differences observed for about one-third commonly isolated from acidic hot springs in Iceland (63). of the genes, and it reinforces the concept that the RNA Most integration events occur via an archaea-specific mecha- introns are functionally important (5). nism, whereby a viral/plasmid integrase gene recombines into a ApossiblefunctionforthetRNAintrons,suggestedby host tRNA gene and partitions (32). The capture of a genetic the above-described analyses, is that they provide protection element in a chromosome leaves a trace because the intN against integration of genetic elements into tRNA genes. fragment overlapping the tRNA gene is generally maintained, Integration can be disadvantageous in that pre-tRNA tran- even if the remainder of the genetic element degenerates or is scription can be impaired. Only two intron-carrying tRNA deleted (51, 52) (Table 2). genes showed evidence of integration events (Table 2). For For strains REY15A and HVE10/4, remnants of integrated the tRNAMet[CAT] gene copies, an intact integrase gene is elements adjoin eight and five tRNA genes, respectively located downstream from the tRNA gene, while for the (Table 2). Most of the integrated genes derive from conju- tRNAPro[GGG], an overlapping intN fragment is present, gative plasmids, and fuselloviral genes were detected only at but the overlapping sequence does not extend to the intron, tRNAThr[GGT] in each strain, with an integrated region of suggesting that the intron entered after the integration unknown origin at tRNAMet[CAT] in strain REY15A. All of event. This is consistent with the latter integration event the integrated elements are highly degenerate, with IS elements being the most conserved, and probably the most ancient, or miniature inverted-repeat transposable elements (MITEs) in- among Sulfolobus species. serted downstream from the tRNA genes (Table 2). Given the IS elements and the versatile orfB element. Each genome possibility of multiple integrations of genetic elements occurring carries a limited range of IS element types, with some in mul- at a given tRNA gene, it is difficult to analyze unambiguously the tiple copies (Table 3). The IS elements are clustered in the origins of residual integrated genes (42). variable genomic region and also downstream from tRNA In contrast to the two Icelandic strains, the other S. solfa- genes that have undergone integration events (Fig. 1A). Many taricus and S. islandicus genomes carry intact genetic elements of these elements appear to be intact, carrying the inverted bordered by intN and intC fragments that are all potentially terminal repeats (ITRs) required for transposition, but exhibit excisable (44, 52). They each show evidence of 2 to 7 tRNA fragmented transposase genes, which are unlikely to be re- gene integration events, in which the most conserved sites are stored by programmed translational frameshifting, as was ob- tRNAPro[GGG] and tRNAAla[GGC], with less common events served for some bacterial transposases of the IS1 and IS3 1676 GUO ET AL. J. BACTERIOL.

TABLE 3. Properties of the IS elements, transposases, and MITEs in the Icelandic genomesa

No. of:

Conserved Element Family REY15A Intact TPases HVE10/ Intact TPases ORFs genome copies in REY15A 4 copies in HVE10/4 positions ISC796 IS1 15 0 4 1 1 ISC1043 ISL3 11 0 1 0 1 ISC1048 IS630 11001203 ISC1058 IS5 12 0 1 0 1 ISC1078 IS630 11 0 1 0 0 ISC1190 IS110 13 1 1 0 0 ISC1200 ISH3 122117 3 0 ISC1205 ISCNY 2 3 0 4 2 1 ISC1229 IS110 14 2 1091 ISC1234 IS5 18 6 7 5 1 ISC1332 IS256 21 1 1 1 0 ISC1395 IS630 1/2 2 1 0 0 0 ISC1733 IS200/IS605 28 8 2 2 1 ISC1921 IS607 21 1 0 0 0 ISSis1 (pARN4) IS6 14 2 5 4 0 ISSto2 IS6 12 1 2 2 0 OrfB IS605 1 19 (8) 18 18 (2) 17 0 SM3A 0 2 0 2 0 2 SMN1 0 7 7 9 9 0

a The nomenclature used for IS elements and MITEs follows that which was used previously (2, 6, 16). For the OrfB elements, the numbers in parentheses indicate the numbers of copies that are physically linked to ISC1200 elements. TPase, transposase.

families (28). Although some of these elements may be mobi- MITEs are active in both of the genomes, as is ISC1733, which lizable by transposases acting in trans, for over one-third of the encodes the mobilizing transposase (Table 3), and they appear IS families present, there is no encoded transposase (Table 3). to be cleanly excised when mobilized, in agreement with the Potentially, the most active elements are ISC1200 and ISC1234 results of an earlier induced excision in the S. islandicus strain in both genomes and ISC1229 in strain HVE10/4 (Table 3). REN1H1 (2). Although most SMN1 copies lie in intergenic The two Icelandic S. islandicus strains, together with those regions, and may or may not affect regulatory signals, some from Kamchatka, Russia, carry the lowest number of IS ele- appear to inactivate or alter genes. Thus, in strain REY15A, ments (Table 1), many of which are inactive. an AAAϩ ATPase (SiRe0883) and a hypothetical gene orfB elements of family IS605, together with elements of the (SiRe0925) have incurred insertions in their promoters, and in IS6 family (Table 3), are considered to represent the few strain HVE10/4, SMN1 copies partially overlap with two genes classes of transposable elements that are ancestral to the ar- (SiH0773/2472), generating altered ORF sequences. chaeal domain (16). orfB occurs alone, or together with a In contrast, the two SM3A copies are conserved in position transposase gene, orfA, in the IS200/605 family of transposable in each genome, consistent with the mobilizing transposase elements. They lack ITRs, and both element types occur encoded in ISC1058 being degenerate in both genomes. Nev- commonly in viruses and conjugative plasmids of the Sulfolo- ertheless, each SM3A copy retains the conserved 8-bp inverted bales (18, 40) (Table 3). Exceptionally, strain REY15A and terminal repeat of the ISC1058 element (and unconserved HVE10/4 genomes carry 11 and 16 nearly identical copies of 9-bp direct repeats resulting from the transposition event) and the single orfB elements in unconserved genomic positions, can potentially be mobilized if a transposase-encoding respectively. This is consistent with these being the most active ISC1058 element enters the cell. Their maintenance as intact transposable elements in each genome (Table 3), although it elements may result from one SM3A copy overlapping with the remains uncertain whether they are autonomous or require an start of a conserved C/D box RNA gene (3), which may alter its OrfA in trans for mobility (16). In addition, the orfB elements transcriptional properties, while the other lies between pro- are exceptionally adaptable, because a further 8 and 2 copies moters of two conserved protein genes and may influence their are physically coupled to copies of ISC1200 for strains relative transcriptional levels. SM3A occurs in a few copies in REY15A and HVE10/4, respectively (Table 3), and are poten- each of the sequenced S. islandicus genomes, whereas SMN1 is tially cotransposable. limited to the Icelandic and three Kamchatka strains, where it Sulfolobus MITEs. Only two MITE types were detected in occurs in 1 to 5 copies (Table 1). multiple copies in each genome, SMN1 (320 bp) and SM3A Strain-specific metabolic pathways. Each Icelandic strain (164 bp) (Table 3), and both of which are capable of non- shows a few specific metabolic properties. Thus, the REY15/A autonomous transposition in different S. islandicus strains, fa- strain carries an operon (SiRe0441-0445) encoding enzymes cilitated by transposases of ISC1733 and ISC1058, respectively implicated in nitrate reduction and nitrite extrusion, suggesting (2, 4, 43). All SMN1 copies are located immediately down- that it can use nitrate as a terminal electron acceptor for stream from the sequence TTTAA, but none occur at con- anaerobic respiration. The operon is located in the variable served positions within the two genomes. Clearly, the SMN1 region and has been observed previously only for two other VOL. 193, 2011 GENOME ANALYSES OF ICELANDIC STRAINS OF S. ISLANDICUS 1677 archaea, S. islandicus strains M.14.25 and M.16.27. The larger TABLE 4. Summary of Sulfolobus conserved noncoding RNA genes genome of strain HVE10/4 exclusively carries a urease operon located in the two Icelandic genomes (SiH0978-0983) predicted to encode enzymes involved in the No. of indicated RNAs in genome of strain: hydrolysis of urea to NH4 and CO2 and previously found only RNA Function/modification in the archaea Sulfolobus tokodaii, Metallosphaera sedula, and REY15A HVE10/4 Cenarchaeum symbiosum. Moreover, uniquely for a Sulfolobus species, strain HVE10/4 also carries several genes predicted to C/D box rRNA 18 16 C/D box tRNA 4 4 encode hydrogenases and hydrogenase maturation enzymes C/D box Unknown 3 3 (SiH0883-0892) in the variable region, which suggests that the H/ACA box tRNA 2 2 strain may be able to grow anaerobically. Noncoding Unknown 31 27 A 50-kb region of strain HVE10/4 in the variable region Total 58 53 (SiH0447-0489) is bordered by IS elements and carries 15 predicted glycosyl transferase genes (group 1 and family 2), constituting about half of the genome copies, interspersed al- most exclusively with genes of unknown function and a gene located in the variable region, and only five gene pairs are con- encoding a predicted polysaccharide biosynthesis enzyme. It is served in sequence and gene contexts in both strains (SiRe0698/ well established that Sulfolobus S-layer proteins SlaA and SlaB SiH0636, SiRe2073/SiH2137, SiRe2171/SiH2227, SiRe2294/ (SiRe1612/1 and SiH1691/0, respectively) are heavily glycosyl- SiH2344, and SiRe2626/SiH2689). Sequence alignments and tree- ated (36), but the relatively low GϩC content of the region building exercises demonstrated that the sequences of both suggests that it has been inserted and has an alternative un- antitoxins and toxins within each genome are very diverse and can known function. The genome region is absent from strain be classified into subtypes (data not shown), consistent with their REY15A and from some of the other S. islandicus strains functional diversity and targeting of different cellular sites. These (Table 1). data also indicate, for given gene pairs, that the subtypes of VapB Transporters. Sulfolobus strains utilize different sugars and and VapC do not always correspond, implying that some gene carbohydrates as carbon and energy sources (19), consistent pairs may have exchanged partners. with their coding capacity for solute ABC transporters. A total Reading frame shifts and mRNA intron splicing. Examples of 15 different ABC transporters were identified, of which of translational reading frame shifts yielding single polypep- strain REY15A carries 12 and strain HVE10/4 contains 14. Of tides have been demonstrated experimentally for S. solfa- these, 11 ABC transporters are present in S. solfataricus P2 taricus P2 (10). For two of these, a predicted transketolase (53), 6 in S. tokodaii (23), but only 3 in S. acidocaldarius (9). (SiRe1696/8 and SiH1776/8) and a putative O-sialoglycopro- The other S. islandicus genomes each carry 10 to 14 ABC tein endopeptidase (SiRe1569/70 and SiH1648/9), the S. islan- transporters (44) (Table 1). In both of the Icelandic genomes, dicus genes overlap in a similar way and are likely to undergo many ABC transporter genes are located in the variable region reading frame shifts. In contrast to S. solfataricus P2, ␣-fuco- (Fig. 1A) and are often flanked by transposons, consistent with sidase (SiRe2185 and SiH2241) is a single gene, as is the their being subjected to loss or gain events. predicted dihydrolipoamide acyltransferase gene (SiH0582), The ABC transporters are diverse, and some of their solute located only in strain HVE10/4. Very few transposase genes specificities have been identified for other Sulfolobus strains present in IS elements (Table 3) carry a single reading frame (15, 24). Cellobiose, maltose, and arabinose transporters are shift that could be expressed as a single protein via transla- present in both of the Icelandic genomes and most other se- tional reading frame shifts (28). quenced S. solfataricus and S. islandicus genomes, although a Transcripts of the intron-carrying cbf5 genes (SiRe1607/8 few S. islandicus strains lack one of the systems, as follows: the and SiH1686/7) have been demonstrated to be spliced by the arabinose system is absent from strain YG5714, while the malt- archaeal splicing enzyme at the mRNA level in some crenar- ose system is not present in strains YN1551 and LD215. Strik- chaea (60). Other mRNAs, including those encoding the XPD ingly, the transporter of glucose, the preferred carbon source helicase (SiRe1685/SiH1765), have been predicted to undergo for many microbes, is present only in the Icelandic strains, S. splicing, but experimental support is lacking (5). islandicus strains M1415 and YG5714, and in S. solfataricus P2. Noncoding RNAs. Many untranslated RNAs have been The lack of specific ABC transporters suggests either that characterized for S. solfataricus and S. acidocaldarius using a glucose is an uncommon nutrient in hot environments or that variety of techniques, including probing cell extracts for RNA another ABC transporter can facilitate glucose transport. One with K-turn binding motifs and generating cDNA libraries of ABC transporter encoded in the variable region of strain total cellular RNA extracts, as well as numerous antisense HVE10/4 (SiH0899-0903), flanked by IS elements, appears to RNAs (33, 55, 59, 61). Most of these RNAs were characterized be unique in public sequence databases. for nucleotide length and partial sequence, and several were Toxin-antitoxin systems. Four of the eight families of anti- detected by more than one experimental approach. We have toxin-toxin complexes characterized for free-living bacteria also reanalyzed all these different RNA entities and have annotated occur in archaea, of which the VapBC family is by far the most the S. islandicus RNA homologs which are conserved in both abundant (34) and is the main antitoxin-toxin family that we sequence and gene contexts. The total number of RNA genes detected in the Sulfolobus strains. The Icelandic strains REY15A and their putative functions are given (Table 4). and HVE10/4 carry 17 and 18 vapBC gene pairs, respectively As for other archaeal hyperthermophiles, each genome car- (Table 1), as well as 2 vapC-like gene copies coupled to other ries many C/D box RNAs that methylate primarily rRNAs and genes. They are distributed throughout the genomes, with several tRNAs (Table 4). In strains REY15A and HVE10/4, 18 and 16 1678 GUO ET AL. J. BACTERIOL.

FIG. 3. (A) Phylogenetic tree of Cmr2 and its homologues in all sequenced archaeal genomes generates 5 families, A to E. The two Icelandic strains carry family B Cmr modules, for which the gene order is shown. Other sequenced S. islandicus and S. solfataricus strains also carry Cmr modules of family Band,lessfrequently,familiesDorE,asindicatedonthetree.(B)SchematicrepresentationsoftheCRISPR/CascassettesinthetwoIcelandicstrains, together with the contents of their CRISPR loci. Strain REY15A carries a single family I CRISPR/Cas cassette (blue), whereas HVE10/4 carries cassettes from families III and I (orange and blue, respectively). Compositions of the individual CRISPR loci are shown, where each triangle represents a spacer-repeat unit. Significant spacer matches to sequenced viruses and plasmids are color coded (red, rudiviruses; orange, lipothrixviruses; yellow, fuselloviruses; green, bicaudaviruses; turquoise, turreted icosahedral viruses; blue, conjugative plasmids; and violet, cryptic plasmids).

C/D box RNAs target rRNAs, respectively, while 4 modify The CRISPR loci of strain REY15A carry 115 and 93 spacer- tRNAs and a further 3 have unknown targets. Two copies of repeat units centered at position 733,000, while those of H/ACA RNA genes are present in each genome which, together HVE10/4 contain 116 and 101 repeat-spacer units and 35 and with the aPus7 protein (SiRe1836 and SiH1908), generate pseu- 14 repeat-spacer units centered at positions 364000 and douridine-35 in pre-tRNATyr transcripts (31). Each of these C/D 745000, respectively (Fig. 1A). No spacer sequence identity box and H/ACA box RNA genes can be detected in the other was detected within, or between, the two Icelandic strains or available S. islandicus genomes, which underlines their functional with the other S. solfataricus and S. islandicus genomes. None importance. Of these, only three RNA genes characterized for of the available fully sequenced S. islandicus genomes (Table other Sulfolobus strains, Sso-sR4, Sso-sR8, and Sso-92, were not 1) have any spacers in common, in contrast to the S. solfatari- located in any S. islandicus genomes (33, 55). For the numerous cus strains P1, P2, and 98/2, which all share many identical noncoding RNAs of unknown function, similar contents were spacers (17, 25) despite their being as distant from one an- found for the two Icelandic strains (Table 4) and for the other S. other, phylogenetically, as the S. islandicus strains (Fig. 2). islandicus strains, with only a few variations (Table 1), thereby Thus, it seems that diversification of genomic CRISPR loci can underlining their functional importance. occur either by simple spacer turnover or by horizontal transfer Diversity of the CRISPR-based immune systems. The of whole or partial CRISPR/cas cassettes. There is increasing CRISPR/Cas and Cmr modules all lie within the large variable evidence for the latter mechanism being the most common one regions. They show marked heterogeneity in the number and in S. islandicus strains (17, 21). family (25, 48) and are unconserved in position between the Since many of the characterized viruses and plasmids of Sul- genomes (Fig. 1A). Whereas REY15A carries one paired folobus derive from Iceland, we analyzed the degree to which CRISPR/Cas module of the family I type and two family B CRISPR spacer sequences of the Icelandic strains yielded signif- Cmr modules, HVE10/4 contains two paired CRISPR/Cas icant matches to genetic element sequences using an earlier ap- modules of family I and III types and a single family B Cmr proach examining nucleotide and translated sequences of the module (48) (Fig. 3A and B). This diversity of CRISPR-based spacers (25, 49). Several significant sequence matches were de- systems also extends to the other S. solfataricus and S. islandi- tected for both of the genomes, primarily to rudiviruses, fusello- cus genomes (Table 1). Although the gene content and orga- viruses, and conjugative plasmids, all of which are abundant in nization of the paired family I CRISPR/Cas modules are quite Icelandic hot springs (63), but also were detected in smaller num- conserved among crenarchaea (48), exceptionally, for strain bers to other viruses and cryptic plasmids (Fig. 3B). HVE10/4, the internal group of cas genes located between the two leader regions is inverted (Fig. 3B), indicative of a rear- DISCUSSION rangement having occurred within the module, possibly via the identical inverted repeat sequences of the bordering leader The genome analyses underline the potential importance of regions (Fig. 3B). S. islandicus strain REY15A as a model organism for molec- VOL. 193, 2011 GENOME ANALYSES OF ICELANDIC STRAINS OF S. ISLANDICUS 1679 ular genetic studies of the Sulfolobales, and crenarchaea in tem (SiH1435 to SiH1437). Moreover, the CRISPR/Cas and general, for a variety of reasons. The genome size of 2.5 Mb is CRISPR/Cmr modules of strain HVE10/4 are relatively com- minimal for a Sulfolobus species; moreover, the incidence of plex, as they also are for strain REY15A and other Sulfolobus mobile elements is relatively low (Table 1), and stable deletion strains. Their activities have also been demonstrated, at least mutants can be readily isolated (14, 20). Furthermore, the high for strain REY15A, by challenging the CRISPR/Cas systems incidence of diverse ABC transporter systems (Table 1) may with vector-borne matching protospacers maintained under explain why S. islandicus (and S. solfataricus) is most commonly selection, which produced deletions of the matching spacers isolated from enrichment cultures obtained from terrestrial (20). The puzzle remains as to why the Sulfolobus CRISPR- acidic hot springs, which is in contrast to, for example, S. based systems are so complex, given that many of the viruses acidocaldarius, which carries only three ABC transporters (9, and plasmids coexist at low copy numbers and are nonlytic. 44, 63). One possibility is that the CRISPR/Cmr system primarily has a The relatively high incidence of deletion mutants obtained regulatory role, with antisense crRNAs (CRISPR RNAs) tar- from strain REY15A occurs despite the presence of several geting viral mRNAs. Whatever the reason, the genetic close- transposable elements. However, in both of the Icelandic ness of strains REY15A and HVE10/4 suggests that the former strains, many of the IS elements are degenerate or carry dis- may also be a broad host for viruses and plasmids, with the rupted transposase genes (Table 3), consistent with the “copy- added advantage that genetic manipulation systems are now and-paste” transpositional mechanism of most classes of Sul- available, and our preliminary studies with fuselloviruses and folobus IS elements and their undetectably low reversibility conjugative plasmids support this supposition. rate (4, 41). The inability to remove the elements by sponta- neous deletion, which does occur in many bacteria (16), may ACKNOWLEDGMENTS also explain the presence of antisense RNAs in Sulfolobus This research was supported by grants from the National Natural species to regulate transposase activity (55). The Icelandic Science Foundation of China (grants 306210165, 30730003, and strains do, however, carry many copies of orphan orfB ele- 30870058) to L.H., a grant from the Danish Research Council for ments and SMN1 MITEs, which are mobilized by a “cut- Technology and Production (grant 09-062932) to Q.S., and grants from and-paste” mechanism presumably through OrfA encoded the Danish Natural Science Research Council (grant 272-08-0391) and Danish National Research Foundation to R.A.G. in IS element ISC1733 (2, 16). The SMN1 MITEs appear to be specific to the Icelandic and Kamchatka strains (Table 1), REFERENCES and they can generate genetic novelty, reversibly, by extend- 1. Arcus, V. L., K. Ba¨ckbro, A. Roost, E. L. Daniel, and E. N. Baker. 2004. ing open reading frames, in contrast to the other Sulfolobus Distant structural homology leads to the functional characterisation of an archaeal PIN domain as an exonuclease. J. Biol. Chem. 279:16471–16478. MITEs, which carry many potential stop codons in all read- 2. Berkner, S., and G. Lipps. 2007. An active nonautonomous mobile element ing frames (43). The absence of most of the known Sulfolo- in Sulfolobus islandicus REN1H1. J. Bacteriol. 189:2145–2149. bus 3. Bize, A., et al. 2009. A unique virus release mechanism in archaea. Proc. Natl. MITEs, except SM3A, probably reflects the much lower Acad. Sci. U. S. A. 106:11306–11311. diversity of the mobilizing transposases present (Table 3). 4. Blount, Z. D., and D. W. Grogan. 2005. New insertion sequences of Sulfolo- Many of these elements are located in the large variable bus: functional properties and implications for genome evolution in hyper- thermophilic archaea. Mol. Microbiol. 55:312–325. region where genetic diversification occurs, including the 5. Bru¨gger, K., X. Peng, and R. A. Garrett. 2007. Sulfolobus genomes: mecha- uptake and loss of operons and gene cassettes and rear- nisms of rearrangement and charge, p. 95–104. In R. A. Garrett and H.-P. rangements of mainly nonessential genes. A similar variable Klenk (ed.), Archaea: evolution, physiology, and molecular biology. Black- well Publishing, Oxford, United Kingdom. genetic region in many genetic elements of Sulfolobus has 6. Bru¨gger, K., et al. 2002. Mobile elements in archaeal genomes. FEMS also been observed (e.g., see reference 18). Microbiol. Lett. 206:131–141. Many questions concerning the exceptional molecular and 7. Bru¨gger, K., E. Torarinsson, P. Redder, L. Chen, and R. A. Garrett. 2004. Shuffling of Sulfolobus genomes by autonomous and non-autonomous mo- cellular properties of crenarchaeal organisms remain to be bile elements. Biochem. Soc. Trans. 32:179–183. resolved. They include the functions of the multiple and highly 8. Brumfield, S. K., et al. 2009. Particle assembly and ultrastructural features associated with the replication of the lytic archaeal virus Sulfolobus turreted diverse gene pairs encoding VapBC antitoxin-toxins. For hy- icosahedral virus. J. Virol. 83:5964–5970. perthermophilic Sulfolobus species, in particular, their pres- 9. Chen, L., et al. 2005. The genome of Sulfolobus acidocaldarius, a model ence and variety could be a prerequisite for adaptation to life organism of the Crenarchaeota. J. Bacteriol. 187:4992–4999. 10. Cobucci-Ponzano, B., et al. 2010. Functional characterisation and high- under extreme, and sometimes rapidly varying, temperature throughput proteomic analysis of interrupted genes in the archaeon Sulfolo- and pH conditions, as well as to survival in nutrient-poor en- bus solfataricus. J. Proteome Res. 9:2496–2507. vironments possibly by optimizing the quality control of gene 11. Contursi, P., et al. 2006. Characterisation of the Sulfolobus host-SSV2 virus interaction. Extremophiles 10:615–627. expression (12, 34). They may also be related to the sulfolo- 12. Cooper, C. R., A. J. Daugherty, S. Tachdjian, P. H. Blum, and R. M. Kelly. bicins implicated in killing competitor Sulfolobus cells (39). 2009. Role of vapBC toxin-antitoxin loci in the thermal stress response of Sulfolobus solfataricus. Biochem. Soc. Trans. 37:123–126. The crystal structure of a VapC toxin from the crenarchaeal 13. Delcher, A. L., K. A. Bratke, E. C. Powers, and S. L. Salzberg. 2007. Iden- hyperthermophile Pyrobaculum aerophilum implicated the pro- tifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformat- tein in exonuclease activity (1), but the multiplicity and wide ics 23:673–679. 14. Deng, L., H. Zhu, Z. Chen, Y. X. Liang, and Q. She. 2009. Unmarked gene sequence diversity of the vapBC genes suggest that the toxins deletion and host-vector system for the hyperthermophilic crenarchaeon target different cellular or molecular sites. Sulfolobus islandicus. Extremophiles 13:735–746. Strain HVE10/4 has been used as a host for a variety of 15. Elferink, M. G., S. V. Albers, W. N. Konings, and A. J. Driessen. 2001. Sugar transport in Sulfolobus solfataricus is mediated by two families of binding genetic elements, mainly from Iceland, which were likely to be protein-dependent ABC transporters. Mol. Microbiol. 39:1494–1503. genetically close to the Icelandic host (63). The genome anal- 16. File´e, J., P. Siguier, and M. Chandler. 2007. Insertion sequence diversity in archaea. Microbiol. Mol. Biol. Rev. 71:121–157. yses provide few insights into why it is a good host, especially 17. Garrett, R. A., et al. 2011. CRISPR-based immune systems of the Sulfolo- since it appears to carry a type 1 restriction-modification sys- bales: complexity and diversity. Biochem. Soc. Trans. 39:51–57. 1680 GUO ET AL. J. BACTERIOL.

18. Greve, B., S. Jensen, K. Bru¨gger, W. Zillig, and R. A. Garrett. 2004. Genomic 40. Prangishvili, D., et al. 2006. Structural and genomic properties of the hy- comparison of archaeal conjugative plasmids from Sulfolobus. Archaea perthermophilic archaeal virus ATV with an extracellular stage of the re- 1:231–239. productive cycle. J. Mol. Biol. 359:1203–1216. 19. Grogan, D. W. 1989. Phenotypic characterization of the archaebacterial 41. Redder, P., and R. A. Garrett. 2006. Mutations and rearrangements in the genus Sulfolobus: comparison of five wild-type strains. J. Bacteriol. 171:6710– genome of Sulfolobus solfataricus P2. J. Bacteriol. 188:4198–4206. 6719. 42. Redder, P., et al. 2009. Four newly isolated fuselloviruses from extreme 20. Gudbergsdottir, S., et al. 2011. Dynamic properties of the Sulfolobus geothermal environments reveal unusual morphologies and a possible inter- CRISPR/Cas and CRISPR/Cmr systems when challenged with vector-borne viral recombination mechanism. Environ. Microbiol. 11:2849–2862. viral and plasmid genes and protospacers. Mol. Microbiol. 79:35–49. 43. Redder, P., Q. She, and R. A. Garrett. 2001. Non-autonomous elements in 21. Held, N. L., A. Herrera, H. Cadillo-Quiroz, and R. J. Whitaker. 2010. the crenarchaeon Sulfolobus solfataricus. J. Mol. Biol. 306:1–6. CRISPR associated diversity within a population of Sulfolobus islandicus. 44. Reno, M. L., N. L. Held, C. J. Fields, P. V. Burke, and R. J. Whitaker. 2009. PLoS One 5:e12988. Sulfolobus islandicus pan-genome. Proc. Natl. Acad. Sci. U. S. A. 106:8605– 22. Jonuscheit, M., E. Martusewitsch, K. M. Stedman, and C. Schleper. 2003. A 8610. (Erratum, 106:18873.) reporter gene system for the hyperthermophilic archaeon Sulfolobus solfa- 45. Robinson, N. P., and S. D. Bell. 2007. Extrachromosomal element capture taricus based on a selectable and integrative shuttle vector. Mol. Microbiol. and the evolution of multiple replication origins in archaeal chromosomes. 48:1241–1252. Proc. Natl. Acad. Sci. U. S. A. 104:5806–5811. 23. Kawarabayasi, Y., et al. 2001. Complete genome sequence of an aerobic 46. Robinson, N. P., et al. 2004. Identification of two origins of replication in the thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain 7. DNA Res. single chromosome of the archaeon Sulfolobus solfataricus. Cell 116:25–38. 8:123–140. 24. Koning, S. M., S. V. Albers, W. N. Konings, and A. J. Driessen. 2002. Sugar 47. Rutherford, K., et al. 2000. Artemis: sequence visualization and annotation. transport in (hyper)thermophilic archaea. Res. Microbiol. 153:61–67. Bioinformatics 16:944–945. 25. Lillestøl, R. K., et al. 2009. CRISPR families of the crenarchaeal genus 48. Shah, S. A., and R. A. Garrett. 2011. CRISPR/Cas and Cmr modules, Sulfolobus: bidirectional transcription and dynamic properties. Mol. Micro- mobility and evolution of adaptive immune systems. Res. Microbiol. 162: biol. 72:259–272. 27–38. 26. Lowe, T. M., and S. R. Eddy. 1997. tRNAscan-SE: a program for improved 49. Shah, S. A., N. R. Hansen, and R. A. Garrett. 2009. Distributions of CRISPR detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. spacer matches in viruses and plasmids of crenarchaeal acidothermophiles 25:955–964. and implications for their inhibitory mechanism. Trans. Biochem. Soc. 37: 27. Lundgren, M., A. Andersson, L. Chen, P. Nilsson, and R. Bernander. 2004. 23–28. Three replication origins in Sulfolobus species: synchronous initiation of 50. She, Q., et al. 2008. Host-vector systems for hyperthermophilic archaeon chromosome replication and asynchronous termination. Proc. Natl. Acad. Sulfolobus, p. 151–156. In S.-J. Liu and H. L. Drake (ed.), Microbes and the Sci. U. S. A. 101:7046–7051. environment: perspective and challenges. Science Press, Beijing, China. 28. Mahillon, J., and M. Chandler. 1998. Insertion sequences. Microbiol. Mol. 51. She, Q., X. Peng, W. Zillig, and R. A. Garrett. 2001. Gene capture in archaeal Biol. Rev. 62:725–774. chromosomes. Nature 409:478. 29. Marck, C., and H. Grosjean. 2003. Identification of BHB splicing motifs in 52. She, Q., B. Shen, and L. Chen. 2004. Archaeal integrases and mechanisms of iintron-containing tRNAs from 18 archaea: evolutionary implications. RNA gene capture. Biochem. Soc. Trans. 22:222–226. 9:1516–1531. 53. She, Q., et al. 2001. The complete genome of the crenarchaeon Sulfolobus 30. Martusewitsch, E., C. W. Sensen, and C. Schleper. 2000. High spontaneous solfataricus P2. Proc. Natl. Acad. Sci. U. S. A. 98:7835–7840. mutation rate in the hyperthermophilic archaeon Sulfolobus solfataricus is 54. She, Q., et al. 2009. Genetic analyses in the hyperthermophilic archaeon mediated by transposable elements. J. Bacteriol. 182:2574–2581. Sulfolobus islandicus. Biochem. Soc. Trans. 37:92–96. Tyr 31. Muller, S., et al. 2009. Deficiency of the tRNA :⌿35-synthase aPus7 in 55. Tang, T.-H., et al. 2005. Identification of novel non-coding RNAs as poten- Archaea of the Sulfolobales order might be rescued by the H/ACA sRNA- tial antisense regulators in the archaeon Sulfolobus solfataricus. Mol. Micro- guided machinery. Nucleic Acids Res. 37:1308–1322. biol. 55:469–481. 32. Muskhelishvili, G., P. Palm, and W. Zillig. 1993. SSV1-encoded site-specific 56. Torarinsson, E., H.-P. Klenk, and R. A. Garrett. 2005. Divergent transcrip- recombination system in Sulfolobus shibatae. Mol. Gen. Genet. 273:334–342. tional and translational signals in Archaea. Environ. Microbiol. 7:47–54. 33. Omer, A. D., M. Zago, A. Chang, and P. P. Dennis. 2006. Probing the 57. Wagner, M., et al. 2009. Expanding and understanding the genetic toolbox of structure and function of an archaeal C/D-box methylation guide sRNA. the hyperthermophilic genus Sulfolobus. Biochem. Soc. Trans. 37:97–101. RNA 12:1708–1720. 58. Worthington, P., V. Hoang, F. Perez-Pomares, and P. Blum. 2003. Targeted 34. Pandey, D. P., and K. Gerdes. 2005. Toxin-antitoxin loci are highly abundant in free-living but lost from host-associated prokaryotes. Nucleic Acids Res. disruption of the alpha-amylase gene in the hyperthermophilic archaeon Sulfolobus solfataricus 185: 33:966–976. . J. Bacteriol. 482–488. 35. Peng, N., Q. Xia, Z. Chen, Y. X. Liang, and Q. She. 2009. An upstream 59. Wurtzel, O., et al. 2010. A single-base resolution map of an archaeal tran- activation element exerting differential transcription activation on an scriptome. Genome Res. 20:133–141. archaeal promoter. Mol. Microbiol. 74:928–939. 60. Yokobori, S., et al. 2009. Gain and loss of an intron in a protein-coding gene 36. Peyfoon, E., et al. 2010. The S-layer glycoprotein of the crenarchaeote Sul- in Archaea: the case of an archaeal RNA pseudouridine synthase gene. BMC folobus acidocaldarius is glycosylated at multiple sites with chitobiose-linked Evol. Biol. 9:198. N-glycans. Archaea pii:754101. 61. Zago, M. A., P. P. Dennis, and A. D. Omer. 2005. The expanding world of 37. Prangishvili, D., et al. 1998. Conjugation in archaea: frequent occurrence of small RNAs in the hyperthermophilic archaeon Sulfolobus solfataricus. Mol. conjugative plasmids in Sulfolobus. Plasmid 40:190–202. Microbiol. 55:1812–1828. 38. Prangishvili, D., P. P. Forterre, and R. A. Garrett. 2006. Viruses of the 62. Zhang, C., et al. 2010. Revealing the essentiality of multiple archaeal pcna Archaea: a unifying view. Nat. Rev. Microbiol. 4:837–848. genes using a mutant propagation assay based on an improved knockout 39. Prangishvili, D., et al. 2000. Sulfolobicins, specific proteinaceous toxins pro- method. Microbiology 156:3386–3397. duced by strains of the extremely thermophilic archaeal genus Sulfolobus. J. 63. Zillig, W., et al. 1998. Genetic elements in the extremely thermophilic Bacteriol. 182:2985–2988. archaeon Sulfolobus.Extremophiles2:131–140. 126 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ All of the bioinformatics behind the section on CRISPR/Cas Contribution: substantial families was done by myself and Dr. Gisle A. Vestergaard as part of our collaboration on genomic characterisation of archaeal CRISPR systems. Molecular Biology of Archaea II 51

CRISPR-based immune systems of the Sulfolobales: complexity and diversity

Roger A. Garrett1,ShirazA.Shah,GisleVestergaard,LingDeng,SoleyGudbergsdottir,ChandraS.Kenchappa, Susanne Erdmann and Qunxin She Archaea Centre, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200N Copenhagen K, Denmark

Abstract CRISPR (cluster of regularly interspaced palindromic repeats)/Cas and CRISPR/Cmr systems of Sulfolobus, targeting DNA and RNA respectively of invading viruses or plasmids are complex and diverse. We address their classification and functional diversity, and the wide sequence diversity of RAMP (repeat-associated mysterious protein)-motif containing proteins encoded in Cmr modules. Factors influencing maintenance of partially impaired CRISPR-based systems are discussed. The capacity for whole CRISPR transcripts to be generated despite the uptake of transcription signals within spacer sequences is considered. Targeting of protospacer regions of invading elements by Cas protein–crRNA (CRISPR RNA) complexes exhibit relatively low sequence stringency, but the integrity of protospacer-associated motifs appears to be important. Different mechanisms for circumventing or inactivating the immune systems are presented.

Introduction Sulfolobus and Pyrococcus respectively, and from invest- The discovery of the widespread occurrence of CRISPR igation of bacterial CRISPR/Cas systems of Streptococcus (cluster of regularly interspaced palindromic repeat)-based thermophilus [8,9], Staphylococcus epidermidis [10,11] and immune systems in archaea and bacteria has provided Escherichia coli [12]. In the present article, we focus primarily important insights into how hosts can inactivate and on current knowledge and ideas deriving from, and relating or regulate invading foreign DNA and, probably, RNA to, the Sulfolobus immune systems. genetic elements. In addition, these systems are likely to influence how co-invading genetic elements can influence one another [1,2]. The two main molecular apparatus involved CRISPR/Cas families: complexity, are structurally complex, partially independent and have classification and versatility diversified functionally. Moreover, their capacity to facilitate At an early stage, it was clear that the CRISPR/Cas and thecontinualuptakeofforeignDNAintohostchromosomes, Cmr systems were highly complex when approx. 45 different and their propensity for transfer between organisms, has proteins were implicated in their function [13], and the important implications for cellular evolution. number has continued to rise [14]. Genes of the two systems The genus Sulfolobus provides an important model system are clustered into cas and cmr cassettes which are sometimes for studying these immune systems. Most Sulfolobus species linked physically. These cassettes encode a few core proteins, carry complex and diverse CRISPR-based systems and appear but they also carry different combinations of other genes, to be particularly active in the uptake of foreign DNA inserts some occurring more commonly than others. Thus cassettes into their CRISPR loci. Furthermore, a broad collection vary markedly in their overall gene contents. To illustrate this, of Sulfolobus genetic elements is available that can be used core gene structures of the archaeal cas cassettes are shown to challenge the CRISPR-based systems [3]. It includes together with a more complex family I cas cassette from numerous diverse viruses many of which have been classified Sulfolobus islandicus HVE10/4 (Figures 1A and 1B). The core into eight new viral families [4,5] as well as a family of cas genes classify into cas group 1, implicated in CRISPR plasmids encoding an archaeal-specific conjugative apparatus acquisition of foreign DNA and insertion into CRISPR loci, [6,7]. and cas group 2 associated with crRNA (CRISPR RNA) Many insights into the complexity of the CRISPR- processing and guidance (Figure 1A). based immune systems, and their mechanistic diversity, Families of CRISPR/Cas modules have been classified have emerged from detailed experimental studies of CR- on the basis of gene content and gene order within cas ISPR/Cas and CRISPR/Cmr systems of the archaeal genera cassettes, and on the basis of conserved sequences of cas genes, leader regions and repeats within CRISPR/Cas modules. For

Key words: archaeal virus, cluster of regularly interspaced palindromic repeats/Cas module archaea, about eight families have been proposed, whereas (CRISPR/Cas module), Cmr module, CRISPR RNA (crRNA), protospacer-associated motif (PAM). among the Sulfolobales, three are common (I–III) and one Abbreviations used: CRISPR, cluster of regularly interspaced palindromic repeats; crRNA, less so (IV) [2,15,16,17]. CRISPR RNA; IS, insertion sequence; PAM, protospacer-associated motif; RAMP, repeat-associated mysterious protein; SIRV1, Sulfolobus islandicus rod-shaped virus 1. Cmr modules carry two conserved core genes, cmr2 1 To whom correspondence should be addressed (email [email protected]). and cmr5 (Figure 2A), and a variable number of genes

C C Biochemical Society Transactions www.biochemsoctrans.org Biochem. Soc. Trans. (2011) 39, 51–57; doi:10.1042/BST0390051 !The Authors Journal compilation !2011 Biochemical Society 52 BiochemicalSocietyTransactions(2011)Volume39,part1

Figure 1 Core genes of archaeal cas cassettes (A) Core genes are divided into putative functional cas groups 1 and 2 (see the text) and the cas6 gene, which encodes an RNA-processing enzyme [18]. (B) Genetic map of a family I CRISPR/Cas module of S. islandicus strain HVE10/4 carrying several non-core cas genes.

encoding diverse proteins which carry RAMP (repeat- and, given their functional interdependence, there is likely associated mysterious protein) motifs. The Cmr modules to have been some co-evolution of the coupled systems. can be classified into five main families A, B, C, D and E Consistent with this view, analysis of the Sulfolobales for archaea on the basis of phylogenetic tree building using suggests that Cmr family D modules (Figure 2B) are sequences of Cmr2 and its homologues Csm1 and Csx11 commonly, but not exclusively, found together with family (Figure 2B), where most Sulfolobus Cmr modules fall within II CRISPR/Cas modules. families B or D. Further classification is complicated by the presence of multiple diverse copies of genes coding for RAMP-motif-containing proteins. Although these proteins CRISPR loci: structural and functional can be classified into families on the basis of these motifs, complexity the remainder of the protein sequences tend to be highly CRISPR loci consist of regularly spaced direct repeat divergent, as illustrated for four proteins encoded in a Cmr sequences with intervening spacers deriving from invading family B module of Sulfolobus solfataricus P2 (Figure 2C). foreign DNA elements. Archaeal repeats fall in the size Most Sulfolobus species carry multiple CRISPR/Cas range 23–37 bp and most spacers are 25–50 bp long [20]. and/or Cmr modules and, given the high energy cost of CRISPR loci are preceded by a leader region which varies maintaining and expressing them, they must confer major in size from approx. 150 to 550 bp and shows levels of advantages on to the cell. Clearly, given the molecular sequence conservation which are only considered significant and mechanistic complexities of the systems, they can be within specific families of CRISPR/Cas modules. CRISPR inactivated readily by incurring a defect in a component or locus sizes can also vary considerably, suggesting that rates critical sequence motif. Moreover, the systems are potential of spacer turnover differ markedly for different CRISPR targets for incoming genetic elements which may attempt loci within a given archaeon. But there is no support for to integrate into essential cas or cmr genes as has been differences occurring between the CRISPR/Cas families of observed for a viral integration in a csa3 gene of S. islandicus the Sulfolobales, since large and small clusters exist for the strain M.16.4 (see below) or modify their protein products most common families I, II and III. or otherwise interfere with transcription or maturation of In organisms carrying several CRISPR/Cas modules, crRNAs. Therefore multiple systems will provide added se- including S. solfataricus strains P1 and P2 with six, and S. curity against unwanted invasion. The pairing of many family acidocaldarius with five, they may not all be fully functional. I CRISPR/Cas modules may reflect a compromise between The CRISPR/Cas system exhibits two partially independent providing added security and generating more compact and functions with one group of Cas proteins responsible for efficient systems which can potentially be mobilized and uptake of invader DNA into CRISPR loci and the other transferred between organisms as single units [2]. for generating crRNAs and guiding them to the invading A further advantage may arise from the presence genetic element (Figure 1). Only the latter proteins are of different families of CRISPR/Cas modules which is essential for the CRISPR/Cas system to function. Thus non- commonly observed for Sulfolobus (e.g. S. solfataricus carries extending CRISPR loci may still be useful to cells as long family I and II modules, whereas Sulfolobus acidocaldarius crRNAs are generated. S. acidocaldarius carries two large carries those of family II and III) [16]. Their presence may loci and three smaller ones of 11, five and two spacer- increase versatility in both the uptake of spacers and targeting repeat units. All five clusters were transcribed and processed of protospacers with different PAMs (protospacer-associated to mature crRNAs [16], but possibly the spacer addition motifs). functions are defective for the small clusters. Similarly, for The presence of multiple Cmr modules is also likely to S. solfataricus P1 and P2, of the six CRISPR loci, only four confer functional versatility, although they are subject to the appear to be active in elongation. Of the other two, the constraint that some encoded proteins must be able to smallest (locus E) carries six spacer-repeat units with a leader recognize part of the repeat sequence of the co-inhabiting and no cas genes [16] and does not appear to be transcribed CRISPR/Cas module [18,19]. Cmr modules are sometimes [21]. It carries spacers matching rudiviruses and a conjugative linked directly to CRISPR/Cas modules on chromosomes plasmid and is conserved in three S. solfataricus strains (two

C C !The Authors Journal compilation !2011 Biochemical Society Molecular Biology of Archaea II 53

Figure 2 Classification of archaeal Cmr modules (A) Gene map of an archaeal Cmr module showing the conserved core proteins Cmr2 and Cmr5, and the grey boxes represent genes encoding different proteins which carry RAMP motifs. (B) Phylogenetic tree for archaeal Cmr modules based on the Cmr2 protein sequence showing five main families: A, B, C, D and E. The total number of different proteins in each family carrying RAMP motifs is given in parentheses. Trees were prepared using the MUSCLE and ClustalW programs as described previously [17]. (C)MapsoffourRAMPmotif-containingproteinswithinasingleCmrfamilyBmoduleofS. solfataricus P2. They illustrate the diverse locations of the two conserved amino acid sequence regions (1 and 2), determined using the MEME program [45]. The remaining sequence regions show very low levels of sequence similarity.

C C !The Authors Journal compilation !2011 Biochemical Society 54 BiochemicalSocietyTransactions(2011)Volume39,part1

Figure 3 A map of the CRISPR locus E from the final repeat. Even for the locus carrying 78 spacer- Locus E is found in S. solfataricus strains P1, P2 and 98/2 and the repeat units (4930 bp), a substantial proportion of transcripts S. islandicus strain L.D.8.5 [22]. Triangles represent spacer-repeat units were approx. 5000 nt long with another large portion in the that are colour-coded for matching sequences: red, rudivirus and blue, size range 3000–3500 nt [16]. conjugative plasmid. The shaded spacer-repeat units carry identical This raised an important question as to how transcription sequences. L represents the leader region. The 36 kb genomic region continues throughout CRISPR loci apparently unimpeded by flanking the locus (grey region) is conserved at >99% sequence identity the presence of spacers carrying archaea-specific promoter or in all four strains. terminator motifs, given that the DNA uptake mechanism is essentially statistically random [15]. A compilation of potential promoter and terminator motifs on the leader (crRNA) strand of the available Sulfolobus genomes revealed, for a total of 4505 spacers, 2560 carrying archaeal-type hexameric TATA boxes (at least six consecutive A and Ts with at least two As) and 725 with T-rich pyrimidine motifs (at least six consecutive T and Cs with at least five Ts) [28,29]. Although many of these may at best be weakly effective, from Naples, Italy) with only the final downstream spacer nevertheless, given the high gene density in the Sulfolobus differing between the P1/P2 strains and strain 98/2 (Figure 3). viral and plasmid genomes and the low frequency of operon Moreover, it is also found on a highly conserved 36 kb structures, the probability of taking up such active motifs is chromosomal fragment (99% sequence identity) in the S. significant. The conclusion that transcripts do not normally islandicus strain L.D.8.5 (from Lassen, CA, U.S.A.) [22], start within CRISPR loci is also supported by examination with an almost identical leader region (one mismatch) and of CRISPR transcripts from S. solfataricus P2 transcriptome identical repeat sequence but different spacers (Figure 3). The data [21], which indicate that most of the detectable 5"-ends maintenance and spreading of locus E, lacking a cas cassette, are attributable to processing sites within repeats [21]. A would suggest that the CRISPR module can be activated and possible explanation for the unimpeded transcription through generate crRNAs. The inference that Cas proteins encoded the CRISPR loci could be the presence of the CRISPR- in one CRISPR/Cas module can activate other CRISPR loci binding protein of Sulfolobus and other crenarchaea [30]; it would also be consistent with the inference that the group could act as a transcription factor inhibiting transcriptional 1 cas genes (Figure 1A) can exchange between CRISPR/Cas starts and stops within the spacer sequences, and repeats. modules [2]. Full-length transcripts are also produced from the opposite The large inactive locus F with 88 spacer-repeat units, DNA strand of CRISPR loci of S. acidocaldarius which yield is completely conserved in sequence between S. solfataricus discrete 50–60 bp fragments carrying spacer sequences, albeit strains P1 and P2, but it lacks a leader region, and, although at lower molar levels than for the crRNAs [16], and antisense transcription occurs internally within the CRISPR locus, RNA transcripts also were detected for CRISPR loci of mature crRNAs are not generated [21,23]. Thus the latter, S. solfataricus P2 [21]. Failure to detect similar transcripts which has been lost from S. solfataricus strain 98/2, may be in the euryarchaeon Pyrococcus and bacterium E. coli [12,19] of little use when a viral infection occurred. suggests that this may be a specific property of Sulfolobus Generally for Sulfolobus species, loss of mobile DNA or crenarchaea. Analyses of cDNA libraries of S. solfataricus elements is difficult, thus IS (insertion sequence) elements demonstrated previously that antisense RNAs are commonly tend to degenerate rather than be deleted [24], and produced especially against transposase mRNAs [27], and this may also apply to CRISPR/Cas and Cmr modules, several other antisense RNAs have been detected for this and explain the maintenance of defective CRISPR systems organism [21]. Given that mature crRNAs are produced in the over long periods, although in a variant strain of S. solfataricus absence of infecting genetic elements in different Sulfolobus P2 (P2A), four physically linked CRISPR/Cas modules (A– species [16,20,23], one possible explanation is that these D) were apparently lost via a single recombination event antisense RNAs protect at least a fraction of the crRNAs between bordering IS elements [25]. against degradation before their activation.

Transcription of CRISPR loci and processing Processed CRISPR transcripts were first observed for Maturation of crRNAs and stringency of the euryarchaeon Archaeoglobus fulgidus and crenarchaeon targeting mechanisms S. solfataricus, and these studies revealed the regular pattern Details of RNA-processing mechanism have been elucidated of the RNA processing, using probes specific for repeat for a euryarchaeal CRISPR/Cmr system and an E. coli sequences [26,27]. Subsequently, the smallest Sulfolobus CRISPR/Cas system where Cas6 homologues cut in the

RNA product of approx. 40 bp was identified covering repeat, 8 nt 5" from the start of the spacer sequence, whereas

primarily a single spacer sequence [20]. S. acidocaldarius the 3"-processing sites differ [12,18]. For S. solfataricus,many CRISPR loci are transcribed upstream from the first repeat 5"-ends, and putative processing sites, are detectable 6–8 nt within the leader region and termination occurs downstream from the spacer start [21], suggesting that a similar mechanism

C C !The Authors Journal compilation !2011 Biochemical Society Molecular Biology of Archaea II 55

operates. Processing at the 3"-end of the crRNA is less clearly Thus targeting and degradation of the free genetic element defined, but for the CRISPR/Cmr system of Pyrococcus,a by the host CRISPR/Cas system could actually favour 14 nt ruler mechanism enables the processing ribonuclease to entrapment of the integrated element, and such a process generate dual cuts at 5 and 11 nt into the spacer sequence could enhance viral and plasmid evolution in archaea. The [31]. Presumably, crRNA-binding Cas and Cmr proteins Redder Model [36] for archaeal viral evolution hypothesized distinguish between the different crRNA products before that, since more than one type of fusellovirus can integrate targeting the foreign DNA or RNA respectively. at a given att site within a tRNA gene, the encaptured Until recently, attention focused on targeting of double- concatenated viruses would tend to recombine thereby stranded DNA elements, but probably single-stranded DNA generating, and subsequently releasing, hybrid fuselloviruses will also be targeted by the CRISPR/Cas system. It remains [36]. A similar process may occur for Sulfolobus-specific an open question whether the CRISPR/Cmr system targets conjugative plasmids. They are also integrative, and their both mRNA and viral RNA, and incorporation of viral RNA DNA is regularly incorporated into CRISPR loci as into CRISPR loci would require reverse transcriptase activity. spacers [16,20]. Moreover, this could explain why some of Nevertheless, all evidence suggests that the primary targets the different Icelandic conjugative plasmids cultivated in of the Sulfolobus immune systems are viruses and plasmids Wolfram Zillig’s laboratory [37] often carry large regions of and, probably, their mRNAs. There is no support for a almost identical nucleotide sequence [6,7]. Thus, indirectly, general targeting of transposable elements. Spacers matching the CRISPR/Cas systems could be fuelling production of transposase genes are occasionally found in CRISPR loci new viral and plasmid variants which they may subsequently [16,20,32], but they can generally be attributed to transposase be required to inactivate. genes present in viruses or plasmids, in particular orphan orfB Some insights into how genetic elements undermine or elements (family IS605/200) for Sulfolobus [2,15]. avoid the CRISPR immune systems were gained by passing Effective targeting of genetic elements requires that the the rudivirus SIRV1 (Sulfolobus islandicus rod-shaped virus mature crRNA anneals to the protospacer DNA region. 1) through a series of closely related S. islandicus strains. Although, for the bacterium S. thermophilus, a perfect This generated many sequence changes in the viral genes, sequence match was required to elicit a response from the CR- but striking was the frequent occurrence of genes that were ISPR/Cas system [9], studies on different Sulfolobus strains altered by 12 bp indels, probably deletions [38]. When similar have shown that a less stringent recognition system prevails. 12 bp indels were observed among related lipothrixviruses, Challenging Sulfolobus cells with viral genes carrying one it was inferred that these might occur at crRNA-targeting to three mismatches still produced a strong response from protospacers on the viral genomes [39]. In another study of a the CRISPR/Cas system [23]. Another important factor is hyperthermophilic archaeal virus, HAV1 (hyperthermophilic the motif known as PAM. Targeted genetic elements carry archaeal virus 1), cultured in a bioreactor over a 2-year period, this short sequence motif which creates a mismatch with samples taken at different times showed genome sequence the 5"-end of the crRNA [16,33,34]. For Sulfolobus, this was changes, not unlike those observed earlier for SIRV1, but also defined as a family-specific dinucleotide, displaced 1 nt from a series of recombination sites were detected along the linear the spacer sequence [15,16]. Potentially, this can be involved genome at which frequent rearrangements had occurred to in both selection of protospacers for excision by Cas proteins generate viral variants with altered sequences [40]. and crRNA targeting. Whereas a study of the bacterium Although accumulating specific sequence changes in S. epidermidis concluded that the PAM was not important for genetic elements is an effective way of avoiding, at least protospacer targeting and that any mismatched base pairing temporarily, crRNA targeting, more direct methods must would suffice [11], for S. islandicus strain REY15A, altering also have evolved. Thus, for the S. islandicus strain M.16.4, the PAM led to a loss of crRNA targeting [23]. an M164 provirus 1 has inserted into, and disrupted, the csa3 gene considered to encode the transcriptional regulator of the group 1 cas genes (Figure 1A) associated with new spacer uptake [17]. This has the advantage for the virus that Anti-immmune systems other infecting viruses will still be attacked by crRNAs if Although a few archaeal viruses have been shown to be matchingspacersarealreadypresentintheCRISPRlocus,but lytic and to elicit strong immune responses, many Sulfolobus new spacers cannot be generated from M164 provirus itself. viruses and plasmids coexist in a stable relationship, at low Other possible mechanisms were discerned from a study copy numbers, over longer periods. Although these genetic in which CRISPR systems of Sulfolobus were challenged elements do not appear to be targeted by the host CRISPR directly by vectors carrying viral genes or protospacers systems, the latter could nevertheless have a regulatory role showing various degrees of matching to host CRISPR spacers possibly by targeting mRNAs. which mimicked, to a degree, the continual infection of a host Another special feature of archaeal genetic elements is cell with a given virus [23]. In many viable transformants, that they often carry an integrase gene which partitions CRISPR locus deletions, including the matching spacer, had on chromosomal integration. Consequently, the integrated occurred, whereas in others, whole CRISPR/Cas cassettes element can only be excised when the free element is were lost. However, several transformants revealed no present to generate an intact integrase/excision enzyme [35]. changes in either CRISPR/Cas modules or vector constructs,

C C !The Authors Journal compilation !2011 Biochemical Society 56 BiochemicalSocietyTransactions(2011)Volume39,part1

suggesting that other unknown regulatory mechanisms, can 9Horvath,P.,Romero,D.A.,Coˆute-Monvoisin, ´ A.-C., Richards, M., inactivate the immune system [23]. Deveau, H., Moineau, S., Boyaval, P., Fremaux, C. and Barrangou, R. (2008) Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J. Bacteriol. 190,1401–1412 10 Marraffini, L.A. and Sontheimer, E.J. (2008) CRISPR interference limits CRISPR/Cas and Cmr module mobility horizontal gene transfer in staphylococci by targeting DNA. Science 322, Sulfolobus CRISPR/Cas and Cmr modules generally occur 1843–1845 within variable chromosomal regions where extensive gene 11 Marraffini, L.A. and Sontheimer, E.J. (2010) Self versus non-self shuffling has occurred [2,41], often attributable to high levels discrimination during CRISPR RNA-directed immunity. Nature 463, of transposable elements. Recombination at bordering IS 568–571 12 Brouns, S.J., Jore, M.M., Lundgren, M., Westra, E.R., Slijkhuis, R.J., elements can also lead to loss of CRISPR/Cas or Cmr Snijders, A.P., Dickman, M.J., Makarova, K.S., Koonin, E.V. and van der modules [25]. There is also strong evidence in support of Oost, J. (2008) Small CRISPR RNAs guide antiviral defense in prokaryotes. the transfer of whole modules between organisms based on Science 321,960–964 comparative studies of CRISPR/Cas module families and 13 Haft, D.H., Selengut, J., Mongodin, E.F. and Nelson, K.E. (2005) A guild of their locations, although the transfer mechanisms remain 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas unclear [2]. For bacteria, evidence was provided for transfer subtypes exist in prokaryotic genomes. PloS Comput. Biol. 1,474–483 of these modules on large plasmids [42], but many archaeal 14 Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I. and Koonin, E.V. (2006) A putative RNA-interference-based immune system in CRISPR/Cas modules are large, up to 25 kb, and the prokaryotes: computational analysis of the predicted enzymatic largest conjugative plasmids are only approx. 40 kb [6]. machinery, functional analogies with eukaryotic RNAi, and hypothetical Chromosomal conjugation may provide a vehicle, possibly mechanisms of action. Biol. Direct 1,7 facilitated by encaptured Sulfolobus conjugative plasmids 15 Shah, S.A., Hansen, N.R. and Garrett, R.A. (2009) Distributions of CRISPR [43,44] or presently unknown mechanisms may operate, spacer matches in viruses and plasmids of crenarchaeal possibly within biofilms. Finally, although phylogenetic acidothermophiles and implications for their inhibitory mechanism. Biochem. Soc. Trans. 37,23–28 analyses support the transfer of CRISPR/Cas and Cmr 16 Lillestøl, R.K., Shah, S.A., Brugger, ¨ K., Redder, P., Phan, H., Christiansen, J. modules between archaea and bacteria, the basic differences and Garrett, R.A. (2009) CRISPR families of the crenarchaeal genus in archaeal and bacterial transcriptional and translational Sulfolobus:bidirectionaltranscriptionanddynamicproperties.Mol. mechanisms and in the unique cell wall, membrane structures Microbiol. 72,259–272 and conjugative system of archaea provide formidable 17 Shah, S.A., Vestergaard, G. and Garrett, R.A. (2011) CRISPR/Cas and barriers to transfer between domains [2]. CRISPR/Cmr immune systems of archaea. In Regulatory RNAs in Prokaryotes (Marchfelder, A. and Hess, W., eds), Springer, Berlin, in the press 18 Carte, J., Wang, R., Li, H., Terns, R.M. and Terns, M.P. (2008) Cas6 is an Funding endoribonuclease that generates guide RNAs for invader defense in prokaryotes. Genes Dev. 22,3489–3496 Research was supported by grants from the Danish Natural Science 19 Hale, C., Kleppe, K., Terns, R.M. and Terns, M.P. (2008) Prokaryotic Research Council [grant number 272-08-0391], the Danish Research silencing (psi)RNAs in Pyrococcus furiosus.RNA14,1–8 Council for Technology and Production [grant number 274-07-0116] 20 Lillestøl, R.K., Redder, P., Garrett, R.A. and Brugger, ¨ K. (2006) A putative and the Danish National Research Foundation. viral defence mechanism in archaeal cells. Archaea 2,59–72 21 Wurtzel, O., Sapra, R., Chen, F., Zhu, Z.Y., Simmons, B.A. and Sorek, R. (2010) A single-base resolution map of an archaeal transcriptome. Genome Res. 20,133–141 References 22 Reno, M.L., Hel, N.L., Fields, C.J., Burke, P.V. and Whitaker, R.J. (2009) 1 Karginov, F.V. and Hannon, G.J. (2010) The CRISPR system: small Biogeography of the Sulfolobus islandicus pan-genome. Proc. Natl. Acad. RNA-guided defense in bacteria and archaea. Mol. Cell 37,7–19 Sci. U.S.A. 106,8605–8610 2 Shah, S.A. and Garrett, R.A. (2010) CRISPR/Cas and Cmr modules, 23 Gudbergsdottir, S., Deng, L., Chen, Z., Jensen, J.V.K., Jensen, L.R., She, Q. mobility and evolution of an adaptive immune system. Res. Microbiol., and Garrett, R.A. (2011) Dynamic properties of the Sulfolobus doi:10.1016/j.resmic.2010.09.001 CRISPR/Cas and CRISPR/Cmr systems when challenged with 3 Zillig, W, Arnold, H.P., Holz, I., Prangishvili, D., Schweier, A., Stedman, K., She, Q., Phan, H., Garrett, R. and Kristjansson, J.K. (1998) Genetic vector-borne viral and plasmid genes and protospacers. Mol. Microbiol. elements in the extremely thermophilic archaeon Sulfolobus. 79,35–49 Extremophiles 2,131–140 24 Blount, Z.D. and Grogan, D.W. (2005) New insertion sequences of 4Prangishvili,D.,Forterre,P.andGarrett,R.A.(2006)Virusesofthe Sulfolobus:functionalpropertiesandimplicationsforgenomeevolution Archaea: a unifying view. Nat. Rev. Microbiol. 11,837–848 in hyperthermophilic archaea. Mol. Microbiol. 55,312–325 5 Lawrence, C.M., Menon, S., Eilers, B.J., Bothner, B., Khayat, R., Douglas, T. 25 Redder, P. and Garrett, R.A. (2006) Mutations and rearrangements in the and Young, M.J. (2009) Structural and functional studies of archaeal viruses. J. Biol. Chem. 284,12599–12603 genome of Sulfolobus solfataricus P2. J. Bacteriol. 188,4198–4206 6 Greve, B., Jensen, S., Brugger, ¨ K., Zillig, W. and Garrett, R.A. (2004) 26 Tang, T.-H., Bachellerie, J.-P., Rozhdestvensky, T., Bortolin, M.-L., Genomic comparison of archaeal conjugative plasmids from Sulfolobus. Huber, H., Drungowski, M., Elge, T., Brosius, J. and Huttenhofer, ¨ A. (2002) Archaea 1,231–23 Identification of 86 candidates for small non-messenger RNAs from the 7 Erauso, G., Stedman, K.M., van de Werken, H.J.G., Zillig, W. and van der archaeon Archaeoglobus fulgidus. Proc. Natl. Acad. Sci. U.S.A. 99, Oost, J. (2006) Two novel conjugative plasmids from a single strain of 7536–7541 Sulfolobus.Microbiology152,1951–1968 27 Tang, T.-H., Polacek, N., Zywicki, M., Huber, H., Brugger, ¨ K., Garrett, R.A., 8 Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., Moineau, S., Romero, D.A. and Horvath, P. (2007) CRISPR provides Bachellerie, J.P. and Huttenhofer, ¨ A. (2005) Identification of novel acquired resistance against viruses in prokaryotes. Science 315, non-coding RNAs as potential antisense regulators in the archaeon 1709–1712 Sulfolobus solfataricus.Mol.Microbiol.55,469–481

C C !The Authors Journal compilation !2011 Biochemical Society Molecular Biology of Archaea II 57

28 Torarinsson, E., Klenk, H.P. and Garrett, R.A. (2005) Divergent 38 Peng, X., Kessler, A., Phan, H., Garrett, R.A. and Prangishvili, D. (2004) transcriptional and translational signals in Archaea. Environ. Microbiol. 7, Multiple variants of the archaeal DNA rudivirus SIRV1 in a single host 47–54 and a novel mechanism of genomic variation. Mol. Microbiol. 54, 29 Santangelo, T.J., Cubonova, ´ L., Skinner, K.M. and Reeve, J.N. (2009) 366–375 Archaeal intrinsic transcription termination in vivo. J. Bacteriol. 191, 39 Vestergaard, G., Shah, S.A., Bize, A., Reitberger, W., Reuter, M., Phan, H., 7102–7108 Briegel, A., Rachel, R., Garrett, R.A. and Prangishvili, D. (2008) SRV, a 30 Peng, X., Brugger, ¨ K., Shen, B., Chen, L., She, Q. and Garrett, R.A. (2003) new rudiviral isolate from Stygiolobus and the interplay of crenarchaeal Genus-specific protein binding to the large clusters of DNA repeats (short rudiviruses with the host viral-defence CRISPR system. J. Bacteriol. 190, regularly spaced repeats) present in Sulfolobus genomes. J. Bacteriol. 6837–6845 185,2410–2417 40 Garrett, R.A., Prangishvili, D., Shah, S.A., Reuter, M., Stetter, K. and Peng, 31 Hale, C.R., Zhao, P., Olson, S., Duff, M.O., Graveley, B.R., Wells, L., Terns, X. (2010) Metagenomic analyses of novel viruses, plasmids, and their R.M. and Terns, M.P. (2009) RNA-guided RNA cleavage by a CRISPR variants, from an environmental sample of hyperthermophilic RNA–Cas protein complex. Cell 139,945–956 neutrophiles cultured in a bioreactor. Environ. Microbiol. 12,2918–2930 32 Held, N.L. and Whitaker, R.J. (2009) Viral biogeography revealed by 41 Brugger, ¨ K., Torarinsson, E., Chen, L. and Garrett, R.A. (2004) Shuffling of signatures in Sulfolobus islandicus genomes. Environ. Microbiol. 11, Sulfolobus genomes by autonomous and non-autonomous mobile 457–466 elements. Biochem. Soc. Trans. 32,179–183 33 Deveau, H., Barrangou, R., Garneau, J.E., Labonte, ´ J., Fremaux, C., 42 Godde, J.S. and Bickerton, A. (2006) The repetitive DNA elements called Boyaval, P., Romero, D.A., Horvath, P. and Moineau, S. (2008) Phage CRISPRs and their associated genes: evidence of horizontal transfer response to CRISPR-encoded resistance in Streptococcus thermophilus. among prokaryotes. J. Mol. Evol. 62,718–729 J. Bacteriol. 190,1390–1400 43 Aagaard, C., Dalgaard, J. and Garrett, R.A. (1995) Inter-cellular mobility 34 Mojica, F.J., Diez-Villasenor, C., Garcia-Martinez, J. and Almendros, C. and homing of an archaeal rDNA intron confers selective advantage over (2009) Short motif sequences determine the targets of the prokaryotic intron-cells of Sulfolobus acidocaldarius. Proc. Natl. Acad. Sci. U.S.A. 92, CRISPR system. Microbiology 155,733–740 12285–12289 35 She, Q., Peng, X., Zillig, W. and Garrett, R.A. (2001) Gene capture events 44 Grogan, D.W. (1996) Exchange of genetic markers at extremely high in archaeal chromosomes. Nature 409,478 temperatures in the archaeon Sulfolobus acidocaldarius. J. Bacteriol. 36 Redder, P., Peng, X., Brugger, ¨ K., Shah, S.A., Roesch, F., Greve, B., 178,3207–3211 She, Q., Schleper, C., Forterre, P., Garrett, R.A. and Prangishvili, D. (2009) 45 Bailey, T.L., Williams, N., Misleh, C. and Li, W.W. (2006) MEME: Four newly isolated fuselloviruses from extreme geothermal discovering and analyzing DNA and protein sequence motifs. Nucleic environments reveal unusual morphologies and a possible interviral Acids Res. 34,369–373 recombination mechanism. Environ. Microbiol. 11,2849–2862 37 Prangishvili, D., Albers, S.V., Holz, I., Arnold, H.P., Stedman, K., Klein, T., Singh, H., Hiort, J., Schweier, A., Kristjansson, J.K. and Zillig, W. (1998) Conjugation in archaea: frequent occurrence of conjugative plasmids in Received 30 September 2010 Sulfolobus. Plasmid 40,190–202 doi:10.1042/BST0390051

C C !The Authors Journal compilation !2011 Biochemical Society 134 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿ ￿￿￿￿￿ ￿ My contribution to this manuscript comprised most of the bioin- Contribution: substantial formatical analyses done following the full annotation of the genome. This work includes figures 1, 34and 5 and interpreta- tions of the data these figures are based on. Extremophiles (2011) 15:487–497 DOI 10.1007/s00792-011-0379-y

ORIGINAL PAPER

Genomic analysis of Acidianus hospitalis W1 a host for studying crenarchaeal virus and plasmid life cycles

Xiao-Yan You • Chao Liu • Sheng-Yue Wang • Cheng-Ying Jiang • Shiraz A. Shah • David Prangishvili • Qunxin She • Shuang-Jiang Liu • Roger A. Garrett

Received: 4 March 2011 / Accepted: 26 April 2011 / Published online: 24 May 2011 Ó The Author(s) 2011. This article is published with open access at Springerlink.com

Abstract The Acidianus hospitalis W1 genome consists stress. Complex and partially defective CRISPR/Cas/Cmr of a minimally sized chromosome of about 2.13 Mb and a immune systems are present and interspersed with five conjugative plasmid pAH1 and it is a host for the model vapBC gene pairs. Remnants of integrated viral genomes filamentous lipothrixvirus AFV1. The chromosome carries and plasmids are located at five intron-less tRNA genes and three putative replication origins in conserved genomic several non-coding RNA genes are predicted that are con- regions and two large regions where non-essential genes are served in other Sulfolobus genomes. The putative metabolic clustered. Within these variable regions, a few orphan orfB pathways for sulphur metabolism show some significant and other elements of the IS200/607/605 family are con- differences from those proposed for other Acidianus and centrated with a novel class of MITE-like repeat elements. Sulfolobus species. The small and relatively stable genome There are also 26 highly diverse vapBC antitoxin–toxin gene of A. hospitalis W1 renders it a promising candidate for pairs proposed to facilitate maintenance of local chromo- developing the first Acidianus genetic systems. somal regions and to minimise the impact of environmental Keywords Toxin–antitoxin VapBC CRISPR Sulphur metabolism OrfB element ÁMITE Á Communicated by L. Huang. Á Á

X.-Y. You and C. Liu contributed equally to this work. Introduction X.-Y. You C.-Y. Jiang S.-J. Liu (&) State Key LaboratoryÁ ofÁ Microbial Resources and Center The Acidianus genus consists of acidothermophiles which for Environmental Microbiology, Institute of Microbiology, grow optimally and slowly in the temperature range Chinese Academy of Sciences, 65–95 C and at pH 2–4 and belongs to the order Sulfo- Bei-Chen-Xi-Lu No. 1 Chao-Yang District, ° Beijing 100101, People’s Republic of China lobales. Acidianus species are chemolithoautotrophic and e-mail: [email protected] facultatively anaerobic and are generally versatile physio- logically. Depending on the culturing conditions, they can C. Liu S. A. Shah Q. She R. A. Garrett (&) either reduce S to H S, catalysed by a sulphur reductase ArchaeaÁ Centre, DepartmentÁ ofÁ Biology, ° 2 Copenhagen University, Ole Maaløes Vej 5, and hydrogenase, or oxidise S° to H2SO4 utilising the 2200 N Copenhagen, Denmark sulphur oxygenase-reductase holoenzyme (Kletzin 1992, e-mail: [email protected] 2007). In contrast to several Sulfolobus species, the geno- mic properties of an Acidianus species have not been S.-Y. Wang Shanghai-MOST Key Laboratory of Health and Disease analysed. The Sulfolobales have been a rich source of Genomics, Chinese National Human Genome Center, genetic elements, including novel conjugative plasmids Shanghai, People’s Republic of China (Prangishvili et al. 1998; Greve et al. 2004) and several exceptional and diverse viruses many of which have now D. Prangishvili Molecular Biology of the Gene in Extremophiles Unit, been classified into eight new viral families (Rachel et al. Institut Pasteur, rue Dr Roux 25, 75724 Paris Cedex, France 2002; Prangishvili et al. 2006; Lawrence et al. 2009).

123 488 Extremophiles (2011) 15:487–497

Acidianus hospitalis W1 is the first Acidianus strain to be were obtained from the following searches: (1) homology isolated carrying a conjugative plasmid pAH1 which is a searches in the GenBank (http://www.ncbi.nlm.nih.gov/) member of the plasmid family predicted to generate an and UniProt protein (http://www.ebi.ac.uk/uniprot/) data- archaea-specific conjugative apparatus (Greve et al. 2004; bases, (2) function assignment searches in the Sulfolobus Basta et al. 2009). These plasmids are also integrative database (http://www.Sulfolobus.org/), and (3) domain or elements and in an encaptured state have been implicated in motif searches in the local CDD database (http://www. facilitating chromosomal DNA conjugation for some Sulf- ncbi.nlm.nih.gov/cdd/), the InterPro and the Pfam data- olobus species (Chen et al. 2005b). A. hospitalis is also a bases. The KEGG database (http://www.genome.jp/kegg/) viable host for the model Acidianus alpha lipothrixvirus was used to reconstruct metabolic pathways in silico. AFV1, a filamentous virus carrying exceptional claw-like Membrane proteins were predicted by Phobius, TMHMM structures at its termini which is currently the subject of and ConPred II programmes. Secretory proteins were detailed structural studies (Bettstetter et al. 2003; Goulet divided into two groups; those with a signal peptide were et al. 2009). Infection of A. hospitalis with AFV1 was shown predicted using the SignalP 3.0 (http://www.cbs.dtu.dk/ to lead to a loss of the plasmid pAH1 and this contrasts with services/SignalP/) and non-classical secretory proteins, observations in bacteria where endogenous plasmids tend to lacking a signal peptide, were predicted by the SecretomeP determine the fate of an incoming phage (Basta et al. 2009). 2.0 programme (http://www.cbS.dtu.dk/services/SecretomeP/). In order to study further the metabolic capability of an Transporters were predicted by searching the TCDB data- Acidianus species and to examine the molecular mecha- base (http://www.tcdp.org) using BLASTP with E values nisms involved in virus–plasmid–host interactions, it was lower than 1e-05. Insertion sequence (IS) elements and important to sequence and annotate the A. hospitalis gen- transposases were identified by BLASTN searches against ome. To date, most genomic studies of the Sulfolobales the IS Finder database (http://www-is.biotoul.fr/). The have concentrated on Sulfolobus species that have revealed MITE-like elements were detected using the programme relatively large genomes generally exhibiting high levels of LUNA (Bru¨gger K, unpublished). Potential frameshifts transposable and integrated genetic elements, as well as were checked by sequencing after manual annotation and considerable genetic diversity (Guo et al. 2011). Analysis any remaining frameshifts were considered to be authentic. of the A. hospitalis genome revealed a minimally sized tRNA genes and their introns were identified using chromosome that appeared relatively stable with few tRNAScan-SE (Lowe and Eddy 1997). All annotations transposable elements and no evidence of recent integra- were manually curated using Artemis software (Rutherford tion events, apart from the reversible integration of pAH1 et al. 2000). Start codons for single genes and first genes of into a tRNAArg gene (Basta et al. 2009). Potentially, Sulfolobus operons were generally located 25–30 bp therefore, A. hospitalis W1 could provide a suitable host downstream from the archaeal hexameric TATA-like box. for developing genetic systems for the Acidianus genus. Only genes within operons were preceded by Shine– Dalgarno motifs, where GGUG dominated (Torarinsson et al. 2005). Where alternative start codons occur, a Materials and methods selection was made on the basis of experimental data when available or on its location relative to a putative promoter Genome sequencing and gap closure and/or Shine–Dalgarno motif. The genome sequence accession number at Genbank/EMBL is CP002535. Genomic DNA of A. hospitalis was sequenced using a Roche 454 Genome Sequencer FLX instrument (Titanium) with an average 19-fold coverage. All useful reads were Results initially assembled into seven contigs ([500 bp) using the Newbler assembler software (http://www.454.com/). Gaps Genomic properties were closed by a Multiplex PCR strategy and PCR products were gel purified and sequenced using an ABI3730 DNA The A. hospitalis genome consists of a circular chromo- sequenator. Raw sequence data were assembled into contigs some of 2,137,654 bp and a circular conjugative plasmid using phred/phrap/consed software and the final consensus pAH1 of 28,644 bp. The chromosome has a GC content of quality for each base was above 30 (http://www.phrap.org). 34.2% and carries 2,389 predicted open reading frames (ORFs), of which about half are assigned putative functions Sequence analysis and gene annotation with many of the conserved hypothetical proteins being archaea-specific or specific to the Sulfolobales. About 320 Initially, ORFs were predicted using the programmes of the encoded proteins are putative membrane proteins Glimmer and FgeneSB and protein function predictions and a further 182 are predicted to be secretory proteins.

123 Extremophiles (2011) 15:487–497 489

The plasmid sequence is identical to that of the conjugative S. islandicus genomes, as is the synteny of the flanking genes plasmid pAH1 isolated earlier from the A. hospitalis strain except for the region immediately downstream from cdc6-3. W1, except that it is 4 bp shorter (Basta et al. 2009). Comparison of the A. hospitalis genome with those of Integrated genetic elements other members of the Sulfolobales provided no evidence of extensive conservation of gene synteny, in contrast to that Integration of genetic elements, generally fuselloviruses or observed for large regions of several Sulfolobus genomes conjugative plasmids at tRNA genes, occurs commonly for (Guo et al. 2011), and consistent with A. hospitalis being genomes of the Sulfolobales (She et al. 1998; Guo et al. relatively distant phylogenetically from these strains (Basta 2011). Most integration events occur via a reversible et al. 2009). Nevertheless, the genome carries two major archaea-specific mechanism whereby the integrase gene regions that are predicted to be relatively labile. They partitions into two sections which border the integrated extend approximately from positions 75,000–444,500 and element and the N-terminal-encoding region carrying the from 1,300,000–1,870,000 and carry most of the trans- intN sequence overlaps with the tRNA gene (Muskhelishvili posable elements, all of the CRISPR loci and cas and cmr et al. 1993). Elements that become encaptured within the family genes, most of the vapBC toxin–antitoxin gene chromosome subsequently degenerate and are gradually pairs, and many genes involved in transport-related func- lost, but will nevertheless leave a trace because the intN tions and metabolism, as well as a degenerate fuselloviral fragment overlapping the tRNA gene is generally retained genome (Fig. 1). These two regions lack genes essential for (She et al. 1998) (Table 1). informational processes including DNA replication, tran- Earlier plasmid pAH1 was sequenced and shown to scription and translation and they appear to constitute sites integrate reversibly into a tRNAArg gene (Basta et al. where non-essential genes are collected, interchanged, 2009). Genome sequencing of A. hospitalis revealed that a exchanged intercellularly and where genetic innovation low fraction of reads matched to the junctions of the may occur, similarly to a single variable region observed in integrated plasmid whilst the majority matched the several Sulfolobus genomes (Guo et al. 2011). unpartitioned integrase gene of pAH1, consistent with both Three origins of chromosomal replication, demonstrated integrated and free forms being present in the culture. The experimentally for Sulfolobus species (Robinson et al. 2004; integration site of pAH1 was located at genome positions Lundgren et al. 2004), were also predicted to occur in the 1,075,876–1,075,946 bp within the gene of tRNAArg Acidianus genome. The Y component of a Z curve analysis [TCG] (Table 1). In addition, the chromosome carries (Zhang and Zhang 2003) revealed two major peaks corre- remnants of integrated elements adjoining another five sponding to the cdc6-3 gene (Ahos0001), and the whiP/cdt1 intron-less tRNA genes, each consisting of a few genes or gene (Ahos1370) and a broader peak coinciding with the pseudogenes (Table 1). Three derive from fuselloviruses, cdc6-1 gene (Ahos0780) (Fig. 1), where the three genes one from a pDL10-like plasmid of the pRN family of encode putative replication initiators (Robinson and Bell cryptic plasmids (Kletzin et al. 1999) and another origi- 2007). The sequences of the cdc6 genes and whiP gene nates from an unknown element (Table 1). Whether these are quite conserved relative to the S. solfataricus and all derive from single integration events remains unclear

10K

8K

6K Family I CRISPRs 4K CRISPRs II Family

2K

y - component 0 transposable elements -2K toxin/antitoxin systems -4K

2 0.5M 5S 3 16S 23S 1M 1 1.5M 2M genome length

Fig. 1 The Y component of a Z curve plot for the A. hospitalis ribosomal RNA genes, the CRISPR-based systems, transposable chromosome showing the three putative replication origins. The elements of the IS200/605/607 family, and vapBC antitoxin–toxin positions of the cdc6-3 gene (origin 2), cdc6-1 gene (origin 3) and gene pairs the whiP/cdt1 gene (origin 1) are indicated as well as locations of the

123 490 Extremophiles (2011) 15:487–497

Table 1 Integration events at tRNA genes showing the numbers of 2001; Guo et al. 2011), the A. hospitalis genome carries 10 residual integrated genes copies of a repeat sequence resembling a MITE-like ele- tRNA Intron Ahos W1 ment (Fig. 3). At one end, it carries a short open reading frame corresponding in amino acid sequence to the Arg–TCG No pAH1 downstream end of an OrfB protein (Fig. 3). The conserved Pro–TGG No intN fragment terminal sequence and the internal similarity to the orfB Glu–CTC No 0986a–0988 element suggests that it could be a transposable element. fusellovirus This supposition is reinforced by the presence of 10 full Arg–TCT No 1232–1238 copies in the genome (and a few degenerate copies), and unknown element also by the presence of multiple copies in some Sulfolobus Cys–GCA No 1550–1558 and other crenarchaeal genomes (unpublished data). plasmid pDL10 Leu–GAG No 2147–2151 Non-coding RNAs fusellovirus ASV1 1604–1609 kb (no tRNA) – 1778–1786 Many untranslated RNAs have been characterised experi- fusellovirus SSV mentally for different Sulfolobus species using a variety of ASV1 Acidianus spindle-shaped virus, SSV Sulfolobus spindle-shaped techniques including probing cellular RNA extracts for virus K-turn-binding motifs and generating cDNA libraries of total cellular RNA extracts, as well as numerous antisense because, in principle, successive integrations can occur at a RNAs (Tang et al. 2005; Omer et al. 2006; Wurtzel et al. given tRNA gene (Redder et al. 2009). An additional 8–10 2010). Most of these RNAs were characterised for partial genes and pseudogenes, most of which are fusellovirus- sequence and nucleotide length, and several were detected related, are clustered distantly from a tRNA gene and they by more than one experimental approach. Based on the may have become displaced from one of the three tRNA- genome sequence comparisons and gene contexts, 23 integrated elements. putative conserved non-coding RNAs were annotated in the A. hospitalis genome. Genes for 12 C/D box RNAs were Transposable elements localised of which 7 were predicted to modify rRNAs, 2 to target tRNAs and a further 2 to modify unknown RNAs. In The A. hospitalis genome carries five IS elements addition, a single copy of a gene for an H/ACA box RNA belonging to the IS200/607 family, only three of which was located which together with aPus7 should generate carry intact transposase genes, and there are 11 copies of pseudouridine-35 in Sulfolobus pre-tRNATyr transcripts orphan orfB elements of the IS605 family, 10 of which (Muller et al. 2009). However, in A. hospitalis, the aPus7 carry intact orfB genes. None of these elements carry gene (Ahos0631) is degenerate. A further 10 genes were inverted terminal repeats and they all appear to be trans- assigned to encode RNAs of unknown function. The rela- posed by ‘‘cut-and-paste’’ mechanisms, with the orfB ele- tively high conservation of sequence and gene synteny for ments, at least, transposing via circular single stranded these RNAs between Sulfolobus and Acidianus species intermediates and inserting after TTAC sequences (File´e underlines their potential functional importance. et al. 2007; Ton-Hoang et al. 2010). Sulfolobus genomes generally carry IS elements from a Reading frame shifts and mRNA intron splicing wide variety of families most of which carry inverted ter- minal repeats and are mobilised by ‘‘copy-and-paste’’ Examples of translational reading frame shifts yielding mechanisms, and tend to be lost by gradual degeneration single polypeptides have been demonstrated experimen- and not by deletion (Blount and Grogan 2005; Redder and tally for S. solfataricus P2 (Cobucci-Ponzano et al. 2010). Garrett 2006). None of these IS element classes were For two of these, a transketolase (Ahos1219/1218) and a detected in the A. hospitalis genome and this suggests that putative O-sialoglycoprotein endopeptidase (Ahos0695/ the genome has rarely, if ever, taken up any of these IS 0696), the A. hospitalis genes overlap in a similar way, and element classes. are likely to undergo translational frame shifts. Moreover, transcripts of the intron-carrying cbf5 gene (Ahos0734/ A new class of MITE-like elements 0735) are likely to undergo splicing at the mRNA level by the archaeal splicing enzyme complex (Ahos0689/0798/ Although none of the MITE elements that are common to 1417) as has been demonstrated experimentally for dif- other Sulfolobus genomes were detected (Redder et al. ferent crenarchaea (Yokobori et al. 2009).

123 Extremophiles (2011) 15:487–497 491

Metabolic pathways assimilated via formation of carbamoyl phosphate, gluta- mine and glutamate. Genes encoding putative carbamoyl Genome analyses indicate the presence of versatile meta- phosphate synthetase (Ahos1106/1107), glutamine synthe- bolic pathways in A. hospitalis. They suggest that it can tase (Ahos0460, Ahos1272, Ahos2233) and glutamate grow autotrophically by fixing CO2 or heterotrophically dehydrogenase (Ahos0494) are present. using yeast extract, as has been demonstrated experimen- tally (Basta et al. 2009). Genome analyses also revealed Sulphur metabolism genes encoding sugar transporters and glycosidases sug- gesting that A. hospitalis can assimilate carbohydrates, A. hospitalis encodes several enzymes involved in sulphur such as starch, glucose, mannose and galactose. Moreover, metabolism, including the oxidation and reduction of sul- enzymes are encoded that are implicated in energy gener- phur, the thiosulphate–tetrathionate cycle which generates ation from oxidising elemental sulphur, hydrogen sulphides sulphate, and the participation of sulphur in electron and other reduced inorganic sulphide compounds, but not transport. However, genes for some sulphur metabolism ferrous ions. However, no hydrogenase genes were detec- enzymes, including sulphite-acceptor oxidoreductase, ted suggesting that A. hospitalis cannot use H2 as electron adenosine phosphosulphate reductase, sulphate adenylyl donor for growth. transferase and adenylylsulphate phosphate adenyltrans- Enzymes were identified for a complete TCA cycle that ferase were not found which suggested that A. hospitalis is important for generating different intermediates for the has some pathways differing from those of other Acidianus biosynthesis of many cellular components, as well as pro- and Sulfolobus species (Kletzin 2007). Therefore, based on ducing reduced electron carriers, such as NAD(P)H, the gene annotations, a model is presented for the proposed reduced ferredoxin (FdR) and FADH2. Formation of acetyl- sulphur oxidation and reduction pathways in A. hospitalis CoA from pyruvate and the formation of succinyl- (Fig. 2). Extracellular H2S is oxidised by a secretory-type CoA from 2-oxoglutarate were predicted to be catalysed, sulphide:quinone oxidoreductase (Ahos0513) and flavocy- respectively, by pyruvate ferredoxin oxidoreductase (Ahos tochrome c sulphide dehydrogenase (Ahos0188) to produce 1949-1952) and 2-oxoglutarate ferredoxin oxidoreductase a surface layer of sulphur on the outer . (Ahos0089/0090/0300/0301). Moreover, both enzymes Elemental sulphur is then transported into the cell by were predicted to use ferredoxin instead of NAD? as a putative-SH radical transporter(s) using an unknown cofactor. mechanism. Subsequently, sulphur is oxidised by sulphur Genes encoding enzymes involved in pathways for fix- oxygenase-reductase (Ahos0131) to yield sulphite, thio- ing atmosphere N2, or reducing nitrate and nitrite, as sulphate and hydrogen sulphide. Sulphite and elemental nitrogen sources were absent, as observed for other Acid- sulphur convert spontaneously and non-enzymatically to ianus species, and the genome analyses suggest that thiosulphate and elemental sulphur and, consistent with this ammonium is an exclusive source of nitrogen that is mechanism, no candidate gene encoding sulphite:acceptor

Fig. 2 Model of pathways for oxidation and reduction of sulphur in A. hospitalis indicating the predicted functions of genes in the A. hospitalis genome and corresponding gene numbers are given for each step. The following abbreviations are used: OM outer membrane, IM inner membrane, SQR sulphide:quinone oxidoreductase, Fcc flavocytochrome c sulphide dehydrogenase, SOR sulphur oxygenase-reductase, TetH tetrathionate hydrolase, TQO thiosulphate–quinone oxidoreductase; SulP sulphate transporter permease, QH2 quinol pool

123 492 Extremophiles (2011) 15:487–497 oxidoreductase was identified in the A. hospitalis genome. concentrated in the genomic regions 350–410 and Thiosulphate enters the putative thiosulphate/tetrathionate 1,374–1,912 kb with a single vapC-like gene lying in an cycle and is finally oxidised to sulphate. The enzymes operon (Fig. 1). The VapB antitoxins, in contrast to VapC involved in this cycle were all annotated: thiosulphate: toxins, could be classified into three families of transcriptional quinone oxidoreductase (Ahos0112-0113 and Ahos0238- regulators, AbrB, CcdA/CopG and DUF217 (Fig. 4a), whilst

0239) and tetrathionate hydrolase (Ahos1670). H2S is no subclassification was observed for the VapC proteins either oxidised by the sulphide:quinone oxidoreductase (Fig. 4b). Tree building based on the sequence alignments (Ahos1014) in the with quinone-cytochrome as demonstrated that the sequences of these antitoxins and electron acceptor or it reacts with tetrathionate spontane- toxins are highly diverse, with sequence identities between ously under the high temperature growth conditions. them rarely exceeding 30%, as indicated by all the proteins Finally, sulphate generated from sulphur oxidation is exhibiting long branches (Fig. 4). This result contrasted effluxed from the cell by a putative sulphate transport with the finding that VapBC complexes with closely similar permease (Ahos1256). Electrons generated from sulphur sequences are commonly found when comparing different oxidation enter the electron transport chain via quinone. genomes of the Sulfolobales. For example, 11 of the 26 Terminal quinol oxidase receives electrons from quinone VapBC protein pairs have closely similar homologs enco- and transfers them to O2 coupled with ATP generation. ded in at least 7 of the 13 available Sulfolobus genomes Some electrons may be transmitted to the NADH complex (Fig. 4b). This indicates that there is likely to be a selection to produce NADH for use in other pathways. against the uptake of closely similar vapBC gene pairs in a given genome, despite the abundance of such gene pairs in Transporters and proteolytic enzymes the environment. The A. hospitalis genome also encodes six copies of Twenty-eight gene products were predicted to be involved in RelE-related toxin proteins, in common with other Sulfol- the transport of amino acids, oligopeptide/dipeptides and obus genomes (Pandey and Gerdes 2005, unpublished ammonium. Of these, 19 are implicated in amino acid results). At least three of the relE genes occur in integrated transport, including 5 amino acid transporters (Ahos0100/ regions carrying degenerated conjugative plasmids, and 0163/0197/0986/1721), three amino acid permeases (Ahos they show sequence similarity to proteins encoded in 0328/0439/1725) and 11 amino acid permease-like proteins Sulfolobus conjugative plasmids pKEF9 (ORF69b), pING1 (Ahos0272/0276/0958/1040/1086/1868/1891/1907/1953/ (ORF98) and pL085 (gene no. 3195) (Greve et al. 2004; 2065/2251) of unknown specificity for amino acid uptake. Stedman et al. 2000; Reno et al. 2009). However, none of Genes encoding an ammonium transporter (Ahos1467) and the putative toxin genes are linked physically to antitoxin two oligopeptide/dipeptide ABC transporter gene clusters relB genes and their function remains unknown. (Ahos0337-0342 and Ahos0170-0175) are present. In addition, 21 genes were predicted to encode proteolytic Diverse CRISPR-based immune systems enzymes, including 20 peptidases. Of these, four are endopeptidases (Ahos0428/0516/0695-6/0800), three are The CRISPR-based immune systems of A. hospitalis can aminopeptidases (Ahos0013/0588/1941), two are pepsins be classified into two main types based on analyses of their (Ahos1929/2087) and one is a carboxypeptidase (Ahos Cas1 protein, leader and repeat sequences (Shah et al. 0991). Five of the proteolytic enzymes are predicted to be 2009; Lillestøl et al. 2009). In total, there are six CRISPR membrane-bound and are designated secretory proteins. loci, carrying 129 spacer-repeat units none of which are These results suggest that A. hospitalis, like Acidianus identical (Fig. 5). The first three loci in the genome (Ahos- brierleyi (Segerer et al. 1986), Acidianus tengchongensis 53, -13 and -9a) are physically linked by cassettes of cmr (He and Li 2004) and Acidianus manzaensis (Yoshida et al. and cas family genes, each of which contains a vapBC 2006), can grow on organic compounds, such as yeast antitoxin–toxin gene pair, and they constitute a family II extract, peptone, tryptone and casamino acids. CRISPR/Cas system (Fig. 5a). The last two CRISPR loci (Ahos-9b and 5) are coupled into a typical family I paired Toxin–antitoxin systems CRISPR/Cas module (Fig. 5b) and there is a vapBC gene pair immediately upstream. Preceding the latter CRISPR/ VapBC complexes constitute the main family of antitoxin– Cas module, there is a single unclassified locus (Ahos-40) toxins that are encoded by members of the Sulfolobales that lacks both cas genes and a leader region (Fig. 5c) (Pandey and Gerdes 2005; Guo et al. 2011), and they occur (Shah and Garrett 2011). mainly in variable genomic regions where they may We analysed the degree to which CRISPR spacers undergo loss or gain events (Guo et al. 2011). The A. exhibited sequence matches to the many diverse genetic hospitalis genome carries 26 vapBC gene pairs that are elements available from Acidianus and Sulfolobus species

123 Extremophiles (2011) 15:487–497 493 using an earlier approach examining nucleotide and trans- tested experimentally. Some Acidianus species, such as lated sequences of the spacers (Shah et al. 2009; Lillestøl A. manzaensis (Yoshida et al. 2006) and A. sulfidivorans et al. 2009). Relatively few significant sequence matches (Plumb et al. 2007) grow chemolithoautotrophically with were found and most of these were to conjugative plas- oxidation of molecular hydrogen, but this cannot occur in mids, with a few matches to members of five different viral A. hospitalis because it apparently lacks an encoded families (Fig. 5). hydrogen dehydrogenase. Transposable elements include a few IS200/607 elements and several orphan orfB elements which all belong to the Discussion IS200/605/607 family. They lack inverted terminal repeats and are mobilised by ‘‘cut-and-paste’’ mechanisms (File´e At about 2.1 Mbp, the genome of A. hospitalis is much et al. 2007; Ton-Hoang et al. 2010). No representatives of smaller than other sequenced genomes of members of the other transposable element families were found, common to Sulfolobales. Although this partly reflects the presence of other Sulfolobus genomes, which carry inverted terminal low levels of transposable elements and few genes deriving repeats and are mobilised by ‘‘copy-and-paste’’ mechanisms from integrated elements, it also results from a lower (Blount and Grogan 2005; Redder and Garrett 2006). It diversity of metabolic and transporter genes (Guo et al. remains uncertain whether the OrfB protein is responsible 2011). The Z curve analysis suggests that the chromosome for transposition of the orfB elements or whether they are carries three replication origins as for Sulfolobus species mobilised in trans by the TnpA transposase encoded by the (Fig. 1), although in contrast to the sequenced strains of S. IS200/607 elements (File´e et al. 2007; Guo et al. 2011). The solfataricus and S. islandicus, the whiP/cdt1 and cdc6-2 IS200/607 and orfB elements have been detected in Sulfol- genes are widely separated. obus conjugative plasmids and orfB elements also occur in a Although no systematic analysis has been performed few viruses of the Sulfolobales including four copies in the experimentally on the metabolic capacity of A. hospitalis, Acidianus two-tailed bicaudavirus ATV (She et al. 1998; genome analyses revealed that A. hospitalis possesses the Greve et al. 2004; Prangishvili et al. 2006). Thus, they are capacity to assimilate a broad range of organic compounds, likely to be transmitted intercellularly, and enter chromo- including different amino acids and proteolytic products, somes, via such genetic elements. which is similar to some other Acidianus and Sulfolobus MITEs are common in Sulfolobus species and have been species (Segerer et al. 1986; Grogan 1989; He et al. 2004; predicted to be mobilised by transposases encoded in dif- Yoshida et al. 2006; Plumb et al. 2007). The analyses also ferent IS element families (Redder et al. 2001). The novel support that A. hospitalis can assimilate various carbohy- MITE-like elements in the A. hospitalis genome (Fig. 3) drates, similarly to several Sulfolobus species (Grogan may derive from orfB elements and be mobilised by a 1989) but in contrast to some Acidianus species (Yoshida similar mechanism but at present we can provide no evi- et al. 2006; Plumb et al. 2007). dence for their mobility. In this respect, they may be A. hospitalis, like other Acidianus and Sulfolobus spe- similar to other Sulfolobus MITEs which show a low level cies, obtains energy for growth mainly via oxidation of of transpositional activity (Redder and Garrett 2006). This reduced inorganic sulphuric components (RISCs), and the is consistent with the hypothesis that MITEs drive the enzymes involved were predicted from the genome anal- evolutionary diversification of their mobilising transpos- yses (Fig. 2). A sulphur oxygenase-reductase was identi- ases to the point that they are no longer recognised which fied showing amino acid sequence similarity to other leads to their immobilisation and subsequent degeneration Acidianus and Sulfolobus SORs of 67–99%, and we (Feschotte and Pritham 2007). inferred that it is important for elemental sulphur oxidation All of the integrated elements, except one, could be and reduction, as occurs in both Acidianus and Sulfolobus identified as originating from fuselloviruses or a pDL10- species (Kletzin 1989, 1992; Sun et al. 2003; Chen et al. like member of the pRN family of cryptic plasmids 2005a). One product of sulphur oxygenase-reductase (Kletzin et al. 1999), and the conjugative plasmid pAH1 catalysis is sulphite. Owing to the apparent lack of the four was already shown to reversibly integrate at a tRNAArg enzymes, sulphite-acceptor oxidoreductase, adenosine [TCG] gene (Basta et al. 2009). None of these events phosphosulphate reductase, sulphate adenylyl transferase occurred within any of the 15 tRNA genes carrying introns and adenylylsulphate phosphate adenyltransferase, A. and this observation is consistent with the hypothesis that hospitalis must have adopted a strategy for sulphite oxi- archaeal introns protect tRNA genes against integration dation that differs from the currently known pathway events (Guo et al. 2011). (Kletzin 2007). Here, we propose that sulphite is chan- VapBC constitutes the predominant antitoxin–toxin nelled to thiosulphate in A. hospitalis via a spontaneous family found amongst the Sulfolobales and the A. hospi- reaction with elemental sulphur, but this remains to be talis genome carries 26 vapBC gene pairs, more than occur

123 494 Extremophiles (2011) 15:487–497

Fig. 3 Alignment of 10 MITE-like repeat elements present in the genome of A. hospitalis. The shaded area denotes to a small open reading frame corresponding to the downstream part of the OrfB found within transposable orfB elements

A antitoxins [VapB] B toxins [VapC] antitoxin other ORF class ORF class genomes 0374 0712 N/A 12 2101 1521 AbrB 3 0399 0210 AbrB 1 0394 0355 AbrB 6 0264 0353 CcdA 8 0209 1674 CcdA 1

0412 AbrB 1737 CcdA 12 1712 0362 CcdA 7 1520 1729 CcdA 0 1610 1996 CcdA 0 0183* 0400 AbrB 4 0356 0265 AbrB 13 0183 AbrB 7 1583 DUF217 7 0361 1979 DUF217 2 1738 0375 AbrB 4 1728 1611 AbrB 0 0354 0205 DUG217 4 1673 1713 AbrB 0 1997 2058 CcdA 13 1644 0395 AbrB 7 1587 0413 AbrB 7 2059 1645 CcdA 0 0206 2102 AbrB 1 1524 1586 CcdA 8 1582 1663 unknown 4

5% DUF217 CcdA/CopG 1978 5% 1525 Duf217 0

Fig. 4 VapBC trees. Phylogenetic trees for a VapB antitoxins and genomes is indicated in b where 0 indicates it is absent from all the b VapC toxins. They demonstrate that VapBs, despite their high genomes whilst 13 indicates that it is present in all. The antitoxin sequence diversity, can be classified into three main families AbrB, corresponding to VapC-0183 is not annotated in the genome because CcdA/CopG and DUF217, whereas the VapCs are highly diverse in it lacks a start codon but it is included in the figure. The VapC-like their sequences but cannot be classified into major subgroups. The protein (Ahos0712) is part of the operon with a translation-related Ahos gene numbers are given for each protein. Moreover, the class of protein and lacks a VapB. The Ahos1664/1663 pair are variant ORFs the VapB corresponding to each VapC is given in b. The degree of where both VapB and VapC are longer than usual and the VapB does conservation of the VapC proteins in the available 13 Sulfolobus not cluster with the families in a in more rapidly growing Sulfolobus species (Pandey and it was proposed that chromosomally encoded toxins may Gerdes 2005; Guo et al. 2011). Moreover, the groups of facilitate maintenance of local DNA regions where vapBC VapB and VapC proteins are highly diverse in sequence gene pairs are located that might otherwise be prone to loss (Fig. 4). Antitoxin–toxins were originally shown to (Magnuson 2007; Van Melderen 2010). This hypothesis is enhance plasmid maintenance as a consequence of the consistent with the observation that most of the A. hospi- growth of plasmid-free cells being preferentially inhibited talis vapBC gene pairs lie within the two variable genomic by free toxins which are inherently more stable than the regions where DNA regions are exchanged (Fig. 1). antitoxins (Gerdes 2000). By analogy with this mechanism, Moreover, it receives strong support from both the high

123 Extremophiles (2011) 15:487–497 495

A vapBC vapBC CRISPR cas4 csx1 vapBC 359 358 360 361 362 354 357 356 353 355 53

csm1csm2 csm3 CRISPR csa1vapBC PaREPcas2 cas1 CRISPR cas6 Family II 369 365 366 364 368 370 367 371 363 374 378 372 373 375 379 377 + RAMP 13 376 9a

52 12 8

B vapBC cas6 csaX casHD cas3 cas5 csa2 csa5csa3 csa1 cas1 cas2* cas4 csa3 1744 1748 1746 1741 1742 1743 1747 1749 1740 1745 1750 1751 1752 1738 1739 1737 9b 5 Family I 8 4 C 39 unknown

Fig. 5 Schematic representations of the CRISPR loci of A. hospitalis. homolog of cmr4. The light blue genes each carry two short RAMP a Family II CRISPR module carrying three CRISPR loci and Cmr and motifs. a–c Structures of the individual CRISPR loci are shown Cas family gene cassettes which are both interrupted by, or bordered together with the leader region (L) where each triangle represents a by, four vapBC gene pairs (orange). b Paired family I CRISPR/Cas spacer-repeat unit. Significant spacer matches to sequenced viruses system flanked by one vapBC gene pair, and c. an unclassified and plasmids are colour coded: red rudivirus, orange lipothrixvirus, CRISPR locus lacking a leader region and adjacent cas genes. csm1 is yellow fusellovirus, green bicaudavirus, turquoise turreted icosahe- a homolog of cmr2, csm2 is a homolog of cmr5 and csm3 is a dral virus, blue conjugative plasmid and violet cryptic plasmid diversity, and the uniqueness of all the VapC proteins is highly conserved in gene content and sequence in other encoded within the A. hospitalis chromosome (Fig. 4b), Sulfolobus genomes (Guo et al. 2011). This suggests that because any similar VapBC complexes would compensate this VapC protein, at least, may also regulate or inhibit for the loss of one another, thereby undermining any DNA translational initiation by binding at the ribosomal A-site, maintenance activity. as demonstrated recently for a RelE type toxin (Neubauer In slowly growing organisms, from nutrient poor envi- et al. 2009). A similar inactivation mechanism would be ronments, multiple toxins are also assumed to be involved plausible for the VapC toxins, if one assumes that in stress response and/or quality control (Gerdes 2000; expression of the individual VapBC complexes is stimu- Pandey and Gerdes 2005). Involvement in stress response lated by either the requirement to maintain different local entails that the more stable toxins inhibit growth and allow regions of chromosomal DNA or different environmental the host to lie in a dormant state during the period of stresses. environmental stress (Gerdes 2000). However, there may Despite the complexity of the CRISPR-based immune also be a negative effect on host growth due to the con- systems present in the genome, they appear to be, at best, tinuous presence of low levels of free toxin (Wilbur et al. only partially functional. Thus, the family II CRISPR/Cas 2005). Thus, the presence of many vapBC gene pairs system is coupled with an archaeal family D Cmr module in A. hospitalis could reflect a compromise between the in A. hospitalis, but is apparently defective, retaining only ability to survive different environmental stresses and its putative RNA, but not DNA, targeting function. The maintaining an adequate growth rate under normal condi- system lacks the group 2 cas genes (cas3, cas5, csa2, csa5, tions. This would be also consistent with the presence of csaX) which encode proteins implicated in targeting and three families of VapB proteins and high sequence diver- inactivating foreign DNA elements (Fig. 5). However, the sity of the VapC proteins, since functionally overlapping cas group 1 genes (cas1, cas2, cas4, csa1), putatively systems would be redundant for stress responses and they involved in integrating new spacers from invading DNA would confer an unnecessary burden on host growth. The elements are present, and the Cmr module implicated in proposed dual roles of maintenance of local chromosomal RNA targeting are also present (Garrett et al. 2011; Shah DNA regions and providing resistance to stress and are not et al. 2011). The family I system exhibits small CRISPR mutually exclusive. loci, with intact leader regions and group 2 cas genes. Although the mechanism of action of VapC toxins However, the cas2 gene in the group 1 cas gene cassette is remains unknown (Arcus et al. 2011), in A. hospitalis,a truncated, having incurred a point mutation which produces single vapC-like gene (Ahos0712) is directly coupled to a premature stop codon. Thus, this system has apparently genes encoding proteins involved in transcription and ini- lost the ability to integrate new spacers. This suggests that tiator tRNA binding to the ribosome, and this gene cassette neither CRISPR-based system is fully functional, despite

123 496 Extremophiles (2011) 15:487–497 their apparent complexity. The presence of five vapBC File´e J, Siguier P, Chandler M (2007) Insertion sequence diversity in gene pairs located either within the cmr and cas gene archaea. Microbiol Mol Biol Revs 71:121–157 Garrett RA, Shah SA, Vestergaard G, Deng L, Gudbergsdottir S, cassettes of the family II CRISPR/Cas module, or imme- Kenchappa CS, Erdmann S, She Q (2011) CRISPR-based diately upstream from the modules of both families, may immune systems of the Sulfolobales—complexity and diversity. reflect that they help to maintain these gene cassettes on the Biochem Soc Trans 39:51–57 chromosome (see above). Gerdes K (2000) Toxin-antitoxin modules may regulate dynthsis of macromolecules during nutritional stress. J Bacteriol 182:561– Although a range of genetic systems have been devel- 572 oped for Sulfolobus species, at present no genetic systems Goulet A, Blangy S, Redder P, Prangishvili D, Felisberto-Rodrigues are available for the Acidianus genus and A. hospitalis C, Forterre P, Campanacci V, Cambillau C (2009) Acidianus provides a promising candidate for such studies. It has a filamentous virus 1 coat proteins display a helical fold spanning the filamentous archaeal viruses lineage. Proc Natl Acad Sci minimal size and the relative stability of its chromosome USA 106:21155–21160 suggests that it is likely to generate stable deletion mutants. Greve B, Jensen S, Bru¨gger K, Zillig W, Garrett RA (2004) Genomic This, combined with its ability to host different plasmids comparison of archaeal conjugative plasmids from Sulfolobus. and viruses provides a promising starting point for devel- Archaea 1:231–239 Grogan DW (1989) Phenotypic characterization of the archaebacterial oping a genetic system. genus Sulfolobus: comparison of five wild-type strains. J Bacte- riol 171:6710–6719 Acknowledgments We thank Mery Pina and Tamara Basta for help Guo L, Bru¨gger K, Liu C, Shah SA, Zheng H, Zhu Y, Wang S, with the DNA preparation. The work was supported by the National Lillestøl RK, Chen L, Frank J, Prangishvili D, Paulin L, She Q, Nature Science Foundation of China (30621005) and the Ministry of Huang L, Garrett RA (2011) Genome analyses of Icelandic Science and Technology (2010CB630903), and by the Danish Natural strains of Sulfolobus islandicus: model organisms for genetic and Science Research Council (Grant no. 272-08-0391) and Danish virus-host interaction studies. J Bacteriol 193:1672–1680 National Research Foundation. He Z-G, Zhong H, Li Y (2004) Acidianus tengchongensis sp. nov., a new species of acidothermophilic archaeon isolated from an Open Access This article is distributed under the terms of the acidothermal spring. Curr Microbiol 48:156–193 Creative Commons Attribution Noncommercial License which per- Kletzin A (1989) Coupled enzymatic production of sulfite, thiosulfate, mits any noncommercial use, distribution, and reproduction in any and hydrogen sulfide from sulfur: purification and properties of a medium, provided the original author(s) and source are credited. sulfur oxygenase reductase from the facultatively anaerobic archaebacterium Desulfurolobus ambivalens.JBacteriol171:1638– 1643 Kletzin A (1992) Molecular characterization of the sor gene, which References encodes the sulfur oxygenase/reductase of the thermoacidophilic Archaeum Desulfurolobus ambivalens. J Bacteriol 174:5854– 5859 Arcus VL, McKenzie JL, Robson J, Cook GM (2011) The PIN- Kletzin A (2007) Oxidation of sulfur and inorganic sulfur compounds domain ribonucleases and the prokaryotic VapBC toxin–anti- in Acidianus ambivalens. In: Dahl C, Friedrich CG (eds) toxin array. Prot Engin Design Select 24:33–40 Microbial sulfur metabolism. Springer, Heidelberg, pp 184–199 Basta T, Smyth J, Forterre P, Prangishvili D, Peng X (2009) Novel Kletzin A, Lieke A, Urich T, Charlebois RL, Sensen CW (1999) archaeal plasmid pAH1 and its interactions with the lipothrix- Molecular analysis of pDL10 from Acidianus ambivalens reveals virus AFV1. Mol Microbiol 71:23–34 a family of related plasmids from extremely thermophilic and Bettstetter M, Peng X, Garrett RA, Prangishvili D (2003) AFV1, a acidophilic archaea. Genetics 152:1307–1314 novel virus infecting hyperthermophilic archaea of the genus Lawrence CM, Menon S, Eilers BJ, Bothner B, Khayat R, Douglas T, Acidianus. Virology 315:68–79 Young MJ (2009) Structural and functional studies of archaeal Blount ZD, Grogan DW (2005) New insertion sequences of viruses. J Biol Chem 284:12599–12603 Sulfolobus: functional properties and implications for genome Lillestøl RK, Shah SA, Bru¨gger K, Redder P, Phan H, Christiansen J, evolution in hyperthermophilic archaea. Mol Microbiol 55:312– Garrett RA (2009) CRISPR families of the crenarchaeal genus 325 Sulfolobus: bidirectional transcription and dynamic properties. Chen Z-W, Jiang C-Y, She Q, Liu S-J, Zhou P-J (2005a) Key role of Mol Microbiol 72:259–272 cysteine residues in catalysis and subcellular localization of Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved sulfur oxygenase reductase of Acidianus tengchongensis. Appl detection of transfer RNA genes in genomic sequence. Nucleic Environ Microbiol 71:621–628 Acids Res 25:955–964 Chen L, Bru¨gger K, Skovgaard M, Redder P, She Q, Torarinsson E, Lundgren M, Andersson A, Chen L, Nilsson P, Bernander R (2004) Greve B, Awayez M, Zibat A, Klenk HP, Garrett RA (2005b) Three replication origins in Sulfolobus species: synchronous The genome of Sulfolobus acidocaldarius, a model organism of initiation of chromosome replication and asynchronous termi- the Crenarchaeota. J Bacteriol 187:4992–4999 nation. Proc Natl Acad Sci USA 101:7046–7051 Cobucci-Ponzano B, Guzzini L, Benelli D, Londei P, Perrodou E, Magnuson RD (2007) Hypothetical functions of toxin–antitoxin Lecompte O, Tran D, Sun J, Wei J, Mathur EJ, Rossi M, Moracci systems. J Bacteriol 189:6089–6092 M (2010) Functional characterisation and high-throughput Melderen LV (2010) Toxin-antitoxin systems: why so many, what proteomic analysis of interrupted genes in the archaeon Sulfol- for? Curr Opin Microbiol 13:781–785 obus solfataricus. J Proteome Res 9:2496–2507 Muller S, Urban A, Hecker A, Leclerc A, Branlant C, Motorin Y Feschotte C, Pritham EJ (2007) DNA transposons and the evolution (2009) Deficiency of the tRNATyr:W35-synthase aPus7 in of eukaryotic genomes. Annu Rev Genet 41:331–368 archaea of the Sulfolobales order might be rescued by the

123 Extremophiles (2011) 15:487–497 497

H/ACA sRNA-guided machinery. Nucleic Acids Res 37:1308– Shah SA, Garrett RA (2011) CRISPR/Cas and Cmr modules, mobility 1322 and evolution of adaptive immune systems. Res Microbiol Muskhelishvili G, Palm P, Zillig W (1993) SSV1-encoded site- 162:27–38 specific recombination system in Sulfolobus shibatae. Mol Gen Shah SA, Hansen NR, Garrett RA (2009) Distributions of CRISPR Genet 273:334–342 spacer matches in viruses and plasmids of crenarchaeal acido- Neubauer C, Gao YG, Andersen KR, Dunham CM, Kelley AC, thermophiles and implications for their inhibitory mechanism. Hentschel J, Gerdes K, Ramakrishnan V, Brodersen DE (2009) Trans Biochem Soc 37:23–28 The structural basis for mRNA recognition and cleavage by the Shah SA, Vestergaard G, Garrett RA (2011) CRISPR/Cas and ribosome-dependent endonuclease RelE. Cell 139:1084–1095 CRISPR/Cmr immune systems of archaea. In: Marchfelder A, Omer AD, Zago M, Chang A, Dennis PP (2006) Probing the structure Hess W (eds) Regulatory RNAs in prokaryotes. Springer, Berlin and function of an archaeal C/D-box methylation guide sRNA. She Q, Phan H, Garrett RA, Albers S-V, Stedman KM, Zillig W RNA 12:1708–1720 (1998) Genetic profile of pNOB8 from Sulfolobus: the first Pandey DP, Gerdes K (2005) Toxin-antitoxin loci are highly abundant conjugative plasmid from an archaeon. Extremophiles 2:417– in free-living but lost from host-asscoiated prokaryotes. Nucleic 425 Acids Res 33:966–976 Stedman KM, She Q, Phan H, Holz I, Singh H, Prangishvili D, Garrett Plumb JJ, Haddad CM, Gibson JAE, Franzmann PD (2007) Acidianus RA, Zillig W (2000) The pING family of conjugative plasmids sulfidivorans sp nov., an extremely acidophilic, thermophilic from the extremely thermophilic archaeon Sulfolobus islandicus: archaeon isolated from a solfatara on Lihir Island, Papua New insights into recombination and conjugation in Crenarchaeota. Guinea, and amendation of the genus description. Int J Syst Evol J Bacteriol 182:7014–7020 Microbiol 57:1418–1423 Sun CW, Chen ZW, He ZG, Zhou PJ, Liu SJ (2003) Purification and Prangishvili D, Albers SV, Holz I, Arnold HP, Stedman K, Klein T, properties of the sulphur oxygenase/reductase from the acido- Singh H, Hiort J, Schweier A, Kristjansson JK, Zillig W (1998) thermophilic archaeon, Acidianus strain S5. Extremophiles Conjugation in archaea: frequent occurrence of conjugative 7:131–134 plasmids in Sulfolobus. Plasmid 40:190–202 Tang TH, Polacek N, Zywicki M, Huber H, Bru¨gger K, Garrett R, Prangishvili D, Forterre P, Garrett RA (2006) Viruses of the Archaea: Bachellerie JP, Hu¨ttenhofer A (2005) Identification of novel a unifying view. Nat Rev Microbiol 4:837–848 non-coding RNAs as potential antisense regulators in the Rachel R, Bettstetter M, Hedlund BP, Ha¨ring M, Kessler A, Stetter archaeon Sulfolobus solfataricus. Mol Microbiol 55:469–481 KO, Prangishvili D (2002) Arch Virol 147:2419–2429 Ton-Hoang B, Pasternak C, Siguier P, Guynet C, Hickman AB, Dyda Redder P, Garrett RA (2006) Mutations and rearrangements in the F, Sommer S, Chandler M (2010) Single-stranded DNA genome of Sulfolobus solfataricus P2. J Bacteriol 188:4198–4206 transposition is coupled to host replication. Cell 142:398–408 Redder P, She Q, Garrett RA (2001) Non-autonomous elements in the Torarinsson E, Klenk H-P, Garrett RA (2005) Divergent transcrip- crenarchaeon Sulfolobus solfataricus. J Mol Biol 306:1–6 tional and translational signals in Archaea. Environ Microbiol Redder P, Peng X, Bru¨gger K, Shah SA, Roesch F, Greve B, She Q, 7:47–54 Schleper C, Forterre P, Garrett RA, Prangishvili D (2009) Four Wilbur JS, Chivers PT, Mattison K, Potter L, Brennan RG, So M newly isolated fuselloviruses from extreme geothermal environ- (2005) Neisseria gonorrheae FitA interacts with FitB to bind ments reveal unusual morphologies and a possible interviral DNA through its ribbon–helix–helix motif. Biochemistry 44: recombination mechanism. Environ Microbiol 11:2849–2862 12515–12524 Reno ML, Held NL, Fields CJ, Burke PV, Whitaker RJ (2009) Wurtzel O, Sapra R, Chen F, Zhu ZY, Simmons BA, Sorek R (2010) Sulfolobus islandicus pan-genome. Proc Natl Acad Sci USA A single-base resolution map of an archaeal transcriptome. 106:8605–8610 Genome Res 20:133–141 Robinson NP, Bell SD (2007) Extrachromosomal element capture and Yokobori S, Itoh T, Yoshinari S, Nomura N, Sako Y, Yamagishi A, the evolution of multiple replication origins in archaeal Oshima T, Kita K, Watanabe Y (2009) Gain and loss of an intron chromosomes. Proc Natl Acad Sci USA 104:5806–5811 in a protein-coding gene in Archaea: the case of an archaeal Robinson NP, Dionne I, Lundgren M, Marsh VL, Bernander R, Bell RNA pseudouridine synthase gene. BMC Evol Biol 9:198 SD (2004) Identification of two origins of replication in the Yoshida N, Nakasato M, Ohmura N, Ando A, Saolo J, Ishii M, single chromosome of the archaeon Sulfolobus solfataricus. Cell Igarashi Y (2006) Acidianus manzaensis sp. nov., a novel 116:25–38 thermoacidophilic Archaeon growing autotrophicallly by the 3? Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream oxidation of H2 with the reduction of Fe . Curr Microbiol MA, Barrell B (2000) Artemis: sequence visualization and 53:406–411 annotation. Bioinformatics 16:944–945 Zhang R, Zhang CT (2003) Multiple replication origins of the Segerer A, Neuner A, Kristjansson JK, Stetter KO (1986) Acidanus archaeon Halobacterium species NRC-1. Biochem Biophys Res infernus gen. nov., sp. nov., and Acidianus brierleyi comb. nov.: Comm 302:728–734 facultatively aerobic, extremely acidophilic thermophilic sulfur- metabolizing archaebacteria. Int J Syst Bacteriol 36:559–564

123 146 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿￿ ￿￿￿￿￿ ￿￿ This review was written by Professor Roger A. Garrett. The Contribution: substantial underlying bioinformatical analyses were carried out by myself and Dr. Gisle A. Vestergaard.

Review

Archaeal CRISPR-based immune systems: exchangeable functional

modules

Roger A. Garrett, Gisle Vestergaard and Shiraz A. Shah

Archaea Centre, Department of Biology, Ole Maaløes Vej 5, University of Copenhagen, DK2200 Copenhagen N, Denmark

CRISPR (clustered regularly interspaced short palin- in extreme thermoacidophilic environments, tend to be low

dromic repeats)-based immune systems are essentially relative to cellular levels, suggesting that these viruses

modular with three primary functions: the excision and prefer to remain ‘inside’ cells [7]. Moreover, archaeal

integration of new spacers, the processing of CRISPR viruses generally exist in stable relationships with their

transcripts to yield mature CRISPR RNAs (crRNAs), and hosts at low copy-numbers and rarely cause cell lysis

the targeting and cleavage of foreign nucleic acid. The [4,6,8,9].

primary target appears to be the DNA of foreign genetic CRISPR systems (Box 1) provide immunity against

elements, but the CRISPR/Cmr system that is wide- invasion by viruses and conjugative plasmids, are present

spread amongst archaea also specifically targets and in most studied archaea and in about 40% of bacteria, and

cleaves RNA in vitro. The archaeal CRISPR systems tend have a common evolutionary origin [10,11]. The CRISPR

to be both diverse and complex. Here we examine evi- systems in many archaea are unusual in that they tend to

dence for exchange of functional modules between ar- be both diverse and complex, suggesting that they have the

chaeal systems that is likely to contribute to their potential to be more versatile functionally and with more

diversity, particularly of their nucleic acid targeting possibilities for regulation than in many bacteria [11,12].

and cleavage functions. The molecular constraints that Given the tendency of many archaeal viruses and conju-

limit such exchange are considered. We also summarize gative plasmids to maintain stable relationships with their

mechanisms underlying the dynamic nature of CRISPR hosts, and to avoid targeting by the CRISPR system,

loci and the evidence for intergenomic exchange of different regulatory systems might play an important role

CRISPR systems. [4,6,13]. For example, the immune response may only be

activated at certain levels of viral DNA replication or

Archaea and CRISPR immunity transcription.

The early evolutionary history of archaea remains unre- All CRISPR systems have three basic functions. First,

solved. Archaea could have descended directly from a the excision of protospacer DNA from invading genetic

universal common ancestor, undergone a shared period elements and insertion into CRISPR loci, a process termed

of descent with eukarya, or have been streamlined from a adaptation. Second, transcripts from complete CRISPR

more complex (and eukaryal-like) ancestor [1,2]. Although loci are processed to yield crRNAs that are then assembled

many cellular processes of archaea and eukarya share into protein complexes. Third, these complexes target and

common features that are absent from bacteria [1], the cleave the DNA or RNA of invading genetic elements,

uniqueness of archaea appears to lie in their successful termed interference. These steps are illustrated in

adaptation to extreme environmental conditions including Figure 1 and the main components are defined in Box 1.

high temperature, extremes of pH, high salt, high pres- CRISPR-based systems have recently been reclassified

sures, and strictly anaerobic conditions. These environ- into three main types, of which only types I and III occur

ments tend to be low in sources of energy consistent with in archaea (Box 2). Protein components of CRISPR systems

the hypothesis that some unique archaeal properties were are manifold and highly diverse. Several core protein

maintained through adaptation to chronic energy stress functions have been predicted from sequence analyses or

via, for example, their catabolic pathways and mechanisms crystal structures [14,15] but with few exceptions their

of energy conservation facilitated by low permeability detailed mechanistic roles remain to be determined exper-

ether-linked lipid membranes [3]. imentally (Box 1). Similarities of essential components and

This exceptional biology is reflected in the properties of core mechanisms of archaeal and bacterial CRISPR sys-

the archaeal viruses. Most of those characterized, especial- tems are consistent with their having a common evolution-

ly from extreme thermophilic and halophilic environments, ary origin [10,11].

show morphotypes and genomic properties distinct from Attempts to classify CRISPR systems phylogenetically

viruses of bacteria and eukarya [4–6]. There are also have previously involved sequence alignments of the most

preliminary indications that levels of free viruses, at least conserved Cas1 protein [14,16]. This protein is almost

ubiquitous and is associated with the adaptation step

Corresponding author: Garrett, R.A. ([email protected]). (Figure 1). Phylogenetic studies on crenarchaeal systems

0966-842X/$ – see front matter ß 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.tim.2011.08.002 Trends in Microbiology, November 2011, Vol. 19, No. 11 549

Review Trends in Microbiology November 2011, Vol. 19, No. 11

Box 1. Core components of CRISPR systems

Here we summarize the main components of CRISPR systems. each crRNA carries a spacer sequence flanked by repeat sequence

Leaders: all active CRISPR loci to date are preceded by a leader of fragments [20,32]. In Cmr-based RNA targeting, crRNAs are further

about 300–400 bp, carrying some low complexity sequence and processed at the 30 end by an unknown enzyme [21,22]. crRNA

conserved regions, that is likely to be involved in the adaptation step complexes with Cas, Csm or Cmr proteins target invading nucleic

at or near the first repeat [16,17]. The CRISPR proximal region of the acids by base-pairing to highly similar sequences, where perfect

leader also carries the main promoter for CRISPR transcription [16]. matching of the 50 terminal spacer sequence of the crRNA can be

CRISPR loci: these consist of arrays of identical direct repeats of 24– especially important for DNA targeting [35,40,43].



37 bp in size and, in archaea, often contain up to 100 repeat units. CRISPR-associated proteins (Cas): although many functions have

These are interspaced with similarly sized spacers (35–44 bp) carrying been predicted bioinformatically for core Cas proteins, few have been

unique sequences that derive from invading DNA genetic elements. tested experimentally [14,15]. Cas1 and Cas2 are universally involved

They are dynamic structures that undergo loss and exchange of in adaptation and the proteins exhibit metal-dependent DNA and RNA

spacer-repeat units, probably via recombination events at repeats endonuclease activity, respectively [59,60]. Cas4 carries a predicted

[16,35]. Thus they provide a record, albeit incomplete, of previous RecB nuclease domain and is sometimes fused to Cas1, and is

invading genetic elements, although if CRISPR loci have recently thereby implicated in adaptation. DNA interference by the CRISPR/Cas

exchanged between related organisms, as occurs for S. islandicus system requires at least three core proteins (Cas5, Cas7, and Cas3),

[17], the record will be erroneous. There is currently no evidence to which carry helicase and single-stranded DNA nuclease activities and

indicate whether spacers can originate from RNA viruses. are associated with invader DNA cleavage [61]. A large group of RNA

Protospacer: a segment of the invading DNA genetic element that is recognition motif-containing proteins (RAMPs) also carry small

incorporated into a CRISPR locus at or near the first repeat, and in a glycine-rich motifs, including the diverse Cas6 proteins involved in

direction predetermined by the location of the adjacent protospacer- CRISPR RNA processing and many of the proteins making up the Csm

associated motif. and Cmr protein targeting complexes for DNA and RNA, respectively

Protospacer-associated motif (PAM): this motif is essential for the [23,46].

immune response [19]. It corresponds to a short sequence, positioned at CASCADE (CRISPR-associated complex for antiviral defense): first

approximately –2 to –4 bp from the end of the protospacer that becomes characterized for the E. coli CRISPR/Cas system, this constitutes a

leader-proximal on insertion into a CRISPR locus. This suggests that the protein complex of Cas5e, Cas6, Cas7 (six copies) and two subtype

base-paired motif influences protospacer selection from genetic specific proteins Cse1 and Cse 2 (two copies) [20,45]. It generates a

elements [16,19,31]. Another proposed function of the PAM motif is seahorse-shaped structure encompassing the crRNA and specifically

that it ensures the presence of mismatched base-pairs between 50 ends targets the complementary strand of protospacer-like DNA (and

of crRNAs and targeted DNA as a prerequisite for avoiding self- unspecifically ssRNA) but does not cleave it. The presence of a Cas6

interference of CRISPR loci [46]. The PAM motif may also play a more homolog underlines an additional link to processing [45]. A similar

specific role in DNA interference, although how it is recognized and the structure was modeled for a Sulfolobus complex containing Cas5e,

degree of PAM sequence stringency required remain unknown [35,41]. multiple copies of Cas7 and crRNA, that also targeted DNA but only

crRNAs: the final products of processing of pre-CRISPR RNAs, many interacted weakly with Cas6 and other Cas proteins [42]. The similarity

of which exhibit short inverted repeats [58]. They are produced for of the two structures suggests that this may be a universal structure

DNA targeting by introducing single cuts in adjacent repeats, and for DNA targeting.

provided evidence for coevolution of Cas1 protein and the In this review we focus primarily on archaeal CRISPR

leader and repeat sequences, strongly suggesting that systems. The degree of functional and structural interde-

these structural components are functionally interdepen- pendence of the functional modules is summarized and

dent in adaptation [16,17]. However, when attempts were evidence is provided for modular exchange. Further, mo-

made to extend these analyses to conserved crenarchaeal lecular and sequence constraints that limit the capacity for

CRISPR components implicated in RNA processing or exchange are considered and it is inferred that advantages

nucleic acid interference, divergent trees were obtained, of exchange lie primarily in generating interference diver-

suggesting that CRISPR systems are non-integral and that sity. Further, we summarize the evidence for CRISPR

modular exchange can occur [17,18]. loci being dynamic structures and describe factors that

Interference DNA Adaptation iCas-crRNA VirusV aCas complex New spacer Viral/plasmid Repeat Cleavage DNA pCas poly-crRNA DNA DNA Interference excision Leader RNA Processing Cleaved mRNA

Plasmid iCmr-crRNA CleCleavedav viviralral RRNA

TRENDS in Microbiology

Figure 1. Scheme for the three primary functions of CRISPR systems. In the adaptation step, Cas proteins excise the protospacer sequence from a foreign DNA genetic

element and insert it into the repeat adjacent to the leader of the CRISPR locus. Pre-CRISPR RNAs are then transcribed from within the leader and are subsequently

processed into crRNAs each carrying a single spacer sequence and part of the adjoining repeat sequence. At the interference stage, crRNAs are assembled into protein

targeting complexes that anneal to, and cleave, matching spacer sequences on either invading elements or their transcripts.

550

Review Trends in Microbiology November 2011, Vol. 19, No. 11

Box 2. Classification and nomenclature

CRISPR-related proteins have been classified into eight types of are implicated in targeting and cleavage of DNA and RNA, respec-

CRISPR systems and up to 45 families of associated proteins [14,61]. tively [23,25]. In archaea the type I and type III systems are often

An attempt was recently made to simplify both the CRISPR functionally interdependent [17,44].

classification and protein nomenclature and the results pertaining Protein nomenclature: the names of proteins Cas1 to Cas6 are

especially to archaeal systems (summarized below) are presented retained but they are extended to include many disparate homologs

together with a suggested terminology that we use for labeling the in different organisms. Cas7 to Cas10 represent new categories, each

diverse functional modules present in archaea [16]. of which brings together a group of differently named homologs. The

CRISPR systems: these are now grouped into three major classes – changes especially relevant to archaeal CRISPR systems are: Cas7 for

types I to III (with a few subtypes) – based primarily on sequences of Csa2, Cas8 for Csa4, and Cas10 is proposed for homologs Cmr2 and

the Cas1 and Cas2 proteins implicated in adaptation, but also taking Csm1 of type III systems (Figure 3). Cas9 is exclusive to the bacteria-

into account gene cassette contents [15]. Type I systems have been specific type II system.

implicated in DNA targeting (exemplified in Figure 2a) and are Functional module nomenclature: the following terms are introduced

generally characterized by a Cas3 endonuclease considered to cleave for the central mechanistic steps: adaptation, aCas; processing, pCas;

invading foreign DNA [61]. Type II are bacteria-specific and require a and interference, iCas, iCmr, and iCsm, which are generally genetically

CRISPR-associated trans-encoded small RNA (tracrRNA) and host- discrete units but are also functionally interdependent (Figure 2). These

encoded RNase III for processing. The large multifunctional Cas9 terms are considered to provide a useful label for all components of the

protein alone appears to facilitate the final processing and inter- genetically diverse archaeal functional modules. Gene cassettes of all

ference steps [36]. Type III systems are over-represented in archaea the functional modules often carry additional proteins that are

and include all CRISPR systems carrying Cmr or Csm proteins, conserved for different CRISPR subtypes, and gene cassettes for the

illustrated for archaea in Figure 2b–f. Some of these proteins (Cmr2/ three types of interference module are particularly diverse (Figures 2

Csm1 and Cmr4/Csm3) are homologs, whereas others show minimal and 4). The terms are applied to components that are specifically

sequence conservation but carry RNA recognition and glycine-rich involved in the main functional steps of different CRISPR systems, but

motifs (RAMP proteins) [12,14]. The Csm and Cmr protein complexes exclude transcriptional regulators (Figures 2 and 4).

contribute to their structural changes and, finally, evi- both their gene contents and in their combinations

dence for intergenomic exchange of CRISPR systems is (Figure 2a–d). About half of the archaeal iCmr and iCsm

discussed. Detailed experimental data pertaining to the gene cassettes are physically separated on genomes from

mechanisms involved in the core functional steps in ar- CRISPR loci and aCas genes (Figure 2e,f).

chaea and bacteria have recently been reviewed [11] and

will not be covered in depth here. Adaptation

New spacer uptake involves excision of a protospacer from

Functional modules an invading DNA genetic element and its integration as a

CRISPR systems all exhibit three basic functional steps new spacer at the repeat sequence adjacent to the leader,

illustrated in Figure 1. (i) Adaptation involves recognition resulting in duplication of the repeat. It has only been

and degradation of foreign DNA by Cas proteins and observed under laboratory conditions for Streptococcus

incorporation of a DNA fragment into the CRISPR locus thermophilus [27]. For archaea, evidence is limited to

as a new spacer presumed to occur at the repeat adjacent to comparative genomic studies of closely related Sulfolobus

the leader [16,19]. (ii) In the second step, the complete strains where more recently incorporated spacers are clus-

CRISPR locus is transcribed from within the leader and tered adjacent to the leader [16,28–30]. The short PAM

processed into multiple CRISPR RNAs (crRNAs) each motif adjacent to the protospacer (Box 1) has been impli-

carrying a single spacer sequence and one or more adjoin- cated in determining the orientation of inserted spacers

ing repeat regions. Proteins implicated in the archaeal [16,26,31]. Most aCas modules are relatively conserved in

RNA processing are the core protein Cas6 and at least content, generally carrying proteins Cas1, Cas2 and Cas4

one other unidentified protein [20–22]. (iii) Interference (Figure 2), of which the first two appear to be essential.

(or invader silencing) of DNA or RNA occurs when a Moreover, spacer integration at the first repeat, combined

protein–crRNA complex targets and cleaves a highly simi- with phylogenetic evidence for coevolution of cas1, leader

lar sequence of the genetic element [23–25]. At present and repeat sequences, suggest that the leader is cofunc-

three interference systems have been identified based on tional [16,17].

Cas and Csm protein complexes each targeting DNA in

vivo and Cmr proteins targeting RNA in vitro. Here we RNA processing

introduce terms for the main molecular components in- Transcripts initiate within leaders and terminate down-

volved in each functional step to simplify the discussion of stream from CRISPR loci [16]; early work on Archaeoglo-

functional module exchange as follows: aCas for adapta- bus fulgidus indicated that processing occurs within

tion; pCas for processing, and iCas, iCsm and iCmr for adjacent repeats [32]. The primary processing enzyme is

nucleic acid interference (Box 2). the ubiquitous and diverse Cas6 protein and, at least in

Currently about 165 CRISPR systems from 110 archae- Pyrococcus furiosus, the CRISPR transcript wraps around

al genomes are available in public sequence databases and the Cas6 endonuclease and is cut once in each adjacent

have provided a basis for analyzing gene organization repeat [23,33]. The order and direction of processing

patterns of different functional modules [12,15,26]. They remains unclear. Early work on Sulfolobus solfataricus

reveal six major combinations of gene cassettes illustrated suggested that, in contrast to A. fulgidus, initially every

with color-coded functional modules in Figure 2. Whereas third repeat is cut and that processing occurs primarily

the aCas cassette is relatively conserved in the first four from the 30 end of the CRISPR transcript, but this remains

combinations, the interference modules are diverse in to be confirmed [16,34]. Moreover, processing levels were

551

Review Trends in Microbiology November 2011, Vol. 19, No. 11

cas2 t.r. t.r. CRISPR csa1 cas1 cas4 CRISPR csa5 cas7 cas5 cas3' cas3" csaX cas6 (a) S. islandicus 115 93

cas6 cmr2 cmr3 csx1 cmr5 cas8 cas7 cas5 cas3 cas4 cas1 cas2

(b) P. f u r i o s u s R R R 22

cas2 t.r. csaXa

csx1 csm1 cas4 cas1 cas6 csa5 cas7 cas5 csaXb cas3' cas3"

(c)

C. subterranum R R R R R R R 42

csx1 cas6 csm1 csm2 csx1 cas1cas2 (d) Key: 17 R R R 19 T. volcanium aCas

iCas csx1 cmr5 cmr2 cmr3 (e)

H. butylicus R R R pCas

iCsm or iCmr csx1 csm1 csm2 csx1

(f)

CRISPR / no. of repeats

9 M. vulcanius R R R 18

TRENDS in Microbiology

Figure 2. Representative gene maps of six main classes of archaeal CRISPR systems. (a) CRISPR/aCas-pCas-iCas, common in archaea; in this example those of S. islandicus

are shown. (b) CRISPR/aCas-pCas-iCas-iCmr; studied experimentally in P. furiosus. (c) CRISPR/aCas-pCas-iCas-iCsm; from Caldiarchaeum subterranum. (d) CRISPR/aCas-

iCsm; shown for volcanium. (e) iCmr from Hyperthermus butylicus. (f) iCsm from Methanocaldococcus vulcanius. Genes encoding the functional domains

are color-coded: aCas module, light blue; pCas gene, orange; iCas module, yellow; iCsm and iCmr modules, red. t.r. genes in green encode putative transcriptional regulator

genes that are not considered to be part of the functional modules. R indicates proteins carrying RNA-recognition motifs (RAMPs). (a) belongs to the type I CRISPR system;

(b) and (c) are mixtures of type I and type III, whereas (d–f) are classified as type III [15].

higher in Sulfolobus during stationary phase when the archaeal virus and plasmid DNA with no significant bias of

cells are more vulnerable to viral attack [16,28]. Patterns of matching crRNAs to either genes relative to intergenic

archaeal crRNAs are often complex, extending over the regions or to coding versus non-coding strands [26,28].

approximate size range 35–60 nt, and this probably reflects Moreover, genetic studies on different Sulfolobus species

the diversity of CRISPR systems present [28,35]. To date, have provided strong evidence for DNA targeting in vivo,

all the characterized crRNAs carry an 8 nt repeat sequence presumably involving iCas rather than iCmr modules

at the 50 end. Larger crRNAs implicated in DNA targeting [35,40]. For bacteria, experimental evidence for DNA tar-

in vivo are 60–65 nt in length and carry partial repeat geting in vivo was provided for the CRISPR/Csm system of



sequences at each end, whereas smaller crRNAs which can Staphylococcus epidermidis [41] (equivalent to Figure 2d),

target RNA in vitro are 37–45 nt in length and lack repeat the CRISPR/Csn (bacterial type II) system of S. thermo-

and partial spacer sequences at the 30 end [20–22]. Proces- philus [25] and for the CRISPR/Cas system of Escherichia

sing at the 30 end of these RNAs is performed by an coli [20], although none of these studies precluded addi-

unknown enzyme [21,22]. Processing within repeats in tional RNA targeting.

Streptococcus pyrogenes is effected by a trans-encoded A large protein complex, containing multiple protein

RNA and host-encoded RNase III [36]. This type II components, was first characterized for an E. coli CRISPR/

CRISPR/Cas system does not occur in archaea [15] where Cas system that participates in crRNA maturation and

the cellular functions of RNase III appear to be performed that facilitates annealing of the crRNA to the DNA target,

by a general intron-splicing enzyme with a different sub- but not cleavage [41]. It generates a seahorse form and is

strate specificity [37]. defined as a CASCADE complex (Box 1). A related struc-

In most studies archaeal CRISPR loci are constitutively ture is produced for a S. solfataricus CRISPR/Cas system

expressed and processed into mature crRNAs in the ab- made up of only Cas5e and multiple copies of Cas7, and

sence of invading DNA elements, but it remains unclear which appears to be involved primarily in DNA targeting

whether the CRISPR systems require further activation [42]. This is therefore likely to be a universal structure, at

[16,21]. Bacterial studies have revealed diverse CRISPR least for iCas targeting systems.

regulatory mechanisms which can be activated on viral Studies on S. thermophilus demonstrated that effective

infection producing elevated expression [38,39]. interference requires perfect matches between crRNA and

protospacers [19]. However, recent work on Sulfolobus spe-

DNA interference cies has demonstrated that three or more mismatches locat-

Independent lines of evidence support that DNA is the ed near the centre of the protospacer or at the distal end from

primary target for most CRISPR systems. Putative proto- the PAM motif do not prevent interference [35,40]. More-

spacer sequences are essentially distributed randomly on over, a systematic study of the E. coli CRISPR/Cas system

552

Review Trends in Microbiology November 2011, Vol. 19, No. 11

has shown that only six of the seven nucleotides of the the partly homologous iCmr and iCsm modules remains

targeted protospacer strand proximal to the PAM motif unclear.

must match the crRNA perfectly, and this was proposed

to act as a recognition site, or seed, for the interference Module exchange

reaction [43]. Whether this is a general property of the Attempts to classify archaeal CRISPR/Cas systems of the

CRISPR DNA targeting systems remains to be determined. Sulfolobales on the basis of the cas1, leader and repeat

sequences provided evidence for four families that were

RNA interference conserved in gene content and synteny and they appeared

In the CRISPR/Cmr system of P. furiosus (Figure 2b), a to constitute integral genetic units [16,26]. However, more

complex of Cmr proteins encompassing a small crRNA, detailed phylogenetic analysis of the aCas and iCas genes

lacking the 30 end of the spacer sequence, targets and of family I CRISPR/Cas systems of different Sulfolobus

cleaves complementary single-stranded RNA (ssRNA) in islandicus strains (Figure 2a) revealed that the aCas tree

vitro [24]. To date there is no evidence for or against in vivo diverges from the iCas tree as well as from trees generated

RNA targeting, and it is too early to establish whether from all the concatenated genes of each host genome,

mRNAs, non-coding RNAs (ncRNAs), and/or RNA viruses consistent with exchange of aCas modules having occurred

can be targets. Nevertheless, iCmr modules are common in [17]. The results of this analysis are illustrated in

archaea and are encoded either together with aCas mod- Figure 4a for two divergent pairs of CRISPR/Cas systems

ules and CRISPR loci or as separate genetic entities from four selected S. islandicus strains [17]. For each

(Figure 2c,e). Paradoxically, some Cmr proteins show sig- similar pair the concatenated homologous Cas proteins

nificant sequence similarity to Csm proteins implicated in showed about 99% amino acid sequence identity. However,

DNA targeting in S. epidermidis [41], and both are common when the protein sequences of the two pairs were compared

in archaea. A phylogenetic tree of archaeal Cmr2 (Cas10) whereas the iCas modules maintained their high sequence

homologs shows five main subfamilies, four of which rep- identity (99%), the aCas identity was reduced to 74%

resent iCmr and iCsm modules (Figure 3) [12,44]. The fifth (Figure 4a), consistent with the aCas module having been

subfamily, A (represented by Csx11), is present in a few exchanged [17]. Using the same approach, similar

bacteria and methanoarchaea but has not been studied CRISPR/Cas systems of two divergent pairs of strains of

experimentally, and is therefore not considered further. the thermoneutrophile Pyrobaculum were compared. All

Other components of iCmr and iCsm modules include the the concatenated homologous Cas protein components of

small conserved Cmr5/Csm2 protein and three to seven each similar pair showed 70% amino acid sequence iden-



copies of highly diverse RNA binding motif-containing tity. However, when the two pairs were compared aCas

proteins (RAMP proteins denoted R in Figure 2)

[12,14,44]. In summary, the degree of cofunctionality of (a) aCas exchange (S. islandicus)

cas2 t.r. t.r. csa1 cas1 cas4 csa5 cas7 cas5 cas3' cas3" csaX cas6 Crenarchaea Group 1 C bias 10% Cmr2 vs 74% 90% (b)

Group 2 C

aCas iCas (b) iCas exchange (Pyrobaculum sp.) (a) Csx11 Eury- cas2 csa5 Eury- (c) archaea cas4 cas1 csa1 t.r. cas7 cas5 n.d. cas3' cas3" archaea Group 1

vs 70% 28% Group 2 n.d. cas7 cas5 cas3' cas3" n.d. (e) TRENDS in Microbiology Euryarchaea

(d) bias

Archaea- Figure 4. Examples of genetic exchange of functional modules where amino acid

specific sequences from shared genes in each functional module are compared [17]. (a)

Comparison of the aCas and iCas modules for type I CRISPR/Cas systems of four

Csm1

closely related S. islandicus strains. Pairwise they show a high sequence identity of

TRENDS in Microbiology 99% for two modules, but when the two pairs are compared the combined iCas

proteins remain almost identical in sequence, whereas the aCas modules show

Figure 3. Phylogenetic tree of the archaeal Cas10 subtypes Cmr2, Csm1 and Csx11. only 74% sequence similarity between the pairs, consistent with the aCas module

These are the largest and most conserved sub components of the interference having been exchanged for one of the group of strains [17]. (b) A similar study was

modules of type III CRISPR systems, where the iCmr module has been implicated performed for shared genes of four thermoneutrophilic Pyrobaculum strains,

in RNA targeting [23] and the iCsm system in DNA targeting [41]. The deep where two pairs each show similar levels of amino acid sequence similarity for

branching reflects the very divergent sequences. Analysis of the five subfamilies their aCas and iCas modules (about 70%), but when the two pairs are compared the

A–E indicates strong biases in their distributions among crenarchaea and aCas sequences remain constant at about 70% whereas the iCas module yields

euryarchaea, and family D is archaea-specific and is present in crenarchaea, only 28% similarity – indicative of the iCas modules having been exchanged. Gene

euryarchaea and unclassified archaea. The Figure is reproduced with permission contents of the two pairs of iCas modules also indicate that they belong to different

from [44]. 10% indicates the amount of amino acid sequence change for the given subtypes. Gene modules are color-coded as in Figure 2. Abbreviations: C, CRISPR

length on the tree branches. locus; t.r., transcriptional regulator (in green); n.d., gene identity not determined.

553

Review Trends in Microbiology November 2011, Vol. 19, No. 11

protein sequence identity remained at 70%, but a much the direct insertion of insertion sequence (IS) elements



lower value of 28% was observed for the iCas proteins, [35]. Bioinformatic analyses have also provided support for

indicative of exchange of the latter (Figure 4b). spacers being inactivated by mutation of the bordering

repeats, and this could generate defective crRNAs [48].

Constraints on modular exchange Thus the integrity of CRISPR loci can be compromised by

Specific interactions with the repeat sequence, either at many different mechanisms.

the DNA or RNA level, are crucial for the function of the

aCas, pCas and interference modules, and the capacity of Anti-CRISPR mechanisms and defective CRISPR/Cas

some protein components to interact specifically with the and Cmr modules

repeat sequence might be a major constraint on modular Specific ways in which archaeal viruses and plasmids

exchange. Integration of new spacers thus probably might circumvent CRISPR systems remain speculative,

depends on Cas protein recognition of the first repeat including the observation that genomes of crenarchaeal

and adjoining leader region [16,19]. Cas6 associates spe- rudiviruses and lipothrixviruses accrue 12 bp indels, prob-

cifically with, and cleaves, the repeat during processing ably deletions, when passed through different hosts [49].

[20,22] and is sometimes cofunctional with different inter- However, given the complexity of many archaeal CRISPR

ference modules. The iCas complex recognizes repeat se- systems, they are also vulnerable to mutation, rearrange-

quence elements at the ends of crRNA for DNA targeting, ments or transposition events [30,50]. The multiple tran-

and the iCmr complex binds to the repeat sequence at the 50 scriptional regulators present in many archaeal CRISPR

end of crRNAs targeting RNA [23,45,42]. The small PAM systems (Figures 2 and 4) are obvious targets. For example,

motif also differs in sequence for different CRISPR/Cas in an S. islandicus strain the putative provirus M164 is

systems, and the motif is likely to be important for proto- integrated into the gene for Csa3, the putative transcrip-

spacer selection, for determining its orientation on inser- tional regulator of the aCas gene cassette, but apparently

tion in CRISPR loci [16,19,31] and, at some level, to be leaves the pCas and iCas modules unaffected [12,17].

important for DNA targeting [19,35,46]. Moreover, the Moreover, bacteriophage EPV1 characterized in a meta-

length of the crRNA spacer sequence may influence the genomics study encodes the proteobacterial transcription-

targeting and cleavage by the iCas module [42]. Taken al repressor H-NS [51] that can inactivate the entire E. coli

together, there appear to be multiple sequence and struc- CRISPR system [38]. Many archaeal systems lack core

tural constraints on modular exchange that are likely to be genes, and CRISPR loci sometimes lack leaders

offset partly by the relatively conserved sequence at the [16,30,50]. It remains unclear whether these defective

leader-distal end of repeats. In support, putative examples modules can be complemented by Cas proteins of another

of modular exchange, including those shown in Figure 4, module of a similar type within a given organism. S.

exhibit fairly conserved repeat sequences, spacer sizes and solfataricus strains P1 and P2 carry a CRISPR locus E

predicted PAM motifs. These examples also show that on lacking an aCas module that is not complemented by aCas

modular exchange the repeat invariably follows the aCas proteins associated with the phylogenetically similar

and not the iCas modules [17]. CRISPR loci C and D, but this could reflect sequence

differences in the leaders [16]. There might also be advan-

Natural dynamics of CRISPR loci tages, at least temporarily, in maintaining cofunctional

Changes can occur in CRISPR loci by a variety of mecha- processing and interference modules despite defective ad-

nisms without compromising their overall viability. New aptation [26,28].

spacer-repeat units are added, intermittently, at or near

the repeat adjacent to the leader [16,19,28,29]. Moreover, Genomic mobility

comparative analyses of closely related archaeal species Comparative studies of the Sulfolobales indicated that

support: (i) the occurrence of large indels, generally dele- CRISPR systems are invariably located in genomic regions

tions; (ii) duplication of sets of spacer-repeat units, and (iii) variable in gene content and often rich in transposable

intracellular exchange of spacer-repeat units between elements [44,52]. Furthermore, nine genomes of closely

CRISPR loci [12,16,28]. Changes can also be induced in related S. islandicus strains from different geographical

CRISPR loci by invading genetic elements carrying, for locations carried two to four apparently viable combina-

example, essential metabolic genes or, possibly, toxin– tions of different subfamilies of both CRISPR/Cas, and

antitoxin maintenance systems [35,47]. Such changes were independent iCmr and iCsm modules, indicative of their

demonstrated by challenging CRISPR loci of different having been transferred between strains [44]. Strong evi-

Sulfolobus species with plasmids carrying matching pro- dence for specific intergenomic transfer of CRISPR loci

tospacers and appropriate PAM motifs maintained under carried on larger chromosomal fragments is available for

selection [35]. This resulted in loss of either CRISPR Pyrococcus and Sulfolobus strains [12,53] and for lactic

regions containing matching spacers or complete acid bacteria [50]. Whether such exchange is common for

CRISPR/Cas systems. In S. islandicus, 50% of viable all archaea remains unclear because for strains of S.

transformants had specifically lost the matching spacer- solfataricus, more distantly related than those of S. islan-

repeat unit, suggesting that feedback and interference of dicus, CRISPR/Cas systems have been largely retained

matching spacers might occur rarely, followed by recombi- and share many identical spacer sequences [12,16].

national repair via adjacent repeats or by slippage occur- Given the potential for mobility of CRISPR systems, it

ring during DNA replication [35]. Furthermore, some was speculated that toxin–antitoxin systems, encoded near

challenged spacers of S. solfataricus were inactivated by CRISPR loci, could help to stabilize the CRISPR genetic

554

Review Trends in Microbiology November 2011, Vol. 19, No. 11

vapBC vapBC cas4 csx1 vapBCiCsm csa1 vapBC cas2 cas1 cas6 53 13 9

TRENDS in Microbiology

Figure 5. A type III CRISPR system of the acidothermophile A. hospitalis carrying four interwoven antitoxin–toxin vapBC gene pairs that are highly divergent in sequence

[52]. Functional module genes are color-coded as in Figure 2, and include genes of unknown function (grey). Numbers of repeats are indicated for each CRISPR locus.

systems within chromosomes [52]. An extreme example of Concluding remarks

this occurs in Acidianus hospitalis, a slowly growing or- One of the puzzles concerning archaeal CRISPR systems is

ganism carrying 26 vapBC antitoxin–toxin gene pairs, four why they are so diverse and complex. There are often

of which are interwoven with the CRISPR/Cas/Csm system multiple CRISPR loci within a given archaeon carrying

(Figure 5) and the fifth is associated with a separate hundreds of unique spacer sequences with multiple signif-

CRISPR/Cas system [52]. The absence of any encoded icant spacer matches to a given type of virus or conjugative

VapB or VapC proteins with similar sequences in this plasmid [26,28,30,44]. Possibly the diversity and complex-

organism is essential for the proposed capacity to maintain ity reflects the large variety of different virus families

a CRISPR/Cas system when loss of the DNA region could characterized for extreme thermophiles, and to a lesser

lead to VapC-induced cell death [52]. extent haloarchaea [4–6]. Another possibility is that, given

their modular structures, and the diversity of their puta-

Interdomain mobility tive transcriptional regulators, the CRISPR systems may

Genetic exchange between archaea and bacteria is restrict- not necessarily eliminate genetic elements. For example,

ed by many factors, including basic incompatibility of their the immune systems might only be activated when repli-

virus–host interactions and radically different conjugative cation or transcription of genetic elements reaches a cer-

mechanisms [4,6,54]. Moreover, even after successful DNA tain level, consistent with many viruses being stably

exchange, basic differences in the mechanisms of tran- maintained at low copy-numbers within cells [4–6].

scriptional initiation and termination, and of translational In addition to determining the detailed mechanistic roles

initiation, would present formidable barriers to viable gene of most of the core proteins, many uncharacterized CRISPR-

expression [55,56]. Furthermore, as argued above, many related proteins remain, some which are archaea-specific

archaea have adapted to extreme low-energy environ- and that are commonly associated with interference mod-

ments where levels of bacterial cells are low or nonexistent. ules (Figure 4a), and these might generate diversity in

In an attempt to interpret the extent to which interdomain targeting or cleavage mechanisms. Some unclassified

exchange has influenced the evolution of archaeal CRISPR CRISPR-related proteins are likely to have secondary roles,

systems, Markov clustering algorithm (MCL) techniques as suggested for the antitoxin–toxin system of A. hospitalis

based on Cas1 sequences were used to compare phyloge- helping to stabilize CRISPR/Cas systems on chromosomes

netically the CRISPR/Cas systems of archaea and bacteria. [52]. Function(s) of the iCmr and iCsm modules need to be

The results support the absence of type II CRISPR/Cas examined more extensively in vivo to establish whether

systems in archaea and revealed clusters specific to, or RNA viruses and/or transcripts of DNA viruses are targeted.

strongly biased to, archaea or bacteria, with one cluster Targeting of transcripts could be a means of regulating and

carrying multiple archaeal, predominantly methanoarch- stabilizing DNA viruses in vivo. At least for Sulfolobus

aeal, and bacterial members [17,18,57]. Qualitatively, the species, robust genetic systems are now available to resolve

analysis suggests that interdomain exchange of aCas mod- these questions [35,40]. Questions remain as to whether

ules occurs rarely and then predominantly in environ- crRNAs are selected for DNA or RNA targeting or whether

ments where archaea and bacteria are both abundant. any spacer RNA potentially can be used for either system,

There is limited evidence for homologous CRISPR-like and to what extent Cas6 proteins are interchangeable be-

mechanisms operating in eukaryotes. The RNA-targeting tween the different interference systems within a given

CRISPR/Cmr system shows some mechanistic similarity to organism [42]. Another pressing question is the extent to

RNAi, the viral RNA interference system of eukarya which defective functional modules are complemented by

[14,23]. Moreover, DNA-targeting CRISPR/Cas systems, components of other CRISPR systems; a high priority will be

in general, share features of the Piwi/Argonaute-interact- to experimentally test a broad range of phylogenetically

ing (piRNA) system where RNA-encoding DNA accumu- diverse CRISPR systems to establish the extent of their

lates passively in a small number of chromosomal loci. structural and functional diversity.

Transcripts from these loci are processed into small

ssRNAs that complex with Piwi/Argonaute proteins and Acknowledgments

can inhibit DNA transposition activity [10,16,31]. Al- We thank Luciano Marraffini, Mark Young, Qunxin She and Malcolm

White for helpful discussions and the referees for their constructive input.

though early in evolution there may have been limited

Research was supported by the Danish Natural Science Research

coevolution of these interference systems for all three

Council.

domains, the archaeal and bacterial systems have clearly

coevolved and interchanged to a significant degree, with

References

the exception of the type II CRISPR/Cas system dependent

1 Gribaldo, S. et al. (2010) The origin of eukaryotes and their relationship

on the bacteria-specific RNase III enzyme for processing with the Archaea: are we at a phylogenomic impasse? Nat. Rev.

[36]. Microbiol. 8, 743–752

555

Review Trends in Microbiology November 2011, Vol. 19, No. 11

2 Kurland, C.G. et al. (2006) Genomics and the irreducible nature of 33 Haurwitz, R.E. et al. (2010) Sequence and structure-specific RNA

eukaryotic cells. Science 312, 1011–1014 processing by a CRISPR endonuclease. Science 10, 1355–1358

3 Valentine, D.L. (2007) Adaptations to energy stress dictate the ecology 34 Tang, T-H. et al. (2005) Identification of novel non-coding RNAs as

and evolution of archaea. Nat. Rev. Microbiol. 5, 316–323 potential antisense regulators in the archaeon Sulfolobus solfataricus.

4 Prangishvili, D. et al. (2006) Viruses of the Archaea: a unifying view. Mol. Microbiol. 55, 469–481

Nat. Rev. Microbiol. 11, 837–848 35 Gudbergsdottir, S. et al. (2011) Dynamic properties of the Sulfolobus

5 Porter, K. et al. (2007) Virus–host interactions in salt lakes. Curr. Opin. CRISPR/Cas and CRISPR/Cmr systems when challenged with vector-

Microbiol. 10, 418–424 borne viral and plasmid genes and protospacers. Mol. Microbiol. 7, 35–49

6 Lawrence, C.M. et al. (2009) Structural and functional studies of 36 Deltcheva, E. et al. (2011) CRISPR RNA maturation by trans-encoded

archaeal viruses. J. Biol. Chem. 284, 12599–12603 small RNA and host factor RNase III. Nature 471, 602–607

7 Snyder, J.C. et al. (2010) Use of cellular CRISPR (clusters of regularly 37 Lykke-Andersen, J. et al. (1997) Archaeal introns: splicing,

interspaced short palindromic repeats) spacer-based microarrays for intercellular mobility and evolution. Trends Biochem. Sci. 22, 326–331

detection of viruses in environmental samples. Appl. Environ. 38 Pul, U. et al. (2010) Identification and characterisation of E. coli

Microbiol. 76, 7251–7258 CRISPR-cas promoters and their silencing by H-NS. Mol. Microbiol.

8 Brumfield, S.K. et al. (2009) Particle assembly and ultrastructural 75, 1495–1512

features associated with replication of the lytic archaeal virus 39 Agari, Y. et al. (2011) Transcription profile of Thermus thermophilus

Sulfolobus turreted icosahedral virus. J. Virol. 83, 5964–5970 CRISPR systems after phage infection. J. Mol. Biol. 395, 270–281

9 Bize, A. et al. (2009) A unique virus release mechanism in the archaea. 40 Manica, A. et al. (2011) In vitro activity of CRISPR-mediated virus

Proc. Natl. Acad. Sci. U.S.A. 106, 11306–11311 defence in a hyperthermophilic archaeon. Mol. Microbiol. 80, 481–491

10 Karginov, F.V. and Hannon, G.J. (2010) The CRISPR system: small 41 Marraffini, L.A. and Sontheimer, E.J. (2008) CRISPR interference

RNA-guided defense in bacteria and archaea. Mol. Cell 37, 7–19 limits horizontal gene transfer in Staphylococci by targeting DNA.

11 Terns, M.P. and Terns, R.M. (2011) CRISPR-based adaptive immune Science 322, 1843–1845

systems. Curr. Opin. Microbiol. 14, 1–7 42 Lindtner, N.G. et al. (2011) Structural and functional characterisation

12 Garrett, R.A. et al. (2011) CRISPR-based immune systems of the of an archaeal CASCADE complex for CRISPR-mediated viral defense.

Sulfolobales: complexity and diversity. Biochem. Soc. Trans. 39, 51–57 J. Biol. Chem. 85, 6287–6292

13 Prangishvili, D. et al. (1998) Conjugation in archaea: frequent occurrence 43 Semenova, E. et al. (2011) Interference by clustered regularly

of conjugative plasmids in Sulfolobus. Plasmid 40, 190–202 interspaced short palindromic repeat (CRISPR) RNA is governed by

14 Makarova, K.S. et al. (2006) A putative RNA-interference-based immune a seed sequence. Proc. Natl. Acad. Sci. U.S.A. 108, 10098–10103

system in prokaryotes: computational analysis of the predicted 44 Guo, L. et al. (2011) Genome analyses of Icelandic strains of Sulfolobus

enzymatic machinery, functional analogies with eukaryotic RNAi, and islandicus, model organisms for genetic and virus–host interaction

hypothetical mechanisms of action. Biol. Direct 1, 7 studies. J. Bacteriol. 193, 1672–1680

15 Makarova, K.S. et al. (2011) Evolution and classification of the 45 Jore, M.M. et al. (2011) Structural basis of CRISPR-guided RNA

CRISPR-Cas systems. Nat. Rev. Microbiol. 9, 467–477 recognition. Nat. Struct. Mol. Biol. 18, 529–537

16 Lillestøl, R.K. et al. (2009) CRISPR families of the crenarchaeal genus 46 Marraffini, L.A. and Sontheimer, E.J. (2010) Self versus non-self

Sulfolobus: bidirectional transcription and dynamic properties. Mol. discrimination during CRISPR RNA-directed immunity. Nature 463,

Microbiol. 72, 259–272 568–571

17 Shah, S.A. and Garrett, R.A. (2011) CRISPR/Cas and Cmr modules, 47 Dyall-Smith, M. (2011) Dangerous weapons: a cautionary tale of

mobility and evolution of adaptive immune systems. Res. Microbiol. CRISPR defence. Mol. Microbiol. 79, 3–6

162, 27–38 48 Stern, A. et al. (2010) Self-targeting by CRISPR: gene regulation or

18 Shah, S.A. et al. (2011) CRISPR/Cas and CRISPR/Cmr immune autioimmunity? Trends Genet. 26, 335–340

systems of archaea. In Regulatory RNAs in Prokaryotes 49 Vestergaard, G. et al. (2008) SRV, a new rudiviral isolate from

(Marchfelder, A. and Hess, W., eds), pp. 163–181, Springer Press Stygiolobus and the interplay of crenarchaeal rudiviruses with the

19 Deveau, H. et al. (2008) Phage response to CRISPR-encoded resistance host viral-defence CRISPR system. J. Bacteriol. 190, 6837–6845

in Streptococcus thermophilus. J. Bacteriol. 190, 1390–1400 50 Horvath et al. (2009) Comparative analysis of CRISPR loci in lactic acid

20 Brouns, S.J. et al. (2008) Small CRISPR RNAs guide antiviral defense bacteria genomes. Int. J. Food Microbiol. 131, 62–70

in prokaryotes. Science 321, 960–964 51 Skennerton, C.T. et al. (2011) Phage encoded H-NS: a potential achilles

21 Hale, C. et al. (2008) Prokaryotic silencing (psi)RNAs in Pyrococcus heel in the bacterial defence system. PLoS ONE 6, e20095

furiosus. RNA 14, 1–8 52 You, X-Y. et al. (2011) Genomic studies of Acidianus hospitalis W1 a

22 Carte, J. et al. (2010) Binding and cleavage of CRISPR RNA by Cas6. crenarchaeal host for studying virus and plasmid life cycles.

RNA 16, 2181–2188 Extremophiles 15, 487–497

23 Hale, C.R. et al. (2009) RNA-guided RNA cleavage by a CRISPR RNA– 53 Portillo, M.C. and Gonzalez, J.M. (2009) CRISPR elements in the

Cas protein complex. Cell 139, 945–956 Thermococcales: evidence for associated horizontal gene transfer in

24 Wang, R. et al. (2011) Interaction of Cas6 riboendonuclease with Pyrococcus furiosus. J. Appl. Genet. 50, 421–430

CRISPR RNAs: recognition and cleavage. Structure 19, 257–264 54 Greve et al. (2004) Genomic comparison of archaeal conjugative

25 Garneau, J.E. et al. (2010) The CRISPR/Cas bacterial immune system plasmids from Sulfolobus. Archaea 1, 231–239

cleaves bacteriophage and plasmid DNA. Nature 468, 67–71 55 Torarinsson, E. et al. (2005) Divergent transcriptional and

26 Shah,S.A.etal.(2009)DistributionsofCRISPRspacermatchesinviruses translational signals in Archaea. Environ. Microbiol. 7, 47–54

and plasmids of crenarchaeal acidothermophiles and implications for 56 Santangelo, T.J. et al. (2009) Archaeal intrinsic transcription

their inhibitory mechanism. Biochem. Soc. Trans. 37, 23–28 termination in vivo. J. Bacteriol. 191, 7102–7108

27 Barrangou, R. et al. (2007) CRISPR provides acquired resistance 57 Haft, D.H. et al. (2005) A guild of 45 CRISPR-associated (Cas) protein

against viruses in prokaryotes. Science 315, 1709–1712 families and multiple CRISPR/Cas subtypes exist in prokaryotic

28 Lillestøl, R.K. et al. (2006) A putative viral defence mechanism in genomes. PLoS Comput. Biol. 1, 474–483

archaeal cells. Archaea 2, 59–72 58 Kunin, V. et al. (2007) Evolutionary conservation of sequence and

29 Held, N.L. et al. (2010) CRISPR associated diversity within a secondary structures in CRISPR repeats. Genome Biol. 8, R611–R617

population of Sulfolobus islandicus. PLoS ONE 5, e12988 59 Wiedenheft, B. et al. (2009) Structural base for DNase activity of a

30 Andersson, A.F. and Banfield, J.F. (2008) Virus population dynamics conserved protein implicated in CRISPR-mediated genome defense.

and acquired resistance in natural microbial communities. Science 320, Structure 17, 904–912

1047–1049 60 Beloglazova, N. et al. (2008) A novel family of sequence-specific

31 Mojica, F.J. et al. (2009) Short motif sequences determine the targets of endoribonucleases associated with clustered regularly interspaced

the prokaryotic CRISPR system. Microbiology 155, 733–740 short palindromic repeats. J. Biol. Chem. 283, 20361–20371

32 Tang, T-H. et al. (2002) Identification of 86 candidates for small non- 61 Sinkunas, T. et al. (2011) Cas3 is a single-stranded DNA nuclease and

messenger RNAs from the archaeon Archaeoglobus fulgidus. Proc. ATP-dependent helicase in the CRISPR/Cas immune system. EMBO J.

Natl. Acad. Sci. U.S.A. 99, 7536–7541 30, 1335–1342

556 ￿.￿￿ ￿￿￿￿ ￿￿￿￿￿￿￿ ￿ 155

￿.￿￿ ￿￿￿￿ ￿￿￿￿￿￿￿ ￿ This chapter was written by Professor Roger A. Garrett. The Contribution: substantial bioinformatical analyses pertaining to the parts of the chapter containing original data were conducted by myself and Dr. Gisle A. Vestergaard, Chapter 10 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea

Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett*

1 Introduction

The CRISPR/Cas (Clustered Regularly Interspaced Short Palindromic Repeats/ CRISPR-Associated Genes) and CRISPR/Cmr systems (Cmr: Cas module-RAMP (Repeat-Associated Mysterious Proteins)) provide the basis for adaptive and hered- itable immune responses directed against the DNA and RNA, respectively, of invad- ing elements. The former consists of CRISPR loci physically linked to a cassette of cas genes which together appear to constitute integral genetic modules. cmr genes, clustered in Cmr modules, are sometimes physically linked to CRISPR/Cas mod- ules. The CRISPR/Cas immune system occurs in almost all archaea and about 40 % of bacteria. Cmr modules are less common, occurring in only about one third of genomes carrying CRISPR/Cas modules. An outline of how the CRISPR/Cas and CRISPR/Cmr systems function is indicated in Figure 1 where the former targets DNA and the latter RNA (mRNA and/or viral RNA) of the genetic elements. Archaeal CRISPR loci consist of clusters of spacer-repeat units varying in size from one to more than one hundred spacer-repeat units where each unit is about 60 – 90 bp with repeats and spacers of, on average, 30 bp and 40 bp, respectively (Lillestøl et al., 2006; Grissa et al., 2008). CRISPR loci are preceded by a non protein coding leader region which varies in size from about 150 to 550 bp and is invariably physically linked to a cas gene cassette (Jansen et al., 2002; Haft et al., 2005; Makarova et al., 2006; Lillestøl et al., 2006; Lillestøl et al., 2009). Cas and Cmr proteins, involved in the two different targeting pathways, are functionally and phylogenetically diverse. The CRISPR/Cas system specically targets DNA ele- ments (Marrafni and Sontheimer, 2008; Shah et al., 2009) while the CRISPR/Cmr system targets RNA, although whether mRNA and/or viral RNA remains unclear (Hale et al., 2009). CRISPR/Cas modules have been classied into families on the basis of sequences of their cas genes, leaders and repeats. Although these modules

* Adress

53 54 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea show a capacity for transfer between phyla of the archaeal and bacterial Domains, and supposedly rarely across Domain boundaries, archaea-specic features are nev- ertheless apparent. Crucial for the functioning of the immune systems are the spacer sequences which derive from foreign invading elements (Mojica et al., 2005; Pourcel et al., 2005; Bolotin et al., 2005; Lillestøl et al., 2006; Barrangou et al., 2007). The CRISPR loci generate whole transcripts which initiate within the leader sequence adjacent to the rst repeat (Lillestøl et al., 2009). These are subsequently processed in their repeat regions yielding end-products that constitute single spacer-containing crRNAs (Tang et al., 2002; Tang et al., 2005; Lillestøl et al., 2006; Lillestøl et al., 2009). Processing is effected by specic Cas or Cmr proteins and, at least for the

virus

DNA excision viral DNA

new spacer Cas complex repeat

leader

viral DNA Cas-crRNA complex cleaved viral DNA

cleaved viral mRNA

Cmr-crRNA complex cleaved viral RNA

Fig. 1. Diagram illustrating how CRISPR/Cas and CRISPR/Cmr systems target genetic ele- ments invading a host cell. crRNAs are processed from whole transcripts of CRISPR loci. For the CRISPR/Cas system Cas proteins complex with the crRNA and guide it to the com- plementary protospacer sequence in the invading DNA element where they anneal prior to DNA degradation. Cmr proteins also complex with crRNA and guide them to either mRNA or viral RNA, targeting them for degradation Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 55 latter, two discrete archaeal crRNAs are produced each carrying 8 nt of the repeat at the 5’-end and lacking 5 nt or 11 nt from the 3’-end of each spacer (Carte et al., 2008; Hale et al., 2009). Complexes of Cas or Cmr proteins transport the processed crRNAs to target, and inactivate, DNA or RNA, respectively, of invading genetic elements (Brouns et al., 2008; Hale et al., 2008; Carte et al., 2008; Hale et al., 2009). Base pairing mismatches occurring between the 5’ 8 nt repeat sequence of the crRNA and the Protospacer-Associated Motif (PAM) sequence adjacent to the targeted protospacer of the invading DNA are essential for subsequent degradation of the latter and for ensuring that the chromosomal CRISPR locus, itself, is not targeted (Horvath and Barrangou, 2010; Marrafni and Sontheimer, 2010; Lillestøl et al., 2009; Gudbergsdottir et al., 2010).

2 Archaeal Viruses and Plasmids and Chromosomal Evolution

Although few comprehensive studies have been performed on the relative abun- dance of different virus-like particle (VLP) morphotypes in archaea-rich environ- ments, available results indicate that spindles, laments, rods and spheres predom- inate in terrestial hot springs and hydrothermal vents, while spindle-shaped and spherical virus-like particles (VLPs) prevail in hypersaline environments (Rachel et al., 2002; Porter et al., 2007; Bize et al., 2008). Bacteriophage-like head-tail VLPs are found infrequently, although their proviruses have been detected in a few halo- and methanoarchaeal genomes (Porter et al., 2007; Krupovic et al., 2010). Several viruses, mainly from terrestial hot springs have been classied into eight new viral archaeal families and examples of their diverse morphotypes are illus- trated in Figure 2. Other viruses including several haloarchaeal viruses remain to be classied (Porter et al., 2007). The latter process is complicated by the absence of a consistent relationship between morphology and genomic properties for euryar- chaeal and crenarchaeal viruses. Overall these discoveries underline the major dif- ferences between the archaeal and bacterial virospheres (Prangishvili et al., 2006a; Lawrence et al., 2009). Archaeal viral genomes fall in the size range 15 to 75 kb dsDNA and are circular or linear. Some linear genomes have free ends whereas others, including those of rudiviruses and some lipothrixviruses have modied ends or are covalently closed and some genomes carry base-specic modications (Zillig et al., 1998; Peng et al., 2001). Consistent with the unusual and sometimes unique viral morphologies (Fig- ure 2), the viral genomes yielded very few signicant sequence matches with genes in public sequence databases (Prangishvili et al., 2006b). These results are sum- marised in histograms of the major hyperthermophilic crenarchaeal viruses in Fig- ure 3 where a large percentage of the genes are classied as unique for each virus. The most extreme case was for genes of the thermoneutrophilic virus PSV which yielded almost no signicant sequence matches in the original study (Bettstetter et al., 2003). 56 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea

a b c

d e

g f

h

Fig. 2. Typical morphologies of representatives of different families of archaeal viral fami- lies. a, SNDV; b, STSV1; c, ATV; d, SIFV; e, AFV1; f, PSV; g, SSV4; h, ARV1. Bars are 100 nm

With the availability of an increasing number of archaeal genome sequences, it has become clear that archaeal viruses and plasmids have played a major role in the evolution of host genomes. This process has apparently been fuelled by the entrapment of foreign DNA elements in host chromosomes via an archaea-specic integrative process. Many archaeal integrase genes partition on integration such that, if the free form of the element is lost, the integrase will not be expressed and cannot effect excision of the genetic element from the chromosome (She et al., 2001). Many of the encaptured elements are recognisable as intact or degenerate genetic entities and Markov-model analyses of whole archaeal genomes suggest that such genes of viral or plasmid origin contribute disproportionately to the genes of unknown function in archaeal chromosomes (Cortez et al., 2009). Archaeal viruses and plasmids have also evolved complex relationships as depen- dents or antagonists. Thus, in the presence of a fusellovirus, pRN family plasmids Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 57

Fig. 3. Histogram showing a summary of archaeal viral gene homologies to other viruses (virus only genes) and cellular chromosomes (cellular); unique indicates no detectable ho- mologs. Homologs in closely related viruses, including the rudiviruses ARV1 and SIRV1 and the spherical viruses PSV and TTSV1 are not included (Prangishvili et al., 2006b)

pSSVx and pSSVi are packaged into fusellovirus-like particles and spread through Sulfolobus host cultures as satellite viruses (Arnold et al., 1999; Wang et al., 2007). In contrast, when a strain of Acidianus hospitalis carrying the conjugative plasmid pAH1 was infected with the lipothrixvirus AFV1, plasmid replication appeared to be inhibited (Basta et al., 2009). Moreover, as mentioned below, Sulfolobus conju- gative plasmids pNOB8 and pKEF9 carry CRISPR loci which may directly target and inactivate archaeal viruses (She et al., 1998; Greve et al., 2004).

3 Diversity of Archaeal CRISPR/Cas and CRISPR/Cmr Immune Systems

Bioinformatic analyses have demonstrated that homologs of a few core Cas proteins occur widely throughout the archaeal and bacterial domains while others occur less commonly and some are predominantly archaeal or bacterial in character. Core gene sets typify the cas and cmr gene cassettes (Figure 4). For the former, the cas genes fall into groups 1 and 2. This division is based on different factors including co-occurrence, co-regulation and synteny of the genes and, possibly, functional dif- ferences for the groups of proteins (see below). The cas6 gene can occur in either group and is likely to be cofunctional with both CRISPR/Cas and CRISPR/Cmr 58 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea systems (Hale et al., 2009). For the cmr cassette, the two most conserved genes cmr2 and cmr5 are interspersed with diverse RAMP-motif containing proteins (Figure 4B). It has also been shown that there is a consistent phylogenetic linkage between sequences of selected Cas proteins and CRISPR locus repeats for archaea and bacteria (Haft et al., 2005; Makarova et al., 2006; Kunin et al., 2007; Shah et al., 2009). Furthermore for the Sulfolobales, a broader analysis of sequences of repeats, leader regions, and of Cas1 proteins, demonstrated that the CRISPR/Cas modules could be classied into distinct CRISPR/Cas families I to IV (Lillestøl et al., 2009; Shah and Garrett, 2010) which are components of an earlier more broadly dened group of families CASS1 + 5 + 6 + 7 from archaea and bacteria (Haft et al., 2005; Makarova et al., 2006). Spatial distributions of all the archaeal and bacterial fami- lies are illustrated in Figure 5A using a Markov clustering approach based on Cas1 protein sequences. Whereas the crenarchaeal families I, II and III tend to cluster separately, the archaeal family IV sequences, which derive mainly from mesophilic euryarchaea, fall together with a family of bacterial sequences (in green). A closely similar spatial distribution is also observed when crenarchaeal families I to IV are

Fig. 4. Gene maps of A. cas cassettes and B. a Cmr module showing only conserved core genes. Many other genes that occur less frequently are not included. The cas genes are di- vided into two groups 1 and 2 (see text). The Cmr module contains the highly conserved cmr2 and cmr5 genes and genes a to e, shaded grey, which correspond to different genes encoding RAMP motif-containing proteins which are present in 3 to 5 copies in the different Cmr module families (Garrett et al., 2010b)

► Fig. 5. CRISPR/Cas modules can be divided into families based on their unique charac- teristics, including the Cas1 protein sequence and nucleotide sequences of the repeat and leader regions. a) Spheres represent Cas1 protein sequences from different organisms. Small distances be- tween spheres reflects higher sequence similarity between them. All Cas1 sequences that are currently publicly available are represented. Markov clustering reveals that all the sequences fall within about 20 families (each coloured differently), 5 of which are very large. Strongly coloured spheres represent archaeal Cas1 sequences while bacterial sequences are shown in faded colours. It is evident that some families are specific to bacteria, whereas others are archaea-specific. A few CRISPR/Cas families are shared between both archaea and bacteria. Sulfolobales families I – IV are marked (Lillestøl et al. 2009) and others remain to be for- mally classified. An earlier broader classification, CASS1 to 7, is also included (Haft et al. 2005; Makarova et al. 2006). b) Leaders from the Sulfolobales are clustered based on their sequence similarities and they fall into the same group of families (I–IV) as those found for the Cas1 proteins, and a similar result is obtained when repeat sequences are clustered (Lillestøl et al. 2009) Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 59 clustered on the basis of their leader sequences (Figure 5B). Clearly, there are other archaea-specic families (strongly coloured in Figure 5A) which remain to be anal- ysed and classied. Family I CRISPR/Cas modules are the most common amongst the Sulfolobales and other crenarchaea, and the most conserved in structural organisation. The two conserved groups of cas genes are located between the leaders and externally at one end of the module. The separation may be functionally signicant with the for- mer involved in processing and insertion of DNA spacer-repeat units and the latter encoding RNA processing and effector proteins (Shah and Garrett, 2010).

a

b 60 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea

The methanoarchaea and haloarchaea, which carry the majority of the family IV CRISPR/Cas modules show the least conservation in their cas gene contents. In particular, their group 2 cas genes range from those typical of crenarchaea to those common amongst bacteria. Putative genetic exchange between archaea and bacteria has generally been attributed to the methanoarchaea and haloarchaea thriving in environments rich in bacteria. Cmr modules invariably coexist with, and are sometimes physically linked to, CRISPR/Cas modules but they occur less widely than the latter. For archaea they are found in about 70 % of genomes carrying CRISPR/Cas modules, more prevalent than for CRISPR/Cas carrying bacterial genomes (about 30 %). Both CRISPR/Cas modules and Cmr modules frequently occur in multiple copies in a given archaeal genome. cmr genes are mainly co-transcribed and their protein products have been implicated in processing of crRNAs and in the guiding of crRNAs to target RNA of invading genetic elements, whether viral RNA, transcripts, or both, remains unclear (Hale et al., 2009). Comparison of phylogenetic trees for the CRISPR/Cas and Cmr modules, based on archaeal and bacterial sequences of Cas1 and the Cmr2 protein, and its homologs Csm1 and Csx11, revealed ve major families of Cmr modules, named A to E, showing distinctive gene syntenies (Garrett et al., 2010b). Given that Cmr and CRISPR/Cas modules are sometimes physically linked and can potentially be mobilised as a unit, and that they have to recognise CRISPR repeat sequences of similar sequence, it is likely that some degree of coevolution has occurred. In support of this idea, there are many examples of family II CRISPR/ Cas modules coexisting with family D Cmr modules amongst the Sulfolobales and this relationship extends to other archaea including for example, the euryarchaeon Methanospirillum hungatei. Sizes of CRISPR loci vary from a single spacer bordered by repeats to more than 100 spacer-repeat units (Lillestøl et al., 2006; Grissa et al., 2008). New spacer- repeat units are added at the leader-repeat junction and the CRISPR loci also undergo deletions of spacer-repeat units, probably via recombination at the direct repeats, without impairing the overall CRISPR/Cas functionality, and the deletions can range from one to several spacer-repeat units. Moreover, there are also putative examples of duplications of spacer-repeat units, or small groups thereof, occurring and exchange between CRISPR loci within a genome (Lillestøl et al., 2006; Lill- estøl et al., 2009; Shah and Garrett, 2010; Gudbergsdottir et al., 2010).

4 Development and Stability of CRISPR Loci

CRISPR loci generally appear to be quite stable, gradually adding spacer-repeat units at the junction with the leader, albeit at different rates for different loci within an organism. There is also a compensatory mechanism for gradual loss of inter- nal spacers which probably involves recombination between the identical direct repeats of a given locus, and occasionally between loci carrying identical repeats (Lillestøl et al., 2009; Shah and Garrett, 2010). A specic example of such changes Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 61 is illustrated in Figure 6, showing the pairwise alignments of CRISPR locus A of Sulfolobus solfataricus strains P1, P2 and 98/2 where shared spacers are shaded, as well as spacers added adjacent to the leader region after these strains diverged. The pattern of shared spacers for each pair of organisms demonstrate that strain 98/2 separated prior to the divergence of strains P1 and P2 which carry more common spacers. Those spacers which show signicant matches to known genetic elements are also colour-coded (Figure 6A,B) indicating a wide variety of matches especially to rudiviruses, bicaudaviruses and conjugative plasmids (Lillestøl et al., 2009). Earlier evidence suggested that CRISPR loci were strongly resistant to integra- tive events (Lillestøl et al., 2006). For example, three strains of S. solfataricus P1, P2 and 98, which carry multiple large CRISPR loci, in addition to locus A in Figure 6. They are also extremely rich in active transposable elements (about 350 in strain P2) which have contributed to extensive genome shufing (Brügger et al., 2004) but no IS insertions were detected in the extensive CRISPR loci (Lillestøl et al., 2009; Shah and Garrett, 2010). Thus, although they do occasionally occur intergenically in the cas and cmr gene clusters, there appears to be a strong selective pressure to maintain the integrity of CRISPR loci which are essential for the function of both CRISPR/Cas and CRISPR/Cmr systems. Whether this is a general rule for archaea or is dependent on environmental conditions, including the levels of viruses and plasmids present, is unclear. A different picture has emerged from bacterial studies. For example, in a biolm carrying acidophilic Leptospirillum group II bacteria, about 20 % of the partially sequenced CRISPR loci contained IS elements (Tyson and Baneld, 2008). Many archaeal and bacterial chromosomes, with or without CRISPR/Cas mod- ules, carry short CRISPR-like clusters lacking associated leader regions and cas genes (Grissa et al., 2008). Although their origin(s) remain unknown, they may have separated from intact CRISPR/Cas modules, possibly via transposable ele- ments. If preceded by promoters their transcripts can, in principle, be processed and activated. Such CRISPR loci are present in Sulfolobus conjugative plasmids pNOB8 and pKEF9 (She et al., 1998; Greve et al., 2004) and at least for the latter,

Fig. 6. Pairwise comparison of the spacer-repeat units of CRISPR A locus of three closely related strains of S. solfataricus P1, P2 and 98/2. Shaded regions indicate identical spacer- repeat units shared by two CRISPR loci. Colour-coded spacer-repeat units indicate that spac- ers have significant sequence matches to the viruses or plasmid families indicated on the Figure 62 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea the spacer-repeat cluster is transcribed and RNA processed in a S. solfataricus host, suggesting that at least some of these small clusters can be activated and functional if complementary Cas or Cmr proteins are present (Lillestøl et al., 2009; Shah and Garrett, 2010).

5 Mobility of CRISPR/Cas and Cmr Modules

Genomic analyses of closely related Sulfolobus species have provided strong evi- dence for CRISPR/Cas modules being mobilised given that they occur at different genomic positions even when there is high level of gene synteny present and they are generally conned to the variable genetic regions (Shah and Garrett, 2010). Their ability to transfer between organisms is also supported by the different com- binations of CRISPR/Cas families found in closely related organisms (Lillestøl et al., 2009; Shah and Garrett, 2010). For example, in S. islandicus strains HVE10/4 and REY15A, the former carries family I and III CRISPR/Cas modules and one Cmr module, while the latter exhibits a family I CRISPR/Cas module and two fam- ily B Cmr modules (Shah and Garrett, 2010; Guo et al., 2010). Further support for such transfer was provided by analysis of the Pyrococcus furiosus genome where a 155 kb fragment bordered by a CRISPR locus and a repeat showing signicantly different properties of G+C content, third codon position and codon usage from the rest of the genome (Portillo and Gonzalez, 2009). Evidence for gene exchange within the CRISPR/Cas modules derived from examination of the structural integrities of the paired family I CRISPR/Cas mod- ules of several closely related Sulfolobus strains. The results indicated that the internal group 1 cas genes, which are functionally implicated in spacer addition at the leader-repeat junction (Figure 4) seem to coevolve, and be mobilised, with the CRISPR locus whereas the group 2 cas genes, putatively involved in RNA process- ing and crRNA mobility (Figure 4), were retained within the strains, suggesting that some exchange within cas gene cassettes can occur (Shah and Garrett, 2010). The mechanism(s) of transfer of CRISPR/Cas modules, varying in size from about 7 kb to 25 kb, remains unclear. The larger CRISPR/Cas modules, at least, may be too large to be borne on the plasmids as has been proposed for bacteria (Godde and Bickerton, 2006). At least for the crenarchaea, genetic elements are relatively small and, although small CRISPR loci have been detected in crenar- chaeal conjugative plasmids, transfer is more likely to result from chromosomal conjugation which may well be facilitated by integrated conjugative plasmids (Lill- estøl et al., 2009).

6 Targets of the CRISPR/Cas and CRISPR/Cmr Systems

Bioinformatic evidence indicated that the spacer crRNAs carrying signicant sequence matches to the protospacer sequence were complementary to either strand Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 63 of genes implying that they were not exclusively targeting mRNAs (Lillestøl et al., 2006). Moreover, extensive analyses of signicant matches to the many known viruses and plasmids of the Sulfolobales revealed several matches to protospac- ers lying between genes. They demonstrated, further, that the locations of the pro- tospacers were randomly distributed along, and on either strand of, the genetic ele- ments. This is illustrated in Figure 7 for ve crenarchaeal viruses and two plasmids, where the positions of the signicant matches are shown in relation to the annotated gene locations. A similar conclusion that DNA, and not mRNA, was targeted by the

Fig. 7. Significant CRISPR spacer matches to protospacer sequences are superimposed on genomes of the following representative viruses and plasmids: SIRV1 – rudiviruses, AFV9 – betalipothrixviruses, SSV2 – fuselloviruses, STIV turreted icosahedral viruses, ATV – bi- caudavirus, pNOB8 – conjugative plasmids. and pHEN7 – cryptic plasmids where circular genomes (SSV2, STIV, ATV, pNOB8 and pHEN7) are presented in a linear form. Protein coding regions are boxed and shaded, as indicated on the Figure, according their levels of conservation for those genomes. No comparative genomic data were used for ATV. Spacer sequence matches are indicated by lines above and below the genomes for the two DNA strands and they are colour-coded according to whether they occur exclusively at a nucle- otide level (red) or additionally at an amino acid level (green). Significant spacer matches were found by setting an e-value cut off corresponding to a 10 % false positive ratio, which was estimated by using the genome of S. acidocaldarius as a negative control (Chen et al., 2005). These data are updated from an earlier study (Shah et al., 2009) 64 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea

CRISPR/Cas system of the bacterium Staphylococcus epidermidis was achieved experimentally (Marrafni and Sontheimer, 2008). However, more recently it was demonstrated that crRNAs complexed with Cmr proteins target RNA carrying matching protospacers (Hale et al., 2009) but it is still unclear whether this includes both mRNAs and viral RNAs. For archaea, this will only be resolved when the rst archaeal RNA viruses have been characterised. A few sequence matches have been detected between archaeal CRISPR spacers and IS elements suggesting that CRISPR/Cas system can target transposable ele- ments (Lillestøl et al., 2006; Held and Whitaker, 2009; Mojica et al., 2009; Shah et al., 2009). However, most of those reported can be attributed to transposase genes carried on viral genomes or plasmids, including, for example, spacer matches to each of the four transposase genes of the bicaudavirus ATV (Figure 7) but these transposase genes/IS elements are presumably indistinguishable from any other viral/plasmid genomic target if they carry appropriate PAM motifs adjacent to pro- tospacer sites.

7 Formation of crRNAs and Targeting of Foreign Elements

The few archaeal CRISPR loci that have been tested experimentally for transcrip- tion, including some lacking intact leader regions, produced processed transcripts (Tang et al., 2002; Tang et al., 2005; Lillestøl et al., 2006; Carte et al., 2008; Lill- estøl et al., 2009). Sulfolobus acidocaldarius carries ve CRISPR loci with sizes of 133, 78, 11, 5 and 2 spacer-repeat units. For the four smaller clusters, whole length transcripts were detected experimentally and for locus-78, the maximum transcript size of about 5000 nt, exceeded the size of the 4930 bp CRISPR locus, consistent with the whole transcript extending from within the leader region and terminating downstream from the locus (Lillestøl et al., 2006; Lillestøl et al., 2009). However, a large fraction of the transcripts also fell in the size range 3000–3500 nt suggest- ing that endogenous degradation, premature termination or processing had occurred towards the 3’-end of the transcript. Given that promoter and terminator motifs will be randomly taken up in spacers of CRISPR loci (Shah et al., 2009), there must be some form of transcriptional regulation to ensure the formation of whole CRISPR transcript from the foreign genetic elements, possibly involving the Sulfolobus CRISPR repeat binding protein (Peng et al., 2003). In the euryarchaeon P. furiosus and in Escherichia coli RNA transcripts are pro- cessed within repeats, 8 nt from the spacer start by the Cas6-type endonuclease. The processing of the 3’-end is less clear but for P. furiosus it occurs at two sites within the spacer, at 5 nt and 11 nt from the 3’-end of the spacer sequence. Com- plexes of Cas or Cmr proteins guide the mature crRNAs to their targets (Brouns et al., 2008; Hale et al., 2009). Annealing of the spacer sequence of the crRNA to the protospacer of the invading element is crucial for the recognition and inactiva- tion of the target. For the bacterium Streptococcus thermophilus it was claimed that 100 % sequence matching between the crRNAs and protospacer RNAs was essen- tial for target inactivation (Barrangou et al., 2007; Horvath and Barrangou, 2010). Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 65

However, for S. solfataricus and Sulfolobus islandicus the requirements appear to be much less stringent because even with 3 mismatches between crRNA and pro- tospacer targeting was still effective (Gudbergsdottir et al., 2010). There may also be differences between some archaea and bacteria in the role of the family specic Protospacer-Associated Motif (PAM) complementary to part of the 5’-repeat sequence of the crRNA which, in Sulfolobus species constitutes a conserved dinucleotide (Lillestøl et al., 2009). For S. islandicus it was shown that altering the PAM motif inhibited protospacer targeting (Gudbergsdottir et al., 2010) whereas for the bacterium Staphylococcus epidermidis it was concluded that any sequence mismatch with the 5’-end of the crRNA ensured protospacer targeting and that sequence complementarity to the PAM motif was not essential (Marrafni and Sontheimer, 2010). The CRISPR-like locus of pKEF9 lacks an associated cas cassette and leader region but when transformed into S. solfataricus P2 it produced transcripts cover- ing the whole CRISPR locus initiating 32 bp upstream from the rst repeat and these were found to be processed. Processing sites were detected within each repeat spacer unit but some of the sites occurred within the spacer. At the time it was presumed that some inaccurate processing had occurred, possibly reecting mis- matches occurring between the plasmid repeat sequence and the host Cas proteins (Lillestøl et al., 2009), but it was not known then that Cmr proteins process within the 3’-ends of spacers (Hale et al., 2009). In contrast to reports on the euryarchaeal CRISPR transcripts (Carte et al., 2008) and a bacterium (Brouns et al., 2008) transcripts were detected from both DNA strands of each of the ve CRISPR loci of S. acidocaldarius (Lillestøl et al., 2006; Lillestøl et al., 2009). The largest CRISPR locus Saci-133 was probed against spacer sequences distributed along the cluster and each yielded clear signals in Northern analyses. The smallest processed products in the size range 55–60 nt were larger than those of leader strand crRNAs and were less regularly processed. These small RNAs were observed for all ve S. acidocaldarius repeat-clusters and must contain most or all of the spacer sequence because the corresponding band was not detected when the spacer probe was replaced by a repeat probe. Whether these have a role in protecting the mature crRNAs when there are no invading elements present remains unclear.

8 Anti CRISPR/Cas and CRISPR/Cmr Systems

Examples have been recorded of archaeal CRISPR/Cas modules being lost from genomes. For example, a variant strain of S. solfataricus P2 (strain P2A) was char- acterised that had lost four closely linked CRISPR/Cas modules, A to D, apparently via a single recombination event between bordering IS elements (Redder and Gar- rett, 2006). Bordering IS elements also have the potential to generate transposons carrying whole CRISPR/Cas or Cmr modules. Possibly this loss reects S. solfatar- icus P2A being a laboratory strain where the immune system had become an unnec- essary burden on the cell’s energy resources in the absence of invading genetic ele- 66 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea ments and this may be analogous to the many bacterial endosymbionts which lack functional CRISPR/Cas systems (Grissa et al., 2008; Mojica et al., 2009). There are also examples of viruses, which circumvent or interfere with the CRISPR systems. Some members of the viral families Rudiviridae and Lipothrix- viridae, carry 12 bp indels, probably deletions, in their genomes often lying within, but not disrupting, open reading frames (Peng et al., 2004; Vestergaard et al., 2008). Although the function of these elements is unknown they may be generated in response to the CRISPR-based immune systems to avoid crRNA targeting. The presence of multiple recombination sites in some archaeal viruses and conjugative plasmids may also facilitate genomic rearrangements and sequence changes (Greve et al., 2004; Garrett et al., 2010a). Analysis of the genome of S. islandicus strain M.16.4 isolated in Kamchatka, Russia (Reno et al., 2009), revealed the presence of a more direct viral interference where an M164 provirus 1, has integrated into, and disrupted, the csa3 gene encod- ing a putative transcriptional regulator of the group 1 cas genes (Figure 8). The insertion event seems to be recent since the truncated parts of the csa3 gene show high sequence similarity to genes of closely related species, and it may be revers- ible. The closely related strain M.16.27 carries and intact csa3 gene (Figure 8A) but also, unlike strain M.16.4, carries a CRISPR spacer sequence perfectly matching the provirus.

a M1627 csa1 cas1 cas2 cas4 csa3 Fig. 8. An example of a cas gene cassette that has been inactivated M164 csa1 cas1 cas2 cas4 csa provirus 3 in the gene for the putative tran- scriptional regulator csa3 of S. attL attR islandicus M.16.4 by the integra- b attL GTAAATTTTCTTCTGCACAGAAAGAAGAT------AATCTT tion of an M164 provirus 1. attR CGAAA----CTTCTGCACAGAAAGAGTATTTGACGTCAAAACATT *** **************** ** ** a) Strain M.16.27, lacking the integrated provirus, carries a CRISPR spacer with a perfect c match to the provirus whereas S. islandicus M.16.4, carrying the integrated provirus, contains no spacer sequence matching the proviral sequence. sugar binding integrase b) The integration att site in the phospholipase D csa3 gene. c) Gene map of the integrated provirus showing some predicted M164 provirus I functional assignments 13,908 bp

DNA primase/polymerase Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 67

9 Evolutionary Considerations

The view that archaeal and bacterial CRISPR/Cas systems are closely related has prevailed since their discovery and was underpinned by the similar ordering of spacer-repeat units in the CRISPR loci and by extensive sequence similarities between Cas proteins (Haft et al., 2005; Godde and Bickerton, 2006; Makarova et al., 2006). This view has been reinforced by the shared mechanism of elongation of CRISPR loci at the leader-repeat junction as well as similarities in the process- ing mechanisms of crRNAs in both Domains (Tang et al., 2002; Tang et al., 2005; Brouns et al., 2008; Hale et al., 2008; Hale et al., 2009). Nevertheless, there are distinctive features. CRISPR/Cas modules are more common amongst archaea and tend to be larger, structurally more complex and more labile (Lillestøl et al., 2006; Grissa et al., 2008; Shah and Garrett, 2010). Many repeat sequences show a bias to archaea or bacteria CRISPR loci, and many archaeal repeats lack inverted repeats common to those of bacteria suggesting that different RNA processing signals occur within transcript repeats (Lillestøl et al., 2006; Kunin et al., 2007). Moreover, many crenarchaea encode the CRISPR repeat binding protein of elusive function (Peng et al., 2003). Phylogenetic analyses imply that periodic inter-Domain exchange of CRISPR/ Cas modules has occurred (Haft et al., 2005; Godde and Bickerton, 2006; Makarova et al., 2006). Clearly, crossing Domain boundaries would be a very complex process given the basic differences in the transcriptional and translational mechanisms of archaea and bacteria (Torarinsson et al., 2005; Santangelo et al., 2009). Moreover, conjugal DNA transfer would also have to overcome the major barriers of different membrane and cell wall structures, and different conjugative systems, of archaea and bacteria (Greve et al., 2004; Veith et al., 2009). Nevertheless, coevolution of archaeal and bacterial CRISPR/Cas systems would only require cross domain events to succeed rarely. The more archaea-specic components may be associated with systems that have evolved in environments of high temperature, extremes of pH, or hypersaline conditions where levels of bacteria are relatively low, which is also supported by the cas gene compositions of different CRISPR/Cas families. Other mechanistic differences may surface as the different systems are studied in more depth. Importantly, however, crenarchaeal viruses have radically different virus-host relationships from those of bacteria that may require altered responses from the immune systems (Prangishvili et al., 2006a; Bize et al., 2009) and it is likely that the CRISPR-based immune systems have maintained and/or undergone Domain-specic adaptations during evolution. Small interference RNA systems (siRNA) are widespread in eukarya where they have multiple roles including the discrimination and targeting of “foreign” genetic elements including viruses and transposons (Hannon 2002; Jinek and Doudna, 2009). There are broad mechanistic parallels between these eukaryal siRNA sys- tems and the DNA- and RNA-targeting CRISPR systems. They all have to dis- tinguish foreign DNA from self-DNA, and target nucleic acids which show little sequence similarity and can undergo continual sequence change. However, whereas the CRISPR systems employ ssRNAs for targeting foreign elements, the eukaryal 68 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea anti-viral systems generate small 21–22 bp dsRNAs for targeting viruses which are subsequently converted to ssRNAs by an Argonaute protein-RISC complex. The closest parallel to the crRNAs and CRISPR loci amongst the eukaryal siRNA systems are the Argonaute Piwi-interacting RNAs (piRNAs) directly processed from large transcripts of piRNA clusters which are rich in transposons and repeat- sequence elements and, as for the CRISPR loci, occur at specic chromosomal sites (Lillestøl et al., 2009; Karginov and Hannon, 2010). This eukaryal system probably plays a role in maintaining germline integrity and development (Aravin et al., 2007; Klattenhoff and Theurkauf, 2008). As for CRISPR loci, the piRNA clusters increase their informational capacity by the insertion of transposon sequences which pro- vide novel sequence content and are maintained in the piRNA clusters by selection. Thus, continual expansion of piRNA clusters occurs, as for CRISPR loci, but the process is passive rather than directed. Moreover, as for the CRISPR/Cas system, the newly incorporated DNA derives exclusively from genetic elements that are to be targeted. No homologous proteins have been detected from sequence analyses between proteins of the eukaryal siRNA systems and those of the CRISPR system, although similarities may appear at a tertiary structural level.

10 Conclusions

The CRISPR/Cas and CRISPR/Cmr immune machinery provide an effective defence against foreign genetic elements in archaea and some bacteria. The system is dynamic and hereditable, although the benet for the cell in evolutionary terms is transitional because DNA from extra chromosomal elements taken up as spacers in CRISPR loci, have a rapid turnover and are lost again via recombination at repeats and/or transpositional events. Current evidence suggests that CRISPR/Cas and Cmr modules can behave like integral genetic elements. They tend to be located in the most variable regions of chromosomes, sometimes physically linked, and are fre- quently displaced as a result of genome shufing, including possibly transposition of whole modules. CRISPR loci may be broken up, and dispersed, in chromosomes with the potential for creating genetic novelty. Small leaderless CRISPR-like loci are commonly found in chromosomes, and in plasmids, and some can be tran- scribed and processed and therefore constitute potentially functional accessories to the CRISPR-based immune systems. Both CRISPR/Cas and Cmr modules appear to exchange readily between closely related organisms, possibly via chromosomal conjugation, where they may be subjected to strong selective pressure. While uni- versal phylogenetic trees based on the Cas1 and Cmr2 proteins of the CRISPR/ Cas and CMR modules, respectively, suggest that transfers between archaea and bacteria have occurred, the relatively large number of archaea-specic Cas/Cmr proteins suggests that these may have been very rare events, consistent with the incompatibility of the transcriptional, translational and conjugative systems of the two Domains (Shah and Garrett, 2010). Parallels to the eukaryal siRNAs exist, and especially germ cell piRNAs which are also directed by effector proteins to silence or destroy invading foreign DNA and transposons. Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 69

References

Aravin AA, Hannon GJ, Brennecke J (2007) The Piwi-piRNA pathway provides an adaptive defense in the transposon arms race. Science 318: 761–764 Arnold HP, She Q, Phan H, Stedman K, Prangishvili D, Holz I et al. (1999) The genetic element pSSVx of the extremely thermophilic crenarchaeon Sulfolobus is a hybrid between a plasmid and a virus. Mol Microbiol 34: 217–226 Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S et al. (2007) CRISPR provides acquired resistance against viruses in prokaryotes. Science 315: 1709–1712 Basta T, Smyth J, Forterre P, Prangishvili D, Peng X (2009) Novel archaeal plasmid pAH1 and its interactions with the lipothrixvirus AFV1. Mol Microbiol 71: 23–34 Bettstetter M, Peng X, Garrett RA, Prangishvili D (2003) AFV1, a novel virus infecting hyperther- mophilic archaea of the genus Acidianus. Virology 315: 68–79 Bize A, Karlsson EA, Ekefjard K, Quax TE, Pina M, Prevost MC et al. (2009) A unique virus release mechanism in the Archaea. Proc Natl Acad Sci U S A 106: 11306–11311 Bize A, Peng X, Prokofeva M, Maclellan K, Lucas S, Forterre P et al. (2008) Viruses in acidic geothermal environments of the Kamchatka Peninsula. Res Microbiol 159: 358–366 Bolotin A, Quinquis B, Sorokin A, Ehrlich SD (2005) Clustered regularly interspaced short pal- indrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 151: 2551–2561 Brouns SJ, Jore MM, Lundgren M, Westra ER, Slijkhuis RJ, Snijders AP et al. (2008) Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321: 960–964 Brügger K, Torarinsson E, Redder P, Chen L, Garrett RA (2004) Shufing of Sulfolobus genomes by autonomous and non-autonomous mobile elements. Biochem Soc Trans 32: 179–183 Carte J, Wang R, Li H, Terns RM, Terns MP (2008) Cas6 is an endoribonuclease that generates guide RNAs for invader defense in prokaryotes. Genes Dev 22: 3489–3496 Chen L, Brugger K, Skovgaard M, Redder P, She Q, Torarinsson E et al. (2005) The genome of Sulfolobus acidocaldarius, a model organism of the Crenarchaeota. J Bacteriol 187: 4992– 4999 Cortez D, Forterre P, Gribaldo S (2009) A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes. Genome Biol 10: R65 Garrett RA, Prangishvili D, Shah SA, Reuter M, Stetter KO, Peng X (2010a) Metagenomic analy- ses of novel viruses and plasmids from a cultured environmental sample of hyperthermo- philic neutrophiles. Environ Microbiol doi: 10.1111/j.1462–2920.2010.02266.x Garrett RA, Shah SA, Vestergaard G, Deng L, Gudbergsdottir S, Kenchappa CS et al. (2010b) CRISPR-based immune systems of the Sulfolobales – complexity and diversity. Biochem Soc Trans in press Godde JS and Bickerton A (2006) The repetitive DNA elements called CRISPRs and their associ- ated genes: evidence of horizontal transfer among prokaryotes. J Mol Evol 62: 718–729 Greve B, Jensen S, Brugger K, Zillig W, Garrett RA (2004) Genomic comparison of archaeal con- jugative plasmids from Sulfolobus. Archaea 1: 231–239 Grissa I, Vergnaud G, Pourcel C (2008) CRISPRcompar: a website to compare clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 36: W145-W148 Gudbergsdottir S, Deng L, Chen Z, Jensen JVK, Jensen LR, She Q et al. (2010) Dynamic proper- ties of the Sulfolobus CRISPR/Cas and CRISPR/Cmr systems when challenged with vector- borne viral and plasmid genes and protospacers. Mol Microbiol under review Guo L, Brugger K, Chao Liu C, Shah SA, Zheng H, Zhu Y et al. (2010) Comparative genomics of two strains of Sulfolobus islandicus from Iceland: Hosts for studying crenarchaeal genetics and virus life cycles. submitted Haft DH, Selengut J, Mongodin EF, Nelson KE (2005) A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput Biol 1: e60 70 CRISPR/Cas and CRISPR/Cmr Immune Systems of Archaea

Hale C, Kleppe K, Terns RM, Terns MP (2008) Prokaryotic silencing (psi)RNAs in Pyrococcus furiosus. RNA 14: 2572–2579 Hale CR, Zhao P, Olson S, Duff MO, Graveley BR, Wells L et al. (2009) RNA-guided RNA cleav- age by a CRISPR RNA-Cas protein complex. Cell 139: 945–956 Hannon GJ (2002) RNA interference. Nature 418: 244–251 Held NL and Whitaker RJ (2009) Viral biogeography revealed by signatures in Sulfolobus islandi- cus genomes. Environ Microbiol 11: 457–466 Horvath P and Barrangou R (2010) CRISPR/Cas, the immune system of bacteria and archaea. Science 327: 167–170 Jansen R, Embden JD, Gaastra W, Schouls LM (2002) Identication of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 43: 1565–1575 Jinek M and Doudna JA (2009) A three-dimensional view of the molecular machinery of RNA interference. Nature 457: 405–412 Karginov FV and Hannon GJ (2010) The CRISPR system: small RNA-guided defense in bacteria and archaea. Mol Cell 37: 7–19 Klattenhoff C and Theurkauf W (2008) Biogenesis and germline functions of piRNAs. Develop- ment 135: 3–9 Krupovic M, Forterre P, Bamford DH (2010) Comparative analysis of the mosaic genomes of tailed archaeal viruses and proviruses suggests common themes for virion architecture and assembly with tailed viruses of bacteria. J Mol Biol 397: 144–160 Kunin V, Sorek R, Hugenholtz P (2007) Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biol 8: R61 Lawrence CM, Menon S, Eilers BJ, Bothner B, Khayat R, Douglas T et al. (2009) Structural and functional studies of archaeal viruses. J Biol Chem 284: 12599–12603 Lillestøl RK, Redder P, Garrett RA, Brugger K (2006) A putative viral defence mechanism in archaeal cells. Archaea 2: 59–72 Lillestøl RK, Shah SA, Brugger K, Redder P, Phan H, Christiansen J et al. (2009) CRISPR fami- lies of the crenarchaeal genus Sulfolobus: bidirectional transcription and dynamic properties. Mol Microbiol 72: 259–272 Makarova KS, Grishin NV, Shabalina SA, Wolf YI, Koonin EV (2006) A putative RNA-interfer- ence-based immune system in prokaryotes: computational analysis of the predicted enzy- matic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1: 7 Marrafni LA and Sontheimer EJ (2008) CRISPR interference limits horizontal gene transfer in staphylococci by targeting DNA. Science 322: 1843–1845 Marrafni LA and Sontheimer EJ (2010) Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature 463: 568–571 Mojica FJ, Diez-Villasenor C, Garcia-Martinez J, Almendros C (2009) Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155: 733– 740 Mojica FJ, ez-Villasenor C, Garcia-Martinez J, Soria E (2005) Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 60: 174–182 Peng X, Blum H, She Q, Mallok S, Brugger K, Garrett RA et al. (2001) Sequences and replica- tion of genomes of the archaeal rudiviruses SIRV1 and SIRV2: relationships to the archaeal lipothrixvirus SIFV and some eukaryal viruses. Virology 291: 226–234 Peng X, Brugger K, Shen B, Chen L, She Q, Garrett RA (2003) Genus-specic protein binding to the large clusters of DNA repeats (short regularly spaced repeats) present in Sulfolobus genomes. J Bacteriol 185: 2410–2417 Peng X, Kessler A, Phan H, Garrett RA, Prangishvili D (2004) Multiple variants of the archaeal DNA rudivirus SIRV1 in a single host and a novel mechanism of genomic variation. Mol Microbiol 54: 366–375 Porter K, Russ BE, Dyall-Smith ML (2007) Virus-host interactions in salt lakes. Curr Opin Micro- biol 10: 418–424 Portillo MC and Gonzalez JM (2009) CRISPR elements in the Thermococcales: evidence for associated horizontal gene transfer in Pyrococcus furiosus. J Appl Genet 50: 421–430 Shiraz A. Shah, Gisle Vestergaard and Roger A. Garrett 71

Pourcel C, Salvignol G, Vergnaud G (2005) CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolu- tionary studies. Microbiology 151: 653–663 Prangishvili D, Forterre P, Garrett RA (2006a) Viruses of the Archaea: a unifying view. Nat Rev Microbiol 4: 837–848 Prangishvili D, Garrett RA, Koonin EV (2006b) Evolutionary genomics of archaeal viruses: unique viral genomes in the third domain of life. Virus Res 117: 52–67 Rachel R, Bettstetter M, Hedlund BP, Haring M, Kessler A, Stetter KO et al. (2002) Remarkable morphological diversity of viruses and virus-like particles in hot terrestrial environments. Arch Virol 147: 2419–2429 Redder P and Garrett RA (2006) Mutations and rearrangements in the genome of Sulfolobus solfa- taricus P2. J Bacteriol 188: 4198–4206 Reno ML, Held NL, Fields CJ, Burke PV, Whitaker RJ (2009) Biogeography of the Sulfolobus islandicus pan-genome. Proc Natl Acad Sci U S A 106: 8605–8610 Santangelo TJ, Cubonova L, Skinner KM, Reeve JN (2009) Archaeal intrinsic transcription termi- nation in vivo. J Bacteriol 191: 7102–7108 Shah SA and Garrett RA (2010) CRISPR/Cas and Cmr modules, mobility and evolution of adap- tive immune systems. Res Microbiol Shah SA, Hansen NR, Garrett RA (2009) Distribution of CRISPR spacer matches in viruses and plasmids of crenarchaeal acidothermophiles and implications for their inhibitory mechanism. Biochem Soc Trans 37: 23–28 She Q, Peng X, Zillig W, Garrett RA (2001) Gene capture in archaeal chromosomes. Nature 409: 478 She Q, Phan H, Garrett RA, Albers SV, Stedman KM, Zillig W (1998) Genetic prole of pNOB8 from Sulfolobus: the rst conjugative plasmid from an archaeon. Extremophiles 2: 417–425 Tang TH, Bachellerie JP, Rozhdestvensky T, Bortolin ML, Huber H, Drungowski M et al. (2002) Identication of 86 candidates for small non-messenger RNAs from the archaeon Archaeo- globus fulgidus. Proc Natl Acad Sci U S A 99: 7536–7541 Tang TH, Polacek N, Zywicki M, Huber H, Brugger K, Garrett R et al. (2005) Identication of novel non-coding RNAs as potential antisense regulators in the archaeon Sulfolobus solfa- taricus. Mol Microbiol 55: 469–481 Torarinsson E, Klenk HP, Garrett RA (2005) Divergent transcriptional and translational signals in Archaea. Environ Microbiol 7: 47–54 Tyson GW and Baneld JF (2008) Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environ Microbiol 10: 200–207 Veith A, Klingl A, Zolghadr B, Lauber K, Mentele R, Lottspeich F et al. (2009) Acidianus, Sul- folobus and Metallosphaera surface layers: structure, composition and gene expression. Mol Microbiol 73: 58–72 Vestergaard G, Shah SA, Bize A, Reitberger W, Reuter M, Phan H et al. (2008) Stygiolobus rod- shaped virus and the interplay of crenarchaeal rudiviruses with the CRISPR antiviral system. J Bacteriol 190: 6837–6845 Wang Y, Duan Z, Zhu H, Guo X, Wang Z, Zhou J et al. (2007) A novel Sulfolobus non-conjugative extrachromosomal genetic element capable of integration into the host genome and spreading in the presence of a fusellovirus. Virology 363: 124–133 Zillig W, Arnold HP, Holz I, Prangishvili D, Schweier A, Stedman K et al. (1998) Genetic ele- ments in the extremely thermophilic archaeon Sulfolobus. Extremophiles 2: 131–140

176 ￿￿￿￿￿￿￿￿￿￿￿￿

￿.￿￿ ￿￿￿￿ ￿￿￿￿￿￿￿ ￿ The bioinformatical analyses were conducted by myself. The Contribution: major chapter was written by Professor Roger A. Garrett.

Chapter 7 Archaeal Type II TA Loci

Shiraz A. Shah and Roger A. Garrett Corresponding: [email protected]

Archaea Centre, Department of Biology, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark

Abstract A f ew of the bacterial type II TA systems, primarily those involved in tranaslational inhibition, occur widely throughout the archaeal domain. Using a bioinformatic approach, the frequency and diastribution of these diverse TA loc i were examined within completed genomes of 124 archaea, that are distributed fairly evenly throughout the major archaeal phyla. Results for the frequency and diversity of TA loci are summarised for archaea isolated from environmental niches generally characterised by extreme conditions including high temperature, high salt concentrations, high pressures, extremes of pH or strictly anaerobic conditions. No clear correlations were found between the number of TA loci present and either the genome size or particular environmental conditions. Multiple TA loci tend to be concentrated in variable genomic regions where the occurrence of intra- or inter-genomic gene transfer is most prevalent. For members of the Sulfolobales which are uniformly rich in TA loci, a case is made for some TA systems facilitating maintenance of important genomic regions.

7.1. Introduction Until recently, type II TA systems have received relatively little attention in comparative genomic studies of archaea. This reflects a general uncertainty regarding their functions, the significance of their structural diversity and, to some degree, their identities. Moreover, this uncertainty was compounded by the small gene sizes, especially for the antitoxins, which rendered their annotation difficult. This widespread deficiency was first highlighted by Gerdes'

1 group who identified large numbers of non annotated TA loci in archaeal and bacterial genomes and demonstrated the structural diversity of the protein components within different TA families (Pandey et al., 2005; Gerdes et al., 2005; Jørgensen et al., 2009). This development, combined with contemporary insights gained into molecular mechanisms of toxin inhibitory activity (reviewed in Gerdes et al., 2005), served to focus attention on the profound importance of TA systems for cellular viability and survival. Genome-based surveys of bacterial type II TA sy st em s, carrying two genes, have identified eight major families denoted vapBC, relBE, hicBA, mazEF, phd/doc, parDE, ccdAB and higBA with an additional system in a Streptococcus plasmid carrying three genes (,  and , a repressor, antitoxin and toxin, respectively). VapC, RelE, MazE and HicA toxins have all been demonstrated experimentally to inhibit translation and Doc has also been implicated, at least indirectly, in affecting translation. In contrast, ParE and CcdB target the bacterial DNA gyrase thereby blocking DNA replication (reviewed in Gerdes et al., 2005). Only three of these toxin families VapC, RelE and HicA, each targeting translation, occur commonly amongst archaea and this chapter is mainly focussed on these three TA systems. In the bacterium Shigella flexneri VapC toxins act by cleaving initiator tRNA within the anticodon loop thereby inhibiting translational initation (Dienemann et al. 2011; Winther and Gerdes, 2011), while RelE binds at the ribosomal A-site cutting the bound mRNA within the codon (Neubauer et al., 2009), and HicA is a translation-dependent mRNA transferase (Jørgensen et al., 2009; Makarova et al., 2009a). MazF and Doc have also been implicated in targeting translation, but their homologs are rarely found amongst archaea (Pandey and Gerdes, 2005; Makarova et al., 2009b). Most archaea do not carry a homolog of the bacterial gyrase, the target of the ParE and CcdB toxins, employing instead the archaea-specific topoisomerase VI (Gadelle et al., 2003; Yamashiro and Yamagishi, 2005). An extensive genomic survey of bacterial and archaeal type II TA systems by Makarova et al., (2009b), that did not take into account the many non annotated genes, reinforced the considerable structural diversity of the major TA families and identified additional subtypes, especially of the antitoxin components. This study also provided bioinformatical evidence for a possible additional TA locus encoding MNT (Minimal Nucleotidyl Transferase) and HEPN (Higher Eukaryote and Prokaryote Nucleotide binding). Although there is currently no experimental support for any toxin activity (Makarova et al., 2009b), we nevertheless included MNT/HEPN gene pairs in the

2 present analysis because they occur commonly in archaea, especially amongst the hyperthermophiles and, moreover, their frequency of genome occurrence mirrors partially that of vapBC gene pairs. A bioinformatical approach wa s employed to identify archaeal type II TA loci within 124 completed archaeal genomes. Exhaustive searches were made for the major families of TA gene loci, vapBC, relBE, and hicAB and for the HEPN/MNT gene pairs and attempts were made to identify non annotated antitoxin and toxin genes.

7.2. The archaeal perspective Archaea differ from bacteria in their cellular biology in fundamental ways and they share many cellular processes exclusively with eukaryotes albeit generally in less complex forms. Although the evolutionary history of archaea and their relationship to early eukarya remains enigmatic (Gribaldi et al., 2010; Kurland, 2006), the maintenance of unique cellular properties amongst archaea is likely to be due to their successful adaptation to extreme environmental conditions. These include high temperature, extremes of pH, high salt, high pressures and strictly anaerobic conditions; and such environments that also tend to be low in energy sources (Kletzin, 2007). It has been argued that some of the properties unique to archaea arose from adaptation to chronic energy stress through modifying catabolic pathways and by conserving energy via their low permeability ether-linked lipid membranes (Valentine, 2007). Thus, stress in bacteria and archaea cannot simply be equated when considering the modes of action of toxins. TA systems that are shared between bacteria and archaea appear primarily to inhibit translation, cleaving either mRNA bound in the ribosomal A-site (RelE), the anticodon of the initiator tRNA (VapC) or mRNA directly (HicA). The ribosomal tRNA binding sites, decoding site and peptidyl transferase centre constitute the most conserved regions of the translational apparatus, in both bacteria and archaea (and also in eukarya), as judged by their shared sensitivities to a wide range of antibiotics which specifically target these sites in both Domains (e.g. Rodriguez-Fonseca et al., 1995). Experimental studies indicate that bacterial TAs have alternative cellular targets, including the bacterial DNA gyrase, but it remains unknown whether there are unidentified archaeal toxins which bind to archaea-specific cellular sites.

7.3 A bioinformatical approach All archaeal genomes publicly available at the beginning of 2012 were screened for the presence of type TA loci of the superfamilies

3 vapBC, relBE and hicAB, as well as gene pairs of the predicted HEPN/MNT TA locus, by first constructing toxin-specific hidden markov models (HMMs), using the jackhmmer program (Eddy, 2011), against the genomes using known toxin genes as queries. Subsequently, all open reading frames (ORFs) between 50 and 250 aa that did not overlap previously annotated ORFs above 250 aa in length were extracted and screened using the constructed HMMs. Every upstream or downstream ORF, depending on the TA family type, located within a fixed distance of the matching toxin ORF, wa s extracted and clustered according to sequence similarity. Some of these clusters were judged to comprise antitoxin gene families based on manual inspection of their genomic contexts, and they were paired with the corresponding toxin genes to generate TA loci. Subsequently, we found that significant numbers of TA loci were partially overlapping with larger annotated ORFs, particularly for members of the Thermococcales and, therefore, we extended the analyses to include these genes, which involved extensive manual inspection of the genomes.

7.4. Phylogenetic distribution and frequency of archaeal TA loci A phylogenetic tree based on 16S rRNA sequences wa s generated for 124 archaea for which genome sequences were available. The genome size and natural habitat is given for each organism, and thermophiles are distinguished from mesophiles with a border for optimal growth of 50oC (Table 1). More details of the natural environments and optimal growth conditions for many of the organisms are given by Kletzin (2007). Some orders, including the hyperthermophilic Sulfolobales, Thermoproteales and Thermococcales, are relatively overrepresented by closely related organisms including several Sulfolobus islandicus, Pyrobaculum and Thermococcus strains, while the less well characterised Korarchaea (K) and Thaumarchaea (T) are underrepresented. This bias primarily reflects that the former group are relatively easy to isolate and culture and that some of them have been employed as model organisms for molecular, cellular and genetic studies. The total numbers of identified TA loci are given for vapBC, relBE and hicAB families and for the HEPN/MNT gene pairs in Table 1. The results reveal a wide range of type II TA contents. Several organisms carry 30 or more TA loci but many have very few or no detectable loci. vapBC constitute the dominant TA family and they are most prevalent amongst thermophiles, in particular in members of the thermoacidophilic Sulfolobales (Pandey and Gerdes, 2005; Guo et al., 2011) and in some Thermococcus species. In contrast relBE or

4 hicAB gene pairs are quite rare especially amongst the 40 crenarchaeal genomes. For the euryarchaea relBE gene pairs were observed in about half of the genomes and several of these carried 1 to 9 copies. Similarly, hicAB pairs were identified in about half the euryarchaeal genomes with multiple copies occurring mainly amongst the Methanomicrobiales and Methanosarcinales. MNT/HEPN gene pairs occur much more frequently but are irregularly distributed. They are most common amongst crenarchaeal thermoacidophiles and thermoneutrophiles and the euryarchaeal hyperthermophiles (Table 1).

7.5. TA loci frequency and their relationship to genome size and environmental factors Generally, there is no simple correlation between genome size and TA locus frequency for the different archaeal phyla. For example, for most members of the Sulfolobales the estimated number of TA loci varies from 17 to 49 but the minimal genome of Acidianus hospitalis (2.2 Mb) carries 38 while the largest genome of Sulfolobus solfataricus P2 (3 Mb) contains 33. For other phyla, a clearer picture emerges when comparing strains within the same genus e.g. the seven Thermococcus strains which have genome sizes ranging from 1.8 to 2.1 Mb. When ordered according to increasing approximate size (Table 1), these genomes carry 4, 10, 24, 25, 26, 47 and 58 TA loci respectively, showing that the TA frequency increases disproportionately with genome size. A similar pattern is seen with Pyrobaculum strains. These results also underline the often large differences in the TA contents of pairs of closely related organisms. There is little correlation between TA loci numbers and optimum growth temperatures. Although Hyperthermus butylicus which can grow up to 108oC has a relatively high TA locus content of 18 (mainly vapBC loci) for a member of the Thermoproteales, Methanopyrus kandleri growing up to 110oC has no detectable TA loci and some of the hyperthermophilic Methanocaldococcus strains also exhibit few TA loci. More difficult to assess is the impact of the natural environments and the available nutrients, although in this respect the S. islandicus strains may be informative (Reno et al., 2009; Guo et al., 2011). They were all isolated from terrestial acidic hot springs with similar maximum growth temperatures and pH ranges but widely

5 separated, and isolated, geographically; on Iceland, in Kamchatka, Russia and in Yellowstone and Lassen National Parks, USA while the related S. solfataricus P2 strain derives from Naples, Italy. Each of the strains carry 26 to 36 TA loci which suggests that the nature of the environment is important. Moreover, active terrestial hot springs are likely to be particularly challenging for cells because temperatures can continuously change from maxima of around 80oC to 0oC, if surrounded by ice, and pH values and nutrient availability can also change rapidly. A definitive answer to the effect of environmental factors on TA activity would require detailed and time consuming experimental analyses of archaea cultivated under a wide range of conditions.

7.6. Orphan toxin and antitoxin genes Many orphan toxin and some orphan antitoxin genes were detected in the genomes and the numbers tend to be proportional to the numbers of type II TA loci. For example, there are many orphan toxin genes amongst the Sulfolobales. Some of these may have been classed as orphans because the adjacent antitoxin protein gene was not identified (Pandey and Gerdes, 2005) and others may be located adjacent to unidentified type III RNA a nt it ox in genes (see Chapter 14). Presumably, over time, antitoxins or toxins may become associated with other cellular functions by selection. One such example could be provided by a single vapC-like gene (Ahos0712) of A. hospitalis. It lies in an operon with genes encoding proteins involved in transcription and initiator tRNA binding to the ribosome (You et al., 2011). This gene cassette is highly conserved in gene content, gene synteny and sequence in other Sulfolobus genomes (Guo et al. 2011). A possible explanation is that this orphan VapC-like protein acts as a VapC competitor and may regulate or inhibit initiator tRNA cleavage.

7.7. Locations within genomes Earlier comparative genomic analyses of closely related Sulfolobus species indicated that TA gene pairs tend to be concentrated in relatively large genomic regions (0.7 to 1 Mbp). These regions are the most variable in gene synteny and gene content (Guo et al., 2011) consistent with the extensive exchange of genes having occurred intra- and/or inter-genomically. This is illustrated for the genomes of S. islandicus REY15A and the related S. solfataricus P2 where a high level of gene synteny is maintained throughout about two thirds of the genome while the remaining one third is extensively shuffled

6

Figure 7.1. Comparison of genomes from pairs of closely related archaea. Dot plots of (A) Sulfolobus species S. islandicus REY15A and S. solfataricus P2, and (B ) Thermococcus species T. onnurineus and T. kodakarensis showing regions of gene synteny (red) and inverted synteny (blue). The total genome sizes are given and the large variable regions in each genome are shaded. TA loci are denoted by black lines along the corresponding genome axes.

Figure 7.2. Phylogenetic trees for VapB antitoxins and VapC toxins of the acidothermophile A. hospitalis W1. (A) The VapB tree demonstrates that the highly diverse antitoxins can be classified into three main subfamilies AbrB, CcdA/CopG and DUF217. In (B) the VapC tree shows the highly diverse toxin sequences falling into one major grouping. The VapB subfamily linked to each VapC is given. Moreover the number of closely similar VapC proteins present in the available 13 Sulfolobales genomes (Table 1) is listed - 0 indicates that no VapC with a similar sequence is encoded in the genomes, while 13 indicates that a VapC with a closely similar sequence is encoded in each the genomes. Ahos Genbank numbers are given for each protein. Modified from You et al., (2011).

7 (Figure 1A). Most of the TA loci of both species fall within the variable region. Although few pairs of genome sequences from closely related archaeal species are available which show extensive gene synteny, a comparable analysis was possible for the genomes of two Thermococcus species. T. kodakarensis carrying many TA loci and T. onnurineus that exhibits very few TA loci (Figure 1B). Here the gene synteny is more limited and extends only over about one half of the genome but again the TA loci of T. kodakarensis are concentrated in the shuffled genome region. The latter example also illustrates the stark differences in the numbers of TA loci between some fairly closely related species. Although several genomes, including some Sulfolobus species, contain many transposable elements and TA loci there is no general proportionality between the two. For example, both Thermococcus genomes carry few IS elements but one of the species, T. kadakarensis, exhibts several TA loci (Figure 1B). Moreover, several of the genomes carry many transposable elements but few TA loci (e.g. Pyrococcus furiosus, Halobacterium NRC1 and ) while others exhibit few transposable elements but contain many TA loci (e.g. Sulfolobus acidocaldarius, H. butylicus and Thermococcus sp. AM4) (Brügger et al., 2002; Filée et al., 2007).

7.8. TA sequence diversity within genomes The A. hospitalis genome carries 24 vapBC loci concentrated within the genomic regions 350-410 kb and 1,374-1,912 kb (You et al., 2011). Whereas the VapC toxins are all PIN domain proteins (PilT N- terminal domain), the VapB antitoxins were classified into three families of transcriptional regulators, AbrB, CcdA/CopG and DUF217 (Figure 2A) (You et al., 2011). Tree building based on sequence alignments demonstrated that the sequences of these antitoxins and toxins are all highly diverse, with sequence identities between them rarely exceeding 30%, as indicated by all the long tree branches for each protein (Figure 2). A parallel tree building study of the closely related S. islandicus strains REY15A and HVE10/4 carrying 18 and 19 vapBC gene pairs, respectively, yielded a similar pattern of long branches for each VapB and VapC protein (Guo et al., 2011). Thus all antitoxins and toxins within each archaeon are highly diverse in sequence. In contrast, when intergenomic comparisons were made for other members of the Sulfolobales, isolated from both closely and distantly separated geographical terrestial hot springs, several VapBC complexes showed high sequence similarity. For example, 11 of the 24 VapBC protein pairs identified in A. hospitalis (Figure 2), exhibit

8 closely similar sequences to homologs encoded in at least seven of the 13 available Sulfolobus genomes (You et al., 2011). A f urt her example is illustrated for the VapC toxins of Pyrococcus species (Figure 3A) and for the predicted MNT toxin of the MNT/HEPN gene pairs for Pyrobaculum species (Figure 3B). The result shows that the VapC and MNT sequences within each cluster of short branches derive from different species. Thus, there is apparently selection against the uptake of closely similar vapBC loci or MNT/HEPN gene pairs in a given genome, despite the abundance of many similar gene pairs in the environment. The tree-building results of the analysis demonstrated further that for given gene pairs the subtypes of VapB and VapC do not always correspond implying that gene pairs exchange partners thereby potentially creating increased functional diversity of the TA systems (You et al., 2011), consistent with an earlier hypothesis (Gerdes et al., 2005).

Figure 7.3 Phylogenetic trees for the toxin VapC and the predicted toxin MNT. (A) VapC proteins encoded in different in Pyrococcus species, and (B) MNT proteins encoded in diverse Pyrobaculum species. Proteins that fall within the small clusters of short branches derive from different organisms. Trees generated for proteins deriving exclusively from one organism yield long branches. Gene numbers are given for each of the genomes analysed (see Table 1).

9

7.9. Stress response Antitoxin-toxins were originally shown to enhance plasmid maintenance as a consequence of the growth of plasmid-free cells being preferentially inhibited, post segregation, by free toxins that are inherently more stable than antitoxins (Gerdes et al. 2005). To date, relatively few archaeal plasmids have been sequenced and there is no current evidence for type II TA loci occurring widely in plasmids. Nevertheless, the plasmid maintenance mechanism led to the hypothesis that the TA systems encoded widely in chromosomes facilitate retention of local DNA regions carrying important genes that might otherwise be prone to loss (Magnuson, 2007; Van Melderen 2010). This hypothesis receives support from the observation that vapBC loci and the HEPN/MNT gene pairs are concentrated within variable genomic regions of members of the Sulfolobales and Thermococcales where intergenomic DNA exchange appears to be most active (Figure 1). Furthermore, the hypothesis is reinforced by the high sequence diversity of each of the numerous VapC proteins encoded within these genomes, exemplified for A. hospitalis (Figure 2). For any pair of similar VapBC complexes, the loss of one would be compensated for by the presence of the other, thereby undermining any DNA maintenance capability. For bacteria which grow slowly in nutrient poor environments, multiple toxins are strongly implicated in responding to different types of nutrient deficiency and/or in enhancing quality control (Gerdes, 2000; Pandey and Gerdes, 2005). Involvement in stress response entails that the toxins inhibit growth, allowing the host to lie in a dormant state during the period of environmental stress (Pedersen et al., 2002; Gerdes et al. 2005). In this context, toxins have also been implicated in producing persistent cells which are able to remain dormant for longer periods and to withstand prolonged exposure to stress factors including antibiotics (Maisonneuve et al., 2011). There may well be a negative effect on host growth as a consequence of carrying large numbers of TA loci (30 to 40 TA loci for some Sulfolobus species and a few other archaea (Table 1)) because of the likelihood of the continuous presence of low levels of free toxin (Wilbur et al. 2005). Although only highly diverse vapBC loci are present, presumably in order to avoid redundancy, the total number of TA loci present per genome may reflect a compromise between the ability to maintain important genes and to survive

10 different environmental stresses while retaining an adequate growth rate under normal conditions. In conclusion, there is a major deficit in experimental work on archaeal TA systems, especially with regard to stress responses. Almost all research to date has focussed on bacteria. One exception was the demonstration that the mode of action of a bacterial RelE toxin in M. jannaschii and bacteria were similar in vitro (Christensen and Gerdes, 2003). Moreover, heat shock of S. solfataricus (from 80oC to 90oC) was shown to induce expression of some TA loci while knockout of a single vapBC locus increased heat shock lability (Cooper et al., 2009). Clearly, however, many challenging experiments remain to be performed in this rapidly developing field.

7.10. Type II TA systems and viral defence It has been proposed that bacterial TA systems could be involved in combating bacteriophage infection by, for example, blocking ribosomes and preventing the viruses from dominating the translational apparatus, prior to their propagating and lysing cells (see Chapter 5). The inferred result would be that only the phage- infected cells would die. In principle, archaeal TA systems whic h primarily target translation could act similarly. However most archaeal viruses, and especially those from extremely thermophilic and halophilic environments, show morphotypes and genomic properties distinct from bacterial and eukaryal viruses and they generally exist in stable relationships with their hosts at low copy numbers, infrequently, if ever, causing cell lysis (Prangishvili et al., 2006; Porter et al., 2007). Consistent with these properties, circumstantial evidence suggests that the level of free viruses, at least in extreme thermoacidophilic environments, tend to be low relative to cellular levels suggesting that these viruses prefer to remain within cells under these challenging conditions (Snyder et al., 2010). Another intriguing possibility arises from juxtapositioning of TA loci and CRISPR loci (Clustered Regularly Interspaced Short Palindromic Repeats) in some archaea. CRISPR-based adaptive immune systems target invading genetic elements, primarily viruses and conjugative plasmids, and they have been classified into three major types, of which only two (types I and III) occur in archaea, often with both major types present in the same archaeon (Garrett et al., 2011). The CRISPR arrays carry spacer regions taken up from invading genetic elements and their processed transcripts are able to facilitate targeting and cleavage of genetic elements with matching sequences. An example of a complex assembly of a type III CRISPR- based system, present in the A. hospitalis genome, is shown in Figure

11 4. The CRISPR arrays and associated gene cassettes are interwoven with four vapBC loci for which all the antitoxins and toxins carry highly divergent sequences (Figure 2). Thus, these CRISPR- associated TA systems could play a secondary role in combating invading genetic elements by helping to maintain the functional CRISPR immune systems, which also tend to be located within the variable chromosomal regions. Another interesting aspect of this system is that one vapBC locus associated with the type III interference system in A. hospitalis (Figure 4B) shows a high level of sequence identity with vapBC loci specifically associated with a different subclass of type III interference systems found in the S. islandicus strains REY15A and HVE10/4 (Figure 4C) (Guo et al., 2011) suggesting that individual types of TA loci may coevolve with genes exhibiting specific functions.

Figure 7.4 Type III CRISPR systems linked to vapBC gene pairs. (A) CRISPR loci and genes of the acidothermophile A. hospitalis W1. CRISPR loci (black) show the numbers of repeats present. Genes encode proteins involved in uptake of new spacers (adaptation) labelled aCas, a gene encoding the RNA processing enzyme Cas6, and a gene cassette encoding type III interference proteins. Four vapBC gene pairs that are highly divergent in sequence are also present. (B) Expansion of t he type III interference cassette of A. hospitalis, and (C) location of a highly similar vapBC gene pair located next to a different class of type III CRISPR interference cassette (denoted Cmr) in S. islandicus HVE10/4. Numbers of repeats are indicated for each CRISPR locus.

7.11. Conclusions Clearly, these are early days for studies of archaeal TA loci. Almost all of the experimental work to date has been performed on different bacterial TA systems some of which have no equivalent amongst archaea. Support is provided here for a role in maintaining important regions of chromsomal DNA for those organisms, particularly members of the Sulfolobales and Thermococcales, whic h exhibit large variable genomic regions and often carry many TA loci. Involvement in response to nutrient deficiency and other stress factors are highly probable and these potential functional roles are not mutually exclusive. A rationale is provided for parts of the highly conserved translational apparatus being the primary target for some toxins that archaea share with bacteria. Finally, it remains to

12 be seen whether there are undiscovered archaea-specific TA systems, or possibly hybrid systems with bacterial and archaeal antitoxin- toxin components, which exclusively target archaeal cellular components.

References Brügger,K., Redder,P., She,Q., Confalonieri,F., Zivanovic,Y. and Garrett,R.A. (2002) Mobile elements in archaeal genomes. FEMS Microbiol Letts 206: 131-141. Christensen,S.K., and Gerdes,K. (2003) RelE toxins from bacteria and Archaea cleave mRNAs on translating ribosomes which are rescued by tmRNA. Mol Microbiol 48: 1389-1400. Cooper,C.R., Daugherty,A.J., Tachdjian,S., Blum,P.H., and Kelly,R.M. (2009) Role of vapBC toxin-antitoxin loci in the thermal stress response of Sulfolobus solfataricus. Biochem Soc Trans 37: 123-126. Dienemann,C., Bøggild,A., Winther,K.S. , Gerdes,K., and Brodersen,D. (2011) Crystal structure of VapBC toxin-antitoxin complex from Shigella flexneri reveals a hetero-octameric DNA- binding assembly. J. Mol Biol 414: 713-722. Eddy,S.R. (2011) Accelerated profile HMM searches. PLoS Comput Biol 7: 10. Filée,J., Siguier,P., and Chandler,M. (2007) Insertion sequence diversity in archaea. Microbiol Molec Biol Revs 71: 121-157. Gadelle,D., Filee,J., Buhler,C., and Forterre,P. (2003) Phylogenomics of type II DNA topoisomerases. Bioessays 25: 232-242. Garrett,R.A., Vestergaard,G., and Shah,S.A. (2011) Archaeal CRISPR- based immune systems: exchangeable functional modules. Trends Microbiol 19: 549-556. Gerdes,K. (2000) Toxin-antitoxin modules may regulate synthsis of macromolecules during nutritional stress. J Bacteriol 182: 561-572. Gerdes,K., Christensen,S.K., and Lobner-Olesen,A. (2005) Prokaryotic toxin-antitoxin stress response loci Nat Rev Microbiol 3: 371-382. Gribaldo,S., Poole,A.M., Daubin,V., Forterre,P., and Brochier- Armanet,C. (2010) The origin of eukaryotes and their relationship with the Archaea: are we at a phylogenomic impasse? Nat Rev Microbiol 8: 743-752. Guo,L., Brügger,K., Liu,C., Shah,S.A., Zheng,H., Zhu,Y., et al. (2011) Genome analyses of Icelandic strains of Sulfolobus islandicus: Model organisms for genetic and virus-host interaction studies. J Bacteriol 193: 1672-1680.

13 Jørgensen,M.G., Pandey,D.P., Jaskolska,M., and Gerdes,K. (2009) HicA of Escherichia coli defines a novel family of translation- independent mRNA transferases in bacteria and archaea. J Bacteriol 191: 1191-1199. Kletzin,A. (2007) General characteristics and important model organisms. In: Archaea Molecular and Cellular Biology (Ed. R. Cavicchioli) pp. 14-92. ASM press, Wash. DC, USA Kurland,C.G., Collins,L.J., and Penny,D. (2006) Genomics and the irreducible nature of eukaryotic cells. Science 312: 1011-1014. Magnuson,R.D. (2007) Hypothetical functions of toxin-antitoxin systems. J Bacteriol 189: 6089-6092. Maisonneuve, E., Shakespeare,L.J., Jørgensen,M.G., and Gerdes,K. (2011) Bacterial persistence by RNA endonucleases. Proc Natl Acad Sci USA 108: 13206-13211. Makarova,K.S., Grishin,N.V., and Koonin,E.V. (2009a) The HicAB cassette, a putative novel, RNA targeting toxin-antitoxin system in archaea and bacteria. Bioinformatics 22: 2581-2584. Makarova,K.S., Wolf,Y.I., and Koonin,E.V. (2009b) Comprehensive comparative-genomic analysis of Type 2 toxin-antitoxin systems and related mobile stress response systems in prokaryotes. Biol Direct 4: 19. Melderen,L.V. (2010) Toxin-antitoxin systems: why so many, what for? Curr Opin Microbiol 13: 781-785. Neubauer,C., Gao,Y.G., Andersen,K.R., Dunham,C.M., Kelley,A.C., Hentschel,J., et al. (2009) The structural basis for mRNA recognition and cleavage by the ribosome-dependent endonuclease RelE. Cell 139: 1084-1095. Pandey,D.P., and Gerdes,K. (2005) Toxin-antitoxin loci are highly abundant in free-living but lost from host-associated prokaryotes. Nucleic Acids Res 33: 966-976. Pedersen,K., Christensen,S.K., and Gerdes,K. (2002) Rapid induction and reversal of a bacteriostatic condition by controlled expression of toxins and antitoxins. Mol Microbiol 45: 501-510. Porter,K., Russ,B.E., and Dyall-Smith,M.L. (2007) Virus-host interactions in salt lakes. Curr Opin Microbiol 10: 418-424. Prangishvili,D., Forterre,P., and Garrett,R.A. (2006) Viruses of the Archaea: a unifying view. Nat Rev Microbiol 11: 837-848. Reno,M.L., Held,N.L., Fields,C.J., Burke,P.V., and Whitaker,R.J. (2009) Sulfolobus islandicus pan-genome. Proc Natl Acad Sci USA 106: 8605-8610. Rodriguez-Fonseca,C., Amils,R., and Garrett,R.A. (1995) Fine structure of the peptidyl transferase centre on 23 S-like rRNAs

14 deduced from chemical probing of antibiotic-ribosome complexes. J Molec Biol 247: 224-235. Snyder,J.C. Bateson M.M., Lavin M., and Young M.J. (2010) Use of cellular CRISPR (clusters of regularly interspaced short palindromic repeats) spacer-based microarrays for detection of viruses in environmental samples. Appl Environ Microbiol 76: 7251-7258. Valentine,D.L. (2007) Adaptations to energy stress dictate the ecology and evolution of archaea. Nat Rev Microbiol 5: 316-323. Wilbur,J.S., Chivers,P.T., Mattison,K., Potter,L., Brennan,R.G., and So,M. (2005) Neisseria gonorrheae FitA interacts with FitB to bind DNA through its ribbon-helix-helix motif. Biochem 44: 12515– 12524. Winther,K.S., and Gerdes,K. (2011) Enteric virulence associated protein VapC inhibits translation by cleavage of initiator tRNA. Proc Natl Acad Sci USA 108: 7403-7407. Yamashiro,K., and Yamagishi,A. (2005) Characterization of the DNA gyrase from the thermoacidophilic Archaeon Thermoplasma acidophilum. J Bacteriol 8531-8536. You,X-Y., Liu,C., Wang,S-Y., Jiang,C-Y., Shah,S.A., Prangishvili,D. et al. (2011) Genomic studies of Acidianus hospitalis W1 a host for studying crenarchaeal virus and plasmid life cycles. Extremophiles 15: 487-497.

15

16 Table 1 Phylogenetic tree of archaea for which complete genome sequences are available together with the estimated number of TA loci of the vapBC, relBE and hicAB families, and the numbers of MNT/HEPN gene pairs. In the kingdom phyla column (P) C denotes Crenarchaeota, E - Euryarchaeota, T - Thaumarchaeota, K - Korarchaeota and N - Nanoarchaeota. In the Order column (O) S denotes Sulfolobales, D - Desulfurococcales, O - Acidolobales, P - Thermoproteales, Y - Methanopyrales, T - Thermococcales, A - Archaeoglobales, C - Methanococcales, B - Methanobacteriales, M - Methanomicrobiales, N - Methanosarcinales, E - Methanocellales, H - Halobacteriales and L - Thermoplasmatales. The ecological niches of the different organisms are indicated together with their degree of thermophilicity, with a border of optimal growth of 50oC. The numbers of the different TA loci and MNT/HEPN gene pairs are colour-shaded extending from bright red (> 20), light red (20 to 11), pink (10 to 6), violet (5 to 3) and light blue (2 to 1). Approximate genome sizes and the Genbank/EMBL accession numbers are given for the genomes.

17

BIBLIOGRAPHY[1] A F Andersson and J F Banfield. Virus population dynamics and acquired virus resistance in natural microbial communit- ies. Science, 320(5879):1047–1050, May 2008.

[2] Kathryne S Auernik, Yukari Maezato, Paul H Blum, and Robert M Kelly. The genome sequence of the metal-mobilizing, extremely thermoacidophilic archaeon metallosphaera sedula provides insights into bioleaching- associated metabolism. Appl Environ Microbiol, 74(3):682–92, Feb 2008.

[3] R Barrangou, C Fremaux, H Deveau, M Richards, P Boyaval, S Moineau, D A Romero, and P Horvath. Crispr provides acquired resistance against viruses in prokaryotes. Science, 315(5819):1709–1712, Mar 2007.

[4] Elizabeth R Barry and Stephen D Bell. Dna replication in the archaea. Microbiol Mol Biol Rev, 70(4):876–87, Dec 2006.

[5] C Bath and M L Dyall-Smith. His1, an archaeal virus of the fuselloviridae family that infects haloarcula hispanica. J Virol, 72(11):9392–5, Nov 1998.

[6] David L Bernick, Courtney L Cox, Patrick P Dennis, and Todd M Lowe. Comparative genomic and transcriptional analyses of crispr systems across the genus pyrobaculum. Front Microbiol, 3:251, 2012.

[7] A Bolotin, B Quinquis, A Sorokin, and S D Ehrlich. Clustered regularly interspaced short palindrome repeats (crisprs) have spacers of extrachromosomal origin. Microbiology, 151(Pt 8):2551–2561, Aug 2005.

[8] Celine Brochier, Simonetta Gribaldo, Yvan Zivanovic, Fabrice Confalonieri, and Patrick Forterre. Nanoarchaea: repres- entatives of a novel archaeal phylum or a fast-evolving euryarchaeal lineage related to thermococcales? Genome Biol, 6(5):R42, 2005.

[9] C Brochier-Armanet, B Boussau, S Gribaldo, and P Forterre. Mesophilic crenarchaeota: proposal for a third archaeal

195 196 Bibliography phylum, the thaumarchaeota. Nat Rev Microbiol, 6(3):245– 252, Mar 2008.

[10] S J Brouns, M M Jore, M Lundgren, E R Westra, R J Slijkhuis, A P Snijders, M J Dickman, K S Makarova, E V Koonin, and J van der Oost. Small crispr rnas guide antiviral defense in prokaryotes. Science, 321(5891):960–964, Aug 2008.

[11] Kimberly K Busiek and William Margolin. Split decision: a thaumarchaeon encoding both ftsz and cdv cell division proteins chooses cdv for cytokinesis. Mol Microbiol, 82(3):535– 8, Nov 2011.

[12] J Carte, R Wang, H Li, R M Terns, and M P Terns. Cas6 is an endoribonuclease that generates guide rnas for invader defense in prokaryotes. Genes Dev, 22(24):3489–3496, Dec 2008.

[13] L Chen, K Brügger, M Skovgaard, P Redder, Q She, E Torar- insson, B Greve, M Awayez, A Zibat, H P Klenk, and R A Garrett. The genome of sulfolobus acidocaldarius, a model organism of the crenarchaeota. J Bacteriol, 187(14):4992–4999, Jul 2005.

[14] Benoît Dayrat. The roots of phylogeny: how did haeckel build his trees? Syst Biol, 52(4):515–27, Aug 2003.

[15] Ling Deng, Haojun Zhu, Zhengjun Chen, Yun Xiang Liang, and Qunxin She. Unmarked gene deletion and host-vector system for the hyperthermophilic crenarchaeon sulfolobus islandicus. Extremophiles, 13(4):735–46, Jul 2009.

[16] James G Elkins, Mircea Podar, David E Graham, Kira S Makarova, Yuri Wolf, Lennart Randau, Brian P Hedlund, Céline Brochier-Armanet, Victor Kunin, Iain Anderson, Alla Lapidus, Eugene Goltsman, Kerrie Barry, Eugene V Koonin, Phil Hugenholtz, Nikos Kyrpides, Gerhard Wanner, Paul Richardson, Martin Keller, and Karl O Stetter. A korarchaeal genome reveals insights into the evolution of the archaea. Proc Natl Acad Sci U S A, 105(23):8102–7, Jun 2008.

[17] Susanne Erdmann and Roger A Garrett. Selective and hyper- active uptake of foreign dna by adaptive immune systems of an archaeon via two distinct mechanisms. Mol Microbiol, Jul 2012. Bibliography 197 [18] Thijs J G Ettema, Ann-Christin Lindås, and Rolf Bernander. An actin-based cytoskeleton in archaea. Mol Microbiol, 80(4):1052–61, May 2011.

[19] Roger A Garrett, David Prangishvili, Shiraz A Shah, Monika Reuter, Karl O Stetter, and Xu Peng. Metagenomic analyses of novel viruses and plasmids from a cultured environ- mental sample of hyperthermophilic neutrophiles. Environ Microbiol, 12(11):2918–30, Nov 2010.

[20] Roger A Garrett, Shiraz A Shah, Gisle Vestergaard, Ling Deng, Soley Gudbergsdottir, Chandra S Kenchappa, Susanne Erdmann, and Qunxin She. Crispr-based immune systems of the sulfolobales: complexity and diversity. Biochem Soc Trans, 39(1):51–7, Jan 2011.

[21] Roger A Garrett, Gisle Vestergaard, and Shiraz A Shah. Archaeal crispr-based immune systems: exchangeable func- tional modules. Trends Microbiol, 19(11):549–56, Nov 2011.

[22] Aurore Gorlas, Eugene V Koonin, Nadège Bienvenu, Daniel Prieur, and Claire Geslin. Tpv1, the first virus isolated from the hyperthermophilic genus thermococcus. Environ Microbiol, 14(2):503–16, Feb 2012.

[23] I Grissa, G Vergnaud, and C Pourcel. The crisprdb database and tools to display crisprs and to generate dictionaries of spacers and repeats. BMC Bioinformatics, 8(1):172–172, May 2007.

[24] Soley Gudbergsdottir, Ling Deng, Zhengjun Chen, Jaide V K Jensen, Linda R Jensen, Qunxin She, and Roger A Garrett. Dynamic properties of the sulfolobus crispr/cas and crispr/cmr systems when challenged with vector-borne viral and plasmid genes and protospacers. Mol Microbiol, 79(1):35–49, Jan 2011.

[25] Li Guo, Kim Brügger, Chao Liu, Shiraz A Shah, Huajun Zheng, Yongqiang Zhu, Shengyue Wang, Reidun K Lillestøl, Lanming Chen, Jeremy Frank, David Prangishvili, Lars Paulin, Qunxin She, Li Huang, and Roger A Garrett. Gen- ome analyses of icelandic strains of sulfolobus islandicus, model organisms for genetic and virus-host interaction stud- ies. J Bacteriol, 193(7):1672–80, Apr 2011. 198 Bibliography [26] D H Haft, J Selengut, E F Mongodin, and K E Nelson. A guild of 45 crispr-associated (cas) protein families and mul- tiple crispr/cas subtypes exist in prokaryotic genomes. PLoS Comput Biol, 1(6), Nov 2005.

[27] Caryn R Hale, Peng Zhao, Sara Olson, Michael O Duff, Brenton R Graveley, Lance Wells, Rebecca M Terns, and Michael P Terns. Rna-guided rna cleavage by a crispr rna- cas protein complex. Cell, 139(5):945–56, Nov 2009.

[28] M Häring, R Rachel, X Peng, R A Garrett, and D Prangishvili. Viral diversity in hot springs of pozzuoli, italy, and char- acterization of a unique archaeal virus, acidianus bottle- shaped virus, from a new family, the ampullaviridae. J Virol, 79(15):9904–9911, Aug 2005.

[29] P Horvath, A C Coûté-Monvoisin, D A Romero, P Boyaval, C Fremaux, and R Barrangou. Comparative analysis of crispr loci in lactic acid bacteria genomes. Int J Food Microbiol, Jul 2008.

[30] Y Ishino, H Shinagawa, K Makino, M Amemura, and A Na- kata. Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in escheri- chia coli, and identification of the gene product. J Bacteriol, 169(12):5429–33, Dec 1987.

[31] R Jansen, J D Embden, W Gaastra, and L M Schouls. Iden- tification of genes that are associated with dna repeats in prokaryotes. Mol Microbiol, 43(6):1565–1575, Mar 2002.

[32] Matthijs M Jore, Magnus Lundgren, Esther van Duijn, Jelle B Bultema, Edze R Westra, Sakharam P Waghmare, Blake Wiedenheft, Umit Pul, Reinhild Wurm, Rolf Wagner, Mar- ieke R Beijer, Arjan Barendregt, Kaihong Zhou, Ambrosius P L Snijders, Mark J Dickman, Jennifer A Doudna, Egbert J Boekema, Albert J R Heck, John van der Oost, and Stan J J Brouns. Structural basis for crispr rna-guided dna recog- nition by cascade. Nat Struct Mol Biol, 18(5):529–36, May 2011.

[33] Y Kawarabayasi, Y Hino, H Horikawa, K Jin-no, M Taka- hashi, M Sekine, S Baba, A Ankai, H Kosugi, A Hosoyama, S Fukui, Y Nagai, K Nishijima, R Otsuka, H Nakazawa, M Takamiya, Y Kato, T Yoshizawa, T Tanaka, Y Kudoh, J Yamazaki, N Kushida, A Oguchi, K Aoki, S Masuda, Bibliography 199 M Yanagii, M Nishimura, A Yamagishi, T Oshima, and H Kikuchi. Complete genome sequence of an aerobic ther- moacidophilic crenarchaeon, sulfolobus tokodaii strain7. DNA Res, 8(4):123–140, Aug 2001.

[34] M Kessel and F Klink. Archaebacterial elongation factor is adp-ribosylated by diphtheria toxin. Nature, 287(5779):250–1, Sep 1980.

[35] Eugene V Koonin and Kira S Makarova. Crispr-cas: an adaptive immunity system in prokaryotes. F1000 Biol Rep, 1:95, Dec 2009.

[36] V Kunin, R Sorek, and P Hugenholtz. Evolutionary con- servation of sequence and secondary structures in crispr repeats. Genome Biol, 8(4), Apr 2007.

[37] R K Lillestol, P Redder, R A Garrett, and K Brügger. A putative viral defence mechanism in archaeal cells. Archaea, 2(1):59–72, Aug 2006.

[38] R K Lillestol, S A Shah, K Brügger, P Redder, H Phan, J Christiansen, and R A Garrett. Crispr families of the crenarchaeal genus sulfolobus: bidirectional transcription and dynamic properties. Mol Microbiol, 72(1):259–272, Apr 2009.

[39] Nathanael G Lintner, Melina Kerou, Susan K Brumfield, Shirley Graham, Huanting Liu, James H Naismith, Mat- thew Sdano, Nan Peng, Qunxin She, Valérie Copié, Mark J Young, Malcolm F White, and C Martin Lawrence. Struc- tural and functional characterization of an archaeal clustered regularly interspaced short palindromic repeat (crispr)- associated complex for antiviral defense (cascade). J Biol Chem, 286(24):21643–56, Jun 2011.

[40] G Lipps. Plasmids and viruses of the thermoacidophilic crenarchaeote sulfolobus. Extremophiles, 10(1):17–28, Feb 2006.

[41] Li-Jun Liu, Xiao-Yan You, Huajun Zheng, Shengyue Wang, Cheng-Ying Jiang, and Shuang-Jiang Liu. Com- plete genome sequence of metallosphaera cuprina, a metal sulfide-oxidizing archaeon from a hot spring. J Bacteriol, 193(13):3387–8, Jul 2011. 200 Bibliography [42] M Lundgren, A Andersson, L Chen, P Nilsson, and R Bernander. Three replication origins in sulfolobus species: synchronous initiation of chromosome replication and asyn- chronous termination. Proc Natl Acad Sci U S A, 101(18):7046– 7051, May 2004.

[43] K S Makarova, N V Grishin, S A Shabalina, Y I Wolf, and E V Koonin. A putative rna-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic rnai, and hypothetical mechanisms of action. Biol Direct, 1:7–7, 2006.

[44] Kira S Makarova, L Aravind, Yuri I Wolf, and Eugene V Koonin. Unification of cas protein families and a simple scenario for the origin and evolution of crispr-cas systems. Biol Direct, 6:38, 2011.

[45] Kira S Makarova, Daniel H Haft, Rodolphe Barrangou, Stan J J Brouns, Emmanuelle Charpentier, Philippe Horvath, Sylvain Moineau, Francisco J M Mojica, Yuri I Wolf, Alexan- der F Yakunin, John van der Oost, and Eugene V Koonin. Evolution and classification of the crispr-cas systems. Nat Rev Microbiol, 9(6):467–77, Jun 2011.

[46] Aron Marchler-Bauer, Shennan Lu, John B Anderson, Farideh Chitsaz, Myra K Derbyshire, Carol DeWeese-Scott, Jessica H Fong, Lewis Y Geer, Renata C Geer, Noreen R Gonzales, Marc Gwadz, David I Hurwitz, John D Jackson, Zhaoxi Ke, Christopher J Lanczycki, Fu Lu, Gabriele H Marchler, Mikhail Mullokandov, Marina V Omelchenko, Cynthia L Robertson, James S Song, Narmada Thanki, Rox- anne A Yamashita, Dachuan Zhang, Naigong Zhang, Chan- juan Zheng, and Stephen H Bryant. Cdd: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res, 39(Database issue):D225–9, Jan 2011.

[47] L A Marraffini and E J Sontheimer. Crispr interference limits horizontal gene transfer in staphylococci by targeting dna. Science, 322(5909):1843–1845, Dec 2008.

[48] Luciano A Marraffini and Erik J Sontheimer. Self versus non-self discrimination during crispr rna-directed immunity. Nature, Jan 2010. Bibliography 201 [49] F J Mojica, C Díez-Villaseñor, J García-Martínez, and C Al- mendros. Short motif sequences determine the targets of the prokaryotic crispr defence system. Microbiology, 155(Pt 3):733–740, Mar 2009.

[50] F J Mojica, C Díez-Villaseñor, J García-Martínez, and E Soria. Intervening sequences of regularly spaced prokaryotic re- peats derive from foreign genetic elements. J Mol Evol, 60(2):174–182, Feb 2005.

[51] F J Mojica, C Díez-Villaseñor, E Soria, and G Juez. Biolo- gical significance of a family of regularly spaced repeats in the genomes of archaea, bacteria and mitochondria. Mol Microbiol, 36(1):244–246, Apr 2000.

[52] Sabin Mulepati, Amberly Orr, and Scott Bailey. Crystal structure of the largest subunit of a bacterial rna-guided immune complex and its role in dna target binding. J Biol Chem, 287(27):22445–9, Jun 2012.

[53] Ki Hyun Nam, Charles Haitjema, Xueqi Liu, Fran Ding, Hongwei Wang, Matthew P Delisa, and Ailong Ke. Cas5d protein processes pre-crrna and assembles into a cascade- like interference complex in subtype i-c/dvulg crispr-cas system. Structure, 20(9):1574–84, Sep 2012.

[54] Takuro Nunoura, Yoshihiro Takaki, Jungo Kakuta, Shinro Nishi, Junichi Sugahara, Hiromi Kazama, Gab-Joo Chee, Masahira Hattori, Akio Kanai, Haruyuki Atomi, Ken Takai, and Hideto Takami. Insights into the evolution of archaea and eukaryotic protein modifier systems revealed by the gen- ome of a novel archaeal group. Nucleic Acids Res, 39(8):3204– 23, Apr 2011.

[55] Maija K Pietilä, Elina Roine, Lars Paulin, Nisse Kalkkinen, and Dennis H Bamford. An ssdna virus infecting archaea: a new lineage of viruses with a membrane envelope. Mol Microbiol, 72(2):307–19, Apr 2009.

[56] André Plagens, Britta Tjaden, Anna Hagemann, Lennart Randau, and Reinhard Hensel. Characterization of the cr- ispr/cas subtype i-a system of the hyperthermophilic cren- archaeon thermoproteus tenax. J Bacteriol, 194(10):2491–500, May 2012. 202 Bibliography [57] C Pourcel, G Salvignol, and G Vergnaud. Crispr elements in yersinia pestis acquire new repeats by preferential up- take of bacteriophage dna, and provide additional tools for evolutionary studies. Microbiology, 151(Pt 3):653–663, Mar 2005.

[58] D Prangishvili, P Forterre, and R A Garrett. Viruses of the archaea: a unifying view. Nat Rev Microbiol, 4(11):837–848, Nov 2006.

[59] D Prangishvili, G Vestergaard, M Häring, R Aramayo, T Basta, R Rachel, and R A Garrett. Structural and gen- omic properties of the hyperthermophilic archaeal virus atv with an extracellular stage of the reproductive cycle. J Mol Biol, 359(5):1203–1216, Jun 2006. [60] G Pühler, H Leffers, F Gropp, P Palm, H P Klenk, F Lott- speich, R A Garrett, and W Zillig. Archaebacterial dna- dependent rna polymerases testify to the evolution of the eukaryotic nuclear genome. Proc Natl Acad Sci U S A, 86(12):4569–73, Jun 1989.

[61] Marco Punta, Penny C Coggill, Ruth Y Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman, and Robert D Finn. The pfam protein families database. Nucleic Acids Res, 40(Database issue):D290–301, Jan 2012. [62] P Redder, X Peng, K Brügger, S A Shah, F Roesch, B Greve, Q She, C Schleper, P Forterre, R A Garrett, and D Prangishvili. Four newly isolated fuselloviruses from extreme geothermal environments reveal unusual morpho- logies and a possible interviral recombination mechanism. Environ Microbiol, Jul 2009. [63] W D Reiter, P Palm, S Yeats, and W Zillig. Gene expression in archaebacteria: physical mapping of constitutive and uv- inducible transcripts from the sulfolobus virus-like particle ssv1. Mol Gen Genet, 209(2):270–5, Sep 1987. [64] M L Reno, N L Held, C J Fields, P V Burke, and R J Whitaker. Biogeography of the sulfolobus islandicus pan-genome. Proc Natl Acad Sci U S A, 106(21):8605–8610, May 2009. [65] Christine Rousseau, Jacques Nicolas, and Mathieu Gonnet. Crispi: a crispr interactive database. Bioinformatics, Oct 2009. Bibliography 203 [66] Rachel Y Samson, Takayuki Obita, Stefan M Freund, Roger L Williams, and Stephen D Bell. A role for the escrt system in cell division in archaea. Science, 322(5908):1710–3, Dec 2008.

[67] Ekaterina Semenova, Matthijs M Jore, Kirill A Datsenko, Anna Semenova, Edze R Westra, Barry Wanner, John van der Oost, Stan J J Brouns, and Konstantin Severinov. Interference by clustered regularly interspaced short palindromic repeat (crispr) rna is governed by a seed sequence. Proc Natl Acad Sci U S A, 108(25):10098–103, Jun 2011.

[68] S A Shah, N R Hansen, and R A Garrett. Distribution of crispr spacer matches in viruses and plasmids of crenar- chaeal acidothermophiles and implications for their inhibit- ory mechanism. Biochem Soc Trans, 37(Pt 1):23–28, Feb 2009.

[69] Shiraz A Shah and Roger A Garrett. Crispr/cas and cmr modules, mobility and evolution of adaptive immune sys- tems. Res Microbiol, 162(1):27–38, Jan 2011.

[70] Q She, R K Singh, F Confalonieri, Y Zivanovic, G Allard, M J Awayez, C C Chan-Weiher, I G Clausen, B A Curtis, A De Moors, G Erauso, C Fletcher, P M Gordon, I Heikamp- de Jong, A C Jeffries, C J Kozera, N Medina, X Peng, H P Thi-Ngoc, P Redder, M E Schenk, C Theriault, N Tolstrup, R L Charlebois, W F Doolittle, M Duguet, T Gaasterland, R A Garrett, M A Ragan, C W Sensen, and J Van der Oost. The complete genome of the crenarchaeon sulfolobus solfataricus p2. Proc Natl Acad Sci U S A, 98(14):7835–7840, Jul 2001.

[71] Daan C Swarts, Cas Mosterd, Mark W J van Passel, and Stan J J Brouns. Crispr interference directs strand specific spacer acquisition. PLoS One, 7(4):e35888, 2012.

[72] T H Tang, J P Bachellerie, T Rozhdestvensky, M L Bortolin, H Huber, M Drungowski, T Elge, J Brosius, and A Hüt- tenhofer. Identification of 86 candidates for small non- messenger rnas from the archaeon archaeoglobus fulgidus. Proc Natl Acad Sci U S A, 99(11):7536–7541, May 2002.

[73] David L Valentine. Adaptations to energy stress dictate the ecology and evolution of the archaea. Nat Rev Microbiol, 5(4):316–23, Apr 2007.

[74] John van der Oost, Matthijs M Jore, Edze R Westra, Mag- nus Lundgren, and Stan J J Brouns. Crispr-based adaptive 204 Bibliography and heritable immunity in prokaryotes. Trends Biochem Sci, 34(8):401–7, Aug 2009.

[75] G Vestergaard, S A Shah, A Bize, W Reitberger, M Reuter, H Phan, A Briegel, R Rachel, R A Garrett, and D Prangishvili. Stygiolobus rod-shaped virus and the interplay of crenar- chaeal rudiviruses with the crispr antiviral system. J Bac- teriol, 190(20):6837–6845, Oct 2008.

[76] Michaela Wagner, Silvia Berkner, Malgorzata Ajon, Arnold J M Driessen, Georg Lipps, and Sonja-Verena Albers. Ex- panding and understanding the genetic toolbox of the hy- perthermophilic genus sulfolobus. Biochem Soc Trans, 37(Pt 1):97–101, Feb 2009.

[77] Finn Werner and Dina Grohmann. Evolution of multisub- unit rna polymerases in the three domains of life. Nat Rev Microbiol, 9(2):85–98, Feb 2011.

[78] Edze R Westra, Benedikt Nilges, Paul B G van Erp, John van der Oost, Remus T Dame, and Stan J J Brouns. Cascade- mediated binding and bending of negatively supercoiled dna. RNA Biol, 9(9), Sep 2012.

[79] Edze R Westra, Paul B G van Erp, Tim Künne, Shi Pey Wong, Raymond H J Staals, Christel L C Seegers, Sander Bollen, Matthijs M Jore, Ekaterina Semenova, Konstantin Severinov, Willem M de Vos, Remus T Dame, Renko de Vries, Stan J J Brouns, and John van der Oost. Crispr immunity relies on the consecutive binding and degradation of negatively supercoiled invader dna by cascade and cas3. Mol Cell, 46(5):595–605, Jun 2012.

[80] Blake Wiedenheft, Gabriel C Lander, Kaihong Zhou, Mat- thijs M Jore, Stan J J Brouns, John van der Oost, Jennifer A Doudna, and Eva Nogales. Structures of the rna-guided sur- veillance complex from a bacterial immune system. Nature, 477(7365):486–9, Sep 2011.

[81] Blake Wiedenheft, Esther van Duijn, Jelle B Bultema, Jelle Bultema, Sakharam P Waghmare, Sakharam Waghmare, Kaihong Zhou, Arjan Barendregt, Wiebke Westphal, Albert J R Heck, Albert Heck, Egbert J Boekema, Egbert Boekema, Mark J Dickman, Mark Dickman, and Jennifer A Doudna. Bibliography 205 Rna-guided complex from a bacterial immune system en- hances target recognition through seed sequence interactions. Proc Natl Acad Sci U S A, 108(25):10092–7, Jun 2011.

[82] C R Woese and G E Fox. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A, 74(11):5088–90, Nov 1977.

[83] C R Woese, O Kandler, and M L Wheelis. Towards a natural system of organisms: proposal for the domains archaea, bacteria, and eucarya. Proc Natl Acad Sci U S A, 87(12):4576– 9, Jun 1990.

[84] Ido Yosef, Moran G Goren, and Udi Qimron. Proteins and dna elements essential for the crispr adaptation process in escherichia coli. Nucleic Acids Res, 40(12):5569–76, Jul 2012.

[85] Xiao-Yan You, Chao Liu, Sheng-Yue Wang, Cheng-Ying Ji- ang, Shiraz A Shah, David Prangishvili, Qunxin She, Shuang- Jiang Liu, and Roger A Garrett. Genomic analysis of acidi- anus hospitalis w1 a host for studying crenarchaeal virus and plasmid life cycles. Extremophiles, 15(4):487–97, Jul 2011.

[86] Jing Zhang, Christophe Rouillon, Melina Kerou, Judith Reeks, Kim Brugger, Shirley Graham, Julia Reimann, Giuseppe Cannone, Huanting Liu, Sonja-Verena Albers, James H Naismith, Laura Spagnolo, and Malcolm F White. Structure and mechanism of the cmr complex for crispr- mediated antiviral immunity. Mol Cell, 45(3):303–13, Feb 2012.