<<

Metaproteomic and genomic analyses of Antarctic haloarchaea

Bernhard Tschitschko

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Biotechnology and Biomolecular Sciences Faculty of Science University of New South Wales

February 2017

v

THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name:Tschitschko

First name: Bernhard Other name/s: Josef

Abbreviation for degree as given in the University calendar: PhD

School: Biotechnology and Biomolecular Sciences Faculty: Science

Title: Metaproteomic and genomic analyses of Antarctic haloarchaea

Abstract 350 words maximum: Deep Lake is a hypersaline lake in the Vestfold Hills, Antarctica. Because of its high salinity (around ten times seawater), Deep Lake does not freeze during winter and the water temperature drops to -20°C. Environmental sequencing of Deep Lake biomass revealed that the lake harbours a low-complexity microbial community that is dominated by haloarchaea. Genomic analyses of four isolated haloarchaeal species, including the three most abundant species (Halohasta litchfieldiae, DL31 and Halorubrum lacusprofundi), revealed differences in nutrient utilization and a high level of gene exchange between them.

In this thesis, the Deep Lake microbial community was studied using metaproteomics. Through analyses of proteins that were present in lake samples, inferences were made about population structures and the functioning of the dominant members of the haloarchaeal community. The metaproteomics was complemented by metagenomic and genomic analyses, allowing an assessment of the viral community present in Deep Lake and interactions of viruses with haloarchaeal hosts. Strain variation was also assessed for Hrr. lacusprofundi by comparing the genome sequence of the Deep Lake type strain with a new strain isolated from a lake 30 km away.

The metaproteomics revealed differences in targeted substrates that were linked to the distinct physiologies of the dominant haloarchaea. Proteins derived from viruses indicated a diverse viral population in Deep Lake. Cell surface proteins with a high degree of sequence variation were detected for the haloarchaea. These proteins were derived from phylotypes that exist in the lake and represent a defence strategy of the haloarchaea to escape virus infection. Other anti-viral defence systems such as CRISPR and BREX were also determined to be active, and CRISPR-spacer analysis revealed host-virus relationships, including the identification of broad host-range viruses. The genomic comparison between the two Hrr. lacusprofundi strains revealed distinct types of strain variation, with the primary replicons highly conserved between the strains in comparison to the secondary replicons, which contained high levels of variation. The findings from this thesis significantly extend understanding of haloarchaeal community functioning in hypersaline environments, and provide unprecedented insight into the ecophysiology of Antarctic archaea.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

………………………………………… ……………………………………..………… ……….……………………...…… Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

i

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………….

Date ……………………………………………….

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………….

Date ……………………………………………….

ii

ORIGINALITY STATEMENT

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signed ……………………………………………….

Date ……………………………………………….

iii

iv

vi

Abstract Deep Lake is a hypersaline lake in the Vestfold Hills, Antarctica. Because of its high salinity (around ten times seawater), Deep Lake does not freeze during winter and the water temperature drops to -20°C. Environmental sequencing of Deep Lake biomass revealed that the lake harbours a low-complexity microbial community that is dominated by haloarchaea. Genomic analyses of four isolated haloarchaeal species, including the three most abundant species (Halohasta litchfieldiae, DL31 and Halorubrum lacusprofundi), revealed differences in nutrient utilization and a high level of gene exchange between them. In this thesis, the Deep Lake microbial community was studied using metaproteomics. Through analyses of proteins that were present in lake samples, inferences were made about population structures and the functioning of the dominant members of the haloarchaeal community. The metaproteomics was complemented by metagenomic and genomic analyses, allowing an assessment of the viral community present in Deep Lake and interactions of viruses with haloarchaeal hosts. Strain variation was also assessed for Hrr. lacusprofundi by comparing the genome sequence of the Deep Lake type strain with a new strain isolated from a lake 30 km away. The metaproteomics revealed differences in targeted substrates that were linked to the distinct physiologies of the dominant haloarchaea. Proteins derived from viruses indicated a diverse viral population in Deep Lake. Cell surface proteins with a high degree of sequence variation were detected for the haloarchaea. These proteins were derived from phylotypes that exist in the lake and represent a defence strategy of the haloarchaea to escape virus infection. Other anti-viral defence systems such as CRISPR and BREX were also determined to be active, and CRISPR-spacer analysis revealed host-virus relationships, including the identification of broad host-range viruses. The genomic comparison between the two Hrr. lacusprofundi strains revealed distinct types of strain variation, with the primary replicons highly conserved between the strains in comparison to the secondary replicons, which contained high levels of variation. The findings from this thesis significantly extend understanding of haloarchaeal community functioning in hypersaline environments, and provide unprecedented insight into the ecophysiology of Antarctic archaea.

vii

Acknowledgements

First and foremost I want to thank my supervisor Rick Cavicchioli and my co- supervisor Tim Williams. I am very grateful for the guidance and support you have provided throughout the course of the PhD. I could have not wished for better supervison, thanks for putting in so much effort and patience in order for me to growth as a scientist and a person. I also want to thank all my lab and office friends in the Cavicchioli group for keeping me company throughout the journey of my PhD. Tahria, Taha, Peter, Jay, Josh, Miranda and Susanne, thanks a lot. Special thanks to Yan who is submitting her thesis at the same time, it was great going through the challenges of the thesis writing together. The Phd was a lot of work and I could only do it because I had a wonderful place to live that I shared with great people. All people from Boyce 252, Yan, Sharon, Dave, Chris, Allessandro, Philly, Luke and Haley, thanks so much for your support and for making me forget about science once in a while and enjoy life. I also want to thank Tim, Denise, Lane and Tamo for letting me be part of your family. Mama and Papa, I would not be here without you and I owe you everyting, thanks so much. Thanks also to my brothers Max, Steve, and Joey and to all my friends back home, Wastl, Floda, Aichi, Nikki, Flori, Baumi, Pippo, Gege, Dappl, Michi, Pudi and Eva. Even though I was far away from you, you were very much part of my journey here.

Lastly I need to thank mother natur and the ocean for replacing the mountains that I left behind. Without the surf I could have not done this.

viii

List of Publications

Tschitschko B, Williams TJ, Allen MA, Paez-Espino D, Kyrpides N, Zhong L, Raftery MJ, Cavicchioli R (2015) Antarctic archaea-virus interactions: metaproteome-led analysis of invasion, evasion and adaptation. The ISME journal 9: 2094-2107.

Tschitschko B, Williams TJ, Allen MA, Zhong L, Raftery MJ, Cavicchioli R (2016) Ecophysiological Distinctions of Haloarchaea from a Hypersaline Antarctic Lake as Determined by Metaproteomics. Applied and Environmental Microbiology 82: 3165- 3173

Williams TJ, Allen MA, Tschitschko B, Cavicchioli R (2016) Glycerol metabolism of haloarchaea. Environmental Microbiology (submitted)

Oral presentations 6th International Conference on Polar and Alpine Microbiology. 6th to 10th of September 2015 in Budweis, Czech Republic. Title of presentation: 'Host-virus interactions in a frigid, hypersaline Antarctic lake revealed by metaproteomics'.

Joint Academic Microbiology Seminars (JAMS). 29th of July, Sydney, Australia. Title of presentation: 'Love or war? Host-virus interactions in a cold, salty Antarctic lake revealed by metaproteomics'.

School of BABS Research Symposium. 5th of November 2014, Sydney, Australia. Title of presentation: 'Love or war? Host-virus interactions in a cold, salty Antarctic lake revealed by metaproteomics'.

Poster presentation Joint Academic Microbiology Seminars (JAMS). 5th Annual Symposium and Dinner. 16th of March 2016. Title of the poster: ‘Variation within the population of a dominant haloarchaeal species in Deep Lake, Antarctica, revealed by metaproteomics‘

ix

x

Contents

Chapter 1 General introduction ...... 1 1.1 Hypersaline environments and haloarchaea ...... 2 1.1.1 Taxonomy of haloarchaea ...... 3 1.1.2 Physiological adaptations to high salt ...... 4 1.1.3 Saltern crystallizer ponds and the Dead Sea ...... 5 1.1.4 Historical review of research on haloarchaea ...... 5 1.1.5 Typical members of hypersaline communities ...... 8 1.2 Antarctica and the Vestfold Hills ...... 10 1.2.1 Deep Lake ...... 13 1.2.2 Historical review of research of life in Deep Lake ...... 14 1.3 Metaproteomics ...... 16 1.4 Objectives ...... 18

Chapter 2 Ecophysiological distinctions of Antarctic haloarchaea revealed through metaproteomics ...... 21 2.1 Abstract ...... 22 2.2 Introduction ...... 23 2.3 Materials and Methods ...... 24 2.3.1 Sample collection ...... 24 2.3.2 Metaproteomics ...... 24 2.3.3 Protein annotation ...... 27 2.3.4 Statistical analysis ...... 28 2.3.5 Epifluorescence microscopy of Deep Lake water samples ...... 29 2.3.6 Growth studies ...... 29 2.4 Results ...... 29 2.4.1 Overview of the Deep Lake metaproteome ...... 29 2.4.2 Transport proteins reveal distinctions in nutrient preferences ...... 36 2.4.3 Carbohydrate metabolism of Hht. litchfieldiae...... 44 2.4.4 Nitrogen metabolism of Hht. litchfieldiae, DL31 and Hrr. lacusprofundi ...... 49 2.4.5 Motility and taxis ...... 55 2.4.6 Haloarchaeal responses to UV and oxidative stress ...... 59 2.5 Discussion ...... 63 2.5.1 The dependency of Hht. litchfieldiae on Dunaliella...... 63 2.5.2 Phosphorus as a limiting factor for Hht. litchfieldiae ...... 64

xi

2.5.3 Diverse strategies of nitrogen acquisition ...... 65 2.5.4 Hht. litchfieldiae is very motile and responsive to the environment . 66 2.5.5 Multiple responses to UV-induced damage ...... 67 2.6 Conclusion ...... 69

Chapter 3 Metaproteomics-led analyses of haloarchaea-virus interactions in Deep Lake ...... 71 3.1 Abstract ...... 72 3.2 Introduction ...... 73 3.3 Materials and Methods ...... 76 3.4 Results ...... 77 3.4.1 Cell surface protein variation...... 78 3.4.2 HIRs ...... 103 3.4.3 Viral proteins ...... 109 3.4.4 Host–defence systems against viral infection ...... 115 3.5 Discussion ...... 133 3.5.1 Diverse and abundant viruses in Deep Lake ...... 133 3.5.2 Avoiding viral infection through variation in cell surface structures ...... 134 3.5.3 Virus encoded cell surface genes ...... 135 3.5.4 CRISPR spacer analyses reveal haloarchaea-virus relationships in Deep Lake ...... 136 3.5.5 BREX and TA systems in Deep Lake ...... 137

Chapter 4 Analyses of intra species variation within Deep Lake haloarchaea ...... 139 4.1 Abstract ...... 140 4.2 Introduction ...... 141 4.3 Materials and Methods ...... 142 4.3.1 BLAST analysis of the Deep Lake metagenome protein database .. 143 4.3.2 Mapping of tADL-II contigs and creation of circular plots ...... 143 4.3.3 Phylogenetic analyses of tADL-II ribosomal proteins ...... 143 4.4 Results ...... 144 4.4.1 Metaproteomic and metagenomic signatures of tADL-II ...... 144 4.4.2 Phylogenetic analyses of detected tADL-II ribosomal proteins ...... 154 4.4.3 Signatures of strain variation within the Deep Lake metagenome .. 157 4.4.4 Further variation in the Deep Lake metaproteome ...... 158 4.5 Discussion ...... 164

xii

4.5.1 Distinctions between tADL-II and tADL revealed through metaproteomics ...... 164 4.5.2 Phylotypes of Hht. litchfieldiae ...... 166 4.5.3 Community structures of Deep Lake haloarchaea ...... 167

Chapter 5 Strain variation in Hrr. lacusprofundi assessed through genomic comparison ...... 169 5.1 Abstract ...... 170 5.2 Introduction ...... 171 5.3 Materials and Methods ...... 173 5.3.1 Strain isolation and preparation for sequencing ...... 173 5.3.2 Sequencing, genome assembly and further bioinformatic analyses 174 5.4 Results ...... 175 5.4.1 Genome sequencing and manual closure of the R1S1 primary replicon ...... 175 5.4.2 Overview of the R1S1 genome in comparison to ACAM34 ...... 177 5.4.3 Comparative analysis of primary replicons ...... 178 5.4.4 Comparison of secondary replicons ...... 189 5.5 Discussion ...... 195 5.5.1 Highly conserved primary replicons exhibit variation related to virus infection pressure ...... 195 5.5.2 Great diversity between secondary replicon content ...... 198 5.5.3 Different strategies of genome organisation ...... 199 5.5.4 HIRs, a haloarchaeal pan-genome in the Vestfold Hills and Rauer Islands ...... 201

Chapter 6 Conclusion ...... 203 6.1 The importance of the Deep Lake metagenome database for the metaproteomics ...... 203 6.2 The value of genomic sequencing in the age of metagenomics ..... 206 6.3 Vestfold Hills and Rauer Islands—great scope for unexplored haloarchaeal communities ...... 208 6.4 Concluding remarks ...... 210

References 213

Appendix A 231

xiii

List of Figures

Figure 1.1 Red pigmentation of haloarchaea...... 3 Figure 1.2 Lakes of the Vestfold Hills, Antarctica...... 12 Figure 1.3 Deep Lake, Antarctica...... 14 Figure 1.4 Metaproteomic workflow...... 18

Figure 2.1 Number of proteins detected for single filter samples ...... 30 Figure 2.2 Microscopy of Deep Lake water...... 31 Figure 2.3 Taxonomic composition of metaproteomic samples...... 32 Figure 2.4 Functional composition of metaproteomic samples...... 33 Figure 2.5 Taxonomic composition of the Deep Lake metaproteome...... 34 Figure 2.6 Functional composition of the Deep Lake metaproteome...... 35 Figure 2.7 Relative abundance of functional categories for the three main species...... 36 Figure 2.8 Relative abundance of transport proteins in the metaproteome...... 38 Figure 2.9 Growth response of Hht. litchfieldiae to phosphonate...... 38 Figure 2.10 Growth response of Hht. litchfieldiae to DHA ...... 45 Figure 2.11 Growth response of Hht. litchfieldiae to starch and pyruvate ...... 45

Figure 3.1 Proportion of cell surface protein variants...... 80 Figure 3.2 Multiple sequence alignments of DL31 S-layer proteins...... 84 Figure 3.3 Low coverage for Hht. litchfieldiae S-layer gene...... 85 Figure 3.4 Alignment of Hht. litchfieldiae archaellin proteins...... 100 Figure 3.5 Hrr. lacusprofundi archaellin variant alignment...... 102 Figure 3.6 Comparison of selected regions of HIRs, viral, transposase and expressed genes of Deep Lake haloarchaea...... 108 Figure 3.7 CRISPR systems in Deep Lake haloarchaea...... 120 Figure 3.8 BREX gene clusters of DL31 and Hrr. lacusprofundi...... 132

Figure 4.1 Initial identification of tADL-II contigs...... 142 Figure 4.2 GC/read-depth plot of tADL-II contigs...... 145 Figure 4.3 Variation between tADL and tADL-II...... 146

xiv

Figure 4.4 Phylogenetic trees of ribosomal proteins...... 157 Figure 4.5 Circular plot of Hht. litchfieldiae tADL and tADL-II genomes...... 165

Figure 5.1 Rauer 1 Lake, Filla, Island, Rauer Islands, Antarctica...... 173 Figure 5.2 GC/coverage plot of the initial set of 89 R1S1 contigs...... 176 Figure 5.3 NUCmer plot of R1S1 and ACAM34 primary replicons...... 178 Figure 5.4 ACT alignment of R1S1 and ACAM34 primary replicons...... 179 Figure 5.5 Archaellin protein alignment...... 180 Figure 5.6 GC/coverage plot of R1S1 secondary replicon contigs...... 190 Figure 5.7 ACT alignment of ACAM34 secondary replicons and matching R1S1 contigs...... 191 Figure 5.8 Functional profile of ACAM34 and R1S1 unique secondary replicon regions...... 192

xv

List of Tables

Table 2.1 Transporter proteins in the Deep Lake metaproteome...... 39 Table 2.2 Carbohydrate uptake and metabolism proteins detected for Hht. litchfieldiae ...... 46 Table 2.3 Proteins involved in the uptake and metabolism of nitrogen sources...... 50 Table 2.4 Motility and taxis proteins...... 56 Table 2.5 UV and oxidative stress repair and protection proteins...... 60

Table 3.1 Cell surface proteins detected for the four Deep Lake isolate species. ... 79 Table 3.2 Characteristics of cell surface proteins...... 86 Table 3.3 HIR encoded proteins from the metaproteome...... 104 Table 3.4 Viral proteins in the Deep Lake metaproteome...... 111 Table 3.5 Cas proteins detected in the metaproteome...... 115 Table 3.6 CRISPR loci of the Deep Lake isolate species...... 116 Table 3.7 Spacers from Deep Lake CRISPR loci...... 121 Table 3.8 BREX clusters in Hrr. lacusprofundi and DL31...... 129

Table 4.1 Proteins detected for tADL-II...... 147 Table 4.2 Metagenome-encoded proteins matching the Deep Lake isolate species...... 158 Table 4.3 Variant proteins detected for Hht. litchfieldiae, tADL, DL31 and Hrr. lacusprofundi...... 160 Table 4.4 Variants of uncertain taxonomic origin...... 162

Table 5.1 Genome characteristics of ACAM34 and R1S1...... 177 Table 5.2 Primary replicon characteristics of ACAM34 and R1S1...... 178 Table 5.3 Unique transposases of ACAM34 and R1S1...... 182 Table 5.4 Unique features of ACAM34 and R1S1...... 184 Table 5.5 Regions with low sequence similarity between ACAM34 and R1S1. .. 186 Table 5.6 Secondary replicon characteristics of ACAM34 and R1S1...... 189 Table 5.7 Novel HIRs of R1S1...... 194 Table A1 Complete list of proteins identified in the Deep Lake metaproteome. .. 232

xvi

List of Abbreviations

ABC ATP-binding cassette ACT Artemis Comparison Tool AEP 2-aminoethylphosphonic acid ANI Average nucleotide identity ANOSIM analysis of similarity BCAA branched-chain amino acid BR bacteriorhodopsin BREX bacteriophage exclusion CAS CRISPR-associated COG cluster of orthologues CRISPR Clustered Regularly Interspaced Short Palindrome Repeats DHA dihydroxyacetone DHAP dihydroxyacetone phosphate eDNA extracytoplasmic DNA FR fragment recruitment GDH Glutamate dehydrogenases GS glutamine synthetase GOGAT glutamate synthase HGT horizontal gene transfer HIR high identity region IMG integrated microbial genomes portal LPS lipopolysaccharides MCP methyl-accepting chemotaxis protein MCP major capsid protein MS mass spectrometry NMDS non-metric multi-dimensional scaling NaCl sodium chloride ORF open reading frame PCR polymerase chain reaction PilA adhesion pili protein R1 Rauer 1

xvii

ROS reactive oxygen species RPKM reads per kilobase per 1000000 recruited reads rRNA ribosomal RNA S-layer surface layer SNP single nucleotide polymorphism SOD superoxide dismutase SRII sensory rhodopsin II SSU small subunit TA toxin-antitoxin TRAP Tripartite ATP-independent periplasmic transporter TTT tripartite tricarboxylate transporter TUD tetranucleotide usage deviation VLP virus-like particle

xviii

Chapter 1

General introduction

The main subject of this thesis was the haloarchaeal community of hypersaline Deep Lake. Antarctica, The community was assessed using a metaproteomic-led approach, integrated with metagenomic and genomic analysis. The aims of this introduction are: (1) to give an overview about research on hypersaline systems and its inhabitants, with a focus on haloarchaea, (2) to describe Deep Lake and summarize previous studies of its microbial community and (3) to describe the functioning of metaproteomics.

It is hard to think of a natural environment on our planet that contains water but is completely devoid of life (Schaechter, 2007). In particular, microbial life has been found to not only exist but even thrive in many environments unsuitable for humans to colonize. From a human perspective many of the environments inhabited by microorganisms appear extreme; hence the term ‘extremophiles’ was introduced to describe those organisms that ‘love life under extreme conditions’. Often it is members of the Archaea, the third domain of life beside Bacteria and Eukarya (Woese et al., 1990), which are found abundantly in environments like terrestrial hot springs, hydrothermal vents on the floor of the oceans (black and white smokers), ice, and also highly acidic, alkaline or anaerobic environments; they can grow at temperatures of up to 122° C and are found abundantly in cold Antarctic lakes (Kletzin, 2007; Takai et al., 2008; Cavicchioli, 2011). Only the advent of environmental sequencing techniques has revealed that archaea are also widespread and diverse in environments like soil or ocean water all over the planet (Schleper et al., 2005). Many ‘extremophilic’ archaea have an obligate requirement for the seemingly extreme conditions for growth; e.g. the optimum growth temperature of cultivated hyperthermophiles (e.g. isolated from hot springs) generally lies within 80-106° C and growth normally does not occur below 50° C (Cavicchioli et al., 2011; Stetter, 2013). With this in mind it becomes apparent that there is no universal definition of the term

1

‘extreme’ but that it is rather subjective: what is extreme and unbearable for one organism can be normal and essential for another one.

1.1 Hypersaline environments and haloarchaea Hypersaline environments are defined by their particular high salt contents. In order to be defined as hypersaline an environment typically needs to contain more than twice the salt concentration of sea water (~7% or 1.2 M sodium chloride (NaCl)) (Oren, 2015) and concentrations can go up to saturation levels (~35% or 5.2 M NaCl). While hypersaline habitats exist all around the world and include saline soils, tidal pools, deep- sea brine pools and even salted animal hides and foods, most research has focused on salt lakes and solar salterns (Ventosa, 2006; Ventosa et al., 2015). In addition to high salt concentrations, many hypersaline environments harbour further challenging conditions like high solar irradiation levels, high temperature differences, low nutrient concentrations or alkaline pHs (Ventosa, 2006). Aquatic hypersaline systems can be categorized according to their salt composition into thalassohaline and athalassohaline. Thalassohaline systems are marine-derived and contain a similar salt composition as the oceans, e.g. Lake Tyrrell in Victoria, Australia (Heidelberg et al., 2013). Athalassohaline systems can contain variable salt compositions that reflect the surrounding geology and climate; examples for athalassohaline systems would be the Dead Sea in the Near East or the Great Salt Lake in Utah, USA (Ventosa, 2006). The salt-loving inhabitants of hypersaline environments are called halophiles. Halophiles are present in all three domains of live, but like in so many other extreme environments, it is typically archaea that are present in high abundance. Salt-loving archaea, also called haloarchaea, usually require > 1.5 M NaCl for growth in culture, however, most haloarchaea grow best between 3.5-4.5 M NaCl and can be described as extreme halophiles. Hence haloarchaea thrive in the most hypersaline environments where they can reach cell densities of > 107 cells/ml (Ventosa, 2006). High concentrations of haloarchaea are responsible for the reddish-pink colouration of some hypersaline lakes or salterns due to the C-50 carotenoid pigments (α-bacterioruberin and derivatives) in their cell membranes (Oren, 2002).

2

Figure 1.1 Red pigmentation of haloarchaea. The picture on the left shows the strong red colouration of a dense haloarchaeal culture. The picture on the right shows the red coloration due to high concentrations of haloarchaea in a saltern crystallizer pond in Eilat, Israel (picture adapted from Oren (2015).

1.1.1 Taxonomy of haloarchaea In this thesis the term haloarchaea is used colloquially for members of the Halobacteriaceae, the only family within the order of Halobacteriales, part of the phylum Euryarchaeota. By February 2014 the Halobacteriaceae comprised 47 genera and 165 species (Oren, 2014c) with most members being strictly aerobic or facultative anaerobic and red-pigmented. With the exception of members of the genera Haloquadratum and Halococcus, it is difficult to taxonomically categorize novel haloarchaeal species based on their morphology. Phenotypic differences between different haloarchaeal genera are often only small. Therefore, most new species and strains can only be assigned to genera through 16S ribosomal RNA (rRNA) gene sequence analysis (Oren, 2014c). While the Halobacteriaceae can be considered as the classic halophiles, it needs to be noted that they are not the only halophilic archaea. Halophilic methanogens can be found in the sediments of some salt lakes. Most of the halophilic methanogens belong to the family of Methanosarcinaceae while a few belong to the family Methanocalculaceae. Little is known about the diversity and abundance of halophilic methanogens. Their slow growth and sensitivity to molecular oxygen might have contributed to the lack of scientific attention for this group in the past (Oren, 2014c). A further group of halophilic archaea are the Nanohaloarchaea. They represent a novel archaeal lineage and have only been discovered in 2012 in a metagenomic study of Lake Tyrell, where they are estimated to account for 10-25% of

3 the archaeal community in surface water (Narasingarao et al., 2012). The same study identified nanohaloarchaeal signatures in metagenomes from hypersaline environments from all around the world, indicating a widespread distribution. Nanohaloarchaea are small in size (~0.6 µm) and contain a small estimated genome (~1.2 Mb) (Narasingarao et al., 2012). So far no member of the Nanohaloarchaea has been isolated.

1.1.2 Physiological adaptations to high salt Life in hypersaline habitats requires special adaptations to deal with the high extracellular salt. Most halophilic bacteria and eukarya accumulate organic solutes (e.g. glycerol or simple sugars) in their cytoplasm to balance the high osmotic pressure, allowing them to retain low intracellular salt concentrations. Haloarchaea on the other hand have adopted a so-called ‘salt-in’ strategy whereby they accumulate intracellular KCl concentrations at least equimolar to the outside NaCl (Soppa, 2006; Oren, 2013b). In order for haloarchaeal proteins to function in the high salt cytosol they contain a high proportion of the acidic residues glutamate and aspartate, and a lower proportion of the basic residues lysine, arginine and histidine. The high density of negative charges on their surfaces enables the proteins to stay solvent under high-salt concentrations and not salt-out (Soppa et al., 2008). Hence haloarchaeal proteomes are typically acidic and many haloarchaeal proteins will only function in high-salt buffers when expressed in vitro (Soppa, 2006; Oren, 2013b). Such adaptations of the proteome are not required for organisms using organic solutes for osmotic balance since these solutes are usually uncharged or zwitterionic (Oren, 2013b). In addition to data from experiments on isolated haloarchaeal species, analysis of metagenome data from environments with different salinities has shown an increase of protein acidity with increasing salinity (Rhodes et al., 2010). Recent studies have revealed bacterial species that are similar to haloarchaea in respect to high-salt adaptations. Salinibacter ruber (Bacteroidetes) and some members of the Halanaerobiales (Firmicutes), all halophilic bacteria, contain high intracellular salt concentrations, and hence also use a salt-in strategy. The proteome of S. ruber was found to be similar in acidity as haloarchaeal ones but those of the Halanaerobiales are not. The strategy of how Halanaerobiales protect their proteins from the high intracellular salt has not yet been discovered (Oren, 2013b).

4

1.1.3 Saltern crystallizer ponds and the Dead Sea Aquatic hypersaline environments with particular importance for the study of halophiles are crystallizer ponds from solar salterns and the Dead Sea. Many haloarchaeal species were first isolated and described from these habitats. Solar salterns are man-made salt-harvesting systems based on the evaporation of seawater. Seawater is concentrated and passed through a series of connected pools, whereby undesired salts like calcium carbonate or calcium sulfate are gradually precipitated. Towards the end of the concentrating process are the so-called crystallizer ponds (Figure 1.1), which contain saturated NaCl concentrations; precipitated NaCl crystals are harvested from these ponds. Solar salterns have been the subject of numerous hypersaline studies. Especially the solar saltern in Santa Pola, near Alicante, Spain and its crystallizer ponds have been studied extensively (Pašić and Rodríguez-Valera, 2014). Haloarchaeal species first isolated from crystallizer ponds include the type strains of Haloferax mediterranei, Halococcus saccharolyticus, Halorubrum saccharovorum, Haloarcula hispanica and many others (Oren, 2002). The Dead Sea represents a large (~632 km2) and deep (maximum depth of 300 m) salt lake in the Near East (south-west Asia). Due to the diversion of its source waters, the Dead Sea has a negative water budget which leads to an increase in salinity (Bodaker et al., 2010; Rhodes et al., 2012). By 2015 the Dead Sea contained an extreme salinity of ~35% NaCl and high concentrations of divalent cations (~2 M Mg2+ and 0.5 M Ca2+). These conditions currently do not support dense microbial communities (Oren, 2015). Microbial blooms of haloarchaea only occur through the dilution of surface water following rainy winters, with the last such bloom occurring in 1992 (Bodaker et al., 2010). Haloarchaeal species isolated from the Dead Sea include the model organisms Haloferax volcanii and Haloarcula marismortui (Mullakhanbhai and Larsen, 1975; Oren et al., 1990).

1.1.4 Historical review of research on haloarchaea Research on haloarchaeal species has been performed since the end of the 19th century and has provided many ground-breaking insights for microbiology in general and archaea research in particular (Soppa, 2006; Soppa et al., 2008; Cavicchioli, 2011). One reason for the pioneering role of haloarchaea in archaea research is that some haloarchaeal species are amongst the easiest ones to be cultivated under laboratory conditions; many of them grow aerobically and on solid complex media. Haloarchaea

5 were the first archaea with an available transformation systems, allowing molecular tools like vectors and gene knock-out systems to be developed (Soppa, 2006; Cavicchioli, 2011). Hence haloarchaeal species like Halobacterium salinarum, Hfx. volcanii, or Har. marismortui have been great model organisms to study various physiological adaptation mechanisms to different environmental factors e.g. osmo- adaptation to high salt concentrations or UV-stress responses (McCready and Marcello, 2003; Oren, 2013b). In this section I reflect on some of the highlights of haloarchaeal research. The first isolated archaeal species was a haloarchaeon in 1880, a member of the genus Halococcus, isolated from salted fish (Kocur and Hodgkiss, 1973; Cavicchioli, 2011) and references therein). The haloarchaeal model organisms Hbt. salinarum, Har. marismortui and Hfx. volcanii were first isolated ~90, 50 and 40 years ago, respectively (Mullakhanbhai and Larsen, 1975; Oren et al., 1990; Soppa, 2006). After Woese and colleagues first realised the presence of an additional ‘ur- kingdom’ (first termed ‘archaebacteria’ which would later become the third domain Archaea) through 16S rRNA gene analyses, they went on a search of phenotypic characteristics unique to all archaea and distinct from bacteria (Woese, 2007). The first such feature they identified was the presence of ether-linked and branched-chained membrane lipids in archaea opposed to the ester-linked and straight lipids in bacteria. This new class of lipids was first observed in haloarchaea. In 1971, the purple membrane protein bacteriorhodopsin was discovered in Halobacterium (Oesterhelt and Stoeckenius, 1971). Bacteriorhodopsin contains seven characteristic transmembrane domains and, together with a retinal , acts as a light-sensitive proton transporter, thus enabling the generation of energy from sunlight (phototrophy). Until the discovery of bacteriorhodopsin, retinal was only known to occur in visual rhodopsins within the eye structures of animals and humans (Grote and O'Malley, 2011). Due to their structural similarity, bacteriorhodopsin has long served as a model for human G protein-coupled receptors, a class of proteins targeted by a large number of medicinal drugs (Cavicchioli, 2011). Since the first discovery, bacteriorhodopsin has become one of the best studied membrane proteins and its research has made significant contributions within many fields of biology (Grote and O'Malley, 2011). Following the discovery of bacteriorhodopsin, further classes of rhodopsins were identified in haloarchaea: halorhodopsin, a chloride transporter, and the phototactic

6 sensory rhodopsins I and II (Grote and O'Malley, 2011). Only in 2000, the first microbial rhodopsins outside of the haloarchaea were discovered: proteorhodopsin was identified in marine Proteobacteria (Beja et al., 2000) and has since been found to be widespread within many bacterial and archaeal taxa (Soppa et al., 2008). The first archaeal virus was identified infecting Hbt. salinarium (Torsvik and Dundas, 1974; Prangishvili et al., 2006). Since then it has been recognized that viruses play a fundamental role in hypersaline community functioning and dynamics (Rodriguez-Valera et al., 2009). In 1980, Walsby reported the presence of a dominant ‘square bacterium’ in a saturated brine pool on the Sinai Peninsula (Walsby, 1980). He discovered the unusually shaped microorganism due to their high content of gas vesicles, which are gas-filled, proteinacous structures present in some bacteria and archaea that give them buoyancy and enables vertical migration in aquatic environments (Walsby, 1980; Pfeifer, 2012). Following their initial discovery, the square-shaped microorganisms were identified as the dominant organism in crystallizer ponds and some salt lakes (e.g. Lake Tyrrell) all around the world (Oren, 2002; Ventosa et al., 2015). The isolation of the square organisms turned out to be a challenging task: only in 2004, over 20 years after its discovery, two research groups reported the independent isolation of the square- shaped haloarchaeon Haloquadratum walsbyi (Bolhuis et al., 2004; Burns et al., 2004); strain HBSQ001 was isolated from a crystallizer pond of a solar saltern in Santa Pola, near Alicante, Spain, and strain C23 was isolated from Lake Tyrrell. While it was mentioned above that one of the advantages of haloarchaea as model organisms was their relatively simple cultivation, Hqr. walsbyi is a reminder that this cannot be generalized across all members of the haloarchaea. The first haloarchaeal genome sequence (Hbt. salinarum strain NRC-1) was published in the year 2000 and initial analysis already indicated the presence of genes that are more similar to homologues in bacteria compared to other archaea. Therefore, these genes might have been acquired through horizontal gene transfer (HGT) (Ng et al., 2000). Subsequently, a number of studies have revealed a range of different genes that were subject to HGT between haloarchaea and other taxonomic groups (Fullmer et al., 2014). Most haloarchaea possess leucyl-tRNA-synthetases that are phylogenetically related to those of bacteria and might originate from an ancient gene transfer event from an ancestor of the Bacteria to haloarchaea (Andam et al., 2012). Three out of the four rhodopsin genes in the halophilic bacterium S. ruber are of haloarchaeal origin

7

(Mongodin et al., 2005). Phylogenetic analysis of archaeal ribosomal proteins indicate that the haloarchaea have a methanogenic ancestor (Matte-Tailliez et al., 2002). Methanogens are obligate anaerobes that generate energy through the production of methane using a range of simple (organic and inorganic) substrates like H2, CO2 or methanol, while haloarchaea are obligate heterotrophs and mostly obligate or facultative aerobes. Hence the two groups have fundamentally different metabolisms (Fullmer et al., 2014). One study reported that the origin of the haloarchaea coincided with a massive gene-transfer event (over 1000 genes) from bacteria to a methanogen (Nelson- Sathi et al. 2012), including the genes necessary for the transformation of a methanogen into an oxidative heterotroph (Nelson-Sathi et al., 2012; Fullmer et al., 2014). While the methodology and conclusions of Nelson-Sathi et al. (2012) have since been contested (Becker et al., 2014; Groussin et al., 2016), it seems evident that HGT has played a particularly important role throughout the evolution of haloarchaea (Becker et al., 2014).

1.1.5 Typical members of hypersaline communities 1.1.5.1 Haloarchaea In aquatic hypersaline habitats haloarchaea typically represent the group of organisms with the highest abundance, which increases with increasing salinity. They can account for up to ~90% of the total archaeal and bacterial population of some systems (Ventosa et al., 2015). Due to the relative simplicity in their isolation, for a long time it was thought that species of the genera Halobacterium, Halorubrum, Haloferax and Haloarcula constituted the dominant members of most haloarchaeal communities (Ventosa et al., 2015). Following Walsby’s observation of abundant square-shaped cells, it became evident that they represent the dominant cell type in many hypersaline environments (Oren, 2002). Finally, environmental sequencing studies have shown that most of the isolated haloarchaeal species are in fact only rare members of their respective communities (Ventosa, 2006). A characteristic of very hypersaline environments is their low species diversity, with a relatively small number of species representing the majority of the community. Hqr. walsbyi can represent up to 80% of the microbial community in environments with saturating salt concentrations (Dyall-Smith et al., 2011). Genome sequencing of two isolates together with metagenomic sequencing have revealed that a high level of intra- species diversity exists for Hqr. walsbyi (Legault et al., 2006; Cuadros-Orellana et al.,

8

2007; Dyall-Smith et al., 2011; Podell et al., 2013). Genomic islands were identified in the genome of Hqr. walsbyi that were not well represented in metagenomes, indicating that within the populations of this species diversity exists for genes encoded in genomic islands(Legault et al., 2006; Cuadros-Orellana et al., 2007). 1.1.5.2 Dunaliella With increasing salinity the presence and abundance of eukaryotic species in aquatic environments usually decreases and often the unicellular green alga Dunaliella (Chlorophyceae) represents the only eukaryote (Ventosa, 2006). Dunaliella can be found in many hypersaline systems all around the world where it is often thought to be the main photosynthetic primary producer (Oren, 2014b). Conditions in the Dead Sea, from which many Dunaliella isolates were cultured in the past, are currently too hostile and no Dunaliella cells are present. However, the last haloarchaeal blooming events in the Dead Sea were triggered by an increase in Dunaliella numbers following heavy rain falls (Oren, 2014b). The two species most often reported in salt lakes and salterns are D. salina and D. viridis (Oren, 2014b). Different to haloarchaea, Dunaliella produces high amounts of intracellular glycerol for osmotic adaptation to the high salt in its environment. For long it has been hypothesised that Dunaliella derived glycerol represents the main carbon source for the heterotrophic haloarchaea in hypersaline environments (Elevi Bardavid et al., 2008). Indeed, many isolated haloarchaeal species can catabolize glycerol when grown in culture (Falb et al., 2008). 1.1.5.3 Halophilic bacteria Based on cultivation experiments and activity measurements, for a long time the prevailing assumption was that there were no (or hardly any) bacteria present in hypersaline environments with salt concentrations near saturation level (Oren, 2002). Only when the first 16S rRNA gene sequencing experiments were carried out in crystallizer ponds the presence of extremely halophilic bacteria was recognized (Martínez‐Murcia et al., 1995; Oren, 2002). In 2002, the red-pigmented, rod-shaped bacterium S. ruber was isolated out of crystallizer ponds, where it represents between 5- 25% of the total microbial community (Anton et al., 2002). Hence, beside Hqr. walsbyi, S. ruber is often the only other species with high abundance in solar crystallizer ponds (Pašić and Rodríguez-Valera, 2014). Since then S. ruber and related species from the same phylum Bacteroidetes have been found in hypersaline environments all around the world (Oren, 2002; Peña et al., 2014). S. ruber was found to contain many

9 characteristics otherwise typical for haloarchaea, including a salt-in strategy of KCl for osmotic adaptation and an acidic proteome (Oren, 2013a). Similar to Hqr. walsbyi, genomic and metagenomic sequencing analyses revealed a high degree of variation within S. ruber populations (Pašić et al., 2009; Pena et al., 2010). Thus, it might have been not too surprising when a detailed analysis of a S. ruber genome revealed the presence of genes that originated through HGT from haloarchaea (Mongodin et al., 2005). 1.1.5.4 Halophilic viruses Even though the first isolated archaeal virus was one that infected a haloarchaeal host (Torsvik and Dundas, 1974), progress in the field of haloarchaeal viruses has been slow for a long time and not many haloarchaeal viruses have been studied in great detail (Dyall-Smith et al., 2003). Similar to halophilic bacteria, viruses have also for long been neglected in studies of hypersaline environments. In 1996, microscopic studies of a solar saltern revealed a high abundance of virus-like particles (VLPs). Both, VLPs and archaeal/bacterial cells were increasing in abundance with increasing salinity. VLPs reached concentrations of up to ~109 VLPs/ml in the crystallizer ponds, outnumbering archaeal/bacterial cells by a factor of ten (Guixa-Boixareu et al., 1996). The same study also described a lack of protozoan predators in the crystallizer ponds, leaving viruses as the only predators of archaeal and bacterial cells (Guixa-Boixareu et al., 1996). Around the same time a high number of viruses with diverse morphologies were reported in the Dead Sea, again outperforming archaea/bacteria in abundance (Oren et al., 1997). The higher abundance of viruses compared to potential host cells is today recognized as a general feature of many saline environments e.g. the world’s oceans (Suttle, 2007). Since these initial studies and particularly during the last decade, a lot of progress has been made in the identification and characterization of haloarchaeal viruses. There are now around 100 isolated haloarchaeal viruses with around 40 sequenced viral genomes; and many metagenomic studies have established viruses as pivotal players that shape the microbial communities of hypersaline environments (Rodriguez-Valera et al., 2009; Atanasova et al., 2016).

1.2 Antarctica and the Vestfold Hills Permanently low temperatures, high irradiation levels during summer and complete darkness during winter are some of the factors that make Antarctica unsuitable for many species inhabiting Earth’s lower latitudes. Nevertheless, particularly the coastal regions

10 of Antarctica, including the Antarctic Peninsula and nearby islands, are home to many aquatic and terrestrial species which have adapted to the unique prevailing conditions (Chown et al., 2015). While only a few flowering plant species are present in Antarctica, a comparably rich diversity exists for plants like lichens and bryophytes (Peat et al., 2007). Some groups of invertebrate animals, e.g. nematodes and springtails, are also present with relatively high species numbers. More than 8000 marine species are described for the Antarctica enclosing Southern Ocean, including well-known species of whales, seals, penguins or albatrosses (Chown et al., 2015). However, as in so many other ecosystems of the world, the highest diversity can be observed within the microbial life of Antarctica. This fact has only recently become apparent through the application of environmental sequencing techniques like metagenomics (Cavicchioli, 2015; Chown et al., 2015). Beside the vast water body of the Southern Ocean, Antarctica contains a number of diverse lake systems (Wilkins et al., 2013). One Antarctic region that is particularly rich in lake systems is the Vestfold Hills (Gibson, 1999), home of the Australian run Davis Station.

11

Figure 1.2 Lakes of the Vestfold Hills, Antarctica. Figure adapted from Zwartz et al. (1998).

The Vestfold Hills are an ice-free region of around 410 km2 on the Ingrid Christiansen Coast in East Antarctica (Zwartz et al., 1998). They have formed through isostatic rebound (i.e. a rise of the land mass) following the retreat of the continental ice sheet around 10,000 years ago (Gibson, 1999). This process has resulted in a landscape that is scattered with hundreds of lakes. Many of the lakes are marine-derived with salinities ranging from freshwater (when the salt got flushed out through melt water) to hypersaline (when the trapped sea water got concentrated through evaporation) (Gibson, 1999). The Vestfold Hills are especially rich in meromictic lakes (lakes in which the bulk of the water does not mix throughout the course of a year); many of them are permanently stratified. It is estimated that the Vestfold Hills harbour the highest concentration of meromictic lakes in the world (Gibson, 1999). In the past the microbial communities of many lakes in the Vestfold Hills have been studied (Wright and Burton, 1981; Sawstrom et al., 2008; Wilkins et al., 2013;

12

Cavicchioli, 2015) and especially the application of metagenomics has revealed many unexpected and novel findings (Yau et al., 2011; DeMaere et al., 2013). While some of the studied lakes of the Vestfold Hills and other regions of Antarctica are also hypersaline, most of them do not harbour haloarchaea-dominated microbial communities (Wright and Burton, 1981; Bowman et al., 2000; Murray et al., 2012; Yau et al., 2013). In many cases permanent stratification combined with an unsuitable geochemistry (e.g. anoxic layers) would have prevented high abundance of haloarchaea.

1.2.1 Deep Lake With > 270 g/l NaCl, Deep Lake (68°33’36.8S, 78°11’48.7E) is the most hypersaline lake in the Vestfold Hills (Wright and Burton, 1981). Due to this high salinity, Deep Lake never freezes and remains liquid even though water temperatures can be as low as -20°C (Ferris and Burton, 1988); the calculated freezing point of Deep Lake water would be around -28°C (Hand, 1980). Different to the many meromictic lakes in the Vestfold Hills, Deep Lake is monomictic, i.e. the bulk of the water body gets mixed throughout the course of a year. Hence Deep Lake contains a relatively homogenous 36 m deep water column — which is deep compared to many other hypersaline systems — with very stable physical and chemical conditions. Because Deep Lake is marine-derived, its relative ionic concentrations are thalassohaline. Inorganic and organic nutrients are evenly distributed with no great fluctuations between seasons (Barker, 1981; Ferris and Burton, 1988). Between 1977-1979, the largest observed seasonal differences occurred in surface water temperatures, ranging from around -18°C during the dark winter up to 12°C for a brief period in summer due to prolonged periods of sunshine (Ferris and Burton, 1988). Temperatures above 0°C were only recorded in the upper 10-15 m during the 3-4 months of summer whereas the bottom part of the lake stayed extremely cold throughout the year (-8 to -17°C) (Barker, 1981; Ferris and Burton, 1988).

13

Figure 1.3 Deep Lake, Antarctica. Picture courtesy of Rick Cavicchioli.

1.2.2 Historical review of research of life in Deep Lake The aim of the following historical review of studies describing Deep Lake’s biota is to illustrate how scientific knowledge and perception can change over time, often advanced through the implementation of novel technologies. No invertebrates or herbivores exist in Deep Lake and for a long time the green algae Dunaliella was thought to be the only active species in the lake; although measured primary productivity was only extremely low, making Deep Lake one of the least productive lake systems in the world (Campbell, 1978; Wright and Burton, 1981). In 1980, Hand reported that there was no sign of bacterial activity even though ~105 bacterial cells per ml were counted at all depths in Deep Lake (Hand, 1980). Bacterial growth was only observed when the salinity of Deep Lake water was diluted down by three quarters. Hence Hand concluded that the bacterial cells were washed in through melt water streams during summer but were not active in Deep Lake (Hand, 1980). Wright and Burton (1980) speculated that ‘Their (the prokaryotes) inability, contrasted with the growth of the eukaryote Dunaliella, dispels some expectations about the supremacy of prokaryotes in adaptations to harsh environments’ (Wright and Burton, 1981). However, subsequent studies and novel technologies have revealed that rather the opposite is true. In 1988, Franzmann et al. accidentally isolated the haloarchaeon Halorubrum

14 lacusprofundi growing in media prepared with 100% lake water, during an attempt to isolate Dunaliella (Franzmann et al., 1988). The first 16S rRNA gene sequencing study with DNA isolated from Deep Lake sediment revealed that the lake harboured a microbial community of low diversity that is dominated by haloarchaea (Bowman et al., 2000). Hrr. lacusprofundi was described as the predominant species while bacteria were found to be of low abundance (Bowman et al., 2000). In the following years further Deep Lake haloarchaea were isolated (Mou et al., 2012) and, together with Hrr. lacusprofundi, had their genomes sequenced (DeMaere et al., 2013). In 2013, DeMaere et al. reported the results of a large scale sequencing study including analyses of (1) the closed genome sequences of four isolated Deep Lake haloarchaea Halohasta litchfieldiae strain tADL, Hrr. lacusprofundi, strain DL31 (unknown genus) and Halobacterium DL1; (2) small subunit (SSU) rRNA pyrotag sequencing of Deep Lake DNA and (3) metagenomic sequencing of Deep Lake DNA (DeMaere et al., 2013). For (2) and (3) Deep Lake biomass was collected through sequential size filtration of Deep Lake water on filters representing the 20-3.0, 3.0-0.8 and 0.8-0.1 µm size fractions. These samples were taken from 5, 13, 24 and 36 m depth (DeMaere et al., 2013). The four isolated species all belong to different genera within the Halobacteriaceae. Of the four, Hht. litchfieldiae is the only species with a genome composed of only a single replicon (3.33 Mb). The genomes of the three other species each comprise one large primary replicon (2.7-2.9 Mb) and between one and two secondary replicons, which add up to a total genome size of 3.2-3.7 Mb. Deep Lake was found to harbour a low-complexity microbial community with similar species composition regardless of sample depth or filter size fraction. The four isolated species together represent ~72% of the cellular community. Hht. litchfieldiae is highest in abundance with ~43%, followed by DL31 with ~18% and Hrr. lacusprofundi with ~10%. DL1 was found to represent only a minor fraction with around 0.3% (DeMaere et al., 2013). Unlike Bowman et al. (2000), who identified a phylotype corresponding to Hrr. lacusprofundi as the dominant organism, the SSU rRNA pyrotag and metagenomic sequencing data of DeMaere et al. (2013) clearly showed that at the time of sampling, Hht. litchfieldiae dominated the Deep Lake community. While it is in theory possible that the community composition of Deep Lake has changed in between the two studies, this is rather unlikely since the calculated generation time for haloarchaea in Deep Lake is only six generations a year (DeMaere et al., 2013). The study of Bowman et al.

15

(2000) was carried out on a sediment sample compared to planktonic biomass sampled by DeMaere et al. (2013), allowing for speculation of a distinct community composition between the two habitats. However, it is more likely that improvements in primer design and sequencing technology since the earlier study have reduced primer bias during PCR and increased resolution. Nonetheless, the study of Bowman et al. (2000) has greatly advanced the knowledge about Deep Lake’s microbial community at the time of its appearance. The most striking result reported by DeMaere et al. (2013) was the identification of so-called high-identity regions (HIRs) in the genomes of the four isolated species. These are stretches of DNA with up to 35 kb in length that are virtually identical (~100% nucleotide identity) within two or more of the isolated species. With the exception of three pairs of closely related Halobacterium, Haloquadratum and Haloarcula strains, HIRs were only identified in the Deep Lake species, which are more distantly related and belong to different genera. These results indicate that gene exchange is occurring frequently within the community of Deep Lake haloarchaea (DeMaere et al., 2013). However, the mechanism of how HIRs are exchanged is still awaiting its discovery. The question was raised how Deep Lake could sustain different species despite the high level of observed gene exchange (DeMaere et al., 2013). Functional analysis of the genomes of the four isolated species was indicative of diverse lifestyles. Their genomes are geared for exploiting different nutrients and substrates and hence they occupy different niches within the seemingly uniform water body of Deep Lake (DeMaere et al., 2013; Williams et al., 2014).

1.3 Metaproteomics Throughout the last two decades the field of environmental microbiology has undergone a substantial transformation. Advances in technology, most notably DNA sequencing technology, have allowed the field to expand its horizon beyond laboratory based experiments on cultivatable species, towards studying complex microbial communities directly in their natural environment. Characteristic of this transition is the entry of the prefix ‘meta’ into microbiology’s vocabulary. Metagenomics describes the sequencing of DNA isolated from environmental samples. It gives insights into community composition (which species are there) and functional potential (which genes are present). Insights into which functions are carried out by certain members of a community can be gained through metatranscriptomics and metaproteomics.

16

Metatranscriptomics is the study of the transcribed RNA pool of a community and metaproteomics refers to the characterisation of the protein content. While metatranscriptomics provides information about gene expression levels in the form of RNA abundance, metaproteomics informs about its final end products, taking into consideration transcriptional and translational control, and mRNA and protein stability and turnover (Williams and Cavicchioli, 2014). An example of a metaproteomics workflow for aquatic systems is depicted in Figure 1.4. In brief, biomass gets collected on filters through filtration. From this biomass, the proteins are extracted and digested into smaller peptides (e.g. tryptic digest). In a shotgun approach, the peptide solution is loaded into a liquid chromatography (LC) column which is in-line with a mass spectrometer: separated peptides that exit the LC are directly analysed through mass spectrometry (MS). MS generated spectra are then matched back to peptides and the corresponding proteins using a database (peptide-spectrum matching). At the end of the workflow comes the computational annotation of the identified proteins, including accumulating taxonomic (which species is the protein derived from) and functional (what function is carried out by the protein) information. Two steps within this workflow require particular consideration. (1) The choice of an appropriate protein extraction method. Applying different extraction methods to the same environmental sample can result in great differences in the set of identified proteins (Leary et al., 2013). Methods for the extraction of soluble cytoplasmic proteins can lack most of the membrane associated proteins and vice versa. Hence, depending on the question to be resolved, an appropriate protein extraction method needs to be chosen. (2) The quality of the database used for the peptide-spectrum matching. In order for a MS spectrum to be assigned to a certain peptide, the peptide needs to present in the used database. It is therefore of utmost importance that the used database provides a good representation of the organisms present in the analysed sample. A metagenome database generated from the same sample, which will be analysed through metaproteomics, can be considered as an optimal database. While there are also database-independent peptide identification approaches, these are inherently more complex and have so far only rarely been used for metaproteomics (Seifert et al., 2013).

17

Figure 1.4 Metaproteomic workflow. Picture of biomass filter and sampling of Deep Lake courtesy of Rick Cavicchioli; picture of mass spectrometer adapted from https://en.wikipedia.org/wiki/Orbitrap

In the past few years metaproteomics was successfully used for studying the microbial communities of aquatic Antarctic systems, including the Southern Ocean and lakes in the Vestfold Hills (Ng et al., 2010; Lauro et al., 2011; Yau et al., 2011; Williams et al., 2012; Williams et al., 2013). In these studies metaproteomics helped linking certain members of the prevailing microbial communities to key metabolic processes (reviewed in Williams and Cavicchioli (2014).

1.4 Objectives The overall aim of this thesis was to expand our understanding of haloarchaeal community functioning and dynamics in the hypersaline Deep Lake. With the available large scale metagenomic and genomic sequencing data and with metaproteomic techniques adapted to aquatic Antarctic systems, it was feasible to apply metaproteomics to the Deep Lake community, making it the first ever metaproteomic study of a haloarchaea-dominated hypersaline lake system. Led by the metaprotomics, further metagenomic and genomic analyses were performed to complement the data for a more holistic view of the system (Chapters 2, 3 and 4). Finally, intrigued through results of this ‘meta’ approach, the genome of a related haloarchaeal strain from a different but nearby hypersaline Antarctic system was sequenced and analysed. The genomic comparison of this novel strain with the type strain (isolated from Deep Lake) allowed analysis of strain-specific variation for this species (Chapter 5).Furthermore,

18 the genome of this novel strain made it possible to investigate for the first time if certain characteristics, so far only described for haloarchaea in Deep Lake, are also present in haloarchaea from other Antarctic locations. Specific objectives of the thesis were: • Investigate differences in the metabolic functioning between Deep Lake haloarchaea and elucidate possible reasons for the dominance of Hht. litchfieldiae in the community. • Investigate the role viruses play within the microbial community of Deep Lake. • Learn about variation that exists within the populations of Deep Lake haloarchaea. • Investigate strain-specific variation for the species Hrr. lacusprofundi with a focus on its genomic organisation.

19

20

Chapter 2

Ecophysiological distinctions of Antarctic haloarchaea revealed through metaproteomics

Co-authorship statement

Sections from this chapter have been published as:

Tschitschko B, Williams TJ, Allen MA, Zhong L, Raftery MJ, Cavicchioli R (2016) Ecophysiological Distinctions of Haloarchaea from a Hypersaline Antarctic Lake as Determined by Metaproteomics. Applied and Environmental Microbiology 82: 3165-3173

The contributions to the manuscript are as follows: I performed the metaproteomics (from sample preparation to mass spectral analysis), general protein data analyses, statistical analysis and epifluorescence microscopy. Assistance in mass spectrometry was provided by Ling Zhong and Mark Raftery. Tim Williams performed growth studies. Tim Williams and I annotated the proteins, analysed the data and drafted the manuscript. Ricardo Cavicchioli, Tim Williams and I wrote the manuscript.

21

2.1 Abstract Deep Lake is a hypersaline lake in the Vestfold Hills, Antarctica that harbours a low-complexity microbial community, dominated by haloarchaea. The isolated species Hht. litchfieldiae, DL31 and Hrr. lacusprofundi represent the most abundant species in the community and genomic analysis revealed that they possess different physiologies and target distinct substrates in the lake. Hht. litchfieldiae, the most dominant species, is specialized in the uptake and utilization of carbohydrates, including glycerol produced by the lake’s only eukaryote Dunaliella. By contrast, DL31 is mostly targeting proteinaceous substrates like amino acids and oligopeptides. Hrr. lacusprofundi is the least specialized species and targets a range of substrates. In this study, the physiologies of the three main species in the lake were studied using metaproteomics. Analysis of detected transporter proteins showed great differences in targeted substrates. Distinctions between the three species were also found in respect to their targeted nitrogen sources. Our data further indicated that Hht. litchfieldiae is highly motile and depends on substrates produced by Dunaliella. A high number of detected proteins involved in the protection and repair of UV-induced damages highlighted that Deep Lake haloarchaeal employ diverse mechanisms against harmful UV irradiation during the Antarctic summer. No functional differences were found between samples from different depths indicating that Deep Lake harbours a functionally uniform microbial community throughout the water column. The study demonstrates how metaproteomics can be applied to a hypersaline environment, to investigate the main physiological traits followed by abundant members of the community.

22

2.2 Introduction Deep Lake harbours a uniform haloarchaea-dominated microbial community in which the three most abundant haloarchaea, Hht. litchfieldiae, DL31 and Hrr. lacusprofundi, coexist throughout the water column without apparent size partitioning (DeMaere et al., 2013). The main physiological traits and substrates of these three species, which are all aerobic heterotrophs, were inferred from detailed genomic analyses including over- and underrepresentation of specific Cluster of Orthologues (COG) groups. In addition, growth assays using different substrates as carbon and/or nitrogen sources were performed to support genomic predictions (Williams et al., 2014). From the three species, Hht. litchfieldiae had the highest number of carbohydrate- targeting ATP-binding cassette (ABC) transporter genes and was generally overrepresented in genes of the COG category carbohydrate uptake and metabolism. It also contained three glycerol kinase orthologues in its genome, compared to one each for DL31 and Hrr. lacusprofundi. Hence it was hypothesised that Hht. litchfieldiae possesses a highly saccharolytic metabolism relying on carbohydrates and glycerol (DeMaere et al., 2013; Williams et al., 2014). DL31 contained the highest number of peptide-targeting ABC transporters and proteolytic in its genome and was therefore predicted to rely mainly on proteinaceous substrates. Hrr. lacusprofundi appeared to be the least specialized species inferred from metabolic capacities and preferences. It also contained the smallest number of over/underrepresented COG categories. The genomic differences between the three isolate species were clearly indicative of niche partitioning with each of them having distinct substrate preferences. Collectively they exploit the nutrient pool available in Deep Lake (Williams et al., 2014). This chapter describes the application of metaproteomics to biomass collected from different depths and filter sizes of Deep Lake during the Antarctic summer of 2008/2009. The samples for the metaproteomics matched the samples previously used for metagenomic sequencing of Deep Lake (DeMaere et al., 2013). Through metaproteomics, synthesised proteins were detected and inferences made about active metabolic pathways and nutrients utilized at the time of sampling. Metaproteomics proved to be a valuable method to confirm/refute the predictions previously made from genomic inferences (Williams et al., 2014). Metaproteomics also allowed to investigate wheter functional differences existed between communities from different depths or size

23 fractions. The focus of this chapter is on proteins relevant to the diverse physiology of the three most abundant haloarchaeal species in Deep Lake.

2.3 Materials and Methods

2.3.1 Sample collection Biomass was collected from Deep Lake (68°33’36.8S, 78°11’48.7E), Vestfold Hills, Antarctica between November 30 and December 5, 2008. Lake water, taken from 5, 13, 24 and 36 m depths (50 litres each depth except 25 litres from 36 m), was filtered through a 20 µm prefilter sequentially onto 293 mm polyethersulfone membrane filters with 3.0, 0.8 and 0.1 µm pore sizes, as described previously (DeMaere et al., 2013). In addition, a surface sample was taken from close to the lake’s shore. A total of 15 distinct samples were obtained (three distinct filter sizes over five depths). All filters were placed in storage buffer, frozen in liquid nitrogen and cryogenically maintained at -80°C until being processed (DeMaere et al., 2013).

2.3.2 Metaproteomics In general, the metaproteomic analysis of Deep Lake was based on methods previously developed for Antarctic lake and marine metaproteomic studies (Ng et al., 2010; Williams et al., 2012; Williams et al., 2013) with some modifications.

2.3.2.1 Protein extraction From each filter one half was used for protein extraction. For this, each filter was taken out of the storage tube with sterile forceps and unwrapped on sterile aluminium foil. Filters were cut in half using sterile scissors with one half put back into the storage tube and returned to -80C. The half filter used for protein extraction was cut into smaller pieces and divided into two 50 ml centrifuge tubes, each containing 18 ml extraction buffer (10 mM Tris ph 8, 1 mM EDTA ph8, 0.1% SDS, 1 mM DTT, 1:1000 Protease Inhibitor VI (Calbiochem). Loosening of biomass from the filters and initial cell lysis was achieved through three rounds of freeze-thawing, were tubes were first put into liquid nitrogen for 2 min followed by incubation in a 35°C water bath for 25 min. The tubes were briefly vortexed between freeze-thaw cycles. Subsequently the filter pieces were removed from the extraction buffer using sterile forceps. For further cell lysis the extraction buffer containing the biomass was sonicated five times using a Branson Sonifier (settings: 40 s intervals, 0.5 s pulse on/pulse off, 20% amplitude). Soluble

24 proteins were separated from insoluble aggregates and cell debris through centrifugation (5000 g, 15 min, 4°C). The supernatant containing the soluble proteins was retained and stored at -80°C until further processing. Protein solutions were further concentrated using Amicon filter units (Ultra-15 3 kDa cut-off; Merck Millipore). Amicon filter units were pre-washed through centrifugation (5000 g, 15 min, 4°C) with 10 ml autoclaved water. Subsequently protein samples were loaded onto Amicon filter units and proteins concentrated into a volume of around 500 µl through centrifugation (5000 g, ~ 30 min, 4°C). The exact time of this centrifugation step depended on the amount of protein in the sample. The buffer was then changed through application of 15 ml 10 mM TE (pH 8) and centrifugation (5000 g, 30 min, 4°C). This last step was repeated one more time and the centrifugation (5000 g, 4°C) carried out until the volume of the protein solution was around 500 µl. Protein concentration of samples was measured using the PierceTM BCA Protein Assay Kit (Thermo Fisher Scientific, Rockford, IL, USA). The concentrated protein samples were stored at -80°C.

2.3.2.2 Preparation of samples for mass spectrometry From each sample 25 µg protein were prepared for mass spectrometry (MS). To each sample, 100 mM NH4HCO3 was added to a final concentration of 25 mM. pH indicator strips were used to ensure that the pH was between 8 – 8.5 (adjusted with 100 mM NH4HCO3 if necessary). Samples were reduced with a final concentration of 2.5 mM dithiothreitol (DTT; 100 mM stock) and incubation for 30 min at 37°C in the dark and subsequently alkylated with 5 mM iodoacetamide (300 mM stock) and incubation for 30 min at room temperature in the dark. Proteins were digested into peptides through the addition of 0.3 µg trypsin (Sequencing Grade Modified Trypsin, Promega, Madison, WI, USA) and overnight incubation at 37°C. Peptide samples were stored at -80°C until MS analysis.

2.3.2.3 Mass spectrometry The mass spectrometry was following the protocol described in Williams et al. (2012). The only major difference to Williams et al. (2012) was that a shotgun approach was followed with whole peptide samples being directly loaded onto the nano-liquid chromatography (nano-LC), without prior SDS-PAGE separation. Peptide solutions were first diluted 1:2 in 1% formic acid, 0.05% heptafluorobutyric acid and then separated by nano-LC using an Ultimate 3000 HPLX and autosampler system (Dionex,

25

Amsterdam, Netherlands). Samples (2.5 µl) were concentrated and de-salted onto a micro C18 precolumn (500 µm x 2 mm, Michrom, Bioresources, Auburn, CA, USA) -1 with H2O:CH3CN (98:2, 0.1% TFA) at 15 µl min . After a 4-min wash the pre-column was switched (Valco 10 port valve, Dionex) into line with a fritless nano column (75 µ x ~10 cm) containing C18 media (3 µ, 200 Å Magic, Michromon Bioresources).

Peptides were eluted using a linear gradient of H2O:CH3CN (98:2, 0.1% formic acid) to -1 H2O:CH3CN (64:36, 0.1% formic acid) at 250 nl min over 30 min. High voltage (2000 V) was applied to the low volume tee (Upchurch Scientific, Oak Harbor, WA, USA) and the column tip positioned ~0.5 cm from the heated capillary (T = 280°C) of an Orbitrap Velos (Thermo Electron, Bremen, Germany) mass spectrometer. Positive ions were generated by electrospray and the Orbitrap operated in data-dependent acquisition mode. A survey scan m/z 350–1750 was acquired in the Orbitrap (Resolution = 30 000 at m/z 400, with an accumulation target value of 1 000 000 ions) with lockmass enabled. Up to the 10 most abundant ions (> 5000 counts) with charge states > + 2 were sequentially isolated and fragmented within the linear ion trap using collision-induced dissociation with an activation q = 0.25 and activation time of 30 ms at a target value of 30 000 ions. m/z ratios selected for MS/MS were dynamically excluded for 30 s. At least two technical replicates were performed for each sample. Peak lists from Orbitrap Velos mass spectra were generated using extract_msn and peptides identified through automated database searches using Mascot Daemon and the Mascot server (version 2.3; Matrix Science, Thermo, London, UK) with ThermoFinnigan LCQ/DECA RAW file as the import filter and the following settings: one accepted missed cleavage for the tryptic digest, peptide mass tolerance of +/- 4 p.p.m., fragment mass tolerance of +/- 0.4 Da and variable modifications of oxidation and carbamidomethylation. Ions were matched against peptides using an in-house constructed database containing a Deep Lake metagenome and the genomes of isolated Deep Lake species. The Deep Lake metagenome comprised 5,837 assembled contigs > 2 kb in length, annotated using SHAP and representing 38,071 predicted protein sequences (DeMaere et al., 2011; DeMaere et al., 2013). In addition, the database contained all 14,181 predicted protein sequences from the genomes of Hht. litchfieldiae strain tADL, DL31, Hrr. lacusprofundi and DL1 (sourced from the IMG portal (http://img.jgi.doe.gov/) (Markowitz et al., 2012). To facilitate calculations of false discovery rates (FDRs), the database contained randomized decoy proteins equal in number to those present in the reference database. The mass spectrometry proteomics data including the protein

26 database have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository (Vizcaino et al., 2014) with the dataset identifier PXD001436 and DOI 10.6019/PXD001436. All technical replicates of a sample were combined and merged into a single sample file during Mascot analysis, resulting in 15 sample files (5 depths x 3 filters). Peptide and protein validation and further analysis was performed using Scaffold (version Scaffold_4.2.1, Proteome Software Inc., Portland, OR, USA) with strict settings of 95% minimum peptide identification probability and 99% minimum protein identification probability. The number of confidently identified proteins was maximized while minimizing false positives by including single peptide matches but maintaining low FDRs, as has been recommended (Gupta and Pevzner, 2009; Claassen, 2012) and successfully applied in recent metaproteomic studies (Morris et al., 2010; Schneider et al., 2012; Herbst et al., 2013). The peptide and protein FDRs were 0.06% and 0.4%. Proteins sharing the same set of identified peptides were grouped together into protein families. For quantitative analyses the Normalized Total Spectrum Count (Scaffold_4.2.1, Proteome Software Inc., Portland, OR, USA) for each identified protein was combined across all 15 samples and used to quantify abundance.

2.3.3 Protein annotation All identified proteins were manually annotated using BLASTP searches against the genome encoded proteins of Hht. litchfieldiae, DL31, Hrr. lacusprofundi and DL1 on IMG, and against all entries in the ExPASy database (Artimo et al., 2012), recording the best overall match including percentage identity, and organism. Further functional information was gained through searches of conserved domains and sequence motifs using the InterProScan web-tool (Jones et al., 2014). Annotated proteins were classified into taxonomic and functional categories. Taxonomic categories were: Hht. litchfieldiae, DL31, Hrr. lacusprofundi, DL1, other Halobacteriaceae (best match to a member of the Halobacteriaceae other than the four Deep Lake isolates), other Archaea (best match to a member of the Archaea other than Halobacteriaceae), Bacteria, Viruses and Dunaliella. Functional categories were: Transport; Carbohydrate Metabolism; Glycerol Metabolism; Amino Acid Metabolism; Central Carbon Metabolism; Energy Conservation; Metabolism (Other); Cell Division; DNA Replication, Repair, and

27

CRISPR; Oxidative Stress; Transduction; Transcriptional Regulators; Transcription; Ribosomes; Protein Chaperones; Translation (Other); Proteolysis; Cell Surface; Hypothetical; Viruses. Cell Surface proteins were subdivided into subcategories based on proteins that comprise archaella (Archaella), adhesion pili (Adhesion), and the archaeal surface layer including hypothetical proteins that possess Sec, TAT or PGF- TERM sequences (Cell Surface Proteins – Other). The Hypothetical category included archaeal and bacterial proteins for which no function in the cell envelope or in cellular metabolism could be inferred. Hypothetical proteins were further subdivided into subcategories based on the presence of transmembrane helices (Hypothetical - Membrane); or nucleic-acid-binding domains (Hypothetical - Nucleic Acid Binding); or possessed domains which provided no indication as to function or possessed no identifiable domains at all (Hypothetical - Other). Other categories were subdivided into subcategories to provide increased resolution of cellular processes such as transport and metabolism. Transport: ABC Transporter - Amino Acids; ABC Transporter - Oligopeptides/Dipeptides; ABC Transporter - Carbohydrates; ABC Transporter - Phosphate/Phosphonate; ABC Transporter - Iron; ABC Transporter (Other); TRAP/TTT Transport; Cation Transport; Secretion; Other Transporter. Carbohydrate Metabolism: Glycosylation/Capsular Polysaccharide; Carbohydrate Metabolism (Other). Metabolism (Other): Nitrogen Metabolism; Sulfur Metabolism; Isoprenoid Metabolism; Vitamin/Cofactor Biosynthesis; Other (comprising those proteins inferred to be involved in metabolism, but no precise function or substrate specificity could be inferred based on identified domains). DNA Replication, Repair and CRISPR: DNA Replication and Repair; CRISPR.

2.3.4 Statistical analysis The PRIMER 6 software (Clark and Gorley, 2006) was used to statistically test for functional relationships between depth and filter-size groups within the set of 15 samples. Input data was provided as normalized total spectrum count for functional categories from each of the 15 samples. The data were standardized and square-root transformed prior to calculations using a Bray Curtis resemblance matrix. Non-metric multi-dimensional scaling (NMDS) plots were created using standard settings. Two- way crossed analysis of similarity (ANOSIM) without replicates were performed on the factors, filter size and sample depth, to test for statistically significant differences within these groups.

28

2.3.5 Epifluorescence microscopy of Deep Lake water samples Microscopic analysis was based on the method described in Yau et al. (2013) with modifications. Deep Lake surface water samples were taken during biomass collection using sequential size filtration, representing the size fractions 3.0-20 µm, 0.8-3.0 µm and 0.1-0.8 µm (see 2.3.1). Samples collected in the Antarctic summer of 2008/2009 were preserved in 2% (v/v) formaldehyde and samples collected in 2013/2014 were preserved in 0.5% (v/v) glutaraldehyde; all samples were stored at -80 °C until microscopy was performed. Deep Lake water (4 ml) was filtered through a 25 mm diameter, 0.02 µm pore size Whatman® Anodisc filter membrane (GE Healthcare Life Science, UK) with a 0.45 µm pore size backing filter (Type HA, Merck Millipore, MA, USA). Filters with captured biomass were stained with 10 µl SYBR® Gold nucleic acid stain (Invitrogen, Life Technologies, NY, USA) for 18 min in the dark and subsequently mounted on a glass slide with a drop of ProLong® Gold anti fade reagent (Invitrogen, Life Technologies, NY, USA). Microscopic analysis of slides was performed using an Olympus BX51WI epifluorescence microscope together with an Olympus DP71 camera and the cell Sense Standard imaging software (all Olympus, Hamburg, Germany). Slides were visualized under excitation with blue light (460 – 495 nm, emission 510 – 550 nm).

2.3.6 Growth studies To assess growth characteristics based on inferences made from Deep Lake metaproteomic data, Hht. litchfieldiae was grown in batch cultures based on DBCM2 media (19, 20) using specific carbon and nitrogen sources. The substrates tested were dihydroxyacetone (DHA) (10 mM) as a carbon source; starch (10 gL-1) as a carbon source; acetamide (10 mM) as both a carbon source (with ammonia) or as a nitrogen source (with pyruvate); 2-aminoethylphosphonic acid (AEP; 5 mM) as a phosphorus source replacing phosphate in DBCM2 medium (tested both with and without peptone and yeast extract); Hht. litchfieldiae genomic DNA (200µg ml-1 final concentration) as a phosphorus source.

2.4 Results

2.4.1 Overview of the Deep Lake metaproteome

29

Shotgun metaproteomics of Deep Lake was performed on 15 samples representing three distinct size fractions (0.1–0.8, 0.8–3.0 and 3.0–20.0 µm) from five different depths (surface, 0 m, 5 m, 13 m, 24 m and 36 m). The highest number of detected proteins was 748 for the 5 m, 3 µm sample, as opposed to a minimum of 74 detected proteins for the 36 m, 0.1 µm sample. Across the 15 samples there was a trend of more detected proteins in samples from shallower depth compared to greater depth; and for each depth more proteins were detected in larger compared to smaller size fractions (Figure 2.1). Epifluorescence microscopy of surface water samples showed association of cells with particulate matter (Figure 2.2) and particulate matter was more often observed in the larger size fractions compared to the 0.1 µm fraction. Hence the smaller number of detected proteins for the smaller size fractions, especially the 0.1 µm fractions, likely reflects a lower amount of biomass in that size range.

Figure 2.1 Number of proteins detected for single filter samples. Numbers on the x-axis reflect the depth in meters the sample was taken from.

30

Figure 2.2 Microscopy of Deep Lake water. The image depicts cells attached to particulate matter from surface water representing the 3.0-20 µm fraction. Magnification , 100 x; scale bar, 10 µm.

All detected proteins were manually annotated and assigned into taxonomic and functional categories. The normalized total spectrum count for each protein was used as a measure of abundance. Overall, the taxonomic composition of the 15 samples was found to be similar with the exception of a relatively higher abundance of virus proteins and a lower abundance of Hht. litchfieldiae proteins on the 0.1 µm filter samples compared to the two larger size fractions (Figure 2.3). The 36 m, 0.1 µm sample had a skewed taxonomic composition compared to the other samples and higher presence of low-abundant taxa at the bottom of the lake was also reported by DeMaere et al. (2013). At the very bottom the lake has a shallow depression, which might prevent this water from regular mixing and allow cells to settle (DeMaere et al., 2013). The metaproteome data indicated that Deep Lake harbours a uniform microbial composition throughout the water column as had previously been reported (DeMaere et al., 2013) with viruses being a major component of the small size fraction.

31

Figure 2.3 Taxonomic composition of metaproteomic samples. The chart shows the relative abundance of taxonomic categories in the 15 metaproteomic samples. Numbers on the x-axis (0, 5, 13, 24 and 36) refer to the depth in meters the samples were taken from.

The functional composition of the 15 filters was compared to examine possible functional partitioning within the dataset. A similar functional composition was found for all the 0.8 and 3.0 µm filters compared to the 0.1 µm filters (Figure 2.4B). Common to the 0.1 µm filters and distinct from the larger filter sizes was a relatively high abundance of cell surface and virus proteins. This distinction was also reflected in a NMDS plot, where the larger size fractions clustered closely together separate from the 0.1 µm filters, which themselves didn’t form a tight cluster (Figure 2.4A). Statistical analysis using ANOSIM confirmed a significant difference (p < 0.01) between filter sizes but not between depths. Archaellin proteins (part of the category Cell Surface) form the structural subunits of the archaeal motility structures archaella and were particularly abundant on the 0.1 µm filters, indicating that the smallest size fraction comprised more free-living and motile cells as opposed to more particle-associated cells in the larger fractions; it is likely that free-living cells differed from particle-associated ones in respect to their physiology. A comparison between proteins from the 0.1 µm filters and the combined 0.8 and 3.0 µm filters would have been desirable to learn about potential physiological differences between these fractions. However, since only a relatively small number of proteins was detected in the 0.1 µm samples (Figure 2.1), with only 37 of the proteins unique to this size fraction (and not detected in any 0.8 or 3.0 µm filter), it was not possible to address this question in the current study. Higher

32 metaproteomic coverage of the 0.1 µm fraction would be required to learn about the physiological distinctions of cells in this size fraction.

Figure 2.4 Functional composition of metaproteomic samples. (A) NMDS plot of filter samples according to the abundance of functional categories. The plot highlights the functional distinction of 0.1 µm filters compared to the larger filter sizes. Numbers in the plot refer to the depth the samples were taken. Two-dimensional stress of the plot is 0.02. (B) Relative abundance of functional categories in the single filter samples. Numbers on the x-axis refer to the depth in meters the samples were taken from.

Due to the overall similarity between the 15 filter samples, all samples were combined into a single metaproteome to facilitate further analyses. The Deep Lake metaproteome comprised a total of 1109 distinct proteins. Hht. litchfieldiae represents the most abundant species in the lake recruiting ~59% of all detected proteins (655) (Figure 2.5). DL31 is second in abundance with ~14 of detected proteins (154) followed by Hrr. lacusprofundi with ~8% (93). The fourth isolate species DL1 is of very low abundance and recruited only 7 detected proteins (0.6%). In addition to proteins from the isolate species, proteins were detected matching other haloarchaea (93 proteins, ~8%), other Archaea (4 proteins, 0.4%), Viruses (25 proteins, ~2%), Bacteria (37 proteins, ~3%) and the only reported eukaryote in Deep Lake Dunaliella (6 proteins, 0.5%). It is likely that Dunaliella was underrepresented in the metaproteome data as the metagenome database only included few matches to available Dunaliella sequences

33

(DeMaere et al., 2013). Thirty seven proteins could not be unambiguously assigned to any of the taxonomic categories. Some of these proteins had their best match to one of the isolate species but the encoding metagenomic contig encoded additional proteins more similar to other haloarchaeal species. Other proteins were 100% conserved between one or more of the isolate species and could therefore not be assigned unambiguously (see 3.4.2). Unlike the protein count, the normalized total spectrum count is a measure of abundance for each identified protein. Pooling the normalized total spectrum count of all proteins for each taxonomic category resulted in a similar taxonomic composition as the protein count (Figure 2.5). However, viruses were around twice as abundant in spectra (4.5%) compared to protein numbers (2.3%), reflecting a particularly high abundance of some viral proteins (Figure 2.5; also see 3.4.3). Overall, the taxonomic composition of the Deep Lake metaproteome was reflective of a low- complexity microbial community dominated by few haloarchaeal species as has been shown previously with SSU rRNA sequencing (DeMaere et al., 2013).

Figure 2.5 Taxonomic composition of the Deep Lake metaproteome. Relative abundance of categories is shown based on number of identified proteins (black bars) and normalized total spectrum count (grey bars). Total number of detected proteins for each category is indicated by numbers above black bars. An additional 37 proteins could not be unambiguously assigned to any of the taxonomic categories and are not shown in the chart.

The metaproteome was also assessed regarding its functional composition and a high number of proteins were assigned into the categories Ribosomes (130 proteins, ~12%), Cell Surface (112 proteins, ~10%), Metabolism (Other) (100 proteins, 9%) and Transport (80 proteins, ~7%) (Figure 2.6). Of particularly high abundance were proteins from the categories Cell Surface, accounting for almost a quarter (~24%) of all spectra,

34 and Protein Chaperones (~14%). The higher abundance of these categories in spectra compared to protein count indicated that some of the proteins were present in high copy number. For 188 proteins, no putative function could be assigned; hence they were grouped into the category Hypothetical.

Figure 2.6 Functional composition of the Deep Lake metaproteome. Relative abundance of functional categories is shown based on number of identified proteins (black bars) and normalized total spectrum count (grey bars). Total number of detected proteins for each category is indicated by numbers above black bars. Categories are ranked by the numbers of detected proteins.

The three most abundant species in Deep Lake, Hht. litchfieldiae , DL31 and Hrr. lacusprofundi, together recruited ~81% (902) of all detected proteins in the metaproteome, allowing for an in-depth comparative analysis to test for physiological distinctions and/or similarities between them. Hht. litchfieldiae accounted for 655 detected proteins that matched to 513 distinct genes, covering ~15% of the 3465 protein encoding genes in the Hht. litchfieldiae genome. One hundred and fifty four and 93 proteins were detected for DL31 and Hrr. lacusprofundi, respectively. The latter two species shared a similar relative abundance profile of functional categories (Figure 2.7) with proteins assigned to Cell Surface, Protein Chaperones and Transport functions accounting for > 70% of the respective spectra. Proteins from these three categories accounted for only ~40% of all Hht. litchfieldiae spectra. In contrast, all other functional categories had higher relative abundance for Hht. litchfieldiae compared to DL31 and Hrr. lacusprofundi. To some extent this difference could be explained with the around two–four fold higher abundance of Hht. litchfieldiae in Deep Lake (DeMaere et al., 2013) which facilitates metaproteomic detection of proteins synthesised at low-levels.

35

For lower abundant species detection of such proteins is less likely. Below, the findings from the Deep Lake metaproteome describing the physiology of the three main species are presented.

Figure 2.7 Relative abundance of functional categories for the three main species. Hht. litchfieldiae (black); DL31 (dark grey); Hrr. lacusprofundi (light grey). Categories are ranked by the relative abundance of Hht. litchfieldiae proteins.

2.4.2 Transport proteins reveal distinctions in nutrient preferences Metaproteomic detection of proteins involved in transport functions can reveal nutrient requirements of organisms in their environment (Sowell et al., 2009). The Deep Lake metaproteome included 64 transport proteins assigned to Hht. litchfieldiae, DL31 and Hrr. lacusprofundi (Table 2.1) and were of particularly high abundance for the latter two species (Figure 2.7). Most of the transport proteins, including the highest abundant ones, represented the substrate-binding components of ATP-binding cassette (ABC) transporters. Less frequently detected were the membrane-associated components of the same transporters. The difference in abundance for the distinct ABC transporter subunits is in part due to the protein extraction procedure which focused on the soluble protein fraction as opposed to the insoluble one (including membranes and membrane-bound proteins). It also reflects a cell’s need for certain nutrients and the scarcity of these nutrients in the environment; overexpression of the substrate-binding component increases the amount of captured substrate (Sowell et al., 2009; Williams and Cavicchioli, 2014). The relative abundances of transport proteins targeting different substrates varied greatly between the three main species (Figure 2.8). For Hht. litchfieldiae phosphate- targeting ABC transporter lipoproteins recruited > 50% of all spectra from transport

36 proteins, compared to only ~1% and ~6% for DL31 and Hrr. lacusprofundi, respectively, likely reflecting a particularly high demand for phosphate for this species. In addition, Hht. lichtfieldiae is the only one of the three species with the genomic capacity to metabolically utilize phosphonates as phosphate source (Williams et al., 2014) and an ABC transporter lipoprotein targeting phosphonate was detected. Growth experiments confirmed the ability of Hht. litchfieldiae to grow on 2- aminoethylphosphonate (a ubiquitous, naturally occurring phosphonate) (Figure 2.9). Multiple detected PhoU phosphate uptake regulator proteins for Hht. litchfieldiae indicated that expression of components from phosphate uptake systems was tightly regulated. The metaproteome further contained one Hht. litchfieldiae protein (halTADL_0044) with a C-terminal transmembrane domain and an N-terminal DNA- binding domain predicted to reside in the extracytoplasmic space, indicating the binding of extracytoplasmic DNA (eDNA). However, no growth of Hht. litchfieldiae was observed using DNA as the sole source of phosphate, carbon or nitrogen. Hht. litchfieldiae was also the only of the three species with detected substrate-binding ABC transporter lipoproteins targeting carbohydrates (Table 2.1). Contrary to Hht. litchfieldiae, the main substrates targeted by detected DL31 ABC transporter lipoproteins were oligopeptides, with eight distinct transporter proteins accounting for > 50% of all transporter spectra. Oligopeptides were also targeted by Hrr. lacusprofundi. The genome of Hht. litchfieldiae lacks genes for this type of transporter (Williams et al., 2014). Unlike the other two species, Tripartite ATP- independent periplasmic transporter (TRAP)/tripartite tricarboxylate transporter (TTT) represented the highest abundant class of transporters for Hrr. lacusprofundi (Figure 2.8). These secondary transporters have been inferred in the uptake of carboxylic acids (Forward et al., 1997). However, the targeted substrates for these transporters in Deep Lake are unknown. ABC transporter lipoproteins targeting amino acids were detected for all three species (Table 2.1).

37

Figure 2.8 Relative abundance of transport proteins in the metaproteome. Hht. litchfieldiae (black); DL31 (dark grey); Hrr. lacusprofundi (light grey). In addition to the relatively abundant ABC and TRAP/TTT transporter proteins shown in the figure, other transporter proteins were identified representing 18, 17 and 3% of transport spectra for Hht. litchfieldiae, DL31 and Hrr. lacusprofundi, respectively (Table 2.1).

Figure 2.9 Growth response of Hht. litchfieldiae to phosphonate. AEP was supplied as phosphorous source. Growth was assessed using optical density (OD) measurements at a wavelength of 600 nm.

38

Table 2.1 Transporter proteins in the Deep Lake metaproteome. The table shows all detected proteins with transport functions for Hht. litchfieldiae, DL31 and Hrr. lacusprofundi. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. ‘nd’ denotes ‘not determined’ due to peptides matching to a . Asterisk (*) denotes a truncated protein. The protein with ‘No match’ in the locus tag column has no sequence similarity to Hht. litchfieldiae but it is encoded on a metagenomic contig which otherwise matches Hht. litchfieldiae. Sequence Spectrum Protein annotation Protein # Locus tag identity (%) count Hht. litchfieldiae Phosphate uptake Phosphate ABC transporter solute-binding protein (PstS) 13 halTADL_2155 99 235 Phosphate ABC transporter solute-binding protein (PstS) 61 halTADL_2155 88 92 Phosphate ABC transporter solute-binding protein (PstS) 75 halTADL_2155 93 82 39 Phosphate ABC transporter ATPase (PstB) 517 halTADL_2152 100 7.6 Phosphate ABC transporter solute-binding protein (PstS) 85 halTADL_1182 51 71 Phosphonate uptake Phosphonate ABC transporter solute-binding protein (PhnD) 347 halTADL_1334 100 16 Carbohydrate uptake Carbohydrate ABC transporter solute-binding lipoprotein 265 halTADL_2357 100 25 Carbohydrate ABC transporter solute-binding lipoprotein 269 halTADL_2761 100 24 Carbohydrate ABC transporter solute-binding lipoprotein 702 halTADL_1911 100 3.2 Carbohydrate ABC transporter solute-binding lipoprotein 247 halTADL_2761 84 26 Carbohydrate ABC transporter ATPase 1099 halTADL_2764 88 0.4 Amino acid uptake Branched-chain amino acid ABC transporter solute-binding protein 373 halTADL_2916 100 14 Branched-chain amino acid ABC transporter solute-binding protein 202 halTADL_2916 88 33 Polar amino acid ABC transporter solute-binding protein 486 halTADL_0024 100 8.5

Ammonium uptake Ammonium permease (ammonium transporter) (Amt) 220 halTADL_1826 100 30 Urea uptake Urea ABC transporter solute-binding protein 617 halTADL_0628 100 4.8 Iron uptake Iron ABC transporter solute-binding protein 159 halTADL_1788 100 40.9 Iron ABC transporter solute-binding protein 856 halTADL_1788 83 1.6 Cation transport K+ uptake system, TrkA subunit 806 halTADL_3061 100 2.0 K+ uptake system, TrkA subunit 1048 halTADL_3061 89 0.6 K+ uptake system, TrkA subunit 287 halTADL_3258 100 21.2 K+ uptake system, TrkA subunit 854 halTADL_2713 100 1.6

40 Mechanosensitive ion channel (MscS) 436 halTADL_2994 100 10.5

Secretion Signal peptide peptidase SppA 282 halTADL_2673 100 22.0 PilT protein: Type II/IV secretion system domain + KH domain 100 protein 558 halTADL_0825 6.3 SecD/SecF/SecDF export membrane protein 213 halTADL_0787 100 31.4 Signal recognition particle Srp54, secretory pathway 500 halTADL_2202 100 7.9 SecD/SecF/SecDF export membrane protein 722 halTADL_0788 100 3.0 TRAP/TTT transport Tripartite Tricarboxylate transporter (TTT), solute receptor 138 halTADL_0690 100 45.2 TRAP transporter solute receptor, TAXI family 299 halTADL_0243 100 19.8 Other transport Thiamine ABC transporter, substrate-binding protein (ThiB) 961 halTADL_2794 100 0.9

Nucleoside ABC transporter substrate-binding protein 881 halTADL_2623 nd 1.5 Nitrate/sulfonate/bicarbonate ABC transporter solute-binding protein 697 No match - 3.3 ABC-type antimicrobial peptide transport system, permease 492 halTADL_1613 100 8.1 component ABC-type antimicrobial peptide transport system, permease 693 halTADL_1613 88 3.4 component Heavy metal-exporting ATPase (copper?) 782 halTADL_1767 100 2.3 Formate/nitrite transporter 868 halTADL_2501 100 1.5 Phosphate/sulfate permease (PiT family) 837 halTADL_3083 100 1.7 RND superfamily family / MMPL (mycobacterial membrane protein 957 halTADL_0082 100 0.9 large) family protein DL31 41 Phosphate uptake

Phosphate ABC transporter solute-binding protein (PstS) 533 Halar_1873 100 7.0 Oligopeptide uptake Oligopeptide/dipeptide ABC transporter solute-binding protein 9 Halar_2016 100 277 Oligopeptide/dipeptide ABC transporter solute-binding protein 76 Halar_1439 100 82 Oligopeptide/dipeptide ABC transporter solute-binding protein 189 Halar_0722 100 35 Oligopeptide/dipeptide ABC transporter solute-binding protein 255 Halar_1146 100 26 Oligopeptide/dipeptide ABC transporter solute-binding protein 582 Halar_1285 100 5.5 Oligopeptide/dipeptide ABC transporter solute-binding protein 654 Halar_3436 100 4.1 Oligopeptide/dipeptide ABC transporter solute-binding protein 818 Halar_2651 100 1.8 Oligopeptide/dipeptide ABC transporter solute-binding protein 879 Halar_2024 100 1.5

Amino acid uptake Branched-chain amino acid ABC transporter solute-binding protein 158 Halar_1569 100 41 Branched-chain amino acid ABC transporter solute-binding protein 187 Halar_2890 100 36 Branched-chain amino acid ABC transporter solute-binding protein 192 Halar_3433 100 35 Iron uptake Iron ABC transporter solute-binding protein 33 Halar_0820 100 124 Iron ABC transporter solute-binding protein 599 Halar_1080 100 5.1 Other transport Membrane transporter: MMPL (mycobacterial membrane protein 29 Halar_1791 99 134 large) family protein / RND superfamily Uncharacterized transporter (export?), ATPase component 919 Halar_1798 100 1.2

42 Hrr. lacusprofundi Phosphate uptake Phosphate ABC transporter solute-binding protein (PstS) 326 Hlac_3551 100 17 Oligopeptide uptake Oligopeptide/dipeptide ABC transporter solute-binding protein 1058 Hlac_0069 100 87 Oligopeptide/dipeptide ABC transporter solute-binding protein 630 Hlac_0244 100 4.5 Amino acid uptake Branched-chain amino acid ABC transporter solute-binding protein* 160 Hlac_2093 100 41 Polar amino acid ABC transporter solute-binding protein 891 Hlac_1804 100 1.4 Iron uptake Iron ABC transporter solute-binding protein 663 Hlac_0162 100 3.9 TRAP/TTT transport TRAP transporter solute receptor, TAXI family 56 Hlac_2586 100 97

TRAP transporter solute receptor, TAXI family 144 Hlac_2329 100 44 Other transport Nucleoside ABC transporter solute-binding protein 496 Hlac_1417 100 8.0 43

2.4.3 Carbohydrate metabolism of Hht. litchfieldiae Fifty proteins for the uptake and metabolism of carbohydrates were detected for Hht. litchfieldiae (Table 2.2), which is in strong contrast to only five such detected proteins each for DL31 and Hrr. lacusprofundi. These data were in agreement with previous genomic inferences that, compared to the other two species, Hht. litchfieldiae has a highly saccharolytic metabolism (Williams et al., 2014). The Hht. litchfieldiae proteins included enzymes pertaining to the Emden-Meyerhof pathway and also a modified (semiphosphorylated) Entner-Doudoroff pathway for the catabolic breakdown of glucose (Brasen et al., 2014; Williams et al., 2014). The metaproteome contained a total of 10 proteins involved in the metabolic utilization of glycerol. Nine proteins were assigned to Hht. litchfieldiae (Table 2.2) and only one low-abundant glycerol kinase was assigned to Hrr. lacusprofundi. Hht. litchfieldiae contains two catabolic pathways for the conversion of glycerol into dihydroxyacetone phosphate (DHAP) (Williams et al., 2014). The first pathway involves glycerol kinase for initial phosphorylation of glycerol into glycerol-3- phosphate and subsequent oxidation into DHAP through glycerol-3-phosphate dehydrogenase. DHAP can either be fed into glycolysis, or used for gluconeogenesis or the synthesis of glycerol-1-phosphate, which is the backbone of haloarchaeal phospholipids. Two distinct glycerol kinases and one glycerol-3-phosphate dehydrogenase were detected in the metaproteome. The second pathway involves glycerol dehydrogenase for the oxidation of glycerol into DHA and DHA kinase for the phosphorylation of DHA into DHAP. DHA kinase was detected in the metaproteome but not glycerol dehydrogenase, which suggests that rather than DHA derived from the oxidation of glycerol-3-phosphate, DHA from the environment was used as a substrate. Growth experiments confirmed Hht. litchfieldiae to grow on DHA supplied as the sole carbon source (Figure 2.10). Overall, more proteins recruiting a higher number of spectra were detected for the first pathway compared to the second one (Table 2.2). The Deep Lake metaproteome further contained a number of Lrp-like transcription regulators, including one from Hht. litchfieldiaea (halTADL_1491) (Schwaiger et al., 2010) with 68% sequence similarity to Lrp from Halobacterium salinarum R1, which was shown to regulate glycerol dehydrogenase activity.

44

Figure 2.10 Growth response of Hht. litchfieldiae to DHA supplied as carbon source.

The detection of three Hht. litchfieldiae enzymes potentially involved in starch- degradation (glucoamylase, α-amylase and α-4-glucanotransferase) indicated that starch-derived sugars were used as carbon and energy sources. In growth experiments, Hht. litchfieldiae showed weak growth when starch was used as a sole carbon source, strong growth on pyruvate and best growth on pyruvate plus starch (Figure 2.11). A similar growth pattern was observed with sucrose (Williams et al., 2014), suggesting that both sucrose and starch are most readily utilized in the presence of pyruvate.

Figure 2.11 Growth response of Hht. litchfieldiae to starch and pyruvate supplied as carbon sources.

45

Table 2.2 Carbohydrate uptake and metabolism proteins detected for Hht. litchfieldiae. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to the Hht. litchfieldiae locus tag (column denoted “Locus tag”). Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Asterisk (*) denotes a match to a protein that is represented by an incomplete (truncated) gene on contig. Sequence Spectrum Protein annotation Protein # Locus tag identity (%) count Carbohydrate uptake Carbohydrate ABC transporter solute-binding lipoprotein 265 halTADL_2357 100 25 Carbohydrate ABC transporter solute-binding lipoprotein 269 halTADL_2761 100 24 Carbohydrate ABC transporter solute-binding lipoprotein 702 halTADL_1911 100 3.2 Carbohydrate ABC transporter solute-binding lipoprotein 247 halTADL_2761 84 26 Carbohydrate ABC transporter ATPase 1099 halTADL_2764 88 0.4 Polysaccharide/starch degradation α-amylase (glycosyl , family 13) 65 halTADL_0142 100 90 46 α-amylase (glycosyl hydrolase, family 13) 193 halTADL_0142 84 35 α-amylase (glycosyl hydrolase, family 13) 543 halTADL_0142 94 6.8 Glucan 1,4-α-glucosidase (glucoamylase) (glycosyl hydrolase, family 15) 557 halTADL_0141 100 6.4 4-α-glucanotransferase (amylomaltase) (glycosyl hydrolase, family 77) 710 halTADL_2529 100 3.1 Glycerol catabolism Glycerol kinase (GlpK) 17 halTADL_2249 100 206 Glycerol kinase (GlpK) 28 halTADL_2249 96 13 Glycerol kinase (GlpK) 459 halTADL_0681 100 135 Glycerol kinase (GlpK)* 381 halTADL_0681 100 9.5 Glycerol-3-phosphate dehydrogenase (GlpA) 195 halTADL_2244 100 34 Glycerol-3-phosphate dehydrogenase (GlpA) 327 halTADL_2244 87 17 Dihydroxyacetone (DHA) kinase, L subunit (DhaL) 266 halTADL_2259 100 25

Dihydroxyacetone (DHA) kinase, L subunit (DhaL) 982 halTADL_2259 92 0.9 Dihydroxyacetone (DHA) kinase, K subunit (DhaK) 140 halTADL_2260 94 45 Glycosylation / Capsular polysaccharide Glucose-1-phosphate thymidylyltransferase (RfbA, RffH) 37 halTADL_3353 100 118 Glucose-1-phosphate thymidylyltransferase (RfbA, RffH) 421 halTADL_3353 93 11 Glycosyl group 1 (possible α--D-glucan synthase) 780 halTADL_2565 100 2.3 Nucleoside-diphosphate-sugar epimerase 497 halTADL_3057 100 8.0 NUDIX hydrolase (NUDIX = NUcleoside DIphosphate linked to some 1097 halTADL_2550 100 0.4 other moiety X) NUDIX hydrolase (NUDIX = NUcleoside DIphosphate linked to some 1053 halTADL_3253 100 0.6 other moiety X) Oligosaccharyltransferase AglB 540 halTADL_2411 100 6.8 Carbohydrate metabolism (other) 47 Carbohydrate kinase, FGGY (possible xylulokinase [XylB]) 1041 halTADL_2660 100 0.6

Phosphoglucomutase/phosphomannomutase 181 halTADL_1712 100 36 Ribose 5-phosphate A (RpiA) 1012 halTADL_1707 100 0.6 Emden-Meyerhof (EM) pathway Phosphoglucose isomerase 528 halTADL_0801 100 7.2 2-dehydro-3-deoxy-D-gluconate (KDG) kinase (ribokinase family) (KdgK) 1038 halTADL_2089 100 0.6 Fructose-1,6-bisphosphate aldolase, class I (FbaB) 171 halTADL_0575 100 39 Fructose-1,6-bisphosphate aldolase, class I (FbaB) 733 halTADL_0575 92 2.8 Fructose 1,6-bisphosphate aldolase (multifunctional) 225 halTADL_3234 100 29 Fructose 1,6-bisphosphate aldolase (multifunctional) 420 halTADL_3234 96 11 Fructose-1,6-bisphosphate aldolase, class II (FbaA) 845 halTADL_3223 100 1.6 Triosephosphate isomerase (TpiA) 175 halTADL_2532 100 37

Triosephosphate isomerase (TpiA) 249 halTADL_2532 89 26 Entner-Doudoroff (ED) Pathway (semiphosphorylative) Gluconate (GnaD) 216 halTADL_0374 100 31 2-dehydro-3-deoxy-D-gluconate (KDG) kinase (ribokinase family) 1038 halTADL_2089 100 0.6 (KdgK) 2-dehydro-3-deoxyphosphogluconate (KDPG) aldolase (Eda) 355 halTADL_0882 100 15 ED/EM common pathway Glyceraldehyde-3-phosphate dehydrogenase (NAD(P)+-dependent), type I 110 halTADL_0817 100 54 (Gap) Glyceraldehyde-3-phosphate dehydrogenase (NAD(P)+-dependent), type I 591 halTADL_0817 90 5.2 (Gap) Phosphoglycerate kinase (Pgk) 272 halTADL_0816 100 24 Phosphoglycerate kinase (Pgk) 833 halTADL_0816 94 1.7 48 (phosphopyruvate hydratase) (Eno) 25 halTADL_2780 100 142 Enolase (phosphopyruvate hydratase) (Eno) 391 halTADL_2780 90 12 Pyruvate kinase (Pyk) 334 halTADL_3014 100 17 Pyruvate kinase (Pyk) 695 halTADL_3014 91 3.4 Gluconeogenesis Phosphoenolpyruvate (PEP) synthase (Pps) 451 halTADL_1011 79 9.8

2.4.4 Nitrogen metabolism of Hht. litchfieldiae, DL31 and Hrr. lacusprofundi The metaproteome revealed a number of different nitrogen sources targeted by the Deep Lake haloarchaea (Table 2.3). Secreted proteolytic enzymes, halolysins and aminopeptidases, for the degradation of extracytoplasmic proteinaceous material were detected for both Hht. litchfieldiae and DL31. The resulting products, oligopeptides and amino acids, can be taken up through specific transporters (see 2.4.2). No enzymes for the extracytoplasmic degradation of proteins could be detected for Hrr. lacusprofundi but transporter proteins for the uptake of amino acids and peptides were detected (Figure 2.8). Manual inspection of the Hrr. lacusprofundi genome revealed the presence of putative proteolytic enzymes. Hence either Hrr. lacusprofundi is not expressing its own proteolytic enzymes and instead utilizes products generated by other species or the lack of metaproteomic detection of these enzymes is a consequence of the lower abundance of this species compared to the other two. Collectively, the data indicated that all three species used proteins and amino acids as potential nitrogen sources. Assimilation of ammonia by Hht. litchfieldiae is indicated through the presence of ammonium transporter, glutamine synthetase (GS) and glutamate synthase (GOGAT) in the metaproteome. GS was also detected for Hrr. lacusprofundi while neither GS nor GOGAT could be detected for DL31. Glutamate dehydrogenases (GDH) were detected for Hht. litchfieldiae and DL31. Hht. litchfieldiae is the only of the three species with encoded proteins for the uptake and utilization of urea and a urea ABC transporter lipoprotein and a urease subunit were present in the metaproteome.

49

Table 2.3 Proteins involved in the uptake and metabolism of nitrogen sources. The table shows all detected proteins involved in the uptake and metabolism of nitrogen sources for Hht. litchfieldiae, DL31 and Hrr. lacusprofundi. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. ‘nd’ denotes ‘not determined’ due to peptides matching to a protein family. Sequence Spectrum Protein annotation Protein # Locus tag identity count (%) Hht. litchfieldiae Protein/peptide digestion (extracytoplasmic) Halolysin (peptidase S8 and S53 subtilisin kexin sedolisin) 477 halTADL_1514 100 8.7 Aminopeptidase (peptidase family M42) 723 halTADL_0101 nd 3.0 Amino acid uptake

50 Branched-chain amino acid ABC transporter solute-binding protein 373 halTADL_2916 100 14

Branched-chain amino acid ABC transporter solute-binding protein 202 halTADL_2916 88 33 Polar amino acid ABC transporter solute-binding protein 486 halTADL_0024 100 8.5 Ammonium uptake & assimilation Ammonium permease (ammonium transporter) (Amt) 220 halTADL_1826 100 30 Glutamine synthetase (GlnA)  Gln 88 halTADL_3423 100 67 Glutamate synthase (GltB)  Glu 760 halTADL_0125 100 2.5 Amino acid biosynthesis Aspartate aminotransferase (AspB)  Asp, Phe 1086 halTADL_0403 100 0.4 Aspartate aminotransferase (AspB)  Asp, Phe 173 halTADL_3081 100 38 Aspartate aminotransferase (AspB)  Asp, Phe 619 halTADL_3081 97 4.8 Carbamoyl-phosphate synthase, large subunit (CarB)  Arg 469 halTADL_0988 100 9.1

Aspartate kinase (LysC)  Lys, Thr, Met 527 halTADL_1916 100 7.3 Aspartate kinase (LysC)  Lys, Thr, Met 905 halTADL_1916 90 1.3 Aspartate-semialdehyde dehydrogenase (Asd)  Lys, Thr, Met 642 halTADL_0714 100 4.3 Homoserine dehydrogenase (MetL)  Lys, Thr, Met 430 halTADL_0649 100 11 (ThrC)  Thr 655 halTADL_2266 97 4.0 Cystathionine gamma-synthase (MetB) or O-acetylhomoserine (thiol)- (MetY) 553 halTADL_1890 100 6.5  Met 5-methyltetrahydropteroyltriglutamate -homocysteine methyltransferase (MetE)  487 halTADL_0179 100 8.4 Met 2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyltransferase (DapD)  Lys 402 halTADL_0281 nd 12 Glutamate-5-semialdehyde dehydrogenase (ProA)  Pro 600 halTADL_2358 nd 5.1 Pyrroline-5-carboxylate reductase (ProC)  Pro 588 halTADL_2360 100 5.3

51 ATP phosphoribosyltransferase (HisG)  His 602 halTADL_1729 100 5.1

Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (HisA) 787 halTADL_1799 100 2.2  His Imidazoleglycerol-phosphate dehydratase (HisB)  His 442 halTADL_1797 100 10 Phosphoserine phosphatase (SerB)  Ser 590 halTADL_1053 100 5.3 Phosphoserine phosphatase (SerB)  Ser 309 halTADL_2046 96 19 D-3-phosphoglycerate dehydrogenase (Ser A)  Ser 131 halTADL_2045 100 47 D-3-phosphoglycerate dehydrogenase (Ser A)  Ser 236 halTADL_2045 93 28 D-3-phosphoglycerate dehydrogenase (Ser A)  Ser 626 halTADL_0712 100 4.6 Glycine hydroxymethyltransferase (GlyA)  Gly 501 halTADL_3114 nd 7.9 Rhodanese-like protein / thiosulfate sulfurtransferase  Cys? 83 halTADL_2750 100 74 Rhodanese-like protein / thiosulfate sulfurtransferase  Cys? 258 halTADL_2750 92 25

Fructose 1,6-bisphosphate aldolase (multifunctional)  Trp, Phe, Tyr 225 halTADL_3234 100 29 Fructose 1,6-bisphosphate aldolase (multifunctional)  Trp, Phe, Tyr 420 halTADL_3234 96 11 2-amino-3,7-dideoxy-D-threo-hept-6-ulosonate synthase  Trp, Phe, Tyr 171 halTADL_0575 100 39 2-amino-3,7-dideoxy-D-threo-hept-6-ulosonate synthase  Trp, Phe, Tyr 733 halTADL_0575 92 2.8 Dehydroquinate synthase II  Trp, Phe, Tyr 408 halTADL_0574 100 12 Shikimate kinase (AroB)  Trp, Phe, Tyr 984 halTADL_2582 100 0.9 , alpha subunit (TrpA)  Trp 888 halTADL_0576 nd 1.5 Anthranilate phosphoribosyltransferase (TrpD)  Trp 390 halTADL_0889 100 13 Anthranilate phosphoribosyltransferase (TrpD)  Trp 290 halTADL_3066 100 21 Anthranilate phosphoribosyltransferase (TrpD)  Trp 1049 halTADL_3066 92 0.6 (PheA2)  Phe 701 halTADL_2073 100 3.3 Branched-chain amino acid aminotransferase (IlvE)  Leu, Val, Ile 841 halTADL_1961 nd 1.6 52 3-isopropylmalate dehydrogenase (LeuB)  Leu 385 halTADL_0366 100 13

3-isopropylmalate dehydrogenase (LeuB)  Leu 1084 halTADL_0366 93 0.4 3-isopropylmalate/(R)-2-methylmalate dehydratase, large subunit (LeuC)  Leu, Ile 890 halTADL_0364 nd 1.4 3-isopropylmalate/(R)-2-methylmalate dehydratase, small subunit (LeuD)  Leu, Ile 894 halTADL_0365 nd 1.4 Acetolactate synthase, small subunit (IlvH)  Ile, Val 560 halTADL_0361 100 6.3 Ketol-acid reductoisomerase (IlvC) Ile, Val 251 halTADL_0362 100 26 Dihydroxy-acid dehydratase (IlvD)  Ile, Val 323 halTADL_2417 nd 18 2-isopropylmalate synthase (LeuA)  Leu 729 halTADL_0359 100 2.9 Citramalate synthase (CimA)  Ile 605 halTADL_1156 100 5.0 Amino acid degradation Glutamate dehydrogenase (GdhA)  Glu 1013 halTADL_1757 81 0.6

S-adenosylmethionine synthetase (Mat)  Met 458 halTADL_3028 100 9.5 S-adenosylhomocysteine hydrolase (AchY)  Met 826 halTADL_1723 100 1.8 2-oxoacid dehydrogenase complex, E2 component  Leu, Val, Ile 631 halTADL_2147 100 4.5 2-oxoacid dehydrogenase complex, dihydrolipoamide dehydrogenase  Leu, Val, Ile 679 halTADL_2144 100 3.6 Urea uptake & degradation Urea ABC transporter solute-binding protein 617 halTADL_0628 100 4.8 Urease, beta subunit (UreB) 1089 halTADL_0634 nd 0.4 DL31 Protein/peptide digestion (extracytoplasmic) Halolysin (peptidase S8 and S53 subtilisin kexin sedolisin) 1080 Halar_3678 100 0.4 Aminopeptidase (peptidase family M42) 745 Halar_3640 100 2.7 Amino acid uptake 53 Branched-chain amino acid ABC transporter solute-binding protein 158 Halar_1569 100 41.1 Branched-chain amino acid ABC transporter solute-binding protein 187 Halar_2890 100 35.5 Branched-chain amino acid ABC transporter solute-binding protein 192 Halar_3433 100 35.2 Oligopeptide/dipeptide uptake Oligopeptide/dipeptide ABC transporter solute-binding protein 9 Halar_2016 100 277.4 Oligopeptide/dipeptide ABC transporter solute-binding protein 76 Halar_1439 100 81.8 Oligopeptide/dipeptide ABC transporter solute-binding protein 189 Halar_0722 100 35.4 Oligopeptide/dipeptide ABC transporter solute-binding protein 255 Halar_1146 100 25.7 Oligopeptide/dipeptide ABC transporter solute-binding protein 582 Halar_1285 100 5.5 Oligopeptide/dipeptide ABC transporter solute-binding protein 654 Halar_3436 100 4.1 Oligopeptide/dipeptide ABC transporter solute-binding protein 818 Halar_2651 100 1.8 Oligopeptide/dipeptide ABC transporter solute-binding protein 879 Halar_2024 100 1.5

Amino acid biosynthesis & degradation Branched chain amino acid aminotransferase (IlvE)  Leu, Val, Ile 371 Halar_2889 100 13.7 Pyrroline-5-carboxylate dehydrogenase (RocA)  Pro 743 Halar_1051 100 2.7 3-isopropylmalate dehydrogenase (LeuB)  Leu 1026 Halar_2164 100 0.6 Glutamate dehydrogenase (GdhA)  Glu 378 Halar_0758 100 13.3 Hrr. lacusprofundi Amino acid uptake Branched-chain amino acid ABC transporter solute-binding protein (gene appears to 160 Hlac_2093 100 40.5 be interrupted by a transposase) Polar amino acid ABC transporter solute-binding protein 891 Hlac_1804 100 1.4 Oligopeptide/dipeptide uptake Oligopeptide/dipeptide ABC transporter solute-binding protein 71 Hlac_0069 100 86.5

54 Oligopeptide/dipeptide ABC transporter solute-binding protein 630 Hlac_0244 100 4.5

Ammonium uptake & assimilation Glutamine synthetase (GS) (GlnA)  Gln 375 Hlac_2374 100 13.5 Amino acid biosynthesis Histidinol-phosphate aminotransferase (HisC)  His 1017 Hlac_0235 100 0.6

2.4.5 Motility and taxis A group of proteins with high abundance in the Deep Lake metaproteome, particularly on the 0.1 µm filter samples, were archaellins. For Hht. litchfieldiae ten different archaellin proteins (FlgA/B) were detected including the proteins with the 1st, 3rd, 7th and 12th highest spectrum count (Table 2.4). The Hht. litchfieldiae genome is conspicuous in containing a total of seven archaellin genes. Four of them (halTADL_1810–1813) are organised in one cluster adjacent to a set of fla accessory genes. The remaining three flagellin genes are located on two distinct locations in the genome, both without neighbouring fla accessory genes. In addition to archaellin proteins, the metaproteome contained type IV adhesion pili (PilA) proteins which are involved in surface adhesion and also biofilm formation (Esquivel et al., 2013). Three PilA proteins were detected for Hht. litchfieldiae and two each for DL31 and Hrr. lacusprofundi (Table 2.4). The Deep Lake metaproteome contained a number of Hht. litchfieldiae proteins involved in chemotactic signal transduction including several methyl-accepting chemotaxis proteins (MCPs), which act as receptors for environmental stimuli (Table 2.4). The relatively high number of detected MCP proteins for Hht. litchfieldiae is consistent with the overrepresentation of MCP genes in its genome (Williams et al., 2014). Hht. litchfieldiae is the only one of the isolated Deep Lake haloarchaea possessing a gene for light-dependent proton pump bacteriorhodopsin (BR) (Williams et al., 2014) and the protein was detected in the metaproteome. In addition, the metaproteome also contained the Hht. litchfieldiae transducer protein HtrII for the phototactic sensory rhodopsin II (SRII). Beside Hht. litchfieldiae, only one other archaellin protein but no signal transduction proteins were detected for Hrr. lacusprofundi. DL31 does not possess genes for archaella (Williams et al., 2014) but four proteins involved in chemotaxis were detected (Table 2.4) indicating that DL31 might use motility structures different to archaella.

55

Table 2.4 Motility and taxis proteins. The table shows all detected proteins involved in motility and taxis for Hht. litchfieldiae, DL31 and Hrr. lacusprofundi. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Sequence Spectrum Protein annotation Protein # Locus tag identity (%) count Hht. litchfieldiae Archaella Archaellin (FlaA or FlaB) 1 halTADL_1544 100 695 Archaellin (FlaA or FlaB) 786 halTADL_1544 75 2 Archaellin (FlaA or FlaB) 3 halTADL_1812 100 497 Archaellin (FlaA or FlaB) 59 halTADL_1812 77 94 Archaellin (FlaA or FlaB) 7 halTADL_1813 100 351

56 Archaellin (FlaA or FlaB) 12 halTADL_1813 76 239 Archaellin (FlaA or FlaB) 21 halTADL_1810 100 169 Archaellin (FlaA or FlaB) 80 halTADL_1811 100 79 Archaellin (FlaA or FlaB) 118 halTADL_0078 100 52 Archaellin (FlaA or FlaB) 95 halTADL_0078 75 59 Archaellar protein FlaG 429 halTADL_1803 100 12 Archaellar protein FlaC or FlacD or FlacE 761 halTADL_1805 100 2.5 Adhesion Adhesion pili (PilA) 750 halTADL_0751 65 2.6 Adhesion pili (PilA) 176 halTADL_1387 100 37.2 Adhesion pili (PilA) 666 halTADL_1387 66 3.9

Taxis Bacteriorhodopsin 752 halTADL_1952 100 2.6 Methyl-accepting chemotaxis sensory transducer (HtrII) (for sensory 245 halTADL_3325 100 26 rhodopsin II) Globin domain + methyl-accepting chemotaxis sensory transducer 235 halTADL_0074 100 28 Heme-based aerotactic transducer HemAT 351 halTADL_1627 100 15 PBS_HEAT protein; taxis signaling 684 halTADL_1768 100 3.4 Methyl-accepting chemotaxis sensory transducer with Pas/Pac sensor 1093 halTADL_1218 100 0.4 Signal transduction protein with CBS domains 292 halTADL_1865 100 21 Chemotaxis signal transduction protein CheW 473 halTADL_1838 100 8.9 Chemotaxis signal transduction protein CheW 925 halTADL_1838 91 1.2 Chemotaxis response regulator CheY 300 halTADL_1808 100 20

57 Chemotaxis response regulator CheY 566 halTADL_1808 94 6.0

Response regulator receiver protein 538 halTADL_2200 100 6.9 Response regulator receiver protein 1036 halTADL_1816 100 0.6 Response regulator receiver domain + HalX domain 102 halTADL_0055 96 29 KaiC domain 278 halTADL_1815 100 23 KaiC domain 924 halTADL_1815 94 1.2 DL31 Taxis Signal transduction protein with CBS domains 673 Halar_1253 100 3.7 Signal transduction protein with CBS domains 1008 Halar_2982 100 0.6 CheY-like receiver domain 1007 Halar_1479 100 0.6 Globin-like domain 536 Halar_2132 100 6.9

Adhesion Adhesion pilin (PilA) 314 Halar_2365 100 18.3 Adhesion pilin (PilA) 976 Halar_3709 100 0.9 Adhesion pilin (PilA) Hrr. lacusprofundi Archaella Archaellin (FlaA or FlaB) 30 Hlac_2557 43 132.8 Adhesion Adhesion pilin (PilA) 434 Hlac_1363 100 10.5 Adhesion pilin (PilA) 1108 Hlac_3311 100 0.4 58

2.4.6 Haloarchaeal responses to UV and oxidative stress During the Antarctic summer and its prolonged periods of sunshine, UV radiation levels can be high and constitute challenging conditions for the inhabiting macro- and micro fauna due to its mutagenic qualities. Since Deep Lake has very little organic input and biological activity, the water is very clear and light penetrates to the bottom of the lake (Barker, 1981). The Deep Lake metaproteome included a number of proteins involved in repair mechanisms against UV-induced DNA damages and proteins that directly protect cells against intense sunlight (Table 2.5). In addition to direct deleterious effects on nucleic acids, UV irradiation also causes production of harmful reactive oxygen species (ROS) when oxygen is present (Santos et al., 2012a) and in Deep Lake dissolved oxygen levels were reported to be close to saturation (Barker 1981). A number of proteins in the metaproteome are indicative of the haloarchaeal species being exposed to oxidative stress during the time of sampling (Table 2.5).

59

Table 2.5 UV and oxidative stress repair and protection proteins. The table shows all detected proteins involved in the repair and prevention of damage from UV irradiation and oxidative stress for Hht. litchfieldiae, DL31 and Hrr. lacusprofundi. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. ‘nd’ denotes ‘not determined’ due to peptides matching to a protein family Sequence Spectrum Protein annotation Protein # Locus tag identity (%) count Hht. litchfieldiae Manganese superoxide dismutase (SOD) 49 halTADL_2687 100 102.2 Manganese superoxide dismutase (SOD) 164 halTADL_2687 95 39.2 Methionine sulfoxide reductase 686 halTADL_1172 100 3.4 Thioredoxin 196 halTADL_1178 100 33.4 Thioredoxin 359 halTADL_1756 100 14.5

60 Thioredoxin 341 halTADL_2077 100 15.9 Thioredoxin 343 halTADL_2563 100 15.7 Thioredoxin 665 halTADL_2563 91 3.9 Glutaredoxin 753 halTADL_0399 100 2.5 Glutaredoxin 689 halTADL_2104 100 3.4 Glutaredoxin 1014 halTADL_2104 94 0.6 Peroxiredoxin 609 halTADL_0067 100 4.9 Dps ferritin (miniferritin) 15 halTADL_1068 100 220.3 Dps ferritin (miniferritin) 20 halTADL_1068 92 170.1 Universal stress protein (UspA) 846 halTADL_0697 100 1.6 Universal stress protein (UspA) 621 halTADL_1044 100 4.7 Universal stress protein (UspA) 45 halTADL_1904 100 104.8

Universal stress protein (UspA) 851 halTADL_1904 76 1.6 Universal stress protein (UspA) 802 halTADL_2110 100 2.1 Universal stress protein (UspA) 502 halTADL_2112 100 7.8 Universal stress protein (UspA) 910 halTADL_2276 75 1.2 Universal stress protein (UspA) 574 halTADL_2351 100 5.8 RosR transcriptional regulator 849 halTADL_0352 nd 1.6 RosR transcriptional regulator 107 halTADL_1645 100 55.3 RosR transcriptional regulator 376 halTADL_1645 92 13.5 Dodecin 238 halTADL_3198 95 27.5 Dodecin 714 halTADL_3198 100 3 RadA (Rad51/RecA recombinase homolog) 363 halTADL_1827 100 14.1 RadA (Rad51/RecA recombinase homolog) 284 halTADL_2135 90 21.7 61 RadA (Rad51/RecA recombinase homolog) 295 halTADL_2135 100 20.1

RecJ-like exonuclease 644 halTADL_1000 nd 4.3 Topoisomerase VI (subunit B (Top6B)) 1046 halTADL_3021 100 0.6 Ribonucleotide reductase (RNR; adenosylcobalamin-dependent) 393 halTADL_0884 100 12.4 Ribonucleotide reductase (RNR; adenosylcobalamin-dependent) 965 halTADL_0884 91 0.9 UvrD helicase domain protein 911 halTADL_2299 100 1.2 Replication protein A (RPA) ssDNA-binding complex component 960 halTADL_2569 100 0.9 (RPA32) Replication protein A (RPA) ssDNA-binding complex component 757 halTADL_3434 100 2.5 (RPA32) Replication protein A (RPA) ssDNA-binding complex component 206 halTADL_3433 100 32 (RPA41)

DL31 Manganese superoxide dismutase (SOD) 122 Halar_1640 100 50.5 Thiredoxin 548 Halar_3305 100 6.6 Peroxiredoxin 861 Halar_1849 100 1.5 Dps ferritin (miniferritin) 185 Halar_0843 100 35.9 Dps ferritin (miniferritin) 974 Halar_0845 100 0.9 RosR transcriptional regulator 791 Halar_0879 100 2.2 Dodecin 109 Halar_2184 100 10.5 RadA (Rad51/RecA recombinase homolog) 956 Halar_3361 100 0.9 Replication protein A (RPA) ssDNA-binding complex component 826 Halar_3463 100 1.7 (RPA32) Hrr. lacusprofundi 62 Manganese superoxide dismutase (SOD) 480 Hlac_2515 100 8.7 Thioredoxin 637 Hlac_0372 100 4.4 Dps ferritin (miniferritin) 386 Hlac_0536 100 12.8 RadA (Rad51/RecA recombinase homolog) 1107 Hlac_2624 100 0.4

2.5 Discussion Deep Lake represents the first hypersaline lake that has been studied using metaproteomics. Following metagenomic analysis of Deep Lake (DeMaere et al., 2013) and genomic analysis of four isolated haloarchaeal species (Williams et al., 2014), the metaproteomics complement our understanding of the Deep Lake microbial community. In many instances the metaproteomics data were in agreement with the previous studies but they also revealed unanticipated and novel findings. DeMaere et al. (2013) have shown that Deep Lake harbours a low complexity microbial community dominated by a few haloarchaeal species and that community composition is overall homogenous throughout the lake. These findings were confirmed in this study. The metaproteomics indicated that functional partitioning occurs between the smallest and the larger size fractions. However, it is noteworthy to mention that the lake otherwise appeared functionally homogeneous throughout the water column. This makes Deep Lake very distinct from other studied lakes in the Vestfold Hills region like Ace Lake and Organic Lake, which are stratified with microbial communities and their physiological functioning varying between size fractions and depth (Lauro et al., 2011; Yau et al., 2013). The Deep Lake metaproteome was dominated by proteins from the three most abundant species Hht. litchfieldiae, DL31 and Hrr. lacusprofundi, revealing great physiological distinctions between them. The main findings are discussed below.

2.5.1 The dependency of Hht. litchfieldiae on Dunaliella The unicellular green algae Dunaliella sp., the only eukaryotic primary producer in Deep Lake (Wright and Burton, 1981), is known to produce high cellular amounts of glycerol as an osmotic solute (Borowitzka et al., 1977). Glycerol has therefore been proposed to serve as a main carbon source for halophilic archaea (Elevi Bardavid et al., 2008; Falb et al., 2008) and in vitro growth assays showed growth of Hht. litchfieldiae and Hrr. lacusprofundi with glycerol as the sole carbon source (Franzmann et al., 1988; Williams et al., 2014). All of the three main species have the genomic potential to utilize glycerol and Hht. litchfieldiae stands out with three glycerol kinase orthologues present in its genome (Williams et al., 2014). The metaproteome data highlighted that Hht. litchfieldiae utilized glycerol in the environment, which might be a contributing factor to its dominance in Deep Lake (Table 2.2). The lack of detection of glycerol dehydrogenase and the higher abundance of proteins from the first glycerol pathway

63

compared to the second one could be due to repression of glycerol dehydrogenase through Lrp transcription regulator, reducing the flow of glycerol to DHA, as has been hypothesised for Halobacterium salinarum R1 (Schwaiger et al., 2010). DHA is produced by Dunaliella as an intermediate during synthesis and degradation of glycerol and can be readily used as carbon source by haloarchaeal species (Elevi Bardavid et al., 2008; Elevi Bardavid and Oren, 2008; Ouellette et al., 2013). Furthermore, the metaproteome indicated Hht. litchfieldiae used starch as a nutrient which is consistent with genomic analysis where starch-degrading glycosyl were found to be overrepresented (COG categories 0366 and 3387) for Hht. litchfieldiae (Williams et al., 2014). Starch is produced by Dunaliella and used as a carbon/energy storage compound as well as precursor for glycerol production during osmoregulation (Goyal, 2007). Monosaccharides produced during starch degradation could be taken up through carbohydrate ABC transporters, which were detected for Hht. litchfieldiae (Table 2.1). Following these metaproteomic finding both DHA and starch were confirmed as potential substrates for Hht. litchfieldiae in growth experiments (Figure 2.10 and Figure 2.11).

2.5.2 Phosphorus as a limiting factor for Hht. litchfieldiae The relatively high abundance of phosphate targeting ABC transporters detected for Hht. litchfieldiae (Figure 2.8; Table 2.1) indicated a high demand for phosphorus for this species. Phosphate ABC transporters were also prominent in the metaproteome of the phosphorus-depleted (< 5 nM ) Sargasso Sea (Sowell et al., 2009). On the contrary, phosphate ABC transporters were absent in the Ace Lake (Lauro et al., 2011) and South Atlantic gyre (Morris et al., 2010) metaproteomes, two environments with comparably higher phosphate concentrations (1 – 12 µM and 0.1 – 0.4 µM respectively). In Deep Lake phosphate concentrations were described to be low throughout the lake (0.52 – 2.34 nM) (Barker, 1981; Williams et al., 2014). Hht. litchfieldiae also targeted phosphonates as a phosphorus source and growth on phosphonates could be confirmed for Hht. litchfieldiae cultures (Figure 2.9). Phosphonates are wide-spread organophosphorus molecules characterized by the presence of highly-stable C-P bonds which can be used as phosphate-source when other, more bioavailable phosphate is low (Quinn et al., 2007). The high demand for phosphorus of Hht. litchfieldiae could be explained by its saccharolytic metabolism

64

during which incorporated carbohydrates including glycerol are ‘trapped’ inside the cell through phosphorylation. The detection of the putative eDNA-binding protein halTADL_0044 further suggested Hht. litchfieldiae could be using eDNA as a source of phosphate. Haloferax volcanii was shown to be capable of using eDNA as a source of phosphate (Chimileski et al., 2014), with the membrane-bound nuclease Hvo_1477 identified as a key . On the Hht. litchfieldiae genome halTADL_0044 is encoded next to an enzyme homologue to Hvo_1477 (halTADL_0045, not detected), suggesting a similar function in Hht. litchfieldiae. No measurements of eDNA concentrations were available for Deep Lake but eDNA is present in many different environments (Dell'Anno and Danovaro, 2005), with particularly high concentrations found in hypersaline system (Danovaro et al., 2005). Multiple strains of different haloarchaeal species were tested positively for eDNA degradation (Oren, 2014d). Therefore, it is possible that eDNA plays an important role as phosphorus source in Deep Lake, where bioavailable phosphate concentrations are low and high salinity and cold temperatures should support eDNA preservation. However, since growth experiments with eDNA were unsuccessful, possible other functions for the eDNA-binding protein need to be considered, e.g. binding of eDNA during transformation. Further experiments are required to elucidate the function of halTADL_0044.

2.5.3 Diverse strategies of nitrogen acquisition From the three isolated species Hht. litchfieldiae appears to be the most versatile regarding its targeted nitrogen sources. The metaproteome included proteins involved in the utilization of ammonia, proteins, amino acids and urea as nitrogen sources (Table 2.3). Williams et al. (2014) showed that Hht. litchfieldiae is capable of growth with urea as the sole nitrogen source but not when supplemented as the sole carbon source. Urea in Deep Lake may be derived from pinniped excreta (Williams et al., 2014). However no measurements of urea are available. Since it requires more energy, it has been hypothesised that urea utilization would only be advantageous when available ammonia concentrations are low (Alonso-Saez et al., 2012). It is therefore likely that at the time of sampling ammonia availability was low for Deep Lake haloarchaea and that the detected GDH was instead involved in glutamate catabolism as opposed to ammonia assimilation, as has been described for the pathogen Porphyromonas gingivalis, member of the Bacteroidetes (Takahashi et al., 2000). DL31 on the other hand appears

65

to obtain its nitrogen exclusively from proteinaceous material through the uptake of proteins and amino acids, which is manifested in the high abundance of the respective ABC transporter proteins (Figure 2.8). The absence of GS and GOGAT in the metaproteome for DL31 plus the lack of ammonia transporter in the DL31 genome are in agreement with the hypothesis that DL31 might not require exogenous ammonia to satisfy its nitrogen needs (Williams et al., 2014). For Hrr. lacusprofundi usage of ammonia, peptides and amino acids as nitrogen sources was reflected in the metaproteome (Table 2.3).

2.5.4 Hht. litchfieldiae is very motile and responsive to the environment Derived from the bacterial motility structure flagellum, the name archaellum was introduced to describe functionally similar structures in archaea (Jarrell and Albers, 2012). However, the similarity between the archaellum and the flagellum is limited to the fact that both serve as a structure for swimming motility. Both structures are composed of different sets of proteins and have distinct underlying mechanisms (Albers and Jarrell, 2015). While there is no evolutionary relationship between the two structures, archaella are related to bacterial type IV pili structures and have been described as a ‘rotating variant of type IV pili’ (Albers and Jarrell, 2015). The metaproteomic data highlight that the multiple copies of archaellin genes present on the Hht. litchfieldiae genome are actively expressed by the Hht. litchfieldiae population (Table 2.4). Since archaellin proteins were particularly highly abundant in the 0.1 µm filter samples compared to the larger size fractions, it is possible that archaellins were predominantly expressed by single, motile cells in the smallest size fraction as opposed to mostly particle-associated cells in the larger fractions. The rotation direction of archaella can be switched, which allows cell’s directed movement in response to certain environmental stimuli in a process termed taxis (Alam and Oesterhelt, 1984; Szurmant and Ordal, 2004). The Deep Lake metaproteome was indicative of active chemo-, aero- and phototaxis for Hht. litchfieldiae (Table 2.4). The detection of BR indicated that Hht. litchfieldiae uses light as energy source. Furthermore, the detection of the Hht. litchfieldiae signal transducer HtrII is also indicative of an active SRII system; HtrII is the protein responsible for signal transduction from SRII to the archaellar apparatus (Inoue et al., 2014). Similar to BR, SRII is an integral membrane protein that captures light energy. But while BR acts as a

66

proton pump, SRII acts as a light-signal receptor in phototaxis (Inoue et al., 2014). SRII captures blue light and induces negative phototaxis, movement away from a potentially harmful light source. This mechanism could help to protect cells from damaging levels of UV radiation present during the Antarctic summer. One detected MCP (halTADL_0074) belongs to the HemAT family of signal transducers that is involved in aerotaxis in Halobacterium salinarum (Hou et al., 2000). Another MCP (halTADL_1218) with a PAS domain at its N-terminus has similarity to the transducer protein Car in Hbt. salinarum, shown to be involved in chemotaxis towards arginine (Storch et al., 1999). One detected protein (halTADL_1768) contained a HEAT_PBS domain and the encoding gene is located near an MCP gene (halTADL_1773). In Hbt. salinarum a protein with this domain was shown to link chemotaxis (Che) signal transduction proteins with the archaellar apparatus. Cells with a deletion of this protein were unable to switch the rotation direction of their archaella (Schlesner et al., 2009). Multiple MCPs were also detected for DL31 (Table 2.4) which, contrary to Hht. litchfieldiae, does not possess archaellin genes, indicating that transduction of environmental signals is also linked to cellular components other than archaella. For example, myxobacteria can glide over solid surfaces using type IV pili (Mauriello et al., 2010). Different to archaella, type IV pili can extend and retract and thereby allow a bacterium to move over a surface (twitching motility) (Burrows, 2005). The detection of adhesion pili for all three species (Table 2.4) is indicative of cells attaching to surfaces. This is consistent with results from microscopy of Deep Lake water where many cells were found associated with particles.

2.5.5 Multiple responses to UV-induced damage High irradiation levels, in particular irradiation within the UV range, act mutagenically on nucleic acids and cause the production of ROS. Hence, they can have deleterious effects on cells and organisms. Hypersaline environments around the world are often exposed to high irradiation levels (Ventosa, 2006). In Antarctica this is especially the case during the long days within summer. Many studies have shown that haloarchaea are well adapted to UV and ROS-induced stress and also identified a number of key proteins involved in protection and repair (McCready and Marcello, 2003; McCready et al., 2005; Boubriak et al., 2008; Sharma et al., 2012; Oren, 2014a).

67

The Deep Lake metaproteome contained a number of proteins hypothesised to be involved in a range of different repair and protection mechanisms (Table 2.5). RadA, the archaeal homologue of the bacterial RecA and the eukaryal Rad51 recombinase, was detected for all three species. In Hfx. volcanii, deletion of the radA gene led to a UV-sensitive mutant (Woods and Dyall-Smith, 1997). In gene expression studies of H. salinarum NRC-I, radA showed the highest upregulation following UV exposure (McCready et al., 2005; Boubriak et al., 2008) indicating a strong role of homologous recombination in UV-induced DNA repair. Other genes that were upregulated in these studies include RecJ-like exonuclease, topoisomerase VI and ribonucleotide reductase (RNR). RecJ-like exonuclease and topoisomerase VI are both thought to be involved in resolving stalled forks during DNA replication (Boubriak et al., 2008) and both proteins were detected for Hht. litchfieldiae. RNR catalyses the rate- limiting step in the de novo synthesis of deoxyribonucleotide triphosphates, necessary for DNA synthesis and repair (Boubriak et al., 2008) and the protein was detected for Hht. litchfieldiae. Furthermore, a homologue to the bacterial UvrD helicase involved in the nucleotide excision repair and multiple replication protein A (RPA) ssDNA-binding complex proteins were detected for Hht. litchfieldiae. In eukaryotes, RPA-ssDNA complexes are involved in many DNA repair mechanisms and a RPA-ssDNA complex was shown to be induced through UV exposure (McCready et al., 2005). One RPA-ssDNA protein was also detected for DL31. Other proteins potentially involved in protection from light-damage were dodecins, small proteins that bind and store riboflavin (Grininger et al., 2009). For all three species superoxide dismutase (SOD), involved in the degradation of ROS, was found in the metaproteome (Table 2.5). Methionine sulfoxide reductase (MsrA), an enzyme catalysing the reduction of ROS-caused methionine sulfoxide back to methionine (Lowther et al., 2000), was detected for Hht. litchfieldiae. Peroxiredoxins, a ubiquitous family of antioxidant enzymes that reduce hydrogen peroxide and other hydroperoxides (Wood et al., 2003), were detected for Hht. litchfieldiae and DL31. Thioredoxins (Trxs) and glutharedoxins (Grxs) are small proteins that reduce disulphide bridges in a number of different partner enzymes, including many antioxidant enzymes e.g. MsrA and peroxiredoxin (Meyer et al., 2009). Multiple Grxs were detected for Hht. litchfieldiae and Trxs were detected for all three species (Table 2.5). The metaproteome also included transcription factors from Hht. litchfieldiae and DL31 that are

68

homologous to reactive oxygen species regulator (RosR) from Hbt. salinarum NRC-1; RosR was shown to be involved in the regulation of gene expression in response to oxidative stress (Sharma et al., 2012). Dps (DNA protection during starvation)-like ferritin proteins were detected for all three isolate species. Dps ferritins protect cells from oxidative damage in different ways: they prevent the formation of iron-induced 2+ hydroxyl radicals by sequestering free Fe using either H2O2 (thereby eliminating harmful H2O2) or O2 as oxidizing agent and they physically protect DNA from oxidative damage (Pulliainen et al., 2005; Wiedenheft et al., 2005; Zeth, 2012).

2.6 Conclusion This study described the successful application of metaproteomics to a low- complexity haloarchaeal community to investigate the metabolic functioning of the highest abundant species. Detected transport proteins and proteins pertaining in carbohydrate and nitrogen metabolism indicated that the three main species target different substrates and follow distinct physiological traits. These data are indicative of niche partitioning, whereby the three haloarchaeal species coexist in the seemingly uniform water column of Deep Lake and collectively exploit the available nutrients. The data were particularly insightful for the most dominant species Hht. litchfieldiae, which was found to rely on carbohydrate-based substrates produced by Dunaliella. The high abundance of archaella proteins and the detection of various proteins involved in sensing and signalling of environmental stimuli indicated that Hht. litchfieldiae is very responsive and actively moving through the lake towards favourable conditions. A high number of proteins were detected that are involved in the protection and repair of damage caused by UV irradiation. These data indicated that the Deep Lake haloarchaea are exposed to harmful UV irradiation levels during the Antarctic summer and that they have employed diverse mechanisms to protect themselves.

69

70

Chapter 3

Metaproteomics-led analyses of haloarchaea- virus interactions in Deep Lake

Co-authorship statement

Sections from this chapter have been published as:

Tschitschko B, Williams TJ, Allen MA, Paez-Espino D, Kyrpides N, Zhong L, Raftery MJ, Cavicchioli R (2015) Antarctic archaea-virus interactions: metaproteome- led analysis of invasion, evasion and adaptation. The ISME journal 9: 2094-2107.

The contributions to the manuscript are as follows: I performed the metaproteomics (from sample preparation to mass spectral analysis), general protein data analyses, analyses of cell surface protein variation, analyses of HIRs and BREX systems. Assistance in mass spectrometry was provided by Ling Zhong and Mark Raftery. Michelle Allen performed CRISPR spacer and repeat identification and identification and analysis of Cas proteins. Tim Williams performed analyses of viral proteins and encoding metagenomic contigs, analyses of CRISPR loci and proto-spacer containing metagenomic contigs. David Paez-Espino provided assistance for CRISPR analyses. Tim Williams and I annotated the proteins. Ricardo Cavicchioli, Tim Williams and I wrote the manuscript.

71

3.1 Abstract Viruses are abundantly found in hypersaline environments from mild-temperate climates, where they are one the main forces controlling microbial community dynamics. In hypersaline Deep Lake, the microbial community is dominated by members of the haloarchaeal. However, no study has yet assessed the Deep Lake virus population. The aim of this study was to investigate the viral community in Deep Lake and how is interacts with the haloarchaeal host populations. To achieve this, data from the Deep Lake metaproteome were complimented with further assessments of the metagenome. Detected viral proteins were analysed and provide a first assessment of the viral community in Deep Lake. Multiple proteins were detected that are involved in host-defence mechanisms against viral infection. Variation in host cell surface proteins, in particular the S-layer, was identified as a prominent mechanism to escape viral infection. CRISPR (Clustered Regularly Interspaced Short Palindrome Repeats) systems were identified and shown to be actively employed by the Deep Lake haloarchaea against virus infection. Host-virus relationships were elucidated using CRISPR-spacer information. The study also included an assessment of the recently discovered BREX (Bacteriophage Exclusion) defence systems in the genomes of two Deep Lake haloarchaea and the metagenome. Overall, the study highlighted that viruses play a fundamental role in shaping the Deep Lake haloarchaeal community.

72

3.2 Introduction Hypersaline aquatic systems are well known for harbouring high viral densities that generally outnumber bacterial and archaeal cells by a factor of 10 or more (Guixa- Boixareu et al., 1996; Oren et al., 1997; Santos et al., 2012b). Similarly, virus diversity (the number of distinct viral species) in hypersaline environments has been estimated to be around 10 times higher compared to host cell diversity (Emerson et al., 2013). Virus caused cell lysis and the accompanying release of nutrients is an important contributor to nutrient cycling in aquatic environments (Wilhelm and Suttle, 1999; Weinbauer and Rassoulzadegan, 2004; Suttle, 2007; Anesio and Bellas, 2011) and might be of particular importance for the heterotrophic microbial communities in hypersaline systems, where primary production is often limited to the green alga, Dunaliella (Elevi Bardavid et al., 2008; Oren, 2014b). Constant viral predation pressure has also been hypothesised to be a main driver of microbial diversity (Weinbauer and Rassoulzadegan, 2004; Rodriguez-Valera et al., 2009) which again might particularly apply to hypersaline systems where protozoan grazers are often rare (Guixa-Boixareu et al., 1996). However, knowledge about haloarchaeoviruses (i.e. viruses infecting haloarchaea) and their specific interactions with host organisms is limited, partly due to the relatively small number of virus isolates (Dyall-Smith et al., 2003; Luk et al., 2014). Great progress has been made throughout the last decade through multiple culture- independent, ‘–omics’ based studies, that directly investigated virus communities of temperate hypersaline systems, elucidating the existence of heterogeneous viral populations and how they interact with the mostly haloarchaea-dominated microbial populations (Santos et al., 2007; Rodriguez-Brito et al., 2010; Santos et al., 2010; Santos et al., 2011; Atanasova et al., 2012; Emerson et al., 2013). However, there is still a shortage of available haloarchaeoviral genome sequences in public databases impeding the interpretation of viral metagenome data. Diverse viral populations have been documented in Antarctic freshwater, saline and hypersaline lake systems including some from the Vestfold Hills region (Kepner et al., 1998; Madan et al., 2005; Lopez-Bueno et al., 2009; Lauro et al., 2011; Yau et al., 2011). In hypersaline Organic Lake, a metaproteogenomic approach (metagenomics and metaproteomics) led to the discovery of a virophage predating on the local phycodnavirus population (Yau et al., 2011). Organic Lake is a stratified, shallow (6.8

73

m) system with high concentrations of dimethylsulfide in the bottom water and a microbial community largely dominated by bacteria and to a lesser extent eukarya, whereas archaea are rare (Yau et al., 2013). By contrast, Deep Lake is a deep (36 m) and homogeneous lake that is so salty it remains ice-free throughout the year (Barker, 1981). Its microbial community is dominated by haloarchaea making it vastly different to Organic Lake. Similar to temperate hypersaline systems, the haloarchaeal community of Deep Lake is of very low complexity but with different abundant species (Bowman et al., 2000; DeMaere et al., 2013). Hht. litchfieldiae strain tADL, DL31 and Hrr. lacusprofundi account for > 70% of the cellular community in Deep Lake (DeMaere et al., 2013). The three species plus a fourth, low abundant member of the community (DL1, ~0.3%) were successfully isolated and their genomes sequenced (DeMaere et al., 2013). Throughout the co-evolution of bacterial and archaeal host cells with their infecting viruses, many diverse anti-viral defence systems and strategies have emerged within host organisms to counter virus infections (Stern and Sorek, 2011; Samson et al., 2013). A first opportunity of defence is the prevention of virus adsorption (the binding of viruses to receptors on the host cell surface). Potential virus receptors are cell surface structures such as pili or archaella and other surface proteins, polysaccharides or lipopolysaccharides (LPSs) (Hyman and Abedon, 2010; Samson et al., 2013; Quemin and Quax, 2015). In Bacteria, strategies to prevent virus adsorption include the hiding of virus receptors behind extracellular polymers, loss of receptors through aborted expression or alterations in receptor structures through mutations (Hyman and Abedon, 2010). However, little is known about such strategies in archaea. CRISPR-Cas represents an adaptive immune system that is present in ~40% of bacterial and ~90% of archaeal genomes (Jansen et al., 2002; Stern and Sorek, 2011). CRISPR (Clustered Regularly Interspaced Short Palindrome Repeats) are genomic loci that consist of multiple copies of conserved repeats (typically 20-50 bp long). Upon infection with foreign DNA (e.g. from a virus) short pieces of the invading DNA can get incorporated into the host genome in between CRISPR repeats; these incorporated foreign pieces of DNA are called spacers. A single CRISPR locus comprises multiple concatenated spacer-repeat units. CRISPR arrays are transcribed into one long RNA molecule (pre-cr RNA) and subsequently cleaved into single repeat-spacer RNA units (crRNA). In case of an infection by a virus for which a spacer is present in the CRISPR array, the spacer-matching sequence in the virus DNA (called ‘protospacer’) gets

74

recognized and degraded in an RNA interference-like mode (Sorek et al., 2013). CRISPR defence is mediated by CRISPR-associated (Cas) proteins which are encoded upstream of the CRISPR array. Cas proteins are responsible for the incorporation of new spacers, the processing of the pre-crRNA and for the recognition and degradation of the target DNA (Makarova et al., 2011). CRISPR systems are highly dynamic and during viral predation new spacers get incorporated on the cas-proximal end of the array whereas old spacers are lost from the distal end (Barrangou et al., 2007; Tyson and Banfield, 2008). Hence CRISPR spacers represent a memory of previous infection events and can be used to infer host–virus relationships. Recently a novel anti–virus defence system termed BREX (Bacteriophage Exclusion) was discovered in Bacillus cereus (Goldfarb et al., 2015). Transformation of the B. cereus BREX gene cluster into the BREX-negative species B. subtilis conferred resistance against a wide range of viruses. BREX genes encode for DNA/RNA manipulating proteins (helicase and methylase) and also for protein modifying enzymes (Lon-like protease, alkaline phosphatase, serine/threonine kinase) but the exact mechanism of how these proteins counter virus infection is not yet known (Goldfarb et al., 2015). Based on the presence/absence of certain genes, BREX systems can be divided into distinct subtypes with the type 5 BREX systems only identified within the archaeal class Halobacteria, including the Deep Lake species DL31 and Hrr. lacuprofundi. Overall, BREX systems are present in ~10% of microbial genomes (Goldfarb et al., 2015). This chapter describes results from the Deep Lake metaproteome, which inform about haloarchaeal host-virus relationships in Deep Lake. The emphasis lies on proteins derived from viruses and host proteins informative of host-defence mechanisms. These data where further complimented with metagenomic and genomic analyses. Using this combined ‘omics’ approach the chapter provides a first assessment of the virus community in Deep Lake, elucidates specific virus-host relationships and informs about haloarchaeal defence mechanisms employed against virus infection. Many of the described mechanisms and interactions have so far only been studied in ecosystems in mild and temperate climates making this study a precedent for Antarctic ecosystem distinctiveness.

75

3.3 Materials and Methods Deep Lake biomass sampling, metaproteomic analyses and functional and taxonomic annotation of proteins were performed as described in Chapter 2. 3.3.1.1 Metagenomic coverage analysis Fragment recruitment (FR) of metagenomic reads to the genomes of the four isolate genomes was performed using Artemis (Carver et al., 2012). Aligned metagenomic reads were loaded into Artemis using BAM files generated by GS Reference Mapper v2.6, as described previously (DeMaere et al., 2013). The quantitative value used for comparing recruited reads between different genes was reads per kilobase per recruited reads (rpkm). 3.3.1.2 CRISPR analyses All CRISPR repeat sequences identified by CRISPR Recognition Tool (CRT (Bland et al., 2007)) were re-examined using CRISPRFinder (default settings (Grissa et al., 2007)) and inspected for motifs associated with functional repeats (Maier et al., 2013). CRISPR repeat sequences were analysed by BLAST against the four genomes databse or the metagenome database (see 2.3.2.3) using default settings, with matches required to have 100% identity over their full length. CRISPR spacer sequences identified using CRT were compared against the four genome db or the metagenome db using BLAST with default settings except that matches were required to have at least 97% identity over the full spacer length (no more than 1 bp mismatch over a 30 bp sequence), with all except 10 of the 194 resulting hits having 100% identity over their full length. The taxonomic affinities of the sources of CRISPR spacer sequences were determined using the identities of the adjacent CRISPR repeat sequences. Contigs in the metagenome assembly that represented the invading DNA that prompted the generation of the spacers were manually annotated using BLASTP searches. Identification of leader sequences of CRISPR loci were identified as described previously (Li et al., 2013). Cas protein sequences (including Csh and Csc) were identified from the Deep Lake metagenome using BLAST with the Hrr. lacusprofundi type I-B and type I-D Cas proteins as queries, and the requirement that matches cover at least 90% of the length of the query protein.

76

3.3.1.3 Protein sequence alignments Multiple sequence alignments were created using Clustal Omega (McWilliam et al., 2013) with output files loaded onto the ENDscript server (Robert and Gouet, 2014) and alignments generated using default settings. Alignments of two protein sequences were created using the EMBOSS Needle protein alignment tool from the EMBL-EBI analysis tool web services (Li et al., 2015).

3.4 Results Metaproteomics was performed on 15 Deep Lake samples, representing three distinct size fractions (0.1–0.8, 0.8–3.0 and 3.0–20.0 µm) from five different depths (surface, 0 m, 5 m, 13 m, 24 m and 36 m). The combined metaproteome of Deep Lake comprised 1109 proteins with 902 of them assigned to the three main species Hht. litchfieldiae, DL31 and Hrr. lacusprofundi. An additional seven proteins were assigned to the low abundant fourth isolate species DL1. The remaining proteins matched to the taxonomic categories Other Halobacteriaceae, Other Archaea, Bacteria, Viruses and Dunaliella (see Chapter 2). The majority of the proteins assigned to the four isolate species matched with 100% sequence identity. However, a proportion of proteins (170) had their best match to the isolate species but with < 100% sequence identity. These proteins were encoded on metagenome contigs, which represent genomic variants (phylotypes) of the isolate species present within the populations in Deep Lake. In this thesis the term ‘variant’ is used to describe proteins that have their best match to the isolate species but with < 100% sequence identity. All variants required at least one detected peptide covering a region with sequence variation. The percentage sequence identity between the variant and the isolate species is used as a measure of the extent of variation; the lower the sequence identity the higher the extent of variation. A group of proteins, which was found to be particularly prone to sequence variation, were cell surface proteins. Annotation as cell surface protein was based on the presence of an N-terminal secretion signal peptide and/or a single C-terminal transmembrane helix and/or homology to experimentally characterized cell surface proteins, including S-layer proteins, archaella, and adhesion pili. Cell surface proteins are of particular importance in host-virus interactions, since they represent the structures a virus makes first contact with a potential host (Quemin and Quax, 2015). A total of 68 cell surface proteins were detected for the four isolate species. Even though these 68 cell surface proteins accounted for only 7% of the total proteins assigned to the isolate

77

species, they recruited a relatively high ~22% of detected spectra, reflecting that some of these proteins were present in high copy number (Table 3.1).

3.4.1 Cell surface protein variation Twenty one out of the 68 cell surface proteins were variants. The degree of sequence identity for cell surface variants ranged from 29-77%. All detected Hrr. lacusprofundi variants (three) and two out of four DL31 variants were cell surface proteins (Table 3.1). For DL31 they constituted 83% of variant spectra. For Hht. litchfieldiae 16 out of 163 variants (~10%) were cell surface proteins accounting for ~25% of variant spectra. Cell surface proteins accounted for a particularly high proportion of variants with a high degree of variation (Figure 3.1): 14 out of the 15 variants with < 60% sequence identity were cell surface proteins. Some of these variants with a high degree of variation were of very high abundance and the details about all the cell surface variants are given in Table 3.2.

78

Table 3.1 Cell surface proteins detected for the four Deep Lake isolate species. All detected proteins Detected cell surface proteins Number of Spectrum Number of Proportion of all Spectrum Proportion of proteins count proteins proteins in % count all spectra in % Hht. litchfieldiae 492 11733 26 5 2162 18 Hht. litchfieldiae variants 163 4353 16 10 1066 24 DL31 150 2362 10 7 113 5 DL31 variants 4 905 2 50 747 83 Hrr. lacusprofundi 90 1328 8 9 189 14 Hrr. lacusprofundi variants 3 300 3 100 300 100 DL1 7 138 3 43 104 75 DL1 variants 0 0 0 - 0 - Total 909 21119 68 7 4681 22 79

Figure 3.1 Proportion of cell surface protein variants. All detected protein variants from Hht. litchfieldiae, DL31 and Hrr. lacusprofundii are shown categorized into cell surface (dark grey) and all other functions (light grey), and grouped according to their identity relative to the best matching sequence in their respective genome. The total number of variant proteins was 170, and for each range of identities the number of variants was: 90 – 99%, 105 proteins; 80 – 89%, 33; 70 – 79%, 14; 60 – 69%, 3; 50 – 59%, 5; 40 – 49%, 6; 30 – 39%, 3; 20 – 29%, 1. Variants with high levels of variation tended to be cell surface proteins.

3.4.1.1 Surface (S)-layer proteins The S-layer is a crystalline protein structure encapsulating the cytoplasmic membrane of most archaeal and also some bacterial species (Sara and Sleytr, 2000; Albers and Meyer, 2011). Usually the S-layer is comprised of multiple copies of a single glycoprotein. They glycoproteins are anchored to the cytoplasmic membrane either via a carboxy-terminal transmembrane domain or a lipid anchor (Sara and Sleytr, 2000; Albers and Meyer, 2011; Kandiba and Eichler, 2014). The genomes of Hht. litchfieldiae, DL31 and DL1 each encode one S-layer protein on their main replicon (halTADL_1043/Halar_0829/HalDL1_0395) while Hrr. lacusprofundi is conspicuous in encoding for two (Hlac_0412 and Hlac_2976). Hlac_0412 is located on the primary replicon while Hlac_2976 is located on a secondary replicon. All these S-layer proteins share a common architecture with an N-terminal surface glycoprotein signal peptide (around 35 amino acids (AA) length), a conserved C-terminal PGF-CTERM archaeal

80

protein-sorting signal (around 25 AA) and a large (around 750-1000 AA) extracytoplasmic domain in between. The metaproteome contained five distinct variants for the Hht. litchfieldiae S-layer protein halTADL_1043 (Table 3.2). They shared between 34-51% sequence identity with halTADL_1043 and were all identified through distinct sets of peptides (Table 3.2). Two variants could be detected matching the DL31 S-layer protein Halar_0829 with 45-47% sequence identity, including the second most abundant protein in the metaproteome. One variant was identified matching the Hrr. lacusprofundi S-layer protein Hlac_2976 with 38%. The 100% matching S-layer proteins were also identified for Hht. litchfieldiae and DL31 and found to be of lower abundance compared to the highest abundant variants. For the second S-layer protein of Hrr. lacusprofundi, Hlac_0412, the 100% matching protein was detected but no variant. The lack of a detected S-layer variant for DL1 while the 100% matching protein was detected could be due to the very low abundance of this species (DeMaere et al., 2013), not allowing for assembly of the variant gene during metagenomic sequencing. Multiple sequence alignments of detected DL31 S-layer proteins revealed that variation mainly occurred in the extracytoplasmic part of the proteins while the C-terminal end was more conserved (Figure 3.2).

81

82

83

Figure 3.2 Multiple sequence alignments of DL31 S-layer proteins. S-layer protein sequence from DL31 genome (Halar_0829) aligned with variants present on Deep Lake metagenome contigs (Halar_0829_variant1/2). The alignment illustrates the high conservation of the C- terminal transmembrane domain and a high degree of variation in the preceding extracytoplasmic part. All proteins were detected in the metaproteome. Identical residues (white characters on red background) and similar residues (red characters on white background, similarity score > 0.6) are framed in blue.

Mapping of metagenomic sequencing reads to genomes of isolate species and calculations of the FR coverage can reveal genomic regions that are prone to variation within populations; this has previously been used to identify phylotype variation for the Deep Lake haloarchaea (DeMaere et al., 2013). FR coverage for the S-layer protein genes of the isolate species was found to be low. For Hht. litchfieldiae average FR coverage of predicted open reading frames (ORFs) was 460 RPKM (reads per kilobase per 1000000 recruited reads) while the coverage for halTADL_1043 was only 106 RPKM (Figure 3.3). FR coverage for only the conserved C-terminus (AA 808-end) was similar to the replicon average with 518 RPKM while the coverage for the rest of the protein, including the extracytoplasmic region with high variation (AA 1-807), was low with only 17 RPKM. Low FR coverage coincided with a lower GC-content for the same region. Average GC-content for the Hht. litchfieldiae replicon was 60% compared to only 55% for halTADL_1043, with only 54% from AA 1-807 compared to 59% from AA 808-end. The region of low FR coverage and GC-content already started in the middle of the neighbouring gene halTADL_1042, a predicted cell surface protein detected in the metaproteome. Low FR coverage was also observed for the DL31 and Hrr. lacusprofundi S-layer genes (Halar_0829 and Hlac_2976) with 141 and 121 RPKM respectively, compared to an average coverage of 474 and 3114 RPKM for the encoding replicons (Table 3.2). Unlike for halTADL_1043, GC-content of Halar_0829 and Hlac_2976 (63 and 59% respectively) was similar to the average of the encoding replicons (65 and 59%

84

respectively). The second Hrr. lacusprofundi S-layer gene Hlac_0412 also had low coverage with 67 RPKM compared to an average of 534 RPKM for the encoding primary replicon, indicative of variation in that locus even though no variant could be detected. GC-content for Hlac_0412 was considerably lower with 60% compared to the average of 68% for the replicon.

Figure 3.3 Low coverage for Hht. litchfieldiae S-layer gene. The figure highlights the low FR coverage of the S-layer gene halTADL_1043. FR coverage was obtained from Deep Lake metagenomic sequencing reads mapped against the Hht. litchfieldiae genome (DeMaere et al., 2013) viewed using Artemis (Carver et al., 2012). Numbers represent locus tags. Highlighted in purple are genes with detected protein product in the metaproteome.

Inspection of metagenomic contigs encoding for variant S-layer proteins revealed further variation in the form of gene re-arrangement. In the genome of Hrr. lacusprofundi the S-layer gene Hlac_2976, encoded on a secondary replicon, is located in a region next to two transposase genes and a HIR. However, on the encoding metagenome contig the gene for the detected Hlac_2976 variant is adjacent to a gene with similarity to BJ1 virus (Figure 3.6A).

85

Table 3.2 Characteristics of cell surface proteins. The table lists all the cell surface proteins from the isolate species that are discussed in the text. The first column lists the matching locus tags. ID, identification number: protein numbers are according to spectrum count rank within the complete Deep Lake metagenome (Appendix A). Column with the header ‘Sequence identity’ shows the percentage amino acid sequence identity of the detected protein to the locus tag. FR coverage diplays the coverage of the locus tags in RPKM values. Column with the header ‘GC’ shows the GC-content of the encoding genes in percent. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. FR Spectrum Sequence Protein Coverage GC ID Detected peptides Count Identity length (RPKM) Domains Hht. litchfieldiae single replicon 460 60%

S-layer proteins SNVDDTTDYQVR; halTADL_1043 156 100% 984 AA ELDTTANDGEGSIEGLVEEVSIDSDDR; 105 55% 42

86 TLGTAVDTER

Surface glycoprotein signal AA 1 - 37 peptide AA 37 - 961 Non cytoplasmic domain

PGF-CTERM archaeal protein- AA 956 - 983 sorting signal

GDTTELDVETNR; halTADL_1043 GSSDVVVHADGDLDDEELTQIFETEGDLS 8 51% 933 AA 338 variant 1 VR; EEIALDDRl; SLSDGDEFSATLEYSTDSDFER; VSSTDASISFR Surface glycoprotein signal AA 7 - 36 peptide AA 30 - 910 Non cytoplasmic domain

PGF-CTERM archaeal protein- AA 905 - 932 sorting signal

halTADL_1043 ADDEDESMTVQVNTR; 18 44% 805 AA 205 variant 2 TLGTGASLGQVYDEGDNVDTIESK; YEFADTADPAPFDNTK; VSSTDASSTFR N and C - termini are missing, likely due to a lack

87 of assembly

halTADL_1043 339 50% 447 AA 16 variant 3 EQTLDVAFDDDEVSTGDESELDLETNR Surface glycoprotein signal AA 9 - 38 peptide AA 37 - 447 Non cytoplasmic domain

C-terminus is missing, likely due to a lack of

assembly

halTADL_1043 356 42% 842 AA 15 variant 4 GDDAYLVEVQR; DEIAIEDR Surface glycoprotein signal AA 1 - 26 peptide AA 25 - 819 Non cytoplasmic domain

PGF-CTERM archaeal protein- AA 814 - 841 sorting signal halTADL_1043 618 34% 348 AA 5 variant 5 SGVTTFDTGAWDVAIYGENADDYENK N and C - termini are missing, likely due to a lack

of assembly Other cell surface proteins halTADL_1047 800 59% 228 AA 139 62% 2

88 variant LSGADLIAR

Signal peptide and AA 1- 23 Prokar_Lipoprotein site AA 24 - 228 Non cytoplasmic domain

halTADL_0878 274 100% 249 AA LLDSIEVGVDFDPNA 439 58% 24 AA 1- 24 Signal peptide AA 25 - 249 Non cytoplasmic domain

halTADL_0878 490 50% 225 AA 8 variant IGLAINLDQDIK AA 1- 24 Signal peptide AA 25 - 255 Non cytoplasmic domain

halTADL_1761 170 100% 181 AA VDQQALQQK; QIQSGNMTQEEAR 464 61% 39 AA 1 - 34 Signal peptide All amino acids are predicted to be outside of the

cytoplasma halTADL_1761 370 40% 150 AA 14 variant QVTVSIQPDQETLQAR No domains detected since the N-terminus is

missing, likely due to a lack of assembly

halTADL_1765 313 100% 341 AA AIDSESNLVETDR; VTVPVGDIR; 458 63% 18 QAESNIVADR AA 1- 22 Signal peptide AA 23 - 318 Non cytoplasmic domain

89 PGF-CTERM archaeal protein- AA 313 - 339 sorting signal

halTADL_1765 361 76% 333 AA QAESNVVADR; ATDSESNLVATER; 14 variant VTVPVGDIR AA 1- 22 Signal peptide AA 23 - 310 Non cytoplasmic domain

PGF-CTERM archaeal protein- AA 305 - 331 sorting signal

halTADL_1403 1849 126 29% VLDQVDESLEGGDVR; 118 60% 49 variant AA TSATTTTSLSVNR AA 1-1829 non_cytoplasmic domain Bacterial Ig-like, group 1 AA 83-185 domains; Invasin/intimin cell-

adhesion (AA 101-178) AA 189-279 Bacterial Ig-like, group 1 domains

AA 604-684 Bacterial Ig-like, group 1 domains

AA 700-783 Bacterial Ig-like, group 1 domains

Bacterial Ig-like, group 1 AA 908-964 domains; Invasin/intimin cell-

adhesion Bacterial Ig-like, group 1

90 AA 1004-1086 domains; Invasin/intimin cell-

adhesion (AA 1005-1085) Bacterial Ig-like, group 1 AA 1089-1187 domains; Invasin/intimin cell-

adhesion (AA 1113-1186) AA 1203-1288 Bacterial Ig-like, group 1 domains

AA 1392-1482 Bacterial Ig-like, group 1 domains

AA 1648-1716 PGF-pre-PGF domain

AA 1830-1846 Transmembrane domain

halTADL_0751 750 65% 446 AA 182 49% 3 variant ITALQNQFQETGTGR AA 1 - 44 Signal peptide AA 1- 19 Cytoplasmic domain

AA 20 - 40 Transmembrane helix

AA 41 - 446 Non cytoplasmic domain

Adhesion proteins

halTADL_1387 176 100% 156 AA VVWTNPSGGSSNTLAER 407 49% 37 AA 1- 19 Non cytoplasmic domain AA 12 - 34 Flagellin/pilin, N-terminal

Pilin/Protein of unknown function AA 12 - 84 DUF1628, archaeal 91 AA 20 - 42 Transmembrane helix

AA 42 - 156 Cytoplasmic domain

halTADL_1387 666 66% 130 AA EIPYQSDDTVR; 4 variant VIWENPAGGSSNVLAER AA 1- 19 Non cytoplasmic domain AA 20 - 41 Transmembrane helix

AA 11 - 34 Flagellin/pilin, N-terminal

Pilin/Protein of unknown function AA 12 - 81 DUF1628, archaeal AA 42 - 130 Cytoplasmic domain

Archaellins LQIQSATGEINSTR; VITNVLSESQITNISAETNDNVITDR; SDRYQLNFDASEDLDGDLQGGEK; halTADL_1812 3 100% 188 AA YQLNFDASEDLDGDLQGGEK; 368 55% 497 ISVTLTTAPGASTVTELR; ISVTLTTAPGASTVTELRVPDSLVDR; VPDSLVDR AA 1- 11 Non cytoplasmic domain AA 11 - 33 Flagellin/pilin, N-terminal

AA 12 - 173 Arch_flagellin

AA 12 - 37 Transmembrane helix

AA 38 - 188 Cytoplasmic domain

92 halTADL_1812 59 77% 188 AA VSVTLTTAAGASTVTELR; 94 variant VPDSLVSRSAVSL; VPDSLVSR AA 1- 11 Non cytoplasmic domain AA 6 - 29 Flagellin/pilin, N-terminal

AA 8 - 173 Arch_flagellin

AA 12 - 33 Transmembrane helix

AA 34 - 188 Cytoplasmic domain

IAVTLTTAPGASTVTELR; halTADL_1813 7 100% 181 AA IAVTLTTAPGASTVTELRVPDSLVDR; 562 55% 351 VPDSLVDR AA 1- 11 Non cytoplasmic domain AA 6 - 29 Flagellin/pilin, N-terminal

AA 12 - 33 Transmembrane helix

AA 34 - 181 Cytoplasmic domain

halTADL_1813 12 76% 188 AA YILQFQTDTAMGAYLQGGDR; 239 variant VSVTLTTAAGASTVTELR; VPDSLVDR AA 1- 11 Non cytoplasmic domain AA 6 - 29 Flagellin/pilin, N-terminal

AA 12 - 33 Transmembrane helix

93 AA 34 - 188 Cytoplasmic domain

halTADL_0078 118 100% 189 AA IQIQSATGAVVYNADGK; 446 53% 52 ISNITAQTNDNVITDR; VPDSLIGK AA 1- 11 Cytoplasmic domain AA 6 - 29 Flagellin/pilin, N-terminal

AA 8 - 174 Arch_flagellin

AA 12 - 33 Transmembrane helix

AA 34 - 189 Non cytoplasmic domain

halTADL_0078 IEFNATNAIGGK; 95 75% 184 AA 59 variant VTLTLTTAPGASTVTELR; VPDSLVSK; VPDSLVSKSAVSL AA 1- 11 Cytoplasmic domain AA 5 - 28 Flagellin/pilin, N-terminal

AA 7 - 169 Arch_flagellin

AA 12 - 32 Transmembrane helix

AA 33 - 184 Non cytoplasmic domain

VTLTTAAGASTVAELRVPDSLVDK; halTADL_1544 1 100% 200 AA VTLTTAAGASTVAELR; VPDSLVDK; 789 51% 695 VPDSLVDKSAVSL AA 1- 11 Cytoplasmic domain AA 10 - 33 Flagellin/pilin, N-terminal

94 AA 12 - 37 Transmembrane

AA 12 - 183 Arch_flagellin

AA 38 - 200 Non cytoplasmic domain

halTADL_1544 786 75% 200 AA NILDSGLTGGDSVQVTLTTAPGASTVTEL 2 variant R; VPDSLVSR AA 1- 11 Cytoplasmic domain AA 10 - 33 Flagellin/pilin, N-terminal

AA 12 - 37 Transmembrane

AA 12 - 177 Arch_flagellin

AA 38 - 200 Non cytoplasmic domain

DL31 replicon 1 474 65% S-layer proteins

Halar_0829 106 100% 835 AA RGGSESSTFVR; GGSESSTFVR; 141 63% 55.3 ARSTGDTQPR; STGDTQPR Surface Glycoprotein signal AA 7 - 36 peptide AA 34 - 813 Non cytoplasmic domain EF_HAND_1, calcium-binding AA 505 - 517 side (PS00018) PGF-CTERM archaeal sorting AA 808 - 813 signal RNTSDSSSLVVQQK; NTSDSSSLVVQQK;

95 TAGRPTGDYFLTTGTGSTYK;

Halar_0829 2 47% 675 AA FELATQEISADFAADTVDNGGSATTVDFE 629 variant 1 AESNRASYEHVLEATYEGDAVAASDLQDI LGGAGTVK; DEPQDLGALVIGER; TSGLVLEDTPK; TSGLVLEDTPKEIK; ATFTVANDR AA 1 - 653 Non cytoplasmic domain PGF-CTERM archaeal sorting AA 648 - 674 signal; also described as transmembrane region N-terminus is missing, likely due to a lack of assembly

RVTDSGTSFVR; VTDSGTSFVR; Halar_0829 36 45% 826 AA QISADADGEVVASTSGLSAGDYR; 118 variant 2 NDWVIHQVSATGLGGLVDAK; TVAADEEYTATFTVDDSK; SAGDTSPR Surface Glycoprotein signal AA 7 - 35 peptide AA 34 - 804 Non cytoplasmic domain PGF-CTERM archaeal sorting AA 799 - 825 signal Hrr. lacusprofundi replicon 2 3114 58% S-layer proteins Hlac_2976 303 100% 880 AA ASAGPNAQR; SASGTQPSFVK 122 59% 19 Surface glycoprotein signal AA 10 - 36 96 peptide AA 37 - 857 Non_cytoplasmic domain PGF-CTERM archaeal protein- AA 852 - 879 sorting signal Hlac_2976 223 38% 956 AA EDGSSYEAFR; LRSTDDTSPR; 2090 60% 30 variant STDDTSPR Surface glycoprotein signal AA 10 - 37 peptide AA 37 - 933 Non_cytoplasmic domain PGF-CTERM archaeal protein- AA 928 - 955 sorting signal

Hrr. lacusprofundi replicon 1 452 68% S-layer protein

1063 TISIETDR; SLTVSGDR; Hlac_0412 52 100% 67 60% 101 AA LRGEAEYVLR; GEAEYVLR; SASGTSPSFIETSEGVR; Surface glycoprotein signal AA 10 - 37 peptide AA 31 - 1040 Non_cytoplasmic domain AA 1036 - PGF-CTERM archaeal protein-

1062 sorting signal Archaellin

97 Hlac_2557 IDVTSEVGIVGSEANGELSEIR; 30 43% 206 AA 2090 60% 133

variant LSVSGAPGADQIDLSETTIQAVGPGGQQN LVFTDR AA 1 - 11 Cytoplasmic domain AA 10 - 33 Flagellin/pilin, N-terminal AA 12 - 155 Flagellin, archaea AA 15 - 37 Transmembrane AA 38 - 212 Non_cytoplasmic domain Other cell surface proteins Hlac_0476 365 100% 193 AA SVTIDTADDAYAFLR; 153 59% 14 FGEDAAHDEAGLFAIR AA 1 - 32 TAT signal sequence AA 21 - 193 Non_cytoplasmic domain

Hlac_0476 27 49% 108 AA 138 variant TLSIDTAGDADAFLSITK AA 1 - 32 TAT signal sequence AA 21 - 108 Non_cytoplasmic domain C-terminus is missing, likely due to a lack of assembly DL1 primary replicon (HalDL1_Contig38) 433 68% S-layer protein DL1_0395 115 100% 834 AA TGTDVTDTANDISR 15 63% 53 Surface glycoprotein signal AA 7 - 35 peptide AA 37 - 811 Non_cytoplasmic domain

98 PGF-CTERM archaeal protein- AA 809 - 833 sorting signal

3.4.1.2 Archaella proteins Archaellins represent the structural subunits of the archaeal motility structures archaella (Jarrell and Albers, 2012) (see 2.4.5 and 2.5.4). Similar to S-layer proteins, archaellins are amongst proteins synthesised in highest copy numbers in archaeal cells (Jarrell et al., 2010), which was reflected in their high abundance in the Deep Lake metaproteome (Table 3.2). Unlike DL31, which does not encode archaellins in its genome, Hht. litchfieldiae, Hrr. lacusprofundi and DL1 all have the genomic capacity for archaella synthesis (Williams et al., 2014). The genome of Hht. litchfieldiae contains a total of seven archaellin genes (halTADL_0078/1543/1544/1810/1811/1812/1813) and the 100% matching proteins were detected for all except halTADL_1543. In addition, variant proteins were detected matching halTADL_0078/1544/1812/1813 with 75-77% sequence identity. Multiple sequence alignment of all the archaellin variants together with the 100% matching proteins highlighted that sequence variation was mainly occurring within distinct regions: at the very N-terminus, around AA 70, AA 120 and AA 150; the regions in between are conserved (Figure 3.4). The FR coverage of Hht. litchfieldiae archaellin genes with detected variants was average (368-789 RPKM) while GC-content was low with 51-55% (Table 3.2).

99

Figure 3.4 Alignment of Hht. litchfieldiae archaellin proteins. Archaellin protein sequences from the genome of Hht. litchfieldiae were aligned with sequences of variants present on Deep Lake metagenome contigs. All proteins were detected in the metaproteome. Identical residues (white characters on red background) and similar residues (red characters on white background, similarity score > 0.6) are framed in blue.

Hrr. lacusprofundi encodes one archaellin gene (Hlac_2557) and a variant with a high degree of variation (43% sequence identity) was detected. The 100% matching protein was not present in the metaproteome. Alignment of the variant with Hlac_2557 revealed that the 44 N-terminal AA of the variant were 100% conserved with Hlac_2557, followed by a region of medium sequence similarity until AA 118. Further downstream there are large gaps interrupting the alignment (Figure 3.5). Although FR coverage of Hlac_2557 was overall very high with 2090 RPKM, there was a coverage

100

peak of 4254 RPKM for the region that aligned well with the variant (AA 1-118), followed by a drop of coverage to 197 RPKM for the differing C-terminal region (AA 119-end). The 60% GC-content for Hlac_2557 were lower than the replicon average (68%) with the N-terminal part (AA 1-118) slightly higher in GC (61%) compared to the C-terminal part (59%; AA 119-end). On the encoding metagenomic contig an additional archaellin gene was located next to the variant gene, matching Hlac_2557 with a comparably high 98% sequence identity. Other genes on the contig matched Hrr. lacusprofundi with 100% identity and conserved gene synteny, indicating that variation was restricted to the archaellin gene. The metaproteome contained a further archaellin protein that matched Hlac_2557 with 77% identity. However, other genes on the same contig had higher matches to other Halorubrum species compared to Hrr. lacusprofundi. The contig might therefore be derived from an additional, low abundance Halorubrum species. Despite its low abundance in the lake, two archaellin proteins were detected for DL1, both matching with 100% sequence identity to HalDL1_1517/1518 respectively; no archaellin variants were detected for DL1.

101

variant 1 MFETILNEEERGQVGIGTLIVFIAMVLVAAIAAGVLINTAGFLQTQAEAT 50 ||||||||||||||||||||||||||||||||||||||||||||:||||| Hlac_2557 1 MFETILNEEERGQVGIGTLIVFIAMVLVAAIAAGVLINTAGFLQSQAEAT 50 variant 51 GEESTSQVADRLQIVSQSGNVTDLDIDGNGPERAIDTLQLTVAQSPGAGN 100 |:|||..|::|:.:.|:.| |.||.....::::::.|..:.|:.. Hlac_2557 51 GQESTDLVSERIDVTSEVG------IVGNNSTGELESIRVAVTGAAGSDQ 94 variant 101 IDLTEVSVELIGTGGQVN------GELASDNITM 128 |||:|.:::.:|..||.| ..|.:....: Hlac_2557 95 IDLSETTIQAVGPNGQANLVFTDEAANGTSLVNNESTYNASSLNASEFAV 144 variant 129 FTGDGDEV-----VLTDNSDRAEIVLN------LNSSGTFG------158 ....||.| ||.|.:|.. |||| |.:.||.| Hlac_2557 145 QDSQGDWVSSGGAVLDDENDYT-IVLNPGAEPFGSLTADGTDGTAVYGGT 193 variant 159 --YNNDASTDGGLQAGDSLSVTFTTASGASTSTELRVPTTLTDEQTSVRL 206 |.:..:.:...... |.|:...:.:.|:||.||..|...:::..:||| Hlac_2557 194 WTYAHQTADEEAFGQSQSSSLEIVSPASATTSLELTSPDLYSEDGEAVRL 243 variant 207 * 207

Hlac_2557 244 - 243

Figure 3.5 Hrr. lacusprofundi archaellin variant alignment. Shown is the alignment of the detected variant protein together with the 100% Hlac_2557 protein. Identical residues are indicated by a bar (|) symbol while conserved substitutions are indicated by a colon (:); non- conserved residues are indicated by a dot (.).

3.4.1.3 Adhesin and other cell surface proteins Cell surface variants detected for Hht. litchfieldiae included an adhesion pili protein (PilA). Adhesion pili are cell surface structures that can be used for attachment to foreign surfaces as e.g. during biofilm formation (Esquivel et al., 2013). The variant had 66% sequence identity to halTADL_1387. The 100% match for this protein was also detected. One other protein was detected that had its highest identity to a Hht. litchfieldiae pili protein (33% match to halTADL_1885) but the encoding metagenome contig could not conclusively be assigned to any particular species (see 3.4.3). Six more Hht. litchfieldiae cell surface variants of unknown function were detected matching halTADL_0751/0878/1047/1403/1761/1765 with 29-76% sequence identity. They all contained N-terminal signal sequences and large extra-cytoplasmic domains, except for the halTADL_1403 variant. The 100% matching proteins were also detected for halTADL_0878/1761/1765. Metagenome FR was found to be low for halTADL_0751/1047/ 1403 with 182/139/118 RPKM while GC-content was only low for halTADL_0751 with 49% (Table 3.2). For Hrr. lacusprofundi one more cell surface variant was detected matching Hlac_0476 with 49% identity; the 100% matching protein was also detected. No putative function could be assigned to this protein though it contains an N-terminal

102

signal sequence followed by an extra-cytoplasmic domain. Metagenome FR and GC- content were both found to be low with 153 RPKM and 49% respectively.

3.4.2 HIRs The presence of HIRs has been described as a unique feature of the genomes of the Deep Lake isolate species, not described for any other haloarchaeal species (DeMaere et al., 2013). The metaproteome contained nine proteins that were encoded in six distinct HIRs, with three of the HIRs encoding for two detected proteins (Table 3.3). All of the HIR encoded proteins matched to Hrr. lacusprofundi and at least one other isolate species; in one case a detected protein matched to all four isolate species (Figure 3.6B). Five of the HIRs were located on Hrr. lacusprofundi’s secondary replicon Hlac_replicon 2, which has previously been identified as a hub for HIRs (DeMaere et al., 2013). While it is not possible to determine which of the species expressed the HIR encoded proteins, their detection is evidence that genes within HIRs are functional. Further inspection revealed that four out of the six HIRs further encoded for multiple proteins of suspected viral origin; these included proteins with sequence similarity to viral proteins and/or with domains of potential viral origin (e.g. recombinase of the pfam00589 family) (Table 3.3, Figure 3.6).

103

Table 3.3 HIR encoded proteins from the metaproteome. Thick horizontal lines denote borders of HIRs. ID, identification number: protein numbers are according to spectrum count rank within the complete Deep Lake metagenome (Appendix A). Column with the header ‘Locus tags’ list all locus tags matching the detected protein. Column with the header ‘Genes in HIR with viral signature’ lists all genes within adjacent to the respective HIR with putative viral origin. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. ID Functional Locus tag Organism/ Replicon Genes in HIR with viral signature Spectrum annotation count 277 Thiazole Hlac_2980 Hrr. lacusprofundi/ 23 biosynthetic Hlac_replicon 2 enzyme Thi1 halTADL_1093 Hht. litchfieldiae single replicon 394 Carbohydrate ABC Hlac_2984 Hrr. lacusprofundi/ 12 transporter Hlac_replicon 2 substrate-binding 104 protein

halTADL_1095 Hht. litchfieldiae single replicon 200 Hypothetical Hlac_3030 Hrr. lacusprofundi/ Hlac_3022 (not present in DL1 because a Transposase 33 (transmembrane Hlac_replicon 2 jumped in between): type III restriction enzyme, helicase domains) domains which is also associated with Hepatitis C virus NS3 helicases HalDL1_3101 DL1/HalDL1_contig37 Hlac_3023/HalDL1_3109 (99% identical): BLAST match : secondary replicon to Mu-like prophage FluMu DNA-binding protein Ner family protein from Eikenella corrodens Hlac_3024/DL1_3107: part of superfamily PHA03359 consisting of Herpes virus DNA packaging enzyme tegument protein UL17

203 Hypothetical Hlac_3031 Hrr. lacusprofundi/ Hlac_3025/DL1_3106: phage integrase (pfam00589) 33 Hlac_replicon 2 HalDL1_3100 DL1/HalDL1_contig37 : secondary replicon 1070 Hypothetical Hlac_3118 Hrr. lacusprofundi/ Hlac_3116/DL1_3304 (93% identity): BLAST matches 0.4 (nucleic acid Hlac_replicon 2 against environmental halophages (best match is eHP-6) binding domains) and BJ1 virus HalDL1_3302 DL1/HalDL1_contig37 Hlac_3120/DL1_3300/Halar_0414 (74% : secondary replicon identity)/halTADL_2010 (90% identity): UF955 domain described as family of bacterial and viral proteins with undetermined function Hlac_3122/DL1_3298/Halar_0417 (95% identity)/halTADL_2012 (85% identity): multiple BLAST matches against haloviruses, best match is to protein 68 105 from HCTV-5

Hlac_3125/DL1_3295/Halar_0420: multiple BLAST matches against environmental halophages, best match is against hypothetical protein OSG eHP20 Hlac_3134/DL1_3286/Halar_0429/tADL_2024: BLAST match to Halovirus HHTV-2 protein 87 744 Metallophospho- Hlac_3129 Hrr. lacusprofundi/ 3 esterase Hlac_replicon 2 HalDL1_3291 DL1/HalDL1_contig37 : secondary replicon Halar_0424 DL31/DL31_replicon 2 halTADL_2019 Hht. litchfieldiae single replicon

392 Hypothetical Hlac_3168 Hrr. lacusprofundi/ Hlac_3166/Halar_0081: BLAST match to uncharacerized 12 Hlac_replicon 2 protein of BJ1 virus Halar_0079 DL31/DL31_replicon 2 Hlac_3162/Halar_0083: BLAST match against alteromonas phage proteins 377 Short-chain Hlac_3251 Hrr. lacusprofundi/ Hlac_3240 (adjacent to HIR): phage integrase 14 dehydrogenase/ Hlac_replicon 2 (pfam00589); not present at the equivalent position in reductase (SDR) tADL halTADL_1233 Hht. litchfieldiae single replicon 738 TATA-box binding Hlac_3413 Hrr. lacusprofundi/ Hlac_3406/halTADL_1287: phage integrase (pfam00589) 3 protein (Tbp) Hlac_replicon 3 halTADL_1279 Hht. litchfieldiae single Hlac_3410/halTADL_1282 (99% identical): BLAST replicon matches against putative cell surface proteins/lipoproteins (including proteins with a PGF-CTERM archaeal sorting 106 signal)

Hlac_3412/halTADL_1280 (99% identical): BLAST matches against putative cell surface proteins/lipoproteins (including proteins with a PGF-CTERM archaeal sorting signal) Hlac_3414/halTADL_1278: DUF955 domain described as family of bacterial and viral proteins with undetermined function Hlac_3415/halTADL_1277: BLAST matches to Haloviruses (HCTV-1, HCTV-5)

107

Figure 3.6 Comparison of selected regions of HIRs, viral, transposase and expressed genes of Deep Lake haloarchaea.(A) Gene re-arrangement of the Hrr. lacusprofundi S-layer gene in the Deep Lake metagenome. The S-layer gene Hlac_2976 appears to be located in a mobile region next to Transposases and HIR. The detected Hlac_2976 variant is located next to a putative BJ1 virus gene on a metagenome contig. ORFs are shown as arrows indicating their orientation. Nucleotide sequence identity between metagenome and genome sequences is shown as percentage. (B) One HIR shared between tADL, DL31, Hrr. lacusprofundi and DL1. The HIR contains genes of viral origin and proteins detected in the metaproteome. Protein sequence identity relative to the Hrr. lacusprofundi sequences is shown as percentage, and ORFs are shown as uniform length, horizontal black bars; numbers below ORFs indicate locus tags. Grey areas indicate regions of no homology. (A and B) Genes with detected protein in the metaproteome are highlighted in orange. Genes of putative viral origin are in green. Transposases are in dark red. 108

3.4.3 Viral proteins In addition to proteins derived from haloarchaea, the Deep Lake metaproteome contained a total of 23 proteins that had highest sequence identity to viruses (Table 3.4). These included major capsid proteins (MCPs) indicative of at least eight distinct head- tailed viruses. Five of these MCPs matched to isolated haloarchaeal siphoviruses (HVTV-1, HCTV-5, HHTV-1, HCTV-2 and Halorubrum virus CGΦ46 (Kukkaro and Bamford, 2009; Atanasova et al., 2012; Sencilo and Roine, 2014), one MCP matched to the Hlac-Pro1 provirus of Hrr. lacusprofundi that has extensive sequence similarity to BJ1 virus (Krupovic et al., 2010), another MCP matched to putative head-tailed viruses assembled from metagenome data (eHP12 and eHP6 (Garcia-Heredia et al., 2012)) and one matched to the bacterial myovirus VBM1. Besides MCPs, two prohead proteases and one scaffold protein, all involved in the assembly and maturation of the capsid of head-tailed viruses (Dokland, 1999; Pietila et al., 2013), were detected (Table 3.4). The metagenome contig encoding for the detected Halorubrum virus CGΦ46 MCP, also contained genes homologous to BJ1 virus and many of the other contigs, encoding for detected viral proteins, contained genes homologous to haloarchaeal genomes. These results conformed to the mosaic structure reported for many haloarchaeal head-tailed viruses, containing genes matching other viruses and cellular host genes (Krupovic et al., 2010). Both prohead proteases were derived from two metagenome contigs, each also encoding for one of the detected MCPs (HVTV-1 and HCTV-5), with prohead protease and MCP encoded within one gene cluster (Table 3.4). The metaproteome further contained six proteins matching haloarchaea with the encoding metagenome contigs also possessing genes of putative viral origin (Table 3.4). Three of the proteins were cell surface proteins including one matching the Hht. litchfieldiae pili protein halTADL_1885; the contig encoding the pili protein contained an ORF matching to pleolipoviruses. Two of the proteins were linocin M18 type bacteriocins, which have been described as adaptive prophage-derived elements that can be carried and introduced into host genomes through prophages (Bobay et al., 2014). The sixth protein had no annotated function (hypothetical) and was encoded on the same contig as one of the bacteriocins. Based on the spectrum count, viruses matching HVTV-1 and HCTV-5 were highest in abundance, with one of the MCPs being 5th highest in abundance in the whole metaproteome. Viral proteins were of particularly high abundance in the 0.1µm filter

109

samples compared to the larger filter sizes (Figure 2.3). Many of the contigs matching viruses were previously described as ‘degenerate’ due to their very high metagenome FR, indicative of high abundance of the respective viruses (Table 3.4) (DeMaere et al., 2013). Collectively, these data are indicative of an abundant and diverse viral community in Deep Lake.

110

Table 3.4 Viral proteins in the Deep Lake metaproteome. ID, identification number: protein numbers are according to spectrum count rank within the complete Deep Lake metagenome (Appendix A). Contig ID refers to the contig identifier in the Deep Lake metagenome database with contigs beginning with ‘deg’ describing degenerate contigs with high FR coverage in the metagenome (DeMaere et al., 2013). Contig taxonomy describes the taxonomic assignment of the contig based on BLASTP searches of encoding ORFs. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Highlighted in light grey are detected proteins that match to haloarchaea with the encoding metagenome contig also encoding proteins matching to viruses. Contigs targeted by CRISPR spacers (Table 3.7) are underlined. Asterisk (*) denotes an ambiguous match (protein family). Spectrum Protein # Annotation BLAST match and sequence identity Contig ID Contig taxonomy count 33% match to prohead protease [119] 433 Prohead protease 10.6 [Halovirus HVTV-1] 64% match to major capsid protein

111 5 Major capsid protein 433.1 [120] [Halovirus HVTV-1] deg7180000395111 HCTV-1-like type head-tail

45% match to hypothetical protein virus 166 Hypothetical protein 39.1 [114] [Halovirus HCTV-1] 49% match to hypothetical protein 892 Hypothetical protein 1.4 [123] [Halovirus HVTV-1]

47% match to hypothetical protein HCTV-1-like type head-tail 876 ctg7180000406861 1.5 [121] [Halovirus HCTV-1] virus Hypothetical protein

51% match to hypothetical protein 877 1.5 [121] [Halovirus HCTV-1] ctg7180000406863 HCTV-1-like type head-tail 43% match to hypothetical protein virus 38 Hypothetical protein 116.4 [121] [Halovirus HCTV-5]

68% match to major capsid protein 24 Major capsid protein 147.4 [119] [Halovirus HCTV-5] 35% match to prohead protease [118] 635 Prohead protease 4.4 [Halovirus HCTV-5] 31% match to major capsid protein 90 Major capsid protein [HALG_00004] [Halorubrum phage 65.8 Head-tail viruses CGΦ46 & CGΦ46] deg7180000401097 BJ1; also homologs in 34% match to major capsid protein haloarchaeal genomes 174 Major capsid protein [HALG_00004] [Halorubrum phage 37.6 CGΦ46] 45% match to major capsid protein [21] 167 Major capsid protein 39.0 [Halovirus HHTV-1] 43% match to major capsid protein [21] deg7180000413894

112 340 Major capsid protein Head-tail virus HHTV-1 15.9 [Halovirus HHTV-1]

34% match to hypothetical protein [20] 583 Hypothetical protein 5.4 [Halovirus HHTV-1] 32% match to major capsid protein Major capsid protein 91 [Hlac_0760] [Hrr. lacusprofundi Head-tail virus Hlac-Pro1 65.1 (Hlac_0260 homolog) ACAM34] deg7180000409203 provirus & HCTV-5; also 32% match to major capsid protein homologs in haloarchaeal Major capsid protein 425 [Hlac_0760] [Hrr. lacusprofundi genomes 10.8 (Hlac_0260 homolog) ACAM34] 34% match to hypothetical protein 367 Hypothetical protein Head-tail virus VBM1 14.0 [VPGG_00035] [Vibrio phage VBM1] ctg7180000427076 (vibriophage) 41% match to major capsid protein 521 Major capsid protein 7.4 [VPGG_00034] [Vibrio phage VBM1]

54% match to major capsid protein [33] 452 Major capsid protein 9.7 [Halovirus HCTV-2] deg7180000398907 Head-tail virus HCTV-2 51% match to major capsid protein [33] 127 Major capsid protein 48.4 [Halovirus HCTV-2] 42% match to uncharacterized protein Head-tail virus eHP-12/eHP- 104 Major capsid protein [OSG_eHP12_00035] [Environmental ctg7180000408321 6; also homologs in 55.5 Halophage eHP-12] haloarchaeal genomes Head-tail HRTV-7; also 59% match to hypothetical protein [12] 726 Scaffold protein ctg7180000410975 homologs in haloarchaeal 3.0 [Halovirus HRTV-7] genomes 54% match to hypothetical protein HCTV-1-like type head-tail 875 Hypothetical protein ctg7180000399498 1.5 [122] [Halovirus HCTV-5] virus 61% match to adhesion pilus PilA Haloarchaeal genomes & 113 358 Adhesion pilus [halTADL_1885] [Halohasta ctg7180000459513 14.6 pleolipovirus HRPV-3 litchfieldiae]

38% to Halorubrum kocurii Haloarchaeal genomes, [C468_14103]; 31% match to 825 Cell surface protein ctg7180000403008 Halorubrum virus GNf2 & 1.8 hypothetical protein [HAPG_00095] bacteriophage ORF [Halorubrum phage GNf2] 66% match to hypothetical protein 147 Hypothetical protein [EL22_00080] [Halostagnicola sp. 43.0 A56] deg7180000399274 Haloarchaeal genomes & 82% match to linocin M18 bacteriocin head-tail virus HGTV-1 719 Linocin M18 bacteriocin [EL22_00075] [Halostagnicola sp. 3.0 A56] 39%/42% to Haloferax gibbonsii Haloarchaeal genomes & 1066* Cell surface protein ctg7180000446992 0.4 [C454_02855]; 38% match to Halorubrum virus GNf2

hypothetical protein [HAPG_00095] [Halorubrum phage GNf2] 85% identity to linocin_M18 bacteriocin [C472_12565] ctg7180000411793 Haloarchaeal genomes & 551 Linocin M18 bacteriocin 6.6 [Halorubrum tebenquichense DSM head-tail virus HGTV-1 14210]

114

3.4.4 Host–defence systems against viral infection 3.4.4.1 CRISPR The metaproteome contained a total of six Cas proteins, indicative of active CRISPR systems in the Deep Lake haloarchaea (Table 3.5). Cas7 and Cas8b were detected for Hht. litchfieldiae; Cas7, Cas10d and Csc2 were detected for Hrr. lacusprofundi; one detected Cas7 protein had the best match to haloarchaeal species other than the Deep Lake isolates. Cas7, Cas8b and Cas10d are part of type I-B CRISPR systems while Csc2 is part of a type I-D system. All detected Cas proteins are part of the CRISPR-associated complex for antiviral defence (Cascade) (Makarova et al., 2011).

Table 3.5 Cas proteins detected in the metaproteome. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Spectrum ID Annotation Organism/locus tag count

214 Cas7 (type I-B) Halohasta litchfieldiae/ halTADL_1360 30.9

595* Cas8b (type I-B) Halohasta litchfieldiae/ halTADL_1361 5.2

627 Cas7 (type I-B) Hrr. lacusprofundi/ Hlac_3331 4.6

914 Cas10d (type I-D) Hrr. lacusprofundi/ Hlac_3573 1.2

626 Csc2 (type I-D) Hrr. lacusprofundi/ Hlac_3574 4.6

827 Cas7 (type I-B) DL1/ HalDL1_3062 1.7

913 Cas7 (type I-B) 68% match to Cas7 from Haloquadratum walsbyi C23 1.2

The genomes of all four isolate species contain CRISPR systems (Table 3.6): Hht. litchfieldiae and DL1 each contain a single type I-B system, DL31 a single type I-D system and Hrr. lacusprofundi contains a type I-B and a type I-D system. In addition, the genomes of Hht. litchfieldiae and DL31 contain three and one more CRISPR arrays without associated Cas gene clusters, respectively. Two out of the four Hht. litchfieldiae CRISPR loci have identical repeat sequences. Using repeat sequences as queries, CRISPR loci of the isolate species were also identified on metagenome contigs.

115

Table 3.6 CRISPR loci of the Deep Lake isolate species. The column with the header ‘Position’ lists the nucleotide range of the respective CRISPR locus on the encoding replicon. CRISPR No. Species Position Repeat sequence (consensus) Comments locus spacers single replicon Adjacent to type I-B cas gene 1 1305778..1311449 86 GTTTCAGACGAACTCTCGTGAGGTTGAAGC cluster (halTADL_1355 to _haltADL_1362) single replicon In intergenic region between 2 1320912..1321203 4 GTTTCAGACGAACGCTTGTGCGGTTGAAGC halTADL_1362 & Hht. halTADL_1363 (transposase) litchfieldiae single replicon In intergenic region between 3 1474449..1475583 17 GTTTCAGACCAACCCTCGTGGGGTCGGAGG halTADL_1520 (transposase) & halTADL_1521 single replicon In intergenic region between 116 4 2597028..2598626 24 GTTTCAGACGAACTCTCGTGAGGTTGAAGC halTADL_2674 (fragments) &

halTADL_2675 replicon 1 Adjacent to type I-D cas gene 1 190284..194678 60 GTTTCAATCCCGTTCTGGGTTTTCTGGGTGTCGCGAC cluster (Halar_0915 to DL31 Halar_0923) replicon 1 In intergenic region between 2 8 1876825..1877379 GTTTCAGACGTACCCTTGTGGGGTTGAAGC Halar_2656 & Halar_2657 replicon 3 Adjacent to type I-B cas gene 1 131 Hrr. 67425..76153 CTTTCAGCCGAACCCCTCGTGGGTTTGAAGC cluster (Hlac_3326 to Hlac_3333) lacusprofundi replicon 3 Adjacent to type I-D cas gene 2 79 333645..339403 GTTTCAATCCCGTGCTGGGTTTTCTGAGTGTCTCGAC cluster (Hlac_3572 to Hlac_3579) Contig37 Adjacent to type I-B cas gene DL1 1 17051..20676 55 GTTTCAGACGTACCCTCGTGGGGTTGAAGT cluster (HalDL1_3060 to HalDL1_3067)

To determine targets of the CRISPR systems of the isolate species, spacer sequences from all identified CRISPR loci (genomes and metagenome) were extracted from the databases and searched against the genomes of the isolate species plus the Deep Lake metagenome contigs. Matched metagenome contigs that were matched by spacers were then manually inspected. Most of the spacers matched to head-tailed viruses while some also matched to pleolipoviruses, haloarchaeal plasmids and haloarchaeal replicons (Figure 3.7; Table 3.7). Spacers were identified matching to four of the head-tailed viral contigs with a detected MCP: eHP12 and eHP6-like, CGΦ46-like, HCTV-2-like, and HCTV-2-like (see 3.4.3); the first three were targeted by Hht. litchfieldiae CRISPR loci and the latter was targeted by DL31. A further metagenome contig matching CGΦ46 was targeted by DL31 though it could not be determined if it represents the same CGΦ46-like virus also targeted by Hht. litchfieldiae or a different one. Spacers were also identified matching two putative viral contigs encoding the detected linocin proteins: one was targeted by Hrr. lacusprofundi and one by an unknown haloarchaeon. One definite pleolipovirus contig (all ORFs matching pleolipoviruses) was targeted by Hrr. lacusprofundi CRISPR system (Table 3.7). In a few instances, CRISPR systems of different haloarchaeal species contained spacers matching the same metagenome contig (Table 3.7). Most of these contigs had similarity to head-tailed haloarchaeal viruses including myoviruses. These data are indicative of Deep Lake viruses with broad host-range capable of infecting across different genera. Myoviruses with broad host-range have been reported previously from temperate environments including hypersaline ones (Sullivan et al., 2003; Atanasova et al., 2012). Hht. litchfieldiae and Hrr. lacusprofundi CRISPR systems were both targeting the contig encoding the detected pilus protein together with a pleolipovirus ORF (see 3.4.3). The other two contigs encoding detected cell surface proteins plus putative viral genes were also targeted by CRISPR systems: one by both Hht. litchfieldiae and Hrr. lacusprofundi, the other one only by Hht. litchfieldiae (Table 3.7). Some metagenome contigs matching to haloarchaeal viruses were targeted by both Hrr. lacusprofundi CRISPR systems (type I-B and type I-D). Hrr. lacusprofundi CRISPR systems were also the only ones containing spacers that matched to the genomes of other isolates. The targeted genes encoded for a Cdc6-like protein of DL31, a DNA methylase of Hht. litchfieldiae and a protein with no assigned function of DL1 (Table 3.7). Methylases can also be present in viral genomes (e.g. haloarchaeal myoviruses (Sencilo and Roine, 2014)) and a cdc6-like gene is present in the genome of

117

BJ1 virus (Krupovic et al., 2010), suggesting a possible viral origin for the CRISPR- targeted genes. Within Hht. litchfieldiae and DL1 CRISPR loci, spacers were found that matched to regions of their own genomes. The targeted genes were annotated as peptidase and hypothetical, with the hypothetical gene of Hht. litchfieldiae encoded in a region containing multiple transposases and an integrase (Table 3.7). The CRISPR spacer data also included examples of CRISPR arrays containing multiple spacers matching the same metagenome contig, indicating that the targeted elements elicited a strong CRISPR response. Many of these targeted contigs were derived from head-tailed viruses. Collectively, the data highlighted that CRISPR systems of Deep Lake haloarchaea were active and employed against a wide range of different sources of invading DNA (Figure 3.7)

118

119

Figure 3.7 CRISPR systems in Deep Lake haloarchaea. Cas proteins detected in the metaproteome (yellow), and Cas protein gene clusters for Hht. litchfieldiae tADL, Hrr. lacusprofundi, DL31 and DL1 are shown with their associated CRISPR loci containing repeats (black) and spacers (white). 120 Spacers identified in metagenome contigs are shown separately, linked to specific CRISPR loci based on their repeat sequences. Hht. litchfieldiae tADL

locus 1 and 4 cannot be differentiated in metagenome data because they have identical repeats. For both genomic and metagenomic spacers, spacers that matched to sequences in metagenome contigs or isolate genomes sequences are in red. Spacer number is shown for repeats (red) from genomic loci, numbered relative to the first spacer in the locus. The contig (red text) matching the spacer (red) includes a description (taxa and/or function; black writing) of relevant genes that were able to be annotated (no description indicates insufficient level of match to provide annotation). Cdc6, cell division control protein 6; IF-2, translation initiation factor IF-2; IgA, immunoglobulin A; IgB, immunoglobulin B; MCP, major capsid protein; PadR, PadR transcriptional regulator; PLD, Phospholipase D domain; PQQ/WD-40, pyrrolo-quinoline quinone repeat; TFIIIB, transcription initiation factor IIB; UspA, universal stress protein A; VP, virion protein; VWA, von Willebrand factor type A domain; wHTH, winged helix-turn-helix; XRE, XRE transcriptional regulator.

Table 3.7 Spacers from Deep Lake CRISPR loci. The table shows DNA elements (metagenome contigs and genomes) that were targeted by Deep Lake CRISPR spacer. The first column designates the spacer containing CRISPR locus; the second column specifies the origin of the spacer containing CRISPR locus (genome or metagenome) and the number of matching spacers; the third column lists the proto-spacer containing element matched by the respective spacers (either metagenome contig ID or genome); the fourth column gives a taxonomic and functional description of the element specified in column three; the fifth column lists the proto-spacer containing ORFs. Metagenome contigs that are targeted by more than one CRISPR loci are highlighted in light grey. Metagenome contig which also encode for a protein detected in the metaproteome are highlighted in bold. Cdc6, cell division control protein 6; PLD, Phospholipase D Active site domain; PQQ/WD-40, pyrrolo-quinoline quinone repeat; RepH, plasmid replication protein; TerL, phage terminase large subunit; UspA, universal stress protein A; VP, virion protein; VWA, von Willebrand factor type A domain. Origin of CRISPR Proto-spacer (genome or Proto-spacer CRISPR locus containing Contig/replicon description metagenome) and # containing ORF contig/genome matching spacers Hht. litchfieldiae metagenome (1) no identifiable ORF (locus 1 or 4) 121 Hrr. lacusprofundi (I-B) ctg7180000271658 ORFs with no known homologs hypothetical protein; (locus 1) & (1-D) (locus metagenome (2,3) no identifiable ORF 2) Hht. litchfieldiae metagenome (1) hypothetical protein (locus 1 or 4) homologs to haloarchaeal ORFs (unknown Hrr. lacusprofundi (I-B) ctg7180000403008 function) + Halorubrum phage GNf2 ORF (cell genome (1), hypothetical protein (locus 1) & surface, unknown function) metagenome (1) (cell surface) (1-D) (locus 2) Hht. litchfieldiae metagenome (1) homologs to haloarchaeal ORFs (unknown hypothetical protein (locus 1) ctg7180000403010 function) Hrr. lacusprofundi (I-B) metagenome (1) hypothetical protein (locus 1)

hypothetical protein Hht. litchfieldiae homologs to head-tail virus HSTV-2/HRTV-7 (phage); hypothetical metagenome (4) (locus 1 or 4) ORFs (unknown function), haloarchaeal ORF protein (VWA); no ctg7180000439875 (VWA domain-like/ATPase protein), & identifiable ORF Hrr. lacusprofundi (I-B) genome (1), mycobacterial phage ORF (unknown function) hypothetical protein (locus 1) metagenome (1) (phage) genome (3, but 2 Hht. litchfieldiae identical), homologs of pleolipovirus HRPV-3 ORF (ORF6 hypothetical protein; (locus 1 genome; locus 1 metagenome (2, same - unknown function) & haloarchaeal ORFs (incl. no identifiable ORF or 4 metagenome) ctg7180000459513 as genome) adhesion pilus [haltADL_1885 homolog] & Hrr. lacusprofundi (I-B) integrase) integrase metagenome (1) (locus 1) genome (1) homologs of haloarchaeal ORF (unknown DL31 (locus 2) TerL function), head-tail virus HSTV-1 ORF (large 122 deg7180000398860 terminase [TerL]), & head-tail virus BJ1 ORF

DL1 (locus 1) metagenome (1) (unknown function) TerL

hypothetical protein DL31 (locus 1) genome (2) homolog to ORF of possible surface protein ctg7180000296261 (WD-40) (PQQ/WD-40 repeat protein) Unknown haloarchaeon hypothetical protein metagenome (1) #1 (WD-40) hypothetical protein DL1 (locus 1) genome (1) (virus) homologs to ORFs from head-tail viruses eHP- metagenome (1) ctg7180000414238 DL31 (locus 2) 32/eHP-36 (unknown function) & haloarchaeal hypothetical protein

ORF (nucleic acid binding protein) Unknown haloarchaeon nucleic acid-binding metagenome (1) #2 protein

Hht. litchfieldiae genome (1), Ig-binding regulator (locus 1 genome; locus 1 metagenome (1, same A family protein or 4 metagenome) as genome) homologs to Firmicutes ORFs (immunoglobulin Ig-binding regulator [Ig]-binding regulator A family protein, Ig- Hrr. lacusprofundi (I-B) deg7180000395624 A family protein; Ig- metagenome (2) binding regulator B family protein, (locus 1) binding regulator B acetyltransferase-like protein) family protein Ig-binding regulator DL1 (locus 1) genome (1) A family protein Hrr. lacusprofundi (I-B) homolog to head-tail virus HSTV-2/HRTV-7 hypothetical protein; (locus 1) & (1-D) (locus metagenome (1,3) ctg7180000271717 ORF (unknown function) & ORF with no no identifiable ORF 2) known homolog Hrr. lacusprofundi (I-B) homologs to haloarchaeal ORFs (incl. hypothetical protein (locus 1) & (I-D) (locus metagenome (1,1) ctg7180000396713 methyltransferase, other unknown function) &

123 (same) 2) head-tail virus BJ1 ORF (unknown function)

homologs of haloarchaeal ORFs (GTP-binding Hht. litchfieldiae genome (1) ctg7180000455053 protein, XRE transcriptional regulator), & ORFs no identifiable ORF (locus 1) with no known homologs Hht. litchfieldiae homologs of haloarchaeal ORFs (unknown hypothetical protein genome (2), (locus 1 genome; locus 1 ctg7180000446992 function) & Halorubrum phage GNf2 (cell (cell surface); metagenome (1) or 4 metagenome) surface, unknown function) hypothetical protein Hht. litchfieldiae homologs of haloarchaeal ORFs (UspA domain metagenome (1) ctg7180000438477 hypothetical protein (locus 1 or 4) protein, otherwise unknown function) homologs to haloarchaeal ORFs (zinc finger Hht. litchfieldiae hypothetical protein metagenome (1) ctg7180000452388 SWIM domain protein, ribosomal L36-like (locus 1 or 4) (membrane) protein, membrane protein of unknown function)

homologs to haloarchaeal ORFs Hht. litchfieldiae metagenome (1) ctg7180000454867 (haltADL_0529, _0533, _0534, all unknown no identifiable ORF (locus 1 or 4) function) Hht. litchfieldiae halTADL_1427: metagenome (1) tADL genome Hht. litchfieldiae single replicon (locus 1 or 4) hypothetical protein homologs to head-tail virus HHTV-1 ORFs Hht. litchfieldiae metagenome (1) deg7180000413894 (incl. major capsid protein) & ORFs with no major capsid protein (locus 1 or 4) known homologs Hht. litchfieldiae homologs to haloarchaeal ORFs (incl. winged hypothetical protein; metagenome (1,1) ctg7180000268294 (locus 1 or 4 & 2) helix-turn-helix DNA-binding protein) no identifiable ORF homologs to haloarchaeal ORFs (incl. PadR-like Hht. litchfieldiae metagenome (1) ctg7180000277369 transcriptional regulator) & BJ1 head-tail virus no identifiable ORF (locus 2) ORF (PhiH-like repressor) homologs of haloarchaeal ORFs (PRC-barrel 124 Hht. litchfieldiae domain protein, helicase), bacteriophage ORF metagenome (1) ctg7180000456477 helicase

(locus 3) (unknown function), & ORF with no known homolog homologs to Halorubrum phages (head-tail Hht. litchfieldiae viruses) CGΦ46 & BJ1 ORFs (incl. major metagenome (1) deg7180000401097 no identifiable ORF (locus 3) capsid protein) & Natrialba phage ΦCh1 (unknown function) genome (2), Hht. litchfieldiae homologs to ORFs from head-tail virus eHP- metagenome (2, same FtsH; hypothetical (locus 4 genome; locus 1 ctg7180000408321 12/eHP-6 & Halomonas (ATP-dependent zinc as genome) protein or 4 metagenome) metalloprotease [FtsH])

homologs to head-tail virus eHP-36 ORF Hht. litchfieldiae genome (1), (unknown function), haloarchaeal ORF (locus 4 genome; locus 1 metagenome (1, same ctg7180000408328 terminase B (terminase), & Natrialba phage ΦCh1 ORF or 4 metagenome) as genome) (unknown function) DL31 homologs of head-tail virus HCTV-2 ORFs metagenome (1) deg7180000398907 major capsid protein (locus 1) (incl. major capsid protein & prohead protease) homologs to Halorubrum phages (head-tail DL31 hypothetical protein genome (1) deg7180000401099 viruses) CGΦ46 & BJ1 ORFs (unknown (locus 1) (virus?) function) homologs of head-tail virus HCTV-2/HHTV-2 DL31 genome (1), ORFs (incl. tail tube, tail assembly protein, tape deg7180000398909 hypothetical protein (locus 2) metagenome (1) measure protein) & Firmicutes (Streptococcus sp.) phage ORFs Hrr. lacusprofundi (I-B)

125 genome (1) ctg7180000396891 homolog to Halar_0011 (putative DNA primase) DNA primase (locus 1)

segment of DL31 genome (Halar_2946-_2949): Hrr. lacusprofundi (I-B) Halar_2947: Cdc6- genome (1) ctg7180000405763 transposase, Cdc6-like protein, signal peptidase, (locus 1) like protein aminopeptidase segment of tADL genome (halTADL_0207- Hrr. lacusprofundi (I-B) _0211): incl. DNA methylase N4/N6 domain DNA methylase-like genome (1) ctg7180000417783 (locus 1) protein, endonuclease, DUF1524/RloF, other protein (tADL) hypothetical proteins homologs to haloarchaeal ORFs (unknown Hrr. lacusprofundi (I-B) genome (1) ctg7180000424023 function); ORFs homologous to those on hypothetical protein (locus 1) ctg7180000464057

homologs of haloarchaeal ORFs (helicase domain protein, Phospholipase D Active site Hrr. lacusprofundi (I-B) metagenome (1) ctg7180000438795 motif domain protein, DNA-cytosine hypothetical protein (locus 1) methyltransferase, A/G-specific DNA glycosylase, & proteins of unknown function) homologs to haloarchaeal ORF (unknown Hrr. lacusprofundi (I-B) genome (1) ctg7180000440486 function) & methanogen ORF (AsnC hypothetical protein (locus 1) transcriptional regulator) Hrr. lacusprofundi (I-B) homologs to haloarchaeal ORFs (unknown genome (2) ctg7180000447311 hypothetical protein (locus 1) function) homologs to haloarchaeal ORFs (unknown Hrr. lacusprofundi (I-B) genome (1) ctg7180000464057 function); ORFs homologous to those on hypothetical protein (locus 1) ctg7180000424023 Hrr. lacusprofundi (I-B) DL31 replicon 1 – same as ctg7180000405763 Halar_2947: Cdc6-

126 genome (1) DL31 genome (locus 1) (above) like protein

Hrr. lacusprofundi (I-B) genome (1), HalDL1_3267: DL1 genome DL1 replicon (“Contig37”) (locus 1) metagenome (1) hypothetical protein halTADL_0211: Hrr. lacusprofundi (I-B) tADL single replicon – genome (1) tADL genome DNA methylase N- (locus 1) same as ctg7180000417783 (above) 4/N-6 Hrr. lacusprofundi (I-D) homologs to haloarchaeal & Halorubrum phage metagenome (1) ctg7180000261785 no identifiable ORF (locus 2) GNf2 ORFs (unknown function) Hrr. lacusprofundi (I-D) genome (1), all homologs to pleolipovirus HRPV-1 ORFs hypothetical protein; ctg7180000266221 (locus 2) metagenome (1) (incl. VP8) VP8 protein homologs of haloarchaeal ORFs (incl. Hrr. lacusprofundi (I-D) metagenome (1) ctg7180000435415 transcription initiation factor IIB, translation no identifiable ORF (locus 2) initiation factor IF-2, histone-like protein)

homologs to haloarchaeal ORFs (PadR Hrr. lacusprofundi (I-D) transcriptional metagenome (1) ctg7180000436748 transcriptional regulator, protein of unknown (locus 2) regulator function) & ORFs with no known homologs homologs to haloarchaeal ORFs (incl. linocin Hrr. lacusprofundi (I-D) M18 bacteriocin), head-tail HGTV-1 ORF genome (1) deg7180000399274 linocin (locus 2) (prohead protease), & ORF with no known homolog DL1 homologs to haloarchaeal ORFs (incl. RepH, genome (1) ctg7180000271318 RepH (locus 1) replication protein) DL1 homologs to ORFs from head-tail viruses eHP- genome (2) ctg7180000408317 hypothetical proteins (locus 1) 6/eHP-12 (unknown function) DL1 homologs to ORFs from head-tail viruses eHP- hypothetical protein; genome (2) ctg7180000408318 (locus 1) 12 & HGTV-1 (incl. tail proteins) no identifiable ORF homologs to haloarchaeal ORFs (unknown 127 DL1 function) & Haloarcula phage genome (1) ctg7180000475061 hypothetical protein

(locus 1) (halosphaerovirus) SH1 ORF (unknown function) HalDL1_0747: DL1 genome (1) DL1 genome DL1 replicon (“Contig38”) Peptidase MA (locus 1) superfamily protein Unknown haloarchaeon all homologs to head-tail virus HSTV-2/HRTV- hypothetical protein metagenome (1) ctg7180000428629 #1 7 ORFs (unknown function) (virus) Unknown haloarchaeon homologs to haloarchaeal ORFs (helicase, metagenome (1) deg7180000422908 helicase #1 phage/plasmid primase) Unknown haloarchaeon homologs to haloarchaeal ORFs (incl. linocin) & metagenome (1) ctg7180000411793 no identifiable ORF #2 head-tail virus HGTV-1 ORF (prohead protease) Unknown haloarchaeon metagenome (1) ctg7180000450146 homolog to haloarchaeal ORF (DNA primase) DNA primase #2

Unknown haloarchaeon Halar_0011: DNA metagenome (1) DL31 genome DL31 replicon 3 #2 primase homologs to bacterial (Halomonas sp.) ORFs Bacterial (Halomonas?) metagenome (1) ctg7180000297164 (unknown function) & bacteriophage ORFs no identifiable ORF (incl. tail fiber protein) homologs to bacteriophage (rhizobial) ORF, & Bacterial (Halomonas?) metagenome (1) ctg7180000447433 hypothetical protein ORF with no known homolog

128

3.4.4.2 BREX systems Hrr. lacusprofundi and DL31 were previously reported to possess BREX systems in their genomes (Goldfarb et al., 2015). In both species the genes forming the BREX system are located on secondary replicons. In Hrr. lacusprofundi, the BREX system spans across locus tags Hlac_3187 to 3197 and in DL31 across Halar_0254 to 0263 (Figure 3.8; Table 3.8). BREX proteins from both species share considerable sequence similarity and parts of the BREX clusters are part of a ~14 kb long HIR (only one nucleotide mismatch) shared only between Hrr. lacusprofundi and DL31.

Table 3.8 BREX clusters in Hrr. lacusprofundi and DL31. Protein sequence identity between the Hrr. lacusprofundi and DL31 homologues is given in percent. Metagenome FR coverage of genes is given as rpkm (reads per kilobase per 1000000 recruited reads) values. Highlighted in light grey are genes that are part of an HIR; the HIR starts in the N-terminal half of brxC/pglY. FR coverage is not shown for the transposase as coverage is uncertain due to multiple copies of the gene being present in Hrr. lacusprofundi and other Deep Lake genomes. TA (toxin- antitoxin). Hrr. Hrr. DL31 DL31 locus Sequence Annotation lacusprofundi lacusprofundi FR tag identity locus tag FR coverage coverage brxHII Hlac_3187 Halar_0263 99% 3322 3886 pglZ Hlac_3188 Halar_0262 92% 2929 3772 Hypothetical - Halar_0261 - - 723 Halar_0260 pglX Hlac_3189 (AA 281 – 66% 1458 871 1402) Transposase Hlac_3190 - - Not shown - Halar_0260 pglX Hlac_3191 92% 2993 3984 (AA 1 – 287) TA system Hlac_3192 Halar_0259 99% 6044 7247 TA system Hlac_3193 Halar_0258 100% 4574 5626 brxC/pglY Hlac_3194 Halar_0257 95% 2889 3251 brxB Hlac_3195 Halar_0256 85% 3739 4529 brxC/pglY Hlac_3196 Halar_0255 95% 2804 3500 brxA Hlac_3197 Halar_0254 100% 3687 4313 Complete Hlac_3187 to Halar_0254 to BREX - 2367 2796 Hlac_3197 Halar_0263 clusters

In the metaproteome, a peptide was detected matching to the methylase PglX from Hrr. lacusprofundi (Hlac_3189). In the Hrr. lacusprofundi genome, the pglX gene is interrupted by a transposase (Figure 3.8) which would render the gene non-functional. However, within the metagenome database a contig was identified encoding the uninterrupted pglX gene. The FR coverage of BREX systems in Hrr. lacusprofundi and

129

DL31 revealed that in both species the respective regions have overall high coverage with the exception of the pglX genes (Figure 3.8; Table 3.8). The PglX proteins also exhibit the lowest degree of sequence identity for BREX proteins of the two species. Two genes encoding for a toxin-antitoxin (TA) system lie within the BREX gene clusters of Hrr. lacusprofundi and DL31 (Figure 3.8). The genes are located on the opposite strand compared to BREX genes and the encoded VapB and VapC proteins have high sequence identity between the two species (99% and 100%). Similarly, a VapBC TA system is also encoded near the type I-D CRISPR system of Hrr. lacusprofundi, with the VapC (Hlac_3585) protein detected in the metaproteome. vapBC genes are also present in the genomes of Hht. litchfieldiae and DL1. In DL31, an additional gene encoding for a short hypothetical protein (Halar_0261) is present within the BREX cluster next to the pglX gene (Figure 3.8). No homologue of this protein could be found in the other Deep Lake isolate species but homologues with ~80% sequence identity were identified in some other haloarchaeal species, e.g. Haloarcula californiae, where the gene is also located between genes with PglZ and methyltransferase domain.

130

131

131

Figure 3.8 BREX gene clusters of DL31 and Hrr. lacusprofundi. The lower panel shows ORFs (blue arrows) of BREX clusters in DL31 and Hrr. lacusprofundi. Nomenclatur of BREX genes (on top of DL31 cluster) is according to Goldfarb et al. (2015). Numbers below genes denote locus tags. Amino acid sequence identity between the encoded proteins of DL31 and Hrr. lacusprofundi is given as percentage. In both species a toxin-antitoxin system (light brown arrows) is integrated in between BREX genes. In Hrr. lacusprofundi a transposase (red arrow) interrupts the pglX gene. In DL31 an additional gene encoding a hypothetical protein (grey arrow) sits in between BREX genes. Beginning in the N-terminal part of DL31_0255/Hlac_3196 and going upstream is an HIR of ~ 14kb shared between the two species. The orange arrow highlights a gene with detected protein in the metaproteome. The top panel shows metagenome FR of the DL31 BREX cluster, highlighting comparably low coverage for the pglX gene. 132

132

3.5 Discussion Led by the metaproteomic identification of viral proteins, host cell surface variants and other host proteins involved in anti-viral defence, this chapter describes a rich network of host–virus interactions in Deep Lake. It provides a first snapshot of a diverse Deep Lake viral community and describes a number of defence systems actively employed by the haloarchaeal hosts.

3.5.1 Diverse and abundant viruses in Deep Lake This study represents the first assessment of the viral community in Deep Lake. MCPs of head-tailed haloarchaeoviruses dominated the viral fraction of the Deep Lake metaproteome (Table 3.4) with some of the encoding contigs also having very high FR coverage in the metagenome. This is in contrast to some temperate hypersaline environments where head-tailed viruses were reported to represent only a minor fraction of the viral communities with spindle-shaped viruses being highest in abundance (Oren et al., 1997; Kukkaro and Bamford, 2009; Luk et al., 2014). Head-tailed viruses represent the majority of haloarchaeovirus isolates and as a consequence are also overrepresented in genomic and proteomic databases (Luk et al., 2014). By contrast, until now His1 is only one isolated haloarchaeal spindle-shaped virus (Bath and Dyall- Smith, 1998) making the identification of potential spindle-shaped virus proteins difficult. Microscopic analysis of the virus fraction will be required for a better understanding of the abundance of different types of viruses in Deep Lake. One detected MCP had its closest match to the bacteriophage VBM1, which infects the marine bacterium Vibrio parahaemolyticus. In Deep Lake, proteins matching VBM1 could be derived from a virus infecting Halomonas sp., which, like Vibrio parahaemolyticus, belongs to the Gammaproteobacteria and represents one of the most abundant bacterial taxa in Deep Lake (0.8% of the lake community) (DeMaere et al., 2013). The sizes of reported head-tailed haloarchaeoviruses range from 40-90 nm for the head structures and 40-170 nm for the tails (Luk et al., 2014). The higher abundance of virus proteins on the 0.1 µm filters compared to the larger filter sizes (Figure 2.3) indicates that relatively large, free living viruses were predominantly captured on the filters. However, the detected prohead proteases and scaffold protein, involved in the assembly and maturation of capsid structures of new virus particles (virions) during lytic infection (Dokland, 1999; Pietila et al., 2013), most likely originate from virus

133

particles in the process of assembly caught within host cells. It is likely that a proportion of Deep Lake viruses were too small to be captured by the 0.1 µm filters (Emerson et al., 2013). Hence this present study might not represent a complete picture of the Deep Lake virus community but rather provide a first insight. Metagenomic sequencing of the post-filtrate fraction is likely to reveal many additional viruses.

3.5.2 Avoiding viral infection through variation in cell surface structures A striking finding of this study was the detection of multiple cell surface proteins with a high degree of sequence variation (Figure 3.1). Multiple variants, some of them with very high abundance, were detected for the main S-layer protein of the isolate species, in particular for the most dominant species Hht. litchfieldiae (Table 3.2). Metagenome FR further showed that the S-layer genes of the isolate species are not well represented in the Deep Lake metagenome (Figure 3.3). These data are indicative of phylotype variation that exists within the populations of Deep Lake where different members of the population exhibit different S-layer proteins on their cell surface. Regions of low metagenome FR coverage containing cell surface genes, including the S-layer gene, have been described in other microbial species and have been hypothesised to represent a phage evasion strategy (Coleman et al., 2006; Legault et al., 2006; Cuadros-Orellana et al., 2007; Rodriguez-Valera et al., 2009; Avrani et al., 2011). Exposure of Prochlorococcus cultures to predating phages led to the evolution of resistant strains that accumulated mutations predominantly in cell surface genes (Avrani et al., 2011). Therefore, variation in cell surface structures potentially prevents virus infection. Our data provided evidence that both S-layer genes in the genome of Hrr. lacusprofundi are functional and expressed within the population. Since the archaeal S- layer is usually composed of multiple copies of a single protein subunit (Albers and Meyer, 2011) it is likely that the two S-layer proteins are expressed by distinct parts of the population. In some pathogenic bacteria with multiple S-layer gene copies, recombination events cause a switching of the actively expressed S-layer homologue resulting in distinct antigenic characteristics (Fagan and Fairweather, 2014). It is therefore possible that a similar mechanism exists in Hrr. lacusprofundi. However, no indications for recombination events of the Hrr. lacusprofundi S-layer genes were found in this study.

134

Genomic and metagenomic analyses revealed that the second Hrr. lacusprofundi S- layer gene (Hlac_2976) is located within a putatively mobile genomic region next to transposases, viral genes and HIR (Figure 3.6). This suggests that S-layer genes are subject to HGT. Through HGT, an S-layer protein conferring resistance to a particular virus could be distributed within a population, allowing the population to adapt promptly to newly emerging viruses. Similar to the variation of host cell surface proteins to escape virus infection, viruses can react and alter their host range through introducing changes in the structures interacting with the host cell receptor (Samson et al., 2013; Sorek et al., 2013). The head-tailed haloarchaeovirus ɸCh1 infecting Natrialba magadii introduces changes in its putative tail fibre proteins through a recombinase-mediated recombination mechanism termed phase variation (Rossler et al., 2004) resulting in distinct binding characteristics (Klein et al., 2012). The affected tail fibre proteins contain predicted galactose-binding domains, suggesting cell surface glycoprotein structures like the S- layer or archaella as putative host cell receptors (Klein et al., 2012). In the metaviriomes (metagenome of only the virus fraction) of temperate hypersaline systems, glucanase genes were found in viral genomes hypothesised to be involved in the degradation of the host S-layer glycoproteins during virus release or preceding virus infection (Garcia- Heredia et al., 2012), similar to the exopolysaccharide (EPS)-degrading genes of bacterial viruses (Cornelissen et al., 2011). The observed variation of the S-layer proteins might therefore also occur as a response to viral glucanase functioning. This is the first study to show that diverse cell surface proteins are synthesised within populations of haloarchaea in the environment as a response to virus infection pressure

3.5.3 Virus encoded cell surface genes A further intriguing finding of this study was the detection of cell surface proteins with sequence similarity to haloarchaea but encoded on contigs together with putative viral genes. Haloarchaeal viruses frequently harbour genes matching haloarchaea in their genomes (Pagaling et al., 2007; Santos et al., 2007; Krupovic et al., 2010; Santos et al., 2010; Emerson et al., 2012) and the genomes of haloarchaea, including those from Deep Lake, have been reported to contain many virus-derived genes (Legault et al., 2006; Cuadros-Orellana et al., 2007; Dyall-Smith et al., 2011; DeMaere et al., 2013). Hence it was not possible to unambiguously assign the detected proteins and the encoding metagenomic contigs to either a haloarchaeal host species or a virus. The

135

respective contigs could represent genomic islands from Deep Lake haloarchaeal species since haloarchaeal cell surface and virus-derived genes have been reported for genomic islands of Hqr. walsbyi (Legault et al., 2006; Cuadros-Orellana et al., 2007). Whatever regions these contigs might represent, the particular abundance of expressed cell surface genes next to putative viral genes raises speculations that viruses could be involved in exchanging cell surface genes within populations and thereby potentially also introducing sequence variation within these genes.

3.5.4 CRISPR spacer analyses reveal haloarchaea-virus relationships in Deep Lake The detection of Cas proteins for Hht. litchfieldiae and Hrr. lacusprofundi (Table 3.5) indicates that CRISPR systems were actively employed against invading DNA by these two species. The lack of detected Cas proteins for DL31, even though it is the second most abundant species in Deep Lake, suggests DL31 might be less prone to virus infection (e.g. no expression of archaella as potential virus receptors) or that other defence mechanisms are more prominent (Samson et al., 2013). Since spacer sequences represent signatures of previous infection events, they can be used to elucidate host-virus relationships occurring in the environment (Andersson and Banfield, 2008; Garcia-Heredia et al., 2012; Emerson et al., 2013; Sorek et al., 2013). CRISPR systems of Deep Lake haloarchaea were found to mostly target head- tailed viruses and also, to a lesser extent, other viruses, plasmids and regions of chromosomal DNA with similarity to viruses (Table 3.7). The identification of viruses capable of infecting multiple Deep Lake haloarchaeal species suggests viruses may function as mediators for HGT. Virus-mediated gene transfer (transduction) occurs frequently in the environment and has been hypothesised to be one of the key drivers of microbial diversity (Paul, 1999; Weinbauer and Rassoulzadegan, 2004). HIRs, in this study shown to encode for genes that are expressed in the environment, include a high number of putative virus-derived genes (Figure 3.6), suggesting a possible role of viruses in their distribution. The identification of Deep Lake viruses capable to infect across different genera is further support for this hypothesis. Viruses were also hypothesised to be responsible for gene transfer events in genomic islands of Hqr. walsbyi (Legault et al., 2006) since they contain many genes with similarity to viruses. However, HIRs are ~100% conserved between species from different genera whereas genomic islands are hot-spots for variation within populations

136

of single species (Coleman et al., 2006; Legault et al., 2006; DeMaere et al., 2013), indicating fundamental functional differences between the two genomic elements. No spacers were matching to the most abundant viruses in the metaproteome indicating that these highly abundant viruses were not captured by host CRISPR systems. However, multiple spacers were matching viral contigs with very high coverage in the metagenome (Table 3.7) indicating that other high abundant viruses were indeed targeted by the CRISPR systems. These results are similar to those from a temperate hypersaline lake, Lake Tyrell, in which spacers for only some of the highly abundant viruses were identified (Emerson et al., 2013). For a large number of spacers, no match could be identified in the metagenome (Figure 3.7). This could be due to several reasons: (A) very high stringency settings were used in the BLAST search of spacers against the metagenome (only 100% matches over the full length) in order to reduce false-positive matches. Since targeted viruses can escape CRISPR systems through mutations within proto-spacer regions (Andersson and Banfield, 2008; Sorek et al., 2013), a single point mutation in the proto-spacer would have been sufficient for the proto-spacer not to be detected in our analysis. (B) In Lake Tyrell most of the viruses targeted by CRISPR systems were of too low abundance to be assembled into a metagenome database (Emerson et al., 2013). An assembled metagenome database was also used for this present Deep Lake study potentially excluding low abundant viruses. (C) In the same Lake Tyrell study most of the viruses were identified in the metagenome of the post- filtrate (0.1 µm filter flow-through); no post-filtrate fraction was used for the metagenomic sequencing of Deep Lake (DeMaere et al., 2013). However, it needs to be noted that even though the post filtrate was included in the Lake Tyrell study, the number of spacers matching to the respective metagenome was still small (Emerson et al., 2013).

3.5.5 BREX and TA systems in Deep Lake The Deep Lake metaproteome included the putative methylase PglX belonging to the BREX anti-viral defence system of Hrr. lacusprofundi. Besides Hrr. lacusprofundi, a BREX system is also present in the genome of DL31, and both systems belong to the subtype 5 BREX systems, which so far have only been identified in haloarchaea (Goldfarb et al., 2015). The high sequence similarity between BREX proteins from Hrr. lacusprofundi and DL31, with parts of the system belonging to an HIR shared between the two species, is indicative of HGT of BREX systems between Deep Lake

137

haloarchaea. These data are in agreement with a previous reporting that BREX systems are subject of extensive HGT, indicated by a lack of coherence between the distribution of BREX systems and species phylogeny (Goldfarb et al., 2015). Comparison of identified BREX systems from sequenced bacterial and archaeal genomes revealed that the pglX gene was prone to irregularities including frameshift mutations, duplications and occurrence of multiple truncated versions next to a full length gene. The pglX gene was further hypothesised to be subject to recombinase-mediated phase variation (Goldfarb et al., 2015). Low metagenome FR of the Hrr. lacusprofundi and DL31 pglX genes and the presence of an interrupting transposase in parts of the Hrr. lacusprofundi population (Figure 3.8), indicate that the pglX gene is not only prone to variation between different species but that variation occurs at a strain level within the populations of single species of Deep Lake haloarchaea. Almost identical VapBC TA systems were identified within BREX gene clusters of Hrr. lacusprofundi and DL31 (Figure 3.8) and a further VapBC system was found next to the type I-D CRISPR system of Hrr. lacusprofundi. The VapC protein of the latter was detected in the metaproteome indicating that this system is functional. VapBC is a TA system present in all three domains of life (Arcus et al., 2011). VapB represents the inhibitory anti-toxin of VapC, a ribonuclease toxin, and co-expression leads to the formation of a tight VapBC complex and hence inactivation of the toxin. However, the toxin is more stable than the labile anti-toxin and degradation of the anti-toxin (e.g. as a stress response or when no longer expressed) activates the toxin, leading to dormancy and cell death (Gerdes, 2000; Yamaguchi et al., 2011). TA systems were first discovered on plasmids where they promoted plasmid maintenance in host cells during cell division (Gerdes et al., 1986). By a similar mechanism chromosomally encoded TA systems could help maintain closely linked genes, especially in regions prone to variation and HGT (Gerdes et al., 2005). In the genome of the acidothermophile archaeon Acidianus hospitalis W1, TA systems were found within and adjacent to multiple CRISPR-Cas loci presumably to maintain the loci on the chromosome (You et al., 2011). In addition, there are also examples of TA systems that directly induce cell death upon virus infection to prevent virus propagation (Samson et al., 2013). Our data indicate that within the community of Deep Lake haloarchaea, TA systems are employed against virus infection and are used to distribute and maintain anti-viral defence systems like BREX and CRISPR.

138

Chapter 4

Analyses of intra-species variation within Deep Lake haloarchaea

Co-authorship statement

Sections from this chapter have been published in the same manuscript as described in the Co-authorship statement of Chapter 2. I performed all work specific for this chapter.

139

4.1 Abstract Hht. litchfieldiae strain tADL represents the dominant member of the Deep Lake microbial community. Analysis of the Deep Lake metagenome previously indicated the presence of an additional strain, here referred to as tADL-II, within the Hht. litchfieldiae population. The Deep Lake metaproteome contained a number of proteins derived from tADL-II, allowing for a comparative analysis with tADL. Detected proteins and protein abundances were indicative of possible physiological distinctions between the two strains. Besides tADL and tADL-II, the metaproteome contained signatures of Hht. litchfieldiae phylotypes that exhibited variation in key metabolic functions. These data, together with further analysis of the Deep Lake metagenome, indicated that the population structure of Hht. litchfieldiae is distinct from the populations of other abundant haloarchaeal species in Deep Lake.

140

4.2 Introduction From metagenomic sequencing data of Deep Lake, a large number of metagenomic contigs were assembled that could be assigned to the genomes of the three most abundant haloarchaeal species (DeMaere et al., 2013). In a GC/read-depth plot, the according contigs built discrete clusters (Figure 4.1). In addition to clusters assigned to Hht. litchfieldiae strain tADL, DL31 and Hrr. lacusprofundi, a fourth ‘unknown’ cluster of contigs was observed. This cluster comprised 52 large contigs (> 15 kb) with a total length of 1.89 Mb (DeMaere et al., 2013). Average nucleotide identity (ANI), tetranucleotide usage deviation (TUD) and mapping of the contigs using CONTIGuator (Galardini et al., 2011) revealed high similarity of the additional cluster of contigs to Hht. litchfieldiae strain tADL. No 16S rRNA gene sequence was present on these contigs and SSU rRNA gene sequencing data of Deep Lake indicated that there was no other abundant species present. The cluster of 52 contigs was therefore thought to represent an additional strain of Hht. litchfieldiae and was referred to as the “tADL- related 5th genome” (DeMaere et al., 2013). In this study the organism representing the “tADL-related 5th genome” is called tADL-II. In addition to the discovery of tADL-II contigs, variation in Deep Lake haloarchaea was identified through FR of metagenomic reads to the genomes of the isolate species. This allowed the identification of fixed single nucleotide polymorphisms (SNPs) (SNPs present in >90% of the population) and genomic regions with low FR coverage, which were indicative of the presence of different phylotypes (DeMaere et al., 2013). This chapter is based on the metaproteomic detection of variant proteins, indicative of strain variation within the populations of the Deep Lake isolate species. While variation of cell surface proteins was previously described as a host response to viral predation pressure (see 3.5.2), this chapter describes variation between tADL and tADL-II and variation linked to key metabolic functions. Furthermore, the metaproteome data were complemented by metagenome analysis in a first attempt to describe community structures of Deep Lake haloarchaea.

141

Figure 4.1 Initial identification of tADL-II contigs. The figure shows GC/read-depth plots of metagenomic contigs; plots and parts of the figure legends were taken from DeMaere et al. (2013). Metagenomic contigs larger than 15 kb were plotted according to their GC content (y- axis) and mean metagenomic read-depth (x-axis). Contigs possessing sufficient BLASTN scores (coverage > 90%, e-value < 10-10) to the isolate species were labelled in color. (A) Clusters of contigs assigned to Hht. litchfieldiae strain tADL (red), DL31 (blue) and Hrr. lacusprofundi (green). An additional dense cluster is formed by unlabelled contigs (yellow) corresponding to the tADL-II. (B) Clusters of metagenomic contigs with read-depth < 100. Three clusters were identified for DL31 (green circles), Hrr. lacusprofundi (red squares) and the tADL-II (cyan crosses).

4.3 Materials and Methods Deep Lake biomass sampling, metaproteomic analyses and functional and taxonomic annotation of proteins were performed as described in 2.3.2 and 2.3.3.

142

4.3.1 BLAST analysis of the Deep Lake metagenome protein database All ORFs from annotated metagenome contigs (total of 38071 ORFs) were used as query sequences and searched against a custom designed database using the standalone BLAST+ 2.2.30 software (Camacho et al., 2009) on the Linux computational cluster Katana (supported by the Faculty of Science, UNSW Australia). Using the blastdb_aliastool, a virtual BLAST database was constructed out of the nr database, comprising all entries from the taxonomic groups Bacteria, Archaea (without the Deep Lake isolate species), Viruses and Dunaliella. At the time of analysis, Hht. litchfieldiae strain tADL was not part of the nr database. Therefore, and also to easily access the IMG locus tags for each protein, the sequences of the Deep Lake isolate species were provided in a separate database. The BLAST search was carried out against the combination of both databases (i.e. the virtual subset of nr and the Deep Lake isolates).

4.3.2 Mapping of tADL-II contigs and creation of circular plots tADL-II contigs were mapped onto the Hht. litchfieldiae tADL genome using the CONTIGuator web server (Galardini et al., 2011) and by manual assignment in Artemis (Carver et al., 2012). Circular plots of tADL-II contigs mapped onto the tADL genome and highlighted genes with detected protein were created using DNAPlotter (Carver et al., 2009) in Artemis (Carver et al., 2012).

4.3.3 Phylogenetic analyses of tADL-II ribosomal proteins The set of species included in the analyses comprised all finished haloarchaeal genomes from IMG (for Haloferax mediterranei there were two entries and only the one with the IMG Submission ID 40688 was included in the analysis). Also included in the analyses were the genome of Hht. litchfieldiae strain tADL (currently deposited as draft genome on IMG) and the unfinished genomes of two Halonotius strains, since Halonotius represents the most closely related genus to Hht. litchfieldiae (based on 16S rRNA gene sequence) (Mou et al., 2012). Ribosomal protein sequences were retrieved using the respective tADL-II sequences as query in a BLAST search on the IMG web site. Phylogenetic analyses of ribosomal protein sequences were performed using MEGA6 (Tamura et al., 2013). Alignments of protein sequences were created using MUSCLE (Edgar, 2004), retaining all positions in the alignments. Phylogenetic trees were constructed using the Maximum Likelihood method based on the JTT matrix-

143

based model (Jones et al., 1992) with 1000 bootstrap replicates. The trees were drawn to scale, with branch lengths measured in the number of substitutions per site.

4.4 Results

4.4.1 Metaproteomic and metagenomic signatures of tADL-II The Deep Lake metaproteome contained 163 variants assigned to Hht. litchfieldiae which is in strong contrast to DL31 and Hrr. lacusprofundi with only four and three detected variants, respectively (Table 3.1). One hundred and six Hht. litchfieldiae variants were encoded on metagenome contigs previously assigned to tADL-II (DeMaere et al., 2013). Inspection of the metagenome contigs encoding the remaining variant proteins revealed 22 contigs which each contained multiple ORFs with ~70– 99% amino acid sequence identity to tADL sequences (lower for some cell surface proteins) and largely conserved gene synteny. These are characteristics shared with the contigs previously assigned to tADL-II. GC/read-depth binning of the 22 new contigs together with the original set of 52 contigs showed an overlapping distribution (Figure 4.2). Mean GC-content/read-depth were similar between the two sets of contigs with 39/63% for the original set of 52 contigs and 35/61% for the additional 22 contigs. The original GC/read-depth binning was limited to contigs > 15 kb (DeMaere et al., 2013). Out of the 22 new contigs, 18 were shorter than 15 kb and were therefore not included in the initial analysis. Alignments of the 52 original contigs with the 22 novel contigs revealed that ten of the novel contigs overlapped (at their ends) with the original contigs; hence they were joined together into larger contigs. The 22 novel tADL-II contigs encoded for 38 of the detected variants including 11 ribosomal proteins with 88–97% sequence identity to tADL.

144

Figure 4.2 GC/read-depth plot of tADL-II contigs. tADL-II contigs were plotted according to their GC-content and metagenomics read-depth coverage. Black diamonds, 22 new contigs; grey squares, 52 original contigs.

Overall, 146 proteins could be assigned to tADL-II (Table 4.1), accounting for ~22% of all proteins assigned to Hht. litchfieldiae (655). For 122 locus tags proteins from both strains, tADL and a tADL-II, were detected, and the tADL proteins were ~3.5 times (median) more abundant than the tADL-II ones. Only twelve tADL-II proteins were more abundant than the respective tADL proteins, including a branched-chain amino acid (BCAA) ABC transporter substrate-binding protein (Table 4.1). Average sequence identity of detected tADL-II proteins to tADL was 88%. Cell surface (63%) and transporter (82%) proteins showed the highest degree of variation compared to more conserved transcription (94%) or cell division (96%) proteins (Figure 4.3). Two proteins were unique to tADL-II with no encoded homologue in the tADL genome: a nitrate/sulfonate/bicarbonate ABC transporter substrate-binding protein and a cell surface protein (Table 4.1).

145

Figure 4.3 Variation between tADL and tADL-II. The chart shows the average sequence identity between proteins in different functional categories from Hht. litchfieldiae tADL and tADL-II. Categories are ranked based on their average sequence identity, highlighting the extent of variation within cell surface proteins.

In 37 cases, proteins from tADL and tADL-II were identified through the same set of peptides with no peptide covering a region with variation between them. In these cases metaproteomics did not allow unambiguous identification of which protein was present in the sample. It could have been either the tADL or the tADL-II protein, or both of them. Therefore these identifications were grouped into protein families but they could still be confidently assigned to Hht. litchfieldiae.

146

Table 4.1 Proteins detected for tADL-II. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Highlighted in purple are tADL-II proteins with a higher spectrum count than the respective tADL proteins. Sequence Spectrum Protein annotation Protein # Matching tADL locus tag identity (%) count Amino acid metabolism 3-isopropylmalate dehydrogenase (LeuB) 1084 halTADL_0366 93 0.4 Agmatinase (SpeB) 374 halTADL_1131 95 13 Glutamate dehydrogenase (GdhA) 1013 halTADL_1757 81 0.6 Aspartate kinase (LysC) 905 halTADL_1916 90 1.3 D-3-phosphoglycerate dehydrogenase (Ser A) 236 halTADL_2045 93 28 Phosphoserine phosphatase (SerB) 309 halTADL_2046 96 19 Anthranilate phosphoribosyltransferase (TrpD) 1049 halTADL_3066 92 0.6 147 Aspartate aminotransferase (AspB) 619 halTADL_3081 97 4.8

Glutamine synthetase (GS) (GlnA) 319 halTADL_3423 97 18 Carbohydrate metabolism Alpha-amylase (glycosyl hydrolase, family 13) 193 halTADL_0142 84 35 Glucose-1-phosphate thymidylyltransferase (RfbA, RffH) 421 halTADL_3353 93 11 Cell division Cell division protein FtsA 467 halTADL_0130 96 9.2 VCP-like protein (2 x CDC48 domains + 2 x AAA family ATPase 927 halTADL_2740 95 1.2 domains) Cell surface No match to tADL – unique Invasin/intimin cell-adhesion domain protein (Sec signal) 577 - 5.5 to tADL-II Archaellin FlaA or FlaB 95 halTADL_0078 75 59 Hypothetical protein (TAT signal) 490 halTADL_0878 50 8.3

Adhesion pilin (PilA) 666 halTADL_1387 66 3.9 Hypothetical protein (Sec signal, PGF-CTERM, C-terminal transmembrane 126 halTADL_1403 29 49 helix) Archaellin FlaA or FlaB 786 halTADL_1544 75 2.3 Hypothetical protein (TAT signal) 370 halTADL_1761 40 14 Hypothetical protein (Sec signal, PGF-CTERM, C-terminal transmembrane 361 halTADL_1765 76 14 helix) Archaellin FlaA or FlaB 59 halTADL_1812 77 94 Archaellin FlaA or FlaB 12 halTADL_1813 76 239 Central carbon metabolism Pyruvate:ferredoxin , alpha subunit (PorA) 691 halTADL_0382 93 3.4 Phosphoenolpyruvate carboxylase (Ppc) 572 halTADL_0401 94 5.9 Fructose-1,6-bisphosphate aldolase, class I (FbaB) 733 halTADL_0575 92 2.8 Phosphoglycerate kinase (Pgk) 833 halTADL_0816 94 1.7 148 Glyceraldehyde-3-phosphate dehydrogenase (NAD(P)+-dependent), type I 591 halTADL_0817 90 5.2 (Gap) Phosphoenolpyruvate (PEP) synthase (Pps) 451 halTADL_1011 79 9.8 Acetate : CoA (Acs) 449 halTADL_1017 92 9.8 Triosephosphate isomerase (TpiA) 249 halTADL_2532 89 26 Enolase (phosphopyruvate hydratase) (Eno) 391 halTADL_2780 90 12 Aconitate hydratase (AcnA) 401 halTADL_2902 96 12 Pyruvate kinase (Pyk) 695 halTADL_3014 91 3.4 Fructose 1,6-bisphosphate aldolase (multifunctional) 420 halTADL_3234 96 11 DNA replication & repair Methylated-DNA-[protein]-cysteine S-methyltransferase (Ogt) 1031 halTADL_0579 82 0.6 DNA polymerase sliding clamp subunit (PCNA homolog) 244 halTADL_1713 97 27 DNA repair and recombination protein RadA 284 halTADL_2135 90 22

Energy conservation Cytochrome c oxidase, subunit II (CoxB) 959 halTADL_1060 82 0.9 Inorganic pyrophosphatase (Ppa) 937 halTADL_1644 88 1.0 ATP synthase, K subunit (AtpK) 293 halTADL_1940 80 21 ATP synthase, beta subunit (AtpB) 406 halTADL_1945 92 12 Ferredoxin 42 halTADL_2137 96 109 NADPH-dependent F420 reductase 1096 halTADL_2320 91 0.4 Glycerol metabolism Glycerol-3-phosphate dehydrogenase (GlpA) 327 halTADL_2244 87 17 Glycerol kinase (GlpK) 381 halTADL_2249 96 13 Dihydroxyacetone (DHA) kinase, L subunit (DhaL) 982 halTADL_2259 92 0.9 Dihydroxyacetone (DHA) kinase, K subunit (DhaK) 140 halTADL_2260 94 45 Hypothetical

149 Hypothetical protein 431 halTADL_0015 94 10 Nucleic acid-binding/OB-fold/TRAM domain protein 267 halTADL_0109 93 24

DUF964 / YheA/YmcA domain 777 halTADL_0133 88 2.4 DUF655 (predicted RNA-binding domain) 428 halTADL_0183 89 10 Hypothetical protein 887 halTADL_0395 85 1.5 Predicted RNA-binding protein containing KH domain, possibly ribosomal 1088 halTADL_0545 81 0.4 protein DUF827 / WEB family domain 183 halTADL_0555 77 36 SelT/SelW/SelH selenoprotein domain / Rdx domain 979 halTADL_1063 81 0.9 ThiJ/PfpI domain-containing protein 199 halTADL_1769 83 33 Hypothetical protein (DUF4382) 215 halTADL_2062 87 31 Hypothetical protein 307 halTADL_2560 89 19 Hypothetical protein 629 halTADL_2576 95 4.5 Hypothetical protein 1100 halTADL_3036 79 0.4 Nucleic acid binding OB-fold tRNA/helicase-type 150 halTADL_3218 86 42

DUF4013 (4 x transmembrane domains) 727 halTADL_3238 73 3.0 Metabolism (other) Nucleoside-diphosphate kinase (Ndk) 641 halTADL_0169 84 4.4 Orotate phosphoribosyltransferase (PyrE) 749 halTADL_0398 90 2.6 Thiamine-phosphate pyrophosphorylase ThiE 1087 halTADL_0473 74 0.4 Ribonucleoside-diphosphate reductase, alpha subunit (NrdE) 965 halTADL_0884 91 0.9 Oxidoreductase FAD-binding domain 672 halTADL_1014 80 3.8 FAD-dependent pyridine nucleotide-disulfide oxidoreductase 650 halTADL_2528 90 4.2 Rhodanese-like protein 258 halTADL_2750 92 25 Adenine phosphoribosyltransferase (Apt) 985 halTADL_2952 95 0.9 Dodecin 238 halTADL_3198 95 28 Oxidative stress Ferritin Dps family protein 20 halTADL_1068 92 170

150 UspA domain-containing protein 851 halTADL_1904 76 1.6 Glutaredoxin 1014 halTADL_2104 94 0.6

UspA domain-containing protein 910 halTADL_2276 75 1.2 Thioredoxin 665 halTADL_2563 91 3.9 Manganese/iron superoxide dismutase (Sod) 164 halTADL_2687 95 39 Protein chaperones Group II chaperonin (thermosome) 135 halTADL_0092 91 45 Hsp20-type chaperone 335 halTADL_0114 93 17 Peptidylprolyl isomerase FKBP-type 77 halTADL_0251 89 81 Chaperone protein DnaK 212 halTADL_0595 94 32 Heat shock protein Hsp20 217 halTADL_0724 95 31 Prefoldin, beta subunit (PfdB) 177 halTADL_1114 95 37 Group II chaperonin (thermosome) 19 halTADL_1928 95 200 Prefoldin, alpha subunit (PfdA) 148 halTADL_2197 95 43

Peptidylprolyl isomerase, cyclophilin type 152 halTADL_2273 91 42 Peptidylprolyl isomerase, FKBP-type 827 halTADL_3026 90 1.8 Group II chaperonin (thermosome) 6 halTADL_3279 95 387 Proteolysis Membrane metalloprotease (peptidase M50 ) 640 halTADL_0323 79 4.4 Proteasome alpha subunit (PsmA) 50 halTADL_2681 95 102 Proteasome beta subunit (PsmB) 441 halTADL_2911 92 10 Ribosomes Ribosomal protein L11 705 halTADL_0103 94 3.2 Ribosomal protein L1 246 halTADL_0105 99 26 Acidic ribosomal protein P0-like protein 478 halTADL_0106 94 8.7 Ribosomal protein L7Ae 93 halTADL_0166 98 63 Ribosomal protein S28e 922 halTADL_0167 95 1.2

151 Ribosomal protein S7 304 halTADL_0623 87 19 Ribosomal protein S6e 612 halTADL_2119 91 4.9

Ribosomal LX protein 471 halTADL_2196 97 9.0 Ribosomal protein S4 648 halTADL_2772 93 4.2 Ribosomal protein L18e 713 halTADL_2775 93 3.0 Ribosomal protein L13 607 halTADL_2776 96 5.0 Ribosomal protein S3Ae 592 halTADL_3142 98 5.2 Ribosomal protein S8e 762 halTADL_3327 89 2.5 Ribosomal protein L30P 456 halTADL_3366 94 9.7 Ribosomal protein L32e 268 halTADL_3370 91 24 Ribosomal protein L6P 834 halTADL_3371 90 1.7 Ribosomal protein S8 647 halTADL_3372 97 4.3 Ribosomal protein L5 674 halTADL_3374 90 3.7 Ribosomal protein S4e 338 halTADL_3375 90 16

Ribosomal protein L29 194 halTADL_3380 92 34 Ribosomal protein S19 643 halTADL_3383 97 4.3 Ribosomal protein L23 463 halTADL_3385 88 9.3 Ribosomal protein L4P 620 halTADL_3386 89 4.8 Ribosomal protein L3 1001 halTADL_3387 93 0.8 Transcription TATA-box-binding protein (Tbp) 523 halTADL_0042 99 7.4 DNA-directed RNA polymerase subunit H (RpoH) 639 halTADL_0616 84 4.4 DNA-directed RNA polymerase subunit A (RpoA1) 589 halTADL_0619 96 5.3 DNA-directed RNA polymerase subunit A2 (RpoA2) 360 halTADL_0620 93 14 TATA-box-binding protein (Tbp) 966 halTADL_1732 96 0.9 DNA-directed RNA polymerase subunit D (RpoD) 499 halTADL_2774 91 7.9 DNA-directed RNA polymerase subunit N (RpoN) 842 halTADL_2778 97 1.6

152 Transcriptional regulators Transcriptional regulator, AsnC family 921 halTADL_0058 93 1.2

Phosphate uptake regulator, PhoU 699 halTADL_1186 92 3.3 Transcriptional regulator, RosR (PadR family) 376 halTADL_1645 92 14 Transcriptional regulator, XRE family 926 halTADL_2533 70 1.2 Phosphate uptake regulator, PhoU 908 halTADL_3204 93 1.3 Transcriptional regulator, AsnC family 172 halTADL_3422 96 38 Transduction Response regulator receiver domain + HalX domain 229 halTADL_0055 96 29 Response regulator receiver protein 566 halTADL_1808 94 6.0 KaiC domain 924 halTADL_1815 94 1.2 Chemotaxis signal transduction protein CheW 925 halTADL_1838 91 1.2 Translation (other) Translation elongation factor aEF-2 (FusA) 89 halTADL_0647 95 66

Translation initiation factor 2, alpha subunit (a/eIF2-alpha) (Eif2a) 1090 halTADL_0923 91 0.4 Translation initiation factor 2, beta subunit (a/eIF2-beta) (Eif2b) 545 halTADL_2337 95 6.7 Methionyl-tRNA synthetase (MetG) 611 halTADL_3069 87 4.9 Elongation factor 1-beta (aEF-1beta) (Ef1b) 454 halTADL_3453 94 9.7 Transport Phosphate ABC transporter solute-binding protein (PstS) 85 halTADL_1182 51 71 ABC-type antimicrobial peptide transport system, permease component 693 halTADL_1613 88 3.4 Iron ABC transporter solute-binding protein 856 halTADL_1788 83 1.6 Carbohydrate ABC transporter solute-binding protein 247 halTADL_2761 84 26 Carbohydrate ABC transporter ATPase 1099 halTADL_2764 88 0.4 Branched-chain amino acid ABC transporter solute-binding protein 202 halTADL_2916 88 33 K+ uptake system, TrkA subunit 1048 halTADL_3061 89 0.6 No match to tADL – unique Nitrate/sulfonate/bicarbonate ABC transporter solute-binding protein 697 - 3.3 to tADL-II 153

4.4.2 Phylogenetic analyses of detected tADL-II ribosomal proteins The amino acid sequences of detected tADL-II ribosomal proteins L3, L4, L23 and S19 were used for phylogenetic analysis of tADL-II together with other haloarchaeal species (Figure 4.4). In three of the constructed phylogenetic trees, tADL-II and tADL formed a discrete cluster. In one instance (Figure 4.4B) the tADL-II protein was more similar to those from Halonotius species than to the tADL protein.

154

155

156

Figure 4.4 Phylogenetic trees of ribosomal proteins L3 (A), L4 (B), L23 (C) and S19 (D). Optimal trees were constructed using the Maximum Likelihood method. For all trees the respective Sulfolobus islandicus sequence was used as outgroup. The percentage of trees in which the associated taxa clustered together is shown next to the branches (based on bootstrapping with 1000 replicates). Bootstrap values <50% are not shown. All positions of the respective alignments were used for calculating the trees; the number of positions were 359 (A), 266 (B), 89 (C) and 142 (D). The trees were drawn to scale, with branch lengths measured in the number of substitutions per site.

4.4.3 Signatures of strain variation within the Deep Lake metagenome After learning that the Hht. litchfieldiae population in Deep Lake comprised two different and rather highly abundant strains (tADL and tADL-II), the question was asked if this was also the case for the populations of DL31 and Hrr. lacusprofundi. Hence, all predicted protein sequences from the Deep Lake metagenome were used as queries in a BLAST search against a database comprising all bacterial, archaeal, viral and Dunaliella protein sequences from the NCBI nr database plus all the protein sequences from the four Deep Lake isolate species. The best match for each metagenome-encoded protein sequence was recorded and those matching to tADL, DL31 and Hrr. lacusprofundi proteins are shown in Table 4.2. tADL proteins were matched by almost twice as many metagenome-encoded proteins (6282) as either DL31 (3454) or Hrr. lacusprofundi (3401). However, only 3057 different tADL proteins were matched by the 6282 metagenome-encoded proteins. The majority of tADL proteins (1722) were matched by two different metagenome-encoded proteins: one matching with 100% sequence identity and a second one matching with an average of 84% sequence identity. It is likely that the 100% matching metagenome-encoded proteins were derived from tADL, while those matching with lower sequence identity came from tADL-II. In addition, 571 tADL proteins were matched by three or more metagenome- encoded proteins which is indicative of additional phylotypes with variation in the respective genes. The data further suggested that the populations of DL31 and Hrr. lacusprofundi are each rather uniform concerning the majority of their encoded proteins; this is indicated through only one matching metagenome-encoded protein for most of the DL31 (2668) and Hrr. lacusprofundi (2541) proteins (Table 4.2). Average sequence identity between these 2668 and 2541 metagenome-encoded proteins and DL31 and Hrr. lacusprofundi, respectively, was 99.6%. Variation in the populations of DL31 and Hrr. lacusprofundi appeared to be limited to the relatively small number of protein-encoding genes which

157

were matched by two or more metagenome-encoded proteins (Table 4.2). It needs to be noted that the presence of low abundance strains could not be investigated using this approach, since it is unlikely that low abundance strains were assembled into the metagenome database.

Table 4.2 Metagenome-encoded proteins matching the Deep Lake isolate species. Number of proteins Number of matching metagenome Hrr. tADL DL31 encoded proteins per isolate protein lacusprofundi 1 764 2668 2541 2 1722 291 282 3 or more 571 63 83 Total number of isolate proteins matched 3057 3022 2906 by a metagenome encoded protein Total number of matching metagenome 6282 3454 3401 encoded proteins

4.4.4 Further variation in the Deep Lake metaproteome In addition to variants that were derived from tADL-II, the metaproteome contained variants which were encoded on metagenome contigs with multiple ORFs with 100% sequence identity to tADL plus some ORFs (typically one) with 97-99% identity. These comprised a total of eight variants; six of them had 99% sequence identity (five containing a single SNPs plus one with a 3 nt deletion) and two had 97% sequence identity, and all had neighbouring genes with 100% sequence identity and conserved gene synteny with tADL. These variants were therefore assigned to tADL(Table 4.3). One of the tADL variants was an ABC transporter phosphate-binding protein PstS (halTADL_2155) which arose from a previously identified SNP (DeMaere et al., 2013). The SNPs characterized in the previous study represented ≥ 90% of the population (DeMaere et al., 2013), indicating it was the dominant form in the population. For the same ABC transporter phosphate-binding protein PstS, two more variants (88% and 93% sequence identity) were identified in the metaproteome. Both of which were encoded on contigs showing characteristics of tADL-II and could not be confidently assigned to either of the strains; they could also originate from an additional strain. Hence they were treated as variants of Hht. litchfieldiae (without any strain specification).

158

Further variants of Hht. litchfieldiae comprised an α-amylase with 94% sequence identity to tADL (halTADL_0142). The only other ORF on the contig matched tADL with 96% sequence identity and both ORFs had synteny with tADL. For the same α- amylase, a protein matching tADL with 100% was also identified (therefore assigned to tADL), and one that matched to tADL-II (84% identity to tADL). Multiple cell surface proteins were also annotated as Hht. litchfieldiae variants (Table 4.3). Only a small number of variants were detected for DL31 and Hrr. lacusprofundi and most of them represented cell surface proteins (Table 4.3). The metaproteome contained one further group of variants (Table 4.4). These variants were encoded on metagenome contigs which, other than the variant encoding genes, did not resemble the isolate species; they contained multiple ORFs that were more similar to other haloarchaea. It was therefore not possible to unambiguously determine the taxonomic origin of these variants. This group of variants included three distinct glycerol kinases that all matched the same tADL glycerol kinase (halTADL_0681) with 98% sequence identity (Table 4.4).

159

Table 4.3 Variant proteins detected for Hht. litchfieldiae, tADL, DL31 and Hrr. lacusprofundi. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. SNP denotes variation caused by a single nucleotide polymorphism. Protein Sequence Spectrum Additional Protein annotation Locus tag # identity (%) count notes Hht. litchfieldiae – variants also detected for α-amylase (glycosyl hydrolase, family 13) 543 halTADL_0142 94 6.8 tADL and tADL-II Adhesion pilin (PilA) 750 halTADL_0751 65 2.6 Cell surface glycoprotein (Sec signal, PGF-CTERM, C-

160 8 halTADL_1043 51 338 terminal transmembrane helix)

Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 18 halTADL_1043 44 205 terminal transmembrane helix) Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 339 halTADL_1043 50 16 terminal transmembrane helix) Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 356 halTADL_1043 42 15 terminal transmembrane helix) Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 618 halTADL_1043 34 4.8 terminal transmembrane helix) Hypothetical protein (TAT signal) 800 halTADL_1047 59 2.1 SMC domain protein 709 halTADL_1458 62 3.1 Phosphate ABC transporter solute-binding protein (PstS) 61 halTADL_2155 88 92 Phosphate ABC transporter solute-binding protein (PstS) 75 halTADL_2155 93 82

tADL – variants Phosphate ABC transporter solute-binding protein (PstS) 13 halTADL_2155 99 235 SNP Methylated-DNA-[protein]-cysteine S-methyltransferase (Ogt) 273 halTADL_0579 99 24 SNP Winged helix-turn-helix DNA-binding domain 128 halTADL_0044 99 48 SNP Hypothetical protein 495 halTADL_2296 99 8.0 SNP Hypothetical protein (transmembrane helix near N-terminal) 852 halTADL_2505 97 1.6 Ribosomal protein L1 78 halTADL_0105 99 81 Methionyl-tRNA synthetase (MetG) 325 halTADL_3069 99 17 SNP Threonine synthase (ThrC) 655 halTADL_2266 97 4.0 DL31 – variants RND superfamily / MMPL (mycobacterial membrane protein 29 Halar_1791 99 134 large) family protein

161 Ribosomal protein L29 271 Halar_2474 99 24 Cell surface glycoprotein (Sec signal, PGF-CTERM, C-

2 Halar_0829 47 629 terminal transmembrane helix) Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 36 Halar_0829 45 118 terminal transmembrane helix) Hrr. lacusprofundi – variants Hypothetical cell surface protein (TAT signal) 27 Hlac_0476 49 138 Archaellin FlaA or FlaB 30 Hlac_2557 38 133 Cell surface glycoprotein (Sec signal, PGF-CTERM, C- 223 Hlac_2976 38 30 terminal transmembrane helix)

Table 4.4 Variants of uncertain taxonomic origin. The table lists detected proteins with high sequence identity to Hht. litchfieldiae, DL31 or Hrr. lacusprofundi encoded on contigs with neighbouring genes that best matched to other haloarchaeal species. Protein numbers are given according to Appendix A (ranked by the sum of the normalized total spectrum count). Sequence identity refers to the amino acid sequence identity of the detected protein to its best match (column denoted “Locus tag”) in a BLASTP search. Spectrum count shows the sum of the normalized total spectrum count across all 15 samples. Contig IDs are from the Deep Lake metagenome assemblies (Antarctic Lakes Metagenome: whole_lake.gbk at http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=AntLakMetagenome). Sequence Protein Spectrum Protein annotation Locus tag identity Contig ID # count (%) Glycerol kinase (GlpK) 191 halTADL_0681 98 35 7180000420363 Glycerol kinase (GlpK) 593 halTADL_0681 98 5.2 7180000399115 Glycerol kinase (GlpK) 586 halTADL_0681 98 5.4 7180000397231 Transcriptional regulator, AsnC family 47 halTADL_1491 88 104 7180000456564

162 Hypothetical protein (Sec signal, Ig fold domain, C-terminal 163 halTADL_1042 73 39 7180000422580 transmembrane helix)

Transcriptional regulator, AsnC family 222 halTADL_1491 80 30 7180000396862 Adhesion pilin (PilA) 358 halTADL_1885 33 15 7180000459513 Hypothetical protein (transmembrane helix near N-terminal) 708 halTADL_1615 70 3.1 7180000398936 Nucleic acid binding OB-fold tRNA/helicase-type (RPA32 747 halTADL_3434 44 2.6 7180000455564 homolog) Response regulator receiver protein 759 halTADL_2696 65 2.5 7180000268900 FeS assembly protein SufB 820 halTADL_0973 95 1.8 7180000432870 Core histone 903 halTADL_1708 89 1.3 7180000446839 Winged helix-turn-helix DNA-binding domain 909 halTADL_0044 62 1.2 7180000394553 SecD/SecF/SecDF export membrane protein 958 halTADL_0787 51 0.9 7180000457612 TATA-box-binding protein (Tbp) 980 halTADL_1450 93 0.9 7180000443701

Hsp20-type chaperone 795 Halar_3162 47 2.1 7180000457612 Adhesion pilin (PilA) 839 Halar_2364 39 1.6 7180000242587 Hypothetical protein (Sec signal; 2 x PKD/chitinase domains) 585 Hlac_2824 34 5.4 7180000456692 Hypothetical protein: 3 x chitinase/PKD domains, C-terminal 809 Hlac_2824 27 1.9 7180000434565 transmembrane helix Carbohydrate ABC transporter solute-binding protein 870 Hlac_2862 72 1.5 7180000267581 VCP-like protein (2 x CDC48 domains + 2 x AAA family 1106 Hlac_2377 67 0.4 7180000414114 ATPase domains) Archaellin FlaA or FlaB 239 Hlac_2557 77 28 7180000295546 163

4.5 Discussion

4.5.1 Distinctions between tADL-II and tADL revealed through metaproteomics The Deep Lake metaproteome and metagenome contained many signatures of strain variation occurring within the Deep Lake populations of the isolated haloarchaeal species. The data were particularly insightful for the most dominant member of the Deep Lake community, Hht. litchfieldiae, and highlighted that, beside the major strain tADL, the population comprises a second strain, tADL-II. The proportion of proteins detected for tADL-II and their relative abundance compared to detected tADL proteins suggests that tADL-II accounts for ~20% of the Hht. litchfieldiae population. This value in accordance with the 15% calculated based on read-depth coverage of metagenomic contigs (DeMaere et al., 2013), making tADL-II similar in abundance to DL31 and Hrr. lacusprofundi. Metaproteomic detection of tADL-II proteins facilitated the identification of 22 novel tADL-II contigs. These contigs were not identified previously because of (A) their shorter length or (B) a deviation in GC-content or metagenomic read-depth. The 22 novel tADL-II contigs had a combined length of 248 kb which, together with the original set of contigs, resulted in an overall set of 74 tADL-II contigs totalling 2.14 Mb (Figure 4.5).

164

Figure 4.5 Circular plot of Hht. litchfieldiae tADL and tADL-II genomes. Outer blue annulus: coding sequences of the tADL genome; inner orange annulus: tADL-II metagenome contigs mapped onto the tADL genome; short black bars: genes corresponding to proteins detected in the metaproteome. Included in the 447 tADL proteins are 11 variants assigned to Hht. litchfieldiae (also represented by black bars).

Differences found in detected proteins and gene content of metagenome contigs were indicative of physiological distinctions between tADL-II and tADL. The nitrate/sulfonate/bicarbonate ABC transporter substrate-binding protein that is unique to tADL-II could potentially be used to target a novel substrate. The absence of the respective gene in the tADL genome indicates its acquisition through HGT. The BCAA ABC transporter substrate-binding protein with two-fold higher abundance (10-fold when normalized to strain abundance) for tADL-II compared to tADL suggested that tADL-II might be more reliant on the uptake of amino acids. In line with this hypothesis was the detection of glutamate-dehydrogenase (GDH) and phosphoenolpyruvate synthase (Pps) only for tADL-II. GDH could be involved in amino acid catabolism and

165

release of ammonia following amino acid uptake. Pps was the only detected protein involved in the gluconeogenesis pathway through which amino acids can be used as sole carbon sources (Williams et al., 2014). Beside cell surface proteins, proteins involved in transporter functions exhibited the highest degree of variation between tADL-II and tADL (Figure 4.3). Genomic variability in transporter genes has been observed on genomic islands of Haloquadratum walsbyi (Cuadros-Orellana et al., 2007) and Prochlorococcus populations (Coleman et al., 2006). Changes in the substrate-binding component of ABC transporters are likely to alter substrate specificity. It has been hypothesised that through introducing variation in transporter proteins a population can increase the amount of targeted compounds and therefore make the most use of available substrates (Cuadros-Orellana et al., 2007).

4.5.2 Phylotypes of Hht. litchfieldiae In addition to tADL-II proteins, further variants were detected that supported the presence of phylotypes in the Hht. litchfieldiae population of Deep Lake. The metaproteome data highlighted that at least three distinct variants of the ABC transporter phosphate-binding protein PstS (halTADL_2155) were synthesised by the Hht. litchfieldiae population (Table 4.3). Multiple PstS variants were also detected for the SAR11 population of the Sargasso Sea (Sowell et al., 2009). In both environments, Deep Lake and the Sargasso Sea, measured phosphate levels were reported to be low (Barker, 1981; Sowell et al., 2009; Williams et al., 2014). Hence it is possible that a population synthesising multiple different variants of the phosphate-binding protein PstS is better equipped to target most of the available phosphate compared to a clonal population. Variation of phosphate-binding components might be particularly important for species with a high demand for phosphate, as is the case for Hht. litchfieldiae (see 2.5.2). The data further showed that at least three distinct proteins (one 100% match and two variants) of the starch-degrading α-amylase halTADL_0142 were synthesised by the Hht. litchfieldiae population (Table 4.3 and Table 4.4). Hht. litchfieldiae has a highly saccharolytic metabolism and utilizes different carbohydrate-containing substrates, many of them (e.g. starch, glycerol, DHA) produced by Dunaliella (see 2.5.1). Therefore, similar to variation of the substrate-binding transporter proteins,

166

variation of α-amylase could potentially increase the efficiency of starch-utilization for the Hht. litchfieldiae population. The detection of three distinct glycerol kinases, matching the same tADL protein with 98% sequence identity, was indicative of phylotype variation for this protein within the Hht. litchfieldiae population (Table 4.4), similar to the phosphate-binding transporter protein and the α-amylase. Since glycerol was also identified as a substrate utilized by Hht. litchfieldiae (see 2.5.1 and Williams et al. 2014) variation in glycerol kinases could potentially enhance glycerol utilization. However, the taxonomic origin of the glycerol kinase variants is uncertain because no other ORFs on the glycerol kinase- encoding metagenome contigs matched to tADL. They could be derived from additional strains of Hht. litchfieldiae which have acquired variation in the genomic region surrounding the glycerol kinase genes. But since intergenera HGT occurs frequently amongst Deep Lake haloarchaea (DeMaere et al., 2013) it is also possible that other haloarchaeal species acquired and expressed the tADL-derived glycerol kinase.

4.5.3 Community structures of Deep Lake haloarchaea The metaproteomic data, together with previous (DeMaere et al., 2013) and additional (Table 4.2) metagenomic analyses, allowed speculations about the assembly of haloarchaeal populations in Deep Lake. The data illustrated the population of Hht. litchfieldiae comprises two major strains (Figure 4.5; Table 4.2): tADL is the more dominant strain representing ~80% of the population while ADL-II accounts for ~20%. Both strains exhibit variation throughout their genome and the metaproteomic data indicate that the strains could possibly occupy different ecological niches in the lake. In addition to the two main strains, phylotypes with variation at distinct genomic loci were identified (Table 4.3). Besides genes encoding for cell surface structures, proteins involved in the uptake and processing of important substrates (phosphate, starch and glycerol) were found to be particularly prone to phylotype variation. Unlike Hht. litchfieldiae, the populations of DL31 and Hrr. lacusprofundi appeared to consist mainly of their respective type strains. The data suggested that additional strains for these two species exhibited variation only in a relatively small number of genes (Table 4.2).

167

168

Chapter 5

Strain variation in Hrr. lacusprofundi assessed through genomic comparison

Co-authorship statement

The contributions to this chapter are as follows: Susanne Erdmann isolated the strain R1S1, grew R1S1 cultures and extracted DNA for sequencing. I performed all other work and wrote the chapter.

169

5.1 Abstract The isolation of Halorubrum lacusprofundi strain ACAM34 from Deep Lake, reported in 1988, represents the first account of an isolated haloarchaeal species from a cold environment. Almost 30 years later, Hrr. lacusprofundi strain R1S1 was isolated from hypersaline Rauer 1 Lake on the Rauer Islands, around 35 km away from Deep Lake. This chapter describes the sequencing of R1S1 and the comparative genomic comparison between R1S1 and ACAM34 to identify strain variation between the two. The primary replicons of the two Hrr. lacusprofundi strains were found to be highly similar with variation mostly occurring in response to viral infection pressure. In contrast, the majority of sequences on secondary replicons were found to be unique to each strain and the unique sequences of R1S1 contained high identity regions (HIRs) shared with other Deep Lake haloarchaea. This is the first time HIRs were identified outside of Deep Lake.

170

5.2 Introduction Metaproteomic and metagenomic analyses of Deep Lake have revealed that the populations of Deep Lake haloarchaea comprise multiple phylotypes that exhibit variation at distinct genomic locations (DeMaere et al., 2013)(see Chapter 3 and 4). A high degree of variation was identified for cell surface structures like the S-layer, hypothesised to occur in response to viral infection pressure (see 3.5.2) (Rodriguez- Valera et al., 2009). Genomic variation was also found through metagenome analysis in populations of Haloquadratum walsbyi in crystallizer ponds, with variation mostly occurring in distinct genomic locations termed genomic islands (Legault et al., 2006; Cuadros-Orellana et al., 2007). Genomic islands were also identified in populations of the marine cyanobacterium Prochlorococcus (Coleman et al., 2006; Avrani et al., 2011; Biller et al., 2015). In both species cell surface genes were commonly found in genomic islands (Cuadros-Orellana et al., 2007; Avrani et al., 2011), and experiments on isolated Prochlorococcus strains identified that variation in cell surface genes occurs in response to viral infections (Avrani et al., 2011). It is hypothesised that genomic islands arose through horizontal gene transfer (HGT) and undergo constant rearrangement (Coleman et al., 2006). A genomic comparison of two distinct Hqr. walsbyi strains also showed that variation was confined to distinct genomic locations, many of them coinciding with genomic islands (Dyall-Smith et al., 2011). The observation that the genomes of different strains from the same microbial species all contained some unique genetic material led to the introduction of the term ‘pan-genome’. It denotes the pool of genetic material comprised by all members of a species (Tettelin et al., 2005). The pan-genome of a species consists of the core genome, common to all members of a species, and all flexible genome content, present in some members of a species. Genomes of the marine cyanobacterium Prochlorococcus typically comprise around 2000 genes and around half of them are shared between all sequenced isolates, representing the Prochlorococcus core genome. The other half of the genes represents the flexible part of the genomes (Biller et al., 2015). However, genomic and metagenomic analysis have so far identified more than 13,000 genes contained by the global Prochlorococcus population with estimates that the pan-genome contains up to an enormous 85000 genes (Biller et al., 2015).

171

Haloarchaeal genomes are well known to comprise more than one replicon (Soppa, 2006). Three out of the four isolated Deep Lake haloarchaea possess multiple replicons and only Hht. litchfieldiae strain tADL contains a single replicon (DeMaere et al., 2013). Primary and secondary replicons in Deep Lake haloarchaea are very distinct regarding their encoded genes. Core genes were found mostly on primary replicons while the secondary replicons contained most of the flexible gene pool (DeMaere et al., 2013). The presence of HIRs, mainly found on secondary replicons, indicated extensive HGT occurred between Deep Lake haloarchaea (DeMaere et al., 2013). The isolation and characterisation of Hrr. lacusprofundi strain ACAM34 from Deep Lake was published in 1988 (Franzmann et al., 1988) and an analysis of the closed genome sequence was reported in 2013 (DeMaere et al., 2013). The genome of ACAM34 comprises three distinct replicons: one primary replicon of ~2.7 Mb plus two secondary replicons of 525 and 431 kb. The 525 kb secondary replicon was shown to contain the largest amount of HIRs compared to all other replicons from isolated Deep Lake species and by network analysis was described as a hub for HIR distribution (DeMaere et al., 2013). The Rauer Islands are located around 30 km off the coast from Davis Station in the Vestfold Hills. They include around 10 major islands which harbour a series of lakes and ponds (Hodgson et al., 2001). Similar to the lakes in the Vestfold Hills, it is believed that most of the Rauer Island lakes are marine-derived and originated through continental uplift, with ocean water getting trapped in land depressions. Due to an imbalance between evaporation and water input (melt water, precipitation), some of the lakes in the Rauer Islands have become hypersaline (Hodgson et al., 2001). Analysis of pigments and light microscopy studies revealed the presence of microscopic algae and cyanobacteria in some of the Rauer Island lakes (Hodgson et al., 2001). The cyanobacterial communities of some of the Rauer Island lakes were also studied using culture (Taton et al., 2006b) and culture-independent sequencing approaches (Taton et al., 2006a; Verleyen et al., 2010; Pessi et al., 2016). However, bacterial and archaeal communities have not been assessed so far. During an Antarctic expedition from 2013-2015, the Rauer Islands were visited and water samples were taken from many of the lakes, including hypersaline Rauer 1 (R1) Lake (Figure 5.1) (Cavicchioli group Antarctic expedition, Australian Antarctic Science project 4031). From an enrichment culture based on R1 Lake water, a microbial strain was isolated that matched the Deep Lake haloarchaeon Hrr. lacusprofundi based on 16S

172

rRNA gene sequencing. To investigate the extent of genomic strain variation between Hrr. lacusprofundi strains from distinct environments, the genome of R1S1 was sequenced. This chapter reports the comparison of the genome of R1S1 to ACAM34, including an examination of features so far only described for Deep Lake haloarchaea, e.g. HIRs.

Figure 5.1 Rauer 1 Lake, Filla, Island, Rauer Islands, Antarctica. Picture courtesy of Alyce Hancock.

5.3 Materials and Methods

5.3.1 Strain isolation and preparation for sequencing A water sample was collected from R1 Lake, Filla Island, Rauer Island group, Antarctica (S68° 48.49’, E77°51.303’) on September 16th 2014. An enrichment culture was grown from the R1 Lake water in liquid DBCM2 medium (Dyall-Smith, 2009) supplemented with 1 g/l peptone and 0.1 g/l yeast extract. The enrichment culture was incubated at room temperature. A single colony was obtained by plating enrichment culture on agar plates, based on the same media used for the enrichment culture and supplemented with 16 g/l agar. For DNA extraction, a liquid culture was grown in DBCM2 media (Dyall-Smith, 2009) supplemented with 1 g/l peptone and 0.1 g/l yeast extract in a glass flask at 30° C and shaking at 120 rpm. Cells were harvested by centrifugation at 6000 x g for 25 min and DNA was extracted using the DNeasy Blood & Tissue Kit (QIAGEN).

173

5.3.2 Sequencing, genome assembly and further bioinformatic analyses Isolated DNA was sequenced using paired-end MiSeq Illumina technology at the Ramaciotti Centre for Genomics (University of New South Wales, Sydney, Australia). Manual gap closure between contigs of the R1S1 primary replicon was done using polymerase chain reaction (PCR). PCR products were purified using QIAquick PCR Purification Kit or QIAquick Gel Extraction Kit (QIAGEN) and sequenced using sanger sequencing at the Ramaciotti Centre for Genomics (University of New South Wales, Sydney, Australia) and the Micromon DNA sequencing facility (Monash University, Melbourne, Australia). The manual gap closure resulted in a linear 2.7 Mb contig representing the R1S1 primary replicon. Circularization of the contig was not achieved because of sequencing failures (see 5.4.1). However, a PCR product of ~900 bp was obtained for this last gap using the primers CGCTCATCGGAGTGTAG and GTGGGAACGGATGGAAC (both in 5’-3’ orientation). The final set of contigs was uploaded to IMG (Markowitz et al., 2012) and deposited as Halorubrum lacusprofundi strain R1S1 draft genome (not publically available). A number of software was used for the computational analysis. Assembly of sequencing reads into contigs was performed using the genome assembler SPAdes (Nurk et al., 2013). Scaffolding of contigs was performed using SSPACE (Boetzer et al., 2011). Mapping of R1S1 contigs onto the ACAM34 reference genome was done using the Mauve Contig Mover (MCM) (Rissman et al., 2009). Calculations of the average nucleotide identity (ANI) were performed using the ANI calculator provided by (Varghese et al., 2015). Genome synteny plots were created using NUCmer (Delcher et al., 2002). The Artemis Comparison Tool (ACT) (Carver et al., 2005) was used for detailed analysis of variation between R1S1 and ACAM34. ACT files for the primary replicons were created on IMG (Markowitz et al., 2012). CONTIGuator (Galardini et al., 2011) was used for matching R1S1 secondary replicon contigs onto the ACAM34 secondary replicons and to distinguish between shared and unique secondary replicon content. CONTIGuator analysis also gave rise to the ACT files for the secondary replicon comparison. Multiple sequence alignments were created using Clustal Omega (McWilliam et al., 2013). For functional comparison between R1S1 and ACAM34 archaeal Clusters of Orthologous Genes (arCOGs) (Makarova et al., 2015) were assigned using COGnitor, part of the COG software package (Kristensen et al., 2010).

174

HIRs were identified using the standalone BLAST+ 2.2.30 software (Camacho et al., 2009). Software requiring a UNIX environment was run on the Linux computational cluster Katana (supported by the Faculty of Science, UNSW Australia).

5.4 Results

5.4.1 Genome sequencing and manual closure of the R1S1 primary replicon Paired-end sequencing of isolated R1S1 DNA produced a total of 718229 read- pairs with a bulk read length of 250 nt. De novo assembly of sequencing reads gave rise to an initial set of 106 contigs totalling around 3.5 Mb. An additional scaffolding step reduced the number of contigs to 89. The set of 89 contigs were then mapped onto the genome sequence of the reference strain ACAM34, which revealed that 27 contigs, totalling around 2.7 Mb, mapped onto the primary replicon of ACAM34. In a plot showing GC content and coverage, these contigs formed a distinct cluster between 65- 70% GC-content and coverage of around 90 (Figure 5.2). Two contigs matching the primary replicon had higher coverage; both of them mapped to a region with multiple transposases. One of the contigs matching the primary replicon had a lower GC content; this contig contained open reading frames (ORFs) for a TATA-binding protein and an origin of replication gene (orc1/cdc6).

175

Figure 5.2 GC/coverage plot of the initial set of 89 R1S1 contigs. Assembled Hrr. lacusprofundi R1S1 contigs were plotted according to their GC-content and their coverage from the genomic sequencing. Contigs matching the ACAM34 primary replicon are shown as grey triangles and build a discrete cluster. Contigs originating from additional replicons are shown as black diamonds. Only contigs larger than 1 kb were included in the analysis.

Following the identification and ordering of 27 R1S1 contigs matching the primary replicon of ACAM34, the gaps between these contigs were attempted to be manually closed with PCR and Sanger sequencing. All but one gap could be closed using this approach, including the manual amplification of an around 5.2 kb rRNA gene cluster between two contigs. For the only gap which could not be closed, a PCR product of ~900 bp was obtained but the sequencing was not successful. Using the PCR primers as sequencing primers, the sequencing reactions from both ends each terminated after ~410 sequenced nucleotides, leaving ~80 bp in the middle of the PCR product unsequenced. At the same position in the ACAM34 primary replicon is a stretch of 80 bp (in between locus tags Hlac_2468 and Hlac_2469) with a high GC content (76%) and a number of single nucleotide repeats, which likely caused a structural problem for the DNA polymerase during the sequencing reaction. Attempts to overcome the presumed blocking secondary structure during the sequencing included: (1) the addition of dimethylsulfoxide or glycine betaine to the sequencing reaction (to relax secondary structures), (2) designing of new primers closer to the site of the premature termination, (3) cloning of the PCR product into a plasmid vector and (4) all combinations of the previous three. However, none of the troubleshooting attempts were successful. However, even though no closed circular replicon was obtained, the manual gap closure

176

resulted in one continuous, linear contig of around 2.7 Mb representing the R1S1 primary replicon, with only ~80 bp missing in between the two ends of the contig. In a final attempt to obtain a closed primary replicon, the assembly of the sequencing reads was repeated, this time including the 2.7 Mb primary replicon contig as a trusted contig during the assembly (Nurk et al., 2013). This second assembly reduced the overall number of contigs to 61 but also failed to produce one circular primary replicon. Instead, two large contigs were assembled that matched the primary replicon. These two contigs were replaced by the manually obtained 2.7 Mb contig. The remaining 59 contigs from this second assembly plus the 2.7 Mb contig were submitted to IMG (Markowitz et al., 2012) for functional annotation and further data analyses. Because IMG does not consider contigs < 1kb, the genome of R1S1 is deposited on IMG as a set of 47 contigs.

5.4.2 Overview of the R1S1 genome in comparison to ACAM34 The general characteristics of the R1S1 genome together with ACAM34 are shown in (Table 5.1). The genome of R1S1 comprises 47 contigs, one 2.7 Mb contig representing the primary replicon and 46 additional contigs, totalling 769 kb, representing secondary replicon content. R1S1 has only two predicted 16S rRNA genes but three 5S and 23S rRNA genes, respectively (Table 5.1). Two of the rRNA gene clusters are complete and located on the primary replicon. At the very end of an additional contigs is a 52 bp fragment of the third (incomplete) 23S rRNA gene copy, next to the third 5S rRNA gene copy (which is only 122 bp in total). It is therefore likely that the majority of the third 23S rRNA gene and the third 16S rRNA gene were not assembled due to the presence of the two other rRNA gene clusters on the primary replicon.

Table 5.1 Genome characteristics of ACAM34 and R1S1. Hrr. lacusprofundi Hrr. lacusprofundi ACAM34 R1S1 Genome size in Mb 3.693 3.468 GC content 64.0% 64.7% DNA scaffolds 3 47 Protein coding genes 3665 3501 5S rRNA genes 3 3 16S rRNA genes 3 2 23S rRNA genes 3 3

177

5.4.3 Comparative analysis of primary replicons The manual closure of the R1S1 primary replicon allowed for an in-depth comparison with the primary replicon of ACAM34. Both primary replicons exhibit very similar characteristics (Table 5.2). The R1S1 primary replicon is around 37 kb shorter in length and contains 45 genes less than ACAM34. This is mostly due to the provirus Hlac-Pro1 (Krupovic et al., 2010), 29 kb in length and encoding 38 predicted ORFs, which is integrated in ACAM34 but missing in R1S1

Table 5.2 Primary replicon characteristics of ACAM34 and R1S1. Hrr. lacusprofundi Hrr. lacusprofundi ACAM34 R1S1 Primary replicon size in Mb 2.735 2.698 GC content 66.7% 66.8% Protein coding genes 2745 2700 rRNA gene clusters 2 2

ANI calculation (Varghese et al., 2015) showed that the primary replicons are 99.8% identical over 98% of their gene sequences. NUCmer analysis (Delcher et al., 2002) confirmed the high sequence identity and further revealed that the primary replicons are also highly syntenic with no major rearrangements (Figure 5.3). The only larger gap in the NUCmer plot is due to Hlac-Pro1 missing from R1S1.

Figure 5.3 NUCmer plot of R1S1 and ACAM34 primary replicons. The black circle indicates the missing Hlac-Pro1 provirus in R1S1.

178

For a detailed analysis of variation between the two primary replicons their sequences were aligned using the ACT (Carver et al., 2005). Similar to the ANI and NUCmer analysis, the ACT confirmed the high similarity between the two primary replicons (Figure 5.4). However, closer examination of the alignment also revealed differences between the two replicons (Table 5.3, Table 5.4 and Table 5.5; Figure 5.4). A total of 27 predicted transposases were identified that were only present at certain positions in the primary replicon of one of the two strains: 17 were unique for ACAM34 and ten for R1S1 (Table 5.3). Two of the transposases were integrated within ORFs, rendering the ORFs non-functional in the respective strain. One of the encoded proteins contains a domain also present in glycosyl hydrolases, potentially involved in the glycosylation of cell surface glycoproteins and the other one is annotated as putative replication initiator protein (Table 5.3).

Figure 5.4 ACT alignment of R1S1 and ACAM34 primary replicons. The primary replicons of the two strains are indicated through grey framed bars on the top and the bottom of the figure. Red areas in between the two replicons indicate regions of sequence similarity ≥ 80%, the darker the red the higher the similarity. Numbers and coloured bars refer to regions with variation between the two replicons: green numbers (and bars) indicate unique features described in Table 5.4; blue numbers (and bars) indicate regions with sequence identity < 99% described in Table 5.5.

Besides transposases, there were only eleven regions containing annotated features that were unique to either of the two strains, nine regions encoding genes and two

179

encoding non-coding RNAs (Table 5.4; Figure 5.4). Three predicted cell surface protein genes were unique to R1S1. One of these genes denotes the region where the Hlac-Pro1 provirus is inserted in ACAM34. The second cell surface protein gene is an additional copy of an archaellin gene (flaB). Therefore, R1S1 contains two copies of archaellin genes (Ga0123509_16091/16092) where ACAM34 only has one (Hlac_2557). Protein sequence alignment revealed that the two R1S1 archaellins only share 44% sequence identity with each other (Figure 5.5). Hlac_2557 is almost identical (98%) to Ga0123509_16092 but only 43% similar to Ga0123509_16091. All three archaellin genes share a highly conserved N-terminus (the first 55 AA) (Figure 5.5). Furthermore, the R1S1 archaellin Ga0123509_16091 is 100% identical to a Hrr. lacusprofundi variant protein detected in the metaproteome (see 3.4.1.2). The metagenome contig encoding the detected variant has 99% nucleotide sequence identity (3641 out of 3643 nt are identical) to the archaellin encoding region of R1S1 and hence also contains the same two archaellin gene copies. These results indicate that the Hrr. lacusprofundi population of Deep Lake contains a strain that is identical to R1S1 with respect to the archaellin genes. The only unique gene with an assigned metabolic function is an alpha amylase, present in ACAM34 but absent in R1S1 (Table 5.4). Other unique features of ACAM34 include the Hlac-Pro1 provirus and a putative toxin-antitoxin (TA) system (Table 5.4).

Figure 5.5 Archaellin protein alignment. The figure shows a multiple amino acid sequence alignment of the two R1S1 (Ga0123509_16091/16092) and the single ACAM34 (Hlac_2557)

180

archaellin proteins. Asterisk (*) denotes fully conserved residues; colon (:) denotes residues with strong similar properties; period (.) denotes residues with weak similar properties.

The ACT analysis further revealed that within the primary replicons there are only five genomic regions (≥ 1 kb) that align with < 99% sequence similarity, here referred to as regions of low similarity (Table 5.5; Figure 5.4). Common to all of these five regions is that they contain at least one or more genes encoding for putative cell surface proteins or proteins involved in the biosynthesis of cell surface structures. The genes encoding for the S-layer glycoprotein share the lowest degree of similarity with only 54%, a value similar to those obtained for detected S-layer variants in the Deep Lake metaproteome (see 3.4.1.1). The largest region of low similarity (Feature number 4 in Table 5.5; 36-37 kb) contains multiple genes encoding proteins predicted to be responsible for the N-glycosylation of cell surface structures like the S-layer or archaella; one of these genes encodes the oligosaccharyltransferase AglB, the most conserved component of N-glycosylation pathways in archaea (Jarrell et al., 2014).

181

Table 5.3 Unique transposases of ACAM34 and R1S1. This table contains all transposases which are not present at the same genomic position in the respective other strain. Hrr. lacusprofundi ACAM34 Hrr. lacusprofundi R1S1 Length Length Coordinates Locus tags Coordinates Locus tags Comments in bp in bp - - - 220497…221663 1167 Ga0123509_160215 53104…54566 1463 Hlac_0054 ------726583..727833 1251 Ga0123509_160717 Hlac_0436/ 452870…454709 1840 - - - 0437 Within a region of low similarity (Table 597160…598692 1533 Hlac_0580 - - - 5.5) - - - 914035…915183 1149 Ga0123509_160913 919624…920622 999 Hlac_0921 - - - 18 1015778…1017239 1462 Hlac_1024 - - - 2

This transposase lies within the ACAM34 gene Hlac_1071 (hypothetical with five- - - - 1328455…1329885 1431 Ga0123509_1601346 bladed beta-propellor domain found in some glycosyl hydrolases); within a region of low similarity (Table 5.5) 1137543…1139396 1854 Hlac_1128 - - - 1222965…1223963 999 Hlac_1212 ------1707365…1708804 1440 Ga0123509_1601717 1501533…1502171 639 Hlac_1494 - - - 1525929…1526567 639 Hlac_1512 ------1848367…1849545 1179 Ga0123509_1601858 1625006…1626467 1462 Hlac_1606 ------1918636…1919808 1173 Ga0123509_1601929

1771722…1772970 1249 Hlac_1756 - - - 1772965…1774372 1408 Hlac_1757 - - - This transposase lies within the R1S1 gene Ga0123509_1602045 which is 1775301…1776702 1402 Hlac_1759 - - - annotated as putative replication initiator protein (rep) 1847940…1849401 1462 Hlac_1840 - - - Within a region of low similarity (Table 2257564…2258808 1245 Ga0123509_1602307 5.5) 2082149…2083550 1402 Hlac_2092 - - - 2238850…2240187 1338 Hlac_2250 - - - Adjacent to the transposases is a HgcC Ga0123509_1602633/ - - - 2581257…2582977 1721 family RNA which is also missing in 1602634 ACAM34 183

Table 5.4 Unique features of ACAM34 and R1S1. Hrr. lacusprofundi ACAM34 Hrr. lacusprofundi R1S1 Feature Length Length Coordinates Locus tags Coordinates Locus tags Comments and annotation number in bp in bp Duplication of 373 nt in R1S1; 75437…75810 and Ga0123509_16078 affected ORFs are annotated as 1 - - - 373 76016…76389 to 16081 ERCC domain protein and hypothetical proteins Two consecutive archaellin genes 2 - - - 85713…86333 620 Ga0123509_16091 (flaB) in R1S1 where ACAM34 has only one archaellin gene four identical tRNA-Asp copies side Ga0123509_160398 3 - - - 403081…403459 379 by side in R1S1; only two copies in to 160401 ACAM34 184 Putative cell surface protein (N-

terminal TAT signal sequence 4 - - - 782723…783296 574 Ga0123509_160778 followed by a larger non- cytoplasmic domain), missing in ACAM34 5 605333…606991 1659 Hlac_0587 - - - Alpha amylase, missing in R1S1 Provirus Hlac-Pro1 inserted in ACAM34; R1S1 contains one hypothetical ORF (N-terminal signal Hlac_0763 peptide (AA 1-21), non-cytoplasmic 6 750801…779997 29197 1038242…1039459 1217 Ga0123509_1601043 to 0775 domain (AA 22-221), transmembrane domain (AA 222- 242) and cytoplasmic domain (AA 243-405) at the respective location

HgcC family ncRNA, missing in ACAM34. Family of ncRNAs associated with transposons. First identified in Methanococcus jannaschii (Klein et al., 2002) but 7 - - - 1169239…1169367 128 Ga0123509_1601183 ncRNAs with similar characteristics were also identified in Halocbacterium salinarum NRC-1 (Gomes-Filho et al., 2015). Their function is not known Hypothetical protein (winged helix DNA-binding domain), missing in 8 1057147…1057488 342 Hlac_1060 - - - R1S1; within a region of low similarity (Table 5.5) 185 Hlac_1081/ Putative toxin-antitoxin (TA) 9 1081868…1082583 716 - - - 1082 system, missing in R1S1

Hlac_1975 Part of hypothetical protein and 10 1972868…1973351 484 - - - (partly) intergenic region, missing in R1S1 Ga0123509_1602632 HgcC family ncRNA and two 11 - - - 2581257…2582977 1721 to 1602634 transposases, missing in ACAM34

Table 5.5 Regions with low sequence similarity between ACAM34 and R1S1. Hrr. lacusprofundi ACAM34 Hrr. lacusprofundi R1S1 Feature Length Length Coordinates Locus tags Coordinates Locus tags ID Comments and annotation number in bp in bp Acetoin utilization deacetylase AcuC; archaeal histone-like protein; replication factor A1; tRNA-Arg; sodium/proton antiporter, CPA1 family (TC2.A.36); 114519 Hlac_0109 403285 Ga0123509_160 NAD+ dependent glucose-6-phosphate 1 9385 9397 95% …123903 to 0118 …412681 402 to 1060413 dehydrogenase; hypothetical (cytoplasmic, transmembrane and non-cytoplasmic domains); hypothetical (DUF309 domain); hypothetical (uncharacterized protein family SepF-related domain); hypothetical 186 Hypothetical (WD40/YVTN repeat-like-

containing domain, oligoxyloglucan Hlac_0409 Ga0123509_160 426616 715497 reducing end-specific cellobiohydrolase 2350 (partly) to 2281 708 (partly) to 93% …428965 …717777 superfamily); integral membrane protein- 2 0411 160710 like protein; putative transcription regulator, CopG/Arc/MetJ family 428966… 717778 Ga0123509_160 3331 Hlac_0412 2962 54% 432296 …720739 711 S-layer glycoprotein 596866 885886 Ga0123509_160 Hypothetical (multiple transmembrane and 294 Hlac_0579 294 93% …597159 …886179 884 non-cytoplasmic domains) 3 597160 1533 Hlac_0580 - - - - …598692 Transposase, missing in R1S1

Dipeptidyl aminopeptidase/acylaminoacyl peptidase; glycosyltransferase; halocyanin 598833 Hlac_0581 886275 Ga0123509_160 6460 6445 89% domain-containing protein; …605292 to 0586 …892719 885 to 160891 phosphoribosyltransferase; MoxR-like ATPase; tRNA_Ser; hypothetical Xaa-Pro aminopeptidase; phosphoglycerate dehydrogenase; amidohydrolase; 4-aminobutyrate 1045360 Hlac_1051 1302658 Ga0123509_160 aminotransferase; choline/carnitine/betaine 11532 11523 93% …1056892 to 1059 …1314181 1325 to 1601334 transport; amidohydrolase; alcohol dehydrogenase zinc-binding domain protein; capsule biosynthesis protein; hypothetical 1057147 Hypothetical protein (winged helix DNA-

187 342 Hlac_1060 - - - - …1057488 binding domain), missing in R1S1;

Putative membrane protein; 4 oligosaccharyltransferase STT3 subunit; glycosyl transferase family 2; glycosyl transferase group 1; sulfatase; glycosyl 1057665 Hlac_1061 1314574 Ga0123509_160 transferase group 1; hypothetical (alkaline- 13956 13886 95% …1071621 to 1071 …1328460 1335 to 1601345 phosphatase-like core domain); polysaccharide biosynthesis protein; formyl transferase; hypothetical with five- bladed beta-propellor domain found in some glycosyl hydrolases 1328455 Ga0123509_160 - - - 1431 - …1329885 1346 Transposase, missing in ACAM34

NAD-dependent epimerase/dehydratase; NAD-dependent epimerase/dehydratase; UDP-sulfoquinovose synthase; 1071622 Hlac_1072 1329886 Ga0123509_160 transcriptional regulator TrmB; 10171 10194 95% …1081793 to 1080 …1340080 1347 to 1601355 hypothetical; ORC complex protein Cdc6/Orc1; glutamine--fructose-6- phosphate transanimase; nucleotidyl transferase; hypothetical Membrane protein involved in the export of O-antigen and teichoic acid; transcriptional regulator, TetR family; 1999910 Hlac_2006 2249787 Ga0123509_160 7279 7423 91% succinylglutamate desuccinylase; arginine …2007189 to 2013 …2257210 2299 to 1602306 decarboxylase; proteasome-activating 5 nucleotidase; response regulator receiver 188 protein; hypothetical 2257564 Ga0123509_160 - - - 1245 - …2258808 2307 Transposase, missing in ACAM34 2007190 2258824 Ga0123509_160 1168 Hlac_2014 1169 95% …2008357 …2259992 2308 Peptidase M29 aminopeptidase I

5.4.4 Comparison of secondary replicons The assembly of the R1S1 genome generated 46 contigs which were not part of the primary replicon and therefore represent additional genomic material, likely originating from secondary replicons. Their combined characteristics are similar to the ACAM34 secondary replicons although they are around 188 kb shorter in length with 119 less genes (Table 5.6). Partly, this difference in length is likely due to the missing of some transposase genes during the R1S1 assembly (see also results of arCOG analysis further below). A GC/coverage plot of R1S1 contigs revealed that most contigs are part of two clusters with distinct coverage (Figure 5.6). Thirty one contigs with a total length of 545 kb built a cluster with coverage values between 42-57. Twelve contigs with a total length of 220 kb built a second cluster with coverage values between 87-107. These data indicated that, similar to ACAM34, R1S1 likely contains two secondary replicons. The coverage data further suggested that the smaller secondary replicon is present in higher copy number than the larger secondary replicon. However, because the R1S1 secondary replicons were not fully assembled, for all further analyses, all contigs except for the one representing the primary replicon, were combined and referred to as secondary replicon contigs.

Table 5.6 Secondary replicon characteristics of ACAM34 and R1S1. Hrr. lacusprofundi Hrr. lacusprofundi R1S1 ACAM34 Number of secondary 2 circular replicons 46 linear contigs replicons/contigs Size in kb 957 (525 and 431) 769 GC content 55 and 57 % 43-62%; 57% average Protein coding genes 920 801 Shared sequences 240 kb Unique sequences 717 kb 529 kb Genes on shared sequences 269 257 Genes on unique sequences 651 544

189

Figure 5.6 GC/coverage plot of R1S1 secondary replicon contigs. Assembled Hrr. lacusprofundi R1S1 contigs were plotted according to their GC-content and their coverage from the genomic sequencing. Orange and green ovals indicate contigs forming putative secondary replicons. Black diamonds indicate contigs not part of the primary replicon; the grey triangle indicates the primary replicon. Not included are two small contigs (1 and 1.3 kb) with high coverage values (352 and 162), which encode a transposase and an ATPase respectively.

The calculated ANI between the two combined ACAM34 secondary replicons and the R1S1 secondary replicon contigs was high with 97%. However, only around 30% of all the genes were included in the ANI calculation. All the remaining genes did not fulfil the ANI requirements of an alignment with ≥ 30% sequence identity over ≥ 70% of their length. This implied that less than a third of the genes encoded on secondary replicons were shared between ACAM34 and R1S1. The difference in secondary replicon composition was illustrated through mapping R1S1 secondary replicon contigs against the ACAM34 secondary replicons using CONTIGuator (Galardini et al., 2011) and the ACT (Figure 5.7). CONTIGuator matched only 240 kb of R1S1 secondary replicon contig sequences against the ACAM34 secondary replicons. The majority of secondary replicon content was identified as unique to either of the two strains (Table 5.6).

190

Figure 5.7 ACT alignment of ACAM34 secondary replicons and matching R1S1 contigs. The secondary replicons of ACAM34 are indicated through grey bars on top of the figure; R1S1 secondary replicon contigs matching the ACAM34 secondary replicons are indicated through grey bars on the bottom of the figure. Red areas in between grey bars indicate regions of sequence similarity

≥ 80%, the darker the red the higher the similarity. Potential functional differences between the ACAM34 and R1S1 unique secondary replicon regions were assessed through comparison of arCOG functional class abundances (Figure 5.8). The largest observed difference was a higher abundance of genes assigned to the Mobilome functional class in ACAM34 compared to R1S1. This functional class includes transposases, which are present in high copy numbers on ACAM34 secondary replicons (DeMaere et al., 2013). The smaller number of transposases on R1S1 secondary replicon contigs is likely the result of a lack of assembly of many transposases due to their high sequence identity. A similar lack of assembly was observed for the highly identical rRNA gene cluster copies. One rRNA gene cluster on the primary replicon was only completed with PCR during the manual gap closure and the rRNA gene cluster on a secondary replicon contig is incomplete. Apart from the Mobilome, the relative abundance of functional classes was quite similar, indicating that the unique regions of the secondary replicons encode distinct but functionally similar genes.

191

192

Figure 5.8 Functional profile of ACAM34 and R1S1 unique secondary replicon regions. The plot shows the relative abundance of genes assigned to arCOG functional classes within the unique secondary replicon regions of ACAM 34 (black bars) and R1S1 (grey bars). Functional classes are ordered according to relative abundance in R1S1, highest to the right and lowest to the left.

HIRs were previously identified as genomic features unique to haloarchaeal species isolated from Deep Lake (DeMaere et al., 2013). The regions of DNA unique for R1S1 (and not present in ACAM34) were investigated for the presence of HIRs shared with other Deep Lake species. BLAST analysis revealed 10 regions between 2-14 kb length that were shared between R1S1 and either Hht. litchfieldiae, DL31 or DL1 (Table 5.7). In one instance, a 3.7 kb region was shared between the three species R1S1, Hht. litchfieldiae and DL1. In total the R1S1 secondary replicon contigs contained 57 kb of HIRs shared with other Deep Lake haloarchaea which were not present in the genome of ACAM34. All of these novel HIRs were present on secondary replions in DL31 and DL1. In Hht. litchfieldiae, which contains only a single replicon, the novel HIRs were adjacent to HIRs described previously (DeMaere et al., 2013).

193

Table 5.7 Novel HIRs of R1S1. Shown are unique regions of R1S1 secondary replicon contigs that are shared with the Deep Lake species Hht. litchfieldiae, DL31 and DL1. The region shaded in light grey is part of the 14 kb HIR shared between R1S1 and DL1. DL (Deep Lake); replicon names are according to IMG. HIR ID R1S1 length start end DL species/replicon start end (%) contig (bp) 13970 99.87 2671192833 127 14096 DL1/HalDL1_Contig37 (secondary replicon) 278097 264129 9603 100 2671192825 1 9603 DL31/DL31_replicon 2 (secondary replicon) 101664 111266 6021 100 2671192827 58 6078 DL31/DL31_replicon 2 (secondary replicon) 363470 357450 4800 99.98 2671192843 21705 26504 DL31/DL31_replicon 2 (secondary replicon) 451899 456698 4386 100 2671192829 3750 8135 Hht. litchfieldiae/Hlit_replicon (single replicon) 1224426 1220041 4302 100 2671192846 263 4564 DL1/HalDL1_Contig37 (secondary replicon) 97783 102084 4225 99.95 2671192827 6073 10297 DL31/DL31_replicon 2 (secondary replicon) 297162 292938 194 4064 99.98 2671192829 8125 12188 Hht. litchfieldiae/Hlit_replicon (single replicon) 1504145 1508208

3694 99.89 2671192833 1501 5194 Hht. litchfieldiae/Hlit_replicon (single replicon) 1483561 1479868 3469 99.94 2671192837 10141 13609 DL1/HalDL1_Contig37 (secondary replicon) 250326 246858 2330 100 2671192812 1 2330 DL31/DL31_replicon 2 (secondary replicon) 679626 677297

5.5 Discussion The sequencing of the R1S1 genome and its comparison to the type strain ACAM34 enabled a thorough examination of strain variation for the Antarctic haloarchaeal species Hrr. lacusprofundi. The manual closure of the R1S1 primary replicon allowed a comparison to be performed between the primary replicons of R1S1 and ACAM34. In addition, the partially reconstructed secondary replicons of R1S1 were also compared to those from ACAM34. This analysis revealed that most inter- species variation occurred in the secondary replicions of Hrr. lacusprofundi.

5.5.1 Highly conserved primary replicons exhibit variation related to virus infection pressure The primary replicons of R1S1 and ACAM34 are almost identical over their full length with conserved gene content and synteny (Figure 5.3 and Figure 5.4). Besides differences in transposase content, variation occurred in only a small number of genomic regions. The ACAM34 primary replicon had uniform metagenomic coverage across the whole length (except for Hlac-Pro1, see below), indicating conservation of the primary replicon within the Hrr. lacusprofundi population of Deep Lake (DeMaere et al., 2013). Highly conserved primary replicons were also reported between two Hqr. walsbyi and two Halobacterium salinarum strains, respectively (Pfeiffer et al., 2008; Dyall-Smith et al., 2011). The authors of the Hbt. salinarum study concluded that both analysed strains originated from the same original isolate and only diverged during cultivation in the laboratory (Pfeiffer et al., 2008). This study is therefore not informative of strain variation occurring in the environment. However, the two studied Hqr. walbyi strains were isolated around the same time from environments on opposite sides of the planet (Spain and Australia) (Bolhuis et al., 2004; Burns et al., 2004). The limited diversity observed between their primary replicons was discussed as a result of high selection pressure on the primary replicons that only left little room for variation (Dyall-Smith et al., 2011). The highly conserved primary replicons are hypothesised to provide Hqr. walsbyi with high fitness in many different hypersaline environments, where it is often the dominant species (Dyall-Smith et al., 2011). In comparison, the distance between the isolation sites of R1S1 and ACAM34 was only ~35 km but the time span in between their isolation was ~30 years (Franzmann et al., 1988). Hence, the high conservation between the R1S1 and ACAM34 primary replicons, together with

195

results from the Deep Lake metagenome (DeMaere et al., 2013), is indicative of high selection pressure, preserving primary replicon structure and gene content for Hrr. lacusprofundi over time within Antarctic hypersaline systems. Variation between the two primary replicons of R1S1 and ACAM34 often involved genes encoding for cell surface proteins or proteins involved in the assembly and modification of cell surface structures (Table 5.4 and Table 5.5). From all genes common to R1S1 and ACAM34, the S-layer genes had the lowest degree of sequence similarity (Table 5.5). This finding is in agreement with analysis of the Deep Lake metaproteome, which highlighted that within the populations of Deep Lake haloarchaea, multiple phylotypes exist with highly different S-layer proteins on their cell surfaces (see 3.4.1). Variation in cell surface structures was also observed within populations and between isolated strains of Hqr. walsbyi and was discussed as a response to virus infection pressure (Legault et al., 2006; Cuadros-Orellana et al., 2007; Rodriguez- Valera et al., 2009; Dyall-Smith et al., 2011) (see 3.5.2). Prochlorococcus strains accumulated mutations predominantly in cell surface protein genes after exposure to infecting viruses (Avrani et al., 2011). The observed variation between R1S1 and ACAM34 cell surface protein genes therefore likely occurs in response to infecting viruses. The variation of cell surface protein genes could also indicate the presence of distinct virus populations in R1 Lake and Deep Lake. However, this latter hypothesis requires further examination of the two virus populations, since variation in cell surface protein genes was already observed within the population of Hrr. lacusprofundi in Deep Lake (see 3.4.1). R1S1 possesses an additional archaellin gene copy (total of two archaellin genes) at the same genomic position where ACAM34 encodes its single archaellin gene. Multiple copies of archaellin genes are not uncommon in haloarchaeal genomes (Pyatibratov et al., 2008), e.g. Hht. litchfieldiae contains a total of seven archaellin genes in its genome and most of them are expressed in the environment (see 2.4.5). Switching of archaellin gene expression was observed in Haloarcula marismortui and resulted in archaella with distinct morphologies and antigenic properties (Pyatibratov et al., 2008). Because archaella are exposed on the cell surface they represent potential attachment sites for viruses. It was therefore hypothesised that variation in expressed archaellins occurred in response to viral predation pressure (Porter et al., 2007; Pyatibratov et al., 2008). Analyses of the Deep Lake metaproteome and metagenome identified a protein and a metagenome contig identical to R1S1 in respect to archaellin gene organisation. The

196

variation in archaellin genes is therefore not related to the distinct environments but might represent a common feature of Hrr. lacusprofundi populations. The single largest difference between the R1S1 and ACAM34 primary replicons related to the absence of the provirus Hlac-Pro1 in R1S1 (figure nucmeer, act and table). Mapping of Deep Lake metagenomic reads to the genome of ACAM34 revealed low coverage for the region representing Hlac-Pro1 (DeMaere et al., 2013). These data suggest that Hlac-Pro1 is either specific for ACAM34 or only present in a small part of the Hrr. lacusprofundi population in Deep Lake. Proviruses are a common feature of bacterial genomes where they can represent a large portion of strain specific sequences (Weinbauer and Rassoulzadegan, 2004), and proviruses were also found in the relatively small number of sequenced haloarchaeal genomes (Krupovic et al., 2010). Distinctions in provirus-content represented a large proportion of the differences between two Hqr. walsby strains (Dyall-Smith et al., 2011). These results indicate that proviruses are a major contributor to strain-diversity in haloarchaeal species. The largest region with conserved gene synteny but < 99% sequence identity between the R1S1 and ACAM34 primary replicons encoded for many proteins putatively involved in the glycosylation of cell surface structures, including a number of glycosyltransferases and the oligosaccharyltransferase AglB (Table 5.5). N- glycosylation refers to the covalent attachment of glycans (single or multiple sugars and sugar derivatives) onto certain asparagine residues of target proteins (Jarrell et al., 2014). During the glycan assembly, glycosyltransferases attach specific sugars and sugar derivatives onto a lipid carrier. The transfer of the preassembled glycan onto the protein is mediated by oligosaccharyltransferases which represent the most conserved component of N-glycosylation systems within archaea (Jarrell et al., 2014). N- glycosylation of cell surface proteins is a wide spread posttranslational modification present in all three domains of life (Jarrell et al., 2014). In haloarchaea, N-glycosylation of the S-layer and, to a lesser extent, archaella are particularly well studied (Jarrell et al., 2010; Kaminski and Eichler, 2014). In Haloferax volcanii, N-glycosylation of the S- layer glycoproteins and archaella were shown to be essential for correct assembly and stability of the S-layer and for proper swimming motility, respectively (Jarrell et al., 2014). Surface proteins of haloarchaeal viruses can also be N-glycosylated and it was shown that N-glycosylation of viral proteins plays a crucial role in the recognition of receptors on host cell surfaces (Kandiba et al., 2012). Furthermore, carbohydrates and carbohydrate derivatives, e.g. sialic acid, on host cell surfaces are well known receptors

197

for a number of human pathogenic viruses (Cohen, 2015; Kato and Ishiwa, 2015). Genes putatively involved in the glycosylation of cell surface proteins were also found within one of the genomic islands described for Hqr. walsbyi (Cuadros-Orellana et al., 2007). The methanogen Methanococcus voltae strain PS contains archaella which are glycosylated with a glycan comprising four carbohydrate residues (Chaban et al., 2009; Jarrell et al., 2014). However, one version of M. voltae strain PS was cultivated for > 25 years and lacked the fourth sugar residue in its glycan. It was hypothesised that this latter strain must have accumulated mutations in the gene responsible for the biosynthesis or attachment of the fourth sugar, leading to a shorter glycan (Jarrell et al., 2014). Besides being involved in the glycosylation of proteins, glycosyltransferases also catalyse the transfer of sugar residues onto other molecules including lipids, DNA or various secondary metabolites (Schmid et al. 2016). The substrate specificity of glycosyltransferases can be artificially altered through targeted mutation, which is used in the production of natural products for the discovery of novel drugs (Schmid et al., 2016). After the artificial introduction of three amino acid changes a glycosyltransferase isolated of Streptomyces antibioticus was able to transfer ten distinct tested sugar residues compared to only three sugar residues transferred by the wild type enzyme (Williams et al., 2008). It is therefore possible that variation in enzymes involved in the N-glycosylation pathway in Hrr. lacusprofundi results in the attachment of differently composed glycans onto cell surface proteins. This variation could potentially represent a further defence strategy of host cells to escape virus infection.

5.5.2 Great diversity between secondary replicon content In contrast to the high conservation between the R1S1 and ACAM34 primary replicons (Figure 5.4), a high degree of variation was observed between the secondary replicons (Figure 5.7). Similar to ACAM34, our data indicated that R1S1 also contains two secondary replicons (Figure 5.6). However, only ~25-30% (240 kb) of the secondary replicon sequences were shared between the two strains (Table 5.6; Figure 5.7), designating ~70-75% (529-717 kb) of the secondary replicons as unique for each strain. Overall, ~80-85% of the genomes (primary and secondary replicons combined) are conserved between R1S1 and ACAM34, while ~15-20% of the genomes are unique to each strain, representing the flexible part of the genome. These values are similar to pathogenic Streptococcus agalactiae strains which share around 80% of their genomes (Tettelin et al., 2005). In a large pan-genome analysis it was shown that ~28% of genes

198

in bacterial species are either strain specific or shared only by a few members (Lapierre and Gogarten, 2009). The flexible gene content is thought to be responsible for adaptations to local environments (Biller et al., 2015). Our results provide a first assessment of the potential pan-genome for Hrr. lacusprofundi. However, for a better estimation of the pan-genome size, more strains need to be analysed. Even though the secondary replicons of R1S1 and ACAM34 contained a majority of unique genes, they exhibited a similar composition of broad functional categories (Figure 5.8). Considering that the primary replicons of R1S1 and ACAM34 are almost identical, genes required for adaptation to different ecological niches would be expected to be part of the flexible gene content located on the secondary replicons. And even though both strains were isolated from hypersaline environments not far apart, the environments were distinct in several aspects. Deep Lake is a 36 m deep and homogenous lake that never freezes. The bottom half of the lake stays extremely cold throughout the year (-8 to -20) (Barker, 1981; Ferris and Burton, 1988). When the water sample, used for the isolation of R1S1, was taken from R1 Lake, the lake was reported to be mostly around 30 cm deep with only a few small, deeper depressions (Cavicchioli group Antarctic expedition, Australian Antarctic Science project 4031). Hence it can be assumed that there are large temperature differences between summer and winter throughout the lake. During winter R1 Lake has an ice cover which could potentially go down to the bottom, leaving little liquid water for its inhabitants (Hodgson et al., 2001). It was therefore expected that these differences between the two environments would be reflected in the unique gene content of R1S1 and ACAM34. However, categorization of the unique gene content of R1S1 and ACAM34 into broad functional categories did not reveal great functional differences between the two strains (Figure 5.8). It needs to be noted though that a detailed analysis of the unique gene content may reveal some physiological distinctions between the two strains.

5.5.3 Different strategies of genome organisation Not considering transposases, strain specific sequences represented ≤ 1% (5-32 kb) of the R1S1 and ACAM34 primary replicons (Table 5.4). As a comparison, Dyall-Smith et al. (2011) identified twelve regions of major divergence between two Hqr. walsbyi strains, some of them corresponding to previously identified genomic islands (Legault et al., 2006; Cuadros-Orellana et al., 2007). These regions of major divergence added up to around 285-360 kb or 9-11% of the primary replicons (not included in this

199

calculation was the hypervariable region six which exhibits sequence variation but mostly conserved gene synteny (Dyall-Smith et al., 2011). Our analysis indicated that, with the exception of the Hlac-Pro1 provirus, the primary replicons of ACAM34 and R1S1 do not contain regions resembling genomic islands or regions of major divergence from Hqr. walsbyi. Hence it appears that the primary replicon of Hrr. lacusprofundi is more conserved compared to the one of Hqr. walsbyi. The genomes of the two Hqr. walsbyi strains were between 3.2-3.3 Mb with primary replicons of ~3.1 Mb. Only 2-3% of the genomes were encoded on secondary replicons or plasmids (Dyall-Smith et al., 2011). In contrast, the two analysed Hrr. lacusprofundi strains contained secondary replicons totalling 769-957 kb, accounting for 22-26% of the genomes. Collectively, the data suggest that Hrr. lacusprofundi and Hqr. walsbyi have evolved distinct strategies to accommodate variation in their genomes. Hrr. lacusprofundi possesses a comparably smaller primary replicon that contains the majority of the core gene content (DeMaere et al., 2013). In addition, Hrr. lacusprofundi contains a relatively large amount of secondary replicon gene content (DeMaere et al., 2013) which was found to be is highly flexible in this study. This flexibility in secondary replicon genes potentially allows Hrr. lacusprofundi to respond to a changing environment and to generate strains that can exploit different nutrients. The outsourcing of genomic variation into secondary replicons probably helps Hrr. lacusprofundi to maintain synteny and functionality of its core genes on the primary replicon. On the other hand, the majority of the Hqr. walsbyi genome is contained on the primary replicon, with only very little secondary replicon material (Dyall-Smith et al., 2011). Variation between Hqr. walsbyi strains is therefore more likely to occur on the primary replicons, where variation is mostly constrained to genomic islands (Cuadros-Orellana et al., 2007). Hypervariable genomic islands were also described for Prochlorococcus and it was speculated that restricting variation to genomic islands helps maintaining the integrity of the rest of the genome (Avrani et al., 2011; Biller et al., 2015). Hht. litchfieldiae strain tADL harbours a genome comprised of only a single, large replicon (DeMaere et al., 2013). Metagenome FR analysis revealed regions on the single replicon with a high degree of variation in the metagenome (DeMaere et al., 2013). These regions contained a high density of mobile elements and were more similar to secondary replicons of other Deep Lake haloarchaea (DeMaere et al., 2013). Hence it appears that Hht. litchfieldiae has adapted a similar strategy as Hqr. walsbyi

200

regarding its genomic organisation. Considering that both species represent the most abundant species in their environments it is intriguing to speculate that their similar genomic organisation is a factor contributing to their dominance, as has been hypothesised previously (DeMaere et al., 2013).

5.5.4 HIRs, a haloarchaeal pan-genome in the Vestfold Hills and Rauer Islands The secondary replicon sequences of R1S1 contained 57 kb representing HIRs, shared with the Deep Lake species Hht. litchfieldiae strain tADL, DL31 and DL1 (Table 5.7). No 16S rRNA gene sequencing data are yet available to investigate if these Deep Lake haloarchaea are also present in R1 Lake. Up until this study, HIRs were only found in the genomes of haloarchaea isolated from Deep Lake (DeMaere et al., 2013). No HIRs were identified in any of the genomes of haloarchaea isolated from warm- temperate climates (except between strains of the same species), and no sequences matching to the Deep Lake HIRs were found in the metagenomes of two solar salterns (DeMaere et al., 2013). The research in this chapter has revealed that the exchange of HIRs between different haloarchaeal genera is not restricted to Deep Lake, and that it might be more of a general feature of hypersaline environments in the Vestfold Hills and the Rauer Islands. Thus, other than the pan-genome that is contained by all members of a single species, HIRs could be described as part of the broader haloarchaeal pan-genome: a pool of genes that is shared between the collective haloarchaeal community of hypersaline environments in the Vestfold Hills and Rauer Islands.

201

202

Chapter 6

Conclusion

This thesis describes the assessment of haloarchaeal community dynamics in Deep Lake, Antarctica, using a variety of ‘-omics’ techniques. For the first time metaproteomics was applied to a hypersaline environment and this study demonstrated the benefits of metaproteomics for studying low-complexity haloarchaeal communities. In combination with metagenomics and genomics, the metaproteomics provided insight into physiological distinctions between abundant haloarchaeal species and revealed intra-species variation within the populations of Deep Lake haloarchaea. The genomic analysis of a new haloarchaeal strain isolated from hypersaline Rauer 1 Lake delivered first insight into the yet unexplored genomic diversity of Antarctic haloarchaea outside of Deep Lake. In this chapter a few additional aspects are discussed that arose during the thesis work. Embedded in this final discussion is a summary of the main results of the thesis and an outlook of possible future work.

6.1 The importance of the Deep Lake metagenome database for the metaproteomics A critical step in every metaproteomic work flow is peptide-spectrum matching and the choice of the database. (see 1.3). In this study a composite database was used. It comprised all the predicted proteins from a Deep Lake metagenome database which was generated from the same samples that were used for the metaproteomics (same depth and same filter size fractions), and all the predicted proteins from the genomes of the four isolated Deep Lake haloarchaea. Especially the Deep Lake metagenome proved to be invaluable for many aspects of the metaproteomic study. The detection of variant proteins, indicative of variation within the populations of Deep Lake haloarchaea, was a core result of this study (see Chapters 3 and 4) and was only achievable through the use of the Deep Lake metagenome database for the peptide- spectrum matching. Different groups of variants could be distinguished within the data: (A) A high number of variants, often with a high degree of sequence variation, were

203

detected for cell surface structures, in particular the S-layer glycoprotein (see Chapter 3). These variants highlighted that the populations of Deep Lake haloarchaea harbour a wide range of diverse proteins on their surfaces in order to dilute the viral infection pressure (Rodriguez-Valera et al., 2009). (B) Variants related to the uptake and processing of nutrients were detected for Hht. litchfieldiae (see Chapter 4). These variants indicate that the population of Hht. litchfieldiae comprises different phylotypes that collectively exploit the available nutrients more effectively as compared to a clonal population. (C) Many variants were detected that were derived from tADL-II (see Chapter 4). Their detection initiated a strain comparison between tADL and tADL-II, describing possible physiological distinctions between them. All these variants were encoded on assembled metagenome contigs and would not have been detected if only the predicted proteins from the genomes of the isolate species were used for the peptide- spectrum matching. Other environments from which variants were detected using metaproteomics include the Sargasso Sea (Sowell et al., 2009) and an acid mine drainage biofilm (Ram et al., 2005). In both cases the databases used for the peptide-spectrum matching contained metagenome sequences from the same environment. Furthermore, both of these environments and Deep Lake have in common that they harbour a relatively low complexity microbial community, where single groups of organisms account for a large proportion of the community. In the Sargasso Sea, Bacteria from the SAR11 clade can account for 35% of all bacterial and archaeal cells during summer, and variants were detected for SAR11 phosphate-binding ABC transporter lipoproteins (Sowell et al., 2009). In the acid mine drainage biofilm, 68% of the detected proteins were derived from Leptospirillum group II Bacteria and variants were detected for a Leptospirrilum group II cytochrome (Ram et al., 2005). Thus it becomes apparent, that the high abundance of single organism groups facilitates the detection of variants in metaproteomics. In more complex microbial communities, the detection of variants is more difficult due to the relatively lower abundance of the individual community members. Detected viral proteins were also encoded on metagenome contigs. Sequence identity of the detected viral proteins to viral proteins in the public databases was generally low. Therefore, using publicly available viral protein sequences for the peptide-spectrum matching instead of the Deep Lake metagenome would have presumably left many viral proteins undetected, underpinning again the importance of

204

the Deep Lake metagenome. However, in respect to the detection of viral proteins, it needs to be noted that some viruses might have not been represented in the Deep Lake metagenome database. Because many haloarchaeal viruses have sizes < 1 µm (Luk et al., 2014) they could have passed the filters used for the biomass sampling and not been captured. For future assessment of the viral community in Deep Lake it would be beneficial to prepare distinct samples of the virus fraction, as has been done in a number of studies on other aquatic systems. Metagenomic sequencing of samples enriched in viruses revealed a diverse viral community in Lake Limnopolar, Byers Peninsula, Antarctica, with most of the obtained sequences representing novel viruses, harbouring no similarity to sequences in public databases (Lopez-Bueno et al., 2009). A near complete halovirus genome was obtained through sequencing a fosmid clone generated from a virus-enriched sample from a crystallizer pond (Santos et al., 2007). In hypersaline Lake Tyrrell, 140 viral contigs > 10kb could be retrieved from a metagenome that included a sample enriched in viruses (Emerson et al., 2013). It is therefore likely that analysis of a distinct viral fraction of Deep Lake will deepen our understanding of the local viral population and how it interacts with the halorchaeal hosts. One important component of this study was the elucidation of host-virus relationships in Deep Lake (see Chapter 3). This was done by searching databases for CRISPR-spacer containing sequences. While a few of the spacers matched to the genomes of the four isolate species, the majority of identified spacers matched to metagenome contigs, mostly derived from viruses. This sort of analysis would have not been possible without a metagenome database. When there is no matching metagenome database available, publically available metagenomes and genomes from similar environments can provide a useful alternative for the peptide-spectrum matching. Morris et al. (2010) used the large protein database (> 6 million proteins) generated from the metagenomic library of the Global Ocean Sampling (GOS) expedition (Rusch et al., 2007), to identify > 2000 membrane proteins from Atlantic surface water samples (Morris et al., 2010). The GOS database combined with the genomes of > 200 sequenced genomes from marine Bacteria and Archaea was also used in a metaproteomic study of the northwest Atlantic, observing a shift in expressed transporter proteins from winter to spring (Georges et al., 2014). Williams et al. (2012) combined specific publically available fosmid libraries, marine microbial genomes and Southern Ocean metagenomes to describe metabolic functions carried out

205

by the bacterioplankton community off the coast of the Antarctic Peninsula (Williams et al., 2012). Different to the detected proteins with variations, most detected proteins used for characterizing the physiologies of the three main species, matched to the isolate genomes with 100% sequence identity (see Chapter 2). This indicated that proteins pertaining to metabolic pathways were less prone to strain variation compared to e.g. cell surface proteins. Therefore, this specific question could have been addressed using a database including only the protein sequences generated from the genomes of the isolate species, without the Deep Lake metagenome. These examples illustrate that a matching metagenome database is not always a prerequisite for a successful metaproteomic experiment. However, for the Deep Lake metaproteomics, the metagenome was of great value.

6.2 The value of genomic sequencing in the age of metagenomics Chapter 5 of this thesis describes the analysis of strain variation between two isolated Hrr. lacusprofundi strains, based on the isolation and genomic sequencing of a new strain (R1S1) and its comparison with the genome of the type strain (ACAM34). Alternatively, it is also possible to reconstruct the genomes of single species out of metagenomic datasets (Sangwan et al., 2016), which is particularly useful to gain genomic information for species with no isolate. This approach was successfully applied to generate the genomes of dominant bacterial species from a low complexity biofilm community (Tyson et al., 2004) but also to assemble the genome of a member of the group II Euryarchaeota from a more complex ocean surface sample (Iverson et al., 2012). The near complete genome of a green sulphur bacterium was assembled from the metagenome of a sample taken at the interface of the aerobic surface and anaerobic bottom zone from Ace Lake, a meromictic lake in the Vestfold Hills (Ng et al., 2010). The assembly of single genomes out of a metagenome dataset is mostly based on abundance information (in the form of coverage) and nucleotide composition information (e.g. nucleotide frequencies, GC content) (Sangwan et al., 2016) and for both of these parameters difficulties arise when analysing haloarchaeal communities. Many haloarchaeal genomes contain secondary replicons, which in case of Hrr. lacusprofundi make up around a quarter of the genome (DeMaere et al., 2013). Secondary replicons have different nucleotide characteristics compared to the primary

206

ones (e.g. a lower GC-content) (DeMaere et al., 2013) and for R1S1 some secondary replicon contigs had higher coverage compared to the primary replicon (Figure 5.6), indicative a higher copy number for these replicons. It is therefore unlikely that the complete secondary replicon content of a haloarchaeal genome would be grouped together with the correct primary replicon out of a metagenome. The genomic comparison of the Hrr. lacusprofundi strains has revealed that most of the strain variation occurs on secondary replicons and this variation could not be assessed if only a metagenome was available. A method that could potentially resolve the relationship of primary and secondary replicons in metagenome data is Hi-C (Lieberman-Aiden et al., 2009). Hi-C relies on the physical cross-linking of molecules within cells prior to cell lysis, thereby conserving information about co-localization of DNA elements (Beitel et al., 2014). Using Hi-C on a synthetic microbial community, it was possible to link plasmids together with the chromosome of a species (Beitel et al., 2014). It still needs to be evaluated if Hi-C is applicable to the metagenomes of complex microbial communities from the environment. Strain variation on the Hrr. lacusprofundi primary replicons was limited to only a few genomic features and regions (see 5.5.1). Over large stretches the primary replicons had 99-100% identical nucleotide sequences. This high level of identity between different strains of a species would make it very difficult to segregate distinct strains out of a metagenome dataset (Sangwan et al., 2016). By contrast, tADL and tADL-II exhibit a lower degree of sequence identity across their whole genomes (see Chapter 4). This higher degree of variation between the two Hht. litchfieldiae strains facilitated the assembly of distinct contigs for each of the strains and tADL-II was initially identified as a cluster of metagenomic contigs with a distinct coverage and GC-content (DeMaere et al., 2013). In addition, the tADL genome consists of a single replicon. If that is also the case for tADL-II, using metagenomic sequencing to pull out the genome of tADL-II is a more promising approach compared to Hrr. lacusprofundi strains. An alternative approach to metagenomic sequencing is single-cell genomics: isolating a single cell from a complex community and sequencing its DNA molecules (Gawad et al., 2016). Single-cell genomics would resolve the issues of assigning secondary replicons to primary replicons and discriminating between highly similar genomes. Compared to the traditional genome sequencing, and similar to metagenomics, single-cell genomics has the great advantage in that it can be used on non-cultivatable species, making it applicable to the vast number of microbial species

207

without isolated representative (Rinke et al., 2013). However, single-cell genomics is still a rather recent technology harbouring a number of technical challenges, in particular when sequencing complete genomes, as opposed to only selected regions (Gawad et al., 2016). Even though the rapid advances in technologies make it nowadays tempting to study microorganisms solely using computational approaches, the isolation and cultivation of species and strains is still an invaluable part of today’s microbiology. Isolating and characterising a strain together with sequencing its genome provides the opportunity of linking phenotypic and genotypic characteristics. Provided that genetic tools are available for the species, the function of interesting target genes and their encoded proteins can be assessed. For example, the primary replicon of ACAM34 contains a gene annotated as alpha amylase which is missing on the primary replicon of R1S1. The alpha amylase could be involved in the utilization of starch or other polysaccharides as nutrient source. If growth studies revealed differences between ACAM34 and R1S1 regarding their ability to use certain polysaccharides as substrates, the alpha amylase would be a first candidate gene potentially responsible for the different phenotype. The function of the alpha amylase, and other candidate genes, could be further assessed using a recently developed transformation and gene knock-out technique for Hrr. lacusprofundi (Yan Liao, personal communication). In metagenomics and metaproteomics, functional annotation of genes/proteins is mostly based on sequence comparison to characterised genes/proteins. But while the amount of generated sequencing data and predicted proteins is increasing enormously, only a fraction of the predicted proteins have their existence proven at the transcript or protein level. Compared to the increase in sequencing data, the number of characterised proteins is only increasing very slowly (Temperton and Giovannoni, 2012). Further characterisation of strains and interesting target genes in the laboratory is therefore indispensable for a meaningful characterisation of metagenomic and metaproteomic datasets.

6.3 Vestfold Hills and Rauer Islands—great scope for unexplored haloarchaeal communities Up until today Deep Lake is the only known Antarctic hypersaline system that harbours a haloarchaea-dominated microbial community (Bowman et al., 2000; DeMaere et al., 2013). Beside Deep Lake, the Vestfold Hills and also the Rauer Islands

208

harbour a number of other hypersaline lakes (Wright and Burton, 1981; Hodgson et al., 2001) and haloarchaea were identified in other lakes, e.g. Organic Lake, but only as lower abundant members of the microbial community (Bowman et al., 2000; Yau et al., 2013). However, many of the hypersaline lakes in the Vestfold Hills and the Rauer Islands have not been studied yet using environmental sequencing techniques and harbour untapped microbial communities. The isolation of Hrr. lacusprofundi strain R1S1 from Rauer 1 Lake indicates that haloarchaea are likely to be found in many of these hypersaline lakes. One lake which could potentially harbour a haloarchaea- dominated microbial community is Club Lake (68° 33’ 21.6” S, 78° 14’ 06.0” E) in the Vestfold Hills. Similar to Deep Lake, Club Lake is reported to be hypersaline and monomictic (Gibson 1999) and Club Lake and Deep Lake are separated by < 1 km. Many of the so far neglected hypersaline lakes of the Vestfold Hills and Rauer Islands, including Club Lake, were visited and sampled during a recent expedition of the Cavicchioli group (Australian Antarctic Science Project 4031). Metagenomic and 16S rRNA gene sequencing of these samples will reveal the community composition of these lakes. It will be exciting to find out if any of these lakes also harbour haloarchaea- dominated microbial communities and if the same species are present in similar abundances as in Deep Lake. Since the genome of R1S1contains HIRs, which were shared with other Deep Lake haloarchaea, it will be particularly interesting to learn about the community composition of Rauer 1 Lake. Hrr. lacusprofundi was identified as a hub for HIR distribution in Deep Lake, and it is possible that the novel HIRs are also part of the Hrr. lacusprofundi pan-genome in Deep Lake (but were just not present in the isolated strain ACAM34). Alternatively, the novel HIRs could be unique for the Hrr. lacusprofundi population of Rauer 1 Lake. In this latter case it will be interesting to find out if the other Deep Lake species, with which the HIRs are shared, are also present in Rauer 1 Lake, or if the HIRs are shared with yet unknown species in Rauer 1 Lake. The microbial community of Deep Lake has been studied twice by sequencing DNA from Deep Lake biomass (Bowman et al., 2000; DeMaere et al., 2013). The samples in DeMaere et al. (2013) were taken from November to December 2008. Bowman et al. (2000) do not mention the date when they collected the samples but it must have been during one of the Antarctic summers before the year 2000. Both studies have described a low-complexity haloarchaea-dominated community in Deep Lake but with different species being the most dominant members. While it is likely that the differences between the two studies are related to distinct sampling and sequencing

209

techniques (see 1.2.2), it can not be ruled out that they reflect a temporal change of community composition in Deep Lake. In Organic Lake, a hypersaline and meromictic lake in the Vestfold Hills, the community composition changed substantially over the course of two years (Yau, 2013). The microbial communities of lakes in the McMurdo Dry Valleys, Antarctica, were found to shift between summer and autumn, indicating a seasonal cycle (Vick-Majors et al., 2014). Similarly, in Lake Limnopolar the viral community shifted from spring to summer, possibly reflecting a change in host organism abundance (Lopez-Bueno et al., 2009). However, Deep Lake harbours a large water body with a uniform haloarchaeal community and growth was predicted to be mainly occurring near the surface during the three months of summer; the growth rate was estimated to be less than six generations a year (DeMaere et al., 2013). Changes in community composition would therefore be expected to happen only slowly. For a better understanding of the Deep Lake community dynamics, it is essential to assess the community composition at different time points throughout a year, and also over the course of a longer time scale (multiple years). Furthermore, metaproteomics on samples collected during the Antarctic winter should reveal how the Deep Lake haloarchaea respond physiologically to the permanent darkness and cooler temperatures. Deep Lake was visited and sampled on a number of occasions, including winter, during the latest Antarctic expedition of the Cavicchioli group. Analysis of these samples should shed some light on the dynamics of the Deep Lake microbial community.

6.4 Concluding remarks While this study has advanced our understanding of Antarctic haloarchaeal community functioning, it has also highlighted the need for further investigations. For example, this study gave a first insight into a diverse viral community in Deep Lake but a detailed assessment of the viruses that are present in the lake and how they shape the haloarchaeal community is still outstanding. Also, HIRs were identified in the genome of Hrr. lacusprofundi strain R1S1 isolated from Rauer 1 Lake. Studies on additional Antarctic hypersaline systems should reveal if HIRs are a general feature of Antarctic haloarchaeal communities, potentially connecting all Antarctic haloarchaeal species. The mechanism of how HIRs are distributed is also still awaiting its discovery. This thesis provides a solid framework for following studies on Antarctic haloarchaeal communities. Therefore, the future for Antarctic haloarchaeal research is looking bright and will undoubtedly continue to reveal surprising and fascinating results. It is therefore

210

of utmost importance that Antarctica as a whole, with all its pristine and unique ecosystems, stays protected from all forms of human exploitation in future.

211

212

References

Alam, M., and Oesterhelt, D. (1984) Morphology, function and isolation of halobacterial flagella. J Mol Biol 176: 459-475.

Albers, S.V., and Meyer, B.H. (2011) The archaeal cell envelope. Nat Rev Microbiol 9: 414-426.

Albers, S.V., and Jarrell, K.F. (2015) The archaellum: how Archaea swim. Front Microbiol 6: 23.

Alonso-Saez, L., Waller, A.S., Mende, D.R., Bakker, K., Farnelid, H., Yager, P.L. et al. (2012) Role for urea in nitrification by polar marine Archaea. Proc Natl Acad Sci U S A 109: 17989-17994.

Andam, C.P., Harlow, T.J., Papke, R.T., and Gogarten, J.P. (2012) Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales. BMC Evol Biol 12: 85.

Andersson, A.F., and Banfield, J.F. (2008) Virus population dynamics and acquired virus resistance in natural microbial communities. Science 320: 1047-1050.

Anesio, A.M., and Bellas, C.M. (2011) Are low temperature habitats hot spots of microbial evolution driven by viruses? Trends Microbiol 19: 52-57.

Anton, J., Oren, A., Benlloch, S., Rodriguez-Valera, F., Amann, R., and Rossello-Mora, R. (2002) Salinibacter ruber gen. nov., sp. nov., a novel, extremely halophilic member of the Bacteria from saltern crystallizer ponds. Int J Syst Evol Microbiol 52: 485-491.

Arcus, V.L., McKenzie, J.L., Robson, J., and Cook, G.M. (2011) The PIN-domain ribonucleases and the prokaryotic VapBC toxin-antitoxin array. Protein Eng Des Sel 24: 33-40.

Artimo, P., Jonnalagedda, M., Arnold, K., Baratin, D., Csardi, G., de Castro, E. et al. (2012) ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 40: W597-603.

Atanasova, N.S., Bamford, D.H., and Oksanen, H.M. (2016) Virus-host interplay in high salt environments. Environ Microbiol Rep 8: 431-444.

Atanasova, N.S., Roine, E., Oren, A., Bamford, D.H., and Oksanen, H.M. (2012) Global network of specific virus-host interactions in hypersaline environments. Environ Microbiol 14: 426-440.

Avrani, S., Wurtzel, O., Sharon, I., Sorek, R., and Lindell, D. (2011) Genomic island variability facilitates Prochlorococcus-virus coexistence. Nature 474: 604-608.

213

Barker, R. (1981) Physical and chemical parameters of Deep Lake, Vestfold Hills, Antarctica. Australian National Antarctic Research Expeditions Series B(V) Limnology Publication NO. 130.

Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., Moineau, S. et al. (2007) CRISPR provides acquired resistance against viruses in prokaryotes. Science 315: 1709-1712.

Bath, C., and Dyall-Smith, M.L. (1998) His1, an archaeal virus of the Fuselloviridae family that infects Haloarcula hispanica. J Virol 72: 9392-9395.

Becker, E.A., Seitzer, P.M., Tritt, A., Larsen, D., Krusor, M., Yao, A.I. et al. (2014) Phylogenetically driven sequencing of extremely halophilic archaea reveals strategies for static and dynamic osmo-response. PLoS Genet 10: e1004784.

Beitel, C.W., Froenicke, L., Lang, J.M., Korf, I.F., Michelmore, R.W., Eisen, J.A., and Darling, A.E. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2: e415.

Beja, O., Aravind, L., Koonin, E.V., Suzuki, M.T., Hadd, A., Nguyen, L.P. et al. (2000) Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 289: 1902-1906.

Biller, S.J., Berube, P.M., Lindell, D., and Chisholm, S.W. (2015) Prochlorococcus: the structure and function of collective diversity. Nat Rev Microbiol 13: 13-27.

Bland, C., Ramsey, T.L., Sabree, F., Lowe, M., Brown, K., Kyrpides, N.C., and Hugenholtz, P. (2007) CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8: 209.

Bobay, L.M., Touchon, M., and Rocha, E.P. (2014) Pervasive domestication of defective prophages by bacteria. Proc Natl Acad Sci U S A 111: 12127-12132.

Bodaker, I., Sharon, I., Suzuki, M.T., Feingersch, R., Shmoish, M., Andreishcheva, E. et al. (2010) Comparative community genomics in the Dead Sea: an increasingly extreme environment. Isme j 4: 399-407.

Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., and Pirovano, W. (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27: 578-579.

Bolhuis, H., Poele, E.M., and Rodriguez-Valera, F. (2004) Isolation and cultivation of Walsby's square archaeon. Environ Microbiol 6: 1287-1291.

Borowitzka, L.J., Kessly, D.S., and Brown, A.D. (1977) The salt relations of Dunaliella. Further observations on glycerol production and its regulation. Arch Microbiol 113: 131-138.

Boubriak, I., Ng, W.L., DasSarma, P., DasSarma, S., Crowley, D.J., and McCready, S.J. (2008) Transcriptional responses to biologically relevant doses of UV-B radiation in the model archaeon, Halobacterium sp. NRC-1. Saline Systems 4: 13.

214

Bowman, J.P., McCammon, S.A., Rea, S.M., and McMeekin, T.A. (2000) The microbial composition of three limnologically disparate hypersaline Antarctic lakes. FEMS Microbiol Lett 183: 81-88.

Brasen, C., Esser, D., Rauch, B., and Siebers, B. (2014) Carbohydrate metabolism in Archaea: current insights into unusual enzymes and pathways and their regulation. Microbiol Mol Biol Rev 78: 89-175.

Burns, D.G., Camakaris, H.M., Janssen, P.H., and Dyall-Smith, M.L. (2004) Cultivation of Walsby's square haloarchaeon. FEMS Microbiol Lett 238: 469-473.

Burrows, L.L. (2005) Weapons of mass retraction. Mol Microbiol 57: 878-888.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10: 421.

Campbell, P.J. (1978) Primary Productivity of a Hypersaline Antarctic Lake. Aust J Mar Freshwater Res 29: 717-724.

Carver, T., Thomson, N., Bleasby, A., Berriman, M., and Parkhill, J. (2009) DNAPlotter: circular and linear interactive genome visualization. Bioinformatics 25: 119-120.

Carver, T., Harris, S.R., Berriman, M., Parkhill, J., and McQuillan, J.A. (2012) Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28: 464-469.

Carver, T.J., Rutherford, K.M., Berriman, M., Rajandream, M.A., Barrell, B.G., and Parkhill, J. (2005) ACT: the Artemis Comparison Tool. Bioinformatics 21: 3422-3423.

Cavicchioli, R. (2011) Archaea--timeline of the third domain. Nat Rev Microbiol 9: 51- 61.

Cavicchioli, R. (2015) Microbial ecology of Antarctic aquatic systems. Nat Rev Microbiol 13: 691-706.

Cavicchioli, R., Amils, R., Wagner, D., and McGenity, T. (2011) Life and applications of extremophiles. Environ Microbiol 13: 1903-1907.

Chaban, B., Logan, S.M., Kelly, J.F., and Jarrell, K.F. (2009) AglC and AglK are involved in biosynthesis and attachment of diacetylated glucuronic acid to the N-glycan in Methanococcus voltae. J Bacteriol 191: 187-195.

Chimileski, S., Dolas, K., Naor, A., Gophna, U., and Papke, R.T. (2014) Extracellular DNA metabolism in Haloferax volcanii. Front Microbiol 5: 57.

Chown, S.L., Clarke, A., Fraser, C.I., Cary, S.C., Moon, K.L., and McGeoch, M.A. (2015) The changing form of Antarctic biodiversity. Nature 522: 431-438.

Claassen, M. (2012) Inference and validation of protein identifications. Mol Cell Proteomics 11: 1097-1104.

215

Clark, K.R., and Gorley, R.N. (2006) PRIMER v6: User Manual/Tutorial. In. PRIMER- E: Plymouth, p. 192pp.

Cohen, M. (2015) Notable Aspects of Glycan-Protein Interactions. Biomolecules 5: 2056-2072.

Coleman, M.L., Sullivan, M.B., Martiny, A.C., Steglich, C., Barry, K., Delong, E.F., and Chisholm, S.W. (2006) Genomic islands and the ecology and evolution of Prochlorococcus. Science 311: 1768-1770.

Cornelissen, A., Ceyssens, P.J., T'Syen, J., Van Praet, H., Noben, J.P., Shaburova, O.V. et al. (2011) The T7-related Pseudomonas putida phage phi15 displays virion-associated biofilm degradation properties. PLoS One 6: e18597.

Cuadros-Orellana, S., Martin-Cuadrado, A.B., Legault, B., D'Auria, G., Zhaxybayeva, O., Papke, R.T., and Rodriguez-Valera, F. (2007) Genomic plasticity in prokaryotes: the case of the square haloarchaeon. Isme j 1: 235-245.

Danovaro, R., Corinaldesi, C., Dell'Anno, A., Fabiano, M., and Corselli, C. (2005) Viruses, prokaryotes and DNA in the sediments of a deep-hypersaline anoxic basin (DHAB) of the Mediterranean Sea. Environ Microbiol 7: 586-592.

Delcher, A.L., Phillippy, A., Carlton, J., and Salzberg, S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30: 2478-2483.

Dell'Anno, A., and Danovaro, R. (2005) Extracellular DNA plays a key role in deep-sea ecosystem functioning. Science 309: 2179.

DeMaere, M.Z., Lauro, F.M., Thomas, T., Yau, S., and Cavicchioli, R. (2011) Simple high-throughput annotation pipeline (SHAP). Bioinformatics 27: 2431-2432.

DeMaere, M.Z., Williams, T.J., Allen, M.A., Brown, M.V., Gibson, J.A., Rich, J. et al. (2013) High level of intergenera gene exchange shapes the evolution of haloarchaea in an isolated Antarctic lake. Proc Natl Acad Sci U S A 110: 16939-16944.

Dokland, T. (1999) Scaffolding proteins and their role in viral assembly. Cell Mol Life Sci 56: 580-603.

Dyall-Smith, M., Tang, S.L., and Bath, C. (2003) Haloarchaeal viruses: how diverse are they? Res Microbiol 154: 309-313.

Dyall-Smith, M.L. (2009) The Halohandbook: protocols for haloarchaeal genetics. http://www.haloarchaea.com/resources/halohandbook.

Dyall-Smith, M.L., Pfeiffer, F., Klee, K., Palm, P., Gross, K., Schuster, S.C. et al. (2011) Haloquadratum walsbyi: limited diversity in a global pond. PLoS One 6: e20968.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792-1797.

Elevi Bardavid, R., and Oren, A. (2008) Dihydroxyacetone metabolism in Salinibacter ruber and in Haloquadratum walsbyi. Extremophiles 12: 125-131.

216

Elevi Bardavid, R., Khristo, P., and Oren, A. (2008) Interrelationships between Dunaliella and halophilic prokaryotes in saltern crystallizer ponds. Extremophiles 12: 5- 14.

Emerson, J.B., Thomas, B.C., Andrade, K., Allen, E.E., Heidelberg, K.B., and Banfield, J.F. (2012) Dynamic viral populations in hypersaline systems as revealed by metagenomic assembly. Appl Environ Microbiol 78: 6309-6320.

Emerson, J.B., Andrade, K., Thomas, B.C., Norman, A., Allen, E.E., Heidelberg, K.B., and Banfield, J.F. (2013) Virus-host and CRISPR dynamics in Archaea-dominated hypersaline Lake Tyrrell, Victoria, Australia. Archaea 2013: 370871.

Esquivel, R.N., Xu, R., and Pohlschroder, M. (2013) Novel archaeal adhesion pilins with a conserved N terminus. J Bacteriol 195: 3808-3818.

Fagan, R.P., and Fairweather, N.F. (2014) Biogenesis and functions of bacterial S- layers. Nat Rev Microbiol 12: 211-222.

Falb, M., Muller, K., Konigsmaier, L., Oberwinkler, T., Horn, P., von Gronau, S. et al. (2008) Metabolism of halophilic archaea. Extremophiles 12: 177-196.

Ferris, J.M., and Burton, H.R. (1988) The Annual Cycle of Heat-Content and Mechanical Stability of Hypersaline Deep Lake, Vestfold Hills, Antarctica. Hydrobiologia 165: 115-128.

Forward, J.A., Behrendt, M.C., Wyborn, N.R., Cross, R., and Kelly, D.J. (1997) TRAP transporters: a new family of periplasmic solute transport systems encoded by the dctPQM genes of Rhodobacter capsulatus and by homologs in diverse gram-negative bacteria. J Bacteriol 179: 5482-5493.

Franzmann, P.D., Stackebrandt, E., Sanderson, K., Volkman, J.K., Cameron, D.E., Stevenson, P.L. et al. (1988) Halobacterium lacusprofundi sp. nov., a Halophilic Bacterium Isolated from Deep Lake, Antarctica. Systematic and Applied Microbiology 11: 20-27.

Fullmer, M.S., Gogarten, J.P., and Papke, R.T. (2014) Horizontal Gene Transfer in Halobacteria. In Halophiles: Genetics and Genomes. Papke, R.T., and Oren, A. (eds). Norfolk, UK: Caister Academic Press.

Galardini, M., Biondi, E.G., Bazzicalupo, M., and Mengoni, A. (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code Biol Med 6: 11.

Garcia-Heredia, I., Martin-Cuadrado, A.B., Mojica, F.J., Santos, F., Mira, A., Anton, J., and Rodriguez-Valera, F. (2012) Reconstructing viral genomes from the environment using fosmid clones: the case of haloviruses. PLoS One 7: e33802.

Gawad, C., Koh, W., and Quake, S.R. (2016) Single-cell genome sequencing: current state of the science. Nat Rev Genet 17: 175-188.

217

Georges, A.A., El-Swais, H., Craig, S.E., Li, W.K., and Walsh, D.A. (2014) Metaproteomic analysis of a winter to spring succession in coastal northwest Atlantic Ocean microbial plankton. ISME J.

Gerdes, K. (2000) Toxin-antitoxin modules may regulate synthesis of macromolecules during nutritional stress. J Bacteriol 182: 561-572.

Gerdes, K., Christensen, S.K., and Lobner-Olesen, A. (2005) Prokaryotic toxin- antitoxin stress response loci. Nat Rev Microbiol 3: 371-382.

Gerdes, K., Bech, F.W., Jorgensen, S.T., Lobner-Olesen, A., Rasmussen, P.B., Atlung, T. et al. (1986) Mechanism of postsegregational killing by the hok gene product of the parB system of plasmid R1 and its homology with the relF gene product of the E. coli relB operon. Embo j 5: 2023-2029.

Gibson, J.A.E. (1999) The meromictic lakes and stratified marine basins of the Vestfold Hills, East Antarctica. Antarctic Science 11: 175-192.

Goldfarb, T., Sberro, H., Weinstock, E., Cohen, O., Doron, S., Charpak-Amikam, Y. et al. (2015) BREX is a novel phage resistance system widespread in microbial genomes. Embo j 34: 169-183.

Gomes-Filho, J.V., Zaramela, L.S., Italiani, V.C.S., Baliga, N.S., Vêncio, R.Z.N., and Koide, T. (2015) Sense overlapping transcripts in IS1341-type transposase genes are functional non-coding RNAs in archaea. RNA Biol 12: 490-500.

Goyal, A. (2007) Osmoregulation in Dunaliella, Part II: Photosynthesis and starch contribute carbon for glycerol synthesis during a salt stress in Dunaliella tertiolecta. Plant Physiol Biochem 45: 705-710.

Grininger, M., Staudt, H., Johansson, P., Wachtveitl, J., and Oesterhelt, D. (2009) Dodecin is the key player in flavin homeostasis of archaea. J Biol Chem 284: 13068- 13076.

Grissa, I., Vergnaud, G., and Pourcel, C. (2007) CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 35: W52- 57.

Grote, M., and O'Malley, M.A. (2011) Enlightening the life sciences: the history of halobacterial and microbial rhodopsin research. FEMS Microbiol Rev 35: 1082-1099.

Groussin, M., Boussau, B., Szollosi, G., Eme, L., Gouy, M., Brochier-Armanet, C., and Daubin, V. (2016) Gene Acquisitions from Bacteria at the Origins of Major Archaeal Clades Are Vastly Overestimated. Mol Biol Evol 33: 305-310.

Guixa-Boixareu, N., Calderón-Paz, J.I., Heldal, M., Bratbak, G., and Pedrós-Alió, C. (1996) Viral lysis and bacterivory as prokaryotic loss factors along a salinity gradient. Aquatic Microbial Ecology 11: 215-227.

Gupta, N., and Pevzner, P.A. (2009) False discovery rates of protein identifications: a strike against the two-peptide rule. J Proteome Res 8: 4173-4181.

218

Hand, R.M. (1980) Bacterial Populations of Two Saline Antarctic Lakes. P A Trudinger & M R Walters (eds), Biogeochemistry of Ancient and Modern Environments Proceedings of the Fourth International Symposium on Environmental Biogeochemistry (ISEB) Australian Academy of Science, Canberra: : 123 -129.

Heidelberg, K.B., Nelson, W.C., Holm, J.B., Eisenkolb, N., Andrade, K., and Emerson, J.B. (2013) Characterization of eukaryotic microbial diversity in hypersaline Lake Tyrrell, Australia. Front Microbiol 4: 115.

Herbst, F.A., Bahr, A., Duarte, M., Pieper, D.H., Richnow, H.H., von Bergen, M. et al. (2013) Elucidation of in situ polycyclic aromatic hydrocarbon degradation by functional metaproteomics (protein-SIP). Proteomics 13: 2910-2920.

Hodgson, D.A., Vyverman, W., and Sabbe, K. (2001) Limnology and biology of saline lakes in the Rauer Islands, eastern Antarctica. Antarctic Science 13: 255-270.

Hou, S., Larsen, R.W., Boudko, D., Riley, C.W., Karatan, E., Zimmer, M. et al. (2000) Myoglobin-like aerotaxis transducers in Archaea and Bacteria. Nature 403: 540-544.

Hyman, P., and Abedon, S.T. (2010) Bacteriophage host range and bacterial resistance. Adv Appl Microbiol 70: 217-248.

Inoue, K., Tsukamoto, T., and Sudo, Y. (2014) Molecular and evolutionary aspects of microbial sensory rhodopsins. Biochim Biophys Acta 1837: 562-577.

Iverson, V., Morris, R.M., Frazar, C.D., Berthiaume, C.T., Morales, R.L., and Armbrust, E.V. (2012) Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335: 587-590.

Jansen, R., Embden, J.D., Gaastra, W., and Schouls, L.M. (2002) Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 43: 1565-1575.

Jarrell, K.F., and Albers, S.V. (2012) The archaellum: an old motility structure with a new name. Trends Microbiol 20: 307-312.

Jarrell, K.F., Jones, G.M., Kandiba, L., Nair, D.B., and Eichler, J. (2010) S-layer glycoproteins and flagellins: reporters of archaeal posttranslational modifications. Archaea 2010.

Jarrell, K.F., Ding, Y., Meyer, B.H., Albers, S.V., Kaminski, L., and Eichler, J. (2014) N-linked glycosylation in Archaea: a structural, functional, and genetic analysis. Microbiol Mol Biol Rev 78: 304-341.

Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8: 275-282.

Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C. et al. (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30: 1236- 1240.

219

Kaminski, L., and Eichler, J. (2014) Haloferax volcanii N-glycosylation: delineating the pathway of dTDP-rhamnose biosynthesis. PLoS One 9: e97441.

Kandiba, L., and Eichler, J. (2014) Archaeal S-layer glycoproteins: post-translational modification in the face of extremes. Front Microbiol 5: 661.

Kandiba, L., Aitio, O., Helin, J., Guan, Z., Permi, P., Bamford, D.H. et al. (2012) Diversity in prokaryotic glycosylation: an archaeal-derived N-linked glycan contains legionaminic acid. Mol Microbiol 84: 578-593.

Kato, K., and Ishiwa, A. (2015) The role of carbohydrates in infection strategies of enteric pathogens. Trop Med Health 43: 41-52.

Kepner, R.L., Jr., Wharton, R.A., Jr., and Suttle, C.A. (1998) Viruses in Antarctic lakes. Limnol Oceanogr 43: 1754-1761.

Klein, R., Rossler, N., Iro, M., Scholz, H., and Witte, A. (2012) Haloarchaeal myovirus phiCh1 harbours a phase variation system for the production of protein variants with distinct cell surface adhesion specificities. Mol Microbiol 83: 137-150.

Klein, R.J., Misulovin, Z., and Eddy, S.R. (2002) Noncoding RNA genes identified in AT-rich hyperthermophiles. Proc Natl Acad Sci U S A 99: 7542-7547.

Kletzin, A. (2007) General Characteristics and Important Model Organisms. In Archaea: Molecular and Cellular Biology. Cavicchioli, R. (ed). Washington DC: American Society for Microbiology Press.

Kocur, M., and Hodgkiss, W. (1973) Taxonomic Status of the Genus Halococcus Schoop International Journal of Systematic Bacteriology 23: 151-156.

Kristensen, D.M., Kannan, L., Coleman, M.K., Wolf, Y.I., Sorokin, A., Koonin, E.V., and Mushegian, A. (2010) A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics 26: 1481-1487.

Krupovic, M., Forterre, P., and Bamford, D.H. (2010) Comparative analysis of the mosaic genomes of tailed archaeal viruses and proviruses suggests common themes for virion architecture and assembly with tailed viruses of bacteria. J Mol Biol 397: 144- 160.

Kukkaro, P., and Bamford, D.H. (2009) Virus-host interactions in environments with a wide range of ionic strengths. Environ Microbiol Rep 1: 71-77.

Lapierre, P., and Gogarten, J.P. (2009) Estimating the size of the bacterial pan-genome. Trends Genet 25: 107-110.

Lauro, F.M., DeMaere, M.Z., Yau, S., Brown, M.V., Ng, C., Wilkins, D. et al. (2011) An integrative study of a meromictic lake ecosystem in Antarctica. Isme j 5: 879-895.

Leary, D.H., Hervey, W.J.t., Deschamps, J.R., Kusterbeck, A.W., and Vora, G.J. (2013) Which metaproteome? The impact of protein extraction bias on metaproteomic analyses. Mol Cell Probes 27: 193-199.

220

Legault, B.A., Lopez-Lopez, A., Alba-Casado, J.C., Doolittle, W.F., Bolhuis, H., Rodriguez-Valera, F., and Papke, R.T. (2006) Environmental genomics of "Haloquadratum walsbyi" in a saltern crystallizer indicates a large pool of accessory genes in an otherwise coherent species. BMC Genomics 7: 171.

Li, M., Liu, H., Han, J., Liu, J., Wang, R., Zhao, D. et al. (2013) Characterization of CRISPR RNA biogenesis and Cas6 cleavage-mediated inhibition of a provirus in the haloarchaeon Haloferax mediterranei. J Bacteriol 195: 867-875.

Li, W., Cowley, A., Uludag, M., Gur, T., McWilliam, H., Squizzato, S. et al. (2015) The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res 43: W580-584.

Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A. et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326: 289-293.

Lopez-Bueno, A., Tamames, J., Velazquez, D., Moya, A., Quesada, A., and Alcami, A. (2009) High diversity of the viral community from an Antarctic lake. Science 326: 858- 861.

Lowther, W.T., Brot, N., Weissbach, H., Honek, J.F., and Matthews, B.W. (2000) Thiol-disulfide exchange is involved in the catalytic mechanism of peptide methionine sulfoxide reductase. Proc Natl Acad Sci U S A 97: 6463-6468.

Luk, A.W., Williams, T.J., Erdmann, S., Papke, R.T., and Cavicchioli, R. (2014) Viruses of haloarchaea. Life (Basel) 4: 681-715.

Madan, N.J., Marshall, W.A., and Laybourn-Parry, J. (2005) Virus and microbial loop dynamics over an annual cycle in three contrasting Antarctic lakes. Freshwater Biology 50: 1291-1300.

Maier, L.K., Lange, S.J., Stoll, B., Haas, K.A., Fischer, S., Fischer, E. et al. (2013) Essential requirements for the detection and degradation of invaders by the Haloferax volcanii CRISPR/Cas system I-B. RNA Biol 10: 865-874.

Makarova, K.S., Wolf, Y.I., and Koonin, E.V. (2015) Archaeal Clusters of Orthologous Genes (arCOGs): An Update and Application for Analysis of Shared Features between Thermococcales, Methanococcales, and Methanobacteriales. Life (Basel) 5: 818-840.

Makarova, K.S., Aravind, L., Wolf, Y.I., and Koonin, E.V. (2011) Unification of Cas protein families and a simple scenario for the origin and evolution of CRISPR-Cas systems. Biol Direct 6: 38.

Markowitz, V.M., Chen, I.-M.A., Palaniappan, K., Chu, K., Szeto, E., Grechkin, Y. et al. (2012) IMG: the integrated microbial genomes database and comparative analysis system. Nucleic Acids Research 40: D115-D122.

Martínez‐Murcia, A.J., Acinas, S.G., and Rodriguez‐Valera, F. (1995) Evaluation of prokaryotic diversity by restrictase digestion of 16S rDNA directly amplified from hypersaline environments. FEMS Microbiology Ecology 17: 247-255.

221

Matte-Tailliez, O., Brochier, C., Forterre, P., and Philippe, H. (2002) Archaeal phylogeny based on ribosomal proteins. Mol Biol Evol 19: 631-639.

Mauriello, E.M., Mignot, T., Yang, Z., and Zusman, D.R. (2010) Gliding motility revisited: how do the myxobacteria move without flagella? Microbiol Mol Biol Rev 74: 229-249.

McCready, S., and Marcello, L. (2003) Repair of UV damage in Halobacterium salinarum. Biochem Soc Trans 31: 694-698.

McCready, S., Muller, J.A., Boubriak, I., Berquist, B.R., Ng, W.L., and DasSarma, S. (2005) UV irradiation induces homologous recombination genes in the model archaeon, Halobacterium sp. NRC-1. Saline Systems 1: 3.

McWilliam, H., Li, W., Uludag, M., Squizzato, S., Park, Y.M., Buso, N. et al. (2013) Analysis Tool Web Services from the EMBL-EBI. Nucleic Acids Res 41: W597-600.

Meyer, Y., Buchanan, B.B., Vignols, F., and Reichheld, J.P. (2009) Thioredoxins and glutaredoxins: unifying elements in redox biology. Annu Rev Genet 43: 335-367.

Mongodin, E.F., Nelson, K.E., Daugherty, S., Deboy, R.T., Wister, J., Khouri, H. et al. (2005) The genome of Salinibacter ruber: convergence and gene exchange among hyperhalophilic bacteria and archaea. Proc Natl Acad Sci U S A 102: 18147-18152.

Morris, R.M., Nunn, B.L., Frazar, C., Goodlett, D.R., Ting, Y.S., and Rocap, G. (2010) Comparative metaproteomics reveals ocean-scale shifts in microbial nutrient utilization and energy transduction. ISME J 4: 673-685.

Mou, Y.Z., Qiu, X.X., Zhao, M.L., Cui, H.L., Oh, D., and Dyall-Smith, M.L. (2012) Halohasta litorea gen. nov. sp. nov., and Halohasta litchfieldiae sp. nov., isolated from the Daliang aquaculture farm, China and from Deep Lake, Antarctica, respectively. Extremophiles 16: 895-901.

Mullakhanbhai, M.F., and Larsen, H. (1975) Halobacterium volcanii spec. nov., a Dead Sea halobacterium with a moderate salt requirement. Arch Microbiol 104: 207-214.

Murray, A.E., Kenig, F., Fritsen, C.H., McKay, C.P., Cawley, K.M., Edwards, R. et al. (2012) Microbial life at -13 degrees C in the brine of an ice-sealed Antarctic lake. Proc Natl Acad Sci U S A 109: 20626-20631.

Narasingarao, P., Podell, S., Ugalde, J.A., Brochier-Armanet, C., Emerson, J.B., Brocks, J.J. et al. (2012) De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities. Isme j 6: 81-93.

Nelson-Sathi, S., Dagan, T., Landan, G., Janssen, A., Steel, M., McInerney, J.O. et al. (2012) Acquisition of 1,000 eubacterial genes physiologically transformed a methanogen at the origin of Haloarchaea. Proc Natl Acad Sci U S A 109: 20537-20542.

Ng, C., DeMaere, M.Z., Williams, T.J., Lauro, F.M., Raftery, M., Gibson, J.A. et al. (2010) Metaproteogenomic analysis of a dominant green sulfur bacterium from Ace Lake, Antarctica. ISME J 4: 1002-1019.

222

Ng, W.V., Kennedy, S.P., Mahairas, G.G., Berquist, B., Pan, M., Shukla, H.D. et al. (2000) Genome sequence of Halobacterium species NRC-1. Proc Natl Acad Sci U S A 97: 12176-12181.

Nurk, S., Bankevich, A., Antipov, D., Gurevich, A., Korobeynikov, A., Lapidus, A. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In Research in Computational Molecular Biology: 17th Annual International Conference, RECOMB 2013, Beijing, China, April 7-10, 2013 Proceedings. Deng, M., Jiang, R., Sun, F., and Zhang, X. (eds). Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 158-170.

Oesterhelt, D., and Stoeckenius, W. (1971) Rhodopsin-like protein from the purple membrane of Halobacterium halobium. Nat New Biol 233: 149-152.

Oren, A. (2002) Molecular ecology of extremely halophilic Archaea and Bacteria. FEMS Microbiol Ecol 39: 1-7.

Oren, A. (2013a) Salinibacter: an extremely halophilic bacterium with archaeal properties. FEMS Microbiol Lett 342: 1-9.

Oren, A. (2013b) Life at high salt concentrations, intracellular KCl concentrations, and acidic proteomes. Front Microbiol 4: 315.

Oren, A. (2014a) Halophilic archaea on Earth and in space: growth and survival under extreme conditions. Philos Trans A Math Phys Eng Sci 372.

Oren, A. (2014b) The ecology of Dunaliella in high-salt environments. J Biol Res (Thessalon) 21: 23.

Oren, A. (2014c) Taxonomy of halophilic Archaea: current status and future challenges. Extremophiles 18: 825-834.

Oren, A. (2014d) DNA as genetic material and as a nutrient in halophilic Archaea. Front Microbiol 5: 539.

Oren, A. (2015) Halophilic microbial communities and their environments. Curr Opin Biotechnol 33: 119-124.

Oren, A., Bratbak, G., and Heldal, M. (1997) Occurrence of virus-like particles in the Dead Sea. Extremophiles 1: 143-149.

Oren, A., Ginzburg, M., Ginzburg, B.Z., Hochstein, L.I., and Volcani, B.E. (1990) Haloarcula marismortui (Volcani) sp. nov., nom. rev., an extremely halophilic bacterium from the Dead Sea. Int J Syst Bacteriol 40: 209-210.

Ouellette, M., Makkay, A.M., and Papke, R.T. (2013) Dihydroxyacetone metabolism in Haloferax volcanii. Front Microbiol 4: 376.

Pagaling, E., Haigh, R.D., Grant, W.D., Cowan, D.A., Jones, B.E., Ma, Y. et al. (2007) Sequence analysis of an Archaeal virus isolated from a hypersaline lake in Inner Mongolia, China. BMC Genomics 8: 410.

223

Pašić, L., and Rodríguez-Valera, F. (2014) Ecology and Evolution of Haloquadratum walsbyi Through the Lens of Genomics and Metagenomics In Halophiles: Genetics and Genomes. Papke, R.T., and Oren, A. (eds). Norfolk, UK: Caister Academic Press.

Pašić, L., Rodriguez-Mueller, B., Martin-Cuadrado, A.B., Mira, A., Rohwer, F., and Rodríguez-Valera, F. (2009) Metagenomic islands of hyperhalophiles: the case of Salinibacter ruber. BMC Genomics 10: 570.

Paul, J.H. (1999) Microbial gene transfer: an ecological perspective. J Mol Microbiol Biotechnol 1: 45-50.

Peat, H.J., Clarke, A., and Convey, P. (2007) Diversity and biogeography of the Antarctic flora. Journal of Biogeography 34: 132-146.

Pena, A., Teeling, H., Huerta-Cepas, J., Santos, F., Yarza, P., Brito-Echeverria, J. et al. (2010) Fine-scale evolution: genomic, phenotypic and ecological differentiation in two coexisting Salinibacter ruber strains. Isme j 4: 882-895.

Peña, A., Gomariz, M., Lucio, M., González-Torres, P., Huertas-Cepa, J., Martínez- García, M. et al. (2014) Salinibacter ruber: The Never Ending Microdiversity? In Halophiles: Genetics and Genomes. Papke, R.T., and Oren, A. (eds). Norfolk, UK: Caister Academic Press.

Pessi, I.S., Maalouf, P.D.C., Laughinghouse, H.D., Baurain, D., and Wilmotte, A. (2016) On the use of high-throughput sequencing for the study of cyanobacterial diversity in Antarctic aquatic mats. Journal of Phycology 52: 356-368.

Pfeifer, F. (2012) Distribution, formation and regulation of gas vesicles. Nat Rev Microbiol 10: 705-715.

Pfeiffer, F., Schuster, S.C., Broicher, A., Falb, M., Palm, P., Rodewald, K. et al. (2008) Evolution in the laboratory: the genome of Halobacterium salinarum strain R1 compared to that of strain NRC-1. Genomics 91: 335-346.

Pietila, M.K., Laurinmaki, P., Russell, D.A., Ko, C.C., Jacobs-Sera, D., Butcher, S.J. et al. (2013) Insights into head-tailed viruses infecting extremely halophilic archaea. J Virol 87: 3248-3260.

Podell, S., Ugalde, J.A., Narasingarao, P., Banfield, J.F., Heidelberg, K.B., and Allen, E.E. (2013) Assembly-driven community genomics of a hypersaline microbial ecosystem. PLoS One 8: e61692.

Porter, K., Russ, B.E., and Dyall-Smith, M.L. (2007) Virus-host interactions in salt lakes. Curr Opin Microbiol 10: 418-424.

Prangishvili, D., Forterre, P., and Garrett, R.A. (2006) Viruses of the Archaea: a unifying view. Nat Rev Microbiol 4: 837-848.

Pulliainen, A.T., Kauko, A., Haataja, S., Papageorgiou, A.C., and Finne, J. (2005) Dps/Dpr ferritin-like protein: insights into the mechanism of iron incorporation and evidence for a central role in cellular iron homeostasis in Streptococcus suis. Mol Microbiol 57: 1086-1100.

224

Pyatibratov, M.G., Beznosov, S.N., Rachel, R., Tiktopulo, E.I., Surin, A.K., Syutkin, A.S., and Fedorov, O.V. (2008) Alternative flagellar filament types in the haloarchaeon Haloarcula marismortui. Can J Microbiol 54: 835-844.

Quemin, E.R., and Quax, T.E. (2015) Archaeal viruses at the cell envelope: entry and egress. Front Microbiol 6: 552.

Quinn, J.P., Kulakova, A.N., Cooley, N.A., and McGrath, J.W. (2007) New ways to break an old bond: the bacterial carbon-phosphorus hydrolases and their role in biogeochemical phosphorus cycling. Environ Microbiol 9: 2392-2400.

Ram, R.J., Verberkmoes, N.C., Thelen, M.P., Tyson, G.W., Baker, B.J., Blake, R.C., 2nd et al. (2005) Community proteomics of a natural microbial biofilm. Science 308: 1915-1920.

Rhodes, M.E., Oren, A., and House, C.H. (2012) Dynamics and persistence of Dead Sea microbial populations as shown by high-throughput sequencing of rRNA. Appl Environ Microbiol 78: 2489-2492.

Rhodes, M.E., Fitz-Gibbon, S.T., Oren, A., and House, C.H. (2010) Amino acid signatures of salinity on an environmental scale with a focus on the Dead Sea. Environ Microbiol 12: 2613-2623.

Rinke, C., Schwientek, P., Sczyrba, A., Ivanova, N.N., Anderson, I.J., Cheng, J.F. et al. (2013) Insights into the phylogeny and coding potential of microbial dark matter. Nature 499: 431-437.

Rissman, A.I., Mau, B., Biehl, B.S., Darling, A.E., Glasner, J.D., and Perna, N.T. (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics 25: 2071-2073.

Robert, X., and Gouet, P. (2014) Deciphering key features in protein structures with the new ENDscript server. Nucleic Acids Res 42: W320-324.

Rodriguez-Brito, B., Li, L., Wegley, L., Furlan, M., Angly, F., Breitbart, M. et al. (2010) Viral and microbial community dynamics in four aquatic environments. Isme j 4: 739-751.

Rodriguez-Valera, F., Martin-Cuadrado, A.B., Rodriguez-Brito, B., Pasic, L., Thingstad, T.F., Rohwer, F., and Mira, A. (2009) Explaining microbial population genomics through phage predation. Nat Rev Microbiol 7: 828-836.

Rossler, N., Klein, R., Scholz, H., and Witte, A. (2004) Inversion within the haloalkaliphilic virus phi Ch1 DNA results in differential expression of structural proteins. Mol Microbiol 52: 413-426.

Rusch, D.B., Halpern, A.L., Sutton, G., Heidelberg, K.B., Williamson, S., Yooseph, S. et al. (2007) The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol 5: e77.

Samson, J.E., Magadan, A.H., Sabri, M., and Moineau, S. (2013) Revenge of the phages: defeating bacterial defences. Nat Rev Microbiol 11: 675-687.

225

Sangwan, N., Xia, F., and Gilbert, J.A. (2016) Recovering complete and draft population genomes from metagenome datasets. Microbiome 4: 8.

Santos, A.L., Gomes, N.C., Henriques, I., Almeida, A., Correia, A., and Cunha, A. (2012a) Contribution of reactive oxygen species to UV-B-induced damage in bacteria. J Photochem Photobiol B 117: 40-46.

Santos, F., Yarza, P., Parro, V., Briones, C., and Anton, J. (2010) The metavirome of a hypersaline environment. Environ Microbiol 12: 2965-2976.

Santos, F., Meyerdierks, A., Pena, A., Rossello-Mora, R., Amann, R., and Anton, J. (2007) Metagenomic approach to the study of halophages: the environmental halophage 1. Environ Microbiol 9: 1711-1723.

Santos, F., Yarza, P., Parro, V., Meseguer, I., Rossello-Mora, R., and Anton, J. (2012b) Culture-independent approaches for studying viruses from hypersaline environments. Appl Environ Microbiol 78: 1635-1643.

Santos, F., Moreno-Paz, M., Meseguer, I., Lopez, C., Rossello-Mora, R., Parro, V., and Anton, J. (2011) Metatranscriptomic analysis of extremely halophilic viral communities. Isme j 5: 1621-1633.

Sara, M., and Sleytr, U.B. (2000) S-Layer proteins. J Bacteriol 182: 859-868.

Sawstrom, C., Lisle, J., Anesio, A.M., Priscu, J.C., and Laybourn-Parry, J. (2008) Bacteriophage in polar inland waters. Extremophiles 12: 167-175.

Schaechter, M. (2007) Talmudic Question #8. In, p. Web log post.

Schleper, C., Jurgens, G., and Jonuscheit, M. (2005) Genomic studies of uncultivated archaea. Nat Rev Microbiol 3: 479-488.

Schlesner, M., Miller, A., Streif, S., Staudinger, W.F., Muller, J., Scheffer, B. et al. (2009) Identification of Archaea-specific chemotaxis proteins which interact with the flagellar apparatus. BMC Microbiol 9: 56.

Schmid, J., Heider, D., Wendel, N.J., Sperl, N., and Sieber, V. (2016) Bacterial Glycosyltransferases: Challenges and Opportunities of a Highly Diverse Enzyme Class Toward Tailoring Natural Products. Front Microbiol 7: 182.

Schneider, T., Keiblinger, K.M., Schmid, E., Sterflinger-Gleixner, K., Ellersdorfer, G., Roschitzki, B. et al. (2012) Who is who in litter decomposition? Metaproteomics reveals major microbial players and their biogeochemical functions. ISME J 6: 1749- 1762.

Schwaiger, R., Schwarz, C., Furtwangler, K., Tarasov, V., Wende, A., and Oesterhelt, D. (2010) Transcriptional control by two leucine-responsive regulatory proteins in Halobacterium salinarum R1. BMC Mol Biol 11: 40.

Seifert, J., Herbst, F.A., Halkjaer Nielsen, P., Planes, F.J., Jehmlich, N., Ferrer, M., and von Bergen, M. (2013) Bioinformatic progress and applications in metaproteogenomics

226

for bridging the gap between genomic sequences and metabolic functions in microbial communities. Proteomics 13: 2786-2804.

Sencilo, A., and Roine, E. (2014) A Glimpse of the genomic diversity of haloarchaeal tailed viruses. Front Microbiol 5: 84.

Sharma, K., Gillum, N., Boyd, J.L., and Schmid, A. (2012) The RosR transcription factor is required for gene expression dynamics in response to extreme oxidative stress in a hypersaline-adapted archaeon. BMC Genomics 13: 351.

Soppa, J. (2006) From genomes to function: haloarchaea as model organisms. Microbiology 152: 585-590.

Soppa, J., Baumann, A., Brenneis, M., Dambeck, M., Hering, O., and Lange, C. (2008) Genomics and functional genomics with haloarchaea. Arch Microbiol 190: 197-215.

Sorek, R., Lawrence, C.M., and Wiedenheft, B. (2013) CRISPR-mediated adaptive immune systems in bacteria and archaea. Annu Rev Biochem 82: 237-266.

Sowell, S.M., Wilhelm, L.J., Norbeck, A.D., Lipton, M.S., Nicora, C.D., Barofsky, D.F. et al. (2009) Transport functions dominate the SAR11 metaproteome at low-nutrient extremes in the Sargasso Sea. Isme j 3: 93-105.

Stern, A., and Sorek, R. (2011) The phage-host arms race: shaping the evolution of microbes. Bioessays 33: 43-51.

Stetter, K.O. (2013) A brief history of the discovery of hyperthermophilic life. Biochem Soc Trans 41: 416-420.

Storch, K.F., Rudolph, J., and Oesterhelt, D. (1999) Car: a cytoplasmic sensor responsible for arginine chemotaxis in the archaeon Halobacterium salinarum. Embo j 18: 1146-1158.

Sullivan, M.B., Waterbury, J.B., and Chisholm, S.W. (2003) Cyanophages infecting the oceanic cyanobacterium Prochlorococcus. Nature 424: 1047-1051.

Suttle, C.A. (2007) Marine viruses--major players in the global ecosystem. Nat Rev Microbiol 5: 801-812.

Szurmant, H., and Ordal, G.W. (2004) Diversity in chemotaxis mechanisms among the bacteria and archaea. Microbiol Mol Biol Rev 68: 301-319.

Takahashi, N., Sato, T., and Yamada, T. (2000) Metabolic pathways for cytotoxic end product formation from glutamate- and aspartate-containing peptides by Porphyromonas gingivalis. J Bacteriol 182: 4704-4710.

Takai, K., Nakamura, K., Toki, T., Tsunogai, U., Miyazaki, M., Miyazaki, J. et al. (2008) Cell proliferation at 122 degrees C and isotopically heavy CH4 production by a hyperthermophilic methanogen under high-pressure cultivation. Proc Natl Acad Sci U S A 105: 10949-10954.

227

Tamura, K., Stecher, G., Peterson, D., Filipski, A., and Kumar, S. (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30: 2725-2729.

Taton, A., Grubisic, S., Balthasart, P., Hodgson, D.A., Laybourn-Parry, J., and Wilmotte, A. (2006a) Biogeographical distribution and ecological ranges of benthic cyanobacteria in East Antarctic lakes. FEMS Microbiol Ecol 57: 272-289.

Taton, A., Grubisic, S., Ertz, D., Hodgson, D.A., Piccardi, R., Biondi, N. et al. (2006b) POLYPHASIC STUDY OF ANTARCTIC CYANOBACTERIAL STRAINS1. Journal of Phycology 42: 1257-1270.

Temperton, B., and Giovannoni, S.J. (2012) Metagenomics: microbial diversity through a scratched lens. Curr Opin Microbiol 15: 605-612.

Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L. et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 102: 13950- 13955.

Torsvik, T., and Dundas, I.D. (1974) Bacteriophage of Halobacterium salinarium. Nature 248: 680-681.

Tyson, G.W., and Banfield, J.F. (2008) Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environ Microbiol 10: 200-207.

Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M. et al. (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37-43.

Varghese, N.J., Mukherjee, S., Ivanova, N., Konstantinidis, K.T., Mavrommatis, K., Kyrpides, N.C., and Pati, A. (2015) Microbial species delineation using whole genome sequences. Nucleic Acids Res 43: 6761-6771.

Ventosa, A. (2006) Unusual micro-organisms from unusual habitats: hypersaline environments. In Prokaryotic diversity: mechanisms and significance. Logan, N.A., Lappin-Scott, H.M., and Oyston, P.C.F. (eds): Cambridge University Press.

Ventosa, A., de la Haba, R.R., Sanchez-Porro, C., and Papke, R.T. (2015) Microbial diversity of hypersaline environments: a metagenomic approach. Curr Opin Microbiol 25: 80-87.

Verleyen, E., Sabbe, K., Hodgson, D.A., Grubisic, S., Taton, A., Cousin, S. et al. (2010) Structuring effects of climate-related environmental factors on Antarctic microbial mat communities. Aquatic Microbial Ecology 59: 11-24.

Vick-Majors, T.J., Priscu, J.C., and Amaral-Zettler, L.A. (2014) Modular community structure suggests metabolic plasticity during the transition to polar night in ice-covered Antarctic lakes. Isme j 8: 778-789.

Vizcaino, J.A., Deutsch, E.W., Wang, R., Csordas, A., Reisinger, F., Rios, D. et al. (2014) ProteomeXchange provides globally coordinated proteomics data submission and dissemination. 32: 223-226.

228

Walsby, A.E. (1980) A square bacterium. Nature 283: 69-71.

Weinbauer, M.G., and Rassoulzadegan, F. (2004) Are viruses driving microbial diversification and diversity? Environ Microbiol 6: 1-11.

Wiedenheft, B., Mosolf, J., Willits, D., Yeager, M., Dryden, K.A., Young, M., and Douglas, T. (2005) An archaeal antioxidant: characterization of a Dps-like protein from Sulfolobus solfataricus. Proc Natl Acad Sci U S A 102: 10551-10556.

Wilhelm, S.W., and Suttle, C.A. (1999) Viruses and Nutrient Cycles in the Sea: Viruses play critical roles in the structure and function of aquatic food webs. BioScience 49: 781-788.

Wilkins, D., Yau, S., Williams, T.J., Allen, M.A., Brown, M.V., DeMaere, M.Z. et al. (2013) Key microbial drivers in Antarctic aquatic environments. FEMS Microbiol Rev 37: 303-335.

Williams, G.J., Goff, R.D., Zhang, C., and Thorson, J.S. (2008) Optimizing glycosyltransferase specificity via "hot spot" saturation mutagenesis presents a catalyst for novobiocin glycorandomization. Chem Biol 15: 393-401.

Williams, T.J., and Cavicchioli, R. (2014) Marine metaproteomics: deciphering the microbial metabolic food web. Trends Microbiol 22: 248-260.

Williams, T.J., Wilkins, D., Long, E., Evans, F., DeMaere, M.Z., Raftery, M.J., and Cavicchioli, R. (2013) The role of planktonic Flavobacteria in processing algal organic matter in coastal East Antarctica revealed using metagenomics and metaproteomics. Environ Microbiol 15: 1302-1317.

Williams, T.J., Allen, M.A., DeMaere, M.Z., Kyrpides, N.C., Tringe, S.G., Woyke, T., and Cavicchioli, R. (2014) Microbial ecology of an Antarctic hypersaline lake: genomic assessment of ecophysiology among dominant haloarchaea. Isme j 8: 1645-1658.

Williams, T.J., Long, E., Evans, F., Demaere, M.Z., Lauro, F.M., Raftery, M.J. et al. (2012) A metaproteomic assessment of winter and summer bacterioplankton from Antarctic Peninsula coastal surface waters. ISME J 6: 1883-1900.

Woese, C.R. (2007) The Archaea: an Invitation to Evolution. In Archaea: Molecular and Cellular Biology. Cavicchiloi, R. (ed). Washington DC: American Society for Microbiology Press.

Woese, C.R., Kandler, O., and Wheelis, M.L. (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87: 4576-4579.

Wood, Z.A., Schroder, E., Robin Harris, J., and Poole, L.B. (2003) Structure, mechanism and regulation of peroxiredoxins. Trends Biochem Sci 28: 32-40.

Woods, W.G., and Dyall-Smith, M.L. (1997) Construction and analysis of a recombination-deficient (radA) mutant of Haloferax volcanii. Mol Microbiol 23: 791- 797.

229

Wright, S.W., and Burton, H.R. (1981) The biology of Antarctic saline lakes. Hydrobiologia 82: 319-338.

Yamaguchi, Y., Park, J.H., and Inouye, M. (2011) Toxin-antitoxin systems in bacteria and archaea. Annu Rev Genet 45: 61-79.

Yau, S. (2013) Molecular microbial ecology of Antarctic lakes. In. Sydney, Australia: University of New South Wales.

Yau, S., Lauro, F.M., Williams, T.J., Demaere, M.Z., Brown, M.V., Rich, J. et al. (2013) Metagenomic insights into strategies of carbon conservation and unusual sulfur biogeochemistry in a hypersaline Antarctic lake. Isme j 7: 1944-1961.

Yau, S., Lauro, F.M., DeMaere, M.Z., Brown, M.V., Thomas, T., Raftery, M.J. et al. (2011) Virophage control of antarctic algal host-virus dynamics. Proc Natl Acad Sci U S A 108: 6163-6168.

You, X.Y., Liu, C., Wang, S.Y., Jiang, C.Y., Shah, S.A., Prangishvili, D. et al. (2011) Genomic analysis of Acidianus hospitalis W1 a host for studying crenarchaeal virus and plasmid life cycles. Extremophiles 15: 487-497.

Zeth, K. (2012) Dps biomineralizing proteins: multifunctional architects of nature. Biochem J 445: 297-311.

Zwartz, D., Bird, M., Stone , J., and Lambeck, K. (1998) Holocene sea-level change and ice-sheet history in the Vestfold Hills, East Antarctica. Earth and Planetary Science Letters 155: 131-145.

230

Appendix A

Proteins detected in the Deep Lake metaproteome

231

Table A1 Complete list of proteins identified in the Deep Lake metaproteome. Column labelled ‘ID’ gives the protein identification number according to the spectrum count (last column). Columns labelled ‘% ID’, ‘Locus tag’ and ‘Organism’ show the result of a BLAST search and list the best match for each identified proteins. % ID gives the amino acid sequence identity of the detected protein to the best BLAST match. The column labelled ‘Sum of spectra’ gives the sum of the normalized total spectrum count for each protein across all 15 samples. Proteins are sorted according to spectrum count. ‘ud’ denotes undefined. Entries with grey background denote protein families. sum of ID annotation % ID locus tag organism spectra 1 archaellin FlaA or FlaB 100% halTADL_1544 tADL 695.1 cell surface glycoprotein (Sec signal, PGF-CTERM, 2 47% Halar_0829 DL31 628.7 C-terminal transmembrane helix) 3 archaellin FlaA or FlaB 100% halTADL_1812 tADL 496.7 group II chaperonin 4 100% halTADL_3279 tADL 478.1 (thermosome) Halovirus 5 major capsid protein 64% [120] 433.1 HCTV-1 group II chaperonin 6 95% halTADL_3279 tADL 386.6 (thermosome) 7 archaellin FlaA or FlaB 100% halTADL_1813 tADL 351.0 cell surface glycoprotein (Sec signal, PGF-CTERM, 8 51% halTADL_1043 tADL 337.6 C-terminal transmembrane helix) oligopeptide/dipeptide ABC 9 transporter solute-binding 100% Halar_2016 DL31 277.4 protein group II chaperonin 10 100% Halar_3034 DL31 269.3 (thermosome) group II chaperonin 11 100% halTADL_1928 tADL 250.0 (thermosome) 12 archaellin FlaA or FlaB 76% halTADL_1813 tADL 239.1 phosphate ABC transporter 13 99% halTADL_2155 tADL 234.5 solute-binding protein (PstS) group II chaperonin Hrr. 14 100% Hlac_2662 225.7 (thermosome) lacusprofundi 15 ferritin Dps family protein 100% halTADL_1068 tADL 220.3 nucleoside phosphorylase Haloferax 16 30% ud 219.7 domain larsenii glycerol kinase (GlpK) (EC 17 100% halTADL_2249 tADL 205.7 2.7.1.30) cell surface glycoprotein (Sec signal, PGF-CTERM, 18 44% halTADL_1043 tADL 204.7 C-terminal transmembrane helix) group II chaperonin 19 95% halTADL_1928 tADL 200.3 (thermosome)

232

20 ferritin Dps family protein 92% halTADL_1068 tADL 170.1 21 archaellin FlaA or FlaB 100% halTADL_1810 tADL 169.3 group II chaperonin 22 100% Halar_3265 DL31 166.3 (thermosome) translation elongation factor 23 100% halTADL_0652 tADL 165.7 EF-1, alpha subunit (Tuf) Halovirus 24 major capsid protein 68% [119] 147.4 HVTV-1 enolase (phosphopyruvate 25 hydratase) (Eno) (EC 100% halTADL_2780 tADL 142.4 4.2.1.11) hypothetical protein (Sec Candidatus 26 43% ud 142.1 signal) Nanosalina sp. hypothetical protein (TAT Hrr. 27 49% Hlac_0476 137.6 signal) lacusprofundi glycerol kinase (GlpK) (EC 28 100% halTADL_0681 tADL 134.7 2.7.1.30) RND superfamily family / MMPL (mycobacterial 29 99% Halar_1791 DL31 133.5 membrane protein large) family protein Hrr. 30 archaellin FlaA or FlaB 38% Hlac_2557 132.8 lacusprofundi glycine-rich hypothetical protein (PGF-pre-PGF Halorubrum 31 39% ud 132.0 domain, C-terminal aidingense transmembrane helix) hypothetical protein 32 (DUF827 / WEB family 100% halTADL_0555 tADL 130.0 domain) iron ABC transporter solute- 33 100% Halar_0820 DL31 123.9 binding protein isocitrate dehydrogenase 34 100% halTADL_0758 tADL 123.6 (Icd) (EC 1.1.1.42) 35 ferredoxin 100% halTADL_2137 tADL 118.7 cell surface glycoprotein (Sec signal, PGF-CTERM, 36 45% Halar_0829 DL31 118.4 C-terminal transmembrane helix) glucose-1-phosphate 37 thymidylyltransferase (RfbA, 100% halTADL_3353 tADL 117.9 RffH) (EC 2.7.7.24) Halovirus 38 hypothetical protein 43% [121] 116.4 HCTV-5 Alcanivorax 39 porin 51% ud 113.8 pacificus 40 ribosomal protein L7Ae 100% halTADL_0166 tADL 110.8 prefoldin, beta subunit 41 100% Halar_3076 DL31 110.0 (PfdB) 42 ferredoxin 96% halTADL_2137 tADL 109.2 proteasome alpha subunit 43 100% halTADL_2681 tADL 108.4 (PsmA) (EC 3.4.25.1) group II chaperonin 44 100% halTADL_0092 tADL 104.8 (thermosome)

233

UspA domain-containing 45 100% halTADL_1904 tADL 104.8 protein hypothetical protein (Sec Marinobacter 46 53% ud 104.6 signal; DUF3359) nanhaiticus transcriptional regulator, 47 88% halTADL_1491 tADL 104.1 AsnC family cell surface glycoprotein (Sec signal, PGF-CTERM, Halalkaicoccu 48 28% HacjB3_14595 103.0 C-terminal transmembrane s jeotgali helix) manganese/iron superoxide 49 dismutase (Sod) (EC 100% halTADL_2687 tADL 102.2 1.15.1.1) proteasome alpha subunit 50 95% halTADL_2681 tADL 102.2 (PsmA) (EC 3.4.25.1) peptidylprolyl isomerase 51 100% halTADL_0251 tADL 101.6 FKBP-type cell surface glycoprotein (Sec signal, PGF-CTERM, Hrr. 52 100% Hlac_0412 100.9 C-terminal transmembrane lacusprofundi helix) translation elongation factor 53 100% Halar_0925 DL31 99.7 EF-1, alpha subunit (Tuf) Halorubrum 54 archaellin FlaA or FlaB 65% ud 98.2 californiensis nascent polypeptide- 55 associated complex protein 100% halTADL_3243 tADL 97.9 (Nac) TRAP transporter solute Hrr. 56 100% Hlac_2586 96.8 receptor, TAXI family lacusprofundi nucleic acid binding OB-fold 57 100% halTADL_3218 tADL 96.0 tRNA/helicase-type ATP synthase, alpha subunit Dunaliella 58 (AtpA) (EC 3.6.3.14), 95% ud 95.9 tertiolecta chloroplastic 59 archaellin FlaA or FlaB 77% halTADL_1812 tADL 93.9 60 Hsp20-type chaperone 100% halTADL_0724 tADL 93.1 phosphate ABC transporter 61 88% halTADL_2155 tADL 92.1 solute-binding protein (PstS) prefoldin, alpha subunit Hrr. 62 100% Hlac_0567 91.1 (PfdA) lacusprofundi transcriptional regulator, 63 100% halTADL_3422 tADL 90.6 AsnC family transcriptional regulator, 64 100% halTADL_1491 tADL 90.5 AsnC family alpha-amylase (glycosyl 65 hydrolase, family 13) (EC 100% halTADL_0142 tADL 90.3 3.2.1.1) 66 SufBD protein 100% halTADL_0974 tADL 90.2 DNA polymerase sliding 67 clamp subunit (PCNA 100% halTADL_1713 tADL 89.5 homolog)

234

hypothetical protein (2 x PKD/chitinase domains + carboxypeptidase-like Natronococcus 68 25% C491_10439 88.8 regulatory domain; C- amylolyticus terminal transmembrane helix) prefoldin, beta subunit 69 100% halTADL_1114 tADL 88.8 (PfdB) translation elongation factor 70 100% halTADL_0647 tADL 86.6 aEF-2 (FusA) oligopeptide/dipeptide ABC Hrr. 71 transporter solute-binding 100% Hlac_0069 86.5 lacusprofundi protein ATP synthase, alpha subunit 72 100% halTADL_1944 tADL 85.7 (AtpA) (EC 3.6.3.14) group II chaperonin Hrr. 73 100% Hlac_0416 85.3 (thermosome) lacusprofundi hypothetical protein (TAT Natronobacteri 74 59% Natgr_3403 83.2 signal) um gregoryi phosphate ABC transporter 75 93% halTADL_2155 tADL 82.2 solute-binding protein (PstS) oligopeptide/dipeptide ABC 76 transporter solute-binding 100% Halar_1439 DL31 81.8 protein peptidylprolyl isomerase 77 89% halTADL_0251 tADL 81.1 FKBP-type 78 ribosomal protein L1 99% halTADL_0105 tADL 80.9 prefoldin, alpha subunit 79 100% Halar_3048 DL31 80.9 (PfdA) 80 archaellin FlaA or FlaB 100% halTADL_1811 tADL 79.2 81 ribosomal protein S3Ae 100% halTADL_3142 tADL 79.0 hypothetical protein (Sec Halogranum 82 33% HSB1_10380 77.9 signal) salarium B-1 83 rhodanese-like protein 100% halTADL_2750 tADL 73.9 84 chaperone protein DnaK 100% halTADL_0595 tADL 71.2 phosphate ABC transporter 85 51% halTADL_1182 tADL 70.9 solute-binding protein (PstS) succinyl-CoA synthetase, 86 beta subunit (SucC) (EC 100% halTADL_0568 tADL 70.0 6.2.1.5) SPFH domain membrane 87 protease (N-terminal 100% halTADL_3017 tADL 69.8 transmembrane helix) glutamine synthetase (GS) 88 100% halTADL_3423 tADL 66.7 (GlnA) (EC 6.3.1.2) translation elongation factor 89 95% halTADL_0647 tADL 66.2 aEF-2 (FusA) Halococcus 90 major capsid protein 46% ud 65.8 hamelinensis hypothetical protein (viral 91 32% Hlac_0760 H.lac 65.1 major capsid protein?) Halonotius sp. 92 tubulin/FtsZ, GTPase domain 68% ud 63.1 J07HN4 93 ribosomal protein L7Ae 98% halTADL_0166 tADL 62.9

235

nucleoside-diphosphate 94 100% halTADL_0169 tADL 61.9 kinase (Ndk) (EC 2.7.4.6) 95 archaellin FlaA or FlaB 75% halTADL_0078 tADL 58.6 96 ribosomal protein L18e 100% halTADL_2775 tADL 58.0 aconitate hydratase (AcnA) 97 100% halTADL_2902 tADL 58.0 (EC 4.2.1.3) 98 hypothetical protein 100% halTADL_2576 tADL 57.9 99 ribosomal protein S4e 100% halTADL_3375 tADL 57.8 succinyl-CoA synthetase, 100 alpha subunit (SucD) (EC 100% halTADL_0569 tADL 57.5 6.2.1.5) VCP-like protein (2 x 101 CDC48 domains + 2 x AAA 100% halTADL_1701 tADL 56.7 family ATPase domains) response regulator receiver 102 100% halTADL_0055 tADL 56.7 domain + HalX domain hypothetical protein (Sec Marinobacter 103 signal; LTXXQ motif family 64% ud 56.4 algicola protein) Environmental 104 major capsid protein 42% eHP12_00035 Halophage 55.5 eHP-12 105 ribosomal protein S2 100% halTADL_2781 tADL 55.4 cell surface glycoprotein (Sec signal, PGF-CTERM, 106 100% Halar_0829 DL31 55.3 C-terminal transmembrane helix) transcriptional regulator, 107 100% halTADL_1645 tADL 55.3 RosR (PadR family) proteasome beta subunit 108 100% halTADL_2911 tADL 55.2 (PsmB) [EC:3.4.25.1] 109 dodecin 100% Halar_2184 DL31 54.8 glyceraldehyde-3-phosphate dehydrogenase (NAD(P)+- 110 100% halTADL_0817 tADL 54.4 dependent), type I (Gap) (EC 1.2.1.12) hypothetical protein (Sec signal, Ig fold domain, C- 111 100% halTADL_1042 tADL 54.3 terminal transmembrane helix) hypothetical protein (Sec Haloterrigena 112 signal, C-terminal 44% C478_17416 54.3 thermotolerans transmembrane helix) 113 ribosomal protein S6e 100% halTADL_2119 tADL 53.8 Marinobacter 114 TonB-dependent receptor 51% ud hydrocarbonoc 53.5 lasticus cell surface glycoprotein (Sec signal, PGF-CTERM, 115 100% HalDL1_0395 DL1 53.0 C-terminal transmembrane helix) ThiJ/PfpI domain-containing 116 100% halTADL_1769 tADL 52.6 protein

236

nucleic acid-binding/OB- 117 100% halTADL_0109 tADL 52.1 fold/TRAM domain 118 archaellin FlaA or FlaB 100% halTADL_0078 tADL 51.7 119 ribosomal protein S13P 100% halTADL_2771 tADL 51.6 Halovivax 120 adhesion pilin (PilA) 47% ud 51.4 ruber 121 ribosomal protein S3 100% halTADL_3381 tADL 50.9 manganese/iron superoxide 122 dismutase (Sod) (EC 100% Halar_1640 DL31 50.5 1.15.1.1) 123 ribosomal protein L23 100% halTADL_3385 tADL 50.4 ATP synthase, beta subunit Dunaliella 124 (AtpB) (EC 3.6.3.14), 96% ud tertiolecta/sali 50.3 chloroplastic na ATP synthase, beta subunit 125 100% halTADL_1945 tADL 49.3 (AtpB) (EC 3.6.3.14) hypothetical protein (Sec signal, PGF-CTERM, C- 126 29% halTADL_1403 tADL 48.8 terminal transmembrane helix) Halovirus 127 major capsid protein 51% [33] 48.4 HCTV-2 winged helix-turn-helix 128 99% halTADL_0044 tADL 48.3 DNA-binding domain acidic ribosomal protein P0- 129 100% halTADL_0106 tADL 47.6 like protein translation elongation factor ambig Hrr. 130 Hlac_0156 47.5 EF-1, alpha subunit (Tuf) uous lacusprofundi D-3-phosphoglycerate 131 dehydrogenase (Ser A) (EC 100% halTADL_2045 tADL 47.4 1.1.1.95) aspartyl-tRNA synthetase 132 100% halTADL_1088 tADL 47.2 (AspS) (EC 6.1.1.12) 133 ribosomal protein S5 100% halTADL_3367 tADL 46.8 DNA-directed RNA 134 polymerase subunit A 100% halTADL_0619 tADL 46.0 (RpoA1) (EC 2.7.7.6) group II chaperonin 135 91% halTADL_0092 tADL 45.4 (thermosome) Halopiger 136 hypothetical protein 66% Halxa_0033 45.3 xanaduensis 137 FeS assembly ATPase SufC 100% halTADL_0972 tADL 45.3 Tripartite Tricarboxylate 138 transporter (TTT), solute 100% halTADL_0690 tADL 45.2 receptor 139 ribosomal protein L4P 100% halTADL_3386 tADL 45.0 dihydroxyacetone (DHA) 140 kinase, K subunit (DhaK) 94% halTADL_2260 tADL 45.0 (EC 2.7.1.29) hypothetical protein (TAT Natronobacteri 141 62% Natgr_3403 44.5 signal) um gregoryi ribosomal protein Hrr. 142 100% Hlac_1842 43.9 L7Ae/L30e/S12e/Gadd45 lacusprofundi

237

methyl-accepting chemotaxis Halorubrum 143 60% ud 43.7 sensory transducer californiensis TRAP transporter solute Hrr. 144 100% Hlac_2329 43.5 receptor, TAXI family lacusprofundi Halogeometric hypothetical protein (Sec 145 35% Hbor_31340 um 43.3 signal) borinquense 146 hypothetical protein 100% halTADL_0395 tADL 43.1 Halostagnicola 147 hypothetical protein 66% EL22_00080 43.0 sp. prefoldin, alpha subunit 148 95% halTADL_2197 tADL 42.8 (PfdA) 149 hypothetical protein 100% halTADL_2560 tADL 42.4 nucleic acid binding OB-fold 150 86% halTADL_3218 tADL 42.3 tRNA/helicase-type 151 Hsp20-type chaperone 100% halTADL_0114 tADL 42.2 peptidylprolyl isomerase, 152 91% halTADL_2273 tADL 42.1 cyclophilin type protein translation factor ambig 153 halTADL_1141 tADL 42.0 SUI1 homolog (Sui1) uous ATP synthase, K subunit 154 100% halTADL_1940 tADL 41.9 (AtpK) (EC 3.6.3.14) 155 ribosomal protein L32e 100% halTADL_3370 tADL 41.7 cell surface glycoprotein (Sec signal, PGF-CTERM, 156 100% halTADL_1043 tADL 41.7 C-terminal transmembrane helix) transferase 1 / rSAM / 157 100% halTADL_3462 tADL 41.3 selenodomain-associated branched-chain amino acid 158 ABC transporter solute- 100% Halar_1569 DL31 41.1 binding protein iron ABC transporter solute- 159 100% halTADL_1788 tADL 40.9 binding protein branched-chain amino acid ABC transporter solute- Hrr. 160 binding protein; appears to 100% Hlac_2093 40.5 lacusprofundi be interrupted by a transposase PKD/chitinase + APHP + terpenoid cyclases/protein Haloferax 161 prenyltransferase alpha-alpha 32% ud 40.3 gibbonsii toroid domains; homologs have Sec signal pyruvate:ferredoxin 162 oxidoreductase, alpha 100% halTADL_0382 tADL 39.9 subunit (PorA) (EC 1.2.7.1) hypothetical protein (Sec signal, Ig fold domain, C- 163 73% halTADL_1042 tADL 39.3 terminal transmembrane helix) manganese/iron superoxide 164 dismutase (Sod) (EC 95% halTADL_2687 tADL 39.2 1.15.1.1)

238

carboxypeptidase-like, Halonotius sp. 165 regulatory domain; central 41% ud 39.2 J07HN4 transmembrane helix Halovirus 166 hypothetical protein 45% [114] 39.1 HCTV-1 Halovirus 167 major capsid protein 45% [21] 39.0 HHTV-1 peptidylprolyl isomerase, 168 100% halTADL_3026 tADL 39.0 FKBP-type group II chaperonin Natrinema 169 97% C489_13810 38.9 (thermosome) versiforme hypothetical protein (TAT 170 100% halTADL_1761 tADL 38.8 signal) fructose-1,6-bisphosphate 171 aldolase, class I (FbaB) (EC 100% halTADL_0575 tADL 38.5 4.1.2.13) transcriptional regulator, 172 96% halTADL_3422 tADL 38.3 AsnC family aspartate aminotransferase 173 100% halTADL_3081 tADL 37.7 (AspB) (EC 2.6.1.1) Archaeoglobus 174 major capsid protein 43% Asulf_01513 37.6 sulfaticallidus triosephosphate isomerase 175 100% halTADL_2532 tADL 37.4 (TpiA) (EC 5.3.1.1) 176 adhesion pilin (PilA) 100% halTADL_1387 tADL 37.2 prefoldin, beta subunit 177 95% halTADL_1114 tADL 37.2 (PfdB) 178 ribosomal protein L3 100% halTADL_3387 tADL 37.0 citrate synthase (GltA) (EC 179 100% halTADL_0824 tADL 36.8 2.3.3.1) 180 archaellin FlaA or FlaB 100% HalDL1_1517 DL1 36.4 phosphoglucomutase/phosph 181 100% halTADL_1712 tADL 36.3 omannomutase winged helix-turn-helix 182 transcription repressor, HrcA 100% halTADL_0462 tADL 36.0 DNA-binding domain hypothetical protein 183 (DUF827 / WEB family 77% halTADL_0555 tADL 36.0 domain) 184 ribosomal protein L29 100% halTADL_3380 tADL 35.9 185 ferritin Dps family protein 100% Halar_0843 DL31 35.9 186 cell division protein FtsA 100% halTADL_0130 tADL 35.7 branched-chain amino acid 187 ABC transporter solute- 100% Halar_2890 DL31 35.5 binding protein histone-like transcription 188 100% halTADL_1071 tADL 35.5 factor (CBF/NF-Y) domain oligopeptide/dipeptide ABC 189 transporter solute-binding 100% Halar_0722 DL31 35.4 protein ATP synthase, E subunit 190 100% halTADL_1941 tADL 35.4 (AtpE) (EC 3.6.3.14) glycerol kinase (GlpK) (EC 191 98% halTADL_0681 tADL 35.2 2.7.1.30)

239

branched-chain amino acid 192 ABC transporter solute- 100% Halar_3433 DL31 35.2 binding protein alpha-amylase (glycosyl 193 hydrolase, family 13) (EC 84% halTADL_0142 tADL 35.1 3.2.1.1) 194 ribosomal protein L29 92% halTADL_3380 tADL 34.4 glycerol-3-phosphate 195 dehydrogenase (GlpA) (EC 100% halTADL_2244 tADL 33.9 1.1.5.3) protein-disulfide isomerase / 196 100% halTADL_1178 tADL 33.4 thioredoxin hypothetical protein (Sec 197 100% halTADL_1882 tADL 33.4 signal) Haladaptatus methyl-accepting chemotaxis 198 42% ud paucihalophilu 33.3 sensory transducer s ThiJ/PfpI domain-containing 199 83% halTADL_1769 tADL 33.0 protein hypothetical protein Hrr. Hlac_3030/Hal 200 (DUF2078; 2 x 100% lacusprofundi/ 32.9 DL1_3101 transmembrane helices) DL1 peptidylprolyl isomerase, 201 100% halTADL_2273 tADL 32.9 cyclophilin type branched-chain amino acid 202 ABC transporter solute- 88% halTADL_2916 tADL 32.9 binding protein hypothetical protein Hlac_3031/Hal DL1/Hrr. 203 100% 32.7 (DUF302) DL1_3100 lacusprofundi thiamine-phosphate 204 pyrophosphorylase (ThiE) 100% halTADL_0473 tADL 32.6 (EC 2.5.1.3) 205 ribosomal protein L22 100% halTADL_3382 tADL 32.0 nucleic acid binding OB-fold 206 tRNA/helicase-type (RPA41 100% halTADL_3433 tADL 32.0 homolog) 207 ribosomal protein L11 100% halTADL_0103 tADL 31.8 winged helix-turn-helix 208 100% HalDL1_1865 DL1 31.8 DNA-binding 209 ribosomal protein S4 100% halTADL_2772 tADL 31.7 ATP synthase, C subunit 210 100% halTADL_1942 tADL 31.6 (AtpC) (EC 3.6.3.14) hypothetical protein 211 100% halTADL_2791 tADL 31.5 (DRTGG domain) 212 chaperone protein DnaK 94% halTADL_0595 tADL 31.5 SecD/SecF/SecDF export 213 100% halTADL_0787 tADL 31.4 membrane protein CRISPR-associated protein 214 100% halTADL_1360 tADL 30.9 Cas7 (= Csh2) (subtype I-B) hypothetical protein 215 87% halTADL_2062 tADL 30.6 (DUF4382) 216 100% halTADL_0374 tADL 30.5 (GnaD) (EC 4.2.1.39) 217 heat shock protein Hsp20 95% halTADL_0724 tADL 30.5

240

proteasome alpha subunit 218 100% Halar_1652 DL31 30.4 (PsmA) (EC 3.4.25.1) 219 hypothetical protein 100% Halar_1724 DL31 30.3 ammonium permease 220 (ammonium transporter) 100% halTADL_1826 tADL 30.1 (Amt) 221 hypothetical protein 100% halTADL_0752 tADL 29.9 transcriptional regulator, 222 80% halTADL_1491 tADL 29.7 AsnC family cell surface glycoprotein (Sec signal, PGF-CTERM, Hrr. 223 38% Hlac_2976 29.5 C-terminal transmembrane lacusprofundi helix) hypothetical protein 224 100% halTADL_2062 tADL 29.4 (DUF4382) fructose 1,6-bisphosphate 225 aldolase (multifunctional) 100% halTADL_3234 tADL 29.4 (EC 4.1.2.13) ATP synthase, beta subunit Dunaliella 226 (AtpB) (EC 3.6.3.14), 95% ud 29.4 salina chloroplastic group II chaperonin Natrinema 227 96% C489_07195 29.1 (thermosome) versiforme ambig 228 ribosomal protein S11P halTADL_2773 tADL 29.0 uous response regulator receiver 229 96% halTADL_0055 tADL 29.0 domain + HalX domain acidic ribosomal protein P0- 230 100% Halar_1845 DL31 28.9 like protein 231 protein tyrosine phosphatase 100% halTADL_0311 tADL 28.8 232 archaellin FlaA or FlaB 41% HalDL1_1563 DL1 28.4 oxidoreductase FAD-binding 233 100% halTADL_1014 tADL 28.2 domain hypothetical protein with Haloterrigena 234 carboxypeptidase regulatory- 48% ud 27.9 thermotolerans like domain (Sec signal) globin domain + methyl- accepting chemotaxis 235 100% halTADL_0074 tADL 27.6 sensory transducer (aerotaxis?) D-3-phosphoglycerate 236 dehydrogenase (Ser A) (EC 93% halTADL_2045 tADL 27.6 1.1.1.95) isocitrate dehydrogenase 237 100% Halar_2872 DL31 27.6 (Icd) (EC 1.1.1.42) 238 dodecin 95% halTADL_3198 tADL 27.5 Hrr. 239 archaellin FlaA or FlaB 77% Hlac_2557 27.5 lacusprofundi prefoldin, alpha subunit 240 100% halTADL_2197 tADL 27.3 (PfdA) 241 PRC-barrel domain 100% halTADL_2525 tADL 27.3 Oleispira 242 porin 34% ud 27.0 antarctica 243 ribosomal protein L30P 100% halTADL_3366 tADL 27.0

241

DNA polymerase sliding 244 clamp subunit (PCNA 97% halTADL_1713 tADL 26.9 homolog) methyl-accepting chemotaxis 245 sensory transducer (HtrII) 100% halTADL_3325 tADL 26.4 (for sensory rhodopsin II) 246 ribosomal protein L1 99% halTADL_0105 tADL 26.4 carbohydrate ABC 247 transporter solute-binding 84% halTADL_2761 tADL 26.3 protein Halobiforma hypothetical protein (Sec 248 41% C446_10320 nitratireducens 26.2 signal) JCM 10879 triosephosphate isomerase 249 89% halTADL_2532 tADL 26.2 (TpiA) (EC 5.3.1.1) peptidylprolyl isomerase, Hrr. 250 100% Hlac_3380 26.0 FKBP-type lacusprofundi ketol-acid reductoisomerase 251 100% halTADL_0362 tADL 25.9 (IlvC) (EC 1.1.1.86) hypothetical protein (Sec Haloterrigena 252 signal, C-terminal 33% ud 25.8 thermotolerans transmembrane helix) hypothetical protein Marinobacter 253 59% ud 25.8 (DUF1302) nanhaiticus TATA-box-binding protein 254 100% halTADL_0042 tADL 25.8 (Tbp) oligopeptide/dipeptide ABC 255 transporter solute-binding 100% Halar_1146 DL31 25.7 protein hypothetical protein (TAT Methanobacter 256 48% ud 25.5 signal) ium sp. ThiJ/PfpI domain-containing 257 100% Halar_1043 DL31 25.5 protein 258 rhodanese-like protein 92% halTADL_2750 tADL 25.4 259 hypothetical protein 100% halTADL_3133 tADL 25.3 260 ribosomal protein L21e 100% halTADL_0185 tADL 25.3 261 ribosomal protein S24e 100% halTADL_3266 tADL 24.9 hypothetical protein (Sec Hrr. 262 100% Hlac_1040 24.9 signal) lacusprofundi hypothetical protein 263 (DUF4013; 4 x 100% halTADL_3238 tADL 24.7 transmembrane domains) hypothetical protein 264 100% Halar_1493 DL31 24.5 (DUF4112) carbohydrate ABC 265 transporter solute-binding 100% halTADL_2357 tADL 24.5 protein dihydroxyacetone (DHA) 266 kinase, L subunit (DhaL) 100% halTADL_2259 tADL 24.5 (EC 2.7.1.29) nucleic acid-binding/OB- 267 93% halTADL_0109 tADL 24.3 fold/TRAM domain 268 ribosomal protein L32e 91% halTADL_3370 tADL 24.2

242

carbohydrate ABC 269 transporter solute-binding 100% halTADL_2761 tADL 24.1 protein hypothetical protein (PGF- Haloferax 270 pre-PGF domain, C-terminal 31% C453_16833 24.1 elongans transmembrane helix) 271 ribosomal protein L29 99% Halar_2474 DL31 24.1 phosphoglycerate kinase 272 100% halTADL_0816 tADL 24.1 (Pgk) (EC 2.7.2.3) methylated-DNA-[protein]- 273 cysteine S-methyltransferase 99% halTADL_0579 tADL 24.1 (Ogt) (EC 2.1.1.63) hypothetical protein (TAT 274 100% halTADL_0878 tADL 23.6 signal) pyruvate:ferredoxin ambig 275 oxidoreductase, beta subunt halTADL_0383 tADL 23.6 uous (PorB) (EC 1.2.7.1) ribulose bisphosphate Dunaliella 276 carboxylase, large chain 98% ud tertiolecta/sali 23.5 (RbcL), chloroplastic na Hrr. thiazole biosynthetic enzyme Hlac_2980/halT 277 100% lacusprofundi/t 23.3 Thi1 ADL_1093 ADL 278 KaiC domain 100% halTADL_1815 tADL 23.0 Hrr. 279 ribosomal protein L13 100% Hlac_1821 22.5 lacusprofundi DNA-directed RNA 280 polymerase subunit H 100% halTADL_0616 tADL 22.4 (RpoH) (EC 2.7.7.6) 281 ribosomal protein L23 100% Halar_2479 DL31 22.0 signal peptide peptidase 282 100% halTADL_2673 tADL 22.0 SppA 283 ribosomal protein S3AE 100% Halar_1046 DL31 21.9 DNA repair and 284 90% halTADL_2135 tADL 21.7 recombination protein RadA 285 ribosomal protein L1 100% Halar_1846 DL31 21.5 ambig 286 ribosomal protein S9P halTADL_2777 tADL 21.3 uous K+ uptake system, TrkA 287 100% halTADL_3258 tADL 21.2 subunit 288 ribosomal protein L19e 100% halTADL_3369 tADL 21.2 non-histone chromosomal 289 100% halTADL_2183 tADL 21.0 MC1 family protein anthranilate 290 phosphoribosyltransferase 100% halTADL_3066 tADL 20.9 (TrpD) (EC 2.4.2.18) membrane metalloprotease 291 100% halTADL_0323 tADL 20.7 (peptidase M50 ) signal transduction protein 292 100% halTADL_1865 tADL 20.7 with CBS domains ATP synthase, K subunit 293 80% halTADL_1940 tADL 20.6 (AtpK) (EC 3.6.3.14) ambig 294 FeS assembly protein SufB halTADL_0973 tADL 20.4 uous

243

DNA repair and 295 100% halTADL_2135 tADL 20.1 recombination protein RadA D-isomer specific 2- 296 hydroxyacid dehydrogenase 100% halTADL_1212 tADL 20.1 NAD-binding nucleoside-diphosphate Hrr. 297 100% Hlac_1845 19.9 kinase (Ndk) (EC 2.7.4.6) lacusprofundi hypothetical protein 298 100% Halar_3711 DL31 19.9 (DUF2150) TRAP transporter solute 299 100% halTADL_0243 tADL 19.8 receptor, TAXI family chemotaxis response 300 100% halTADL_1808 tADL 19.8 regulator CheY hypothetical protein 301 (transmembrane helix near 100% halTADL_0100 tADL 19.7 C-terminal) ambig 302 ribosomal protein L14P halTADL_3377 tADL 19.5 uous cell surface glycoprotein (Sec signal, PGF-CTERM, Hrr. 303 100% Hlac_2976 19.4 C-terminal transmembrane lacusprofundi helix) 304 ribosomal protein S7 87% halTADL_0623 tADL 19.4 transcriptional regulator, ambig 305 halTADL_2793 tADL 19.3 MarR family uous Hrr. 306 ferredoxin 100% Hlac_2176 19.3 lacusprofundi 307 hypothetical protein 89% halTADL_2560 tADL 19.1 308 ribosomal protein S10 100% halTADL_0653 tADL 18.8 phosphoserine phosphatase 309 96% halTADL_2046 tADL 18.6 (SerB) (EC 3.1.3.3) 310 ribosomal protein L32e 100% Halar_2464 DL31 18.4 ribonuclease J / beta- 311 100% halTADL_2415 tADL 18.4 lactamase domain proteasome beta subunit 312 100% Halar_1691 DL31 18.4 (PsmB) (EC 3.4.25.1) hypothetical protein (Sec signal, PGF-CTERM, C- 313 100% halTADL_1765 tADL 18.3 terminal transmembrane helix) 314 adhesion pilin (PilA) 100% Halar_2365 DL31 18.3 hypothetical protein (TAT Hrr. 315 100% Hlac_0313 18.2 signal) lacusprofundi SPFH domain membrane 316 protease (N-terminal 100% halTADL_2057 tADL 18.2 transmembrane helix) peptidoglycan-associated Marinobacter 317 65% ud 18.0 lipoprotein (OmpA/MotB) santoriniensis 318 hypothetical protein 100% Halar_3727 DL31 17.9 glutamine synthetase (GS) 319 97% halTADL_3423 tADL 17.8 (GlnA) (EC 6.3.1.2) 320 ribosomal LX protein 100% halTADL_2196 tADL 17.7 ATP synthase, H subunit 321 100% halTADL_1938 tADL 17.7 (AtpH) (EC 3.6.3.14)

244

322 ribosomal protein S6e 100% Halar_2261 DL31 17.7 dihydroxy-acid dehydratase ambig 323 halTADL_2417 tADL 17.6 (IlvD) (EC 4.2.1.9) uous DNA-directed RNA ambig 324 polymerase subunit A2 halTADL_0620 tADL 17.6 uous (RpoA2) (EC 2.7.7.6) methionyl-tRNA synthetase 325 99% halTADL_3069 tADL 17.4 (MetG) (EC 6.1.1.10) phosphate ABC transporter Hrr. 326 100% Hlac_3551 17.3 solute-binding protein (PstS) lacusprofundi glycerol-3-phosphate 327 dehydrogenase (GlpA) (EC 87% halTADL_2244 tADL 17.3 1.1.5.3) 328 hypothetical protein 100% halTADL_0015 tADL 17.1 zinc-binding alcohol 329 100% halTADL_1853 tADL 17.1 dehydrogenase transcriptional regulator, Hrr. 330 100% Hlac_2373 17.0 AsnC family lacusprofundi Natrinema 331 hypothetical protein 91% ud 16.9 versiforme transcriptional regulator, 332 100% halTADL_2533 tADL 16.9 XRE family transcription factor TFIIE, Hrr. 333 100% Hlac_1708 16.9 alpha subunit lacusprofundi pyruvate kinase (Pyk) (EC 334 100% halTADL_3014 tADL 16.8 2.7.1.40) 335 Hsp20-type chaperone 93% halTADL_0114 tADL 16.6 proteasome alpha subunit 336 100% halTADL_3109 tADL 16.4 (PsmA) (EC 3.4.25.1) DNA-directed RNA 337 polymerase subunit L (RpoL) 100% halTADL_3313 tADL 16.4 (EC 2.7.7.6) 338 ribosomal protein S4e 90% halTADL_3375 tADL 16.2 cell surface glycoprotein (Sec signal, PGF-CTERM, 339 50% halTADL_1043 tADL 16.2 C-terminal transmembrane helix) Halovirus 340 major capsid protein 43% [21] 15.9 HHTV-1 341 thioredoxin 100% halTADL_2077 tADL 15.9 pyridoxine biosynthesis 342 100% halTADL_3362 tADL 15.8 protein (lyase) PdxS 343 thioredoxin 100% halTADL_2563 tADL 15.7 amidohydrolase 344 100% halTADL_0419 tADL 15.7 (acetamidase/formamidase) FAD-dependent pyridine ambig 345 nucleotide-disulfide halTADL_2528 tADL 15.6 uous oxidoreductase Natrinema 346 rhodanese 94% C489_15497 15.6 versiforme phosphonate ABC transport 347 system, solute-binding 100% halTADL_1334 tADL 15.5 protein (PhnD) 348 ribosomal protein S17P 100% halTADL_3378 tADL 15.4

245

elongation factor Tu, Dunaliella 349 98% ud 15.3 chloroplastic (EF-Tu) (TufA) tertiolecta inorganic pyrophosphatase 350 100% halTADL_1644 tADL 15.2 (Ppa) (EC 3.6.1.1) heme-based aerotactic 351 100% halTADL_1627 tADL 15.1 transducer HemAT hypothetical protein 352 (transmembrane helix near 50% HalDL1_0733 DL1 15.0 N-terminal) 353 ribosomal protein S17e 100% halTADL_0708 tADL 15.0 FAD-dependent pyridine 354 nucleotide-disulfide 100% halTADL_1122 tADL 15.0 oxidoreductase 2-dehydro-3- deoxyphosphogluconate 355 100% halTADL_0882 tADL 15.0 (KDPG) aldolase (Eda) (EC 4.1.2.14) cell surface glycoprotein (Sec signal, PGF-CTERM, 356 42% halTADL_1043 tADL 14.9 C-terminal transmembrane helix) translation initiation factor 2, 357 gamma subunit (a/eIF2- 100% halTADL_3271 tADL 14.6 gamma) (Eif2g) 358 adhesion pilin (PilA) 33% halTADL_1885 tADL 14.6 359 thioredoxin 100% halTADL_1756 tADL 14.5 DNA-directed RNA 360 polymerase subunit A2 93% halTADL_0620 tADL 14.4 (RpoA2) (EC 2.7.7.6) hypothetical protein (Sec signal, PGF-CTERM, C- 361 76% halTADL_1765 tADL 14.3 terminal transmembrane helix) 362 archaellin FlaA or FlaB 100% HalDL1_1518 DL1 14.1 DNA repair and 363 100% halTADL_1827 tADL 14.1 recombination protein RadA 364 ribosomal protein L13 100% halTADL_2776 tADL 14.1 hypothetical protein (TAT Hrr. 365 100% Hlac_0476 14.0 signal) lacusprofundi Natronomonas 366 phage shock protein A, PspA 70% Nmlp_2541 14.0 moolapensis Vibrio phage 367 hypothetical protein 34% ud 14.0 VBM1 hypothetical protein Halococcus 368 (transmembrane helix near 39% ud 14.0 salifodinae N-terminal) ambig 369 cell division protein FtsZ halTADL_0937 tADL 14.0 uous hypothetical protein (TAT 370 40% halTADL_1761 tADL 13.8 signal) branched-chain amino acid 371 aminotransferase (IlvE) (EC 100% Halar_2889 DL31 13.7 2.6.1.42)

246

hypothetical protein (Sec 372 100% halTADL_1203 tADL 13.7 signal) branched-chain amino acid 373 ABC transporter solute- 100% halTADL_2916 tADL 13.6 binding protein agmatinase (SpeB) (EC 374 95% halTADL_1131 tADL 13.6 3.5.3.11) glutamine synthetase (GS) Hrr. 375 100% Hlac_2374 13.5 (GlnA) (EC 6.3.1.2) lacusprofundi transcriptional regulator, 376 92% halTADL_1645 tADL 13.5 RosR (PadR family) short-chain Hrr. dehydrogenase/reductase Hlac_3251/halT 377 100% lacusprofundi/t 13.5 (SDR): glucose/ribitol ADL_1233 ADL dehydrogenase domain glutamate dehydrogenase 378 100% Halar_0758 DL31 13.3 (GdhA) (EC 1.4.1.3/1.4.1.4) hypothetical protein 379 100% halTADL_0483 tADL 13.2 (DUF541) (TAT signal) ATP synthase, K subunit 380 100% Halar_3027 DL31 13.2 (AtpK) (EC 3.6.3.14) glycerol kinase (GlpK) (EC 381 96% halTADL_2249 tADL 13.1 2.7.1.30) SPFH domain membrane 382 protease (N-terminal 100% Halar_3276 DL31 13.0 transmembrane helix) hypothetical protein 383 100% halTADL_0856 tADL 13.0 (DUF2110) TrkA-N domain + DHH 384 100% halTADL_2524 tADL 12.9 phosphatase family domain 3-isopropylmalate 385 dehydrogenase (LeuB) (EC 100% halTADL_0366 tADL 12.9 1.1.1.85) Hrr. 386 ferritin Dps family protein 100% Hlac_0536 12.8 lacusprofundi hypothetical protein 387 100% halTADL_3148 tADL 12.8 (DUF1508) 388 ribosomal protein S8e 100% halTADL_3327 tADL 12.6 phosphate uptake regulator, 389 100% halTADL_1186 tADL 12.6 PhoU anthranilate 390 phosphoribosyltransferase 100% halTADL_0889 tADL 12.5 (TrpD) (EC 2.4.2.18) enolase (phosphopyruvate 391 hydratase) (Eno) (EC 90% halTADL_2780 tADL 12.4 4.2.1.11) Hrr. Hlac_3168/Hala 392 hypothetical protein 100% lacusprofundi/ 12.4 r_0079 DL31 ribonucleoside-diphosphate 393 reductase, alpha subunit 100% halTADL_0884 tADL 12.4 (NrdE) (EC 1.17.4.1)

247

carbohydrate ABC Hlac_2984/halT tADL/Hrr. 394 transporter solute-binding 100% 12.3 ADL_1095 lacusprofundi protein Marinobacter 395 ribosomal protein L7/L12 89% ud 12.3 santoriniensis transcriptional regulator, 396 100% halTADL_2989 tADL 12.3 TrmB beta-lactamase / 397 100% halTADL_0425 tADL 12.1 transpeptidase domain PUA-domain-containing ambig 398 halTADL_0666 tADL 12.1 protein uous prefoldin, beta subunit Natrinema 399 95% ud 12.1 (PfdB) pallidum VCP-like protein (2 x 400 CDC48 domains + 2 x AAA 100% Halar_1865 DL31 12.1 family ATPase domains) aconitate hydratase (AcnA) 401 96% halTADL_2902 tADL 12.0 (EC 4.2.1.3) 2,3,4,5-tetrahydropyridine-2- carboxylate N- ambig 402 halTADL_0281 tADL 11.9 succinyltransferase (DapD) uous (EC 2.3.1.117) 403 ribosomal protein L19e 100% Halar_2463 DL31 11.9 hypothetical protein (C- J07HN4v3_005 Halonotius sp. 404 terminal transmembrane 43% 11.9 66 J07HN4 helix) hypothetical protein (Sec signal; bacterial virulence Marinobacter 405 factor lipase N-terminal 42% ud 11.8 algicola domain; alpha/beta hydrolase fold near C-terminal) ATP synthase, beta subunit 406 92% halTADL_1945 tADL 11.8 (AtpB) (EC 3.6.3.14) hypothetical protein (Sec 407 signal, C-terminal 100% Halar_2920 DL31 11.8 transmembrane helix) dehydroquinate synthase II 408 100% halTADL_0574 tADL 11.7 (EC 1.4.1.-) Oceanimonas 409 cold-shock protein (CspE) 81% ud 11.7 sp. 410 HTH domain 100% Halar_2438 DL31 11.6 transcriptional regulator, 411 100% Halar_2383 DL31 11.6 TrmB haloacid dehalogenase-like 412 100% halTADL_2160 tADL 11.5 hydrolase DNA-directed RNA 413 polymerase subunit B 100% halTADL_0618 tADL 11.5 (RpoB1) (EC 2.7.7.6) CBS domain containing 414 100% halTADL_0719 tADL 11.4 membrane protein DNA gyrase subunit A GyrA 415 100% halTADL_3019 tADL 11.3 (EC 5.99.1.3) DUF1508; 4 x Hrr. 416 100% Hlac_2647 11.2 transmembrane helices lacusprofundi

248

methenyltetrahydromethanop 417 terin cyclohydrolase (Mch) 100% halTADL_3392 tADL 11.2 (EC 3.5.4.27) 418 ribosomal protein L5 100% halTADL_3374 tADL 11.1 Rieske iron-sulfur membrane 419 100% halTADL_0720 tADL 11.1 protein fructose 1,6-bisphosphate 420 aldolase (multifunctional) 96% halTADL_3234 tADL 11.0 (EC 4.1.2.13) glucose-1-phosphate 421 thymidylyltransferase (RfbA, 93% halTADL_3353 tADL 11.0 RffH) (EC 2.7.7.24) Type II secretory pathway, Marinobacter 422 56% ud 11.0 pseudopilin PulG algicola hypothetical protein with carboxypeptidase regulatory- 423 like domain (Sec signal, C- 100% Halar_0957 DL31 10.9 terminal transmembrane helices) adenine 424 phosphoribosyltransferase 100% halTADL_2952 tADL 10.9 (Apt) (EC 2.4.2.7) 425 major capsid protein 32% Hlac_0760 Hlac-Pro1 10.8 phosphoenolpyruvate 426 carboxylase (Ppc) (EC 100% halTADL_0401 tADL 10.7 4.1.1.31) 427 PRC-barrel domain 100% halTADL_0609 tADL 10.7 hypothetical protein 428 (DUF655; predicted RNA- 89% halTADL_0183 tADL 10.7 binding domain) 429 archaellar protein FlaG 100% halTADL_1803 tADL 10.7 homoserine dehydrogenase 430 100% halTADL_0649 tADL 10.7 (MetL) (EC 1.1.1.3) 431 hypothetical protein 94% halTADL_0015 tADL 10.6 Haloterrigena 432 ferredoxin 96% ud 10.6 turkmenia Halovirus 433 prohead protease 33% [119] 10.6 HVTV-1 Hrr. 434 adhesion pilin (PilA) 100% Hlac_1363 10.5 lacusprofundi 435 ribosomal protein S8 100% halTADL_3372 tADL 10.5 mechanosensitive ion 436 100% halTADL_2994 tADL 10.5 channel (MscS) translation elongation factor 437 100% Halar_2185 DL31 10.5 1-beta (EF-1-beta) (Ef1b) zinc-binding alcohol Haloferax 438 93% C438_00325 10.4 dehydrogenase denitrificans short-chain dehydrogenase/reductase 439 100% Halar_1075 DL31 10.4 (SDR): glucose/ribitol dehydrogenase domain transcriptional regulator, 440 100% halTADL_0915 tADL 10.3 XRE family

249

proteasome beta subunit 441 92% halTADL_2911 tADL 10.3 (PsmB) (EC 3.4.25.1) imidazoleglycerol-phosphate 442 dehydratase (HisB) (EC 100% halTADL_1797 tADL 10.2 4.2.1.19) transcriptional regulator, 443 100% halTADL_0058 tADL 10.2 AsnC family transcription elongation 444 81% ud Marinobacter 10.2 factor NusA 445 ribosomal protein S3 86% ud Marinobacter 10.2 elongation factor 1-beta 446 100% halTADL_3453 tADL 10.1 (aEF-1beta) (Ef1b) hypothetical protein (TAT Natrialba 447 39% Nmag_3745 9.9 signal) magadii hypothetical protein (Zn- 448 100% halTADL_0613 tADL 9.9 finger domain) acetate : CoA ligase (Acs) 449 92% halTADL_1017 tADL 9.8 (EC 6.2.1.1) alpha-1,4-glucan-protein 450 100% Halar_1081 DL31 9.8 synthase phosphoenolpyruvate (PEP) 451 79% halTADL_1011 tADL 9.8 synthase (Pps) (EC 2.7.9.2) Halovirus 452 major capsid protein 54% [33] 9.7 HCTV-2 hypothetical protein Halorubrum 453 (DUF964 / YheA/YmcA 32% ud 9.7 arcis domain) elongation factor 1-beta 454 94% halTADL_3453 tADL 9.7 (aEF-1beta) (Ef1b) 455 ribosomal protein L24e 100% halTADL_0168 tADL 9.7 456 ribosomal protein L30P 94% halTADL_3366 tADL 9.7 457 phage shock protein A, PspA 100% halTADL_2278 tADL 9.5 S-adenosylmethionine 458 100% halTADL_3028 tADL 9.5 synthetase (Mat) (EC 2.5.1.6) glycerol kinase (GlpK) (EC 459 100% halTADL_0681 tADL 9.5 2.7.1.30) hypothetical protein 460 (DUF655 - predicted RNA- 100% halTADL_0183 tADL 9.5 binding domain) proteasome-activating 461 100% halTADL_0964 tADL 9.4 nucleotidase (Pan) SBDS ribosome maturation 462 100% halTADL_2242 tADL 9.4 protein SDO1 463 ribosomal protein L23 88% halTADL_3385 tADL 9.3 glutamine synthetase (GS) Natrinema/Hal 464 95% ud 9.2 (GlnA) (EC 6.3.1.2) oterrigena cell division protein ambig 465 SepF/SepF-related / halTADL_0668 tADL 9.2 uous DUF1621 Hrr. 466 ribosomal protein S6e 100% Hlac_2128 9.2 lacusprofundi 467 cell division protein FtsA 96% halTADL_0130 tADL 9.2

250

VCP-like protein (2 x 468 CDC48 domains + 2 x AAA 100% Halar_2098 DL31 9.1 family ATPase domains) carbamoyl-phosphate 469 synthase, large subunit 100% halTADL_0988 tADL 9.1 (CarB) (EC 6.3.5.5) ambig 470 hypothetical protein halTADL_2908 tADL 9.1 uous 471 ribosomal LX protein 97% halTADL_2196 tADL 9.0 peptidylprolyl isomerase, Hrr. 472 100% Hlac_0806 8.9 FKBP-type lacusprofundi chemotaxis signal 473 100% halTADL_1838 tADL 8.9 transduction protein CheW 474 ribosomal protein L7Ae 100% Halar_2173 DL31 8.9 475 ribosomal protein S19 100% halTADL_3383 tADL 8.8 476 ribosomal protein L24 100% halTADL_3376 tADL 8.8 halolysin (peptidase S8 and 477 S53 subtilisin kexin 100% halTADL_1514 tADL 8.7 sedolisin) acidic ribosomal protein P0- 478 94% halTADL_0106 tADL 8.7 like protein hypothetical protein (Sec 479 signal, C-terminal 100% halTADL_3076 tADL 8.7 transmembrane helix) manganese/iron superoxide Hrr. 480 dismutase (Sod) (EC 100% Hlac_2515 8.7 lacusprofundi 1.15.1.1) phosphate uptake regulator, 481 100% halTADL_3071 tADL 8.7 PhoU 482 ribosomal protein L6P 100% halTADL_3371 tADL 8.6 483 ribosomal protein L2 100% halTADL_3384 tADL 8.6 translation initiation factor 484 100% halTADL_2262 tADL 8.5 5A (aIF-5A) (Eif5a) winged helix-turn-helix 485 100% halTADL_3005 tADL 8.5 DNA-binding domain polar amino acid ABC 486 transporter solute-binding 100% halTADL_0024 tADL 8.5 protein 5- methyltetrahydropteroyltriglu 487 tamate -homocysteine 100% halTADL_0179 tADL 8.4 methyltransferase (MetE) (EC 2.1.1.14) DNA-directed RNA 488 polymerase subunit F (RpoF) 100% halTADL_0184 tADL 8.4 (EC 2.7.7.6) Marinobacter 489 chaperonin GroES 89% ud 8.3 aquaeolei hypothetical protein (TAT 490 50% halTADL_0878 tADL 8.3 signal) BolA (bacterial stress- 491 induced morphogen)-related 100% halTADL_0160 tADL 8.2 protein

251

ABC-type antimicrobial 492 peptide transport system, 100% halTADL_1613 tADL 8.1 permease component Hrr. Hlac_1010/halT 493 ribosomal protein S23 100% lacusprofundi/t 8.1 ADL_0622 ADL short-chain dehydrogenase/reductase 494 100% halTADL_1620 tADL 8.0 (SDR): glucose/ribitol dehydrogenase domain 495 hypothetical protein 99% halTADL_2296 tADL 8.0 nucleoside ABC transporter Hrr. 496 100% Hlac_1417 8.0 solute-binding protein lacusprofundi nucleoside-diphosphate- 497 100% halTADL_3057 tADL 8.0 sugar epimerase geranylgeranyl diphosphate 498 synthase, type I (IsdA) (EC 100% halTADL_2983 tADL 7.9 2.5.1.1 / 2.5.1.10 / 2.5.1.29) DNA-directed RNA 499 polymerase subunit D 91% halTADL_2774 tADL 7.9 (RpoD) (EC 2.7.7.6) signal recognition particle 500 100% halTADL_2202 tADL 7.9 Srp54, secretory pathway glycine ambig 501 hydroxymethyltransferase halTADL_3114 tADL 7.9 uous (GlyA) (EC 2.1.2.1) UspA domain-containing 502 100% halTADL_2112 tADL 7.8 protein Hrr. 503 ribosomal protein L4/L1e 100% Hlac_2448 7.8 lacusprofundi phosphate uptake regulator, 504 100% halTADL_3204 tADL 7.8 PhoU isocitrate dehydrogenase Hrr. 505 100% Hlac_2330 7.8 (Icd) (EC 1.1.1.42) lacusprofundi Candidatus 506 hypothetical protein 45% HRED_00231 7.7 Haloredivivus hypothetical protein (Sec Halomicrobiu 507 signal, C-terminal 25% Hmuk_1667 7.7 m mukohataei transmembrane helix) non-histone chromosomal 508 100% Halar_2922 DL31 7.7 MC1 family protein ABC transporter, ribose(?)- Haloarcula 509 84% ud 7.7 binding protein sinaiiensis IMP dehydrogenase (GuaB) 510 100% halTADL_0053 tADL 7.6 (EC 1.1.1.205) transcriptional regulator, Haloferax 511 84% ud 7.6 AsnC family elongans 512 ribosomal LX protein 100% Halar_3049 DL31 7.6 von Willebrand factor type A Hrr. 513 100% Hlac_3401 7.6 domain lacusprofundi hypothetical protein Halosimplex 514 74% ud 7.6 (DUF302) carlsbadense 515 hypothetical protein 100% halTADL_0257 tADL 7.6

252

translation elongation factor Hrr. 516 100% Hlac_0152 7.6 aEF-2 (FusA) lacusprofundi phosphate ABC transporter 517 100% halTADL_2152 tADL 7.6 ATP-binding protein (PstB) lactoylglutathione lyase 518 (glyoxylase I) (GloA) (EC 100% halTADL_3348 tADL 7.5 4.4.1.5) VapC ribonuclease / PIN 519 100% Halar_2111 DL31 7.5 domain GMP synthase (glutamine- 520 hydrolyzing) (GuaA) (EC 100% halTADL_2295 tADL 7.5 6.3.5.2) Vibrio phage 521 major capsid protein? 41% [VPGG_00034] 7.4 VBM1 522 Hsp20-type chaperone 100% Halar_3260 DL31 7.4 TATA-box-binding protein 523 99% halTADL_0042 tADL 7.4 (Tbp) ABC-type oligopeptide Haloquadratu 524 transport system, periplasmic 69% ud 7.4 m walsbyi component hypothetical protein Haloferax 525 (multiple PKD/chitinase 30% C457_16252 7.3 prahovense domains) 526 core histone 100% Halar_3008 DL31 7.3 aspartate kinase (LysC) (EC 527 100% halTADL_1916 tADL 7.3 2.7.2.4) phosphoglucose isomerase 528 100% halTADL_0801 tADL 7.2 (EC 5.3.1.9) succinyl-CoA synthetase, 529 beta subunit (SucC) (EC 100% Halar_3400 DL31 7.2 6.2.1.5) 530 hypothetical protein 100% halTADL_2117 tADL 7.1 hypothetical protein (Sec 531 signal, C-terminal 100% Halar_2611 DL31 7.1 transmembrane helix) phosphoribosylformylglycina 532 midine synthase (PurL) (EC 100% halTADL_2726 tADL 7.1 6.3.5.3) phosphate ABC transporter 533 100% Halar_1873 DL31 7.0 solute-binding protein (PstS) DNA-directed RNA 534 polymerase subunit L (RpoL) 100% Halar_2527 DL31 7.0 (EC 2.7.7.6) hypothetical protein Halogranum 535 71% HSB1_34530 6.9 (DUF839; Sec signal) salarium B-1 536 globin-like domain 100% Halar_2132 DL31 6.9 orotate 537 phosphoribosyltransferase 100% halTADL_0398 tADL 6.9 (PyrE) (EC 2.4.2.10) response regulator receiver 538 100% halTADL_2200 tADL 6.9 protein 2-oxoglutarate:ferredoxin 539 oxidoreductase, alpha 100% halTADL_1013 tADL 6.8 subunit (KorA) (EC 1.2.7.3)

253

oligosaccharyltransferase 540 100% halTADL_2411 tADL 6.8 AglB Marinobacter/ 541 ribosomal protein S11 93% ud 6.8 Salinisphaera hypothetical protein 542 100% halTADL_0916 tADL 6.8 (DUF555) alpha-amylase (glycosyl 543 hydrolase, family 13) (EC 94% halTADL_0142 tADL 6.8 3.2.1.1) Hrr. 544 ribosomal protein L30P 100% Hlac_2428 6.7 lacusprofundi translation initiation factor 2, 545 beta subunit (a/eIF2-beta) 95% halTADL_2337 tADL 6.7 (Eif2b) 546 ferritin-like domain 100% halTADL_2748 tADL 6.7 Hrr. 547 ribosomal protein L32e 100% Hlac_2432 6.7 lacusprofundi 548 thioredoxin 100% Halar_3305 DL31 6.6 inositol monophosphatase family; possible fructose-1,6- ambig 549 halTADL_3201 tADL 6.6 bisphosphatase (Fbp) (EC uous 3.1.3.9) Marinobacter 550 outer membrane protein 38% ud 6.6 adhaerens linocin_M18 bacteriocin Halorubrum 551 84% ud 6.6 protein aidingense ambig 552 cell division protein FtsZ Halar_2224 DL31 6.6 uous cystathionine gamma- synthase (MetB) (EC 553 2.5.1.48) or O- 100% halTADL_1890 tADL 6.5 acetylhomoserine (thiol)- lyase (MetY) (EC 2.5.1.49) Hrr. 554 hypothetical protein 100% Hlac_0725 6.5 lacusprofundi alanyl-tRNA synthetase 555 100% halTADL_3231 tADL 6.5 (AlaS) (EC 6.1.1.7) formyltetrahydrofolate ambig 556 deformylase (PurU) (EC halTADL_3051 tADL 6.5 uous 3.5.1.10) glucan 1,4-alpha-glucosidase (glucoamylase) (glycosyl 557 100% halTADL_0141 tADL 6.4 hydrolase, family 15) (EC 3.2.1.3) PilT protein: Type II/IV 558 secretion system domain + 100% halTADL_0825 tADL 6.3 KH domain D-isomer specific 2- 559 hydroxyacid dehydrogenase 100% halTADL_0558 tADL 6.3 NAD-binding acetolactate synthase, small 560 100% halTADL_0361 tADL 6.3 subunit (IlvH) (EC 2.2.1.6) transcriptional regulator, 561 100% halTADL_1997 tADL 6.3 AsnC family

254

terpenoid cyclases/protein Halogeometric prenyltransferase alpha-alpha 562 30% Hbor_31320 um 6.2 toroid domains; homologs borinquense have Sec signal hydroxyethylthiazole kinase 563 100% halTADL_0472 tADL 6.2 (ThiM) (EC 2.7.1.50) proteasome-activating 564 100% halTADL_0603 tADL 6.2 nucleotidase (Pan) (Can) 565 100% halTADL_2162 tADL 6.1 (EC 4.2.1.1) response regulator receiver 566 94% halTADL_1808 tADL 6.0 protein DNA-directed RNA 567 polymerase subunit B" 100% halTADL_0617 tADL 6.0 (RpoB2) (EC 2.7.7.6) 568 ribosomal protein L4P 100% Halar_2480 DL31 6.0 Hermes transposase DNA- 569 100% Halar_2265 DL31 6.0 binding domain adenylosuccinate synthase 570 100% halTADL_3004 tADL 6.0 (PurA) (EC 6.3.4.4) Hrr. 571 chaperone protein DnaK 100% Hlac_0682 5.9 lacusprofundi phosphoenolpyruvate 572 carboxylase (Ppc) (EC 94% halTADL_0401 tADL 5.9 4.1.1.31) Hrr. 573 ribosomal protein L29 100% Hlac_2442 5.8 lacusprofundi UspA domain-containing 574 100% halTADL_2351 tADL 5.8 protein hypothetical protein (C- Halonotius sp. 575 terminal transmembrane 28% J07HN6_00220 5.7 J07HN6 helix) VCP-like protein (2 x 576 CDC48 domains + 2 x AAA 100% halTADL_2740 tADL 5.6 family ATPase domains) invasin/intimin cell-adhesion Beutenbergia 577 25% Bcav_2782 5.5 domain (Sec signal) cavernae 578 ribosomal protein S19e 100% halTADL_0985 tADL 5.5 579 DsrE/DsrF domain 100% Halar_1162 DL31 5.5 thiazole biosynthetic enzyme 580 100% Halar_3704 DL31 5.5 Thi1 citrate lyase, beta subunit Hrr. 581 100% Hlac_0213 5.5 (CitE) (EC 4.1.3.6) lacusprofundi oligopeptide/dipeptide ABC 582 transporter solute-binding 100% Halar_1285 DL31 5.5 protein Halovirus 583 hypothetical protein 34% [20] 5.4 HHTV-1 hypothetical protein 584 (DUF296 - possible DNA- 100% halTADL_1861 tADL 5.4 binding) hypothetical protein (Sec Hrr. 585 signal; 2 x PKD/chitinase 34% Hlac_2824 5.4 lacusprofundi domains)

255

glycerol kinase (GlpK) (EC 586 98% halTADL_0681 tADL 5.4 2.7.1.30) Hrr. 587 ribosomal protein L15 100% Hlac_1820 5.4 lacusprofundi pyrroline-5-carboxylate 588 reductase (ProC) (EC 100% halTADL_2360 tADL 5.3 1.5.1.2) DNA-directed RNA 589 polymerase subunit A 96% halTADL_0619 tADL 5.3 (RpoA1) (EC 2.7.7.6) phosphoserine phosphatase 590 100% halTADL_1053 tADL 5.3 (SerB) (EC 3.1.3.3) glyceraldehyde-3-phosphate dehydrogenase (NAD(P)+- 591 90% halTADL_0817 tADL 5.2 dependent), type I (Gap) (EC 1.2.1.12) 592 ribosomal protein S3Ae 98% halTADL_3142 tADL 5.2 glycerol kinase (GlpK) (EC 593 98% halTADL_0681 tADL 5.2 2.7.1.30) branched-chain amino acid Haladaptatus 594 ABC transporter solute- 62% ud paucihalophilu 5.2 binding protein s 6,7-dimethyl-8- 595 ribityllumazine synthase 100% halTADL_3082 tADL 5.2 (RibH) (EC 2.5.1.78) CRISPR-associated protein ambig 596 halTADL_1361 tADL 5.2 Cas8b (=Csh1) (subtype I-B) uous DNA-directed RNA 597 polymerase subunit E' 100% halTADL_3269 tADL 5.2 (RpoE1) (EC 2.7.7.6) hypothetical protein (C- Natrinema 598 terminal transmembrane 28% Natpe_4026 5.2 pellirubrum helix) iron ABC transporter solute- 599 100% Halar_1080 DL31 5.1 binding protein glutamate-5-semialdehyde ambig 600 dehydrogenase (ProA) (EC halTADL_2358 tADL 5.1 uous 1.2.1.41/EC 1.2.1.88) translation initiation factor 2, Hrr. 601 alpha subunit (a/eIF2-alpha) 100% Hlac_1009 5.1 lacusprofundi (Eif2a) ATP 602 phosphoribosyltransferase 100% halTADL_1729 tADL 5.1 (HisG) (EC 2.4.2.17) hypothetical protein (Amphi- 603 100% halTADL_0235 tADL 5.1 Trp + 2 x DUF1508) aldehyde dehydrogenase 604 100% halTADL_2368 tADL 5.0 (AldY) (EC 1.2.1.3) citramalate synthase (CimA) 605 100% halTADL_1156 tADL 5.0 (EC 2.3.1.182) ambig 606 GTPase domain halTADL_0307 tADL 5.0 uous 607 ribosomal protein L13 96% halTADL_2776 tADL 5.0 608 hypothetical protein 100% halTADL_2579 tADL 4.9

256

alkyl hydroperoxide 609 100% halTADL_0067 tADL 4.9 reductase / peroxiredoxin hypothetical protein (Zn- 610 100% halTADL_0118 tADL 4.9 finger domain) methionyl-tRNA synthetase 611 87% halTADL_3069 tADL 4.9 (MetG) (EC 6.1.1.10) 612 ribosomal protein S6e 91% halTADL_2119 tADL 4.9 hypothetical protein 613 100% halTADL_2936 tADL 4.9 (DUF151) DNA polymerase sliding Hrr. 614 clamp subunit (PCNA 100% Hlac_2653 4.9 lacusprofundi homolog) 615 ribosomal protein L18e 100% Halar_1554 DL31 4.9 proteasome beta subunit Halomicrobiu 616 61% ud 4.9 (PsmB) (EC 3.4.25.1) m mukohataei urea ABC transporter, solute- 617 100% halTADL_0628 tADL 4.8 binding protein cell surface glycoprotein (Sec signal, PGF-CTERM, 618 34% halTADL_1043 tADL 4.8 C-terminal transmembrane helix) aspartate aminotransferase 619 97% halTADL_3081 tADL 4.8 (AspB) (EC 2.6.1.1) 620 ribosomal protein L4P 89% halTADL_3386 tADL 4.8 UspA domain-containing 621 100% halTADL_1044 tADL 4.7 protein amino acid-binding ACT 622 100% halTADL_1796 tADL 4.7 domain hypothetical protein (PGF- Halonotius sp 623 CTERM, C-terminal 41% ud 4.7 J07HN6 transmembrane helix) DisA bacterial checkpoint 624 controller nucleotide-binding 100% halTADL_2993 tADL 4.6 domain SPFH domain membrane Halogranum 625 protease (N-terminal 39% HSB1_19170 4.6 salarium transmembrane helix) D-3-phosphoglycerate 626 dehydrogenase (Ser A) (EC 100% halTADL_0712 tADL 4.6 1.1.1.95) CRISPR-associated protein Hrr. 627 100% Hlac_3574 4.6 Csc2 (subtype I-D) lacusprofundi CRISPR-associated protein Hrr. 628 100% Hlac_3331 4.6 Cas7 (= Csh2) (subtype I-B) lacusprofundi 629 hypothetical protein 95% halTADL_2576 tADL 4.5 oligopeptide/dipeptide ABC Hrr. 630 transporter solute-binding 100% Hlac_0244 4.5 lacusprofundi protein 2-oxoacid dehydrogenase 631 100% halTADL_2147 tADL 4.5 complex, E2 component hypothetical protein (DUF88 632 / NYN domain, limkain-b1- 100% halTADL_0855 tADL 4.5 type)

257

pre-mRNA processing 633 ribonucleoprotein, binding 100% halTADL_2985 tADL 4.5 domain NifU (FeS cluster assembly) ambig 634 halTADL_2317 tADL 4.5 domain uous Halovirus 635 prohead protease 35% [118] 4.4 HCTV-5 DNA-directed RNA 636 polymerase subunit D 100% halTADL_2774 tADL 4.4 (RpoD) (EC 2.7.7.6) Hrr. 637 thioredoxin 100% Hlac_0372 4.4 lacusprofundi Marinobacter 638 chaperonin GroES 78% ud 4.4 nanhaiticus DNA-directed RNA 639 polymerase subunit H 84% halTADL_0616 tADL 4.4 (RpoH) (EC 2.7.7.6) membrane metalloprotease 640 79% halTADL_0323 tADL 4.4 (peptidase M50 ) nucleoside-diphosphate 641 84% halTADL_0169 tADL 4.4 kinase (Ndk) (EC 2.7.4.6) aspartate-semialdehyde 642 dehydrogenase (Asd) (EC 100% halTADL_0714 tADL 4.3 1.2.1.11) 643 ribosomal protein S19 97% halTADL_3383 tADL 4.3 ambig 644 RecJ-like endonuclease halTADL_1000 tADL 4.3 uous selT/selW/selH selenoprotein 645 100% halTADL_1063 tADL 4.3 domain / Rdx domain 646 ribosome-binding GTPase 100% halTADL_1919 tADL 4.3 647 ribosomal protein S8 97% halTADL_3372 tADL 4.3 648 ribosomal protein S4 93% halTADL_2772 tADL 4.2 nucleoside ABC transporter Natrinema 649 49% ud 4.2 solute-binding protein versiforme FAD-dependent pyridine 650 nucleotide-disulfide 90% halTADL_2528 tADL 4.2 oxidoreductase HpcH/HpaI aldolase; possible 2-dehydro-3- 651 100% halTADL_0089 tADL 4.2 deoxyglucarate aldolase (GarL) (EC 4.1.2.20) hypothetical protein (Sec 652 100% halTADL_1753 tADL 4.1 signal) PAC2 (proteasome assembly 653 100% halTADL_0921 tADL 4.1 chaperone) oligopeptide/dipeptide ABC 654 transporter solute-binding 100% Halar_3436 DL31 4.1 protein threonine synthase (ThrC) 655 97% halTADL_2266 tADL 4.0 (EC 4.2.3.1) malic enzyme (MaeB) (EC 656 100% halTADL_0683 tADL 4.0 1.1.1.40)

258

aldo-keto reductase family: 657 NADPH-dependent 100% halTADL_0327 tADL 4.0 oxidoreductase 658 ribosomal protein L6P 100% Halar_2465 DL31 4.0 transcriptional regulator, 659 100% halTADL_3163 tADL 3.9 TrmB phospholipase/carboxylestera 660 100% halTADL_0131 tADL 3.9 se Hrr. 661 ribosomal protein L25/L23 100% Hlac_2447 3.9 lacusprofundi 662 ribosomal protein L29 100% Halar_2474 DL31 3.9 iron ABC transporter solute- Hrr. 663 100% Hlac_0162 3.9 binding protein lacusprofundi ambig 664 ribosomal protein L18P/L5E halTADL_3368 tADL 3.9 uous 665 thioredoxin 91% halTADL_2563 tADL 3.9 666 adhesion pilin (PilA) 66% halTADL_1387 tADL 3.9 ATP synthase, F subunit 667 100% halTADL_1943 tADL 3.9 (AtpF) (EC 3.6.3.14) 668 ribosomal protein S15 100% halTADL_3145 tADL 3.8 acyl-CoA synthetase (NDP 669 100% halTADL_2790 tADL 3.8 forming) DNA-binding TFAR19- 670 100% halTADL_0984 tADL 3.8 related protein 671 ribosomal protein L15 100% halTADL_3365 tADL 3.8 oxidoreductase FAD-binding 672 80% halTADL_1014 tADL 3.8 domain signal transduction protein 673 100% Halar_1253 DL31 3.7 with CBS domains 674 ribosomal protein L5 90% halTADL_3374 tADL 3.7 hypothetical protein ambig 675 halTADL_3160 tADL 3.7 (DUF3209) uous hypothetical protein 676 100% halTADL_1778 tADL 3.7 (DUF4382) (TAT signal) 677 ribonuclease II 100% halTADL_0664 tADL 3.6 ambig 678 pterin dehydratase? halTADL_3321 tADL 3.6 uous 2-oxoacid dehydrogenase 679 complex, dihydrolipoamide 100% halTADL_2144 tADL 3.6 dehydrogenase Hrr. 680 ribosomal protein L19e 100% Hlac_2431 3.6 lacusprofundi aconitate hydratase (AcnA) Haloterrigena 681 90% Htur_3383 3.5 (EC 4.2.1.3) turkmenica hypothetical protein 682 (DUF1610 / Zn ribbon 100% Halar_2186 DL31 3.5 domain) zinc-binding alcohol ambig Hrr. 683 Hlac_1058 3.4 dehydrogenase uous lacusprofundi PBS_HEAT protein; taxis 684 signaling (Schlesner et al., 100% halTADL_1768 tADL 3.4 2009) fumarate hydratase (FumC) 685 100% halTADL_0161 tADL 3.4 (EC 4.2.1.2)

259

peptide-methionine (S)-S- 686 oxide reductase (MsrA) (EC 100% halTADL_1172 tADL 3.4 1.8.4.11) amino acid-binding ACT 687 100% halTADL_0648 tADL 3.4 domain restriction/modification 688 enzyme (EC 2.1.1.-) (EC 100% Halar_0232 DL31 3.4 3.1.21.-) 689 glutaredoxin 100% halTADL_2104 tADL 3.4 690 hypothetical protein 100% halTADL_1480 tADL 3.4 pyruvate:ferredoxin 691 oxidoreductase, alpha 93% halTADL_0382 tADL 3.4 subunit (PorA) (EC 1.2.7.1) molybdopterin 692 molybdotransferase (MoeA) 100% halTADL_0723 tADL 3.4 (EC 2.10.1.1) ABC-type antimicrobial 693 peptide transport system, 88% halTADL_1613 tADL 3.4 permease component 694 hypothetical protein 100% halTADL_3262 tADL 3.4 pyruvate kinase (Pyk) (EC 695 91% halTADL_3014 tADL 3.4 2.7.1.40) 696 ribosomal protein L13 100% Halar_1553 DL31 3.3 nitrate/sulfonate/bicarbonate J07HN4v3_000 Halonotius sp. 697 ABC transporter solute- 78% 3.3 21 J07HN4 binding protein hypothetical protein (Amphi- 698 100% Halar_3643 DL31 3.3 Trp domain) phosphate uptake regulator, 699 92% halTADL_1186 tADL 3.3 PhoU DNA-directed RNA 700 polymerase subunit F (RpoF) 100% Halar_1831 DL31 3.3 (EC 2.7.7.6) prephenate dehydratase 701 100% halTADL_2073 tADL 3.3 (PheA2) (EC 4.2.1.51) carbohydrate ABC 702 transporter solute-binding 100% halTADL_1911 tADL 3.2 protein 703 hypothetical protein 100% halTADL_1081 tADL 3.2 glycerol kinase (GlpK) (EC Hrr. 704 100% Hlac_1122 3.2 2.7.1.30) lacusprofundi 705 ribosomal protein L11 94% halTADL_0103 tADL 3.2 hypothetical protein (Amphi- 706 100% halTADL_3120 tADL 3.1 Trp domain) prefoldin, beta subunit 707 79% ud Halococcus sp. 3.1 (PfdB) hypothetical protein 708 (transmembrane helix near 70% halTADL_1615 tADL 3.1 N-terminal) 709 SMC domain 62% halTADL_1458 tADL 3.1 4-alpha-glucanotransferase (amylomaltase) (glycosyl 710 100% halTADL_2529 tADL 3.1 hydrolase, family 77) (MalQ) (EC 2.4.1.25)

260

ATP synthase, alpha subunit 711 100% Halar_3031 DL31 3.1 (AtpA) (EC 3.6.3.14) cytidylate kinase (Cmk) (EC 712 100% halTADL_2798 tADL 3.1 2.7.4.14) 713 ribosomal protein L18e 93% halTADL_2775 tADL 3.0 714 dodecin 100% halTADL_3198 tADL 3.0 succinyl-CoA synthetase, Hrr. 715 alpha subunit (SucD) (EC 100% Hlac_2208 3.0 lacusprofundi 6.2.1.5) 716 chaperone protein DnaK 100% Halar_2629 DL31 3.0 Natronolimno bius 717 archaellin FlaA or FlaB 47% ud 3.0 innermongolic us lactoylglutathione lyase-like Natrinema 718 88% ud 3.0 lyase pellirubrum linocin_M18 bacteriocin Haloterrigena 719 82% ud 3.0 protein thermotolerans conjugative transfer protein Marinobacter 720 43% ud 3.0 (TraF) adhaerens ambig 721 hypothetical protein Halar_2513 DL31 3.0 uous SecD/SecF/SecDF export 722 100% halTADL_0788 tADL 3.0 membrane protein aminopeptidase (peptidase ambig 723 halTADL_0101 tADL 3.0 family M42) uous hypothetical protein Natrinema/Hal 724 92% ud 3.0 (DUF2150) oterrigena hypothetical protein (Sec 725 signal, C-terminal 100% halTADL_0301 tADL 3.0 transmembrane helix) Halovirus 726 scaffold protein 59% [12] 3.0 HRTV-7 hypothetical protein 727 (DUF4013; 4 x 73% halTADL_3238 tADL 3.0 transmembrane domains) hypothetical protein (Sec 728 signal, C-terminal 100% halTADL_1037 tADL 2.9 transmembrane helix) 2-isopropylmalate synthase 729 100% halTADL_0359 tADL 2.9 (LeuA) (EC 2.3.3.13) 730 hypothetical protein 100% halTADL_0721 tADL 2.9 phosphoribosylaminoimidazo lecarboxamide 731 formyltransferase / IMP 100% halTADL_0571 tADL 2.9 cyclohydrolase (PurH) (EC 2.1.2.3/EC 3.5.4.10) hypothetical protein with carboxypeptidase regulatory- 732 like domain (Sec signal, C- 100% Halar_2903 DL31 2.9 terminal transmembrane helices)

261

fructose-1,6-bisphosphate 733 aldolase, class I (FbaB) (Ec 92% halTADL_0575 tADL 2.8 4.1.2.13) hypothetical protein (Zn- 734 100% halTADL_1917 tADL 2.8 ribbon domain / DUF2072) multi-copper oxidase: 735 possible nitrite reductase 100% halTADL_2997 tADL 2.8 (NirK) (EC 1.7.2.1) ThiJ/PfpI domain-containing Hrr. 736 100% Hlac_0958 2.8 protein lacusprofundi ATP synthase, I subunit 737 100% halTADL_1939 tADL 2.8 (AtpI) (EC 3.6.3.14) Hrr. TATA-box binding protein Hlac_3413/halT 738 100% lacusprofundi/t 2.8 (Tbp) ADL_1279 ADL hypothetical protein (Sec Halorubrum 739 56% [C464_10963] 2.8 signal) sp. AJ67 hypothetical protein (DUF88 740 / NYN domain, limkain-b1- 100% Halar_0815 DL31 2.7 type) 741 ribosomal protein S4 100% Halar_1557 DL31 2.7 ribosomal protein S4E, Hrr. 742 100% Hlac_2437 2.7 central domain lacusprofundi pyrroline-5-carboxylate 743 dehydrogenase (RocA) (EC 100% Halar_1051 DL31 2.7 1.2.1.88) Hlac_3129/Hal Hrr. DL1_3291/halT lacusprofundi/ 744 metallophosphoesterase 100% 2.7 ADL_2019/Hala DL1/tADL/DL r_0424 31 aminopeptidase (peptidase 745 100% Halar_3640 DL31 2.7 family M28) Halar_2275/Hal 746 ferritin-like domain 100% DL31 2.6 ar_3188 nucleic acid binding OB-fold 747 tRNA/helicase-type (RPA32 44% halTADL_3434 tADL 2.6 homolog) 748 rhodanese-like protein 100% Halar_2935 DL31 2.6 orotate 749 phosphoribosyltransferase 90% halTADL_0398 tADL 2.6 (PyrE) (EC 2.4.2.10) 750 adhesion pilin (PilA) 65% halTADL_0751 tADL 2.6 Marinobacter 751 transcriptional regulator 100% ud 2.6 nanhaiticus 752 bacteriorhodopsin 100% halTADL_1952 tADL 2.6 753 glutaredoxin 100% halTADL_0399 tADL 2.5 aldo-keto reductase family: 754 NADPH-dependent 100% halTADL_1064 tADL 2.5 oxidoreductase NH(3)-dependent NAD(+) ambig 755 halTADL_2231 tADL 2.5 synthetase (EC 6.3.1.5) uous 756 OsmC family protein 100% Halar_1442 DL31 2.5

262

nucleic acid binding OB-fold 757 tRNA/helicase-type (RPA32 100% halTADL_3434 tADL 2.5 homolog) 758 hypothetical protein 100% halTADL_3407 tADL 2.5 response regulator receiver 759 65% halTADL_2696 tADL 2.5 protein glutamate synthase 760 (GOGAT) (GltB) (EC 100% halTADL_0125 tADL 2.5 1.4.1.13/1.4.1.14) archaellar proteins FlaC or 761 100% halTADL_1805 tADL 2.5 FlacD or FlacE 762 ribosomal protein S8e 89% halTADL_3327 tADL 2.5 hypothetical protein 763 100% halTADL_1653 tADL 2.5 (DUF302) DegV family protein (fatty Marinobacter 764 79% ud 2.5 metabolism related) algicola 765 ribosomal protein S5 100% Halar_2461 DL31 2.5 hypothetical protein (Sec ambig 766 Halar_3358 DL31 2.5 signal) uous Hrr. 767 ribosomal protein L24E 100% Hlac_1844 2.5 lacusprofundi Hrr. 768 ribosomal protein L3 100% Hlac_2449 2.5 lacusprofundi 769 aldo/keto reductase 100% Halar_1866 DL31 2.4 hypothetical protein ambig 770 Halar_2379 DL31 2.4 (DUF1508) uous TATA-box-binding protein 771 100% Halar_1670 DL31 2.4 (Tbp) 772 hypothetical protein 100% Halar_3204 DL31 2.4 hypothetical protein ambig 773 halTADL_2131 tADL 2.4 (DUF424) uous 774 thioesterase 100% halTADL_2491 tADL 2.4 aspartyl-tRNA synthetase 775 100% Halar_3227 DL31 2.4 (AspS) (EC 6.1.1.12) ambig 776 hypothetical protein halTADL_0400 tADL 2.4 uous hypothetical protein 777 (DUF964 / YheA/YmcA 88% halTADL_0133 tADL 2.4 domain) uridylate kinase (PyrH) (EC 778 100% halTADL_1147 tADL 2.4 2.7.4.22) Hrr. 779 ribosomal protein L11 100% Hlac_1983 2.4 lacusprofundi glycosyl transferase group 1 780 (possible alpha-D-glucan 100% halTADL_2565 tADL 2.3 synthase) nucleic acid binding OB-fold 781 48% HalDL1_2841 DL1 2.3 tRNA/helicase-type heavy metal-exporting 782 100% halTADL_1767 tADL 2.3 ATPase (copper?) 783 ribosomal protein S27e 100% halTADL_0924 tADL 2.3 784 ribosomal protein L3 100% Halar_2481 DL31 2.3

263

pyruvate:ferredoxin 785 oxidoreductase, alpha 100% halTADL_0382 tADL 2.3 subunit (PorA) (EC 1.2.7.1) 786 archaellin FlaA or FlaB 75% halTADL_1544 tADL 2.3 phosphoribosylformimino-5- aminoimidazole carboxamide 787 100% halTADL_1799 tADL 2.2 ribotide isomerase (HisA) (EC 5.3.1.16) ambig 788 ribosomal L37ae protein halTADL_1111 tADL 2.2 uous TATA-box-binding protein Hrr. 789 100% Hlac_1523 2.2 (Tbp) lacusprofundi 790 PRC-barrel domain 100% Halar_2624 DL31 2.2 transcriptional regulator, 791 100% Halar_0879 DL31 2.2 RosR (PadR family) metallo-beta-lactamase 792 superfamily (RNA 100% halTADL_2982 tADL 2.2 metabolism?) DNA polymerase sliding 793 clamp subunit (PCNA 100% Halar_3180 DL31 2.2 homolog) Hlac_3034/Hal Hrr. 794 rhodanese-like protein 100% DL1_3098/Hala lacusprofundi/ 2.1 r_3573 DL1/DL31 795 Hsp20-type chaperone 47% Halar_3162 DL31 2.1 ambig 796 hypothetical protein halTADL_0358 tADL 2.1 uous predicted RNA-binding protein containing KH 797 100% halTADL_0545 tADL 2.1 domain, possibly ribosomal protein ATP synthase, D subunit 798 100% halTADL_1953 tADL 2.1 (AtpD) (EC 3.6.3.14) 799 hypothetical protein 100% halTADL_1843 tADL 2.1 hypothetical protein (TAT 800 59% halTADL_1047 tADL 2.1 signal) type I 801 phosphodiesterase/nucleotide 100% halTADL_1660 tADL 2.1 pyrophosphatase or sulfatase UspA domain-containing 802 100% halTADL_2110 tADL 2.1 protein 803 hypothetical protein 100% halTADL_0370 tADL 2.1 nucleic acid-binding/OB- Halorubrum 804 58% ud 2.0 fold/TRAM domain coriense phosphoadenosine phosphosulfate reductase ambig 805 halTADL_1176 tADL 2.0 (PAPS reductase) (EC uous 1.8.4.8) K+ uptake system, TrkA 806 100% halTADL_3061 tADL 2.0 subunit hypothetical protein (2 x Hrr. 807 100% Hlac_2059 2.0 transmembrane helices) lacusprofundi

264

DNA-directed RNA Hrr. 808 polymerase subunit F (RpoF) 100% Hlac_2278 2.0 lacusprofundi (EC 2.7.7.6) hypothetical protein: 3 x chitinase/PKD domains, C- Hrr. 809 27% Hlac_2824 1.9 terminal transmembrane lacusprofundi helix aldo-keto reductase family: 810 NADPH-dependent 100% halTADL_1049 tADL 1.9 oxidoreductase ambig 811 enolase C-terminal domain halTADL_0393 tADL 1.9 uous Haloterrigena 812 ribosomal protein L23P 98% ud 1.9 thermotolerans transcriptional regulator, 813 100% halTADL_2871 tADL 1.9 TrmB TRAP dicarboxylate Marinobacter 814 62% ud 1.8 transporter, DctP subunit nanhaiticus KH-domain/beta-lactamase- 815 100% Halar_0869 DL31 1.8 domain 816 ribosomal protein L22 100% Halar_2476 DL31 1.8 817 ribosomal protein S19 100% Halar_2477 DL31 1.8 oligopeptide/dipeptide ABC 818 transporter solute-binding 100% Halar_2651 DL31 1.8 protein ATP synthase, beta subunit 819 100% HalDL1_2816 DL1 1.8 (AtpB) (EC 3.6.3.14) 820 FeS assembly protein SufB 95% halTADL_0973 tADL 1.8 aldo-keto reductase family: 821 NADPH-dependent 100% halTADL_1109 tADL 1.8 oxidoreductase Hrr. 822 ribosomal protein S2 100% Hlac_1826 1.8 lacusprofundi Hrr. 823 hypothetical protein 100% Hlac_1219 1.8 lacusprofundi 5-(carboxyamino)imidazole 824 ribonucleotide synthase 100% halTADL_3087 tADL 1.8 (PurK) (EC 6.3.4.18) hypothetical protein (Sec Halorubrum 825 signal, C-terminal 38% ud 1.8 kocurii transmembrane helix) S-adenosylhomocysteine 826 hydrolase (AchY) (EC 100% halTADL_1723 tADL 1.8 3.3.1.1) peptidylprolyl isomerase, 827 90% halTADL_3026 tADL 1.8 FKBP-type 828 ribosomal protein L11 100% Halar_2189 DL31 1.7 hypothetical protein (chlorite 829 100% halTADL_1349 tADL 1.7 dismutase domain) transcriptional regulator, 830 100% halTADL_3062 tADL 1.7 AsnC family transcriptional regulator, 831 100% halTADL_3318 tADL 1.7 AsnC family

265

DNA binding protein, Tfx 832 100% halTADL_3431 tADL 1.7 family phosphoglycerate kinase 833 94% halTADL_0816 tADL 1.7 (Pgk) (EC 2.7.2.3) 834 ribosomal protein L6P 90% halTADL_3371 tADL 1.7 835 phasin 100% ud Halomonas 1.7 nucleic acid binding OB-fold 836 tRNA/helicase-type (RPA32 100% Halar_3463 DL31 1.7 homolog) phosphate/sulfate permease 837 100% halTADL_3083 tADL 1.7 (PiT family) GCN5-related N- 838 100% Halar_1412 DL31 1.6 acetyltransferase 839 adhesion pilin (PilA) 39% Halar_2364 DL31 1.6 840 hypothetical protein 73% ud Prevotella sp 1.6 branched-chain amino acid ambig 841 aminotransferase (IlvE) (EC halTADL_1961 tADL 1.6 uous 2.6.1.42) DNA-directed RNA 842 polymerase subunit N 97% halTADL_2778 tADL 1.6 (RpoN) (EC 2.7.7.6) protein translation factor 843 100% Halar_1901 DL31 1.6 SUI1 homolog (Sui1) hypothetical protein (N- Natrinema 844 terminal transmembrane 73% ud 1.6 versiforme helix) fructose-1,6-bisphosphate 845 aldolase, class II (FbaA) (EC 100% halTADL_3223 tADL 1.6 4.1.2.13) UspA domain-containing 846 100% halTADL_0697 tADL 1.6 protein hypothetical protein (TAT 847 100% Halar_0006 DL31 1.6 signal) group II chaperonin Natrinema 848 92% ud 1.6 (thermosome) pellirubrum transcriptional regulator, ambig 849 halTADL_0352 tADL 1.6 RosR (PadR family) uous translation initiation factor 2, 850 alpha subunit (a/eIF2-alpha) 100% halTADL_0923 tADL 1.6 (Eif2a) UspA domain-containing 851 76% halTADL_1904 tADL 1.6 protein hypothetical protein 852 (transmembrane helix near 97% halTADL_2505 tADL 1.6 N-terminal) ambig 853 AAA ATPase central domain halTADL_2577 tADL 1.6 uous K+ uptake system, TrkA 854 100% halTADL_2713 tADL 1.6 subunit 855 tyrosyl-tRNA synthetase 100% halTADL_3299 tADL 1.6 iron ABC transporter solute- 856 83% halTADL_1788 tADL 1.6 binding protein 857 ribosomal protein S4 94% ud Natrinema sp. 1.5

266

Marinobacter 858 bacterioferritin 69% ud 1.5 nanhaiticus peptidylprolyl isomerase, Marinobacter 859 43% ud 1.5 FKBP-type adhaerens FAD-dependent oxidoreductase 860 100% Halar_0855 DL31 1.5 (geranylgeranyl reductase? (EC 1.3.1.83)) alkyl hydroperoxide 861 100% Halar_1849 DL31 1.5 reductase / peroxiredoxin ambig 862 ribosomal protein S8e Halar_1876 DL31 1.5 uous hypothetical protein (Sec signal, PGF-CTERM, C- 863 100% Halar_2424 DL31 1.5 terminal transmembrane helix) alpha/beta hydrolase fold 864 100% Halar_3618 DL31 1.5 containing protein hypothetical protein (N- 865 terminal transmembrane 100% halTADL_0271 tADL 1.5 helix) alpha/beta hydrolase fold 866 100% halTADL_0486 tADL 1.5 containing protein 867 hypothetical protein 100% halTADL_0830 tADL 1.5 868 formate/nitrite transporter 100% halTADL_2501 tADL 1.5 ambig 869 hypothetical protein halTADL_3006 tADL 1.5 uous carbohydrate ABC Hrr. 870 transporter solute-binding 72% Hlac_2862 1.5 lacusprofundi protein methyltransferase domain / Hrr. 871 100% Hlac_3189 1.5 DNA methylase lacusprofundi transcriptional regulator, Klebsiella 872 46% ud 1.5 LysR family pneumoniae iron or corrinoid ABC Natronorubru 873 transporter solute-binding 51% ud m 1.5 protein sulfidifaciens Hahella 874 ribosomal protein S5 86% ud 1.5 chejuensis Halovirus 875 hypothetical protein 54% [122] 1.5 HCTV-5 Halovirus 876 hypothetical protein 47% [121] 1.5 HCTV-1 Halovirus 877 hypothetical protein 51% [121] 1.5 HCTV-1 878 hypothetical protein 100% Halar_1167 DL31 1.5 oligopeptide/dipeptide ABC 879 transporter solute-binding 100% Halar_2024 DL31 1.5 protein FAD-dependent pyridine ambig 880 nucleotide-disulfide halTADL_0461 tADL 1.5 uous oxidoreductase nucleoside ABC transporter ambig 881 halTADL_2623 tADL 1.5 solute-binding protein uous

267

882 cell division protein FtsZ 100% halTADL_3056 tADL 1.5 Natrinema 883 phage shock protein A, PspA 87% ud 1.5 versiforme levansucrase (glycoside Natronococcus 884 74% ud 1.5 hydrolase, family 68) jeotgali ATP synthase, B subunit 885 (AtpF) (EC 3.6.3.14), 95% ud Dunaliella 1.5 chloroplastic UspA domain-containing Natrinema 886 73% C487_11961 1.5 protein pallidum 887 hypothetical protein 85% halTADL_0395 tADL 1.5 tryptophan synthase, alpha ambig 888 halTADL_0576 tADL 1.5 subunit (TrpA) (EC 4.2.1.20) uous hypothetical protein 889 (transmembrane helix near 100% halTADL_0290 tADL 1.4 N-terminal) 3-isopropylmalate/(R)-2- methylmalate dehydratase, ambig 890 halTADL_0364 tADL 1.4 large subunit (LeuC) (EC uous 4.2.1.33/4.2.1.35) polar amino acid ABC Hrr. 891 transporter solute-binding 100% Hlac_1804 1.4 lacusprofundi protein 892 hypothetical protein 48% ud halovirus 1.4 fumarate hydratase (FumC) 893 100% Halar_2608 DL31 1.4 (EC 4.2.1.2) 3-isopropylmalate/(R)-2- methylmalate dehydratase, ambig 894 halTADL_0365 tADL 1.4 small subunit (LeuD) (EC uous 4.2.1.33/4.2.1.35) 895 halocyanin 100% halTADL_2996 tADL 1.4 hypothetical protein (TAT 896 100% halTADL_0488 tADL 1.4 signal) transcriptional regulator, ambig 897 halTADL_2729 tADL 1.4 XRE family uous nascent polypeptide- Hrr. 898 associated complex protein 100% Hlac_1390 1.4 lacusprofundi (Nac) Tubby C-terminal-like Natrialba 899 82% ud 1.3 domain taiwanensis orotate 900 phosphoribosyltransferase 100% halTADL_1735 tADL 1.3 (PyrE) (EC 2.4.2.10) translation initiation factor 6 901 100% halTADL_2195 tADL 1.3 (aIF-6) (Eif6) ATP/cobalamin adenosyltransferase ambig 902 (cob(I)yrinic acid a,c- halTADL_2490 tADL 1.3 uous diamide adenosyltransferase) (PduO/EutT) (EC 2.5.1.17) 903 core histone 89% halTADL_1708 tADL 1.3 ThiJ/PfpI domain-containing 904 100% halTADL_1681 tADL 1.3 protein

268

aspartate kinase (LysC) (EC 905 90% halTADL_1916 tADL 1.3 2.7.2.4) hypothetical protein (4 x transmembrane domains); 906 DUF21 + CBS + CorC_HlyC 100% halTADL_2590 tADL 1.3 (transporter associated) domains 907 hypothetical protein 100% halTADL_3036 tADL 1.3 phosphate uptake regulator, 908 93% halTADL_3204 tADL 1.3 PhoU winged helix-turn-helix 909 62% halTADL_0044 tADL 1.2 DNA-binding domain UspA domain-containing 910 75% halTADL_2276 tADL 1.2 protein 911 UvrD/REP helicase 100% halTADL_2299 tADL 1.2 Hrr. 912 rhodanese domain 100% Hlac_1687 1.2 lacusprofundi CRISPR-associated protein Haloquadratu 913 69% ud 1.2 Cas7 (= Csh2) (subtype I-B) m walsbyi DNA polymerase, family X ambig 914 (DNA-directed DNA- halTADL_1656 tADL 1.2 uous polymerase) hypothetical protein 915 82% ud Halomonas 1.2 (DUF336) electron transfer 916 flavoprotein, alpha subunit 100% Halar_3024 DL31 1.2 (EtfA) CopG/Arc/MetJ DNA- binding domain and a metal- 917 100% halTADL_3445 tADL 1.2 binding domain; predicted transcriptional regulator CRISPR-associated protein Hrr. 918 Cas10d (=Csc3) (subtype I- 100% Hlac_3573 1.2 lacusprofundi D) uncharacterized transporter 919 (export?), ATPase 100% Halar_1798 DL31 1.2 component 920 ribosomal protein L5 100% Halar_2468 DL31 1.2 transcriptional regulator, 921 93% halTADL_0058 tADL 1.2 AsnC family 922 ribosomal protein S28e 95% halTADL_0167 tADL 1.2 cysteinyl-tRNA synthetase 923 100% halTADL_0865 tADL 1.2 (CysS) (EC 6.1.1.16) 924 KaiC domain 94% halTADL_1815 tADL 1.2 chemotaxis signal 925 91% halTADL_1838 tADL 1.2 transduction protein CheW transcriptional regulator, 926 70% halTADL_2533 tADL 1.2 XRE family VCP-like protein (2 x 927 CDC48 domains + 2 x AAA 95% halTADL_2740 tADL 1.2 family ATPase domains) Hrr. 928 ribosomal protein S3Ae 100% Hlac_0618 1.2 lacusprofundi

269

selT/selW/selH selenoprotein 929 100% Halar_1175 DL31 1.1 domain / Rdx domain DNA-directed RNA 930 polymerase subunit N 100% halTADL_2778 tADL 1.1 (RpoN) (EC 2.7.7.6) hypothetical protein Marinobacter 931 46% ud 1.0 (DUF4168; Sec signal) lipolyticus transcriptional regulator, 932 100% Halar_1790 DL31 1.0 TrmB Hrr. 933 hypothetical protein 100% Hlac_0081 1.0 lacusprofundi DNA-directed RNA Hrr. 934 polymerase subunit L (RpoL) 100% Hlac_0620 1.0 lacusprofundi (EC 2.7.7.6) hypothetical protein Hrr. 935 100% Hlac_1933 1.0 (UF2150) lacusprofundi hydroxymethylglutaryl-CoA 936 reductase (NADPH) (HmgA) 100% halTADL_0487 tADL 1.0 (EC 1.1.1.34) inorganic pyrophosphatase 937 88% halTADL_1644 tADL 1.0 (Ppa) (EC 3.6.1.1) Hrr. 938 endoribonuclease L-PSP 100% Hlac_0419 1.0 lacusprofundi ambig 939 helix-hairpin-helix domain Halar_1318 DL31 1.0 uous citrate synthase (GltA) (EC 940 100% halTADL_0686 tADL 1.0 2.3.3.1) Hrr. 941 PIN domain 100% Hlac_3585 1.0 lacusprofundi Thioalkalivibri invasin/intimin cell-adhesion 942 38% ud o 0.9 domain nitratireducens Marinobacter 943 chaperonin GroEL 83% ud 0.9 lipolyticus nitrogen regulatory protein 944 96% ud Marinobacter 0.9 P-II hypothetical protein Marinobacter 945 44% ud 0.9 (DUF4168) nanhaiticus Natronococcus 946 polysaccharide deacetylase 82% ud 0.9 jeotgali Natrinema 947 ferritin Dps family protein 92% ud 0.9 versiforme fructose-1,6-bisphosphate ambig 948 aldolase, class I (FbaB) (EC Halar_0751 DL31 0.9 uous 4.1.2.13) 949 RecJ-like endonuclease 100% Halar_1542 DL31 0.9 950 ribosomal protein S2 100% Halar_1548 DL31 0.9 951 ribosomal protein L15 100% Halar_2459 DL31 0.9 952 ribosomal protein S4e 100% Halar_2469 DL31 0.9 953 ribosomal protein S24e 100% Halar_2723 DL31 0.9 ATP synthase, H subunit 954 100% Halar_3025 DL31 0.9 (AtpH) (EC 3.6.3.14) nucleic acid-binding/OB- 955 100% Halar_3244 DL31 0.9 fold/TRAM domain

270

DNA repair and 956 100% Halar_3361 DL31 0.9 recombination protein RadA RND superfamily / MMPL 957 (mycobacterial membrane 100% halTADL_0082 tADL 0.9 protein large) family protein SecD/SecF/SecDF export 958 51% halTADL_0787 tADL 0.9 membrane protein cytochrome c oxidase, 959 82% halTADL_1060 tADL 0.9 subunit II (CoxB) nucleic acid binding OB-fold 960 tRNA/helicase-type (RPA32 100% halTADL_2569 tADL 0.9 homolog) thiamine ABC transporter, 961 100% halTADL_2794 tADL 0.9 solute-binding protein (ThiB) fibronectin-binding A ambig 962 halTADL_3222 tADL 0.9 domain + DUF814 protein uous thiolase / acetyl-CoA Haloterrigena 963 95% Htur_0386 0.9 acetyltransferase turkmenica 964 Hsp20-type chaperone 100% Halar_3161 DL31 0.9 ribonucleoside-diphosphate 965 reductase, alpha subunit 91% halTADL_0884 tADL 0.9 (NrdE) (EC 1.17.4.1) TATA-box-binding protein 966 96% halTADL_1732 tADL 0.9 (Tbp) PUA domain containing Hrr. 967 100% Hlac_0035 0.9 protein lacusprofundi 968 thioredoxin 56% ud Haloferax sp. 0.9 hypothetical protein 969 100% halTADL_2664 tADL 0.9 (DUF336) UspA domain-containing Natrinema 970 85% ud 0.9 protein altunense FAD-dependent Natronomonas 971 oxidoreductase/dehydrogenas 83% ud 0.9 / Halonotius e D-isomer specific 2- ambig halTADL_0315/ 972 hydroxyacid dehydrogenase tADL 0.9 uous halTADL_0088 NAD-binding prefoldin, alpha subunit Halovivax 973 65% C479_06996 0.9 (PfdA) asiaticus 974 ferritin Dps family protein 100% Halar_0845 DL31 0.9 SBDS ribosome maturation 975 100% Halar_2575 DL31 0.9 protein SDO1 976 adhesion pilin (PilA) 100% Halar_3709 DL31 0.9 D-isomer specific 2- 977 hydroxyacid dehydrogenase 100% halTADL_0088 tADL 0.9 NAD-binding GTPase (probable translation 978 100% halTADL_0258 tADL 0.9 factor) selT/selW/selH selenoprotein 979 81% halTADL_1063 tADL 0.9 domain / Rdx domain TATA-box-binding protein 980 93% halTADL_1450 tADL 0.9 (Tbp)

271

winged helix-turn-helix 981 DNA-binding domain + 100% halTADL_1959 tADL 0.9 riboflavin kinase domain dihydroxyacetone (DHA) 982 kinase, L subunit (DhaL) 92% halTADL_2259 tADL 0.9 (EC 2.7.1.29) hypothetical protein (Sec ambig 983 halTADL_2297 tADL 0.9 signal) uous shikimate kinase (AroB) (EC 984 100% halTADL_2582 tADL 0.9 2.7.1.71) adenine 985 phosphoribosyltransferase 95% halTADL_2952 tADL 0.9 (Apt) (EC 2.4.2.7) Hrr. 986 ribosomal LX protein 100% Hlac_0823 0.9 lacusprofundi Hrr. 987 ribosomal protein S13 100% Hlac_1816 0.9 lacusprofundi Natrinema 988 ferritin Dps family protein 86% ud 0.8 pallidum Halorhabdus 989 hypothetical protein 46% ud 0.8 tiamatea peptidylprolyl isomerase, 990 100% Halar_1967 DL31 0.8 cyclophilin type 991 ribosomal protein S8 100% Halar_2466 DL31 0.8 phytoene synthase (CrtB) 992 100% halTADL_0465 tADL 0.8 (EC 2.5.1.32) short-chain dehydrogenase/reductase 993 100% halTADL_0693 tADL 0.8 (SDR): glucose/ribitol dehydrogenase domain hypothetical protein (2 x 994 100% halTADL_0953 tADL 0.8 transmembrane helices) 2-oxoglutarate:ferredoxin 995 oxidoreductase, beta subunit 100% halTADL_1012 tADL 0.8 (KorB) (EC 1.2.7.3) glycyl-tRNA synthetase ambig 996 halTADL_1866 tADL 0.8 (GlyS) (EC 6.1.1.14) uous glycine-rich hypothetical 997 100% halTADL_2802 tADL 0.8 protein 998 ribosomal protein L10e 100% halTADL_2806 tADL 0.8 fibrillarin-like rRNA/tRNA 999 100% halTADL_2986 tADL 0.8 2'-O-methyltransferase aspartate carbamoyltransferase 1000 100% halTADL_3037 tADL 0.8 catalytic subunit (PyrB) (EC 2.1.3.2) 1001 ribosomal protein L3 93% halTADL_3387 tADL 0.8 Halorhodospir 1002 TonB-dependent receptor 44% Hhal_0508 0.8 a halophila Hrr. 1003 OsmC family protein 100% Hlac_1348 0.8 lacusprofundi hypothetical protein (Sec Hrr. 1004 signal, C-terminal 100% Hlac_2682 0.8 lacusprofundi transmembrane helix)

272

Candidatus translation elongation factor 1005 85% ud Nanosalina sp. 0.6 EF-1, alpha subunit (Tuf) J07AB43 polysaccharide biosynthesis Halorubrum 1006 protein (multiple 84% ud 0.6 aidingense transmembrane domains) 1007 CheY-like receiver domain 100% Halar_1479 DL31 0.6 signal transduction protein 1008 100% Halar_2982 DL31 0.6 with CBS domains CopG/Arc/MetJ DNA- binding domain and a metal- ambig 1009 halTADL_0355 tADL 0.6 binding domain; predicted uous transcriptional regulator hypothetical protein 1010 (transmembrane helix near 100% halTADL_0890 tADL 0.6 N-terminal) 1011 hypothetical protein 100% halTADL_1662 tADL 0.6 ribose 5-phosphate isomerase 1012 100% halTADL_1707 tADL 0.6 A (RpiA) (EC 5.3.1.6) glutamate dehydrogenase 1013 81% halTADL_1757 tADL 0.6 (GdhA) (EC 1.4.1.3/1.4.1.4) 1014 glutaredoxin 94% halTADL_2104 tADL 0.6 5-(carboxyamino)imidazole ambig 1015 ribonucleotide mutase (PurE) halTADL_3088 tADL 0.6 uous (EC 5.4.99.18) 2-dehydro-3- deoxyphosphogluconate Hrr. 1016 100% Hlac_0151 0.6 (KDPG) aldolase (Eda) (EC lacusprofundi 4.1.2.14) histidinol-phosphate Hrr. 1017 aminotransferase (HisC) (EC 100% Hlac_0235 0.6 lacusprofundi 2.6.1.9) succinyl-CoA synthetase, Hrr. 1018 beta subunit (SucC) (EC 100% Hlac_2207 0.6 lacusprofundi 6.2.1.5) adenine-specific DNA Haloferax 1019 79% ud 0.6 methylase sulfurifontis hypothetical protein Natronococcus 1020 (DUF1269; transmembrane 64% ud 0.6 jeotgali helix) Dorea 1021 hypothetical protein 30% ud 0.6 longicatena Halorhabdus 1022 hypothetical protein 46% ud 0.6 tiamatea VapC ribonuclease / PIN Haloferax 1023 86% ud 0.6 domain elongans hypothetical protein (P-loop ambig Halococcus 1024 containing nucleoside C450_05395 0.6 uous salifodinae triphosphate hydrolase) 1025 ribosomal protein S17e 100% Halar_1761 DL31 0.6 3-isopropylmalate 1026 dehydrogenase (LeuB) (EC 100% Halar_2164 DL31 0.6 1.1.1.85) 1027 blue (type 1) copper domain 100% Halar_2448 DL31 0.6

273

1028 hypothetical protein 100% Halar_2756 DL31 0.6 branched-chain amino acid 1029 ABC transporter solute- 100% HalDL1_1115 DL1 0.6 binding protein RimK domain + Zn protease 1030 100% halTADL_0438 tADL 0.6 (ATP-dependent) domain methylated-DNA-[protein]- 1031 cysteine S-methyltransferase 82% halTADL_0579 tADL 0.6 (Ogt) (EC 2.1.1.63) peptidase S16, Lon-like 1032 100% halTADL_0582 tADL 0.6 protease transcriptional regulator, 1033 100% halTADL_1102 tADL 0.6 XRE family hypothetical protein 1034 100% halTADL_1123 tADL 0.6 (transmembrane helix) TATA-box-binding protein 1035 100% halTADL_1490 tADL 0.6 (Tbp) response regulator receiver 1036 100% halTADL_1816 tADL 0.6 protein arginyl-tRNA synthetase ambig 1037 halTADL_1958 tADL 0.6 (ArgS) (EC 6.1.1.19) uous 2-dehydro-3-deoxy-D- gluconate (KDG) kinase 1038 100% halTADL_2089 tADL 0.6 (ribokinase family) (KdgK) (EC 2.7.1.45) 1039 ribosomal protein L31e 100% halTADL_2194 tADL 0.6 NADPH-dependent F420 1040 100% halTADL_2320 tADL 0.6 reductase carbohydrate kinase, FGGY 1041 (possible xylulokinase 100% halTADL_2660 tADL 0.6 [XylB] [EC 2.7.1.17]) hypothetical protein ambig 1042 halTADL_2698 tADL 0.6 (DUF555) uous alcohol dehydrogenase ambig 1043 GroES domain / Zn-binding halTADL_2784 tADL 0.6 uous alcohol dehydrogenase transcriptional regulator, 1044 100% halTADL_2829 tADL 0.6 AsnC family thioesterase (acyl-CoA 1045 100% halTADL_2937 tADL 0.6 thioester hydrolase?) DNA topoisomerase VI, 1046 subunit B (Top6B) (EC 100% halTADL_3021 tADL 0.6 5.99.1.3) aspartate 1047 carbamoyltransferase 100% halTADL_3038 tADL 0.6 regulatory subunit (PyrI) K+ uptake system, TrkA 1048 89% halTADL_3061 tADL 0.6 subunit anthranilate 1049 phosphoribosyltransferase 92% halTADL_3066 tADL 0.6 (TrpD) (EC 2.4.2.18) NADH-quinone 1050 oxidoreductase subunit C/D 100% halTADL_3093 tADL 0.6 (NuoCD) (EC 1.6.5.3)

274

1051 hypothetical protein 100% halTADL_3119 tADL 0.6 1052 hypothetical protein 100% halTADL_3251 tADL 0.6 NUDIX hydrolase (NUDIX = NUcleoside DIphosphate 1053 100% halTADL_3253 tADL 0.6 linked to some other moiety X) proteasome beta subunit Hrr. 1054 100% Hlac_0608 0.6 (PsmB) (EC 3.4.25.1) lacusprofundi Hrr. 1055 hypothetical protein 100% Hlac_2035 0.6 lacusprofundi Hrr. 1056 ribosomal protein S19e 100% Hlac_2312 0.6 lacusprofundi Hrr. 1057 ribosomal protein L22 100% Hlac_2444 0.6 lacusprofundi 1058 hypothetical protein 100% halTADL_0069 tADL 0.5 hydroxymethylbilane 1059 synthase (HemC) (EC 100% halTADL_1820 tADL 0.5 2.5.1.61) translation initiation factor 2, 1060 beta subunit (a/eIF2-beta) 100% halTADL_2337 tADL 0.5 (Eif2b) 1061 hypothetical protein 45% ud Salmonella 0.4 Natrinema 1062 ribosomal protein L29P 97% ud 0.4 versiforme Haloarcula 1063 hypothetical protein 65% ud 0.4 japonica catalase-peroxidase (KatG) Halorubrum 1064 87% ud 0.4 (EC 1.11.1.21) sp. AJ67 TATA-box-binding protein Natronorubru 1065 100% ud 0.4 (Tbp) m tibetense hypothetical protein (Sec ambig Haloferax 1066 ud 0.4 signal) uous gibbonsii Halosarcina 1067 hypothetical protein 64% ud 0.4 pallida 2-oxoacid dehydrogenase 1068 complex, dihydrolipoamide 80% ud Halonotius sp 0.4 dehydrogenase oligopeptide/dipeptide ABC Haloferax 1069 transporter solute-binding 34% ud 0.4 volcanii protein winged helix-turn-helix DNA-binding domain + Hlac_3118/Hal DL1/Hrr. 1070 100% 0.4 nucleotidyl transferase DL1_3302 lacusprofundi domain deoxycytidine triphosphate ambig 1071 deaminase [Dcd] (EC ambiguous ambiguous 0.4 uous 3.5.4.13) PUA-domain-containing 1072 100% Halar_0909 DL31 0.4 protein translation initiation factor 2, 1073 alpha subunit (a/eIF2-alpha) 100% Halar_1299 DL31 0.4 (Eif2a) glycine-rich hypothetical 1074 100% Halar_1460 DL31 0.4 protein

275

double-stranded DNA- 1075 100% Halar_2691 DL31 0.4 binding domain DNA-directed RNA 1076 polymerase subunit E' 100% Halar_2727 DL31 0.4 (RpoE1) (EC 2.7.7.6) UPF0278 family protein; 1077 100% Halar_2773 DL31 0.4 possibly nucleic acid binding cupin 2 conserved barrel 1078 100% Halar_2869 DL31 0.4 domain 1079 endoribonuclease L-PSP 100% Halar_3193 DL31 0.4 halolysin (peptidase S8 and 1080 S53 subtilisin kexin 100% Halar_3678 DL31 0.4 sedolisin) ambig 1081 ribosomal protein S2 HalDL1_2024 DL1 0.4 uous FAD-dependent pyridine 1082 nucleotide-disulfide 100% halTADL_0017 tADL 0.4 oxidoreductase glutamyl-tRNA(Gln) ambig 1083 halTADL_0163 tADL 0.4 amidotransferase subunit E uous 3-isopropylmalate 1084 dehydrogenase (LeuB) (EC 93% halTADL_0366 tADL 0.4 1.1.1.85) transcriptional regulator ambig 1085 halTADL_0376 tADL 0.4 NikR, CopG family uous aspartate aminotransferase 1086 100% halTADL_0403 tADL 0.4 (AspB) (EC 2.6.1.1) thiamine-phosphate 1087 pyrophosphorylase ThiE (EC 74% halTADL_0473 tADL 0.4 2.5.1.3) predicted RNA-binding protein containing KH 1088 81% halTADL_0545 tADL 0.4 domain, possibly ribosomal protein urease, beta subunit (UreB) ambig 1089 halTADL_0634 tADL 0.4 (EC 3.5.1.5) uous translation initiation factor 2, 1090 alpha subunit (a/eIF2-alpha) 91% halTADL_0923 tADL 0.4 (Eif2a) aspartyl- tRNA(Asn)/glutamyl-tRNA ambig 1091 (Gln) amidotransferase halTADL_1078 tADL 0.4 uous subunit B (GatB) (EC 6.3.5.6 / EC 6.3.5.7) 1092 hypothetical protein 100% halTADL_1167 tADL 0.4 methyl-accepting chemotaxis 1093 sensory transducer with 100% halTADL_1218 tADL 0.4 Pas/Pac sensor NifU (FeS cluster assembly) ambig 1094 halTADL_1844 tADL 0.4 domain uous CobQ/CobB/MinD/ParA ambig 1095 halTADL_1956 tADL 0.4 nucleotide binding domain uous NADPH-dependent F420 1096 91% halTADL_2320 tADL 0.4 reductase

276

NUDIX hydrolase (NUDIX = NUcleoside DIphosphate 1097 100% halTADL_2550 tADL 0.4 linked to some other moiety X) hypothetical protein (Sec 1098 signal, DUF839, C-terminal 100% halTADL_2727 tADL 0.4 transmembrane helix) carbohydrate ABC 1099 transporter ATP-binding 88% halTADL_2764 tADL 0.4 protein 1100 hypothetical protein 79% halTADL_3036 tADL 0.4 hypothetical protein (DUF35 1101 OB-fold domain - DNA- or 100% halTADL_3135 tADL 0.4 RNA-binding) ambig 1102 AAA ATPase central domain halTADL_3346 tADL 0.4 uous peptidylprolyl isomerase, Hrr. 1103 100% Hlac_1668 0.4 cyclophilin type lacusprofundi Hrr. 1104 hypothetical protein 100% Hlac_1709 0.4 lacusprofundi Hrr. 1105 ribosomal protein S4 100% Hlac_1817 0.4 lacusprofundi VCP-like protein (2 x Hrr. 1106 CDC48 domains + 2 x AAA 67% Hlac_2377 0.4 lacusprofundi family ATPase domains) DNA repair and Hrr. 1107 100% Hlac_2624 0.4 recombination protein RadA lacusprofundi Hrr. 1108 adhesion pilin (PilA) 100% Hlac_3311 0.4 lacusprofundi Natronomonas 1109 hypothetical protein 26% ud 0.4 moolapensis

277