<<

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1615

Culture-independent methods and high-throughput sequencing applied to evolutionary microbial genomics

ANDERS E. LIND

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-513-0196-9 UPPSALA urn:nbn:se:uu:diva-336794 2018 Dissertation presented at Uppsala University to be publicly examined in B22, BMC, Husargatan 3, Uppsala, Friday, 16 February 2018 at 13:00 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Purificación López-García (Paris-Sud University, Paris, France).

Abstract Lind, A. E. 2018. Culture-independent methods and high-throughput sequencing applied to evolutionary microbial genomics. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1615. 68 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-0196-9.

Culture-independent methods allow us to better understand the diversity of microorganisms on earth. By omitting the culturing step, we can gain access to species previously out of our reach. To better capture and understand the diversity of microorganisms, we have developed an alternative to 16S amplicon sequencing. This method exploits the fact that the rRNA operon in most contains both the 16S and the 23S genes in a determined order, close to each other. Combining this with PacBio long molecule sequencing, we can now generate amplicon containing the 16S-ITS-23S operon, allowing for a stronger phylogenetic signal. Using two culture-independent methods, single-cell genomics and metagenomics, we have been able to fully sequence the genome of two archaeal endosymbionts from the genera Methanobrevibacter and Methanocorpusculum. These methanogenic were isolated from the ciliates Nyctotherus ovalis and Metopus contortus. The genomic data show evidence of genome degradation, mainly through pseudogenization of genes. These genomes represent the first genomic data from archaeal endosymbionts. The endosymbiotic archaea we have sequenced live in close proximity to hydrogen producing mitochondria, or hydrogenosomes. In ciliates it has previously been shown that the hydrogenosome of N. ovalis has retained a genome, something many other ciliate hydrogenosome have not. Using genomic and transcriptomic data from seven anaerobic ciliates, we have been able to show that these ciliate have independently evolved from ancestral mitochondria, and show various degrees of genome reduction. Furthermore we have been able to shed some new light on haloarchaeal evolution by reconstructing the genomes of five species belonging to the Marine Group IV archaea. These archaea have previously been found to be closely related to the , however they are not . All five genomes were obtained using metagenomic binning of the publicly available TARA Oceans dataset. The genomes are all of high quality and completeness, and are used in tree-aware ancestral reconstruction, in order to try to better understand the evolutionary transition from an anaerobic to a . These studies all show how culture-independent method are powerful tools in gathering genomic information about microbes we are currently unable to culture in the lab.

Keywords: Single cell genomics, metagenomics, amplicon, archaea, haloarchaea, endosymbiosis, ciliate, hydrogenosome, tree reconciliation

Anders E. Lind, Department of Cell and Molecular Biology, Box 596, Uppsala University, SE-75124 Uppsala, Sweden.

© Anders E. Lind 2018

ISSN 1651-6214 ISBN 978-91-513-0196-9 urn:nbn:se:uu:diva-336794 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-336794)

For the curious

List of Papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Martijn, J*., Lind, AE*., Spiertz, I., Juzokaite, L., Bunikis, I., Petters- son, OV., Ettema TJG. (2018) Amplicon sequencing of the 16S-ITS-23S rRNA operon with long-read technology for improved phylogenetic classification of uncultured prokaryotes. bioRxiv, doi:10.1101/234690

II Lind, AE., Lewis, WH., Spang, A., Guy, L., Embley, TM., Ettema, TJG. (2018) Genomic insights into two independent events of archaeal endo- symbiosis. Manuscript

III Lewis, WH*., Lind, AE*., Sendra, KM., Gustafson, HO., Williams, TA., Esteban, GF., Hirt, RP., Ettema, TJG., Embley, TM. (2018) The roles of exaptation, gene loss and gene transfer in the evolution of ciliate hydrogenosomes. Manuscript

IV Lind, AE*., Martijn, J*., Schön, ME., Vosseberg, J., Williams, T., Spang, A., Ettema, TJG. (2018) First genomes of Marine Group IV ar- chaea enlighten the evolutionary origins of Haloarchaea. Manuscript

* Equal contribution

Papers by the author not included in this thesis

Baker, BJ., Saw, JH., Lind, AE., Lazar, CS., Hinrichs K., Teske, AP., Ette- ma TJG. (2016) Genomic inference of the metabolism of cosmopolitan sub- surface Archaea, Hadesarchaea. Nature microbiology, 1:16002

Spang A., Saw, JH., Jørgensen, SL., Zaremba-Niedzwiedzka, K., Martijn, J., Lind, AE., van Eijk, R., Schleper, C., Guy, L., Ettema, TJG. (2015) Com- plex archaea that bridge the gap between prokaryotes and eukaryotes. Na- ture, 521(7551):173-179

Schneider, DI., Klasson, L., Lind, AE., Miller, WJ. (2014) More than fish- ing in the dark: PCR of a dispersed sequence produces simple but ultrasensi- tive Wolbachia detection. BMC Microbiology, 14(1):121

Spang, A., Martijn, J., Saw, JH., Lind, AE., Guy, L., Ettema, TJG. (2013) Close encounters of the Third : the emerging genomic view of ar- chaeal diversity and evolution. Archaea, 2013

Bernander, R., Lind, AE., Ettema, TJG. (2011) Archaeal origin of the actin cytoskeleton: implications for eukaryogenesis. Communicative & integrative biology, 4(6):664-667

Contents

Introduction ...... 9 The rise of DNA sequencing ...... 10 Early history ...... 10 The high throughput revolution ...... 12 Putting the pieces together: genome assembly ...... 15 From micro to pico ...... 16 The future ...... 17 Culture independence ...... 20 Who is out there? ...... 22 What are they doing? ...... 25 Single cell genomics ...... 26 Metagenomics ...... 27 What we can learn ...... 29 Problems with culture-independent methods ...... 29 Genomics of symbionts ...... 31 Symbiosis ...... 31 Archaeal endosymbionts ...... 35 Hydrogen-producing mitochondria ...... 40 Evolution of the Haloarchaea ...... 42 Results ...... 44 Paper I – PacBio sequencing of rRNA amplicons ...... 44 Paper II – Genome sequences of archaeal endosymbionts ...... 45 Paper III – Evolution of hydrogenosomes in ciliates ...... 46 Paper IV –Marine Group IV and the evolution of Haloarchaea ...... 47 Perspectives ...... 48 Svensk sammanfattning ...... 49 Acknowledgements ...... 52 References ...... 55

Abbreviations

BES 2‐bromoethanesulfonate bp Base pairs CPR Candidate Phyla Radiation ddNTP di-deoxynucleotide triphosphates ETC Electron Transport Chain FISH Fluorescent in situ hybridization HGP Human Genome Project HGT Horizontal Gene Transfer ITS Internal transcribed spacer Kb Kilo base pairs (1,000 bp) KTH Kungliga Tekniska Högskolan (Royal Institute of Technology) LB Lysogeny Broth MAG Metagenome-assembled Genome Mb Megabase (1,000,000 bp) MG-IV Marine Group IV PPi Pyrophosphate SAG Single Amplified Genome SBL Sequence by Ligation SBS Sequence by Synthesis OLC Overlap-Layout-Consensus WGA Whole Genome Amplification

Introduction

In this thesis I hope to provide a thorough introduction to both the methods used as well as our current understanding of the questions addressed in the included papers. The topics of the papers are wide, but they all seek to im- prove our understanding of organisms previously not studied using genomic tools. The technologies that have made this work possible have all been de- veloped under the last decade, and can serve as examples on the type of stud- ies which we can now perform, but which was not feasible just a few years ago. In this processes I have, together with all my collaborators, generated se- quence data from organisms from which there were previously no such in- formation available. This represents just a small drop of the vast amount of sequences that are now being generated on a daily basis. Especially when it comes to prokaryotic genome sequencing, the possibility and ease in which such projects can be undertaken has never been greater. From this massive wealth of sequence data, we have been able to gain a better understanding of the forces that have shaped the evolution of all man- ner of organisms. Culture-independent methods have been key in allowing us to look up from our agar plates and explore new frontiers. We are now able to study and understand how life from all possible, and sometimes seemingly impossible, corners of the earth. And while perhaps not every- thing found to be true for the E. coli must also be true for the elephant, as Jacques Monod once famously said, they are built with the same tools.

9 The rise of DNA sequencing

Early history Sequencing, the determination of the exact sequence of the four bases of DNA, is the foundation of the methods this thesis relies upon. From the dis- covery of the DNA structure in 1953 by James Watson and Francis Crick (Watson et al., 1953), with important contributions by Rosalind Franklin and Maurice Wilkins (Maddox, 2003), up until today, there have been astonish- ing improvements in the technologies underlying sequencing. In 1977 Frederick Sanger published both the genome sequence of the bac- teriophage phiX174 (Sanger et al., 1977a), and a short time later, his new and improved method for DNA sequencing (Sanger et al., 1977b). While this was not the first method ever developed for sequencing of nucleic acids (Holley et al., 1965, Fiers et al., 1972, Padmanabhan et al., 1974), Sanger’s method proved robust, as evident by its continued use today. The method he developed to determine a DNA sequence, called chain termination, is based on the process of DNA polymerization. In normal polymerization, a deox- ynucleotide triphosphate (dNTP) is incorporated. However, if a modified form of dNTP (called a di-deoxynucleotide triphosphate, or ddNTP) is in- corporated instead, further DNA polymerization becomes impossible, due to the ddNTP’s lack of a 3’ hydroxyl group. In short, Sanger’s method is this: the DNA template of choice is incubat- ed together with DNA polymerase, a primer, and a mix of dNTPs with a single type of radioactively labelled di-deoxynucleotide triphosphates (ddATP, ddGTP, ddTTP or ddCTP). The incorporation of ddNTPs is ap- proximately 100-fold lower than that of dNTPs. The primer hybridizes with the template DNA and extension by the DNA polymerase occurs via the incorporation of dNTPs. Occasionally a ddNTP will be incorporated instead of its dNTP equivalent, and the elongation of the DNA strand stops. The mixture of different DNA fragments can then be separated based on length on an acrylamide gel (Figure 1). Doing so gives the position of the incorpo- rated ddNTP. The reaction is then repeated using each different ddNTP, and by comparing all gels to each other, the complete sequence of the DNA can be determined (Sanger et al., 1977b).

10 Figure 1. Acrylamide gel showing the length of sequences following two primers, named A14 and A12d, for each of the ddNTPs used as terminator. The interpretation of the DNA sequence, done manually, can be seen annotated in the margins together with an indication of its position. Original figure by (Sanger et al., 1977b).

While Sanger’s method for DNA sequencing remained the staple for many years, advancements in automatisation and parallelization took place rapidly. Several companies developed automatic DNA sequencers based on the Sanger method, and to this day these produce some of the highest quali- ty-to-length reads of any DNA sequencers, with read accuracy above 99.9% (Liu et al., 2012). These types of machines were also instrumental in the Human Genome Project (HGP) initiated in 1988 (National Research Council, 1988). This ambitious project aimed to completely map the approx- imately 3 billion bases of the human genetic code. The project was declared completed in 2003, and the first high quality genome with 99% of the eu- chromatic sequences mapped were published in 2004 (International Human Genome Sequencing Consortium, 2004).

11 The HGP was not the only effort to sequence the human genome. There was competition from a private venture, Celera, lead by Craig Venter. The Celera project had a fundamentally different attitude to who would be the benefactor of the genome projects. Where the HGP had daily public releases of all new data generated, the Celera project explored venues for how to monetize their results. Ultimately they wished to patent their findings, and license the use of the sequence data to anybody who could afford it. While the two projects were not able to find grounds for direct co-operation, they did agree upon trying not to interfere with each other, and this lead to simul- taneous publications of the initial results in 2001 (Lander et al., 2001, Venter et al., 2001). In parallel with the HGP, several other genomes were completely se- quenced. The first eukaryotic genome was that of the yeast Saccharomyces cerevisiae in 1996 (Goffeau et al., 1996), and was followed by model organ- isms such as Caenorhabditis elegans, Arabidopsis thaliana and Guillardia theta (The C. elegans Sequencing Consortium, 1998, Kaul et al., 2000, Douglas et al., 2001). The first cellular genomes were of the bacterium Haemophilus influenzae and the archaeon Methanococcus jannaschii in 1995 and 1996 respectively (Fleischmann et al., 1995, Bult et al., 1996).

The high throughput revolution Apart from delineating the human DNA, the HGP functioned as a technolog- ical catalyst. In the wake of the HGP, advances were made that allowed for development of various high-throughput sequencing technologies, which in turn allowed for massively larger amounts of sequence data to be generated. One of the first of these high-throughput technologies was the pyrose- quencing method (Nyrén et al., 1993), pioneered at KTH (Royal academy of technology), Stockholm. This method led to the development of the 454 sequencing platform (Margulies et al., 2005), which allowed for medium length sequence reads (up to 700 bp) to a much reduced cost as compared to Sanger sequencing, due to the improvements in parallelization. The principle behind 454 sequencing is as follows. A DNA template is fixed to a solid surface and a primer is hybridized to it. A single type of dNTP is added to the mixture and any incorporation of dNTP is accompa- nied by the release of pyrophosphate (PPi). This PPi is converted to ATP by ATP-sulfurylase, which in turn is used by luciferase to emit light. The emit- ted light can be measured with a luminometer (Ronaghi et al., 1996) and used to determine the sequence of bases following the primer. The reaction vessel is then washed and the procedure is repeated using the other types of dNTP to determine the following bases of the DNA template. This technique leads to a loss of signal when many bases of the same type are being incor- porated in the same step. Stretches of 3 or more identical nucleotides are

12 referred to as homopolymer regions, and their inaccurate sequencing makes for one of the main weaknesses of the 454 technology (Huse et al., 2007). The next major disruption to the sequencing industry came with the intro- duction of the Solexa (later named Illumina) Genome Analyzer in 2006 and the ABI SOLiD sequencer in 2007. Where Sanger and 454 sequencing would results in reads between 300–1000 bp, these new methods would pro- duce reads of only around 35 bp (Morozova et al., 2008). Their main ad- vantage, however, comes in the form of massively increased output of a se- quencing run (Morozova et al., 2008, Harismendy et al., 2009). A single experiment could now produce several gigabases of data over only a few days (Liu et al., 2012). As a consequence, new assembly methods had to be developed that could take this massive amount of short length data into ac- count (these methods are further discussed in the next section). Illumina sequencers, just as Sanger and 454 sequencing, use a variation on sequence by synthesis (SBS). A major difference in Illumina sequencing, as compared to previous methods, was the bridge amplification, which al- lowed for the creation of millions of clonal sequencing clusters attached to a sequencing flow-cell (Mitra et al., 1999, Adessi et al., 2000, Goodwin et al., 2016). These millions of clusters can then be sequenced by polymerization in several rounds. A primer sequence binds to each strand of DNA in the cluster, and a single fluorescent nucleotide is incorporated. Besides being fluorescent, this nucleotide will have its 3’ end blocked so no additional polymerization can occur. This blocking prevents multiple fluorophores being incorporated at once, solving the homopolymer issue. Due to Illumina holding the patent to this technique, 454 and similar technologies were una- ble to also use this method, which caused them to continue to be plagued by the homopolymer problem. A laser then reads the colour of the fluorophores, which will have strong enough signals due to the many clonal reads of each cluster. The 3’ end is then unblocked and the process is repeated (Goodwin et al., 2016). While only approximately 35 bp could be sequenced in the early days of the Illumina method, today this technique can sequence more than 300 bp, rivalling the lengths reached by 454 sequencing. While it produces similar read lengths, DNA sequencing using the SOLiD method does not rely on sequence by synthesis, but rather on sequence by ligation (SBL) (McKernan et al., 2009). The DNA template is fixed to a surface and an oligonucleotide anchor is hybridized to the template. A mix- ture of probes, each containing two known bases followed by six variable ones and a specific fluorophore are added to the mix. The probe with two complementary bases is ligated onto the template and imaged in order to determine the identity of the two bases. The last base containing the fluoro- phore is then cleaved of the 3’ end, and a new probe is ligated on. After 10 rounds, where two bases out of every five are identified, the whole comple- mentary strand is washed off the surface, a new n+1 oligonucleotide anchor

13 is hybridized to the template, and the whole process is repeated until every base of the fragment can be identified (Goodwin et al., 2016). These techniques, i.e. 454 pyrosequencing, Illumina and SOLiD sequenc- ing methods, were the first in what is sometimes called the “Next Generation Sequencing methods”. This name is somewhat ridiculous since it means that any names for even newer methods would have to be called “Next-Next Generation sequencing methods,” and so on, in a clearly absurd and unin- formative pattern. I will therefore refrain from using this term, and instead refer to them as the second-generation sequencing methods, where Sanger sequencing is the first generation method. In addition to the methods men- tioned here, there were other methods developed with various commercial success, such as IonTorrent and Complete genomics (Heather et al., 2016). While some of these are still being used today, they will not be discussed in further detail here. The exact definition of what would entail a third generation sequencing method has been debated (Schadt et al., 2010, Niedringhaus et al., 2011, Liu et al., 2012), but many would argue that the third generation started with the development of single molecule sequencing. While the second generation sequencers relied on some sort of DNA amplification in order to obtain enough material to get a signal (fluorescent or otherwise), the third genera- tion sequencers can detect nucleotide incorporation into a single piece of DNA (Braslavsky et al., 2003). Pioneered in the early 2000s, these methods were first commercialized by Helicos Biosciences, and later other companies developed their own takes on single molecule sequencing, with the Pacific Biosciences (PacBio) sequencers being the most prominent today (Goodwin et al., 2016). Just like Illumina and 454, the PacBio sequencers work on the SBS prin- ciple. A DNA polymerase is fixed on a surface, and a laser excites fluores- cent nucleotides as they are being incorporated in real time. This allows for very long reads, over 20 kb in length, to be generated (Giordano et al., 2017). Where the second generation sequencing methods produced DNA with very low error rates, less than 0.1% (Liu et al., 2012, Luo et al., 2012), the PacBio platform has an error rate closer to 15% (Weirather et al., 2017). A number of solutions have been implemented to bring these rates down, such as to circularize the DNA template and repeatedly loop it around the polymerase many times over (Rhoads et al., 2015), and correction of the reads using short read data (Au et al., 2012). The second technology belonging to the third generation worth mention- ing is that of Oxford Nanopore. In this method, instead of sequencing by synthesis, a protein pore sits in a membrane and a DNA molecule is passed through it. The fluctuation in current over the membrane is measured, and the nucleotides passing through can thus be determined (Branton et al., 2008, Venkatesan et al., 2011). This method allows for single molecule sequencing in real time, in the smallest available package on the market (an Oxford Na-

14 nopore minION sequencer is about the size of a USB memory-stick). While the initial data produced using this method had a high error rate (over 35%, (Goodwin et al., 2015, Laver et al., 2015), more recently the error rate has been shown to be around 15% due to improved sequencing strategies and chemistries (Weirather et al., 2017). The long reads, up to 900 kb, produced are still useful when it comes to resolving repeat structures and scaffolding of genomes, a feature it shares with PacBio sequencing (Utturkar et al., 2014, Karlsson et al., 2015). There has also been development of the existing second generation, short- read technologies which allows for creation of longer reads out of barcoded short reads (Kuleshov et al., 2014). Illumina calls this method TruSeq Syn- thetic Long-Reads, and it allows for the generation of reads up to 18 kb in length (McCoy et al., 2014). Apart from increasing the amount of data obtainable from a single se- quencing run, these technologies have also helped to radically drive down the cost of sequencing. By the time the HGP was declared completed, the cost to sequence a megabase of a DNA template (1 million base pairs) was approximately $2986. In just 5 years, this price had fallen to $102, and today that cost is around $0.012 per megabase (Wetterstrand, 2017). Of course, calculating such costs is not a straightforward process, since one has to take into account several factors apart from the pure reagent costs, such as the initial purchasing cost of the sequencing platform, labour, analysis etc. And given the amount of data generated by these new technologies, the ever- increasing cost of computational resources should not be underestimated (Sboner et al., 2011, Muir et al., 2016). However, despite these concerns it is safe to say that DNA sequencing, as a tool, is now available to a very broad range of scientist and labs.

Putting the pieces together: genome assembly Currently available sequencing technologies are unable to sequence the ge- netic material of an organism in its entirety. Even the smallest cellular organ- isms have genomes that are hundreds of kilobases long. While the newest technologies such as PacBio and Oxford Nanopore can produce very long sequence reads, up too 900 kb, we are as of yet often faced with millions of shorter DNA reads. A typical Illumina-based sequencing project generates reads that are 250bp in length, which then require sequence assembly, i.e. piecing the reads together into contiguous sequences (contigs). The two ma- jor methods used by assemblers are overlap-layout-consensus (OLC) and De Bruijn graphs. OLC assemblers, such as the Celera assembler and Newbler (Miller et al., 2010a), were commonly used with first generation sequencing methods (i.e. Sanger sequencing). By finding overlapping regions in these long, a single

15 contig is built by finding all reads that overlap with each other, given a de- fined overlap length. Any redundantly inferred contiguous stretches are re- moved during the layout step. Lastly the sequence of the contigs is deter- mined by the consensus of the reads used to build it (Li et al., 2012). This process is well-suited for long sequences with easily identified overlaps, but struggles when starting with many smaller pieces of DNA. This is due to the need for performing pair-wise alignments for every possible read-pair. The amount of computational power necessary for assembling millions of reads can quickly become impractical. To faster assemble the shorter reads produced by Illumina and other high throughput methods, one can use De Bruijn graph assemblers, such as Velvet and SPAdes (Zerbino et al., 2008, Bankevich et al., 2012). Reads are subdi- vided into k-mers, and graphs are built from every possible k-mer pair over- lapping by n-1 nucleotides, where n is the length of the k-mer (Idury et al., 1995, Zerbino et al., 2008). The contig is then inferred by finding the path connecting k-mers in the graph. Since this method does not involve con- structing alignments between each sequence read, it is both faster and less memory intensive than OLC assemblers. Regardless of the type of assembler used, certain genomic regions are dif- ficult to assemble when using only short reads (< 300 bp). A 400 bp DNA element repeated several times would be impossible to completely solve using 250 bp reads, since there is no way for an assembler to span this region By combining the new long molecule sequencing methods, such as PacBio and Oxford Nanopore, with the deep sequence coverage offered by Illumina and other methods in hybrid assemblers, such problems can be overcome (Koren et al., 2012, Goodwin et al., 2015). This combination also has a syn- ergistic benefit in that the short, high quality reads can be used to correct any sequencing errors present in the longer reads. Using such methods it is pos- sible to completely assemble microbial chromosomes into single contigs, without the need for any manual finishing (Koren et al., 2015).

From micro to pico The ever-decreasing price of sequencing is one of the major factors behind the rapid increase of available sequence data, but one should not overlook the importance of innovation in the fields of DNA library creation and whole genome amplification (WGA). These two techniques have allowed us to sequence the DNA of organisms from environments that were previously out of our reach. At the introduction of the first high throughput Illumina se- quencer, the Genome Analyzer, the construction of the DNA libraries need- ed 1–4 µg of input DNA (Quail et al., 2008). This can be compared to today when a sequencing library can be constructed from between 5 ng and 1 pg. (Rinke et al., 2016, Ring et al., 2017).

16 In synergy with this decrease in the amount of required starting material, WGA technologies have allowed researchers to gain enough DNA from precious samples in order to perform genome sequencing. Most WGA meth- ods amplify DNA via exponential or linear polymerase reactions (De Bourcy et al., 2014, Normand et al., 2016, Babayan et al., 2017). One of the most successful methods is multiple displacement amplification (MDA), which uses the strand-displacing capabilities of the phage phi29 DNA polymerase to create high concentrations of high molecular weight DNA (Esteban et al., 1993, Paez et al., 2004). Limitations of these WGA methods include uneven genomic coverage (Pinard et al., 2006, Marcy et al., 2007) and chimera for- mation (Lasken et al., 2007). But despite that, the combination of WGA and low-input DNA library construction allows for the sequencing of genomic and metagenomic samples previously unavailable to us, due to limited avail- able starting material. This includes samples from low-accessibility envi- ronments such as ocean floor sediments (Spang et al., 2015), samples with low cell density (Duhaime et al., 2012) and precious samples such as clinical samples, preserved samples or ancient DNA (Greninger et al., 2015, Kader et al., 2016).

The future As sequencing technology has progressed, the number of organisms from which there are genomic sequences has increased substantially. In June 2003, the NCBI RefSeq genome database contained genome sequences from 280 vertebrate species (O'Leary et al., 2015). By September of 2017, this number had increased to 4691. And that is only the vertebrate genomes. If we also include the genome of invertebrates, plants, fungi, prokaryotes and protists, the number of species from which we have access to genome has increased from 456 to 65413. These numbers do not include the thousands of mitochondrial, viral and plasmid sequences which we have also sequenced over the last 15 years. The technologies of DNA sequencing are sure to advance again in the fu- ture, and they will be able to provide even longer sequences with ever- increasing quality and yield. Established companies such as Illumina and Pacific Biosciences are continuously pushing the envelope for their methods. New emerging technologies based on both the tried-and-true method of se- quencing by synthesis and that of Nanopore sequencing are sure to appear. An effort to replace the protein pores used by Oxford Nanopore with solid state pores is currently underway and shows some initial promise (Dekker, 2007, Lindsay, 2016). Apart from the increase in yield and quality of these new methods (Figure 2), the third generation sequencing methods and its successors will provide DNA sequencing capabilities beyond those that exist today.

17 Figure 2. Read length and output per sequencing run (Gb) for some of the major sequencing platforms. Image adapted from Lex Nederbragt http://dx.doi.org/10.6084/m9.figshare.100940

The extremely compact real-time sequencer minION from Oxford Nanopore has already allowed researches to conduct DNA sequencing in field condi- tions (Hoenen et al., 2016). The rapid turnover of these types of sequencers allows them to be used as diagnostic and monitoring tools for clinicians. One can also speculate about the impact these new technologies will have on the fields of molecular evolution and genomics. As the length and accura- cy of the DNA reads increase, the complex methods of modern genome as- sembly will become simpler. This should allow more research groups to perform genomic studies, and focus can be shifted from the technical aspects to the interpretation of the data itself. Despite these technological and meth- odological advances, the challenge of interpreting the ever-increasing amount of DNA sequences available will remain a significant bottleneck to our understanding of life on Earth.

18 This incredible development of technology over the last 40 years is per- haps only rivalled by the advancement in heavier than air flight, which start- ed with the Wright brothers’ first powered flight in 1903, and reach the heights of the moon with the lunar landing of Apollo 11 in 1969. But just as flight has continued to expand and improve, so too are we nowhere near the end of the journey of DNA sequencing. Even though we now might have the blueprints to life, we still do not know what all the parts are. .

19 Culture independence

Since the days of Antonie Van Leeuwenhoek, microbiological studies have revolved around microorganisms we can culture in the lab. This is due to the fact that in order to be able to answer questions about what a certain species of microbes does, we need to have it isolated from other species so we can perform experiments. This includes experiments trying to elucidate the me- tabolism of some species, by seeing which compounds can be broken down as in traditional biochemical assays, or in experiments trying to extract the DNA of some microbe, among others. Having multiple species in a sample would make it impossible to know where a certain DNA sequence came from, but also the amount of DNA needed for sequencing necessitates a very large number of cells as starting material. This has lead to a bias in the types of species that have been extensively studied so far. The ones that could readily grow in the lab, and did so in a timely manner, have been good can- didates for study. The classical example is of course Escherichia coli, which grows in well defined media such as LB (Lysogeny Broth) (Bertani, 1951, Bertani, 2004) and has a doubling time of around 20 minutes. Such species make for excellent study subjects, and E. coli has been a model organism for decades. However, our need for culturing has been the cause of a blind spot in our microbiological field of vision. Some organisms that we might find in nature defy our attempts to grow them at the lab, despite our efforts to do so. The fact that some species grow better, or at all, using culture media was first noticed by Razumov in 1932 and later coined the Great Plate Count Anoma- ly by Staley in 1985 (Razumov, 1932, Staley et al., 1985). In short, this means that the diversity of microorganism that we are able to observe in the cultures we produce in the lab does not reflect the diversity in nature. Esti- mates of how large this fraction of microorganisms we are yet to be able to grow have been estimated to be up to 99% (Epstein, 2013). This has lead to some people referring to species we are unable to culture as “unculturable”. I believe this term is unfortunate since it implies that it would be an impossible task to produce a monoculture of such species. This mind-set is, if nothing else, a bit pessimistic. The species we are able to grow in the lab today are much more numerous than ever before, and advances are indeed made (Browne et al., 2016). Despite these advancements, we would benefit from alternative methods of studying the enormous natural diversity we are able to observe with more direct methods such as microscopy. A

20 good way of doing so would be to study organisms without the step of culti- vation. Furthering skewing our knowledge of the true diversity of microbes is the fact that some organisms have a greater incentive to their study. These in- clude those microorganisms that have a medical impact either on us humans directly or on our pets. A tremendous amount of scientific funding and work have gone into bettering our understanding of microorganisms causing such diseases such as tuberculosis, hepatitis or cholera. There has also been exten- sive study of the microbes that might be of economical importance. This includes beneficial and pathogenic microbes of livestock and crops, as well as those organisms that are part of the refinement of products such as milk or beer. Finally, we have the organisms important to the biotechnology indus- try. Insulin is now produced using bacteria, and the biofuel industry is bor- rowing heavily from nature in coming up with new ways of converting or- ganic and inorganic compounds into climate friendly fuel. While there is nothing too surprising with the fact that medical and economical needs have steered our examination of natural diversity, it has lead to a certain bias in our understanding of what is out there in terms of microorganisms. One group of organisms that has been understudied is the archaea. First described by Carl Woese in the late 70s, the archaea are known as the third domain of life (Woese et al., 1977, Woese et al., 1990). While being physi- cally more similar to bacteria, the archaea possess core functions such as transcription, translation and other information processing machinery more similar to the eukaryotes (Huet et al., 1983, Pühler et al., 1989, Langer et al., 1995, Spang et al., 2013). They are of particular interest from an evolution- ary standpoint as it has been shown that eukaryotes branch inside the archae- al part of the tree of life, rendering the three-domains model of the tree of life invalid in favour of a two-domains topology (Iwabe et al., 1989, Guy et al., 2011, Williams et al., 2013). Furthermore, there is evidence indicating that the eukaryotes are most closely related to the group of archaea named the archaea (Spang et al., 2015, Zaremba-Niedzwiedzka et al., 2017). Not only does the phylogenetic evidence point towards this, but a vast amount of genes and proteins previously thought to be exclusive to eukary- otes, are now found in archaeal species, particularly in the Asgard archaea (Zaremba-Niedzwiedzka et al., 2017). Despite the impact the archaea have had on evolutionary research, there is an oddity about this group that have lead to their underrepresentation in sci- entific studies. That is that while both bacteria and eukaryotes are found as pathogens of humans, to date there is no example of direct medical harm caused by an archaeal species. There is however no lack of examples of ar- chaea living in close association with humans. Archaea has been found liv- ing in our gastrointestinal tract (Dridi et al., 2009), vagina (Belay et al., 1990), mouth (Bik et al., 2010) and on our skin (Probst et al., 2013). Recent- ly there have also been reports of archaea found in connection to periodontal

21 disease (Lepp et al., 2004) and brain abscesses (Drancourt et al., 2017), however there is no indication that these archaea were the direct cause of these conditions. The lack of examples of archaeal pathogens has been the topic of discussion for quite some time (Reeve, 1999, Cavicchioli et al., 2003, Eckburg et al., 2003). We are now starting to see some of the first cases of archaea implicated in human health, like the examples mentioned previously. This might very well be due to clinicians and other medical sur- veys using methods suitable for detecting not only bacteria, but also archaeal species. As we are now aware that we have a skewed image of what microbial life can be found on planet Earth, many have taken to these new cultivation- independent methods in order to improve upon our knowledge of the natural biodiversity. These methods have, among many examples, helped us under- stand organisms from extreme environments not easily replicated in the lab, organisms very rare in the biosphere and endosymbionts relying on their hosts for many of the metabolic needs. Many organisms are still being found, and not only new species. A group of bacterial phyla were recently discovered and named the “Candidate Phyla Radiation” (CPR) (Brown et al., 2015). These never-before seen bacteria are estimated to make up 15% of the bacterial sequence diversity. Also in the domain of archaea new phyla are being discovered. The DPANN group of archaea (Diapherotrites, Parvar- chaeota, Aenigmarchaeota, and ), were first reported in 2013 (Rinke et al., 2013), and the aforementioned Asgard ar- chaea in 2015 (Spang et al., 2015). Both were detected via the means of cul- tivation-independent sequencing efforts. So even though we are yet to discover how to culture the vast majority of diversity in nature, alternative methods can now be used. One might even relish the notion that today, when many things seem already discovered and explored, there exists a myriad of organisms still waiting to be encountered. The tools for doing so have expanded impressively in the last decades.

Who is out there? Studying the diversity of nature has long been at the heart of biology. Carl von Linné, sometimes referred to as the inventor of modern , gave us the tools for classification with his publication Systema Naturae. Since that time, we have strived to identify and classify the natural world around us. Antonie Van Leeuwenhoek discovered the microbial world using the first microscopes and since then, the microscope has remained one of the most important tools for microbiology. The microscope has many advantages even today. The ability to visualize the subject of study has a profound im- portance for our understanding. The mind tends to believe what it can see. But of course it also has several shortcomings. As a tool for discovering new

22 species it has a low throughput, requires extensive knowledge of the existent diversity by its operator, and can be restricted to certain samples. One might also argue that the extensive preparation of samples needed by many micros- copy techniques alters the nature of the sample, putting into question just how “direct” such observations are. This is true for both our modern electron microscopes as well as classic light microscopes. Diversity surveys today are usually based on DNA sequence instead of di- rect observation. While many different genes can be used as the target for such experiments, by far the most common one is the gene encoding the small ribosomal subunit. This is commonly referred to as the 16S gene in prokaryotes and the 18S gene in eukaryotes. The strength of using these genes for studying diversity was well laid out by Carl Woese in 1987, refer- ring to them as the “Ultimate Molecular Chronometers” (Woese, 1987). He pointed out that such marker genes should be conserved across all three do- mains of life, be readily sequenced and should maintain a phylogenetic sig- nal even between the most distantly related organisms. There are however certain shortcomings of using the 16S/18S for micro- bial ecology studies. One of them is that these genes are often found in mul- tiple, heterogeneous, copies within a single genome (Crosby et al., 2003). This can cause biases in trying to determine abundances of members in a microbial community. While there are suggested methods for correcting such issues via bioinformatic means, this remains an issue today (Angly et al., 2014). Other genes have been suggested to complement 16S surveys, such as the rpoB gene which encodes for the DNA polymerase subunit β (Case et al., 2007). However, despite advantages any alternative marker gene would have, the 16S/18S genes have been in use for decades. This has lead to a vast accumulated wealth of 16S/18S sequences at our disposal. One of the most prominent biological databases for 16S/18S gene se- quences, SILVA, consists of over five and a half million sequences (as of December 2017). This large abundance of sequences is the true strength of the 16S/18S gene as a tool for molecular ecology today. Starting in in 1990 with the exploration of the genetic diversity in the Sargasso sea (Giovannoni et al., 1990), numerous surveys of every kind of available environment have been undertaken using the 16S/18S as a marker. The fundamental workflow of these surveys is that of extracting all DNA found in an environmental sample and amplify the 16S/18S gene using PCR (Saiki et al., 1985) with primers universal to all known organisms. Before the advent of the second and third generation sequencing methods, the am- plicons (amplified gene fragments) were cloned into plasmids, and subse- quently sequenced. With modern sequencing methods we can now sequence millions of 16S/18S amplicons directly in a single sequencing run. This deeper sequencing allows for detection of community members with very low abundances.

23 While 16S/18S amplicon sequencing has been key in unravelling the un- known microbial diversity, it has some limitations. Firstly, there are tech- nical aspects that are difficult to overcome. Soil samples are for instance harder to work with than aquatic ones due to compounds in the soil which inhibit either the DNA extraction or the PCR (Miller et al., 1999, Schrader et al., 2012). There are also potential risks with how DNA is extracted from the sample with different methods giving different final results both in terms of capturing the true diversity of the sample, as well as differences in yield of DNA (de Lipthay et al., 2004),. The PCR method itself can also create problems. PCR can introduce arte- facts and biases due to template loading (Polz et al., 1998), polymerase error rates (Acinas et al., 2005) and through chimera formation (Haas et al., 2011). The issue of designing suitable primers for PCR should also not be underes- timated. A primer is a small piece of DNA that is designed to be comple- mentary to the gene one wants to amplify, and it primes the DNA polymer- ase in the PCR. These small oligonucleotides are generally around 15–25 bp in length, which means that in order to make a truly universal PCR primer, there has to be two stretches of DNA (one in each direction) in the 16S/18S gene, which is conserved in all organisms on earth. Such a sequence does not exist. Designing primers that try to be universal, at least for the domains bacteria and archaea can in the best cases achieve coverage of up to 90% of all known species. However, despite this, it has been shown that entire phyla of bacteria and archaea could still be missed using such primers (Klindworth et al., 2013, Takahashi et al., 2014). One should also note that PCR primers are designed is based on our cur- rent knowledge of 16S/18S sequences. They can therefore never find any novel sequences with mismatches to the primers, which might lead to such sequences going undetected. It is also noteworthy that primer design has to take not only sequence conservation into account, but also the length and region of the gene the target fragment is from. A sequence that is very short, or that is in a too conserved region of the marker gene, will contain very little phylogenetic signal, which makes classification of the species it is from quite difficult. An alternative to PCR-based 16S/18S amplicon methods is to use RNA sequencing. By extracting all RNA from a sample and using reverse tran- scriptase to generate DNA, full-length 16S/18S sequences can be generated (Karst et al., 2016). A welcome side effect is that the ribosomal large subunit can also be sequenced in the same experiment. While this method does re- move the primer issue, it does mean that RNA has to be extracted, which can be technically difficult since the RNA molecules are less stable than DNA (Fleige et al., 2006). The issue with some species containing a large number of ribosomal RNA operon copies, which skews the abundance profiles, also still remains.

24 Despite their shortcomings, the use of both amplicon and RNA sequenc- ing is of much use when trying to identify environmental samples of interest. Being aware of each methods strengths and shortcomings, and compensating for these, will allow these methods to remain relevant for a long time to come. These techniques were used in some of the massive global sampling expeditions undertaken in the last decades. One of the first major sampling efforts was the Global Ocean Sampling expedition in 2003, which sampled the marine environments it traversed during its circumnavigation of the Earth (Rusch et al., 2007). This was followed up with more efforts, like the Pacific Ocean Virome project (Hurwitz et al., 2013) and the TARA oceans expedition (Sunagawa et al., 2015). Together these projects have generated vast amounts of sequence data, showing us some of the true microbial diver- sity on our planet. They also serve as public resources from which diversity data from around the globe can be extracted.

What are they doing? Microbial composition and abundance answers the question of who is out there, but not what they are doing. Taxonomy can be used to infer function- ality, but it has its pitfalls (Langille et al., 2013). We know that even closely related species can differ in genome size and genomic capabilities. An ex- ample is the 5.5 Mb genome of the human pathogen E. coli O157:H7. This genome is 35% larger than the genome of the human commensal E. coli K- 12, and contains genes not previously found in E. coli strains (Hayashi et al., 2001). This is despite the fact that these two strains are 99% identical in their 16S sequence. In order to fully understand which species inhabit a certain environment, and to know what they are doing there, we benefit from having the genomes of these organisms available to us. As discussed previously it is not always feasible to culture these organisms, so culture-independent whole genome sequencing is required. In the current state of the field, two techniques are dominant in procuring such genomes, namely single cell genomics and met- agenomic binning.

25 Single cell genomics Isolating and sequencing the DNA of a single cell can be done in various ways, but the general work flow is to capture a cell of the organisms of in- terest, lyse it, extract the DNA and sequence using high throughput methods (Figure 3).

Figure 3. A cell is captured and isolated in a microtiter plate via fluorescent activat- ed cell sorting. Lysis and whole genome amplification is then performed. A screen for organisms of interest, usually by 16S PCR, is performed to identify candidates for further processing. Sequencing libraries are constructed and the DNA is se- quenced and assembled into a Single Amplified Genome (SAG).

The variation in the workflow comes mainly in how one captures the single cell. This can be done using micro manipulation (Inoue et al., 2008), laser dissection (Frumkin et al., 2008), micro fluidics (Nishikawa et al., 2015) or fluorescent activated cell sorting (FACS) (Rinke et al., 2014). While mi- cromanipulation and laser dissection can provide very good specificity, they are low throughput, requiring extensive manual work for each cell analysed. FACS can achieve a moderately high throughput where cells can be sorted into 96 or 384 well microtiter plates and the 16S from the resulting samples can be simultaneously screened using multiplexed, high-throughput meth- ods. Yet, the highest throughput workflow currently available is based on the use of droplet micro fluidics. These systems encapsulate single cells from an environmental sample into millions of picoliter-sized drops using automated high-speed droplet generating systems. These droplets are then used as min- iature reaction chambers for cell lysis, whole genome amplification and li- brary construction. The SAGs generated are then sequenced as previously described (Nishikawa et al., 2015, Hammond et al., 2016). Apart from being

26 very high throughput, reaching processing capacities of millions of amplifi- cation reactions per experiment, there are also advantages to performing whole genome amplification in picoliter volumes in terms of coverage and completeness of the resulting SAGs (Marcy et al., 2007, Hammond et al., 2016). Completeness of the SAG, i.e. how much of the organism’s genome is represented by the generated sequence data, is traditionally lower in single cell genomics as compared to traditional whole genome sequencing. The uneven genome coverage generated by the use of random primers in the MDA techniques probably explains the lower completeness rates of these techniques (Pinard et al., 2006) (Ellegaard et al., 2013). Ways for increasing the completeness of SAGs includes, as previously mentioned, smaller reac- tion volumes, or to pool the isolated single cells. This pooling can be done before or after the whole genome amplification steps (Dodsworth et al., 2013).

Metagenomics An alternative method to single cell genomics, which avoids some of the technical challenges associated with capturing single cells, is metagenomics. Instead of processing each cell from an environmental sample one by one, all cells are lysed and the whole sum of all DNA is used for library preparation and high throughput sequencing. This means that one does not need to opti- mize the sometimes complicated single-cell protocols. The metagenomic techniques completely avoid the biases introduced by primers, rRNA operon copies and sample inhibitors, since they can be done without the use of any DNA amplification. Metagenomic surveys do still share some technical issues with 16S/18S amplicon sequencing. The meth- ods used to sample the environment, lyse the cells, and extract the DNA will still influence the final composition of the sample (de Lipthay et al., 2004). Particularly, cell lysis has shown to be key, and the use of a combination of extraction methods can sometime improve the recovered biodiversity of a sample of up to 83% as compared to using a single extraction method (Delmont et al., 2011). Of course, analysing all DNA at once introduces some new problems, es- pecially in terms of bioinformatics. The raw sequencing reads are a mix of short stretches that represents the genomes of an unknown number of organ- isms. While genome assembly can be performed, the assembled contigs nev- er represent a full genome, which necessitates methods for determining which contigs belong to which organisms. The assembly itself can also be problematic since a metagenomic study of a sample, especially a complex one with high diversity, necessitates a deep sequencing effort in order to make sure that all organisms DNA have been sequenced with enough cover-

27 age. This large amount of DNA puts higher demand on computational re- sources, particularly in regards to computer memory. Determining the organisms present in a sample, and piecing together which contigs belong to which organism, can be done using mapping tech- niques. The contigs or the raw sequence reads are simply aligned and mapped against a fully sequenced reference genome. This, of course, neces- sitates good reference sequences. In samples with novel lineages, or with different species from the same genera, it becomes hard to fully separate the contigs into singular genomes. One might also miss any unusual genome rearrangements since the reads will map poorly to these regions. As an alternative to reference-guided methods, many metagenomic exper- iments now use binning methods. This means that each contig is, by some mean, grouped together with all other contigs from the same organism into a single group, or bin. There are many competing methods for performing binning, but many of them utilize the fact that an organism’s DNA is mostly homogeneous in terms of sequence composition. This is the reason why GC content (percentage of the genome consisting of the nucleotides G+C) has been used for a long time for identifying species. GC content in itself is too crude of a method to be used for binning, so instead a related metric is often used, the tetranucleotide frequency (Pride et al., 2003, Teeling et al., 2004). The frequency of each possible tetranucleotide in a contig is calculated and the combined information is used as a profile in order to compare with every other contig in a metagenome. Contigs that have similar tetranucleotide fre- quency profiles are binned together. In addition to using sequence composi- tion, metagenomic binning strategies can be improved upon by using infor- mation such as taxonomic classification (Brady et al., 2009), sequence paired-read linkage (Lu et al., 2017) and coverage (Alneberg et al., 2013). For example, if one has access to multiple samples close in spatial or tem- poral space, one can use differential coverage, meaning that multiple sam- ples can be combined in order to bin contigs with similar coverage profiles over several locations (Albertsen et al., 2013, Alneberg et al., 2013). There are also some downsides to metagenomic binning. Compared to single cell genomics, where one can often be sure about the exact origin of every contig, there are possibilities that contigs get grouped together in an incorrect way. This can lead to artificially chimeric genomes, or genomes split into multiple parts. A sample with many closely related species will have many contigs with similar nucleotide compositions, making binning challenging. Even within a genome there are regions that are more difficult to bin than others. Large non-coding sequences such as rRNA operons, new- ly acquired material such as prophages or horizontally transferred genes can all deviate in their nucleotide composition, making them harder to properly bin. Organisms with a very low abundance might also be missed unless the sequencing depth is high enough, but improving this parameter is not always

28 possible owing to purely financial limitations. In fact, these low-abundance organisms will be challenging to capture whatever the method used.

What we can learn Regardless of the challenges, culture independent sequencing has allowed us to sequence the genomes of many newly discovered organisms. Today, a single metagenomic project can yield thousands of new genomes, many of them from never before seen taxa (Brown et al., 2015, Parks et al., 2017). This has greatly expanded our genomic sampling of the world, allowing us to better understand the evolution of the tree of life (Hug et al., 2016, Zaremba-Niedzwiedzka et al., 2017). We are not however limited to new sampling efforts. The publicly available sequence data from projects such as GOS and TARA (Rusch et al., 2007, Sunagawa et al., 2015) can be mined for genomes of new and interesting organisms. We are able to not only find who is out there, but also better understand what these newly discovered organisms are doing in their environments. Through comparative genomics we have the ability to discover new interactions and important players in biochemical cycles (Manor et al., 2014, Evans et al., 2015, McIlroy et al., 2017).

Problems with culture-independent methods The development of culture-independent methods, like single-cell genomics and meta-genomics, means that we are no longer limited to studying the biology of organisms that we can grow, as we are also able to infer aspects of biology from the genomes of uncultivated organisms. We now have the means to discover and analyse microbial taxa we were unable to only a few years ago. Culture independent methods have allowed us to discover new groups of organisms that have diverse metabolisms and ecological roles that we were completely unaware of previously, like the CPR bacteria (Brown et al., 2015), and the Asgard archaea (Zaremba-Niedzwiedzka et al., 2017). Despite these advantages, the age of microbial culturing is probably far from over. Comparative genomics is only as good as the references we have to compare with, and the need for high quality, experimentally verified, ge- nome annotation will only increase. There is a risk of information overload and erroneous conclusions drawn based on inferred homology when doing large-scale automated genome sequencing projects (Richardson et al., 2012). Automatic genome annotations also have the risk of propagating previous annotation errors if no steps are taken to carefully correct them. The data- bases that exist today often have inadequate mechanisms in place to update and correct the available sequences, so mistakes often remain unidentified.

29 Given the sheer number of genomes that are generated today, it is key that the analyses of these genomes are held to a high standard, and that the auto- matic aspects of genome annotation are also experimentally verified. There have been calls for implementation of rules and guidelines to uphold mini- mum informational requirements for genomes obtained by culture independ- ent methods (Bowers et al., 2017), a step in the right direction. Of course it is the responsibility of researchers to make sure such guidelines are upheld.

30 Genomics of symbionts

Symbiosis The term “symbionts” was coined by Heinrich Anthon de Bary in 1897 (Oulhen et al., 2016). The term covers any organism that lives in association with an organism from another species. Symbiotic relationships can be mu- tualistic, commensal or parasitic. A mutualistic relationship benefits both partners, a commensal relationship benefits one partner at no expense to the other, and a parasitic relationship benefits one partner at a cost for the other. There are numerous examples of such partnerships in nature, including fungi living on the roots of plants (Remy et al., 1994), the classic case of the Li- chen, which is one or more fungi living with a photosynthetic partner (Spribille et al., 2016) and bacteria that inhabit our own gastrointestinal tracts (Turnbaugh et al., 2007, Yatsunenko et al., 2012). As there have been multiple definitions for different forms of symbiosis, here I will describe the definitions that I will be using in this thesis. The term “symbiosis” will refer to any direct ecological relationship between two or more organisms, no matter the benefits or costs that the different partners reap from it. For the more specific symbiotic relationship of endosymbiosis, I will consider any organism residing inside a tissue or cell of another organ- ism an endosymbiont. This condition is sometimes called intracellular endo- symbiosis, but I will use these terms interchangeably. This means that any microorganism in a gastrointestinal tract would not be considered an endo- symbiont as long as it lived on the outside of the gut epithelium. Rather, such microorganisms could be examples of ectosymbionts, if they are living on the exterior of cells or tissues. Apart from the organisms in our gastroin- testinal tracts, other examples include the Nanoarchaeota Nanoarchaeum equitans living on the surface of its Crenarchaeal host Ignicoccus hospitalis (Huber et al., 2002), the devescovinid flagellates that moves using ectosym- biotic bacteria (Tamm, 1982), and ticks, which suck blood from its mamma- lian prey. The phenomenon of endosymbiosis is widespread in nature, but arguably the most important endosymbiosis, from an evolutionary perspective, is that which gave rise to mitochondria in eukaryotes (Sagan, 1967, Margulis, 1970). While it was a highly debated topic at the time this idea was intro- duced by Lynn Margulis (Cavalier-Smith, 1975, Bonen et al., 1977), it is now a well-established theory that the mitochondrial progenitor, related to

31 the alphaproteobacteria (Williams et al., 2007, Gray, 2012), was taken up by an archaeal common ancestor of eukaryotes. Recently this archaeal ancestor has been shown to be closely related to the Asgard superphylum of archaea (Spang et al., 2015, Zaremba-Niedzwiedzka et al., 2017). The mitochondrion represents a case of endosymbiosis where the symbionts have been reduced to the level of an organelle in the eukaryotic cell. While the term ‘organelle’ is not well defined, it can be said to represent an intracellular structure that is present in most, if not all, cells of a multicellular organism. While mitochon- dria represent an extreme case of endosymbiont adaptation, the are extensive examples of endosymbionts in various stages of host-adaptation (Toft et al., 2010). Endosymbiotic relationships seem especially common in insects. The most widespread example is possibly that of the bacterial endosymbionts from the Wolbachia, which might inhabit up to 60% of all insects species (Hilgenboecker et al., 2008). These endosymbionts have various effects on the hosts they inhibit and in some cases they can function as re- productive parasites. The mechanisms for which includes feminization and killing of males (O'Neill et al., 1997, Bordenstein et al., 2001). In the fruit fly Drosophila paulistorum, where Wolbachia is essential for the host’s sur- vival, the endosymbiont functions as a sexual barrier, allowing mating to occur only between flies infected with the same strain of Wolbachia, driving speciation of the host (Miller et al., 2010b). Organisms living as endosymbionts do have some intuitive advantages over free-living ones. Living inside a nutrient-rich cell not only supplies the endosymbiont with the sustenance it needs for growth, but also protects it from predation and other sudden detrimental changes in the environment. Also the host can greatly benefit from having a endosymbiont, since in some cases they perform functions that the host is unable to perform itself. Exam- ples of these include the nitrogen recycling Blattabacterium in wood-feeding cockroaches (Sabree et al., 2009), green algae and cyanobacteria in lichen that produce energy for the host via photosynthesis (Spribille et al., 2016), and amino acid and vitamin synthesis by Buchnera in aphids (Nakabachi et al., 1999). Biosynthesis of nutrients not available in the host diet seems to be a fairly common trait of bacterial endosymbionts (Moran et al., 2000, Akman et al., 2002, Kuriwada et al., 2010, Smith et al., 2015), and there are even examples of multipartite relationship where biosynthesis pathways are broken up into parts performed by multiple endosymbionts (Ponce-de-Leon et al., 2017).

Genome evolution of endosymbionts The endosymbionts mentioned in the previous section are all examples of well-established symbioses where the host and symbionts have a long-term evolutionary relationship. Interactions over such timescales lead to genomic adaptations, and for endosymbionts a pattern of genome reduction is appar-

32 ent, where some endosymbionts’ genomes have been reduced to less than a megabase in length (McCutcheon et al., 2012). Endosymbionts have small effective population sizes and it has been shown that this, combined with relaxed selection due to the hosts providing many of the factors normally encoded by the genomes of free-living organisms, leads to large portions of endosymbiont genomes being reduced (Burke et al., 2011). The isolation from free-living, closely related species and small population sizes also lead to degradation of the genomic content due to Müllers ratchet, where accumu- lated deleterious or neutral mutations are not compensated for by inflow of genetic material which could be incorporated via recombination (Moran, 1996). The process of endosymbiont genome reduction typically follows a pro- gressive trend, in which recently acquired endosymbionts undergo loss of gene function due to pseudogenization, either by mutation or by insertion of mobile genetic elements. As the endosymbiosis stabilizes, the genome shrinks by elimination of non-functional genes and those genes whose func- tion is supplied by the host (Figure 4).

Figure 4. Endosymbiont genomes are reduced over time by gradual gene loss and and pseudogenization, genome rearrangement and deletions. Reprinted by permis- sion from Macmillan Publishers Ltd: Nature Reviews Genetics (Toft et al., 2010), copyright 2010.

In the most advanced stages, all that remains are the very core genes for the survival of the endosymbiont and its host (Figure 4). These typically include genes necessary for the propagation of the endosymbiont as well as those genes that are useful for the host (Toft et al., 2010, McCutcheon et al., 2012). Again the most extreme example is the mitochondrion, whose ge- nome in humans is around 16,5 kb (Ingman et al., 2000), and has been com- pletely lost in some anaerobic protists (Palmer, 1997). Many of the functions that were once encoded in the mitochondrial genome are now encoded in the nuclear genome, and it is clear that some of these genes have been acquired

33 through horizontal gene transfer (HGT) from the alphaproteobacteria that later became the mitochondria (Mourier et al., 2001). Next-generation sequencing has been key in studying these endosymbi- otic relationships. The wealth of fully sequenced endosymbiont genomes we have access to today have made comparative studies of genomes a reality and insights into how endosymbionts evolve and interact with their hosts have become possible. In addition to understanding how genome evolution occurs, there have been advances in elucidating how endosymbionts interact with their hosts. This includes both the initial evasion of the host defences as the endosymbi- otic relationship is established, and how partners in mature syntrophies inter- act with one another. For the initial steps towards symbiosis that any organ- isms take, one could perhaps talk about a general strategy of adhering- invading-resisting. Many bacteria uses type III secretion systems to invade and interact with the hosts, like in Sodalis glossinidius and Sinorhizobium fredii (Viprey et al., 1998, Dale et al., 2001). These systems are used in order to penetrate the membrane of the cell and inject them with effector proteins. Other systems functioning in similar ways include fimbriae and flagella (Hentschel et al., 2000, Somvanshi et al., 2010). For resisting degradation by the host there have been multiple mecha- nisms shown. In Drosophila, the endosymbionts of the genera Wolbachia and Spiroplasma seem to be able to alter the host gene expression levels of genes involved in immune-responses (Xi et al., 2008, Anbutsu et al., 2010). In protists it has been shown that invading prokaryotes can avoid being de- graded in the host lysosome by counteracting the host’s metal-ion pumps, and Shigella flexneri seem to be able to escape by lysing the phagosome altogether (High et al., 1992). There are even some symbionts that produce a toxin/antitoxin complex to make sure that the cost for getting rid of the in- vading organism is high, or even fatal (Leplae et al., 2011). These examples and many more illustrate that there does not seem to be any universal mechanisms that endosymbionts employ in order to establish themselves inside a host, but that such systems are needed for the successful establishment of the symbiosis. One should also keep in mind that these mechanisms predominantly come from bacterial examples. For archaeal endosymbionts no mechanisms have yet been described. When examining more ancient endosymbiotic relationships, many of them no longer contain any traces of such systems at all. Instead there seem to be some convergence in what is lost by the endosymbionts (Merhej et al., 2009), leading to the idea that loss of function, could also be a driving force of endosymbiosis.

34 Archaeal endosymbionts As discussed, there are many well-studied examples of diverse endosymbi- otic bacteria (Moya et al., 2008, Fisher et al., 2017), but far fewer examples of endosymbiotic archaea. While cases of endosymbiotic archaea have been known for decades (van Bruggen et al., 1983), much less effort has gone into understanding these systems than has gone into studying the bacterial sys- tems. All known archaeal endosymbionts are that inhabit an- aerobic microbial eukaryote hosts (Hackstein, 2010). Many methanogenic archaeal endosymbionts can be found in anaerobic ciliates, including free- living aquatic species, such as Metopus contortus (Figure 3a), Trimyema sp. and Cyclidium porcatum (Clarke et al., 1993, Embley et al., 1993, Finlay et al., 1993), as well as ciliate species from gastrointestinal tracts, such as Nyc- totherus ovalis (Figure 3b) (Van Hoek et al., 1998). Endosymbiotic archaea can also be found in parabasalid and oxymonad flagellates such as Hex- amastix termopsidis, Tricercomitus termopsidis, Microjoenia sp. and Dinenympha parva (Lee et al., 1987, Tokura et al., 2000), which inhabit termite guts. Giant amoebae, like Pelomyxa palustris, are also known to harbour endosymbiotic methanogens (Gutiérrez et al., 2017).

Figure 5. The archaeal endosymbionts of Metopus contortus (a) and Nyctotherus ovalis (b) are intimately associated with the host hydrogenosomes. Examples of endosymbionts are marked with a red arrow, and hydrogenosomes with a blue ar- row.

Relationships between methanogens and anaerobic microbial eukaryotes are thought to centre on the transfer of hydrogen. Some anaerobic microbial eukaryotes have specialised hydrogen-producing mitochondria, traditionally called hydrogenosomes. The methanogenic endosymbionts use this hydro- gen in order to generate energy via methanogenesis. The hosts benefit from a lowered intracellular partial pressure of hydrogen, which lowers the overall

35 free energy of the fermentation process (Worm et al., 2010). The extent of the benefit to the host seems to vary between different host/symbionts sys- tems. For example, the ciliates Metopus contortus and Plagiopyla frontata were found to have a decreased growth time when cured of their methano- genic endosymbiont by growing the ciliates with the methanogenesis inhibi- tor 2-bromoethanesulfonate (BES). This inhibitor stops methanogenesis and causes the endosymbionts to stop dividing. Conversely, Metopus palaeform- is showed no difference in growth with or without the endosymbiont present (Fenchel et al., 1991). A similar result was observed in Trimyema compres- sum, where curing the ciliate of its endosymbionts did not substantially af- fect the ciliates’ growth (Yamada et al., 1997). In the parabasalid flagellate Trichomitopsis termopsidis, which harbours a Methanobrevibacter-like endosymbiont (Lee et al., 1987), growth of the methanogens was also inhibited when the flagellate was treated with BES. Interestingly, the growth rate was rescued by using heat-killed Bacteroides sp. JW20 as food source for the flagellate (Odelson et al., 1985). Using other bacterial species however, did not have this rescuing effect on the growth rate of the flagellate. This indicates that the methanogenic endosymbiont not only provides a benefit to the host by removing hydrogen, but it might also provide the host nutrients, the nature of which were not determined (Odelson et al., 1985). All the described archaeal endosymbionts come from three different or- ders of methanogens: , and Meth- anosarcinales. A previous study suggested that endosymbiotic archaea asso- ciated with free-living protists tend to come from the Methanomicrobiales, whereas the gut/faecal associated protists tend to harbour Methanobacterial- es type endosymbionts (van Hoek et al., 2000). There are exceptions to this observation, however, such as the free-living ciliate Trimyema compressum, which has an endosymbiont of the genus Methanobrevibacter (of the order Methanobacteriales) (Shinzato et al., 2007). The third group of methanogens, , was only recently found living as endosymbionts of pro- tists (Narayanan et al., 2009, Gutiérrez et al., 2017). Specifically, the endo- symbionts identified in these studies seem to belong to the genus Meth- anosaeta. Limited co-speciation seems to have occurred between anaerobic protists and their endosymbionts (van Hoek et al., 2000). Instead, based on phyloge- netic evidence there seems to be numerous cases where the endosymbionts in anaerobic protists have been replaced (Smith et al., 2013, Husnik et al., 2016). One possible reason for this could be due to a new species being ac- quired by the host that outcompetes the existing population of endosymbi- onts. Endosymbionts are thought to suffer a reduced long-term fitness due to Muller’s ratchet effects (Doolittle, 1998): their asexual method of reproduc- tion means that they are unable to recombine to avoid the accumulation of deleterious mutations, and deleterious mutations are driven to fixation

36 quicker because of small effective population sizes. The uptake of methano- genic archaea into a protist lacking endosymbionts was demonstrated exper- imentally previously (Wagener et al., 1990), consistent with the lack of evi- dence for co-speciation, suggesting that these relationships are not entirely stable. One might speculate that the protists’ proximity to free-living meth- anogens that are closely related to their own “pre-existing” endosymbiotic methanogens (for example inside animal digestive tracts) might increase the frequency of such replacements.

37 Isolation of archaeal endosymbionts In paper II of this thesis we provide the first genomic data from archaeal endosymbionts. In order to obtain these genomes we used culture- independent methods, since previous work with bacterial endosymbionts has shown that endosymbionts are difficult to culture outside their hosts (Kikuchi, 2009). This inability to culture endosymbionts stem from them tending to have reduced genomes, which often lack genes for key metabolic pathways, or parts thereof. The decision was made despite numerous reports of isolation and growth of archaeal endosymbionts in the past (van Bruggen et al., 1984, Van Bruggen et al., 1986, Goosen et al., 1988, Van Bruggen et al., 1988, Broers et al., 1993, Tokura et al., 1999). Many of these studies were performed before the advent of sophisticated molecular biological techniques that can unequivocally confirm the endo- symbiotic nature of the identified methanogens (e.g., fluorescent in situ hy- bridization). Methods such as these are crucial in providing direct evidence of the intracellular nature of the organism in question, however there are additional issues with these studies that need to be addressed. Several studies found Methanobacterium formicicum as an endosymbiotic partner of a num- ber of different protists (van Bruggen et al., 1984, Goosen et al., 1988, Van Bruggen et al., 1988, Broers et al., 1993). This would be remarkable if these various ciliates, giant amoebas and amoeboflagellates, harboured the same endosymbiont, however the endosymbionts for these hosts were not identi- fied using sequence data. Instead they used less accurate methods like growth conditions and nucleotide composition (mol% GC). This is of course a blunter instrument than the methods we have available to us today, such as 16S gene and whole genome sequencing. Even more remarkable is the fact that the hosts all seem to have been isolated from the same 200-L aquarium situated in the lab of the authors (van Bruggen et al., 1983, van Bruggen et al., 1984, Goosen et al., 1988, Van Bruggen et al., 1988). The authors also indicate that several different fluorescent colonies (a marker trait of meth- anogenic archaea) grew on the solid media plates they used for isolation, and that several rounds of re-plating were done in order to get a pure culture. Later studies have also found that the 16S sequences of the Methanobac- terium formicicum suggested to be the endosymbiont are identical between hosts. These studies could not identify M. fomicicum as the endosymbiont, but rather show other species as probable endosymbionts (Embley et al., 1993, Gutiérrez et al., 2017), which suggests that the Methanobacterium formicicum reported in these papers, and which have later been fully se- quenced (Gutiérrez, 2012), are not an endosymbiont but rather a common contaminant. Other studies have shown other species to be endosymbionts of ciliates, for instance Methanoplanus endosymbiosis sp. nov. as an endosymbiont of Metopus contortus (Van Bruggen et al., 1986). Here, the authors admit that

38 contaminants were present during the time of cultivation, and repeat studies using in situ hybridization methods have been unable to confirm this species (Embley et al., 1993). In the case the isolation of Methanobrevibacter sp. as endosymbiont of sheep rumen-associated ciliates (Tokura et al., 1999), the evidence provided is not compelling enough to classify the isolated and cul- tured organism as an endosymbiont. The cultures of methanogens, which were obtained from a sample with multiple ciliates isolated from sheep, showed contamination of bacterial species despite the use of antibiotics in the growth media. Tokura et al. also did not use any in situ localisation methods and the authors themselves report that the isolated species did not show the same growth traits (halophilism) as their own previous reports (Tokura et al., 1999).

39 Hydrogen-producing mitochondria Many of the archaeal endosymbionts we know today come from hosts con- taining mitochondria producing hydrogen (Hackstein, 2010). These types of mitochondria are common in anaerobic protists, and have evolved inde- pendently in several lineages (Embley et al., 1995, Stairs et al., 2015). Clas- sifying the many different kinds of mitochondria is not a straightforward task. There have been attempts to classify the different types into distinct classes based on metabolic properties, such as mitosomes, aerobic and an- aerobic mitochondria, hydrogen producing mitochondria, etc. (Müller et al., 2012). However, recent work has shown that these capabilities are not easily split into discreet classes, but rather represent a spectrum (Gawryluk et al., 2016). Here I will discuss the hydrogen producing mitochondria often re- ferred to as hydrogenosomes. While there is a debate regarding the differ- ence between hydrogen producing mitochondria and hydrogenosomes (Stairs et al., 2015), I will use the term “hydrogenosomes” for any mitochondrion that produces hydrogen. More typical mitochondria, like the ones in human cells, respire oxygen in order to produce ATP via the electron transport chain (ETC). This is achieved by pumping protons out of the mitochondria, establishing a gradi- ent that can be used to generate ATP via an ATPase. Hydrogenosomes in ciliates such as N. ovalis and M. contortus, who live in anaerobic conditions, instead couple ATP and hydrogen production. This process benefits from a low partial pressure of hydrogen (Worm et al., 2010), and the endosymbionts in the cell ensure this by using the hydrogen for methanogenesis, as men- tioned previously. The hydrogenosomes of ciliates seem to have evolved numerous times (Embley et al., 1995), and in this process many of the ciliate hy- drogenosomes have lost their genome (Palmer, 1997). An exception to this is the hydrogenosome of the ciliate N. ovalis that has retained a 48 kb genome, which was discovered and partially sequenced in 1998 (Akhmanova et al., 1998). A follow-up study revealed a near complete genome, and showed that the hydrogenosome of N. ovalis contained a partial ETC (Boxma et al., 2005). This partial ETC is thought to be useful for establishing a proton gradient used for driving protein transporters rather than ATPases (Gasser et al., 1982, Chacinska et al., 2009). Another possible function would be to main- tain a high production rate of Fe-S clusters in the hydrogenosomes, some- thing that is reduced when the membrane potential is decreased (Kispal et al., 1999, Veatch et al., 2009). In order to expand our knowledge of hydrogenosomal evolution in cili- ates, we need to have a broader sampling of hydrogenosomal genomes. When the hydrogenosomal genome of the ciliate N. ovalis was sequenced in 1998, it was the first example of a ciliate that had retained a genome. How-

40 ever, there is no reason to believe that other closely related species will not contain any genetic material. The ciliates that harbor methanogenic endo- symbionts together with hydrogenosomes represent an opportunity to use culture-independent methods for sequencing the genomes of both the endo- symbiont, and the hydrogenosome.

41 Evolution of the Haloarchaea

There are many groups of halophilic (salt-loving) organisms. Many of these groups are archaea, but there are also bacteria that are halophilic to various degrees, including extreme halophiles (Antón et al., 2002, Amoozegar et al., 2013, Oh et al., 2016). The archaeal halophiles come from many different parts of the archaeal tree, including species from the recently described phy- lum nanohaloarchaea (Narasingarao et al., 2012). These species belong to the wider group of the DPANN (Diapherotrites, Parvarchaeota, Aenigmar- chaeota, Nanoarchaeota and Nanohaloarchaea) archaea (Rinke et al., 2013). However, most of the halophiles are of the euryarchaea. Within this phylum there are species of the families (Paterek et al., 1988, Wilharm et al., 1991, Boone et al., 1993) and Methanocalculaceae (Ollivier et al., 1998, Lai et al., 2004). The largest group, however, is that of the aptly named haloarchaea. They are found in salt-rich environments rang- ing from modest salt content up to saturation (~26%) (Purdy et al., 2004, Savage et al., 2007). The haloarchaea were originally named the halobacteria. This name per- sists in scientific texts today, despite these organisms having been known to be members of the archaeal domain of life for several decades. In order to try to alleviate some of this confusion, I will only refer to this group as the halo- archaea throughout this text. Most of the haloarchaea known today are aerobic heterotrophs, and are seemingly quite flexible as to their metabolism (Andrei et al., 2012). Many of them use amino acids as organic energy/carbon source, but they may also use carbohydrates (Altekar et al., 1992, Johnsen et al., 2001, Anderson et al., 2011). Some species are capable of growing in anoxic environments as fac- ultative anaerobes (Oren, 1991, Antunes et al., 2008, Werner et al., 2014) and recently two obligate anaerobic species were found in the sediments of hypersaline lakes (Sorokin et al., 2016, Sorokin et al., 2017b). One of these species, Halanaeroarchaeum sulfurireducens, seems to respire elemental sulfur (Sorokin et al., 2016). As a supplement to their aerobic respiration, some species of the haloarchaea contain light sensitive proteins, rhodopsins. The rhodopsins function as proton-pumps, supplementing the proton motive force, driving the generation of ATP through ATPase (Hartmann et al., 1980, Mukohata et al., 1988). There are also other rhodopsins in the haloarchaea, with functions ranging from signalling to chloride ion pumps (Spudich, 1998, Essen, 2002, Sharma et al., 2007). Different types of rhodopsins can

42 be present in the same organism, with Haloarcula marismortui having six rhodopsins encoded in its genome (Baliga et al., 2004). All organisms living in high saline environments need some mechanism for coping with the increased osmotic stress. The haloarchaea have evolved a mechanism usually referred to as the ‘salt-in strategy’. In short, they pump potassium ions (K+) into the cytoplasm at the expense of sodium ions (Na+) using H+/K+ symporters and H+/Na+ antiporters (Becker et al., 2014). Apart from ion pumps, the haloarchaea all perform isoprenoid lipid biosynthesis, which have been shown to make membranes more impermeable to ions (Yamauchi et al., 1993) and genes involved in isoprenoid biosynthesis are upregulated in response to increased salinity (Bidle et al., 2007). The evolution of the haloarchaea has been under increased scrutiny over the last decade. Their phylogenetic position in the euryarchaeal tree, where they branch together with anaerobic methanogens (Brochier-Armanet et al., 2011, Spang et al., 2017), coupled with them being aerobic heterotrophs, has lead many to ask how these metabolic changes came to pass. It has been suggested that a single event of a large influx of over 1000 bacterial genes led the transition from a methanogenic lifestyle to the one known in haloar- chaea today (Nelson-Sathi et al., 2012, Nelson-Sathi et al., 2015). However, this study only used the ten haloarchaeal genomes that were available at the time of the analysis. Subsequent papers, that added an additional 65 haloar- chaeal genomes, have shown that the number of horizontally acquired bacte- rial genes at the base of the haloarchaea might be artificially inflated (Becker et al., 2014). The method used to infer the massive gene influx has also been questioned. After re-analysing the data using an evolutionary model for gene gain and loss, it seems that the number of genes acquired by the haloarchaeal ancestor is closer to 200, and many of the genes acquired by the haloarchaea have been gradually transferred to its species, rather than acquired at the root (Groussin et al., 2015). The transition from an anaerobic methanogen to an aerobic heterotroph suggests that major evolutionary changes have taken place. Any method that tries to infer the changes that lead to the ancestor of the haloarchaea will need a good sampling of lineages closely related to the halo-archaea. Two such potential linages are the SA1 (Sorokin et al., 2017a) and MG-IV (Ma- rine Group IV) groups of archaea. The SA1 has been described as a new class of halophilic methanogens, the ‘Methanonatronarchaeia’, and might be closely related to the haloarchaea. The other group, the MG-IV archaea were first discovered in 2001, at 3000 m depth of the coast of Antarctica (López- García et al., 2001a, López-García et al., 2001b). Given their close phyloge- netic placement to the haloarchaea, and the fact that they are non-halophilic aerobes, they might be key in understanding the transition from methanogen to haloarchaea.

43 Results

Paper I – PacBio sequencing of rRNA amplicons Traditional 16S surveys rely on amplifying regions of the 16S small subunit rRNA gene by using primers designed to match the conserved sites of said gene. This allows for, given properly designed primers, the capture of 16S fragments from a broad range of taxonomic groups. A caveat of this method is the relatively short sequences one can obtain, due to the maximum read- length of the Illumina sequencing platform. Typically, the largest fragment sizes one can achieve, by overlapping the paired-end fragments, is around 500 bp. While such short fragments can be used to identify many of the known species of prokaryotes, it can be hard to gain any insight into the identity of novel species. Typically one would construct phylogenetic trees of any hard-to-classify sequences, but given the short fragment length, a 16S amplicon does not carry much phylogenetic signal. In this paper we describe a method that utilizes the fact that the 16S and 23S rRNA genes are together in a single operon, including an internal tran- scribed spacer (ITS) between, in most prokaryotic species. This allowed us to design primers that amplify the near full length 16S-ITS-23S operon in a single experiment. Since this fragment usually is more than 3 kb in length, we used the long-molecule sequencing offered by the Pacific Biosciences, or PacBio, sequencers. Using this technology, we were able to sequence the near full-length 16S-ITS-23S amplicons. An issue we had to overcome were the higher error rates that are a result of choosing PacBio sequencers over Illumina. We were able to mitigate these by establishing an in silico pipeline that brought the error rates down from 2.5% to 0.2%. This we consider to be well within the natural variation that can be found between rRNA operons from the same genome. With these longer fragment lengths we could now construct phylogenetic trees from the combined 16S and 23S sequences together with a reference set of sequences. These sequences, which we used to build trees, carry a stronger phylogenetic signal than if we would have used only a small frac- tion of the 16S gene, and we were able to observe phylogenetic relationships and classify sequences that we would not have been able to without both the 16S and the 23S combined.

44 By using this method, we were able to generate 911 bacterial and 595 ar- chaeal 23S sequences, which will be deposited to databases and be of further use in future studies. We believe that this method could be used as an alter- native to pure 16S studies, when one would want to better classify novel rRNA sequences.

Paper II – Genome sequences of archaeal endosymbionts Archaea, sometimes referred to as the third domain of life, remains as the most under-studied branch on the tree of life. Previously mainly known for being extremophiles, archaea have now been found in almost every corner of the globe. Here we present the draft genomes of two archaea living in endo- symbiosis with their ciliate hosts. The ciliates are Nyctotherus ovalis, isolat- ed from a cockroach hindgut, and Metopus contortus, isolated from a marine pool. While many examples of endosymbiotic bacteria exists, much fewer are known from the world of archaea. The genomes presented here are the first genomes ever sequenced from archaeal endosymbionts. Cultures of endo- symbionts are usually hard to establish since they often rely upon their hosts for various metabolites, which is why we have chosen to implement culture- independent techniques in order to obtain the genomes of these methanogen- ic archaea. The genome of the endosymbionts of N. ovalis, referred to as Methano- brevibacter sp. NOE, was obtained using single cell genomics. Ciliate cells were isolated from cockroach hindguts by exploiting their galvanotactic behaviour. The cells were then lysed and sorted using fluorescent activated cell sorting (FACS). A total of 19 cells identified as the endosymbiont by their 16S sequence were pooled together and sequenced. In contrast, we used a metagenomic approach to sequence the genome from the M. contortus endosymbiont, Methanocorpusculum sp. MCE. Hand- picked ciliate cells were lysed and the nuclear DNA was removed using a combination of filtration and centrifugation. The remaining DNA coming from the endosymbiont, hydrogenosome and any phagocytosed prokaryotes, was sequenced together and the endosymbiont genome was binned out using tetranucleotide frequency profiling. The genomes sequenced both show gene loss due to pseudogenization, a pattern commonly found in endosymbionts. In comparison to genomes of long-term, stable endosymbionts, we did not see extensive genome reduction however, indicating that these organisms might be recent acquisitions, or that frequent replacement of these endosymbionts takes place.

45 We were not able to directly identify the mechanisms of how the endo- symbionts escape the hosts’ immune system, but we were able to identify several genes of unknown function that did contain secretion peptides. These genes could prove interesting for further study of the interactions between the endosymbionts and their hosts.

Paper III – Evolution of hydrogenosomes in ciliates Archaeal endosymbionts are most commonly associated with hydrogen pro- ducing mitochondria, called hydrogenosomes. The endosymbionts benefit from a steady supply of hydrogen for methanogenesis that is provided by the hydrogenosomes, the metabolism of which also benefits from the lowered partial pressure of hydrogen as a result of this interaction. In paper II we provided the first genomic data from two such endosymbi- ont/hydrogenosome systems. The ciliate hosts in those relationships were Nyctotherus ovalis and Metopus contortus. N. ovalis was the first ciliate to be shown to contain a hydrogenosome that had retained a mitochondrial genome. This discovery confirmed that the hydrogenosomes of N. ovalis were derived from aerobic mitochondria, and that it contained genes for a partial electron transport chain. Based on 18S phylogenies, there is evidence that anaerobic lifestyles, and hydrogenosomes, have evolved independently on multiple occasions in dif- ferent ciliate lineages. In order to better understand the evolution of hy- drogenosomes, we investigated whether anaerobic ciliates from three differ- ent lineages contain a mitochondrial genome: Metopus contortus, Metopus es, Metopus striatus, Plagiopyla frontata, Trimyema finlayi, Cyclidium por- catum and a re-sequencing of the hydrogenosome of Nyctotherus ovalis. We were successful in obtaining at least partial genome sequences from 5 spe- cies: M. contortus, M. es, M. striatus, C. porcatum and N. ovalis. For P. frontata and T. finlayi no sequences could be obtained, suggesting that they lack mtDNA all together. Transcriptomic data was also generated for each of these ciliates and used to reconstruct key pathways that have been shown to function in the mito- chondria of aerobic ciliates, as well as pathways that function in the hy- drogenosomes of other anaerobic eukaryotes. Combined with phylogenetic analysis of key enzymes, this revealed that diverse patterns of gene retention and gene loss, as well as gene acquisition by lateral gene transfer, has had a key role in shaping how the hydrogenosomes of these ciliates function and provide the ciliates with energy in anoxic environments.

46 Paper IV –Marine Group IV and the evolution of Haloarchaea The haloarchaea, sometimes referred to as the halobacteria, is a group of halophilic, aerobic archaea of the phylum euryarchaea. They are thought to have evolved from anaerobic methanogens, a fact supported by their place- ment in the phylogenetic tree of life as a sister clade to the . How this transition took place is currently a debated subject. It has been suggested that a large influx of genes (> 1000) of bacterial origin, via hori- zontal gene transfer, has facilitated this transition. However, more recent work has disputed this fact, claiming that such conclusions were drawn based on insufficient taxon sampling. In this paper we perform a gene flow analysis of the euryarchaea, with fo- cus on the haloarchaeal branch. This analysis was done using tree- reconciliation methods, and with a much increased taxon sampling. Not only did we include many more haloarchaeal genomes in this analysis, but we also obtained 5 new genome bins of the Marine Group IV archaea, which were binned from publicly available data from the TARA oceans project. These genomes represent the first genomic data from this group of archaea, and are of interest since they are not halophilic, but rather come from marine environments. They have also been shown to be a sister branch to the halo- archea. In addition we examined the recently discovered methanonatronar- chaeia, and their phylogenetic relationship to the haloarchaea. While we did find that they are probably not representing a sister branch to the haloar- chaea, we did include them in the gene flow analysis. While our results are preliminary, and additional work is required, our re- sults show that the inflow of genetic material via horizontal gene transfer was not as high as previously claimed. It should be mentioned that the topol- ogy of the euryarchaeal tree used in this study for species-tree/gene-tree reconciliation is still under investigation, and the reconciliation method itself also needs further improvement.

47 Perspectives

Via DNA (and RNA) sequencing we know have the means to peer into the blueprints of all organisms on earth. This has allowed us to realize that just because we can do so, doesn’t mean that we understand how it all works. There is much more complexity to life than just genes and their products. Advances in regulation, epigenetics, protein-protein interactions, etc. have shown that we have still much to learn. Despite this, we have come a long way since they first viral genome was sequenced. We have discovered new forms of life, living in some of the most extreme environments on earth. From these genomes we have seen how life has adapted to overcome all sorts of challenges. We have expanded our own toolboxes for medial, agricultural, biotechnical and other fields of research. With the culture independent methods developed over the last decades we are more capable than ever to explore organisms that were previously out of our reach. This allows us to better understand how life has evolved on earth during billions of years. Looking back over just the last 10 years or so, the amount of sequence data we are now able to generate is astonishing. It has been pointed out before, but it is worth repeating that making sense of this abundance of information is one of the true challenges in the next coming years. The large scale genomics projects that has been undertaken, from the hu- man genome project to the TARA oceans sampling expeditions have also allowed for a vast amount of publicly accessible data to be available. This has in a way leveled the playing field to a point where it is no longer only the big genome centers that have access to large scale datasets. Labs from all over the world can now access this data and try to find key organisms in order to answer their biological research question. I do believe that these kinds of efforts are very worthwhile performing, and society as a whole will benefit from more such efforts. Of course one should remember that comparative genomics is just that, comparative. While these methods are powerful when it comes to correla- tions and for suggesting new venues of research, there is still a great need for experimental efforts in understanding genomic traits. We should also keep being wary of how we interpret and use automatic methods for annotating and analyzing genomes. Predicted functions are just that, predicted, and veri- fication should remain a priority if we ever want to truly understand how the organisms we study work.

48 Svensk sammanfattning

Att sekvensera en organisms DNA, det vill säga att ta fram ordningen på de fyra nukleotiderna (A, T, C och G), har varit möjligt under flera årtionden. På 1970-talet, när DNA sekvensering först startade, var det en mycket lång- sam och resurskrävande process. Att då sekvensera hela arvsmassan, ge- nomet, av ett virus cirka 5000 nukleotider (eller baspar), tog veckor av ar- bete och kostade tiotusentals kronor. På grund av detta var många därför skeptiska när ett förslag kom att sekvensera hela det mänskliga genomet, vilket är mer än 3 miljarder baspar. Trots detta inleddes arbetet i slutet av 1980-talet. Projektet bedömdes klart 2004, och hade då kostat närmare två och en halv miljard svenska kronor. Det kostade alltså nästa en krona att bestämma ett enda baspar. Under denna tid skedde dock en explosion av innovationer inom fältet DNA-sekvensering. Idag har vi tillgång till metoder som kan producera ett mänskligt genom på mindre än en vecka, och till en kostnad på ca 12 000 svenska kronor. Från att kosta en krona per baspar, har priset alltså sjunkit till ca 0,0004 öre på bara 20 år. Det är svårt att tänka sig andra teknologier som har haft en lika otrolig prisutveckling. Samtidigt har utvecklingen gått framåt starkt inom fältet mikrobiologi. Tidigare har forskare till stor del varit begränsade till att studera de mikrober som vi kan odla på labb. Man har behövt en stor mängd celler av de bakte- rier eller andra mikrober man velat studera, genom till exempel genom DNA sekvensering. Detta har lett till att vi fått en skev bild av mångfalden av mikroorganismer. Det är numera ett känt faktum att endast ca 1 procent av alla de arter av mikroorganismer som finns i naturen ens går att odla i labb- miljöer. Vi har därför alltså missat att studera ca 99% av allt mikrobiellt liv på vår planet. Under de senaste tio åren har det därför lags ner stor möda på att få fram så kallade kultur-oberoende metoder, där vi kan hoppa över steget med att odla dessa organismer. Kombinationen av de nya moderna sekvenseringsmetoderna som utveck- lats, tillsammans med att vi nu kan kartlägga arvsmassan hos organismer vi tidigare inte kunde studera, har lett till att vi upptäckt en myriad av mikro- organismer vi tidigare inte visste fanns. De här upptäckterna kommer från alla delar av livets träd (Figur 6).

49

Figur 6. Det så kallade livets träd, där de tre stora grenarna är Bakterier, Arkéer och Eukaryoter. Vi människor är en del av den eukaryota gruppen, och vi delar den med allt multi-cellulärt liv. Den mesta biologiska mångfalden utgörs dock inte av multi- cellulärt liv, utan av mikroorganismer. Hit hör bakterierna, arkéerna och encelliga eukaryoter.

Förutom bakterier som eventuellt kan ha en negativ inverkan på vår hälsa, så kallade patogener, har vi nu också upptäckt helt nya grupper av arkéer, in- klusive arter som verkar representera våra allra tidigaste släktingar. Den här gruppen av arkéer kallas nu Asgård-arkéer, och innehåller arter döpta efter olika asagudar såsom Loke, Tor, Odin, Freja med flera. Det verkar som att de allra första eukaryoterna på jorden uppstod då en Asgård-arké ingick i ett symbiotiskt förhållande med en bakterie, vilken under miljarder år har ut- vecklats till det vi idag kallar för mitokondrien. Asgård-arkéer är ett bra ex- empel på hur vi kan lära oss mer om evolutionen på jorden genom att se- kvensera DNA från mikroorganismer med hjälp av kultur-oberoende tekni- ker. Bioteknologin är ännu ett fält där stora framgångar rönts vare att vi har kartlagt hur nya arter av mikroorganismer kan genomföra olika proces- ser. Ett sådant exempel är biobränslen, där forskare nu försöker härma de alger och cyanobakterier som kan konvertera socker, eller till och med bara solljus och koldioxid, till biobränsle. I naturen har miljarder år av evolution format en uppsjö av metoder för att genomföra olika biokemiska reaktioner. Det har sagts för, men det är värt att upprepa, att naturen är den största inno- vatören vi känner till. I den här avhandlingen visar jag, och mina med-författare, på en ny metod för att upptäcka nya arter av mikroorganismer genom att använda en gen som är gemensam hos alla organismer på jorden. Tidigare har man kunnat göra liknande studier, men vi visar för första gången hur ett mycket större

50 fragment kan sekvenseras, vilket är till fördel om man vill konstruera så kallade fylogenetiska träd från dessa sekvenser. Dessa fylogenetiska experi- ment lägger grunden till vår förståelse för hur olika organismer är besläktade med varandra. Jag använder mig också av kultur-oberoende tekniker för att kartlägga arvsmassan hos organismer vi tidigare inte kunnat studera, så som avlägsna släktingar till haloarkéerna. Haloarkéer är halofila (salt-älskande) arkéer som lever i extremt salthaltiga miljöer så som döda havet, och salt-öknar. De är aeroba, vilket menas att de behöver syre föra att leva och föröka sig. Intres- sant nog har dessa haloarkéer utvecklats ifrån metanogena (metan tillver- kande) arkéer, vilka inte kan överleva om det finns syre närvarande. Hur denna transformering gått till försöker vi besvara genom att studera DNA från en grupp organismer som verkar befinna sig emellan dessa två grupper. Jag har också sekvenserat genomet hos två arkéer som lever i endosymbi- otiska förhållanden med väte-producerande mitokondrier, inuti eukaryot celler. Att arkéer kan leva inuti eukaryota celler, och utnyttja den vätgas som bildas av dessa mitokondrier har varit känt länge. Dock så har det fram tills nu inte funnits några sådana endosymbionta arkéer där man kartlagt hela arvsmassan. Dessa nya genom kommer förhoppningsvis att öka vår förstå- else för hur arkéer, och andra organismer, kan leva i dessa mycket intima, endosymbiotiska förhållanden.

51 Acknowledgements

My journey at the university started with me applying to a course in psy- chology at Uppsala University in 2004. This became a full year of psycholo- gy classes, until we started reading more about “biological psychology”, or neurology as one might also call it. I thoroughly enjoyed that class and real- ised that this was much more interesting to me, which made my start the biology program instead. Towards the end of my bachelor, I had a small evening course called “Microbial diversity”, where guest lecturers told us about various interesting microbes. This included lectures about archaea, which I thought was just the best thing ever. Organisms living in 80°C acid- baths? Sign me up! I contacted the head teacher of that course, and inquired about any positions for doing a summer internship in his lab. He told me that he didn’t have any openings, but instead directed me to a colleague of his, Thijs Ettema. This lead to me doing an internship, my master thesis, and finally my PhD in Thijs’ lab. So for that I am very thankful to Rolf Ber- nander, who was the head teacher of that microbial diversity course, and the one who first pointed me towards my academic future. So first of all, my biggest thanks goes to Thijs for all these years we’ve been working together. I think I can fairly say that most of the skills I have learned as a researcher and scientist are thanks to you. You have been a great supervisor, and more than that, a very understanding and accommodating friend during this whole time. A great big thank you also goes to my co-supervisor Lionel. You have an awesome ability to get to the core of most things, and you are very straight- forward in doing so. I’ve been very lucky in the fact that I’ve had two good supervisors to discuss and evaluate things with. The papers in this thesis would not have been possible without all the work done by all the collaborators, so to Joran, Ian, Lina, Ignas, Olga, Will, Anja, Martin, Kacper, Sendra, Henning, Tom, Genoveva, Robert, Julian and Max, thanks for showing that science is better together. There was also several people who were kind enough to help me proof- read and edit this thesis, so a big thank you goes to team-thesis: Eva, Dani, Courtney, Laura, Jennah, Jonathan and Will. I would also like to extend a big thank you to all the wonderful people in the molecular evolution department. Siv for really being an inspiration and an awesome storyteller, it’s such a great skill you have. Then there are my fellow PhD colleagues. First of all my office-mate during this whole time,

52 Joran. I think it is fair to say that our working-styles are different, but I think they’ve been complementary. Thanks for sharing the ride and learning together. Eva, you’ve been essential to this thesis in many, many ways, and I’ll always be grateful for that. You’ve made this whole experience a whole lot nicer. Andrea, thanks for teaching me only useful Spanish words, and for being a good friend. Now that we live so close you should come over more often. Dani, you are a great teacher, and there is no one who is both so knowledgeable and humble. Thanks for all the great advice and lessons you’ve given me. Jennah, you have an awesome spirit and thanks to you there’s been a lot of fun evenings, making this job even better. Max, even though we’ve only collaborated a bit towards the end, it was great working with such a knowledgeable person. You’re always pushing for new tech, and thanks for teaching the rest of us. Alejandro, together with Joran you’ve turned into a climbing/toughest machine. Let me know if you want to have some more great photos of you in the mud-baths. Wei, now that you’re a Santiago pilgrim I expect your remaining time will be super smooth. But, please invite me for more dumplings. Guilherme you’ve thought me to don’t stress to bad, and that one person can learn all the dances in the world. And thank you Minna for adding your energy and sisu to the group. I still sometimes think about your performance in Dani’s spex. Kirsten, thanks for all the help in the lab and for teaching me about taking care of flies. Hen- ning I hope you will keep being the most environmentally friendly protist catcher by hitchhiking to your sampling sites. Don’t fall in! Disa, I hope you keep finding wonderfully strange archaeal viruses. Karl, you have the best poker face, and nobody will ever know when you are the werewolf. Emil, our time only partially overlapped, but I think you will fit in very well in the molev group. Mayank, thanks for showing the rest of us how a proper moustache should be grown. And let me know if you ever need a ride back from Herräng again. Then there are all the wonderful molev post-docs. Jimmy, you were one of the very first ones in the group, and you were such a great person to learn from. Few people are so calm and willing to help out. I hope you are having a great time now in the US. Anja, you’ve taught me a lot about archaea and I think you are going to kick butt now in the Netherlands. Kasia, you always have time for question, even when I have weird math ones. Feifei, thank you for letting me be your toastmaster. And make sure that Wei doesn’t keep stealing your kids’ stuffed animals. Courtney, you’re such a great friend and it’s awesome having a trivia buddy. Let’s keep doing that. Laura, you are always helpful and easy to discuss with, and I very much look forward to our camera calendar. Will, you have been a great collaborator, and a key player in this thesis. Let me know if you want to watch some bad Swedish football some day. Erik P, you’re one of the original team archaea, going back to the EBC days. Your commitment to outreach and teaching is really inspirational. Jonathan, thanks for many great lunch discussion. And for always speaking

53 your mind in journal clubs. Maria, the work you’ve done with droplets is so cool, and I hope it just keeps on improving. Gaelle, please continue being the queen of the GIFs. Christian S, I hope that the running with Seeger will take of. And we will see about those half-marathons in the future. Together with Erik H you guys have also shown that the experimental work has such an important place in the program. Ellie, thanks for having someone to talk Wolbachia with, and for the British influences. Yu, you often have good insight in scientific discussion. I hope you keep asking questions in the jour- nal clubs. Ahjit, thanks for letting me be an honorary spicy food eater to- gether with Jessin. A great big thanks also goes to all the people working in the lab. Even though I didn’t spend so much time there towards the end, it was great to learn and work from so many skilled people. First of all I’d like to say thank you to Kristina and Ann-Sofie who were the first people to show me how to make a proper clone library and how to do Sanger sequencing. Roel, I hope you are now enjoying your canoe in the Netherlands. Lina, you were always so helpful and I hope you’ve found your calling now. I hope to see more of you in the future. Tom, you are such a natural lab manager, and you know your stuff. Thanks for all the help, and for being ok with my messy lab- drawer. A big thanks to the Si-Cell platform staff, Anna-Maria and Clau- dia. You guys have been a lot of help, especially in the lab. Furthermore, I’d like to give a big thank you to all the other molev staff throughout the years. Especially Felix, who is always patient with all our computer worries. And I think you are also very good at sometimes applying some breaks when we get out of hand. I will never agree about your pizza ideas however! Lisa, you were the one who introduced me to Wolbachia, and who also helped me have my Austrian adventure. You are a great scien- tist, and always really good at explaining things. Jan, thank you for teaching me about phylogenetics and evolution. Then there is the original IT master, Håkan, who thought me about ‘rm’. Ino, you were only with us for a short while, but you thought me a lot about bash. Martha you have made our lives a lot easier. Thank you for taking care of us all. Johan, you and Lionel were the ones who thought me to program. Maybe I one day understand half the magic you did. Christian J, A.K.A. Tyler. You were there in the very be- ginning and thought me that the terminal is better than the GUI. It took a while to realise you were right, but you sure were. Torbjörn, it’s great hav- ing you with us and I’m impressed that you’re now switching fields. I would also like to thank the students who worked with me over the years, Santhanam, Electra and Guillaume. I hope you guys had a good time. I did. Finally I’d like to thank my family. My parents Mats and Marianne, and my sister Erika and brother-in-law Roger. And of course also Linn and Sid! Thanks for being curious about what I was doing these last five years. Even though I think I didn’t always explain very well what I was up to.

54 References

Acinas, S. G., R. Sarma-Rupavtarm, V. Klepac-Ceraj and M. F. Polz 2005. PCR- induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample. Applied and environmental microbiology 71(12): 8966-8969. Adessi, C., G. Matton, G. Ayala, G. Turcatti, J.-J. Mermod, P. Mayer and E. Kawashima 2000. Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic acids research 28(20): e87- e87. Akhmanova, A., F. Voncken, T. van Alen, A. van Hoek, B. Boxma, G. Vogels, M. Veenhuis and J. H. Hackstein 1998. A hydrogenosome with a genome. Nature 396(6711): 527-528. Akman, L., A. Yamashita, H. Watanabe, K. Oshima, T. Shiba, M. Hattori and S. Aksoy 2002. Genome sequence of the endocellular obligate symbiont of tsetse flies, Wigglesworthia glossinidia. Nature genetics 32(3): 402-407. Albertsen, M., P. Hugenholtz, A. Skarshewski, K. L. Nielsen, G. W. Tyson and P. H. Nielsen 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature biotechnology 31(6): 533-538. Alneberg, J., B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, N. J. Loman, A. F. Andersson and C. Quince 2013. CONCOCT: clustering contigs on coverage and composition. arXiv preprint arXiv:1312.4038. Anbutsu, H. and T. Fukatsu 2010. Evasion, suppression and tolerance of Drosophila innate immunity by a male‐killing Spiroplasma endosymbiont. Insect molecular biology 19(4): 481-488. Angly, F. E., P. G. Dennis, A. Skarshewski, I. Vanwonterghem, P. Hugenholtz and G. W. Tyson 2014. CopyRighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction. Microbiome 2(1): 11. Au, K. F., J. G. Underwood, L. Lee and W. H. Wong 2012. Improving PacBio long read accuracy by short read alignment. PloS one 7(10): e46679. Babayan, A., M. Alawi, M. Gormley, V. Müller, H. Wikman, R. P. McMullin, D. A. Smirnov, W. Li, M. Geffken and K. Pantel 2017. Comparative study of whole genome amplification and next generation sequencing performance of single cancer cells. Oncotarget 8(34): 56066. Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham and A. D. Prjibelski 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology 19(5): 455-477. Belay, N., B. Mukhopadhyay, E. C. De Macario, R. Galask and L. Daniels 1990. Methanogenic bacteria in human vaginal samples. Journal of clinical microbiology 28(7): 1666-1668.

55 Bertani, G. 1951. Studies on lysogenesis I.: The Mode of Phage Liberation by Lysogenic Escherichia coli. Journal of bacteriology 62(3): 293. Bertani, G. 2004. Lysogeny at mid-twentieth century: P1, P2, and other experimental systems. Journal of bacteriology 186(3): 595-600. Bik, E. M., C. D. Long, G. C. Armitage, P. Loomer, J. Emerson, E. F. Mongodin, K. E. Nelson, S. R. Gill, C. M. Fraser-Liggett and D. A. Relman 2010. Bacterial diversity in the oral cavity of 10 healthy individuals. The ISME journal 4(8): 962-974. Bonen, L., R. Cunningham, M. Gray and W. Doolittle 1977. Wheat embryo mitochondrial 18S ribosomal RNA: evidence for its prokaryotic nature. Nucleic acids research 4(3): 663-671. Bordenstein, S. R., F. P. O'hara and J. H. Werren 2001. Wolbachia-induced incompatibility precedes other hybrid incompatibilities in Nasonia. Nature 409(6821): 707-710. Bowers, R. M., N. C. Kyrpides, R. Stepanauskas, M. Harmon-Smith, D. Doud, T. Reddy, F. Schulz, J. Jarett, A. R. Rivers and E. A. Eloe-Fadrosh 2017. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature biotechnology 35(8): 725-731. Boxma, B., R. M. de Graaf, G. W. van der Staay, T. A. van Alen, G. Ricard, T. Gabaldón, A. H. Van Hoek, S. Y. Moon-Van Der Staay, W. J. Koopman and J. J. van Hellemond 2005. An anaerobic mitochondrion that produces hydrogen. Nature 434(7029): 74-79. Brady, A. and S. L. Salzberg 2009. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature methods 6(9): 673-676. Branton, D., D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs and X. Huang 2008. The potential and challenges of nanopore sequencing. Nature biotechnology 26(10): 1146-1153. Braslavsky, I., B. Hebert, E. Kartalov and S. R. Quake 2003. Sequence information can be obtained from single DNA molecules. Proceedings of the National Academy of Sciences 100(7): 3960-3964. Broers, C. A., H. H. Meijers, J. C. Symens, C. K. Stumm, G. D. Vogels and G. Brugerolle 1993. Symbiotic association of Psalteriomonas vulgaris n. spec. with Methanobacterium formicicum. European journal of protistology 29(1): 98-105. Brown, C. T., L. A. Hug, B. C. Thomas, I. Sharon, C. J. Castelle, A. Singh, M. J. Wilkins, K. C. Wrighton, K. H. Williams and J. F. Banfield 2015. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523(7559): 208-211. Browne, H. P., S. C. Forster, B. O. Anonye, N. Kumar, B. A. Neville, M. D. Stares, D. Goulding and T. D. Lawley 2016. Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation. Nature 533(7604): 543-546. Bult, C., O. White, G. Olsen, L. Zhou, R. Fleischmann, G. Sutton, J. Blake, L. FitzGerald, R. Clayton and J. Gocayne 1996. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273(5278): 1058. Burke, G. R. and N. A. Moran 2011. Massive genomic decay in Serratia symbiotica, a recently evolved symbiont of aphids. Genome biology and evolution 3: 195- 208. Case, R. J., Y. Boucher, I. Dahllöf, C. Holmström, W. F. Doolittle and S. Kjelleberg 2007. Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Applied and environmental microbiology 73(1): 278-288.

56 Cavalier-Smith, T. 1975. The origin of nuclei and of eukaryotic cells. Nature 256(5517): 463-468. Cavicchioli, R., P. M. Curmi, N. Saunders and T. Thomas 2003. Pathogenic archaea: do they exist? Bioessays 25(11): 1119-1128. Chacinska, A., C. M. Koehler, D. Milenkovic, T. Lithgow and N. Pfanner 2009. Importing mitochondrial proteins: machineries and mechanisms. Cell 138(4): 628-644. Clarke, K. J., B. J. Finlay, G. Esteban, B. E. Guhl and T. M. Embley 1993. Cyclidium porcatum n. sp.: a free-living anaerobic scuticociliate containing a stable complex of hydrogenosomes, eubacteria and archaeobacteria. European journal of protistology 29(2): 262-270. Crosby, L. D. and C. S. Criddle 2003. Understanding bias in microbial community analysis techniques due to rrn operon copy number heterogeneity. Biotechniques 34(4): 790-803. Dale, C., S. A. Young, D. T. Haydon and S. C. Welburn 2001. The insect endosymbiont Sodalis glossinidius utilizes a type III secretion system for cell invasion. Proceedings of the National Academy of Sciences 98(4): 1883-1888. De Bourcy, C. F., I. De Vlaminck, J. N. Kanbar, J. Wang, C. Gawad and S. R. Quake 2014. A quantitative comparison of single-cell whole genome amplification methods. PloS one 9(8): e105585. de Lipthay, J. R., C. Enzinger, K. Johnsen, J. Aamand and S. J. Sørensen 2004. Impact of DNA extraction method on bacterial community composition measured by denaturing gradient gel electrophoresis. Soil Biology and Biochemistry 36(10): 1607-1614. Dekker, C. 2007. Solid-state nanopores. Nature nanotechnology 2(4): 209-215. Delmont, T. O., P. Robe, S. Cecillon, I. M. Clark, F. Constancias, P. Simonet, P. R. Hirsch and T. M. Vogel 2011. Accessing the soil metagenome for studies of microbial diversity. Applied and Environmental Microbiology 77(4): 1315- 1324. Dodsworth, J. A., P. C. Blainey, S. K. Murugapiran, W. D. Swingley, C. A. Ross, S. G. Tringe, P. S. Chain, M. B. Scholz, C.-C. Lo and J. Raymond 2013. Single- cell and metagenomic analyses indicate a fermentative and saccharolytic lifestyle for members of the OP9 lineage. Nature communications 4: 1854. Doolittle, W. F. 1998. You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends in Genetics 14(8): 307-311. Douglas, S., S. Zauner, M. Fraunholz, M. Beaton, S. Penny, L.-T. Deng, X. Wu, M. Reith, T. Cavalier-Smith and U.-G. Maier 2001. The highly reduced genome of an enslaved algal nucleus. Nature 410(6832): 1091-1096. Drancourt, M., V. D. Nkamga, N. A. Lakhe, J.-M. Régis, H. Dufour, P.-E. Fournier, Y. Bechah, W. Michael Scheld and D. Raoult 2017. Evidence of Archaeal Methanogens in Brain Abscess. Clinical Infectious Diseases 65(1): 1-5. Dridi, B., M. Henry, A. El Khechine, D. Raoult and M. Drancourt 2009. High prevalence of Methanobrevibacter smithii and Methanosphaera stadtmanae detected in the human gut using an improved DNA detection protocol. PloS one 4(9): e7063. Duhaime, M. B., L. Deng, B. T. Poulos and M. B. Sullivan 2012. Towards quantitative metagenomics of wild viruses and other ultra‐low concentration DNA samples: a rigorous assessment and optimization of the linker amplification method. Environmental Microbiology 14(9): 2526-2537. Eckburg, P. B., P. W. Lepp and D. A. Relman 2003. Archaea and their potential role in human disease. Infection and immunity 71(2): 591-596.

57 Ellegaard, K. M., L. Klasson and S. G. Andersson 2013. Testing the reproducibility of multiple displacement amplification on genomes of clonal endosymbiont populations. PloS one 8(11): e82319. Embley, T. M. and B. J. Finlay 1993. Systematic and morphological diversity of endosymbiotic methanogens in anaerobic ciliates. Antonie van Leeuwenhoek 64(3): 261-271. Embley, T. M., B. J. Finlay, P. L. Dyal, R. P. Hirt, M. Wilkinson and A. G. Williams 1995. Multiple origins of anaerobic ciliates with hydrogenosomes within the radiation of aerobic ciliates. Proceedings of the Royal Society of London B: Biological Sciences 262(1363): 87-93. Epstein, S. 2013. The phenomenon of microbial uncultivability. Current opinion in microbiology 16(5): 636-642. Esteban, J. A., M. Salas and L. Blanco 1993. Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization. Journal of Biological Chemistry 268(4): 2719-2726. Evans, P. N., D. H. Parks, G. L. Chadwick, S. J. Robbins, V. J. Orphan, S. D. Golding and G. W. Tyson 2015. Methane metabolism in the archaeal phylum Bathyarchaeota revealed by genome-centric metagenomics. Science 350(6259): 434-438. Fenchel, T. and B. Finlay 1991. Endosymbiotic methanogenic bacteria in anaerobic ciliates: significance for the growth efficiency of the host. Journal of Eukaryotic Microbiology 38(1): 18-22. Fiers, W., G. Haegeman, D. Iserentant and W. Min Jou 1972. Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature(237). Finlay, B., T. Embley and T. Fenchel 1993. A new polymorphic methanogen, closely related to Methanocorpusculum parvum, living in stable symbiosis within the anaerobic ciliate Trimyema sp. Microbiology 139(2): 371-378. Fisher, R. M., L. M. Henry, C. K. Cornwallis, E. T. Kiers and S. A. West 2017. The evolution of host-symbiont dependence. Nature Communications 8. Fleige, S. and M. W. Pfaffl 2006. RNA integrity and the effect on the real-time qRT- PCR performance. Molecular aspects of medicine 27(2): 126-139. Fleischmann, R. D., M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J.-F. Tomb, B. A. Dougherty and J. M. Merrick 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. science 269(5223): 496-512. Frumkin, D., A. Wasserstrom, S. Itzkovitz, A. Harmelin, G. Rechavi and E. Shapiro 2008. Amplification of multiple genomic loci from single cells isolated by laser micro-dissection of tissues. BMC biotechnology 8(1): 17. Gasser, S., G. Daum and G. Schatz 1982. Import of proteins into mitochondria. Energy-dependent uptake of precursors by isolated mitochondria. Journal of Biological Chemistry 257(21): 13034-13041. Gawryluk, R. M., R. Kamikawa, C. W. Stairs, J. D. Silberman, M. W. Brown and A. J. Roger 2016. The earliest stages of mitochondrial adaptation to low oxygen revealed in a novel rhizarian. Current Biology 26(20): 2729-2738. Giordano, F., L. Aigrain, M. A. Quail, P. Coupland, J. K. Bonfield, R. M. Davies, G. Tischler, D. K. Jackson, T. M. Keane and J. Li 2017. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Scientific reports 7:3935. Giovannoni, S. J., T. B. Britschgi, C. L. Moyer and K. G. Field 1990. Genetic diversity in Sargasso Sea bacterioplankton. Nature 345(6270): 60-63.

58 Goffeau, A., B. G. Barrell, H. Bussey, R. Davis, B. Dujon, H. Feldmann, F. Galibert, J. Hoheisel, C. Jacq and M. Johnston 1996. Life with 6000 genes. Science 274(5287): 546-567. Goodwin, S., J. Gurtowski, S. Ethe-Sayers, P. Deshpande, M. C. Schatz and W. R. McCombie 2015. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome research 25(11): 1750-1756. Goodwin, S., J. D. McPherson and W. R. McCombie 2016. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics 17(6): 333-351. Goosen, N. K., A. M. Horemans, S. J. Hillebrand, C. K. Stumm and G. D. Vogels 1988. Cultivation of the sapropelic ciliate Plagiopyla nasuta Stein and isolation of the endosymbiont Methanobacterium formicicum. Archives of microbiology 150(2): 165-170. Gray, M. W. 2012. Mitochondrial evolution. Cold Spring Harbor perspectives in biology 4(9): a011403. Greninger, A. L., K. Messacar, T. Dunnebacke, S. N. Naccache, S. Federman, J. Bouquet, D. Mirsky, Y. Nomura, S. Yagi and C. Glaser 2015. Clinical metagenomic identification of Balamuthia mandrillaris encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing. Genome medicine 7(1): 113. Gutiérrez, G. 2012. Draft genome sequence of Methanobacterium formicicum DSM 3637, an archaebacterium isolated from the methane producer amoeba Pelomyxa palustris. Journal of bacteriology 194(24): 6967-6968. Gutiérrez, G., L. V. Chistyakova, E. Villalobo, A. Y. Kostygov and A. O. Frolov 2017. Identification of Pelomyxa palustris Endosymbionts. Protist 168(4): 408- 424. Guy, L. and T. J. Ettema 2011. The archaeal ‘TACK’superphylum and the origin of eukaryotes. Trends in microbiology 19(12): 580-587. Haas, B. J., D. Gevers, A. M. Earl, M. Feldgarden, D. V. Ward, G. Giannoukos, D. Ciulla, D. Tabbaa, S. K. Highlander and E. Sodergren 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome research 21(3): 494-504. Hackstein, J. H. 2010. (endo) symbiotic methanogenic archaea, Springer Science & Business Media. Hammond, M., F. Homa, H. Andersson-Svahn, T. J. Ettema and H. N. Joensson 2016. Picodroplet partitioned whole genome amplification of low biomass samples preserves genomic diversity for metagenomic analysis. Microbiome 4(1): 52. Harismendy, O., P. C. Ng, R. L. Strausberg, X. Wang, T. B. Stockwell, K. Y. Beeson, N. J. Schork, S. S. Murray, E. J. Topol and S. Levy 2009. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome biology 10(3): R32. Hayashi, T., K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C.-G. Han, E. Ohtsubo, K. Nakayama and T. Murata 2001. Complete genome sequence of enterohemorrhagic Eschelichia coli O157: H7 and genomic comparison with a laboratory strain K-12. DNA research 8(1): 11-22. Heather, J. M. and B. Chain 2016. The sequence of sequencers: the history of sequencing DNA. Genomics 107(1): 1-8. Hentschel, U., M. Steinert and J. Hacker 2000. Common molecular mechanisms of symbiosis and pathogenesis. Trends in microbiology 8(5): 226-231.

59 High, N., J. Mounier, M. C. Prevost and P. J. Sansonetti 1992. IpaB of Shigella flexneri causes entry into epithelial cells and escape from the phagocytic vacuole. The EMBO journal 11(5): 1991. Hilgenboecker, K., P. Hammerstein, P. Schlattmann, A. Telschow and J. H. Werren 2008. How many species are infected with Wolbachia?–a statistical analysis of current data. FEMS microbiology letters 281(2): 215-220. Hoenen, T., A. Groseth, K. Rosenke, R. J. Fischer, A. Hoenen, S. D. Judson, C. Martellaro, D. Falzarano, A. Marzi and R. B. Squires 2016. Nanopore sequencing as a rapidly deployable Ebola outbreak tool. Emerging infectious diseases 22(2): 331. Holley, R. W., J. Apgar, G. A. Everett, J. T. Madison, M. Marquisee, S. H. Merrill, J. R. Penswick and A. Zamir 1965. Structure of a ribonucleic acid. Science: 1462-1465. Huber, H., M. J. Hohn, R. Rachel, T. Fuchs, V. C. Wimmer and K. O. Stetter 2002. A new phylum of Archaea represented by a nanosized hyperthermophilic symbiont. Nature 417(6884): 63-67. Huet, J., R. Schnabel, A. Sentenac and W. Zillig 1983. Archaebacteria and eukaryotes possess DNA-dependent RNA polymerases of a common type. The EMBO journal 2(8): 1291. Hug, L. A., B. J. Baker, K. Anantharaman, C. T. Brown, A. J. Probst, C. J. Castelle, C. N. Butterfield, A. W. Hernsdorf, Y. Amano and K. Ise 2016. A new view of the tree of life. Nature Microbiology 1: 16048. Hurwitz, B. L. and M. B. Sullivan 2013. The Pacific Ocean Virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PloS one 8(2): e57355. Huse, S. M., J. A. Huber, H. G. Morrison, M. L. Sogin and D. M. Welch 2007. Accuracy and quality of massively parallel DNA pyrosequencing. Genome biology 8(7): R143. Husnik, F. and J. P. McCutcheon 2016. Repeated replacement of an intrabacterial symbiont in the tripartite nested mealybug symbiosis. Proceedings of the National Academy of Sciences: 201603910. Idury, R. M. and M. S. Waterman 1995. A new algorithm for DNA sequence assembly. Journal of computational biology 2(2): 291-306. Ingman, M., H. Kaessmann, S. Pääbo and U. Gyllensten 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408(6813): 708- 713. Inoue, K., T. Tanikawa and T. Arai 2008. Micro-manipulation system with a two- fingered micro-hand and its potential application in bioscience. Journal of biotechnology 133(2): 219-224. International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431: 931–945. Iwabe, N., K.-i. Kuma, M. Hasegawa, S. Osawa and T. Miyata 1989. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proceedings of the National Academy of Sciences 86(23): 9355-9359. Kader, T., D. L. Goode, S. Q. Wong, J. Connaughton, S. M. Rowley, L. Devereux, D. Byrne, S. B. Fox, G. M. Arnau and R. W. Tothill 2016. Copy number analysis by low coverage whole genome sequencing using ultra low-input DNA from formalin-fixed paraffin embedded tumor tissue. Genome medicine 8(1): 121.

60 Karlsson, E., A. Lärkeryd, A. Sjödin, M. Forsman and P. Stenberg 2015. Scaffolding of a bacterial genome using MinION nanopore sequencing. Scientific reports 5: 11996. Karst, S. M., M. S. Dueholm, S. J. McIlroy, R. H. Kirkegaard, P. H. Nielsen and M. Albertsen 2016. Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life. bioRxiv: 070771. Kaul, S., H. L. Koo, J. Jenkins, M. Rizzo, T. Rooney, L. J. Tallon, T. Feldblyum, W. Nierman, M. I. Benito and X. Lin 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. nature 408(6814): 796-815. Kikuchi, Y. 2009. Endosymbiotic bacteria in insects: their diversity and culturability. Microbes and Environments 24(3): 195-204. Kispal, G., P. Csere, C. Prohl and R. Lill 1999. The mitochondrial proteins Atm1p and Nfs1p are essential for biogenesis of cytosolic Fe/S proteins. The EMBO journal 18(14): 3981-3989. Klindworth, A., E. Pruesse, T. Schweer, J. Peplies, C. Quast, M. Horn and F. O. Glöckner 2013. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic acids research 41(1): e1-e1. Koren, S. and A. M. Phillippy 2015. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Current opinion in microbiology 23: 110-120. Koren, S., M. C. Schatz, B. P. Walenz, J. Martin, J. T. Howard, G. Ganapathy, Z. Wang, D. A. Rasko, W. R. McCombie and E. D. Jarvis 2012. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology 30(7): 693-700. Kuleshov, V., D. Xie, R. Chen, D. Pushkarev, Z. Ma, T. Blauwkamp, M. Kertesz and M. Snyder 2014. Whole-genome haplotyping using long reads and statistical methods. Nature biotechnology 32(3): 261-266. Kuriwada, T., T. Hosokawa, N. Kumano, K. Shiromoto, D. Haraguchi and T. Fukatsu 2010. Biological role of Nardonella endosymbiont in its weevil host. PLoS One 5(10): e13101. Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle and W. FitzHugh 2001. Initial sequencing and analysis of the human genome. Nature 409(6822): 860-921. Langer, D., J. Hain, P. Thuriaux and W. Zillig 1995. Transcription in archaea: similarity to that in eucarya. Proceedings of the National Academy of Sciences 92(13): 5768-5772. Langille, M. G., J. Zaneveld, J. G. Caporaso, D. McDonald, D. Knights, J. A. Reyes, J. C. Clemente, D. E. Burkepile, R. L. V. Thurber and R. Knight 2013. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature biotechnology 31(9): 814-821. Lasken, R. S. and T. B. Stockwell 2007. Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC biotechnology 7(1): 19. Laver, T., J. Harrison, P. O’neill, K. Moore, A. Farbos, K. Paszkiewicz and D. J. Studholme 2015. Assessing the performance of the oxford nanopore technologies minion. Biomolecular detection and quantification 3: 1-8. Lee, M. J., P. J. Schreurs, A. C. Messer and S. H. Zinder 1987. Association of methanogenic bacteria with flagellated protozoa from a termite hindgut. Current Microbiology 15(6): 337-341.

61 Leplae, R., D. Geeraerts, R. Hallez, J. Guglielmini, P. Drèze and L. Van Melderen 2011. Diversity of bacterial type II toxin–antitoxin systems: a comprehensive search and functional analysis of novel families. Nucleic acids research 39(13): 5513-5525. Lepp, P. W., M. M. Brinig, C. C. Ouverney, K. Palm, G. C. Armitage and D. A. Relman 2004. Methanogenic Archaea and human periodontal disease. Proceedings of the National Academy of Sciences of the United States of America 101(16): 6176-6181. Li, Z., Y. Chen, D. Mu, J. Yuan, Y. Shi, H. Zhang, J. Gan, N. Li, X. Hu and B. Liu 2012. Comparison of the two major classes of assembly algorithms: overlap– layout–consensus and de-bruijn-graph. Briefings in functional genomics 11(1): 25-37. Lindsay, S. 2016. The promises and challenges of solid-state sequencing. Nature nanotechnology 11(2): 109-111. Liu, L., Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu and M. Law 2012. Comparison of next-generation sequencing systems. BioMed Research International 2012. López-García, P., A. López-López, D. Moreira and F. Rodríguez-Valera 2001a. Diversity of free-living prokaryotes from a deep-sea site at the Antarctic Polar Front. FEMS Microbiology Ecology 36(2-3): 193-202. López-García, P., D. Moreira, A. López‐López and F. Rodríguez‐Valera 2001b. A novel haloarchaeal‐related lineage is widely distributed in deep oceanic regions. Environmental microbiology 3(1): 72-78. Lu, Y. Y., T. Chen, J. A. Fuhrman and F. Sun 2017. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO- alignment and paired-end read LinkAge. Bioinformatics 33(6): 791-798. Luo, C., D. Tsementzi, N. Kyrpides, T. Read and K. T. Konstantinidis 2012. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS one 7(2): e30087. Maddox, B. 2003. The double helix and the 'wronged heroine'. Nature 421(6921): 407-408. Manor, O., R. Levy and E. Borenstein 2014. Mapping the inner workings of the microbiome: genomic-and metagenomic-based study of metabolism and metabolic interactions in the human microbiome. Cell metabolism 20(5): 742- 752. Marcy, Y., T. Ishoey, R. S. Lasken, T. B. Stockwell, B. P. Walenz, A. L. Halpern, K. Y. Beeson, S. M. Goldberg and S. R. Quake 2007. Nanoliter reactors improve multiple displacement amplification of genomes from single cells. PLoS genetics 3(9): e155. Margulies, M., M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen and Z. Chen 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057): 376-380. Margulis, L. 1970. Origin of eukaryotic cells: Evidence and research implications for a theory of the origin and evolution of microbial, plant and animal cells on the precambrian Earth, Yale University Press. McCoy, R. C., R. W. Taylor, T. A. Blauwkamp, J. L. Kelley, M. Kertesz, D. Pushkarev, D. A. Petrov and A.-S. Fiston-Lavier 2014. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly- repetitive transposable elements. PloS one 9(9): e106689. McCutcheon, J. P. and N. A. Moran 2012. Extreme genome reduction in symbiotic bacteria. Nature Reviews Microbiology 10(1): 13-26.

62 McIlroy, S. J., R. H. Kirkegaard, M. S. Dueholm, E. Fernando, S. M. Karst, M. Albertsen and P. H. Nielsen 2017. Culture-Independent Analyses Reveal Novel Anaerolineaceae as Abundant Primary Fermenters in Anaerobic Digesters Treating Waste Activated Sludge. Frontiers in microbiology 8: 1134. McKernan, K. J., H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu, E. F. Tsung, C. R. Clouser, C. Duncan, J. K. Ichikawa and C. C. Lee 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome research 19(9): 1527-1541. Merhej, V., M. Royer-Carenzi, P. Pontarotti and D. Raoult 2009. Massive comparative genomic analysis reveals convergent evolution of specialized bacteria. Biology direct 4(1): 13. Miller, D., J. Bryant, E. Madsen and W. Ghiorse 1999. Evaluation and optimization of DNA extraction and purification procedures for soil and sediment samples. Applied and environmental microbiology 65(11): 4715-4724. Miller, J. R., S. Koren and G. Sutton 2010a. Assembly algorithms for next- generation sequencing data. Genomics 95(6): 315-327. Miller, W. J., L. Ehrman and D. Schneider 2010b. Infectious speciation revisited: impact of symbiont-depletion on female fitness and mating behavior of Drosophila paulistorum. PLoS Pathogens 6(12): e1001214. Mitra, R. D. and G. M. Church 1999. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Research 27(24): e34-e39. Moran, N. A. 1996. Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proceedings of the National Academy of Sciences 93(7): 2873-2878. Moran, N. A. and P. Baumann 2000. Bacterial endosymbionts in animals. Current opinion in microbiology 3(3): 270-275. Morozova, O. and M. A. Marra 2008. Applications of next-generation sequencing technologies in functional genomics. Genomics 92(5): 255-264. Mourier, T., A. J. Hansen, E. Willerslev and P. Arctander 2001. The Human Genome Project reveals a continuous transfer of large mitochondrial fragments to the nucleus. Molecular Biology and Evolution 18(9): 1833-1837. Moya, A., J. Peretó, R. Gil and A. Latorre 2008. Learning how to live together: Genomic insights into –animal symbioses. Nature Reviews Genetics 9(3). Muir, P., S. Li, S. Lou, D. Wang, D. J. Spakowicz, L. Salichos, J. Zhang, G. M. Weinstock, F. Isaacs and J. Rozowsky 2016. The real cost of sequencing: scaling computation to keep pace with data generation. Genome biology 17(1): 53. Müller, M., M. Mentel, J. J. van Hellemond, K. Henze, C. Woehle, S. B. Gould, R.- Y. Yu, M. van der Giezen, A. G. Tielens and W. F. Martin 2012. Biochemistry and evolution of anaerobic energy metabolism in eukaryotes. Microbiology and Molecular Biology Reviews 76(2): 444-495. Nakabachi, A. and H. Ishikawa 1999. Provision of riboflavin to the host aphid, Acyrthosiphon pisum, by endosymbiotic bacteria, Buchnera. Journal of Insect Physiology 45(1): 1-6. Narayanan, N., B. Krishnakumar, V. N. Anupama and V. B. Manilal 2009. Methanosaeta sp., the major archaeal endosymbiont of Metopus es. Research in microbiology 160(8): 600-607. National Research Council (1988). Mapping and sequencing the human genome, National Academies Press.

63 Niedringhaus, T. P., D. Milanova, M. B. Kerby, M. P. Snyder and A. E. Barron 2011. Landscape of next-generation sequencing technologies. Analytical chemistry 83(12): 4327-4341. Nishikawa, Y., M. Hosokawa, T. Maruyama, K. Yamagishi, T. Mori and H. Takeyama 2015. Monodisperse picoliter droplets for low-bias and contamination-free reactions in single-cell whole genome amplification. PloS one 10(9): e0138733. Normand, E., S. Qdaisat, W. Bi, C. Shaw, I. Van den Veyver, A. Beaudet and A. Breman 2016. Comparison of three whole genome amplification methods for detection of genomic aberrations in single cells. Prenatal diagnosis 36(9): 823- 830. Nyrén, P., B. Pettersson and M. Uhlén 1993. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Analytical biochemistry 208(1): 171-175. O'Leary, N. A., M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput, B. Robbertse, B. Smith-White and D. Ako-Adjei 2015. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research 44(D1): D733-D745. O'Neill, S. L., J. Werren and A. Hoffmann 1997. Influential passengers: inherited microorganisms and arthropod reproduction. Odelson, D. A. and J. A. Breznak 1985. Nutrition and growth characteristics of Trichomitopsis termopsidis, a cellulolytic protozoan from termites. Applied and Environmental Microbiology 49(3): 614-621. Oulhen, N., B. J. Schulz and T. J. Carrier 2016. English translation of Heinrich Anton de Bary’s 1878 speech,‘Die Erscheinung der Symbiose’(‘De la symbiose’). Symbiosis 69(3): 131-139. Padmanabhan, R., E. Jay and R. Wu 1974. Chemical synthesis of a primer and its use in the sequence analysis of the lysozyme gene of bacteriophage T4. Proceedings of the National Academy of Sciences 71(6): 2510-2514. Paez, J. G., M. Lin, R. Beroukhim, J. C. Lee, X. Zhao, D. J. Richter, S. Gabriel, P. Herman, H. Sasaki and D. Altshuler 2004. Genome coverage and sequence fidelity of ϕ29 polymerase‐based multiple strand displacement whole genome amplification. Nucleic Acids Research 32(9): e71-e71. Palmer, J. D. 1997. Organelle Genomes--Going, Going, Gone. Science 275(5301): 790-790. Parks, D. H., C. Rinke, M. Chuvochina, P.-A. Chaumeil, B. J. Woodcroft, P. N. Evans, P. Hugenholtz and G. W. Tyson 2017. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature microbiology 2(11): 1533. Pinard, R., A. de Winter, G. J. Sarkis, M. B. Gerstein, K. R. Tartaro, R. N. Plant, M. Egholm, J. M. Rothberg and J. H. Leamon 2006. Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC genomics 7(1): 216. Polz, M. F. and C. M. Cavanaugh 1998. Bias in template-to-product ratios in multitemplate PCR. Applied and environmental Microbiology 64(10): 3724- 3730. Ponce-de-Leon, M., D. Tamarit, J. Calle-Espinosa, M. Mori, A. Latorre, F. Montero and J. Pereto 2017. A genome-scale study of metabolic complementation in endosymbiotic consortia: the case of the cedar aphid. bioRxiv: 179127. Pride, D. T., R. J. Meinersmann, T. M. Wassenaar and M. J. Blaser 2003. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome research 13(2): 145-158.

64 Probst, A. J., A. K. Auerbach and C. Moissl-Eichinger 2013. Archaea on human skin. PloS one 8(6): e65388. Pühler, G., H. Leffers, F. Gropp, P. Palm, H.-P. Klenk, F. Lottspeich, R. A. Garrett and W. Zillig 1989. Archaebacterial DNA-dependent RNA polymerases testify to the evolution of the eukaryotic nuclear genome. Proceedings of the National Academy of Sciences 86(12): 4569-4573. Quail, M. A., I. Kozarewa, F. Smith, A. Scally, P. J. Stephens, R. Durbin, H. Swerdlow and D. J. Turner 2008. A large genome center's improvements to the Illumina sequencing system. Nat Meth 5(12): 1005-1010. Razumov, A. 1932. The direct method of calculation of bacteria in water: comparison with the Koch method. Mikrobiologija 1: 131-146. Reeve, J. N. 1999. Archaebacteria then… Archaes now (are there really no archaeal pathogens?). Journal of bacteriology 181(12): 3613-3617. Remy, W., T. N. Taylor, H. Hass and H. Kerp 1994. Four hundred-million-year-old vesicular arbuscular mycorrhizae. Proceedings of the National Academy of Sciences 91(25): 11841-11843. Rhoads, A. and K. F. Au 2015. PacBio sequencing and its applications. Genomics, proteomics & bioinformatics 13(5): 278-289. Richardson, E. J. and M. Watson 2012. The automatic annotation of bacterial genomes. Briefings in bioinformatics 14(1): 1-12. Ring, J. D., K. Sturk-Andreaggi, M. A. Peck and C. Marshall 2017. A performance evaluation of Nextera XT and KAPA HyperPlus for rapid Illumina library preparation of long-range mitogenome amplicons. Forensic Science International: Genetics 29: 174-180. Rinke, C., J. Lee, N. Nath, D. Goudeau, B. Thompson, N. Poulton, E. Dmitrieff, R. Malmstrom, R. Stepanauskas and T. Woyke 2014. Obtaining genomes from uncultivated environmental microorganisms using FACS–based single-cell genomics. Nature protocols 9(5): 1038-1048. Rinke, C., S. Low, B. J. Woodcroft, J.-B. Raina, A. Skarshewski, X. H. Le, M. K. Butler, R. Stocker, J. Seymour and G. W. Tyson 2016. Validation of picogram- and femtogram-input DNA libraries for microscale metagenomics. PeerJ 4: e2486. Rinke, C., P. Schwientek, A. Sczyrba, N. N. Ivanova, I. J. Anderson, J.-F. Cheng, A. Darling, S. Malfatti, B. K. Swan and E. A. Gies 2013. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499(7459): 431-437. Ronaghi, M., S. Karamohamed, B. Pettersson, M. Uhlén and P. Nyrén 1996. Real- time DNA sequencing using detection of pyrophosphate release. Analytical biochemistry 242(1): 84-89. Rusch, D. B., A. L. Halpern, G. Sutton, K. B. Heidelberg, S. Williamson, S. Yooseph, D. Wu, J. A. Eisen, J. M. Hoffman and K. Remington 2007. The Sorcerer II global ocean sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS biology 5(3): e77. Sabree, Z. L., S. Kambhampati and N. A. Moran 2009. Nitrogen recycling and nutritional provisioning by Blattabacterium, the cockroach endosymbiont. Proceedings of the National Academy of Sciences 106(46): 19521-19526. Sagan, L. 1967. On the origin of mitosing cells. Journal of theoretical biology 14(3): 225IN221-274IN226. Saiki, R. K., S. Scharf, F. Faloona, K. B. Mullis, G. T. Horn, H. A. Erlich and N. Arnheim 1985. Enzymatic amplification of b-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230(4732): 1350-1354.

65 Sanger, F., G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, J. C. Fiddes, C. Hutchison, P. M. Slocombe and M. Smith 1977a. Nucleotide sequence of bacteriophage φX174 DNA. nature 265(5596): 687-695. Sanger, F., S. Nicklen and A. R. Coulson 1977b. DNA sequencing with chain- terminating inhibitors. Proceedings of the national academy of sciences 74(12): 5463-5467. Sboner, A., X. J. Mu, D. Greenbaum, R. K. Auerbach and M. B. Gerstein 2011. The real cost of sequencing: higher than you think! Genome biology 12(8): 125. Schadt, E. E., S. Turner and A. Kasarskis 2010. A window into third-generation sequencing. Human molecular genetics 19(R2): R227-R240. Schrader, C., A. Schielke, L. Ellerbroek and R. Johne 2012. PCR inhibitors– occurrence, properties and removal. Journal of applied microbiology 113(5): 1014-1026. Shinzato, N., I. Watanabe, X.-Y. Meng, Y. Sekiguchi, H. Tamaki, T. Matsui and Y. Kamagata 2007. Phylogenetic analysis and fluorescence in situ hybridization detection of archaeal and bacterial endosymbionts in the anaerobic ciliate Trimyema compressum. Microbial ecology 54(4): 627-636. Smith, T. A., T. Driscoll, J. J. Gillespie and R. Raghavan 2015. A Coxiella-like endosymbiont is a potential vitamin source for the Lone Star tick. Genome biology and evolution 7(3): 831-838. Smith, W. A., K. F. Oakeson, K. P. Johnson, D. L. Reed, T. Carter, K. L. Smith, R. Koga, T. Fukatsu, D. H. Clayton and C. Dale 2013. Phylogenetic analysis of symbionts in feather-feeding lice of the genus Columbicola: evidence for repeated symbiont replacements. BMC Evolutionary Biology 13(1): 109. Somvanshi, V. S., B. Kaufmann‐Daszczuk, K. s. Kim, S. Mallon and T. A. Ciche 2010. Photorhabdus phase variants express a novel fimbrial locus, mad, essential for symbiosis. Molecular microbiology 77(4): 1021-1038. Spang, A., J. Martijn, J. H. Saw, A. E. Lind, L. Guy and T. J. Ettema 2013. Close encounters of the third domain: the emerging genomic view of archaeal diversity and evolution. Archaea 2013: Article ID 202358. Spang, A., J. H. Saw, S. L. Jørgensen, K. Zaremba-Niedzwiedzka, J. Martijn, A. E. Lind, R. van Eijk, C. Schleper, L. Guy and T. J. Ettema 2015. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521(7551): 173-179. Spribille, T., V. Tuovinen, P. Resl, D. Vanderpool, H. Wolinski, M. C. Aime, K. Schneider, E. Stabentheiner, M. Toome-Heller and G. Thor 2016. Basidiomycete yeasts in the cortex of ascomycete macrolichens. Science 353(6298): 488-492. Stairs, C. W., M. M. Leger and A. J. Roger 2015. Diversity and origins of anaerobic metabolism in mitochondria and related organelles. Phil. Trans. R. Soc. B 370(1678): 20140326. Staley, J. T. and A. Konopka 1985. Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annual Reviews in Microbiology 39(1): 321-346. Sunagawa, S., L. P. Coelho, S. Chaffron, J. R. Kultima, K. Labadie, G. Salazar, B. Djahanschiri, G. Zeller, D. R. Mende and A. Alberti 2015. Structure and function of the global ocean microbiome. Science 348(6237): 1261359. Takahashi, S., J. Tomita, K. Nishioka, T. Hisada and M. Nishijima 2014. Development of a prokaryotic universal primer for simultaneous analysis of bacteria and archaea using next-generation sequencing. PloS one 9(8): e105592. Tamm, S. L. 1982. Flagellated ectosymbiotic bacteria propel a eucaryotic cell. The Journal of cell biology 94(3): 697-709.

66 Teeling, H., A. Meyerdierks, M. Bauer, R. Amann and F. O. Glöckner 2004. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental microbiology 6(9): 938-947. The C. elegans Sequencing Consortium 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science: 2012-2018. Toft, C. and S. G. Andersson 2010. Evolutionary microbial genomics: insights into bacterial host adaptation. Nature Reviews Genetics 11(7): 465-475. Tokura, M., M. Ohkuma and T. Kudo 2000. Molecular phylogeny of methanogens associated with flagellated protists in the gut and with the gut epithelium of termites. FEMS microbiology ecology 33(3): 233-240. Tokura, M., K. Tajima and K. Ushida 1999. Isolation of Methanobrevibacter sp. as a ciliate-associated ruminal methanogen. The Journal of general and applied microbiology 45(1): 43-47. Turnbaugh, P. J., R. E. Ley, M. Hamady, C. Fraser-Liggett, R. Knight and J. I. Gordon 2007. The human microbiome project: exploring the microbial part of ourselves in a changing world. Nature 449(7164): 804. Utturkar, S. M., D. M. Klingeman, M. L. Land, C. W. Schadt, M. J. Doktycz, D. A. Pelletier and S. D. Brown 2014. Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences. Bioinformatics 30(19): 2709-2716. Van Bruggen, J., G. Van Rens, E. Geertman, K. Zwart, C. Stumm and G. Vogels 1988. Isolation of a methanogenic endosymbiont of the sapropelic amoeba Pelomyxa palustris Greeff. The Journal of protozoology 35(1): 20-23. Van Bruggen, J., K. Zwart, J. Hermans, E. Van Hove, C. Stumm and G. Vogels 1986. Isolation and characterization of Methanoplanus endosymbiosus sp. nov., an endosymbiont of the marine sapropelic ciliate Metopus contortus Quennerstedt. Archives of Microbiology 144(4): 367-374. van Bruggen, J. J., C. K. Stumm and G. D. Vogels 1983. Symbiosis of methanogenic bacteria and sapropelic protozoa. Archives of microbiology 136(2): 89-95. van Bruggen, J. J., K. B. Zwart, R. M. van Assema, C. K. Stumm and G. G. Vogels 1984. Methanobacterium formicicum, an endosymbiont of the anaerobic ciliate Metopus striatus McMurrich. Archives of microbiology 139(1): 1-7. Van Hoek, A., T. A. van Alen, V. Sprakel, J. Hackstein and G. D. Vogels 1998. Evolution of anaerobic ciliates from the gastrointestinal tract: phylogenetic analysis of the ribosomal repeat from Nyctotherus ovalis and its relatives. Molecular biology and evolution 15(9): 1195-1206. van Hoek, A. H., T. A. van Alen, V. S. Sprakel, J. A. Leunissen, T. Brigge, G. D. Vogels and J. H. Hackstein 2000. Multiple acquisition of methanogenic archaeal symbionts by anaerobic ciliates. Molecular Biology and Evolution 17(2): 251- 258. Veatch, J. R., M. A. McMurray, Z. W. Nelson and D. E. Gottschling 2009. Mitochondrial dysfunction leads to nuclear genome instability via an iron-sulfur cluster defect. Cell 137(7): 1247-1258. Venkatesan, B. M. and R. Bashir 2011. Nanopore sensors for nucleic acid analysis. Nature nanotechnology 6(10): 615-624. Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith, M. Yandell, C. A. Evans and R. A. Holt 2001. The sequence of the human genome. science 291(5507): 1304-1351. Viprey, V., A. Del Greco, W. Golinowski, W. J. Broughton and X. Perret 1998. Symbiotic implications of type III protein secretion machinery in Rhizobium. Molecular microbiology 28(6): 1381-1389.

67 Wagener, S., C. Bardele and N. Pfennig 1990. Functional integration of Methanobacterium formicicum into the anaerobic ciliate Trimyema compressum. Archives of microbiology 153(5): 496-501. Watson, J. D. and F. H. Crick 1953. Molecular structure of nucleic acids. Nature 171(4356): 737-738. Weirather, J. L., M. de Cesare, Y. Wang, P. Piazza, V. Sebastiano, X.-J. Wang, D. Buck and K. F. Au 2017. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Research 6. Wetterstrand, K. A. (2017). DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Retrieved 2017-11-01, 2017, from https://www.genome.gov/sequencingcostsdata. Williams, K. P., B. W. Sobral and A. W. Dickerman 2007. A robust species tree for the alphaproteobacteria. Journal of bacteriology 189(13): 4578-4586. Williams, T. A., P. G. Foster, C. J. Cox and T. M. Embley 2013. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504(7479): 231-236. Woese, C. R. 1987. Bacterial evolution. Microbiological reviews 51(2): 221. Woese, C. R. and G. E. Fox 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences 74(11): 5088-5090. Woese, C. R., O. Kandler and M. L. Wheelis 1990. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences 87(12): 4576-4579. Worm, P., N. Müller, C. M. Plugge, A. J. M. Stams and B. Schink (2010). Syntrophy in Methanogenic Degradation. (Endo)symbiotic Methanogenic Archaea. J. H. P. Hackstein. New York, Springer: 143-174. Xi, Z., L. Gavotte, Y. Xie and S. L. Dobson 2008. Genome-wide analysis of the interaction between the endosymbiotic bacterium Wolbachia and its Drosophila host. Bmc Genomics 9(1): 1. Yamada, K., Y. Kamagata and K. Nakamura 1997. The effect of endosymbiotic methanogens on the growth and metabolic profile of the anaerobic free-living ciliate Trimyema compressum. FEMS microbiology letters 149(1): 129-132. Yatsunenko, T., F. E. Rey, M. J. Manary, I. Trehan, M. G. Dominguez-Bello, M. Contreras, M. Magris, G. Hidalgo, R. N. Baldassano and A. P. Anokhin 2012. Human gut microbiome viewed across age and geography. Nature 486(7402): 222-227. Zaremba-Niedzwiedzka, K., E. F. Caceres, J. H. Saw, D. Bäckström, L. Juzokaite, E. Vancaester, K. W. Seitz, K. Anantharaman, P. Starnawski and K. U. Kjeldsen 2017. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541(7637): 353-358. Zerbino, D. R. and E. Birney 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18(5): 821-829.

68

Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1615 Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-336794 2018