Examination of the Nematostella vectensis Holobiont by Comparative Bacterial Genomics and Metatranscriptomics

by

Timothy J. Helbig

B.S. Biological Sciences Carnegie Mellon University, 2010

SUBMITTED TO THE DEPARTMENT OF MICROBIOLOGY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE AT THE MASSACHUSETTS INSfT E TFCHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY OCT 0 3 2013 September 2013

@2013 Timothy J. Helbig. All rights reserved. LIBRARIES

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature redacted Signature of Author: Dep rtn, nt of Miobiology August, 29th 2013

Certified by: Signature redacted-, Janelle R. Thompson Doherty Assis ant Professor in Ocean Utilization Thesispupervisor Signature redacted Accepted by Michael T. Laub Whitehead Career Development Associate Professor of Biology Chairman, Committee for Microbiology Graduate Students

Examination of the Nematostella vectensis Holobiont by Comparative Bacterial Genomics and Metatranscriptomics

by

Timothy J. Helbig

Submitted to the Department of Microbiology on August 30, 2013 in Partial Fulfillment of the Requirements for the Degree of Master of Science

ABSTRACT

Previous work has shown that similar microbial populations are associated with the starlet sea anemone Nematostella vectensis over distinct temporal and geographic locations; however, the functions these may be performing within their anemone hosts and the mechanisms with which the bacteria may be using to adapt are unknown. To address these issues comparative genomic analysis of ten newly sequenced bacterial isolates from four bacterial populations ( oleovorans,Agrobacterium tumefaciens, thiooxidans and Stappia stellulata) that are associated with Nematostella in the laboratory and/or its natural salt marsh habitat was performed and whole metatranscriptomes of lab-raised N. vectensis were sequenced and analyzed. Comparative genomic analysis revealed the isolates from these bacterial populations to likely be non-clonal, with no evidence that holobiont-specific orthologous groups (i.e. gene orthologs found only in N. vectensis-associated bacterial genomes and absent in closely related genomes of the same genus/family) were shared across the populations examined. Further, no evidence of lateral gene transfer or shared phage or mobile elements among the isolates was observed. Isolate genomes did, however, reveal conserved holobiont specific orthologs within members of the same bacterial population that could be reflective of the ecology of the anemone holobiont; for instance, 3 of the four P. oleovorans isolate genomes showed evidence of holobiont specific antibiotic production, the three A. tumefaciens isolates all shared common ion scavenging proteins and both L. thiooxidanshad a holobiont specific antibiotic resistance protein. Whole anemone metatranscriptomic analysis based on BLASTx annotation of sequenced transcripts revealed bacterial expression of housekeeping genes such as those for replication, ribosomal structure and ATP-synthesis dominated by , in particular . Further recruitment of the transcripts to sequenced Nematostella associates revealed an active and foraging Limnobacter population expressing genes for signaling, movement, iron scavenging and carbon storage in the form of PHA granules. The similarity of high Limnobacter and host anemone expression for iron regulators suggest iron may be a source of structuring within the anemone holobiont and a good area of further study.

Thesis Supervisor: Janelle R. Thompson Title: Doherty Assistant Professor in Ocean Utilization

3 4 Acknowledgements

Firstly I would like to thank the members of the Thompson Lab and everyone in Parsons for creating a wonderful and jubilant work environment. Particularly, I'd like to thank Sonia Timberlake who provided both invaluable programming and bioinformatics advice and great life advice in general when I needed it.

Further, I would like to thank my advisor Janelle Thompson for patience and the ability to talk my life out of dark places in my times of most need. She was a wonderful mentor and one of the most thoughtful and caring people I have come across in my life.

I would also like to thank my parental unit, who has provided a constant beam of unconditional love throughout my time in grad school.

Finally, I would like to thank and give love to my crucial support factor in this interesting past year and a half, my boyfriend Adam. Adam, you are a true raraavis and the most special person in my life. Here's to more of the happiest times of my life together now that this thesis is over.

5 6 Table of Contents

F ig u re K ey ...... 9

Introduction ...... 11

M e th o d s ...... 1 5

Com parative Genom ics Results and Discussion ...... 23

M etatranscriptom ics Results and Discussion ...... 41

Conclusion ...... 61

References ...... 63

Appendices ...... 73

7 8 Figure Key

Comparative Genomics of Nematostella vectensis Associated Bacteria

1. Map and distribution of microbial diversity in field and lab-raised N. vectensis as determined through 16S clone libraries (Har, MS Thesis) - p. 23

2. 16S rRNA Tree of Symbiont Phylogenetic Relatedness - p. 25

3. Shared Gene Contents of Nematostella Associated Isolates - p. 31

4. Holobiont-specific orthologous groups found within multiple members of the Nematostella isolated populations - p. 33

5. MEGAN Visualization of N. vectensis Isolate Operational Core Genomes - p. 35

6. MEGAN Visualization of N. vectensis Isolate Operational Flexible Genomes - p. 36

7. Analysis of N. vectensis Genome Scaffolds containing and not containing PseudomonasDNA - p. 37

8. Shared Phage Elements of N. vectensis Isolates within and among Populations - p. 38

Nematostella vectensis Metatranscriptome Analysis

1. Ribosomal and Contaminant Read Breakdown of Initial Read Pairs - p. 42

2. MEGAN Taxonomic Breakdown of Filtered Reads - p. 44

3. MEGAN Analysis of N. vectensis Metatranscriptome Diversity - p. 45

4. Top 15 most highly represented orthologous groups among transcripts of Bacterial and Cnidarian binned reads - p. 47

5. Read mapping to second highest annotated "Cnidarian" Read Category, opiNOG08261 - p. 48

6. Read mapping to representative sequence of highest expressed "Bacterial" orthologous group NOG323497 - p. 49

7. Genus level assignment of SSU rRNA sequences from the unprocessed control sample - p. 51

9 8. Species Level assignment of SSU rRNA sequences from all Metatranscriptome samples - p. 52

9. COG Category Distribution of Reads Binned By MEGAN as Proteobacteria and Reads Mapped to Sequenced Limnobacter Genomes - p. 57

10. Orthologous Group comparison of those present in "Bacterial" reads as determined through MEGAN analysis and those present in the symbiont genome mapping analysis - p. 58

10 Introduction

Significance of host-associated microbial communities: Emerging evidence Multicellular life emerged in a world teeming with microorganisms. Rather than something to overcome, this was, however, an opportunity for each type of life to make use of the other's unique physiological and enzymatic capabilities in order to help themselves better adapt to their surrounding environment. The success of these multicellular-microorganism partnerships is well illustrated in the fact that all studied mammals, lower vertebrates, invertebrates and plants are each distinctly colonized with unique and active microbial communities (Bosch, 2013; Nyholm et al., 2012; Hooper et al., 2001). Microbial inhabitants of multicellular creatures have recently come to light as powerful contributors to the well-being and success of their hosts. They are critical in the digestion and absorption of nutrients as they have been found to breakdown complex plant-polymers and polysaccharides in ruminants, termites and humans (Warnecke et al., 2007; Xu et al., 2003; Mahowald et al. 2009), synthesize essential amino acids in insects such as aphids (Moran, 2007) and produce vitamins in mice and humans (Chaucheyras-Durand et al., 2010). They have also been found to be imperative in the development of particular organs and systems within animals such the immune system of mice and humans (Dobber et al. 1992; O'Hara et al. 2004), the gut of zebrafish (Rawls et al., 2004) and the light creating organ of the Hawaiian Bobtail Squid (Rader et al., 2012). Further, they are known to have effects on complex physiological processes such as obesity in humans and mice (Backhed et al., 2004; Ley at al., 2005; Ley at al., 2006). Finally, the microbial constituents are known to be a passive and active deterrent of pathogens within both vertebrates and invertebrates (Reshef et al., 2006; Bosch, 2013). The composition, structuring and development of these microbial communities within their hosts has been found to be species specific in both vertebrates and invertebrates including hydra, mice, zebrafish, humans and other hominids (Ley et al., 2008; Ochman et al., 2010; Rawls et al., 2006). In a few cases critical components of the host structuring mechanism are known such as the roles of particular cationic antimicrobial peptides in Drosophila and Hydra (Ryu et al., 2008, Froune et al., 2010) and the use of specific surface attachment sites in Nematode and the squid Euprymna scolopes (Ruby, 2008). However, most of the mechanisms remain unknown leading researchers to speculate on single or synergistic effects of nutrients, microbe-microbe interactions (both chemical and frequency dependent), external environmental effects or host-derived factors as the source of the control of the bacterial community composition (Bosch, 2013; Bevins et al. 2011). These communities of micro and macro organisms and their multicellular hosts have become collectively known as "holobionts" (Rohwer et al., 2002). Holobionts are relevant as it is becoming known that the microbial associates of a multicellular organism are the keys to its success and adaptation in diverse environments (Rohwer et al., 2002; Vega-Thurber et al., 2009). Indeed, the holobiont theory of evolution suggests that it is the collective whole of the genetic

11 material of a multicellular host and its associates that is the unit of selection on which evolution acts (Singh et al., 2013). For most of the defined roles of microbial associates, extreme phenotypes of the host, such as a termite's digestion of wood or the glowing light organ of the squid Euprymna scolopes, provided the clues as to what the inhabiting microorganisms were likely doing (Warnecke et al., 2007; Rader et al., 2012). Other evidence of functional roles of the associates came from the creation and manipulation of gnotobiotic animal models, or those capable of being raised with or without microbes; powerful model organisms of this type include mouse, hydra and zebrafish (Ruby, 2008). For organisms without a distinctive phenotype and gnotobiotic form, methods of hypothesis creation of the function of microbial associates (if any) included correlating behavioral and physiological observations with monitored changes of complex microbial systems, comparative genomics of full genome sequences of the associates or speculation of metagenomics datasets (Ley et al., 2006; Mandel et al., 2009; Vega-Thurber et al., 2009). However, advances in nucleic acid preparation techniques and next generation sequencing technologies have permitted the use of population genomics and metatranscriptomics as viable tools for the creation of hypotheses regarding the roles of microbial associates within multicellular hosts (Coleman and Chisholm, 2010; Stewart et al., 2010).

Model System: Nematostella vectensis One such multicellular host is the starlet sea-anemone Nematostella vectensis, an animal of Phylum Cnidaria, Subclass Anthozoa: Hexacorallia. A sedentary carnivore, this anemone resides exclusively in estuaries (Hand and Uhlinger, 1994) including those of extreme salinity (Sheader et al., 1997), temperature (Williams, 1983; Kneib, 1988) and sulfide fluxes (Howes et al., 1985; Bart and Hartman, 2000). Recently, due to its tractability in the lab, easily induced sexual and asexual reproduction, sequenced genome including a genomic repertoire replete with innate immunity genes (Putnam et al., 2007; Genikhovich and Technau, 2009) and a host of molecular tools designed for it (Renfer et al. 2010), Nematostella has become a popular model of evolution and development (Stefanik et al., 2013; Reitzel et al., 2012). Previous experimental work has shown that similar microbial populations are associated with Nematostella over distinct geographic locations and timescales (Har et al., MS Thesis). However, how these microbes are associating, their function within the host and the factors used to structure the community are unknown. In order to address these questions, we have used next-generation sequencing tools to generate hypotheses about the functional roles of the microbial community and possible ways of which it is structured. The motivation of this work was several- fold. 1. To compare the specifics of Nematostella-microbe interactions to other symbiosis models to get a better understanding of canonical host-microbe interaction principles. 2. To learn the specifics of anthozoan host-microbe interactions in order to provide clues for understanding the health of anthozoans. 3. To understand the role and adaptations of microbial associates to N. vectensis. Based on an assessment of strength of association, we selected ten bacterial genomes for sequencing, 4 classified as Pseudomonas oleovorans,3 as Agrobacterium

12 tumefaciens, 2 as and 1 as Stappia stelluata. To these sequenced genomes we have applied the principles of comparative population genomics in order to understand the unique ways in which they may be functioning and surviving within the anemone.

Comparative population genomics as a tool for exploring bacterial function Comparative population genomics for the elucidation of microbial genes and functions related to host association can be done in two ways. In the first a genome, or set of related genomes, of a strain(s) experiencing an environment of interest is compared to the genome(s) of a related strain(s) not experiencing those factors and the presence and absence of genomic signatures such as genes, non-coding RNAs, regulatory regions and mobile genetic elements such as integrons and phages is observed. This type of approach yielded the discovery of a single gene in fischerii that establishes its host specificity for squid (Mandel et al., 2009), genes of potential pathogenicity in mycobacterium tuberculosis (Zakham et al., 2012) and the discovery of repeat elements influencing plant pathogenicity in the fungus Pyrenophoratriticirepentis (Manning et al., 2013). The second approach is to compare genomes of phylogenetically diverse microbes experiencing the same environment and see if they hold clues signifying a similar unknown pressure they may both be under. This approach was used to successfully predict how species of Pelagibacterand Prochlorochoccuswere both under phosphate limitation in the Atlantic Ocean (Coleman and Chisholm, 2010). Because of our collection of ten genomes from four distinct populations, we have employed both methods.

Metatranscriptomic analysis provides insights into gene expression in complex communities. In addition to the comparative genomics methods employed above, it was desired to use metatranscriptomics, or the large-scale sequencing of RNA from a mixed community, in order to understand the expressed functions of anemone microbes. Metatranscriptomics has previously been a method of limited use due to the cost of sequencing and the low amounts of desired bacterial mRNA in comparison to the prodigious amounts of rRNA and mRNA of the host and other organisms of the system (Poretsky et al. 2009, Moran et al. 2013, Gilbert et al. 2008). However, advancements in rRNA subtraction and bacterial mRNA enrichment techniques have yielded the ability to perform metatranscriptomics on complex communities and gain insights into the bacteria residing within them (Stewart et al., Shi et al. 2009). For instance, studies examining such complex hosts as humans (Gosalbes et al. 2011), insects (Xie et al. 2012), mice (Xiong et al. 2012) and sponges (Radax et al. 2012) have all effectively used metatranscriptomics to understand the ecology of the microorganisms within the host. Further, metatranscriptomics can be applicable in practical ways in that metatranscriptome data was used to design a medium in order to culture a symbiont of the medicinal leech Hirudo verbana (Bomar et al. 2011). We have thus employed metatranscriptomics on whole, lab- raised anemones, using a variety of rRNA depletion and mRNA enrichment techniques, not only to get an understanding of the microbial ecology of the anemone holobiont based on the genes expressed and the types of microbes

13 expressing them, but to apply the results to the creation of lab testable hypotheses for known and unknown inhabitants of the anemone microbiome. Taken together, the following comparative genomics and metatranscriptomics data reveal complex microbial associations with Nematostella indicating that antibiotics and elemental iron may be key agents in the structuring of the anemone microbiome.

14 Comparative Genomics Methods

N. vectensis Bacterial Isolate Genome Sample Preprocessing and Assembly

Barcode Sort and Removal: Raw 102bp paired-end Fastq reads of all 10 N. vectensis associated genomes were obtained via Illumina-GAII (Illumina, Inc., San Diego, CA) and received in one data file. To sort out the reads of the individual genomes, 6 base-pair barcodes on the sequences were identified, those sequences were moved to specified files and the barcode sequence as well as a linking "T" nucleotide was trimmed from the forward and reverse sequence of each pair; all of this was performed using the perl script "bcsortfastq-pe.v3.pl" (Obtained from former Thompson Lab post-doc Samodha Fernando).

Read Preprocessingand Adaptor Removal: Contaminating Illumina adaptor sequences were trimmed from the reads of each paired-end genome fastq file by utilizing the "fastx-clipper.pl" script of the FASTX-Toolkit (Hannon Lab Open source software). This script also removed pairs of reads that contained ambiguous "N" nucleotides. Trimmed reads less than 30nt were removed, and trimmed reads greater than 30nt were moved to a separate file as they were now single-ended, and this made it easier to import them into CLC Genomics Workbench (CLC Bio, Cambride, MA). The exact line on the command line to perform this step was: fastx-clipper -a GAGATCGGA -1 30 -i inputfilename -o Outputjfile-name

The movement of trimmed, now single end sequences over 30nt were moved to separate files by custom script "fastq-purge.py".

Preprocessing was completed by importing both the file of untrimmed paired-end fastq sequences and the file of trimmed > 30nt fastq single end sequences into CLC Genomics Workbench and using the "Quality Trimming" function on default parameters in order to remove low-quality bases before the assembly; reads < 30 base pairs were removed at this point.

Adaptor Discovery: The fact that proprietary Illumina adaptor sequences were part of the reads was discovered upon mapping the raw reads back to an alternative assembly via the "Map Reads to Reference" functionality of CLC Workbench. (That alternative assembly will not be discussed as the preprocessing pipeline was found to be inferior to the one described above). Upon mapping the raw reads to this assembly, particular 25-80nt regions of scaffolds, when using manual inspection, had thousands fold higher coverage than surrounding regions and these were found to contain identical sequences later identified as the Illumina adaptor sequence.

Assembly and Assembly Check: Preprocessed trimmed and untrimmed reads were submitted to the CLC Workbench "De Novo Assembly" Function on default parameters. Assemblies were checked via manual inspection of the raw reads mapped back to the constructed scaffolds using CLC Genomics Workbench.

15 Structural and Functional Annotation of Isolate Genomes Open reading frames (ORFs) were identified by uploading fasta files of the genome assemblies to RAST (Aziz et al. 2008) on default settings. Average nucleotide identities (ANI) among isolate genomes were calculated based on pairs of ORFs sharing bi-directional best hits. ORFs were annotated by using BLASTp (Altschul et al. 1997) to compare them to the COG and NOG subsets of the eggNOG Database (v3.0) (Powell et al. 2011). ORFs were given a particular COG/NOG designation if their BLASTp alignment to a particular protein of a certain COG or NOG had an e- value < le-20 and the alignment included the functional portion of the protein (as designated in the COG/NOG database). Only one orthologous group annotation was given per ORF, so if an ORF had 2 or more significant hits, only the NOG/COG with the lowest e-value was kept.

N. vectensis Bacterial Isolate 16S rRNA Gene Phylogenetic Analysis Partial sequences of the 16S rRNA genes were obtained for each of the 10 isolates from Ju Hyoung Lim and Jia Yi Har (a former post-doc and graduate student of the Thompson lab respectively). These partial sequences, along with full 16S rRNA gene sequences of the Alphaproteobacteria C. crescentus Na1000 and S. melliloti 1021, N. multiformis ATCC 25196 and V. paradoxus, Gammaproteobacteria P. Putida F1 and P.aeruginosa PA01 and Firmicute B. subtilis BPD-13 (all obtained from MicrobesOnline (Dehale et al. 2009)), were aligned using the software Muscle on default parameters (Edgar 2004) and a maximum-likelihood tree was constructed from the alignment using PHYML (Guindon et al. 2010); 100 non-parametric bootstraps were calculated and reported. B. subtilis served as outgroup.

Symbiont Population Shared and Unshared Ortholog Determination Genomes of the same population had their annotated ORFs compared to determine shared and unshared orthologous groups present within them; this was carried out using custom python scripts. Results were manually put into Venn diagrams.

Holobiont Specific Gene Determination Holobiont-specific genes were defined as gene orthologs found in N. vectensis- associated bacterial genomes that were absent in closely related genomes of the same genus/family. In order to determine holobiont specific gene orthologs a list was made of all unique COG and NOG groups present in any of that population's genomes; this was done via custom python script. Next, the fasta files of proteins from all of a particular taxonomic level of closely sequenced relatives were obtained from Microbesonline for each population (Dehale et al. 2009). In particular all Pseudomonasgenomes available (n=27) were used for analysis of the Pseudomonas oleovorans anemone isolates, all available sequences of Rhizobiacea (n=19) were obtained for analysis of the Agrobacterium tumefacians isolates, 30 members of the family were used for assessment of the Limnobacter thiooxidans isolates and all available Rhodobacteraceae (n=7) as well as the Rhizobiacea were references for the Stappia isolate.

16 All of the relative protein files were annotated using the COG and NOG subsets of the eggNOG database (v3.0) as described in the "Structural and Functional Annotation of Isolate Genomes" section above. From there, unique orthologous group lists of all the relatives were compared to those of the populations to determine which orthologous groups were holobiont population specific. A list of all holobiont specific orthologous groups and genomes in which they're present can be found in Appendix I.

Determination of population-specific, 'operational core' orthologs and flexible orthologs Core genes are defined as the set of genes shared by all members of a population while flexible genes are present in a subset of population genomes and may represent recent evolutionary events (i.e. horizontal gene transfer). To investigate the distribution and origin of core and flexible genes from N. vectensis associated isolates, while accounting for the partial genome assembly of isolates, we operationally defined "Core orthologous groups" as those that were present in the majority of isolates of a population (i.e. three or more of the four Pseudomonas strains, two or more of the three Agrobacterium strains or both Limnobacter strains). Use of ortholog groups as the basis of comparison rather than reciprocal hits of genes allowed genes split by partial assembly to be scored as a match. Flexible orthologous groups were the remaining set that did not meet those criteria. The Stappia strain was left out of core/flexible analysis as it was the only sequenced member of its population.

MEGAN Analysis of Operational Core and Flexible Ortholog Groups: In order to observe the phylogenetic origins of the core and flexible orthologous groups, the genes annotated as particular orthologous groups were classified by performing a BLASTp against the NCBI NR Database (Altschul et al. 1997) with default parameters but with tabular (-m8) output. The BLASTp results were imported into MEGAN (MEtaGenome ANalyzer) using Least Common Ancestor (LCA) parameters of at least 3 hits to a clade for significance and a minimum bitscore cutoff of 50 (Huson et al. 2007).

Nematostella Horizontal Gene Transfer Analysis To assess the likelihood of horizontal gene transfer between Nematostella and the Pseudomonas,Agrobacteria, Stappia and Limnobacterpopulation of symbionts, a nucleotide BLAST was performed between the contigs of all 10 assembled N. vectensis associate genome and the scaffolds of the complete N. vectensis genome (Putnam et al. 2007). Sequence matches were determined using a stringent expected value cutoff of le-30 for accepting a match. N. vectensis scaffolds bearing bacteria-like DNA were manually inspected and analyzed with custom python scripts in order to determine their GC content, ambiguous base composition and size, which were used to asses the likelihood of horizontal gene transfer.

Inter-Population Lateral Gene Transfer Analysis

17 In order to asses whether there was evidence of horizontal gene transfer among the different symbiont populations, every pair of genomes from different populations had their genes BLASTn compared and checked to see if any of their genes shared >95% nucleotide identity as that would be a strong indicator of recent gene transfer.

Symbiont Shared Phage and Prophage Element Analysis To test if any of the populations shared similar phage or prophage elements, the proteomes of each isolate were BLASTped against the PHAST phage and prophage database (Zhou et al. 2011) and the top hits for each gene were compared among the N. vectensis associates to look for any that were identical among different populations of associates. Top hits were compared using a custom python script.

18 Transcriptomics Methods

RNA Extraction, rRNA Subtraction Methods and Illumina Library Preparation Isolation of RNA, ribosomal RNA depletion, cDNA synthesis and preparation of libraries for Illumina sequencing were carried out by Dr. Samodha Fernando according to methods adapted and described in detail elsewhere (Timberlake, et al., In prep; Penn, et al., Submitted). In brief, 20 lab-raised N. vectensis were washed with saline, homogenized and their RNA was extracted through use of TRIzol (Life Technologies) according to manufacturer's instructions, including treatment with DNAse followed by phenol-chloroform-extraction. The RNA was divided between six samples that were each subjected to various combinations of rRNA depletion protocols as an initial screen of protocol effectiveness (Table 1). These depletion methods designed to eliminate eukaryotic and bacterial rRNAs included treatment with RNAseH (Personal Comm. Samodha Fernando), use of MICROBEnrichTM Kit (Ambion Part No. AM1901) and MICROBExpressTM Kit (Ambion Part No. AM1905) and treatment with duplex-specific nuclease (Personal comm. Chisholm Lab). Following depletion of rRNA samples were transcribed to cDNA (SuperScript Double-Stranded cDNA Synthesis Kit (Catalog # 11917-020)).

Table 1 Bacterial mRNA enrichment techniques and their principles * Poly(A)purist - A kit that relies on use of oligo(dT) cellulose to preferentially bind PolyJA) tails of eukaryotic mRNA; this is used to remove unwanted eukaryotic mRNAs from our samples. - RNaseH - Endonuclease that specifically degrades RNA in RNA:DNA hybrids; DNA oligos that bind to specific conserved regions of rRNA are added with it to selectively remove rRNAs. - mRNAOnly - A kit that relies on an endonuclease that selectively degrades RNAs with 5'- monophosphates; rRNAs have this feature while mRNAs do not, so they are selectively degraded- * MlCRoQEnrich/MICROBExpress -A pair of kits that rely on a novel capture hybrilization protocol to stectely degrade eukaryotic rRNA and Bacterial rRNA respectvely - * Dpxa e*-speci pcfit Nuclease- A nuclease thattcftctl.degriwdeA ipecfo ydge ds6Ne adDAi ant.lr4Ed ON4Rp DNA:~4N:N hybrods;

To prepare the Illumina Libraries, cDNA for each sample was sheared into pieces between 100-300 base-pairs, purified, ligated into proprietary Illumina Adaptor sequences (Illumina, Inc., San Diego, CA) with unique 6 base-pair barcode sequences to designate samples for multiplexing within a single lane. Barcoded adaptor-ligated reads were then subject to size selection to remove self-ligated adaptors. Cleaned and merged adaptor ligated reads were than submitted to MIT personnel, who sequenced them using the Illumina-GAII platform as described in Timberlake et al., in prep; Penn et al., Submitted. Sequence and sequence quality data in the form of FastQ files obtained from the Illumina platform were obtained and form the basis of the dataset analyzed in this thesis.

Read Filtering

Barcode Sort and Removal: Raw reads obtained from the Illumina GA-I were sorted into respective samples by barcode and subsequently had that barcode + linking "T"

19 nucleotide removed via the perl script "bcsort_fastqpev3.pl" (Obtained from former post-doc in the lab Samodha Fernando).

Removal of rRNAs: Raw read pairs were compared against the Silva large and small subunit rRNA databases (Quast et al., 2013) using BLASTn and those read pairs having one or both ends matching a database sequence with a bitscore > 50.0 were removed and categorized as the type of rRNA matching its highest hit. For removal of 5S rRNA and Internal Transcribed Spacers (ITS), remaining non-rRNA reads were compared against custom databases of bacterial and Nematostella 5S rRNA and ITS using BLASTn, and again, those read pairs having one or both ends matching a sequence within one of these databases with bit score > 50.0 were removed and classified as the type of rRNA according to the identity of the highest hit.

Trimming ofAdaptor contaminated reads: From visual inspection of the reads, it was determined that some of the reads contained the adaptor sequence within them. rRNA free reads had this adaptor contamination removed by searching for a pattern of "'AGATCGG[ACTGN]+?NNN[ACGTN]+?CCGATCT"' (python regular expression) within read pairs that had been combined by merging the first read pair, three ambiguous nucleotides "NNN" and the reverse compliment of the reverse read. If the pattern above was found in the merged read, the merged read was trimmed before the beginning of the initial adaptor sequence "AGATCGG". This resulted in the production of an adaptor free single ended sequence. If this sequence was larger than 25 nucleotides, it was kept for analysis.

Removal of reads with tandem repeats: Tandem repeats were removed via custom perl script. A read pair was determined to contain a tandem repeat if a 6-mer was present for greater or equal to 6 times.

Merging overlapping pairedends: Paired end reads that had overlapping sequence were merged using the software program SHERA with confidence metric >= 0.7 (Rodrigue et al. 2010)

Readfiltering results: Putative mRNA single-end and paired-end reads that had made it through the filtering steps were now in three forms: 1) Single ended because of adaptor trimming 2) Single ended because of overlapping paired-end that was merged with SHERA 3) Paired-end because of non-overlap of ends and no adaptor presence. For further analysis, each of these groups counted as one read pair unit.

Taxonomic binning of Read Pair Units through MEGAN Read pair units were compared against all sequences in the NCBI database using BLASTx (Altschul et al. 2007) with parameters (-m 8 -W 3 -e 20 -Q 11 -F "m S"). The BLASTx results were imported into MEGAN with bit score cutoff 40.0 and the lowest common ancestor cutoff being 2 matches (Huson et al. 2007). Read pairs with ends matching different domains of life were discarded; those matching the same domain of life were annotated with the end with the more specific . Reads binned

20 at at least the level of Bacteria or Cnidaria were exported for annotation to understand the functions of anemone host and bacterial associates. The reads binned as "Virus", "Archaea" and "Non-Cnidarian Eukaryote" have been exported but have yet to be annotated and analyzed.

EggNOG Annotations Read pair units of interest exported from MEGAN were annotated by use of BLASTx against the eggNOG database version 3.0 (Powell et al. 2011); "Bacterial" reads were compared against the NOG and COG subsets of eggNOG while "Cnidarian" reads were compared against the opiNOG subset. Single-ended read pair units were annotated simply if a top hit had a bit score match > 50.0 with whatever the orthologous group and function that top hit belonged and counted as two read counts. Paired-end read units had each end BLASTxed separately against the desired subset of the eggNOG database version 3.0. Using a custom python script, bit scores were recalculated for pairs that hit the same member of the database (Timberlake et al. in prep); this recalculation of bit scores was to account for added confidence that both pairs should be classified as a particular orthologous group as they aligned to the same target sequence. If the merged bit score was greater than all other scores for the two ends and it was greater than 50.0, the read pair unit was annotated with the identity and function of the orthologous group of that of the top merged hit and counted as two read counts. If the individual ends had higher bit scores than those of the merged hits, the ends are treated and annotated as single ends with one read count value each. However, if after this treatment as single ends, both ends ended up within the same orthologous group, they are counted as two counts. This particular counting system was used in order to match the functional annotation counting scheme of MEGAN.

Method Symbiont Genome Read Mapping Reads that had been filtered through the metatranscriptomics preprocessing pipeline were merged into one file and imported into CLC Genomics Workbench (CLC Bio, Cambridge, MA). Using the "map reads to reference sequence" functionality of CLC with parameters "Similarity = 0.9" and "Length Fraction = 0.5", the reads were aligned with the annotated symbiont reference genomes. Results were filtered by "Consensus length > 50", which meant at least half of a full length read had to align to the reference in order to be counted. Further, reads containing > 5x coverage were manually inspected to ensure even distribution of the reads over the length of the gene; genes with coverage deemed too skewed were excluded from further analysis. These filtering results were exported and eggNOG version 3 annotations were added in as described in the above "EggNOG Annotations" section.

21 22 Comparative Genomics of Nematostella vectensis Associated Bacteria

Introduction and Initial Selection of Bacterial Symbionts for Sequencing

While previous experimental work has shown that similar microbial populations are associated with Nematostella over distinct temporal and geographic locations (Figure 1), the functions these bacteria may be performing within their anemone hosts and the mechanisms the bacteria may be using to associate remain unknown. In order to begin to address those questions, cultured bacteria found to be associated with Nematostella were whole genome-sequenced in order to use comparative genomics of those sequences as clues of the ecology of the anemone symbionts and to use those sequences as references for quantitative metatranscriptomics work; this section will focus exclusively on the comparative genomics analysis.

Mahone Bay, Nova Scotia

MIT (Laboratory) Ju y 2008 November 2008 November 2008 sediment

Great Sippewissett Marsh, Massachusetts 0 29% 72.6%

Clinton Harbor, Connecticut N CFB Or Chioroflexi NChoroplasts ACyanobacteria 0 Deferribacteres 8 ODI * Planctomycetes 6 Spirochetes U Tenericutes 0 62.5125 250 375 5W U Verrucomicrobia OAlphaproteobacteria Deltaproteobacteria =I Epsilonproteobacteria 0 Gammaproteobacteria U Unknown

Figure 1 Map and distribution of microbial diversity in field and lab-raised N. vectensis as determined through 16S clone libraries (Har, MS Thesis). Each Pie-graph represents a distinct 16S clone library from a particular location and time as labeled. Colors on the graphs correspond to classes of bacteria as specified in the key.

Bacterial isolates cultured from field or laboratory raised N. vectensis were identified by 16S rRNA gene sequences (Har, MS Thesis). Isolates were selected for genome sequencing if they appeared to be stably associated with the anemone host based on recovery of the 16S rRNA sequence type (>99% nucleotide identity) in multiple samples (Table 1). Genome sequences were prepared for four Pseudomonasoleovorans strains (B4, Gab, Isu and 47) representing a 16S rRNA sequence type recovered from anemones collected in the field, and from two laboratories over a two year timespan, and for two Limnobacterthiooxidans strains

23 (Fl, FCMA) that also shared a 16S rRNA sequence with populations detected in both the field-collected and lab-raised anemones. In addition, three Agrobacterium tumefaciens strains (Isu, D5 and D8) were included in the sequencing effort because they were observed in both culture- and clone-libraries of laboratory-raised anemones over a two-year period. A single Stappia stelluata strain was also included that was a close relative (on the genus level) to previously described associates of Eastern Oysters, Crassostreavirginica, and corals (Boettcher et al. 2000, Uchino et al. 1998). It should be noted that only culturable populations were able to be included in the current genome sequencing effort. While two 16S rRNA sequences types similar to strains of Endozoicomonasand Campylobacterales respectively have been observed in N. vectensis in multiple field sites and have been described as stable associates, neither strain has been recovered in culture despite substantial effort (JH Lim, pers. communication, data not shown). The phylogenetic relationship among the 10 strains chosen for genome sequencing is depicted in Figure 2; further, a comparison of the strains of each population to their closest type strain proves they are more related to each other than any other sequenced genome (see Appendix I). The Pseudomonasand Limnobacterpopulations fall within the Gammaproteobacteria and Betaproteobacteria classes respectively while the Agrobacterium and Stappia populations both fall within the Alphaproteobacteria class.

and Detection of Bacterial Strains S

24 BdNus subdis BPD-13 CBuebct" rscentus Na1000 Stposteilwata f-IMGe-03 Sitrarbiamab s1021 Agrobaceriam tunWfacens IsuO.A8f-A9 Agrobactefium nolacens O Agrobactenum wumefaiem DS

Alphaproteobacterta

Betaproteobacteria edmn oea 70

Gammaproteobacterla

Figure 2 16S rRNA Tree of Symbiont Phylogenetic Relatedness. Maximum-likelihood phylogenetic tree constructed with the16S rRNA sequences of isolated Nematostella vectensis strains (highlighted in red) and a collection of other Proteobacteria. Different color blocks represent different classes of Proteobacteria as detailed in the key. The gram-positive Bacillus subtilis serves as out-group. One hundred non-parametric bootstraps were calculated; branches with bootstrap support of > 50 are labeled. (Scale is average substitutions per site)

Genome Assembly, Annotation, and Illumina Adaptor Artifact Detection

The selected genomes were prepared and sequenced using the Illumina GA-Il platform, which, after sorting out through removal of barcodes, resulted in the creation of, files of paired-end 102 bp sequences with quality score data for each strain. In order to preprocess the samples for better assembly, all read-pairs containing "Ns" were eliminated and the reads were trimmed at 60 nucleotides due to the known decline in quality near the ends of the reads. This resulted in post- processed read pairs that ranged among the ten isolates from 689,036 to 1,854,925, with no correlation to population. The number and average length of these read pairs allowed for the calculation of the expected coverage of every genome, and it ranged from 13.3x-23.1.8x for the Pseudomonasand Agrobacteriastrains to 29.4x and 64.1x for the Limnobacters,which had smaller genome sizes (Table 2). Clean reads were assembled into scaffolds using the software Velvet along with the optimizing script velvetoptimiser.pl (on default parameters); repetitive sequence errors were corrected using the functionalities of the AMOS software package (Zerbino et al. 2008). The results of initial assemblies suggested most genomes could be assembled into less than 1000 scaffolds, which was on par with assemblies used in other comparative genomics studies of incomplete genomes (O'Brien et al. 2011) (Table 2). However, upon visually inspecting the genomes in CLC Genomics Workbench it

25 became clear that there were problems with the assemblies. The initial reads used to assemble the genomes were mapped back to genome scaffolds and multiple small regions of extremely high coverage were observed (almost 1000-fold higher than neighboring regions) that were identical to the Illumina adaptor sequence. We determined that the Illumina adaptor, a proprietary sequence construct used to support the basic mechanics of the sequencing reaction into which DNA of interest is ligated, was sequenced in some cases when the DNA insert ligated into the adaptors was shorter than the read-length the machine was programmed to sequence, causing the Illumina sequencer to begin reading into the adaptor sequence. Thus, it appeared that the presence of this internal adaptor sequence caused the artificial joining of contigs together.

Statistics of AdaDtor Containina Genomic Reads

Ss_F1 1,107,834 4,778,618 735 12,606 51,920 5026 (27.8x) (65.4%)

26 717,642 RPs4,253 Ss_F1 231,719 ATRs 4,40,534 3,410 1,932 13,551 5,487 1,756 (25.3x)

In order to fix this missassembly, the original reads were subjected to a modified QC procedure adding a step to remove internal adaptor sequences using the FASTX- Toolkit (Hannon Lab open access software). Non-adaptor containing read pairs (RPs) and adaptor trimmed sequences (ATRs) were imported into CLC Genomics Workbench (column 2 of Table 3); once in CLC, the read ends were trimmed using the CLC quality trimmer function set on default parameters and assembled using the CLC assembler (CLC Bio, Cambridge, MA). This cleaner pre-processing method resulted in poorer assemblies of the genomes as the number of contigs the Pseudomonasand Agrobacteriastrains have assembled into is 4,429 to 6,720 almost 3-10 times the range of the previous assemblies of 408 to 2,149 scaffolds (Table 2 and 3). However, the assemblies were now free from artifacts from contaminating adaptor sequences as ascertained by mapping reads back to scaffolds for each sample. Following assembly, open reading frames (ORFs) were identified by uploading the genome assemblies to RAST (Aziz et al. 2008) and the genes associated with ORFs were annotated using the COG and NOG subsets of the eggNOG Database v3.0 (Powell et al. 2011).

27 Diversity within Holobiont-Associated Microbial Populations

Since the bacterial genomes within each population isolated in this study share >99% homology of the 16S rRNA gene, it was desired to know if the genomes within those populations were clonal or had distinct genomic repertoires. This question was addressed first by calculating the average nucleotide identity (ANI) of each pair of genomes within each population calculated based on orthologs identified by best bi-directional BLASTn in RAST (Aziz et al., 2008). For the Pseudomonas population the ANI between strains ranged from 94.91%-97.08%, for the Agrobacterium strains it ranged from 95.5%-97.46% and the ANI between the two Limnobacter isolates was 92.5%. This revealed that although the strains within our populations are quite similar, they are not clonal as there are nucleotide differences among their shared genes. An ANI greater than 94-96% has been cited as evidence that two strains belong to the same species (Richter and Rossello- Mora, 2009) and in our study the Pseudomonasand Agrobacterium strains were each related above this threshold and may be part of the same ecological population, while the ANI of 92.5% between the two Limnobacter strains may indicate they are not the same ecological species. However, when compared to the most closely related genome sequences in the RAST database, it is clear that the isolates from N. vectensis are more closely related to each other than to near phylogenetic neighbors (Figure Appendix I). A further test of the diversity of the populations performed was an analysis of shared and unshared gene contents. Normally this would be tested by performing a reciprocal-best BLAST hit or reciprocal smallest distance analysis between ORFs from each pair of genomes within a population; and then merging orthologous groups permitting the visualization of shared and unshared genes via a Venn diagram. However, because the relatively low coverage of genomes prevented complete assemblies, structural gene annotation likely resulted in creating artificially higher numbers of genes, due to genes split between the ends of two non- continuous contigs being annotated as two distinct genes. Thus reciprocal best BLAST hit analysis will fail to group split genes together and would result in clonal strains appearing artificially distinct. Artificially high gene numbers do indeed appear to be the case for the poorly assembled Nematostella genomes as the protein coding gene density (# of protein coding genes per kilo base pairs) of the 4 isolated Pseudomonasstrains is -1.23 whereas the density for four fully assembled reference strains of Pseudomonas are -0.90 (Table 4). Heightened coding densities are also observed for the Agrobacteria (data not shown).

28 Table 4 Comparison of gene densities of Nematostella isolated partially sequenced Pseudomonas genomes and fully assembled Pseudomonas reference genomes

PoB4 6,781 5,410,491 1.25 2,206 0.41 PoGab 6,372 5,196,558 1.23 2,252 0.43 PO_Isu 6,553 5,288,085 1.24 2,292 0.43 Po_47 6,359 5,254,749 1.21 2,203 0.42 Ps-omns4,$94' ",5,6709 ,6105 men~dacinja yMp Pseuddmorias ,9 24 aeruginos PA1

Pseudonons, 4,128 4,567,418. 2,462 054 stitzeri A1501 Pse'udomonas 5,21 5,§59,964 6 8,72.4 putida F1 Information for reference strains (in the dark orange rows above) was obtained using the web reference: www.microbesonline.com (Dehal et al. 2009)

In order to perform a preliminary comparative genome analysis and to avoid artificial inter-strain differences due to the high numbers of split genes in poorly assembled genomes, the genes were functionally annotated by one-way best BLAST hits to the eggNOG Database (v3.0) of COGs (supervised clusters of orthologous genes) or NOGs (non-supervised clusters of orthologous genes) (Powell et al. 2011). This analysis classified each full or split gene within an orthologous group. From there, shared orthologous groups were determined between two genomes by checking simply for the presence/absence of a particular COG/NOG group; thus, a gene artificially split in two fragments in one genome compared against a single intact version of the gene in another genome would each be counted as a single match to the same COG/NOG and counted as shared. Evidence of this as a better approach can be seen in that a comparison of the N. vectensis Pseudomonasisolates and fully sequenced reference Pseudomonasin Table 4 used in terms of unique COG/NOG density reveals close isolate and reference averages of 0.42 and 0.49, much closer than the protein coding gene densities compared above. Analysis of shared COG/NOG groups was carried out within the three Nematostella associated populations of Pseudomonas,Agrobacterium, and Limnobacter (Figure 3); the Stappia strain was excluded, as it was the only strain of its population. For all three of the assessed groups, it appeared that the genomes within populations are not clonal as each genome has from 60 to 279 unique COG/NOG orthologous groups present within it. However, it is also possible that unique COG/NOGs are absent because they were not sequenced in other genomes and confirmation by PCR is necessary before final confirmation of strain heterogeneity. However, given the multi-fold sequencing depths of this study it is unlikely that such high numbers of unique COG/NOGs would be detected by chance.

29 Additional evidence to support the existence of genomic heterogeneity within strains of the same 16S rRNA gene sequence type is the failure of improved assembly due to pooling sequences from multiple strains. If strains were clonal, we would expect that pooling the sequences would improve the assembly, however in the case of the Pseudomonas,Agrobacteria, and Limnobacter populations the opposite occurred and the N50 decreased. The isolated bacteria of Nematostella thus seem to have richly diverse genomic repertoires.

Comparison of Symbiont Genomes

One of the primary goals for sequencing the ten bacterial isolates from Nematostella was to use the sequences in order to gain knowledge about the ecology of the anemone holobiont. The first method employed to do this was to use fully sequenced close phylogenetic relatives of the Nematostella isolates to determine if the anemone strains contained any unique protein coding genes and to determine whether those unique genes were present among the four populations. To do this, reference genomes from fully sequenced phylogenetically neighbors were selected for each holobiont strain group from the MicrobesOnline database (Dehal et al. 2009); in particular all Pseudomonasgenomes available (n=27) were used for analysis of the Pseudomonas anemone isolates, all available sequences of Rhizobiaceae (n=19) were obtained for analysis of the Agrobacteria population, 30 members of the Burkholderiaceae family were used for assessment of the Limnobacters and all available Rhodobacteraceae (n=7) as well as the Rhizobiaceae were references for the Stappia. These reference genomes were annotated just as the isolates were by one-way BLAST against the COG and NOG databases. To determine if a gene within an isolate genome was unique, its COG or NOG designation was compared against all COGs and NOGs within the reference strains (Powell et al. 2011). The N. vectensis holobiont Pseudomonasstrains collectively contained 116 distinct orthologous groups that were not present in any of the reference Pseudomonasgenomes. The majority of these holobiont-specific orthologous groups were hypotheticals (COG Category S) and were observed in only one of the four genomes. Similarly, the three Agrobacterium strains contained 100 holobiont- specific COG/NOGs, the Limnobacters 58 and the Stappia 59 with the majority of these corresponding to "hypothetical" orthologous groups.

30 L. Thiooxidans F1 L. Thiooxidans FCMA

a. 279

A. Tumefaciens D8 A. Tumefaciens D5

b.

A. Tumefaciens Is

P. Oleovorans 47 P. Oleovorans Gab

Figure 3 Shared Gene Contents of Nematostella Associated Isolates. Populations are assessed individually: a = Limnobacter Population, b = Agrobacteria Population, c = Pseudomonas Population. Shared gene contents determined by comparing the presence or absence of particular NOG/COG orthologous groups of genomes annotated by one-way best BLAST hit to the COG/NOG subset of eggNOG Database version 3.0. Bolded numbers represent those genes in the operational core.

31 However, some holobiont-specific orthologous groups did have functions that may reflect ecological significance, and they were recovered from two or more members of a particular strain group (Figure 4). For instance three of the isolated Pseudomonashave both a unique efflux transporter and an antibiotic synthesis monooxygenase. If antibiotic production by the members of the anemone microbiome is necessary for establishment of a community, these genes could be imperative factors for association. Further, two of the isolated Pseudomonas strains have a plasmid maintenance protein; plasmids are often associated with adaptive functions such as antibiotic resistance and nutrient utilization, so this too could be a potential target of interest for further study of interactions within the holobiont. The Agrobacteriaand Limnobacter populations also have unique ortholog groups of interest present in multiple members. All three of the Agrobacteriahave an isochorismate synthase which is a branch point enzyme that can lead to the creation of several types of siderophores (Figure 4); its presence might reflect a unique way this population is scavenging metal ions from its surrounding. Adding more evidence to the importance of metals within this population, all three strains and two of the three respectively have a mercuric transport protein and mercuric periplasmic transport protein, suggesting the importance of having a system for efflux of the toxic ion mercury. Perhaps most interesting of all, both Limnobacters have a holobiont-specific ABC-type multidrug transport system, which again may reflect the relevance of antibiotics as a structuring agent of the anemone microbiome. It was finally tested to see if any of these holobiont-specific orthologous groups were shared among the four populations, as this would provide strong evidence to a similar holobiont-structuring factor to which the various bacterial populations were adapting similarly. There are three such sharing events (data not shown). However, they are each between one Pseudomonasstrain and one Agrobacteria strain and all three of the orthologous groups are hypothetical (COG Category S), revealing little useful ecological information about the anemone microbiome.

32 Holobiont-specific OGs

Ortholog found Ortholog found Ortholog found in 2 Genomes in 3 Genomes in 4 Genomes

(A r-

0

=0

0CL &-0

+-, UC

0= C 0.

Figure 4 Holobiont-specific orthologous groups found within multiple members of the Nematostella isolated populations, not found in any tested closely related reference strain. Horizontal axis represents # of genomes of a population a particular orthologous group is found within. Vertical axis represents the different populations. Inset tables within colored boxes are COG/NOG functional descriptions. Additional Information of Holobiont-Specific Genes can be found in Appendix II.

Exploring Phylogenetic Origins of Core and Flexible Genomes of N. vectensis Isolates

The comparison of shared and unshared orthologous groups within the four populations of N. vectensis isolated bacterial associates permitted their assortment

33 into "core" and "flexible" groups and new questions about the phylogenetic origins and functional characteristics of these new categories to be asked. Core orthologous groups were operationally defined as those that were present in three or more Pseudomonasstrains and those that were found in two or more Agrobacteria or Limnobacter strains; flexible orthologous groups were the remaining set that did not meet those criteria. The Stappia strain was left out of core/flexible analysis as it was the only sequenced member of its population. It was hypothesized that since they were present in fewer members of the populations, flexible orthologous groups present in the isolates likely had much more diverse phylogenetic origins than those of the core. This is a relevant idea to test as diverse phylogenetic origins, and the identities of those origins, of orthologous groups could reveal insights both into the community and structure of the anemone holobiont and differences between the origins of core and flexible genes could reveal the magnitude of these structuring effects. In order to observe the phylogenetic origins of the core and flexible orthologous groups, the genes annotated as particular orthologous groups were classified by performing a BLASTp against the NCBI NR Database (Altschul et al. 1997). The BLASTp results were imported into MEGAN (MEtaGenome ANalyzer), which binned each gene taxonomically using the Lowest Common Ancestor algorithm to the closest known sequence (Figures 5 and 6) (Huson et al. 2007). Examining the core genome (Figure 5), it is evident that the majority of the orthologous groups bin within the Class of bacteria expected (i.e. genes of the Pseudomonasbin into Gammaproteobacteria, those of Agrobacteria bin into Alphaproteobacteria and Limnobacter orthologous groups bin into Betaproteobacteria). The small exception being 17 orthologous groups which classify as Delta and Epsilon subdivision of Proteobacteria. Although the expected binning is still present to a large extent, the flexible genomes of the isolates have a much higher tendency to have orthologous groups binned in other classes of Proteobacteria (Figure 6). This may either represent heavy amounts of lateral gene transfer among the proteobacteria or errors in taxonomic binning resulting from the limited sequences available within the NCBI nr Database. In the core and flexible orthologous groups (OGs) there is the presence of diverse bacterial phyla, however, it is constrained to particular isolates. For instance L. thiooxidans FCMA has two flexible OGs classified as Cyanobacteria and A. tumefaciens IS has three core OGs classified as Actinobacteria. Skepticism of these results should be taken due to the limited number of genomes available in the NR database and given the fact that, although they were core OGs, only one of the three Agrobacteria strains had these particular OGs considered of Actinobacterial origin. Also of note are four OGs from P. oleovorans B4 that were classified as Eukaryota; upon uncollapsing the MEGAN hierarchy, these OGs were more specifically binned as being of Nematostella origin. This has tremendous implications in terms of host-symbiont horizontal gene transfer or host genome DNA contamination and will be explored further in a section below. Finally, in almost all of the Pseudomonasand Agrobacterialstrains, OGs, in either the core or flexible groups, were classified as viral in origin. This may imply

34 that these isolates have similar viral or mobile genetic elements; this too will be explored further in another section below.

MEGAN Visualization of Isolate Core Genomes

U L. thiooxidans F1 Aiphaproteobacteria U L. thiooxidons FCMA U P. oleovorans Gab 0 P. oleovorons B4 ta P. oleovorons 47A01 (11 U P. oleovorons Isu A. tumefodens Is Proteobacteria Delta/Eosilon A. tumefodens 08 Proteobacteria A. tumefidens DS Subdivisions Bacteria Beta poteobacteria Cellular Organisms

Unclassified Proteobactera

Root Actinobacteria

Environmental Samples

Eukaryota

, - I I I - I ...... - X It

Virus

Figure 5 MEGAN Visualization of N. vectensis Isolate Core Genomes. Genes fulfilling the definition of operational core for each population were compared with BLASTp to the NCBI nr Database (September 2011) and those results were imported into MEGAN (Huson et al. 2007). MEGAN binned the genes at levels of taxonomy using the LCA algorithm with parameters of at least 3 significant hits per clade and minimum bitscores of 50 for significant hits. The size of the pie graphs at each node is proportional to the total number of genes binned at or below that node from every genome. (i.e. the size of the root node is proportional to the total number of genes used in the analysis from all genomes). The colors of the pie graph sections correspond to the colors assigned to the symbionts in the key found in the top left and the size of the pie graph sections are again, proportional to the total number of genes binned at or below that level of taxonomy from the specified isolates flexible genome.

35 MEGAN Visualization of Isolate Flexible Genome L. thiooxidans F1 L. thiooxidonsFCMA U P. oleovorons Gab U P. oleovorans 84 U P. oleovorans 47A01 Alphaproteobacteria P. oleovorans lsu U0) A. tumefociens Is A. tumefociens08 A. tumefociensD5 Proteobacteria f Gammaproteobacteria Cellular Bacteria r- Organisms M

Betaproteobacteria

Cyanobacteria

Virus

Figure 6 MEGAN Visualization of N. vectensis Isolate Flexible Genomes. Genes fulfilling the definition of operational core for each population were compared with BLASTp to the NCBI nr Database (September 2011) and those results were imported into MEGAN. MEGAN binned the genes at levels of taxonomy using the LCA algorithm with parameters of at least 3 significant hits per clade and minimum bitscores of 50 for significant hits. For interpretation of the figure please see description below Figure 5.

Surprisingly from the analysis above, four genes from the Nematostella associated strains, in particular those of P. oleovorans B4, were annotated via MEGAN as deriving from the anemone genome. While horizontal gene transfer of the shikimic acid pathway from bacteria to the N. vectensis genome has previously been suggested (Starcevic et al. 2008) it may be possible that anemone-associated bacteria have incorporated anemone sequences. However, the most parsimonious explanation is that the Nematostella genome is contaminated with Pseudomonas DNA incorrectly annotated as Cnidarian, which is causing these genes to have a false phylogenetic origin. To investigate these possibilities a nucleotide BLAST was

36 performed between the assembled P. oleovorans genomes and the scaffolds of the complete N. vectensis genome (Putnam et al. 2007). Using a stringent expected value cutoff of le-30 for accepting a match, Pseudomonas DNA was found in 101 scaffolds of the anemone genome. Those matching scaffolds were examined in terms of GC content, ambiguous base composition and size (Figure 6) in order to assess the likelihood of HGT.

GC Coten )Wailonof AS Scdods ofMN Se S~oPotetGeo

O- a. I1-

Q6 Total 3.556 x 108 1.046 x 106 Number of Nucleotides 500 Average 33,008 10,361 Size of Scaffold 30 GC% 46.00% 60.10% Percentage 16.60% 77.80% Ambiguous I Bases j I 15,,

5-

lt 02 95 0.4 as0 0.6 047 0.0 Figure 7: Analysis of N. vectensis Genome Scaffolds containing and not containing PseudomonasDNA. (a.) Histogram of all Nematostella genome scaffolds and their average GC% (b.) Histogram of Nematostella scaffolds containing Pseudomonas DNA (c.) Table comparing salient details of all N. vectensis genome scaffolds and those just containing Pseudomonas DNA as determined by BLASTn analysis with threshold le-30.

Several observations point to the likelihood that the N. vectensis genome is contaminated with bacterial DNA including Pseudomonas DNA. First, the Nematostella scaffolds containing Pseudomonas DNA are on average about 20,000 base pairs shorter, contain almost 80% ambiguous nucleotides and have an average GC% that matches quite closely with that of the sequenced Pseudomonas symbionts (Table 2) and a full 14 percentage points above the average GC% of this anemone genome. Due to this shortness, high percentage of ambiguous bases and skewed GC%, the genes identified as Nematostella within the isolates are likely not the result of HGT but the result of bacterial contamination in the Nematostella genomes. Further tests with the genomes of the Agrobacteria,Limnobacter and Stappia revealed no evidence of lateral gene transfer with the anemone but additional evidence of bacterial contamination in the anemone genome. However, unlike the Pseudomonas,which matched 101 scaffolds, the Agrobacteria,Limnobacter and

37 Stappia populations matched 6, 4 and 3 Nematostella scaffolds respectively; all Nematostella scaffolds identified as bacterial are indicated in Appendix III. A final screen for evidence of potential horizontal gene transfer in the holobiont was done within the bacterial isolates themselves to examine the hypothesis that these cultured populations interacted with a common gene pool in the holobiont. Every pair of genomes from different populations had their genes BLASTn compared. No inter-strain gene pairs were found that matched or were greater than 95% nucleotide identity indicating that among the four populations there has not been any recent horizontal transfer of protein coding genes.

Shared Phage and Prophage Element Analysis

Since MEGAN visualization of gene annotations revealed viral origins for some genes of the Pseudomonas and Agrobacterialisolates (Figures 5 and 6), it was hypothesized that the bacterial populations might share the same phage or other mobile genetic elements. To test this, the proteomes of each isolate were BLASTed against the PHAST phage and prophage database (Zhou et al. 2011) and the top hits for each gene were compared among the N. vectensis associates.

100 -

90 00 .C 80 ------

W 70 ------

C 60- - - M V ------>3 GenomesGe m s

0 4 *2 Genomes

3 Genome

50 -+"-

Aiphaproteobacteria Gammaproteabacteria Betaproteobacteria All Isolates Figure 8: Shared Phage Elements of N. vectensis Isolates within and among Populations. Best BLASTP hits for genes that significantly matched a sequence in the PHAST phage database (significant match = E-value < le-5) were compared within and among all microbial populations isolated to determine if any were the same phage sequence. The three columns to the left represent comparisons within populations of N. vectensis isolates; colors represent finding a best BLASTP hit to the same phage sequence in the database in one to all of the isolates within that population. The right-most column represents a comparison among all ten strains.

Comparisons of the best BLAST hits to the PHAST phage database (Figure 8) reveal that a large proportion of mobile genetic elements are found only in a single isolate suggesting dynamic evolution of the strains. Further, while two matches were found among the populations of bacteria, which might suggest there was lateral transfer of phage elements within the holobiont, upon manual inspection it

38 was determined that they were likely not of phage but bacterial origin as they were annotated as an ATP-dependent Cp protease ATP-binding subunit and a TldD protease, very conserved bacterial proteins. Manual curation of the rest of the matches did however reveal phage specific genes present in each of the four populations, suggesting that although they may not be shared among the populations, phage elements play an important role in all of the N. vectensis associated isolates.

Conclusion

This study focused on 10 sequenced isolates from four populations of bacteria (P. oleovorans,A. tumefaciens, L. thiooxidans and S. stellulata) known to be associated with the sea anemone Nematostella vectensis in order to gain an understanding of the ecology of the anemone holobiont. A first look into the relative diversity within the populations illustrated that the population strains are likely not clonal as evidenced by their ANI and apparent strain-specific OGs but quite diverse, implying that the populations are capable of adapting to multiple environments including the anemone host. This diversity did not, however, preclude an understanding of potential ecological structuring factors of the holobiont. By comparing the shared or "core" orthologs of each population to an abundance of sequenced relatives, holobiont specific orthologs were able to be found for each population. While none of these holobiont specific orthologous groups were shared among the populations, some common functional themes emerged which may reflect adaptation to the holobiont environment. For instance, three Pseudomonasisolates had a holobiont specific antibiotic biosynthesis monooxygenase and a holobiont-specific efflux transporter while both Limnobacters had a holobiont specific multidrug resistance efflux pump. This shared holobiont specific function of efflux across the Limnobacter and Pseudomonas populations, as well as the antibiotic biosynthesis gene in the Pseudomonads may indicate the relevance of antibiotic production and resistance as a microbial survival and adaption factor within Nematostella. Although there are these hints of common functional themes, there are few strong signals from the holobiont specific genes indicating adaptation of the populations to similar ecological factors. This is likely because, as indicated by their ease of being cultured and their diverse genomic repertoires, the 10 isolates used for this study are quite adaptive to many environments and not just the anemone. Better results would likely be obtained using these same approaches from obligate, less free-living members of the anemone holobiont, which may be strains that would be unable to be cultured but only studied through single-cell genomics. While MEGAN analysis of "flexible" and "core" orthologs hinted at the possible horizontal transfer of genes between the microbes and the anemone host, further inspection revealed that it is the result of heavy Pseudomonascontamination within the genome sequence of Nematostella itself (Putnam et al. 2007). However, the MEGAN analysis, along with a BLAST comparison to the PHAST phage and prophage element database, revealed the relevance of phage and mobile elements

39 on the evolution of each population and the complexities yet to be illuminated from the anemone holobiont.

40 Nematostella vectensis Metatranscriptome Analysis

With the overall goal of my project to understand the microbial ecology of the anemone holobiont, it was desired to get a general sense of what was happening at the transcriptional level for both the anemone host and the microbial associates. This was done by high-throughput metatranscriptomic sequencing of lab-raised anemones. Although primarily descriptive, this analysis allows for the creation of lab testable hypotheses of associate adaptation within the host, and, further, it provides greater evidence of the relevance of the bacterial culturing and comparative genomics work described in previous sections given the detection of transcripts from those bacteria, particularly those of the Limnobacter population.

Metatranscriptome Preparation

Lab raised N. vectensis were prepared by a postdoctoral associate in the Thompson laboratory for metatranscriptomic analysis using an initial screen of several different methods designed to deplete rRNA and thus enrich for bacterial mRNA. This initial screen of ribosomal depletion methods was carried out to guide selection of an rRNA depletion strategy for future work (pers. Comm. Samodha Fernando). Ribosomal RNA depletion was done as the expected amount of bacterial mRNA was small compared to other mRNAs of the host and rRNAs of both host and associates that would be sequenced (Stewart et al., 2010). A summary behind the rationale and biological mechanism of each bacterial mRNA enrichment technique can be found within Table 1. Five anemone metatranscriptomes processed with combinations of bacterial mRNA enrichment techniques and one unprocessed control were sequenced to a target depth of 1 million reads using Illumina GA-Il. Metatranscriptome datasets that are the basis of the analysis presented in this section consisted of FastQ files including both sequence data and quality scores.

Table 1 Bacterial mRNA enrichment techniques and their principles - Poly(A)purist - A kit that relies on use of oligo(dT) cellulose to preferentially bind Poly(A) tails of eukaryotic mRNA; this is used to remove unwanted eukaryotic mRNAs from our samples. * RNaseH - Endonuclease that specifically degrades RNA in RNA:DNA hybrids; DNA oligos that bind to specific conserved regions of rRNA are added with it to selectively remove rRNAs. - mRNAOnly - A kit that relies on an endonuclease that selectively degrades RNAs with S- monophosphates; rRNAs have this feature while mRNAs do not, so they are selectively degraded. SM QCROBEnrich/MICROBExpress -A pair of kits that rely on a novel capture hybridization protocol to slectveiy degrade eukaryotic ARNA and Sacterial rRNA respect$vety Dipex-Specific- NMclease- Al nuclease that specfficdlly degrades dSDNAand MNA n DA:RNA -hybrids;

Evaluation of rRNA Depletion Techniques

For each of these samples, the raw sequencing reads were filtered for internal adaptor contamination (as described in section Illumina Adaptor Artifact Detection),

41 Table 2 Filtering results of N. vectensis metatranscriptome sample reads for ribosomal RNAs and other contaminants Reads Ieads Reads fFiltered as Filtered as Reads Filtered Read Pair Units* Large Small Filtered as as Reads Filtered Remaining After nitial Subunit Subunit 5S or ITS Tandem as Adaptor All Filtering Steps lairs IrRNA JrRNA rRNA Repeats Contaminants % Original)

Abbreviations: ITS, Internal Transcribed Spacer. *A read pair unit can be one of three things: 1) A read pair whose ends have both made it through filtering. 2) A pair of reads merged into one read because of shared overlapping sequence. 3) A pair of reads clipped to one read because of adaptor contamination.

100% i Reads Remaining After 90% Filtering

80% -- a- m Adaptor Contaminated Reads 70% In A Tandem Repeat Reads M 60%

50% 4 55 or ITS rRNA

40% U SSU rRNA Reads 30%

20% -1 LSU rRNA Reads

10%

0% poly(A)purist + poly(A)purist + poly(A)purist + poly(A) purist + poly(A)purist + Total RNA RNaseH mRNAonly RNaseH + Microbe enrich RNaseH + DSN unprocessed mRNAonly +MicrobExpress + mRNAonly Figure 1 Ribosomal and Contaminant Read Breakdown of Initial Read Pairs. Initial read pairs were filtered against the SILVA large and small subunit rRNA database and a hand-made database of N. vectensis 5S and Internal Transcribed Spacer rRNA (database match = BLASTn bit score > 50.0) as well as filtered for the presence of tandem repeats and adaptor contamination. Remaining reads (seen above in orange) were to be subject to functional and taxonomic analysis.

42 tandem repeats and the presence of rRNAs. While all samples started off with the same amount of initial Total RNA, the final yield of Illumina sequence reads was variable by less than a factor of 3 (0.6 to 1.6 million read pairs) and this variability increased to span more than three orders of magnitude after removal of ribosomal contamination and QC filtering (i.e. 117 to 2.47 E 5 reads) suggesting very different yields for the different processes (Table 2). In brief, the samples processed with the MICROBEnrich/MICROBExpress kit performed best in terms of the remaining putative mRNA compared to the initial number of reads; it had 37.7% of the original reads remaining (246,506 out of 653,926 reads) as putative mRNAs as compared to the unprocessed control with 8.24% (90721 out of 1,100,418 reads) (Table 1). This makes sense given the success of the MICRO BEnrich/MICROBExpress kits in the creation of previous metatranscriptomes. (Leimena et al., 2013) An unexpected result was that those samples processed with RNaseH did worse than the unprocessed control as they had a 2-10 fold reduction in the fraction of putative mRNAs remaining after filtering (Table 1). The reason for this is unknown; however, too strong of a conclusion should not be drawn because of the low number of samples. Performing worst of all, the Duplex-Specific Nuclease sample retained only -0.001% of the original reads (117 out of 107964 reads), with the bulk of the removed reads (95649 out of 107964 reads) being classified as large subunit rRNAs (Figure 1 and Table 2). This was quite unexpected as anecdotal evidence of Prochlorococcustranscriptome preparations demonstrated it to be an extremely effective method of rRNA depletion. Because of the extreme low numbers of remaining reads within the Duplex-Specific Nuclease treated sample, it was excluded from further analysis.

MEGAN Taxonomic Binning and Efficiency of Bacterial mRNA Enrichment

After removal of ribosomal RNA, low complexity sequences and Illumina adaptors, remaining "filtered" reads from the metatranscriptome samples were compared against all known sequences in the NCBI database using BLASTx (Altschul et al. 1997), and those results were imported into the software package MEGAN (Huson et al. 2007) in order to bin the reads taxonomically; the reads were first separated into the broad phylogenetic categories of Bacteria, Archaea, Cnidaria, non-Cnidarian Eukaryota and Virus. While each of the samples contained reads from all five of these broad groups, the majority of reads that could be categorized (17.69-43.16%) were classified as "Cnidarian" (Table 3 and Figure 2) implying that the most abundant type of mRNA present was that of the host anemone. MEGAN also classified some of the reads into the relevant but less informative categories of "Root and Cellular Organisms", "Unassigned Reads" and "Low Complexity" (Table 3). Reads binned into "Root and Cellular Organisms" have so many significant matches to so many domains of life that the best MEGAN can do is classify them as potentially being any kind of cellular organism. Reads classified as "Unassigned" have significant hits to sequences in the NCBI database; however the number of these significant matches does not meet MEGAN's algorithm's criteria to assign it a taxonomy. Finally, reads considered "Low Complexity" are deemed as

43 Table 3 MEGAN Taxonomic Analysis of Metatranscriptome Diversity Other Root and Eukary- Aead Pair No Celular Cnidartz Units significant Organisms MEGAN low 84nned

A read pair unit can be one of three things: 1) A read pair whose ends have both made it through filtering. 2) A pair of reads merged into one read because of shared overlapping sequence. 3) A pair of reads clipped to one read because of adaptor contamination.

1% Viral Sinned Reads

0 Archaeat Binned Reads

ABacterial Binned Reads

SOthor Ckaryotir Riniwd Re'ad,. I SCmidartan BxAned Reads 40%

Low Complexity Reads

MEGAN Unassigned Reads

* Root and Cellular Orgarismn Binred Reads

P P sA)et 0 + p a 's A om i - It 'A )psc * PQ(A)puni TotalRNA RNAS*H NAns Ra 4aa"Nke KRCfeenrK .* UJiworesisd * No signiicant NR Database Hit MCROSEsp& mRi4Aory Figure 2 MEGAN Taxonomic Breakdown of Filtered Reads. Filtered Reads were compared using BLASTx to the NCBI non-redundant protein database; the results were imported into MEGAN and binned taxonomically according to the lowest-common ancestor algorithm. The categories above describe the taxonomic groups within which they have been sorted. In particular, "No significant NR database hit" = Read pairs having no hit to the NR database of bitscore > 40.0, "Low Complexity Reads" = Reads determined by MEGAN to be suspicious based on low nucleotide diversity, "MEGAN Unassigned Reads" = Reads with bit score hits to NR > 40.0 but without enough information to be classified according to the specified parameters of the lowest common ancestor algorithm, "Root and

44 Cellular Binned Organisms" - Those reads with blast results that can only place them taxonomically at the root of all known sequences or of all known cellular organisms.

Poly(A)purist + RNaseH Poly(A)purist + mRNAonly Poly(A)purist + RNaseH + mRNAonly -- 4 A~*~e~mtsav'a O Poty(A)purist + MICROBEnrich + MICROBExpress + mRNAonly PIrE s -. Unprocessed RNA

W -"IWR* -;4~w,

cL-.r

W qw --

APOOM

Figure 3 MEGAN Analysis of N. vectensis Metatranscriptome Diversity: BLASTx results of filtered reads against the NCBI NR database were imported into the MEGAN software package and binned taxonomically at the most specific node possible on a universal phylogenetic tree. The radius of the pie graphs is non-linearly proportional to the amount of reads summarized at and below the current node (see Table 3 for distribution between Eukaryote, Cnidarian and Bacterial reads). The values within the pie graphs for each sample are total reads binned at that level normalized to the total number of reads with significant BLASTx hits to the NR database (Bit score > 40.0) in that sample. Bacterial results were uncollapsed to the Class Level if possible (No reads in Bacteroidetes were able to be placed at the Class level; reads from Cyanobacteria could only be classified into Orders) The remaining groups were left at the domain level, and nodes for "Low Complexity", "Not Assigned" and "No Hits" were removed for image clarity.

having too much of one specific nucleotide, which MEGAN judges to be more likely a sequencing artifact than a relevant read and categorizes it as such.

45 While the MEGAN software was able to bin some of the reads, the largest fraction of filtered reads had no significant BLASTx hit to the NCBI non-redundant database (Figure 2); the fraction of these non-annotatable filtered reads ranged from 44.39% (109,422 out of 246,506 reads) to 73.81% (9,822 out of 13,307 reads) of our samples (Table 3). This unidentifiable group of putative mRNAs may indicate the extent of the biological novelty of this anemone holobiont in that these reads may be transcripts from organisms that have yet to have themselves or phylogenetically close relatives sequenced. However, it cannot be discounted that they could be a number of other things including non-coding RNAs or unknown and currently undetectable errors of sequencing. In terms of the success of the different processing steps used to try and enrich the metatranscriptome samples for bacterial mRNA, different samples were successful based on whether you consider relative or absolute numbers of bacterial reads. In absolute terms the treatment "mRNAonly + MICROBEnrich/MICROBExpress treatment" that yielded the highest number filtered sequences also yielded the highest number of bacterial reads with nearly five-fold more reads than the unprocessed control (i.e. 728 bacterial reads representing 0.30% of filtered reads). In relative terms all of the treatments (i.e. rRNA depletion strategies) had at least 2-fold higher proportions of bacterial mRNA relative to the no-treatment control, which had a bacterial proportion of 0.15% (138 out of 90721 reads). The highest efficiency of bacterial mRNA recovery was the "RNaseH + mRNAonly treatment" which yielded 0.69% bacterial reads (472 out of 68143 reads) (Table 3); in comparison the no-treatment control. It should be noted that without replication the differences observed cannot be interpreted as statistically significant. However the data from this initial screen supported adoption of the mRNA-only + Microbe Express/Enrich protocol for preparation of metatranscriptomes in diverse samples (e.g. Penn, et al in prep, Timberlake et al in prep), yielding similar proportions of sequences and successful enrichment of bacterial mRNA's among complex targets (J. Thompson, pers. Comm).

Descriptive Analysis of Anemone Transcriptome

The reads within each of the five metatranscriptome samples (Table 3) classified as "Cnidarian" were pooled and annotated by using BLASTx comparisons versus the eggNOG version 3 database of opiNOGs (Powell et al. 2011), a set of orthologous genes of sequenced Opisthokonts (Metazoans/Fungi); opiNOGs, instead of the entire eggNOG database, were used to ensure reads of the same orthologous groups weren't artificially separated owing to the redundancy of the eggNOG database. In order to assure accurate annotations, a read was only considered to be part of a particular opiNOG if its sequence matched to a functionally relevant portion(s) of one of the proteins of that opiNOG. Checking annotations further, top opiNOGs were manually curated to check that reads assigned to them were evenly distributed across the length of a protein when mapped to a particular protein of that opiNOG; for instance the reads of the second most abundant function of the anemone transcriptome (opiNOG08261 - Proteins involved in cellular iron ion homeostasis)

46 map evenly across the N. vectensis protein 45351.JG1237552, a constituent of that particular opiNOG (Figure 5). The orthologous groups were then counted and ranked to determine the most expressed transcripts of the anemone host (Figure 4).

( Top Cnidarian Functions Top Bacterial Functions

Protein involved in microtubule-based 1.73 opiNOG00002 NA process 1.79 GTPases -translation Protein involved in COGOO5O i elongation factors 1.4 cellular iron ion Calcium ion binding opiNOG08261 P homeostasis 1.74 NOG12793 S protein 1.35 Glycoprotein 2 Outer membrane protein (zymogen granule COG3203 M (porin) 0.92 opiNOG06546 S membrane) 1.24 DNA polymerase 1- '-5' Protein involved in exonuclease and negative regulation of COG0749 L polymerase domains 0.86 inclusion body Type hIA topoisomerase opiNOGOS941 NA assembly (DNA gyrase/topo 11, topoisomerase IV), A opiNOG22116 S Hypothetical COG0188 L subunit

Figure 4 Top 15 most highly represented orthologous groups among transcripts of Bacterial and Cnidarian binned reads annotated with eggNOG version 3.0 database (Cnidarian reads with opiNOG subset and Bacterial with COG/NOG subset). For top 15 function determination, annotated read counts of all five samples were merged and ranked. COG Categories: W - extracellular structures, P - Inorganic ion transport and metabolism, S - Poorly characterized, I - Lipid Transport and metabolism, Z - cytoskeleton, J - translation, ribosomal structure and biogenesis, M - Cell wall/membrane/envelope biogenesis, L - Replication, recombination and repair, C - Energy production and conversion, U - intracellular trafficking, secretion and vesicular transport, E - Amino acid transport and metabolism, T - Signal Transduction Mechanisms, 0 - Posttranslational modification, protein turnover, chaperones, NA - Not placed into a functional category.

47 The top annotation, making up 1.79% of Cnidarian classified reads (3,873 reads out of 215,900) is opiNOG0002, which consists of proteins involved microtubule processes and tubulin itself (Huson et al. 2007). This result, along with some of the other top expressed Cnidarian functions including actin binding proteins (opiNOG02967) and cell adhesion factors (opiNOG11729) not only reflect how cell structure and the cytoskeletal maintenance are some of the most active processes within the anemone but they also give confidence to our transcriptomics pipeline in that the top functions of other metazoan whole-organism transcriptomes, including zebrafish and mouse, are known to be of this variety (Francis et al., 2013). Also of interest is the third highest transcript, which is a protein involved in iron homeostasis (1.74% of all opiNOG annotated reads). This high level of expression of this type of regulator likely indicates the importance of iron in the microenvironment of the anemone.

200 400 600 800

45351JG12375S2

Consensus coverage

-I - -ul - -+4 -4

Figure 5 Read mapping to second highest annotated "Cnidarian" Read Category, opiNOG08261. "Cnidarian" reads assigned to this opiNOG were mapped to the N. vectensis protein 45351.JGI237552 (a ferritin and member of opiNOG08261) to determine the validity of this assignment. Numbers at the top represent the nucleotide position of the mRNA; the gray histogram represents the coverage of a particular nucleotide position. Physical representations of reads are shown below the histogram; green reads are the "forward" read of paired end reads while red are the "reverse".

Descriptive Analysis of Bacterial Transcriptome Top Functions

Reads that were annotated as "Bacterial" were pooled among the samples (Total 1853 bacterial mRNA sequences) and annotated into orthologous groups using BLASTx comparisons versus the COG and NOG subsets of the eggNOG database v3.0 (Huson et al. 2007). These identified, expressed orthologous groups were then ranked to determine the top bacterial functions within the anemone holobiont (Figure 4).

48 The most expressed orthologous group, containing 1.73% of the "Bacterial" read counts (32 out of 1853 read counts), is NOG323497 containing hypothetical conserved proteins. To determine the validity of this classification, the reads assigned to this orthologous group were mapped to the transcript of the protein member of the group that was most similar to our reads, protein 216895.VV1_0587 from the genome of Vibrio vulnificus CMCP6 (Figure 6) (this was identical to the method used for verification for the top "Cnidarian" opiNOG assignments). The reads distribute along the length of the gene giving strength to our classification of these reads within this particular orthologous group. Further, this highly expressed bacterial transcript from a hypothetical gene is an intriguing target for the determination of bacterial association and community structuring factors within the anemone that could be explored further.

100 200 300 400

216895.VV1_0587

Consensus Coverage

Figure 6 Read mapping to representative sequence of highest expressed "Bacterial" orthologous group NOG323497. To determine the evenness of spread of the reads assigned to this orthologous group, the nucleotide coding sequence of the protein member being most similar to our reads (Protein 216895.VV1_0587 from the genome of Vibrio vulnificus CMCP6) was mapped against. Numbers at the top represent the nucleotide position of the mRNA; the gray and pink histogram represents the coverage of a particular nucleotide position. Physical representations of reads are shown below the histogram; green reads are the "forward" read of paired end reads while red are the "reverse".

A majority of the rest of the top expressed bacterial functions are "housekeeping" genes, including those that code for translation elongation factors, DNA polymerases, topoisomerase, ribosomal proteins, ATPases, alkaline phosphatase and aconitase (Figure 4). While they provide little information about the interactions of bacterial associates within the host, the presence of these functions indicates that bacteria are not just residing within the anemone but actively growing. Several top functions may indicate activities that mediate acclimation and/or interaction within the holobiont. One function that may indicate bacterial adaption to the holobiont is COG0841, an orthologous group that contains many acriflavin resistance genes and is expressed in 9 out of 1853 of our read counts. Acriflavin is a known antibacterial compound; if Nematostella were to use it as a microbiome structuring agent, resistance would imply an adaptation required to inhabit the anemone space. Another interesting observation is that both lists of top "Bacterial" and "Cnidarian" functions contain orthologous groups annotated as calcium ion binding proteins (Figure 4). This, however, does not provide any evidence of the

49 ecology of the anemone as calcium ion binding proteins are involved in many basic cellular pathways and processes. Other ecologically relevant top expressed orthologous groups of the bacterial transcriptome include COG0841, which consists of cation/multidrug efflux proteins; this may represent a mechanism of antibiotic resistance used by the bacteria for adaptation within the anemone microbiome. Another possible mechanism of antibiotic resistance or general stress resistance may come from COG3203 and COG0477, orthologous groups containing outer membrane porins and permeases respectively. Both of these types of proteins permit the diffusion of solutes across the cell wall and thus allow for a bacteria to interact with the solutes of the external environment. These three cell-wall/membrane associated ortholgous groups along with COG1289, a potential membrane protein, illustrate the likely importance of bacterial members of the holobiont to regulate the barrier between themselves and the anemone as they encompass four out of the fifteen top expressed bacterial functions.

Bacterial Transcriptome Taxonomic Analysis and Comparison to 16S rRNA gene Clone Library Data

The MEGAN analysis was not only able to group the "Bacterial" reads at the domain level but also at more precise categories of taxonomy from phylum all the way to the species level (Figure 3). Further, the putative SSU rRNA reads filtered from the metatranscriptome samples during the preprocessing steps (Table 2) were also able to be phylogenetically analyzed. This presented opportunities of validating both the MEGAN taxonomic classification and earlier work done on the bacterial inhabitants of N. vectensis that included constructed 16S rRNA clone libraries (Har, MS Thesis) in the hopes of getting an even fuller understanding of the bacterial associates of Nematostella. In order to classify the putative SSU rRNAs filtered from the metatranscriptomes in the pre-processing steps, the sequences were uploaded and compared to known SSU rRNA sequences in the RDP database using MG-RAST, and they were classified at the genus level with a threshold of at least 70nt matching with an e-value of less than 1 E -20 and at the species level with a threshold of at least 90nt matching and an e-value of less than 1E -40. The unprocessed metatranscriptome sample was assessed first as it was the least likely to be biased by one of the rRNA subtraction protocols. At the genus level, only 60 reads were able to be classified with the majority (58 of the 60 reads; 97%) annotated as a genus of Proteobacteria and with the bulk of those 58 being identified as "Vibrio" or "unclassified Alphaproteobacteria" (Figure 7). This large proportion of Proteobacteria matches the MEGAN taxonomic binning of mRNA as 1,616 of 1,853 binned "Bacterial" reads (87%) were further grouped as "Proteobacteria". The large amount of Proteobacteria within the SSU rRNA, MG-RAST analyzed reads is also confirmed somewhat in the previous 16S clone library experimental work as some clone libraries, particularly those generated from lab-raised anemones, tended to be dominated by Proteobacteria (Har, MS Thesis).

50 One of the non-proteobacterial reads was classified as Synechococcus, which is likely not aberrant as Synechococcus species were also found in the previous 16S clone library work (Har, MS Thesis). Synechococcus was not found in the MEGAN phylogenetic results as the Cyanobacterial results were only robust at the phylum level.

# of Reads

0 2 4 6 8 10 12 14 16 unclassified (derived from Alphaproteobactena) Vibno unclassified (derived from unclassified sequences) Spiroplasma Maricauks Ik04 11RI Burkholdera Thiobacillus 4OW64k unclassified (derived from Alteromonadaceae) WO" Aquimanna Crocelbacter U- Cyanothete Synechococcus Pirelluta U- Cohaesibacter Marivtta U- unclassified (derived from Rhodobacteraceae) U- Desulfuromusa U- Shewanella Alcanivorax U- Pseudomonas Solimonas

Figure 7 Genus level assignment of SSU rRNA sequences from the unprocessed control sample. Putative SSU rRNA reads filtered from the "unprocessed control" metatranscriptome sample were compared to the RDP database of SSU rRNA sequences using MEGAN; reads were annotated at the genus level if they aligned with a sequence in the database with minimum length 70nt and e-value less than -20. Genus names are on the vertical axis; counts of reads are on the horizontal.

To leverage the power of greater read depth SSU rRNA filtered reads from all six samples were classified at the species level using a minimum alignment size of 90 nucleotides and minimum e-value cutoff of le-40. Although SSU reads from the rRNA depleted samples are most likely biased by the selective removal of rRNA's based on sequence homology to proprietary probes (Microbe Enrich/Express) or amplified rDNA (RNAseH or DSN), the richness and taxonomic groups revealed can still be utilized to gain insight into the composition of the N. vectensis associated microbial community (Figure 8). Like the analysis with only the unprocessed reads, the majority of the reads from samples treated for rRNA-depletion were classified as Proteobacterial with Vibrio in particular as the dominant type. These criteria revealed sequence types most similar to Vibrio parahaemolyticusand Marcaulis maris as present in multiple samples. While the previously made clone libraries did not include these particular species of bacteria, they did have the presence of related strains. Several species of

51 0 4 8 12 16 20 24 Afteromonas macendib Afteromonas spUSTL10723-013 * 11 RNAS M(n 86) Afteromanas sp. UST981101-023 Ferimonas sp A38-S7-2 * 2) mR bnly (n - 2 Moritella sp. 3681 3) RNA3: eHnmRNAinly (n - 157 Comarnonas denitnticans Skm"n a4) Micra be Enrich/E*ress (n - 3) Herbaspirillum --ropedicae si -S) Unu esed )n .21) 06)00. tranded uclease (n 5) Rotumn emppdkae Burkhotderiales Rubrivivax gelatnosus uncultured Ralstonia sp. Chrobcocaifes Synlcec6- ssp. P 70 d Desulfuromonadale s Desulfuromusa succinoxidans Eschichacli a Enterobacteriales, Pantaea aggorrerans w Panto"s ananats Celulophaga "tca 0 Flayobacter'ales Crocelbacter aanticusm Flavobacteure branchiophilum Hj& iiinoPh ies Th .bacllus thiophilus, m Oceanospirillales Alcanivora. sp. IMhol Acinetobactr haerxlyticus W Pseudomonasp. ND4167 Cohaesbac teOactrlus Rhizobliales Mesorhirobiumn sp. RPJ16 Mesorhimobium s.RJ Maricauki marIs 6===* ee Rhodovulurn sufiohn Rhodobacterales Stappia stellulata u uncultu red Rhodobacteraceae bacteniun ft unassgnred Q M uncultured alpha protembacterium unicultured bacterium uncultured marine bacteriumgan Unclass. Gammap roteo. Wilonas sp NAA16 uncultured gamma proteobacterdum unultured marine micoorgaibm asM---Wav Xanffhomrinadales Xylelia fastidiosa

0 4 8 12 16 20 24 28 Listonella angulllaram marine proteobacterium 'Sippewissett 2' Photobacterium aplysiae Photobacterium damselae Photobactenum IeiOgnathi Vibrio aestuarianus Vibrio alginolyticus Vlbro azureus A40 Vibrio casei Vibro cholerae Vibrio corallilyticus Vibrio cyclltrophicus Vibrio furnissti Vibrio galllcus Vibria harvey # a1) RNAseH (n 8 Vibno hispanicus I a 2) mR INAonly (n 24) Vibro mediterranei Vibrno mytilt , 3) RNAaseH+mR only (n =57) Vibrionale Virbno natriegens 04) Mic obe Enrlchf Express (nr33) Vbno mgnpulchntudrcao Vibrioowenso t ==r=i Ni) Unprocessed (r 21) Vibrio parahaemolytcus AVOWQWI I 6) Doqble Strand d Nuclease n = 5) Vibrio porneroyi W Vibnto ponticus I Vibno proteolyticus Vibrio scophthalml Vibrio shiloni Vibio sp. Vibro sp. 12012 - Vibro sp. AND4 . - Vibrio sp. BFLP-10 Vibriosp. EW2S Vbrio sp. FALF273 Vibro sp.FLCAl Vibro sp. MSSRF10 Vbi sp. R-619 14bo sp.seaur,1 Vibrio superstes Vibrio tapens Vibno vulnfcus Vibrio xwit JeOsa*4NW* Figure 8 Species Level assignment of SSU rRNA sequences from all Metatranscriptome samples. Putative SSU rRNA reads filtered from the metatranscriptome samples were compared to the RDP database of SSU rRNA sequences using MEGAN; reads were annotated at the species level if they aligned with a sequence in the database with minimum length 90nt and e-value less than -40. Species names are on the vertical axis; counts of reads are on the horizontal. The different metatranscriptome samples are colored as indicated by the key and the relative abundances of each species are displayed as a stacked bar chart. Bacterial order names are the left most labels of the grey and white boxes.

52 Vibrio were present in the 16S rRNA gene libraries as well as an unspecified Rhodobacteraceae, which is in the same family as Marcaulismaris. These two species are also not specifically found in the MEGAN taxonomic analysis; however, like the clone libraries, reads classified as Vibrio and Rhodobacteraceae are present. Of particular interest is the presence in two of the six metatranscriptome samples of SSU rRNAs classified by MG-RAST as Stappia stellulata.Although this strain is not found in the clone libraries it has been found in culture-based analysis of Nematostella and is one of the ten sequenced isolates used for comparative genomic analysis. While the SSU rRNA reads analyzed by MG-RAST were mostly grouped as Proteobacteria, the MEGAN taxonomic analysis revealed the small but relevant presence of other bacterial phyla including Bacteroidetes, Planctomycetes, Firmicutes, Cyanobacteria and Actinobacteria. Two of these phyla (Actinobacteria and Firmicutes) were found in four or more of the five MEGAN imported samples giving good weight of evidence to their true presence within the anemone holobiont (Figure 3); however, these phyla were not detected within the 16S clone libraries. Further, while the phyla of Planctomycetes and Bacteroidetes were found in only one or two MEGAN samples, these phyla are a significant portion of the libraries, thus illustrating the vast complexity and continuum of N. vectensis microbial inhabitants.

N. vectensis Bacterial Associate Genome Recruitment Mapping

Filtered reads were also recruited, under stringent conditions (90% identity over 50% of the read length), to the sequenced and annotated genomes of the 10 cultured N. vectensis associated bacteria in order to determine if those particular bacteria were present and active within the metatranscriptomes of the lab anemones. The Pseudomonas and Limnobacter populations were recruited to by 18 and 127 reads respectively suggesting they may be relevant and active members of the microbial community. The Agrobacteriaand Stappia genomes had zero and one reads mapped to them respectively providing inconclusive evidence regarding whether they are active within the anemone holobiont. In the Pseudomonasgenomes, the reads tended to primarily map to conserved housekeeping genes (Table 4); these included translation initiation and elongation factors, ribosomal proteins, and synthases of ATP and nucleotides. Two genes of potential ecological relevance were detected, including a chemotaxis protein and a multidrug-transport protein. However, each of these genes had only one read recruited to it, making it hard to posit strong conclusions from their presence. The recruited reads of the Limnobacter population, however, do offer some insight into its ecology within the anemone (Table 5). The top function for the Limnobacter population is a member of the phasin protein family, a group of proteins responsible for the synthesis and structure of PHA granules, which are insoluble spherical inclusions of polyesters used as energy and carbon storage reserves within bacterial cells. PHA granules have recently been determined, through direct genetic manipulation of symbiont bacteria, to play a critical role in

53 allowing a species of Burkholderiato establish a community within the bean bug Riptortus pedestris (Kim et al., 2013); when the PHA synthesis genes were knocked out, density of the Burkholderia communities declined sharply within the bean bugs and resulted in smaller host size. Further, the PHA deficient bacteria were more vulnerable to general environmental stresses. Because of the strong expression of this particular phasin protein and the presence of COG3432, a synthase of the polyester units that compose the granules (Table 5), and this recent work with bean bug symbionts, it is possible to hypothesize that PHA granules may play a similarly important role in Limnobacters adapting to the anemone holobiont.

Table 4 Top functions of metatranscriptomic reads recruited to the Pseudomonas population of the sequenced anemone associates. cO/NG # Rleads COG category Funtion Mapped

COG0050 6 J GTPases - translation elongation factors COG0090 3 J Ribosomal protein L2

Hypothetical_1 1 NA - C0G0093 1 J Ribosomal protein L14 COG0644 1 C Dehydrogenases (flavoproteins)

COG0290 1 J Translation initiation factor 3 (IF-3) ABC-type multidrug transport system, COG0842 1 V permease component COG0209 1 F Ribonucleotide reductase, alpha subunit COG0840 1 T Methyl-accepting chemotaxis protein

Hypothetical_2 1 NA COG0055 1 C FOF1-type ATP synthase, beta subunit Reads were mapped to the gene sequences of the four sequenced Pseudomonasassociates that were annotated using the COG and NOG subsets of the eggNOG database (version 3.0). Mapped reads to transcripts unable to be annotated in the above manner are labeled "Hypothetical_#". NA = Not classified into a COG category. See Figure 4 for relevant COG category definitions.

Other top mapped functions of the Limnobacter population include phosphate and iron transporters and flagellar motility proteins (Table 5). The expression of these particular transcripts portrays the Limnobacter population as an active and foraging member of the anemone microbiome. Further, the similar high expression of iron regulatory genes to the annotated "Cnidarian" reads provides further evidence of iron as a potential source of structuring within the anemone.

54 Table 5 Top 40 functions of metatranscriptomic reads recruited to the Limnobacterpopulation of the seauenced anemone associates.

NOG45042 22 S Phasin family protein Hypothetical_ 1 10 NA COG3211 6 R Predicted phosphatase COG1629 6 P Outer membrane receptor proteins, mostly Fe transport Hypothetical-2 6 NA COG0226 4 P ABC-type phosphate transport system, periplasmic component Transcriptional regulator containing GAF, AAA-type ATPase, and DNA binding COG3604 4 T domains COG2885 4 M Outer membrane protein and related peptidoglycan-associated (lipo)proteins Hypothetical_3 3 NA NOG47765 3 S Hypothetical_4 3 NA Hypothetical_5 3 NA COG3203 3 M Outer membrane protein (porin) COG2908 3 S Uncharacterized protein conserved in bacteria NOG69967 2 S Galactose oxidase NOG268346 2 S COG1386 2 K Predicted transcriptional regulator containing the HTH domain COG0183 2 Acetyl-CoA acetyltransferase COG2063 2 N Flagellar basal body L-ring protein NOG12793 2 S Calcium ion binding protein Hypothetical_6 2 NA Hypothetical_7 2 NA COG1960 2 Acyl-CoA dehydrogenases Hypothetical_8 2 NA ABC-type amino acid transport/signal transduction systems, periplasmic COG0834 2 T component/domain COG4774 2 P Outer membrane receptor for monomeric catechols COG0784 2 T FOG: CheY-like receiver COG1426 2 S Uncharacterized protein conserved in bacteria NOG16078 2 S 3-hydroxyisobutyrate dehydrogenase and related beta-hydroxyacid COG2084 dehydrogenases COG1278 K Cold shock proteins COG1999 R involved in biogenesis of respiratory and photosynthetic systems COG1520 S FOG: WD40-like repeat COG0234 0 Co-chaperonin GroES (HSP1O) COG0194 F Guanylate kinase COG4105 R DNA uptake lipoprotein COG0773 M UDP-N-acetylmuramate-alanine ligase COG0102 J Ribosomal protein L13 COG3243 Poly(3-hydroxyalkanoate) synthetase NOG08290 E L-Ectoine synthase Reads were mapped to the gene sequences of the two sequenced Limnobacterassociates that were annotated using the COG and NOG subsets of the eggNOG database (version 3.0). Mapped reads to transcripts unable to be annotated in the above manner are labeled "Hypothetical_#". NA = Not classified into a COG category. See Figure 4 for relevant COG category definitions.

Bacterial Life Strategy Assessment Limnobacter vs. Other Proteobacteria

The results of the analysis of top expressed "Bacterial" reads annotated by MEGAN and those annotated through read mapping to sequenced associate genomes

55 suggests the presence of two diverse bacterial strategies for residing within the anemone holobiont. One, as seen in the presence of top "Bacteria" expressed functions (Figure 4), mapped Pseudomonas reads (Table 4) and top "Proteobacteria" expressed functions (Table 6) in which the bacteria are actively regulating solute levels with the external environment, synthesizing ribosomes and ATP and replicating. The other, as seen in the top expressed functions of Limnobacter mapped reads, in which the bacteria are foraging, adapting and interacting within the holobiont. To explore this divergence of strategies further, the relative proportions of reads annotated into particular COG Categories within those classified as "Proteobacteria" by MEGAN and those classified as Limnobacter by genome mapping were compared (Figure 8). The results show that the Limnobacter population is expressing disproportionately higher amounts (by at least two-fold) of COG Categories T (Signal Transduction Mechanisms), P (Inorganic Ion Transport and Metabolism), N (Cell Motility) and I (Lipid Transport and Metabolism) whereas the general "Proteobacteria" are expressing at least 2-fold higher amounts of categories J (Translation, Ribosomal Structure and Biogenesis), C (Energy Production and Conversion), G (Carbohydrate Transport and Metabolism) and L (Replication, Recombination and Repair).

Table 6 Top 10 Annotated NOG/COG Orthologous Groups of "Proteobacteria" Binned Reads

NOG323497 NA Hypothetical 1.98

NOG12793 S Calcium ion binding protein 1.55 GTPases - translation elongation COG0050 J factors 1.24

COG3203 M Outer membrane protein (porin) 1.05 DNA polymerase I - 3'-5' exonuclease COG0749 L and polymerase domains 0.99

COG0055 C FOF1-type ATP synthase, beta subunit 0.80 COG0090 J Ribosomal protein L2 0.74 Type IIA topoisomerase (DNA gyrase/ COG0188 L topo I, topoisomerase IV), A subunit 0.74 Permeases of the major facilitator COG0477 R superfamily 0.68

COG0056 C FOF1-type ATP synthase, alpha subunit 0.62 Reads binned as "Bacteria" by MEGAN analysis were further subdivided into Bacterial phyla. The 1,616 out of 1,853 total "Bacterial" reads that were classified as "Proteobacteria" had their annotations filtered from the total "Bacterial" pool and are seen above. See Figure 4 for relevant COG Category Information.

56 This suggests that the Limnobacters may be adapting to the anemone in their own distinct way in that they are signaling, moving and scavenging lipids and ions (mostly iron) to withstand the microenvironment of their host. In addition, from the view provided by MEGAN classification of BlastX annotations we also see that a more generalized proteobacterial population is successfully inhabiting the anemone by producing energy and dividing.

Mapped Limnobacter Reads * MEGAN Binned Proteobacteria Reads

40

35

20 0

5 - NA C E D G F I H K J M L 0 N Q P 5 R U T V COG Category

Figure 9 COG Category Distribution of Reads Binned By MEGAN as Proteobacteria and Reads Mapped to Sequenced Limnobacter Genomes. Total count of reads mapped to Limnobacter genomes = 127; total count of reads binned by MEGAN as Proteobacteria = 1616. See Figure 4 for COG category descriptions.

Assessment of Sequenced Symbiont Genome Annotation Benefit

One of the initial goals for sequencing bacterial genomes of microbes associated with Nematostella was to use them as reference genomes for metatranscriptomics studies of the anemone holobiont; this was done as it was hypothesized that since those microbes were found to be present on the anemone, it was likely transcripts from those bacteria would be found within its metatranscriptome and having their genome sequences would possibly be the only way of identifying them. Thus, it was desired to assess the success of these genomes on the ability to help annotate metatranscriptomic reads. To measure this level of success, orthologous groups were compared between the collected "Bacterial" reads as determined through MEGAN analysis and COG/NOG annotation and those reads that mapped to any of the 10 symbiont genomes at 90% similarity for over 50% of the read. Whereas MEGAN determined bacterial reads were classified into 719 distinct NOG or COG orthologous groups, reads mapping to symbiont genomes only represented 49 distinct orthologous

57 groups with 31 shared between both methods of assessment (Figure 10). The symbiont genomes thus only added the presence of 18 orthologous groups to the metatranscriptome assessment, with all 18 of them gained just from having the Limnobacter genomes. While the number of unique orthologous groups gained by symbiont mapping is not large, when explored along with orthologous groups with 2-fold higher expression in symbiont mapping than MEGAN analysis, these groups represent most of the unique and interesting ecological insights into activities in the anemone holobiont previously discussed (Table 7). For instance, the majority of the strongly expressed NOG45042 reads and all of the COG3243 reads, both related to the synthesis and regulation of PHA granules (as described in the "Genome Mapping" Section above), were determined through read mapping. Further, the flagellar reads (COG2063) and the iron scavenging TonB dependent siderophore receptor reads (COG4774) were only annotated in the presence of the Limnobacter genomes. So, while the symbiont genomes added relatively fewer numbers of total reads and orthologous groups annotations, the ones they did add gave great evidence to the ways in which bacteria of the Limnobacter population are adapting and surviving within the anemone.

OGs from OGs from Symbiont MEGAN Genome Analysis Mappings

Figure 10 Orthologous Group comparison of those present in "Bacterial" reads as determined through MEGAN analysis (Orange circle) and those present in the symbiont genome mapping analysis (Yellow circle). Numbers represent the number of unique orthologous groups present within each particular analysis or shared between them (overlap of circles).

Conclusion

Metatranscriptomic analysis on whole animals of Nematostella vectensis was performed in order to get an understanding of the transcriptional activity of the host and microbes within the anemone holobiont. Because the sought bacterial mRNA activity is such a small component of the pool of total RNA within a complex milieu of organisms such as the anemone holobiont, various methods of eliminating unwanted components, particularly rRNAs were tested (Stewart et al., 2010); it was found that on absolute terms use of the kits MicrobENRICH/MicrobEXPRESS and

58 Table 7 Orthologous Groups present in Metatranscriptome Data Unique to Limnobacter Mapping Analysis or Present at Least 2-Fold Higher in Limnobacter Mapping Analysis C CG Function Total Toaons NOGteor Counts Linobacter IMEGAN6 Mapping - -Analysis Analysis

NOG45042 S Phasin family protein 2 22

COG3211 R Predicted phosphatase 2 6 Transcriptional regulator containing GAF, AAA- COG3604 T type ATPase, and DNA binding domains 0 4 NOG47765 S 0 3

NOG69967 S Galactose oxidase 0 2

COG2063 N Flagellar basal body L-ring protein 0 2 Outer membrane receptor for monomeric COG4774 P catechols 0 2

COG1426 S Uncharacterized protein conserved in bacteria 0 2

NOG16078 S -- 0 2

C0G1520 S FOG: WD40-like repeat 0 2

C0G3243 I Poly(3-hydroxyalkanoate) synthetase 0 2 C0G3490 S Uncharacterized protein conserved in bacteria 0 2 cAMP-binding proteins - catabolite gene activator and regulatory subunit of cAMP- COG0664 T dependent protein kinases 0 2

COG2262 R GTPases 0 1 Uncharacterized protein related to COG3046 R deoxyribodipyrimidine photolyase 0 1

COG1905 C NADH:ubiquinone oxidoreductase 24 kD subunit 0 1 NOG44894 S 0 1

COG0067 E Glutamate synthase domain 1 0 1 NOG13288 S 0 1 Membrane proteins related to COG0739 M metalloendopeptidases 0 1 Orthologous groups present in the "Bacterial" Reads of those analyzed by MEGAN and those reads mapping to sequenced symbiont genomes were compared and those present in just the mapped reads or at least 2-fold higher in the mapped reads are displayed in the table above. Please see Figure 4 for relevant COG category definitions. mRNAOnly resulted in the greatest number of recovered bacterial mRNA reads while on a relative scale, use of RNAseH and mRNAOnly was superior. However, with sample sizes of one for each of the enrichment/subtraction technique trials, it is difficult to make strong conclusions.

59 Assessment of the highest Cnidarian expression activity revealed mostly transcripts coding for cell structure regulation and components of the cytoskeleton as well as other "housekeeping" functions such as ribosome synthesis and energy production; of possible interest was a highly expressed orthologous group for iron regulation and metabolism. The bacteria, mostly dominated by the phylum Proteobacteria as determined by MEGAN taxonomic analysis and classification of SSU rRNAs using MG-RAST (Huson et al. 2007), had high expression of many ecologically relevant cell- wall/membrane related functions including efflux pumps and permeases as well as a variety of replication and ribosomal synthesis genes. While these top functions indicated a presence of an actively growing, dividing and interacting population of bacteria, genome recruitment mapping to sequenced N. vectensis associates revealed a specialized population of the genus Limnobacterexpressing motility, iron regulation, nutrient scavenging and antibiotic resistance genes. Of particular interest within the Limnobacter population is that their most highly expressed gene regulates PHA-granules: large, insoluble and intracellular stores of carbon known to be associated with symbiosis in insects (Kim et al., 2013). Metatranscriptomics can be used as a descriptive method in order to create more explicit lab testable hypotheses. For the anemone holobiont, metatranscriptomics has revealed the possible importance of iron as an environmental factor as both the Cnidarian host and bacteria expressed genes to respond and regulate it. Further, it has revealed evidence of specialization of Limnobacterswithin the microbiome and the potential importance PHA granules may play in their ability to persist within the holobiont.

60 Conclusions

The starlet sea anemone Nematostella vectensis is an emerging model of evolution and development, which, like all multicellular organisms, can be viewed as a "holobiont" consisting of a host organism (the anemone) and microbial associates that influence the hosts physiology and evolution (Stefanik et al., 2013; Reitzel et al., 2012; Rohwer et al., 2002; Vega-Thurber et al., 2009). Previous experimental work has shown that similar microbial populations are associated with Nematostella over distinct geographic locations and timescales (Har et al., MS Thesis). However, how these microbes are associating, their function within the host and the factors used to structure the community are unknown. In order to address these questions, comparative genomics of known bacterial associates and metatranscriptomics on whole lab-raised anemones was performed. Comparative genomic analysis was carried out on 10 isolates from four bacterial populations (P. oleovorans,A. tumefaciens, L. thiooxidans and S. stellulata) known to be associated with the anemone. While no genes, phage or mobile genetic elements were found to be horizontally transferred among these populations or between these populations and the anemone host, analysis of holobiont-specific genes (i.e. gene orthologs found only in N. vectensis-associatedbacterial genomes and absent in closely related genomes of the same genus/family) within the populations revealed common functional themes that may indicate adaptation of bacteria within the holobiont. Specifically, two of the populations, P. oleovorans and L. thiooxidans,contained a holobiont-specific efflux transporter and multi-drug efflux pump respectively, and the P. oleovorans population contained a holobiont- specific antibiotic biosynthesis monooxygenase. This provides some evidence to the idea that antibiotic synthesis/resistance is a mechanism of adaptation within the microbial communities of the anemone and provides a starting point for future lab- testable work of anemone holobiont structuring factors. Metatranscriptomic analysis of whole lab-raised anemones revealed that the bacteria, mostly dominated by the phylum Proteobacteria as determined by MEGAN taxonomic analysis and classification of SSU rRNAs using MG-RAST (Huson et al. 2007), had high expression of many ecologically relevant cell-wall/membrane related functions including efflux pumps and permeases as well as a variety of replication and ribosomal synthesis genes. These functions indicate the presence of an actively growing, dividing and interacting population of bacteria, but they do little to describe particular bacterial adaptations to the holobiont. Genome recruitment mapping to sequenced N. vectensis associates, unlike the more broad in scale MEGAN analysis, revealed a specialized population of the genus Limnobacter expressing motility, iron regulation, nutrient scavenging and antibiotic resistance genes. Of particular interest within this Limnobacter population is that their most highly expressed gene regulates PHA-granules: large, insoluble and intracellular stores of carbon known to be associated with symbiosis in insects (Kim et al., 2013). Discovery of these specialized transcripts from the Limnobacter population among the anemone metatrascriptomes illustrates the power of known, sequenced microbial associates in the analysis of a holobiont. To better understand bacterial adaptation within the anemone, more work should be done in trying to culture and

61 sequence less free-living, more anemone-specific strains as these, unlike the 10 isolates used in this analysis, will contain clearer genomic signals for the selective environment of the anemone holobiont. Not only will their genomes likely provide a reference for annotating more reads from the metatranscriptome, but assessment of their holobiont-specific genes using the methods established here will provide stronger evidence of microbial community structuring factors within the anemone.

62 References

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman. "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs." Nucleic Acids Res 25, no. 17 (1997): 3389-402.

Aziz, R. K., D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S. Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman, R. A. Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R. Stevens, 0. Vassieva, V. Vonstein, A. Wilke and 0. Zagnitko. "The Rast Server: Rapid Annotations Using Subsystems Technology." BMC Genomics 9, (2008): 75.

Backhed, F., H. Ding, T. Wang, L. V. Hooper, G. Y. Koh, A. Nagy, C. F. Semenkovich and J. I. Gordon. "The Gut Microbiota as an Environmental Factor That Regulates Fat Storage." Proc NatlAcad Sci U SA 101, no. 44 (2004): 15718-23.

Bevins, C. L. and N. H. Salzman. "The Potter's Wheel: The Host's Role in Sculpting Its Microbiota." Cell Mol Life Sci 68, no. 22 (2011): 3675-85.

Boettcher, K. J., B. J. Barber and J. T. Singer. "Additional Evidence That Juvenile Oyster Disease Is Caused by a Member of the Roseobacter Group and Colonization of Nonaffected Animals by Stappia Stellulata-Like Strains." Appl Environ Microbiol 66, no. 9 (2000): 3924-30.

Bomar, L., M. Maltz, S. Colston and J. Graf. "Directed Culturing of Microorganisms Using Metatranscriptomics." MBio 2, no. 2 (2011): e00012-11.

Bosch, T. C. "Cnidarian-Microbe Interactions and the Origin of Innate Immunity in Metazoans." Annu Rev Microbiol, (2013).

Chapman, J. A., E. F. Kirkness, 0. Simakov, S. E. Hampson, T. Mitros, T. Weinmaier, T. Rattei, P. G. Balasubramanian, J. Borman, D. Busam, K. Disbennett, C. Pfannkoch, N. Sumin, G. G. Sutton, L. D. Viswanathan, B. Walenz, D. M. Goodstein, U. Hellsten, T. Kawashima, S. E. Prochnik, N. H. Putnam, S. Shu, B. Blumberg, C. E. Dana, L. Gee, D. F. Kibler, L. Law, D. Lindgens, D. E. Martinez, J. Peng, P. A. Wigge, B. Bertulat, C. Guder, Y. Nakamura, S. Ozbek, H. Watanabe, K. Khalturin, G. Hemmrich, A. Franke, R. Augustin, S. Fraune, E. Hayakawa, S. Hayakawa, M. Hirose, J. S. Hwang, K. Ikeo, C. Nishimiya-Fujisawa, A. Ogura, T. Takahashi, P. R. Steinmetz, X. Zhang, R. Aufschnaiter, M. K. Eder, A. K. Gorny, W. Salvenmoser, A. M. Heimberg, B. M. Wheeler, K. J. Peterson, A. Bottger, P. Tischler, A. Wolf, T. Gojobori, K. A. Remington, R. L. Strausberg, J. C. Venter, U. Technau, B. Hobmayer, T. C. Bosch, T. W. Holstein, T. Fujisawa, H. R. Bode, C. N. David, D. S. Rokhsar and R. E. Steele. "The Dynamic Genome of Hydra." Nature 464, no. 7288 (2010): 592-6.

63 Chaucheyras-Durand, F. and H. Durand. "Probiotics in Animal Nutrition and Health." Benef Microbes 1, no. 1 (2010): 3-9.

Coleman, M. L. and S. W. Chisholm. "Ecosystem-Specific Selection Pressures Revealed through Comparative Population Genomics." Proc NatlAcad Sci US A 107, no. 43 (2010): 18634-9.

Dehal, P. S., M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, D. Chivian, G. D. Friedland, K. H. Huang, K. Keller, P. S. Novichkov, I. L. Dubchak, E. J.Alm and A. P. Arkin. "Microbesonline: An Integrated Portal for Comparative and Functional Genomics." Nucleic Acids Res 38, no. Database issue (2010): D396- 400.

Dobber, R., A. Hertogh-Huijbregts, J. Rozing, K. Bottomly and L. Nagelkerken. "The Involvement of the Intestinal Microflora in the Expansion of Cd4+ T Cells with a Naive Phenotype in the Periphery." Dev Immunol 2, no. 2 (1992): 141- 50.

Edgar, R. C. "Muscle: Multiple Sequence Alignment with High Accuracy and High Throughput." Nucleic Acids Res 32, no. 5 (2004): 1792-7.

Franzenburg, S., S. Fraune, P. M. Altrock, S. Kunzel, J. F. Baines, A. Traulsen and T. C. Bosch. "Bacterial Colonization of Hydra Hatchlings Follows a Robust Temporal Pattern." ISMEJ 7, no. 4 (2013): 781-90.

Fraune, S., R. Augustin, F. Anton-Erxleben, J. Wittlieb, C. Gelhaus, V. B. Klimovich, M. P. Samoilovich and T. C. Bosch. "In an Early Branching Metazoan, Bacterial Colonization of the Embryo Is Controlled by Maternal Antimicrobial Peptides." Proc NatlAcad Sci USA 107, no. 42 (2010): 18067-72.

Fraune, S. and T. C. Bosch. "Long-Term Maintenance of Species-Specific Bacterial Microbiota in the Basal Metazoan Hydra." Proc NatlAcad Sci USA 104, no. 32 (2007): 13146-51.

Fraune, S. and T. C. Bosch. "Why Bacteria Matter in Animal Development and Evolution." Bioessays 32, no. 7 (2010): 571-80.

Frias-Lopez, J., Y. Shi, G. W. Tyson, M. L. Coleman, S. C. Schuster, S. W. Chisholm and E. F. Delong. "Microbial Community Gene Expression in Ocean Surface Waters." Proc Natl Acad Sci U SA 105, no. 10 (2008): 3805-10.

Gan, H. M., A. 0. Hudson, A. Y. Rahman, K. G. Chan and M. A. Savka. "Comparative Genomic Analysis of Six Bacteria Belonging to the Genus Novosphingobium: Insights into Marine Adaptation, Cell-Cell Signaling and Bioremediation." BMCGenomics 14, (2013): 431.

64 Gilbert, J. A., D. Field, Y. Huang, R. Edwards, W. Li, P. Gilna and I. Joint. "Detection of Large Numbers of Novel Sequences in the Metatranscriptomes of Complex Marine Microbial Communities." PLoS One 3, no. 8 (2008): e3042.

Gosalbes, M. J., A. Durban, M. Pignatelli, J. J. Abellan, N. Jimenez-Hernandez, A. E. Perez-Cobas, A. Latorre and A. Moya. "Metatranscriptomic Approach to Analyze the Functional Human Gut Microbiota." PLoS One 6, no. 3 (2011): e17447.

Har, J. Y. "Introducing the starlet sea anemone Nematostella vectensis as a model for investigating microbial mediation of health and disease in hexacorals." Civil and Environmental Engineering. Cambridge, MA, Massachusetts Institute of Technology (2009): 115.

Hayashi, T., K. Makino, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han, E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. lida, H. Takami, T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori and H. Shinagawa. "Complete Genome Sequence of Enterohemorrhagic Escherichia Coli 0157:H7 and Genomic Comparison with a Laboratory Strain K-12." DNA Res 8, no. 1 (2001): 11-22.

Helbling, D. E., M. Ackermann, K. Fenner, H. P. Kohler and D. R. Johnson. "The Activity Level of a Microbial Community Function Can Be Predicted from Its Metatranscriptome." ISMEJ6, no. 4 (2012): 902-4.

Hislop, N. R., D. de Jong, D. C. Hayward, E. E. Ball and D. J. Miller. "Tandem Organization of Independently Duplicated Homeobox Genes in the Basal Cnidarian Acropora Millepora." Dev Genes Evol 215, no. 5 (2005): 268-73.

Hooper, L. V. and J. I. Gordon. "Commensal Host-Bacterial Relationships in the Gut." Science 292, no. 5519 (2001): 1115-8.

Huson, D. H., A. F. Auch, J. Qi and S. C. Schuster. "Megan Analysis of Metagenomic Data." Genome Res 17, no. 3 (2007): 377-86.

Konstantinidis, K. T., A. Ramette and J. M. Tiedje. "The Bacterial Species Definition in the Genomic Era." Philos Trans R Soc Lond B Biol Sci 361, no. 1475 (2006): 1929-40.

Kusserow, A., K. Pang, C. Sturm, M. Hrouda, J. Lentfer, H. A. Schmidt, U. Technau, A. von Haeseler, B. Hobmayer, M. Q. Martindale and T. W. Holstein. "Unexpected Complexity of the Wnt Gene Family in a Sea Anemone." Nature 433, no. 7022 (2005): 156-60.

Leimena, M. M., J. Ramiro-Garcia, M. Davids, B. van den Bogert, H. Smidt, E. J. Smid, J. Boekhorst, E. G. Zoetendal, P. J. Schaap and M. Kleerebezem. "A

65 Comprehensive Metatranscriptome Analysis Pipeline and Its Validation Using Human Small Intestine Microbiota Datasets." BMC Genomics 14, (2013): 530.

Ley, R. E., F. Backhed, P. Turnbaugh, C. A. Lozupone, R. D. Knight and J. I. Gordon. "Obesity Alters Gut Microbial Ecology." Proc NatlAcad Sci U SA 102, no. 31 (2005): 11070-5.

Ley, R. E., M. Hamady, C. Lozupone, P. J. Turnbaugh, R. R. Ramey, J. S. Bircher, M. L. Schlegel, T. A. Tucker, M. D. Schrenzel, R. Knight and J. I. Gordon. "Evolution of Mammals and Their Gut Microbes." Science 320, no. 5883 (2008): 1647-51.

Ley, R. E., P. J. Turnbaugh, S. Klein and J. I. Gordon. "Microbial Ecology: Human Gut Microbes Associated with Obesity." Nature 444, no. 7122 (2006): 1022-3.

Mahowald, M. A., F. E. Rey, H. Seedorf, P. J. Turnbaugh, R. S. Fulton, A. Wollam, N. Shah, C. Wang, V. Magrini, R. K. Wilson, B. L. Cantarel, P. M. Coutinho, B. Henrissat, L. W. Crock, A. Russell, N. C. Verberkmoes, R. L. Hettich and J. I. Gordon. "Characterizing a Model Human Gut Microbiota Composed of Members of Its Two Dominant Bacterial Phyla." Proc NatlAcad Sci USA 106, no. 14 (2009): 5859-64.

Mandel, M. J., M. S. Wollenberg, E. V. Stabb, K. L. Visick and E. G. Ruby. "A Single Regulatory Gene Is Sufficient to Alter Bacterial Host Range." Nature 458, no. 7235 (2009): 215-8.

Manning, V. A., I. Pandelova, B. Dhillon, L. J. Wilhelm, S. B. Goodwin, A. M. Berlin, M. Figueroa, M. Freitag, J. K. Hane, B. Henrissat, W. H. Holman, C. D. Kodira, J. Martin, R. P. Oliver, B. Robbertse, W. Schackwitz, D. C. Schwartz, J. W. Spatafora, B. G. Turgeon, C. Yandava, S. Young, S. Zhou, Q. Zeng, I. V. Grigoriev, L. J. Ma and L. M. Ciuffetti. "Comparative Genomics of a Plant-Pathogenic Fungus, Pyrenophora Tritici-Repentis, Reveals Transduplication and the Impact of Repeat Elements on Pathogenicity and Population Divergence." G3 (Bethesda) 3, no. 1 (2013): 41-63.

Meyer, F., D. Paarmann, M. D'Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening and R. A. Edwards. "The Metagenomics Rast Server - a Public Resource for the Automatic Phylogenetic and Functional Analysis of Metagenomes." BMC Bioinformatics 9, (2008): 386.

Miller, D. J., E. E. Ball and U. Technau. "Cnidarians and Ancestral Genetic Complexity in the Animal Kingdom." Trends Genet 21, no. 10 (2005): 536-9.

Mira, A., H. Ochman and N. A. Moran. "Deletional Bias and the Evolution of Bacterial Genomes." Trends Genet 17, no. 10 (2001): 589-96.

66 Moran, M. A., B. Satinsky, S. M. Gifford, H. Luo, A. Rivers, L. K. Chan, J. Meng, B. P. Durham, C. Shen, V. A. Varaljay, C. B. Smith, P. L. Yager and B. M. Hopkinson. "Sizing up Metatranscriptomics." ISMEJ 7, no. 2 (2013): 237-43.

Moran, N. A. "Symbiosis as an Adaptive Process and Source of Phenotypic Complexity." Proc NatlAcadSci USA 104 Suppl 1, (2007): 8627-33.

Nyholm, S. V. and J. Graf. "Knowing Your Friends: Invertebrate Innate Immunity Fosters Beneficial Bacterial Symbioses." Nat Rev Microbiol 10, no. 12 (2012): 815-27.

O'Brien, H. E., Y. Gong, P. Fung, P. W. Wang and D. S. Guttman. "Use of Low-Coverage, Large-Insert, Short-Read Data for Rapid and Accurate Generation of Enhanced-Quality Draft Pseudomonas Genome Sequences." PLoS One 6, no. 11 (2011): e27199.

O'Hara, A. M. and F. Shanahan. "The Gut Flora as a Forgotten Organ." EMBO Rep 7, no. 7 (2006): 688-93.

Ochman, H., M. Worobey, C. H. Kuo, J. B. Ndjango, M. Peeters, B. H. Hahn and P. Hugenholtz. "Evolutionary Relationships of Wild Hominids Recapitulated by Gut Microbial Communities." PLoS Biol 8, no. 11 (2010): e1000546.

Penn, K., J. Wang, S. C. Fernando and J. R. Thompson. (Submitted). "Secondary Metabolite Gene Expression and Interplay of Bacterial Functions in a Freshwater Cyanobacterial Bloom."

Poretsky, R. S., S. Gifford, J. Rinta-Kanto, M. Vila-Costa and M. A. Moran. "Analyzing Gene Expression from Marine Microbial Communities Using Environmental Transcriptomics." J Vis Exp, no. 24 (2009).

Powell, S., D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering and P. Bork. "Eggnog V3.0: Orthologous Groups Covering 1133 Organisms at 41 Different Taxonomic Ranges." Nucleic Acids Res 40, no. Database issue (2012): D284-9.

Putnam, N. H., M. Srivastava, U. Hellsten, B. Dirks, J. Chapman, A. Salamov, A. Terry, H. Shapiro, E. Lindquist, V. V. Kapitonov, J. Jurka, G. Genikhovich, 1. V. Grigoriev, S. M. Lucas, R. E. Steele, J. R. Finnerty, U. Technau, M. Q. Martindale and D. S. Rokhsar. "Sea Anemone Genome Reveals Ancestral Eumetazoan Gene Repertoire and Genomic Organization." Science 317, no. 5834 (2007): 86-94.

Quast, C., E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza, J. Peplies and F. 0. Glockner. "The Silva Ribosomal Rna Gene Database Project: Improved Data

67 Processing and Web-Based Tools." Nucleic Acids Res 41, no. Database issue (2013): D590-6.

Radax, R., T. Rattei, A. Lanzen, C. Bayer, H. T. Rapp, T. Urich and C. Schleper. "Metatranscriptomics of the Marine Sponge Geodia Barretti: Tackling Phylogeny and Function of Its Microbial Community." Environ Microbiol 14, no. 5 (2012): 1308-24.

Rader, B. A. and S. V. Nyholm. "Host/Microbe Interactions Revealed through "Omics" in the Symbiosis between the Hawaiian Bobtail Squid Euprymna Scolopes and the Bioluminescent Bacterium Vibrio Fischeri." Biol Bull 223, no. 1 (2012): 103-11.

Rawls, J. F., M. A. Mahowald, R. E. Ley and J. 1. Gordon. "Reciprocal Gut Microbiota Transplants from Zebrafish and Mice to Germ-Free Recipients Reveal Host Habitat Selection." Cell 127, no. 2 (2006): 423-33.

Rawls, J. F., B. S. Samuel and J. I. Gordon. "Gnotobiotic Zebrafish Reveal Evolutionarily Conserved Responses to the Gut Microbiota." Proc NatlAcad Sci USA 101, no. 13 (2004): 4596-601.

Reitzel, A. M., J. F. Ryan and A. M. Tarrant. "Establishing a Model Organism: A Report from the First Annual Nematostella Meeting." Bioessays 34, no. 2 (2012): 158- 61.

Renfer, E., A. Amon-Hassenzahl, P. R. Steinmetz and U. Technau. "A Muscle-Specific Transgenic Reporter Line of the Sea Anemone, Nematostella Vectensis." Proc NatlAcad Sci U SA 107, no. 1 (2010): 104-8.

Reshef, L., 0. Koren, Y. Loya, I. Zilber-Rosenberg and E. Rosenberg. "The Coral Probiotic Hypothesis." Environ Microbiol 8, no. 12 (2006): 2068-73.

Richter, M., and R. Rossello-Mora. "Shifting the genomic gold standard for the prokaryotic species definition." Proc NatlAcad Sci USA 106 (2009) :19126- 31.

Rodrigue, S., A. C. Materna, S. C. Timberlake, M. C. Blackburn, R. R. Malmstrom, E. J. Alm and S. W. Chisholm. "Unlocking Short Read Sequencing for Metagenomics." PLoS One 5, no. 7 (2010): e11840.

Ruby, E. G. "Symbiotic Conversations Are Revealed under Genetic Interrogation." Nat Rev Microbiol 6, no. 10 (2008): 752-62.

Ryu, J. H., S. H. Kim, H. Y. Lee, J. Y. Bai, Y. D. Nam, J. W. Bae, D. G. Lee, S. C. Shin, E. M. Ha and W. J. Lee. "Innate Immune Homeostasis by the Homeobox Gene

68 Caudal and Commensal-Gut Mutualism in Drosophila." Science 319, no. 5864 (2008): 777-82.

Sanders, J. G., R. A. Beinart, F. J. Stewart, E. F. Delong and P. R. Girguis. "Metatranscriptomics Reveal Differences in in Situ Energy and Nitrogen Metabolism among Hydrothermal Vent Snail Symbionts." ISMEJ 7, no. 8 (2013): 1556-67.

Shapiro, B. J. and E. Alm. "The Slow:Fast Substitution Ratio Reveals Changing Patterns of Natural Selection in Gamma-Proteobacterial Genomes." ISMEJ 3, no. 10 (2009): 1180-92.

Shapiro, B. J., L. A. David, J. Friedman and E. J. Alm. "Looking for Darwin's Footprints in the Microbial World." Trends Microbiol 17, no. 5 (2009): 196-204.

Shi, Y., G. W. Tyson and E. F. DeLong. "Metatranscriptomics Reveals Unique Microbial Small Rnas in the Ocean's Water Column." Nature 459, no. 7244 (2009): 266-9.

Singh, Y., J. Ahmad, J. Musarrat, N. Z. Ehtesham and S. E. Hasnain. "Emerging Importance of Holobionts in Evolution and in Probiotics." Gut Pathog 5, no. 1 (2013): 12.

Starcevic, A., S. Akthar, W. C. Dunlap, J. M. Shick, D. Hranueli, J. Cullum and P. F. Long. "Enzymes of the Shikimic Acid Pathway Encoded in the Genome of a Basal Metazoan, Nematostella Vectensis, Have Microbial Origins." Proc NatlAcad Sci USA 105, no. 7 (2008): 2533-7.

Stefanik, D. J., L. E. Friedman and J. R. Finnerty. "Collecting, Rearing, Spawning and Inducing Regeneration of the Starlet Sea Anemone, Nematostella Vectensis." Nat Protoc 8, no. 5 (2013): 916-23.

Stewart, F. J., E. A. Ottesen and E. F. DeLong. "Development and Quantitative Analyses of a Universal Rrna-Subtraction Protocol for Microbial Metatranscriptomics." ISMEJ 4, no. 7 (2010): 896-907.

Timberlake, S. C., S. C. Fernando, K. Penn, F. L. Thompson, and J. R. Thompson. (In Preparation). "Holobiont metatranscriptomics in Brazilian reef-building corals (gen. Mussismilia): Unraveling the functional dynamics of Coral host, Symbiodinium, and Microbiota during health and disease."

Uchino, Y., A. Hirata, A. Yokota and J. Sugiyama. "Reclassification of Marine Agrobacterium Species: Proposals of Stappia Stellulata Gen. Nov., Comb. Nov., Stappia Aggregata Sp. Nov., Nom. Rev., Ruegeria Atlantica Gen. Nov., Comb. Nov., Ruegeria Gelatinovora Comb. Nov., Ruegeria Algicola Comb. Nov., and

69 Ahrensia Kieliense Gen. Nov., Sp. Nov., Nom. Rev."J Gen Appl Microbiol44, no. 3 (1998): 201-210.

Vega Thurber, R. L., K. L. Barott, D. Hall, H. Liu, B. Rodriguez-Mueller, C. Desnues, R. A. Edwards, M. Haynes, F. E. Angly, L. Wegley and F. L. Rohwer. "Metagenomic Analysis Indicates That Stressors Induce Production of Herpes-Like Viruses in the Coral Porites Compressa." Proc NatlAcad Sci U SA 105, no. 47 (2008): 18413-8.

Vega Thurber, R., D. Willner-Hall, B. Rodriguez-Mueller, C. Desnues, R. A. Edwards, F. Angly, E. Dinsdale, L. Kelly and F. Rohwer. "Metagenomic Analysis of Stressed Coral Holobionts." Environ Microbiol 11, no. 8 (2009): 2148-63.

Warnecke, F., P. Luginbuhl, N. Ivanova, M. Ghassemian, T. H. Richardson, J. T. Stege, M. Cayouette, A. C. McHardy, G. Djordjevic, N. Aboushadi, R. Sorek, S. G. Tringe, M. Podar, H. G. Martin, V. Kunin, D. Dalevi, J. Madejska, E. Kirton, D. Platt, E. Szeto, A. Salamov, K. Barry, N. Mikhailova, N. C. Kyrpides, E. G. Matson, E. A. Ottesen, X. Zhang, M. Hernandez, C. Murillo, L. G. Acosta, I. Rigoutsos, G. Tamayo, B. D. Green, C. Chang, E. M. Rubin, E. J. Mathur, D. E. Robertson, P. Hugenholtz and J. R. Leadbetter. "Metagenomic and Functional Analysis of Hindgut Microbiota of a Wood-Feeding Higher Termite." Nature 450, no. 7169 (2007): 560-5.

Whitaker, R. J. and J. F. Banfield. "Population Genomics in Natural Microbial Communities." Trends Ecol Evol 21, no. 9 (2006): 508-16.

Wilmes, P., S. L. Simmons, V. J. Denef and J. F. Banfield. "The Dynamic Genetic Repertoire of Microbial Communities." FEMS Microbiol Rev 33, no. 1 (2009): 109-32.

Xie, W., Q. S. Meng, Q. J. Wu, S. L. Wang, X. Yang, N. N. Yang, R. M. Li, X. G. Jiao, H. P. Pan, B. M. Liu, Q. Su, B. Y. Xu, S. N. Hu, X. G. Zhou and Y. J. Zhang. "Pyrosequencing the Bemisia Tabaci Transcriptome Reveals a Highly Diverse Bacterial Community and a Robust System for Insecticide Resistance." PLoS One 7, no. 4 (2012): e35181.

Xiong, X., D. N. Frank, C. E. Robertson, S. S. Hung, J. Markle, A. J. Canty, K. D. McCoy, A. J. Macpherson, P. Poussier, J. S. Danska and J. Parkinson. "Generation and Analysis of a Mouse Intestinal Metatranscriptome through Illumina Based Rna-Sequencing." PLoS One 7, no. 4 (2012): e36009.

Xu, J., M. K. Bjursell, J. Himrod, S. Deng, L. K. Carmichael, H. C. Chiang, L. V. Hooper and J. I. Gordon. "A Genomic View of the Human-Bacteroides Thetaiotaomicron Symbiosis." Science 299, no. 5615 (2003): 2074-6.

70 Zakham, F., 0. Aouane, D. Ussery, A. Benjouad and M. M. Ennaji. "Computational Genomics-Proteomics and Phylogeny Analysis of Twenty One Mycobacterial Genomes (Tuberculosis & Non Tuberculosis Strains)." Microb Inform Exp 2, no. 1 (2012): 7.

Zerbino, D. R. and E. Birney. "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs." Genome Res 18, no. 5 (2008): 821-9.

Zhao, Z., H. Liu, C. Wang and J. R. Xu. "Comparative Analysis of Fungal Genomes Reveals Different Plant Cell Wall Degrading Capacity in Fungi." BMC Genomics 14, (2013): 274.

71 72 Appendix I: Comparison of N. vectensis Isolate genomes with closest genome sequenced type-strains

Pseudomonas oleovorans Agrobacterium tumefaciens Limnobacter thidoxidans 47_CLC D5 strain F1

4455 orthologs 5407 orthologs 3850 orthologs

Compared strains: Outer to Inner rir g Pseudomonas oleovorans B4_CLC Agrobacterium tumefaciens D8 Limnobacter thiooxidans FCMA Pseudomonas oleovorans GabCLC Agrobacterium tumefaciens IsCLC (T) Burkholderia cenocepacia AU 1054 Pseudomonas oleovorans IsCLC Stappia stelluatta FlCLC (T) Ralstonia solanacearum GM11000 (T) Pseudomonas mendocina str. ymp (T) Agrobacterium tumefacians C58

Percent protein sequence identity Bidirectional best hit 9 99.9 99.5 99 98 95 90 80 70 60 50 40 *3 *0 Unidirectional best hit 100 99.9 99.8 99.5 99 98 95 90 80 70 60 50 40 30 20 10 Figure 1: Comparison of closest genome-sequenced type strains (T) to holobiont-isolates.

Assemblies of the N. vectensis isolate genomes were imported into RAST and annotated via its built in ORF Finder function (Aziz et al. 2008). Nucleotide sequences of shared genes (determined by bi- and uni- directional BLAST analysis) were compared within each population and its closest available type-strain. In the above figure, one of the population isolates is chosen as the subject and is at the top of the circle diagrams (Po_47, AtD5 and Lt_F1 in the image above). The circles represent average nucleotide identities of the genes with the color scale representing the range of this value. The closest sequenced type strain, the inner circle of each diagram above, can be visually seen to have less average nucleotide identity to the other strains than those in the same populations, showing the relatedness of our isolate populations.

73 Appendix II: List of Holobiont-Specific Orthologous Groups

All designated as COG/NOG: COG category: Functional Description

Pseudomonas- Present in all 4 Strains NOG09865 NA Protein involved in cGMP biosynthetic process NOG81651 R Alpha/Beta protein

Pseudomonas- Present in 3 Strains NOG127992 U Efflux transporter, RND family, MFP subunit COG3305 S Predicted membrane protein COG01658 R Alpha/beta superfamily hydrolase NOG238032 R Antibiotic biosynthesis monooxygenase

Pseudomonas- Present in 2 Strains NOG43354 S Accessory processing protein NOG82489 NA Nuclear transport factor 2 NOG150660 L Protein involved in plasmid maintenance COG1223 R Predicted ATPase (AAA+ superfamily) COG5331 S Uncharacterized protein conserved in bacteria COG3623 G Putative L-xylulose-5-phosphate 3-epimerase

Pseudomonas- Present in 1 Strain NOG76531 R Phospholipase D/transphosphatidylase NOG67790 L Integrase COG3723 L Recombinational DNA repair protein (RecE pathway) NOG138108 P Small multidrug resistance protein NOG260384 S Ring-Infected erythrocyte surface antigen protein COG5588 S Uncharacterized conserved protein COG01389 M G lycosyltra nsferase NOG236196 NA Photosystem I reaction center subunit VIII NOG42892 R Plasmid stabilization system protein COG5322 R Predicted dehydrogenase COG5256 J Translation elongation factor EF-1alpha (GTPase) COG5486 S Predicted metal-binding integral membrane protein Indolepyruvate ferredoxin oxidoreductase, alpha and beta COG4231 C subunits NOG68061 R Unsaturated glucuronyl hydrolase NOG136402 NA Major facilitator protein NOG80712 R M ethyltra nsferase COG0615 I Cytidylyltra nsferase COG1887 M Putative glycosyl/glycerophosphate transferases COG4922 S Uncharacterized protein conserved in bacteria

74 NOG72023 R Radical S-adenosyl methionine domain containing protein

Agrobacterium - Present in All 3 Strains NOG41617 S Helix-Turn-Helix protein, CopG family COG5283 S Phage-related tail protein NOG45800 P Mercuric transport protein COG 1021 Q Peptide arylation enzymes COG1169 Q Isochorismate synthase

Agrobacterium - Present in 2 Strains ABC-type dipeptide/oligopeptide/nickel transport system, COG00181 E ATPase component NOG115657 S Prevent-Host-Death protein NOG150960 S Endonuclease Udp-N-Acetylmuramoylalanyl-D-Glutamyl-2,6-Diaminopimelate-D-Alanyl- NOG27742 R D-Alanine ligase NOG145827 L DNA methylase N-4/N-6 domain-containing protein NOG86690 M Cdp-Glycerol glycerophosphotransferase NOG127640 NA Bifunctional DNA primase/polymerase NOG81601 P Mercuric transport protein periplasmic component COG3498 R Phage tail tube protein FIl NOG47648 R Gcn5-Related N-acetyltransferase

Agrobacterium - Present in 1 Strain NOG241729 0 Tripartite motif-containing 63 protein NOG47954 L Crispr-Associated protein, VVA1548 family NOG09685 R Nucleotidyltransferase substrate binding protein NOG68654 N A Diacylglycerol kinase, catalytic region COG1337 L Uncharacterized protein predicted to be involved in DNA repair (RAMP superfamily COG1336 L Uncharacterized protein predicted to be involved in DNA repair (RAMP superfamily NOG236395 S Parallel beta-helix repeats NOG38892 R DNA polymerase beta domain protein region NOG39129 R N-Acetyltra nsferase NOG79506 S DNA primase COG1367 L Uncharacterized protein predicted to be involved in DNA repair (RAMP superfamily' NOG145870 N A Helicase-Like COG1353 R Predicted hydrolase of the HD superfamily (permuted catalytic motifs) NOG44923 L Crispr-Associated protein, NE0113 family COG2859 S Uncharacterized protein conserved in bacteria

Limnobacter - Pr esent in Both Strains COG01467 V ABC-type multidrug transport system, permease component COG4270 S Predicted membrane protein

75 NOG124444 R Gcn5-Related N-acetyltransferase Limnobacter- Present in 1 Strain NOG241729 0 Tripartite motif-containing 63 protein NOG47954 L Crispr-Associated protein, VVA1548 family NOG09685 R Nucleotidyltransferase substrate binding protein NOG68654 NA Diacylglycerol kinase, catalytic region Uncharacterized protein predicted to be involved in DNA repair COG1337 L (RAMP superfamily) Uncharacterized protein predicted to be involved in DNA repair COG 1336 L (RAMP superfamily) NOG236395 S Parallel beta-helix repeats NOG38892 R DNA polymerase beta domain protein region NOG39129 R N-Acetyltransferase NOG79506 S DNA primase Uncharacterized protein predicted to be involved in DNA repair COG1367 L (RAMP superfamily) NOG145870 NA Helicase-Like COG1353 R Predicted hydrolase of the HD superfamily (permuted catalytic motifs) NOG44923 L Crispr-Associated protein, NE0113 family COG2859 S Uncharacterized protein conserved in bacteria

Stappia - Present in 1 Strain NOG131267 K Transcriptional regulator protein NOG125598 NA Chorismate mutase COG00976 0 Protein-L-isoaspartate carboxylmethyltransferase COG01364 C Radical SAM superfamily enzyme NOG127558 NA Major facilitator superfamily MFS_1 protein Udp-N-Acetylglucosamine-Lysosomal-Enzyme NOG05352 NA N-acetylglucosaminephosphotransferase COG2364 S Predicted membrane protein COG04817 M CMP-N-acetylneuraminic acid synthetase COG08079 R AAA+ ATPase domain containing protein Plays a central role in 2-thiolation of mcm(5)SU at tRNA wobble NOG255023 NA positions of tRNA, tRNA and tRNA. May act by forming a heterodimer with protei COG00516 R Predicted flavin-nucleotide-binding protein NOG139698 S TRAP-T family transporter, DctQ (4 TMs) subunit NOG123944 K Helix-Turn-Helix domain protein NOG77058 NA 3-Hydroxyanthranilate 3,4-dioxygenase COG04213 I Myo-inositol-1-phosphate synthase Pyruvate/2-oxoglutarate dehydrogenase complex, COG01054 C dehydrogenase (El) component, eukaryotic type, alpha subunit

76 Appendix III - List of scaffolds from the JGI assembly of Nematostella vectensis that likely are bacterial contaminants based on BLASTn alignments to the 10 sequenced N. vectensis associated isolates

jgilNemvel 11568311e-gw.12797.7.1 jgilNemvel 11563511e-gw.10270.1.1 jgilNemvel 11447851e-gw.1158.3.1 jgilNemve1133161gw.7009.1.1 jgilNemvel 11447561e-gw.1152.1.1 jgilNemvell 1487161e-gw.2574.7.1 jgi|Nemve1160966|gw.2574.4.1 jgilNemve1l 1487181e-gw.2574.12.1 jgilNemve11629521gw.1345.3.1 jgilNemve11688841gw.9505.6.1 jgiINemve1I68384jgw.9505.4.1 jgilNemvell 156025Ie-gw.9580.4.1 jgilNemvel 11562931e-gw.10013.4.1 jgilNemvel192161gw.10013.1.1 jgilNemvel 11571961e-gw.14966.6.1 jgilNemvel 145135e-gw.1325.6.1 jgilNemvell 1522251e-gw.4762.1.1 jgilNemve1616761gw.3793.4.1 jgilNemvel 11491721e-gw.2834.1.1 jgilNemvel 1450191e-gw.1269.1.1 jgilNemve1146382 Igw.9892.1.1 jgilNemvel 11442551 egw.981.2.1 jgilNemvel 11554481e-gw.8594.3.1 jgilNemve1|684291gw.12503.2.1 jgilNemvel |1533581e-gw.6033.2.1 jgilNemvel 11449641e-gw.1244.3.1 jgilNemve1j68529jgw.1228.1.1 jgilNemvel 11571711e-gw.14678.2.1 jgilNemvel 157272egw.15471.3.1 jgilNemvel 732381gw.4271.1.1 jgilNemvel 787631gw.18200.1.1 jgilNemve11623311gw.863.2.1 jgilNemvel 11447851e-gw.1158.3.1 jgilNemve1133161gw.7009.1.1 jgiINemve1 144756e-gw.1152.1.1 jgijNemve1l1487161e-gw.2574.7.1 jgijNemve1j60966jgw.2574.4.1 jgilNemvel 11487201e-gw.2574.3.1 jgi|Nemve1 148718e-gw.2574.12.1 jgi|Nemve168384jgw.9505.4.1 jgilNemvel 1560251e-gw.9580.4.1 jgilNemvel192161gw.10013.1.1

77 jgi|Nemvel 11571961 egw.14966.6.1 jgi|Nemvel 11574281 e-gw.16429.2.1 jgilNemvel 11522251 egw.4762.1.1 jgi INemvel 161676 lgw.3793.4.1 jgi INemvel 1149172 1 e-gw.2834.1.1 jgi Nemvel 146382 |gw.9892.1.1 jgi Nemvel 11442551 egw.981.2.1 jgi Nemvel 168429 lgw.12503.2.1 jgi Nemve 11569881e-gw.13707.1.1 jgilNemvel 1153358 1egw.6033.2.1 jgi Nemve 11449641e-gw.1244.3.1 jgi INemve 168529 lgw.1228.1.1 jgil Nemve 1157171 e-gw.14678.2.1 jgi INemve 1157272 egw.15471.3.1 jgi INemvel 173238 lgw.4271.1.1 jgi Nemvel 178763|gw.18200.1.1 jgi Nemvel 162331 Igw.863.2.1 jgi Nemve 11447851 e-gw.1158.3.1 jgi Nemve 133161gw.7009.1.1 jgi Nemve 11447561e-gw.1152.1.1 jgi Nemve 11487161e-gw.2574.7.1 jgilNemve 1I60966 Igw.2574.4.1 jgilNemve 11487201e-gw.2574.3.1 jgilNemve 11487181e-gw.2574.12.1 jgil Nemvel I683841gw.9505.4.1 jgi lNemve 1156293 e.gw.10013.4.1 jgi Nemvel 192161 gw.10013.1.1 jgi Nemve 11451351 egw.1325.6.1 jgiINemve 1157428 Iegw.16429.2.1 jgi|Nemvel 152225 I e-gw.4762.1.1 jgilNemvel 161676 lgw.3793.4.1 jgil Nemvel 1149172 1 e-gw.2834.1.1 jgilNemvel 146382 Igw.9892.1.1 jgi I Nemve 11442551 e-gw.981.2.1 jgi INemve 11554481 egw.8594.3.1 jgil Nemvel 168429 |gw.12503.2.1 jgi INemve 11569881 egw.13707.1.1 jgilNemve 11533581 egw.6033.2.1 jgil Nemve 168529 |gw.1228.1.1 jgi INemve 1157171 e.gw.14678.2.1 jgi INemvel 173238 lgw.4271.1.1 jgi INemvel 178763 |gw.18200.1.1

78