<<

Genome evolution of Symbiodiniaceae

Raúl Augusto González Pech

MSc in Evolution, Ecology and Systematics

BSc in Biology

A thesis submitted for the degree of Doctor of Philosophy at

The University of Queensland in 2020

Institute for Molecular Bioscience Abstract

Symbiotic interactions between (Symbiodiniaceae) and give rise to the ecological complexity and biodiversity of reef ecosystems. Comparative genomic studies can aid in tracing the evolutionary history of these dinoflagellates, and thus elucidate the evolutionary forces that drove their diversification and adaptation as predominantly symbiotic lineages. However, data from these ecologically important organisms remain scarce, largely due to their immense sizes and idiosyncratic genome features. The incorporation of genome-scale data from diverse lineages in a comprehensive comparative analysis is essential to better understand the molecular and evolutionary mechanisms that underpin the diversification of Symbiodiniaceae.

In this thesis, I first review and discuss the state-of-the-art of Symbiodiniaceae genomics in depth, highlighting the genetic and ecological diversity of these dinoflagellates. In addition, I present a theoretical framework, based on our current knowledge of intracellular bacterial symbionts and parasites, to approach the study of genome evolution in Symbiodiniaceae along the broad spectrum of symbiotic associations they can establish. I also summarise and explain how common methods in comparative genomics can be implemented to improve our understanding of Symbiodiniaceae evolution.

Using available genome and transcriptome data in a comparative analysis (Chapter 3), I identified gene functions that distinguish Symbiodiniaceae from other dinoflagellates in Order , as well as functions specific to the major lineages within the family. These results show that gene functions shared by all lineages in Symbiodiniaceae are relevant to adaptation to the environment, as well as to the establishment and maintenance of symbiosis. I also determined functions specific to each lineage and highlight their potential use in future research to understand niche specialisation. The basal genus consists of both free-living and symbiotic forms, and most dinoflagellates external to Symbiodiniaceae are free-living. Genome data from Symbiodinium therefore represent a key analysis platform to assess genome features related to the evolutionary transition from a free-living to a symbiotic lifestyle.

Next, I generated and compared high-quality de novo genome assemblies from two Symbiodinium isolates (Chapter 4): the symbiotic Symbiodinium tridacnidorum CCMP2592 and the free-living Symbiodinium natans CCMP2548; these assemblies were generated using both short- and long-read sequence data. My results reveal extensive genome-sequence divergence between these two , and suggest that increased structural rearrangements in the genome of S. tridacnidorum, characterised as distinct types of gene duplication and transposable elements, contribute to the extensive genome divergence between these two species. The distinguishing genome features between these two isolates potentially associate with their evolution towards the distinct ii lifestyles. The results also agree with the notion that the symbiotic lifestyle is a derived trait in Symbiodinium, and that the free-living lifestyle is ancestral.

To further assess the divergence within this genus and within family Symbiodiniaceae, I generated de novo genome assemblies from additional five Symbiodinium isolates, encompassing diverse ecological niches. In a comprehensive analysis (Chapter 5) that incorporated all other available genome data of Suessiales (a total of 15 genomes, nine of which are from Symbiodinium), I assessed, for the first time, genome-sequence divergence within Order Suessiales, within Family Symbiodiniaceae, within Genus Symbiodinium, and among isolates of individual species (i.e. Symbiodinium microadriaticum and Symbiodinium tridacnidorum). Whole-genome comparisons reveal extensive sequence divergence, with no sequence regions common to all 15. Based on similarity of k-mers from whole-genome sequences, the distances among Symbiodinium isolates are similar to those between isolates of distinct genera. Gene functions related to symbiosis and stress response exhibit similar abundance in all analysed genomes. These results suggest that structural rearrangements contribute to genome sequence divergence in Symbiodiniaceae even within a same species, but the gene functions have remained largely conserved in Suessiales.

This thesis work is the most comprehensive assessment to date of genome evolution of Symbiodiniaceae, and of the basal genus Symbiodinium. The thesis includes, for the first time, comparisons at the intra-generic and intra-specific levels using extensive whole-genome sequence data. Through this thesis research, seven de novo genome assemblies from diverse Symbiodinium isolates, as well as their corresponding transcriptomes and predicted protein-coding genes, were generated. Customised and novel bioinformatic methods were implemented to accommodate the complexity and idiosyncrasy of dinoflagellate genomes. Knowledge generated from this body of research provide novel insights into genome evolution of Symbiodiniaceae linked to their transition to symbiosis, and the molecular mechanisms that underpin the diversification of the family. The data and analytic workflows from this research can be readily applied in comparative genomic studies of other dinoflagellates and microbial .

iii Declaration by author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, financial support and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my higher degree by research candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis and have sought permission from co-authors for any jointly authored works included in the thesis.

iv Publications included in this thesis

González-Pech RA, Ragan MA & Chan CX. (2017). Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodinium. Scientific Reports 7(1), 15021. DOI: 10.1038/s41598- 017-15029-w

González-Pech RA, Bhattacharya D, Ragan MA & Chan CX. (2019). Genome evolution of reef symbionts as intracellular residents. Trends in Ecology and Evolution 34(9), 799-806. DOI: 10.1016/j.tree.2019.04.010

González-Pech RA, Stephens TG, Chen Y, Mohamed AR, Cheng Y, Burt DW, Bhattacharya D, Ragan MA & Chan CX. (2019). Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium. bioRxiv, 783902. DOI: 10.1101/783902

González-Pech RA, Chen Y, Stephens TG, Shah S, Mohamed AR, Lagorce R, Bhattacharya D, Ragan MA & Chan CX. (2019). Genomes of Symbiodiniaceae reveal extensive sequence divergence but conserved functions at family and genus levels. bioRxiv, 800482. DOI: 10.1101/800482

v Submitted manuscripts included in this thesis

No manuscripts submitted for publication.

Other publications during candidature

Peer-reviewed papers and pre-prints:

González-Pech RA, Vargas S, Francis WR & Wörheide G. (2017). Transcriptomic resilience of the Montipora digitata holobiont to low pH. Frontiers in Marine Science 4, 403. DOI: 10.3389/fmars.2017.00403

Voigt O, Erpenbeck D, González-Pech RA, Al-Aidaroos AM, Berumen ML & Wörheide G. (2017). Calcinea of the Red Sea: providing a DNA barcode inventory with description of four new species. Marine Biodiversity, 1-26. DOI: 10.1007/s12526-017-0671-x

Liu H, Stephens TG, González-Pech RA, Beltran VH, Lapeyre B, Bongaerts P, Cooke I, Aranda M, Bourne DG, Forêt S, Miller DJ, van Oppen MJH, Voolstra CR, Ragan MA & Chan CX. (2018). Symbiodinium genomes reveal adaptive evolution of functions related to coral-dinoflagellate symbiosis. Communications Biology 1, 95. DOI: 10.1038/s42003-018-0098-3

González-Pech RA, Stephens TG & Chan CX. (2018). Commonly misunderstood parameters of NCBI BLAST and important considerations for users. Bioinformatics 35(15), 2697-2698. DOI: 10.1093/bioinformatics/bty1018

Stephens TG, González-Pech RA, Cheng Y, Mohamed AR, Bhattacharya D, Ragan MA & Chan CX. (2019). glacialis genomes encode tandem repeats of single-exon genes with functions critical to adaptation of dinoflagellates. bioRxiv, 704437. DOI: 10.1101/704437

Chen Y, González-Pech RA, Stephens TG, Bhattacharya D & Chan CX. (2020). Evidence that inconsistent gene prediction can mislead analysis of dinoflagellate genomes. Journal of Phycology 56(1), 6-10. DOI: 10.1111/jpy.12947

vi Conference posters and oral presentations:

González-Pech RA, Ragan MA & Chan CX. (2017). Unveiling signatures of Symbiodinium using functional genomics. Oral presentation. Australian Coral Reef Society 2017 Conference. Townsville, Australia.

González-Pech RA, Bhattacharya D, Ragan MA & Chan CX. (2017). Comparative genomics of Symbiodinium: the evolutionary transition from free-living to symbiosis. Poster presentation. EMBL Australia Postgraduate Symposium 2017. Sydney, Australia.

González-Pech RA, Bhattacharya D, Ragan MA & Chan CX. (2017). Genome comparison of free- living and symbiotic Symbiodinium reveals signatures of evolutionary transition to symbiosis. Oral presentation. European Coral Reef Symposium 2017. Oxford, UK.

González-Pech RA, Vargas S, Francis WR & Wörheide G. (2017). Transcriptomic resilience of a coral holobiont to low pH. Oral presentation. European Coral Reef Symposium 2017. Oxford, UK.

González-Pech RA, Bhattacharya D, Ragan MA & Chan CX. (2018). Genome of a free-living Symbiodinium: evolutionary transition to coral-dinoflagellate symbiosis. Poster presentation. Society for Molecular Biology and Evolution (SMBE) Annual Meeting 2018. Yokohama, Japan.

González-Pech RA. (2018). The evolutionary genomics of Symbiodinium. Oral presentation. Joint Academic Microbiology Seminars (JAMS) Brisbane. Brisbane, Australia.

Contributions by others to the thesis

Dr Cheong Xin Chan and Prof Mark Ragan provided feedback and revised drafts of this thesis. The contribution of co-authors for published work presented as part of this thesis is detailed in the relevant chapters.

vii Statement of parts of the thesis submitted to qualify for the award of another degree

No works submitted towards another degree have been included in this thesis.

Research involving human or animal subjects No animal or human subjects were involved in this research.

viii Acknowledgements

This thesis is not only result of my work but also of the support, guidance and advice of people around me.

First of all, thanks to my supervisors, Dr Cheong Xin Chan (better known as CX) and Emeritus Professor Mark Ragan, for seeing potential in me to join their team and work on this project. Thanks to both of them also for their guidance, not only in research matters but also regarding my career path. Thanks for being a professional and academic role model and for providing all that was in your hands to boost the development of my career. Thanks to CX, for looking after me for these three years, for opening my eyes to the challenges that a scientific career implies and for teaching me how to overcome those challenges. Thanks to Mark, for inspiring me to address the research questions that my passion and interest dictate; thanks also for all the philosophical conversations and all the lessons about the English language.

Thanks to all members of the past Group Ragan and of the current CX Team. Thanks to Lanna Wong for her help figuring out and going through all the burocratical and organisational processes of the University and the Institute. Thanks to Dr Guillaume Bernard, Dr Huanle Liu and MSc Sarah Shah for the heated discussions about science and many other topics. Thanks to Dr Atefeh (Ati) Fard, for being not only a colleague but also a confidant and an advisor. Thanks to the students I had the opportunity to co-supervise, who let me develop a different facet as a scientist, including MSc Clarisse Louvard, BSc Christopher Wrona, Ing Rémi Lagorce, MSc Yibi Chen and MSc Pierre Youssef. Thanks to them I was able to explore topics out of the scope of this thesis work that were of my interest. A very special acknowledgment to BSc Timothy (Tim) Stephens, my “PhD brother”, with whom I shared most of these years. Tim helped me to set up, run and troubleshoot all types of tools and analyses, got engaged in interpretations of my results and formulation of hypotheses, as crazy as they might have been. Tim was not only a complement to my formation as a researcher but also the best companion throughout the whole PhD experience. Thanks to everyone for creating a friendly and supportive environment to do fantastic research!

Dr Amanda Carozzi, Cody Mudgway and Olga Chaourova, members of the Postgraduate Team, were particularly helpful during my time at IMB to go through all the requirements of my Higher Degree by Research (HDR) program. Thanks to the support staff from IMB IT, UQ Research Computing Centre and the National Computational Infrastructure (NCI) National Facility for running and maintaining the computational facilities that most of my project relied upon.

The research presented here was enriched by the contribution of external collaborators. Professor Debashish Bhattacharya is a never-ending source of ideas and scientific questions, some of

ix which have been captured and addressed within this thesis, that I am particularly grateful for. I would also like to acknowledge all the effort put by Dr Amin Mohamed and Dr Yuanyuan Cheng to generate the data analysed here.

I am also grateful to the many friendships I have made during my time in Brisbane. Thanks to the Colombian crowd that welcomed me very warmly when I first arrived. Thanks to my “PhD friends” with whom I had lunch and coffee many times in campus and shared the innate struggle of the course, and that were of incredible emotional support. These include Marcos Soto-Pérez, MSc Amu Faiz, Dr Alvin Chandra, Dr Daniel Demant and BSc Rhys White. I thank my friends who are my personal and academic mentors, especially Dr Cecilia González-Tokman, Dr Gonzalo Martínez- Fernández and Dr Angelo Tedoldi. Other close friends helped me to clear out my mind and relax in the hardest moments either at the gym, playing board games or even watching Game of Thrones; Jeremy Juhas (Oso), Margarita Navarro and Miguel Alvarado deserve my most appreciated acknowledgement.

My partner, Mauricio Guillén-Rodríguez, was by my side almost from the beginning of the PhD. Thanks to him, for being my unconditional support all these years. Thanks for standing strong and being there for me whenever I felt frustrated, stressed or lost. Thanks for cheering me to work hard and move forward. Thanks for listening patiently all the details about my research and for the innumerable times you helped me to prepare talks for conferences and seminars. Thanks for understanding and for encouraging me to pursue an academic career.

Finally, I would like to say thanks to my family. Mom and dad, María Isabel and Agustín, are my inspiration. All that I have been able to accomplish is thanks to them. The admiration I feel for my sisters (Natalia and Karla), because of their strength and wisdom, drives my growth as a scientist. I therefore dedicate this work to them.

x Financial support

This research was supported by an Australian Government Research Training Program Scholarship, an Australian Research Council grant (DP150101875), and the computational resources of the National Computational Infrastructure (NCI) National Facility systems through the NCI Merit Allocation Scheme (Project d85).

Keywords

Symbiodinium, Symbiodiniaceae, dinoflagellates, free-living, symbiosis, comparative genomics, evolutionary genomics, transcriptome analysis

Australian and New Zealand Standard Research Classifications (ANZSRC)

ANZSRC code: 060408, Genomics, 60%

ANZSRC code: 060701 Phycology (incl. Marine Grasses), 20%

ANZSRC code: 060303 Biological Adaptation, 20%

Fields of Research (FoR) Classification

FoR code: 0604, Genetics, 60%

FoR code: 0603, Evolutionary Biology, 20%

FoR code: 0607, Plant Biology, 20%

xi Table of contents

Chapter 1. Introduction 1

1.1. Hypotheses 5

1.2. Aims 5

1.3. Scope and limitations 5

1.4. Significance 6

1.5. Outline 7

1.6. References 8

Chapter 2. Symbiodiniaceae diversity and evolution, and genomics methods for their study 14

2.1. Family Symbiodiniaceae 15 2.1.1. Morphology 15 2.1.2. Ecological niches 15 2.1.3. Phylogeny and systematics 17 2.1.4. Evolutionary rates and timescale 19 2.1.5. Evolution of symbiosis 21 2.1.6. Genomics 22 2.1.7. Genome evolution as intracellular residents 25

2.2. Methods in genomics 33 2.2.1. Genome assembly 33 2.2.2. Genome annotation 37 2.2.3. Comparative genomics 38

2.3. Concluding remarks 39

2.4. References 39

Chapter 3. Signatures of adaptation and symbiosis in genomes and transcriptomes of symbiodiniaceae 55

3.1. Abstract 56 xii 3.2. Introduction 56

3.3. Results and discussion 58 3.3.1. Genome and transcriptome data 58 3.3.2. Delineation of gene families 60 3.3.3. Functional annotation of gene families 61 3.3.4. Dynamics of gene families among Symbiodiniaceae clades 63 3.3.5. What makes Symbiodiniaceae Symbiodiniaceae? 66 3.3.6. Lineage-specific enrichment of function 67 3.3.7. Impact of taxon sampling and data amount on gene-family analysis 69

3.4. Concluding remarks 71

3.5. Methods 72 3.5.1. Data collection and preparation 72 3.5.2. Homolog clusters 73 3.5.3. Functional analysis of gene families 74

3.6. References 74

3.7. Supplementary figures 79

3.8. Supplementary tables 85

Chapter 4. Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium 86

4.1. Abstract 87

4.2. Introduction 87

4.3. Results and discussion 89 4.3.1. Genome sequences and predicted genes of S. tridacnidorum and S. natans 89 4.3.2. Genomes of S. natans and S. tridacnidorum are highly divergent 91 4.3.3. Duplication events and transposable elements contribute to the divergence between S. tridacnidorum and S. natans genomes 91 4.3.4. High divergence among gene copies counteracts gene-family expansion in S. tridacnidorum 96 4.3.5. Gene functions of S. tridacnidorum and S. natans are relevant to their lifestyle 98 xiii 4.3.6. Are features underpinning genome divergence in Symbiodiniaceae ancestral or derived? 99

4.4. Concluding remarks 101

4.5. Methods 102 4.5.1. Symbiodinium cultures 102 4.5.2. Nucleic acid extraction 102 4.5.3. Genome sequence data generation and de novo assembly 103 4.5.4. Removal of putative microbial contaminants 104 4.5.5. RNA sequence data generation and transcriptome assembly 104 4.5.6. Full-length transcript evidence for gene prediction 105 4.5.7. Genome annotation and gene prediction 106 4.5.8. Gene-function annotation and enrichment analyses 107 4.5.9. Comparative genomic analyses 108

4.6. References 108

4.7. Supplementary figures 115

4.8. Supplementary tables 121

Chapter 5. Genomes of symbiodiniaceae reveal extensive sequence divergence but conserved functions at family and genus levels 122

5.1. Abstract 123

5.2. Introduction 123

5.3. Results and discussion 125 5.3.1. Genome sequences of Symbiodiniaceae 125 5.3.2. Isolates of Symbiodiniaceae and Symbiodinium exhibit extensive genome divergence 127 5.3.3. Remnants of transposable elements were lost in more-recently diverged lineages of Symbiodiniaceae 130 5.3.4. Diversity of gene features within Suessiales 132 5.3.5. Gene families of Symbiodiniaceae 133 5.3.6. Core genes of Symbiodiniaceae and of Symbiodinium encode similar functions 135

xiv 5.3.7. Functions related to symbiosis and stress response are conserved in Suessiales 136

5.4. Concluding remarks 138

5.5. Methods 140 5.5.1. Symbiodinium cultures 140 5.5.2. Nucleic acid extraction 140 5.5.3. Genome sequence data generation and de novo genome assembly 141 5.5.4. Removal of putative microbial contaminants 142 5.5.5. Generation and assembly of transcriptome data 142 5.5.6. Gene prediction and function annotation 143 5.5.7. Comparison of genome sequences and analysis of conserved synteny 144 5.5.8. Genic features, gene families and function enrichment 146

5.6. References 146

5.7. Supplementary figures 154

5.8. Supplementary tables 165

Chapter 6. General discussion, conclusions and outlook 166

6.1. General discussion 167 6.1.1. Symbiodiniaceae gene functions are associated with adaptation and diversification 168 6.1.2. Genomes features of Symbiodinium likely relate to transition from free-living to symbiotic 169 6.1.3. Genome-sequence divergence among Symbiodiniaceae exceeds known genetic diversity 169

6.2. Concluding remarks and future perspectives 170

6.3. References 173

xv List of figures

Fig. 1.1 Recent systematic revision of the group 2 Fig. 1.2 Clade A is consistently the most basal lineage in the family 4 Fig. 2.1 Morphotypes of symbiodiniaceae 16 Fig. 2.2 Global distribution of clades in symbiodiniaceae 18 Fig. 2.3 Divergence time of symbiodiniaceae lineages 20 Fig. 2.4 Key features of nuclear and organellar genomes of dinoflagellates 23 Fig. 2.5 Expected genome features of symbiodiniaceae across the spectrum of symbiotic associations 26 Fig. 2.6 Estimated genome sizes of dinoflagellates 32 Fig. 2.7 De novo genome assembly 34 Fig. 2.8 A typical pipeline of de novo genome assembly 35 Fig. 2.9 Process of a hybrid genome assembly 36 Fig. 3.1 G+C content and codons usage in coding sequences 60 Fig. 3.2 Prevalent protein domains in gene families of symbiodiniaceae 62 Fig. 3.3 Gene families along the phylogeny of symbiodiniaceae 63 Fig. 3.4 Gene families shared among lineages of symbiodiniaceae 65 Fig. 3.5 Rarefaction analyses of taxa and gene families 70 Fig. 4.1 Comparison of S. tridacnidorum and S. natans genomes 92 Fig. 4.2 Contribution of genomic features to distinct composition of S. tridacnidorum and S. natans genomes 94 Fig. 4.3 Overrepresented functions in retroposed and RT-genes 95 Fig. 4.4 Interspersed repeat landscapes of S. tridacnidorum and S. natans 96 Fig. 4.5 Relative gene-family sizes in S. tridacnidorum and S. natans 98 Fig. 4.6 Proportion of distinct elements in genomes of S. Tridacnidorum, S. natans and P. glacialis 100 Fig. 5.1 Genome divergence among Symbiodiniaceae 128 Fig. 5.2 Repeat composition of Suessiales genomes 131 Fig. 5.3 PCA of gene features in Symbiodiniaceae 133 Fig. 5.4 Number of gene families along the phylogeny of Symbiodiniaceae 134 Fig. 5.5 Relative abundance of symbiosis-related functions in genes of Suessiales 137 Fig. 5.6 Relative abundance of selected functions in genes of Suessiales 139 Fig. 6.1 Genomics of Symbiodiniaceae in the last decade 167

xvi List of tables

Table 3-1 Summary of the selected datasets for analysis in the present study 59 Table 3-2 Gene families in which each lineage is represented 64 Table 4-1 Statistics of hybrid genome assemblies 89 Table 4-2 Gene statistics 90 Table 5-1 Symbiodinium isolates for which genome data were generated and genome assembly statistics 126

xvii List of abbreviations

A Adenine

A+T Adenine-thymine pairs

Ab initio From the beginning (Latin)

ATP Adenosine triphosphate

BLAST Basic Local Alignment Search Tool bp (s)

BUSCO Benchmarking Universal Single-Copy Orthologs

C Cytosine

Ca Calcium

CCS Circular consensus sequences cDNA Complementary DNA

CDS Coding sequences

CEGMA Core Eukaryotic Genes Mapping Approach

Cl Chlorine cob Cytochrome b gene

COGs Clusters of orthologous groups coI Mitochondrially encoded cytochrome c oxidase I gene cox1 Mitochondrially encoded cytochrome c oxidase I gene cp23S Chloroplast encoded cytochrome 23S ribosomal RNA gene

CpG Regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide

CTAB Cetyl trimethyl ammonium bromide

D Adenine, guanine or thymine

DAPI 4′,6-diamidino-2-phenylindole

De novo From the beginning (Latin)

DinoSL Dinoflagellate spliced leader

DNA Deoxyribonucleic acid

xviii e.g. Exempli gratia (Latin): for example

EDTA Ethylenediaminetetraacetic acid elf2 E74-like factor 2 gene et al. Et alii (Latin): and others

Ex hospite Outside the host (Latin) g Gravity/gravities

G Guanine

G+C Guanine-cytosine pairs

Gbp Gigabase pair

GO Gene Ontology

GTP Guanosine-5'-triphosphate h Hour(s)

HLPs Histone-like proteins

HMM Hidden Markov Model

HSP Heat shock protein i.e. Id est (Latin): that is

In hospite Inside the host

In situ Locally (Latin)

Inter alia Among others (Latin)

ITS2 Internal transcribed spacer 2 gene kbp Kilobase pair

KEGG Kyoto Encyclopedia of Genes and Genomes

KO KEGG Orthology

LGT Lateral genetic transfer

LINE Long interspersed nuclear element

LSU Large subunit rRNA gene

LTR Long terminal repeat

xix m Metre

MAA Mycosporine-like amino acid

MAPQ Mapping quality

Mbp Megabase pair mg Milligram

Mg Magnesium min Minute(s) mL Millilitres mm Millimetres mM Millimolar

MMETSP Marine Microbial Transcriptome Sequencing Project mRNA Messenger RNA

MYA Million years ago

Na Sodium

NJ Neighbour joining nr28S Nuclear 28S ribosomal RNA gene

PacBio Pacific Biosciences

PCA Principal Component Analysis

PCP Peridinin-chlorophyll a-binding protein

PCR Polymerase chain reaction

PPR Pentatricopeptide domain psbA Photosystem II protein D1 precursor gene rad24 Checkpoint RAD24 protein gene

RAPD Random-amplified-polymorphic DNA rDNA Ribosomal DNA

RNA Ribonucleic acid

RNA-Seq RNA-Sequencing

xx ROS Reactive oxygen species rRNA Ribosomal RNA

RT Retrotransposition s Second(s) sd Standard deviation

Sensu stricto In the narrow sense (Latin)

SINE Short interspersed nuclear element

SL Splice-leader

SMRT Single-Molecule Real-Time sp. Species (singular) spp. Species (plural)

T Thymine

TE Transposable element tRNA Transfer RNA

U Uracil

UV Ultraviolet v/v Volume by volume

Versus Against (Latin)

Via Through (Latin) w/v Weight by volume

µE Microeinstein

µg Microgram

µL Microlitre

xxi

Chapter 1. Introduction

The ecological complexity and biodiversity of coral reefs arise from the symbiotic interaction between reef-building corals and their associated microalgae (dinoflagellates in the family Symbiodiniaceae)1. The coral hosts offer protection from grazers and inorganic substrates (such as carbon dioxide an ammonium) to the symbiotic . In return, the algae provide an organic source of carbon (i.e. translocated photosynthates) to the coral, essential for their metabolic requirements, in particular of high-energy demanding processes like calcification2. A sound understanding of the biology and evolution of these algae is key for determining their current state and predicting future of reef ecosystemes3-5.

In addition to coral symbionts, the family Symbiodiniaceae encompasses symbionts of other marine invertebrates including flatworms, sponges, jellyfishes, clams, and even microbial eukaryotes such as foraminifera and ciliates6. Despite their importance and remarkable abundance in reef ecosystems, the overall biomass of Symbiodiniaceae is relatively small and they are therefore considered keystone species7. Dinoflagellates in family Symbiodiniaceae also display extensive genetic diversity6. Formerly, the family Symbiodiniaceae was known as the genus Symbiodinium, which was divided into nine major phylogenetic groups (Clades A through I) based on molecular markers8-14. However, the genetic divergence among these lineages can be as large as that found between orders of other dinoflagellates15. This extensive divergence caused the recent reclassification of the genus as a family, and seven of the clades have been assigned, partially or completely, to distinct genera (Fig. 1.1)14. Genetic variability at the genus (or clade) level is less known, although functional diversification within a single genus is well acknowledged and often reflected in their physiology16-19. There is also evidence that suggests a molecular basis for this intra-genus functional diversity. For example, species within Breviolum (Clade B) with different dominance in coral hosts display distinct gene expression profiles20.

Symbiodiniaceae with lifestyles other than the more-acknowledged mutualistic one have been discovered. The physiology and metabolism of some Symbiodiniaceae, and their impact on the host fitness, can resemble (or transition into) those of parasites rather than of mutualistic symbionts21-26. Some others have been found in environmental samples with no evident association with any host, suggesting a potential free-living habit9,27-30. Whether these lifestyles represent permanent or transient conditions remains to be verified31.

1

Fig. 1.1 Recent systematic revision of the group

2

(A) Large subunit ribosomal DNA (LSU rDNA) phylogeny of genera in the Order Suessiales. Symbols next to terminal branches indicate species whose transcriptomes (triangles) or genomes have been sequenced before (dark blue circles) or as part of this project (light blue squares), which taxa form resting cysts (black circle), and from taxa other than Symbiodiniaceae (asterisks). Bootstrap values (1000 replicates) and Bayesian posterior probabilities are indicated for each internal branch. Ancestral traits are indicated on the tree: (B) most recent common ancestor possessing an eyespot lens in a “type E” brick-like arrangement, (C) mastigote cells possessing only seven latitudinal series of amphiesmal vesicles, (D) cells having a single pyrenoid surrounded by starch deposits and an internal matrix devoid of thylakoid extensions, and (E) a cell life cycle with a predominant coccoid phase. (F) Genera in Symbiodiniaceae are most typically symbiotic with various metazoan phyla, including: (1) Cnidaria, (2) Mollusca, (3) Porifera, (4) Platyhelminthes, (5) Foraminifera and (6) . (G) Unrooted LSU rDNA phylogeny based on maximum likelihood of the major groups of in comparison to Symbiodiniaceae. Modified from Lajeunesse et al. (2018)14.

Interestingly, the whole genus Effrenium (Clade E) and monophyletic clades within Symbiodinium (non-temperate Clade A) are exclusively formed from many of these non-symbiotic isolates on phylogenetic trees inferred based on standard marker genes14,32,33. Clade A is considered the most-ancestral lineage within Symbiodiniaceae in accordance with its consistent appearance as a basal lineage in phylogenetic trees inferred using different molecular markers (Fig. 1.2)34. Genome analysis of the genus Symbiodinium can therefore yield valuable insights regarding the evolution of Symbiodiniaceae and their transition from free-living to symbiotic.

The study of symbiodiniacean evolution is, however, limited by the lack of complete fossil records for dinoflagellates, their complex genetic organisation35,36, and their scant morphological variation10. The selection and development of appropriate molecular markers have led to a well- resolved phylogeny37,38, and reasonable estimates of evolutionary rates and divergence times of the taxon14,39, but the evolutionary processes underlying the diversification of Symbiodiniaceae remain little known. Additionally, the few studies on evolution of Symbiodiniaceae mainly focused on the symbiotic (more-recently divergent) lineages39,40. Fundamental questions regarding the evolution of the group, such as the transition from a free-living to a symbiotic lifestyle, are practically unexplored.

3

Fig. 1.2 Clade A is consistently the most basal lineage in the family Phylogenetic reconstructions of Symbiodiniaceae using different genetic markers. Each clade is highlighted in a different colour denoting its monophyletic condition. Note the basal position of Clade A (in red) in all of them. From Pochon et al. 201234.

Dinoflagellate genomes are known to be large (3–250 Gbp) and complex41. Until recently, sequencing of these algal genomes was not feasible. Comparative genomics is a powerful tool in evolutionary biology, in which distinct features (e.g. protein-coding capacity, gene organisation, gene families, transposable elements, regulatory features and genome structure) can be analysed at a multi- genome scale. Comparative analyses of these genome features aid in tracing evolutionary history, and thus elucidate the drivers of lineage diversification42. Genomic techniques are increasingly being used to investigate non-model organisms, providing valuable insights about the biology of these organisms. Likewise, these approaches can be applied to the study of dinoflagellates43. In this thesis research, I use comparative genomics to gain a better understanding of the diversification and genome evolution of Symbiodiniaceae, focusing on the transition of these dinoflagellates from free-living to symbiotic.

4

1.1. Hypotheses

I. Symbiodiniaceae genomes encode gene functions that are associated with adaptation to diverse ecological niches, and diversification of these lineages.

II. Genomes of taxa among the basal genus Symbiodinium exhibit features related to the early evolutionary transition of Symbiodiniaceae from free-living to symbiotic.

III. Genome-sequence divergence among Symbiodiniaceae, and within genus Symbiodinium, is comparable to the observed genetic diversity based on known molecular markers.

1.2. Aims

▪ Aim 1: to assess the variation of gene functions in symbiodiniacean lineages through the analysis of genome-scale data (Hypothesis I). ▪ Aim 2: to characterise features associated with the evolutionary transition of Symbiodiniaceae from free-living to symbiotic in genomes of Symbiodinium (Hypothesis II). ▪ Aim 3: to assess the divergence of genome sequences and variation of encoded gene functions among Symbiodinium and among Symbiodiniaceae (Hypothesis III).

1.3. Scope and limitations

This thesis research represents the most comprehensive genome-scale comparisons of the family Symbiodiniaceae and of the genus Symbiodinium. It represents the first systematic comparative analysis of genomes from taxa within a single genus in the family. The overall research is based on both newly generated data from this thesis work, and other available data from previous studies. The generation of de novo genome and transcriptome data in this thesis work follows current best practice for non-model eukaryote genomes, using both short- and long-read sequencing technologies. The gene-prediction workflow adopted in this thesis research was specifically customised for dinoflagellates. The generated draft genome and transcriptome assemblies, as well as predicted gene models, are thus the best available for any symbiodiniacean taxa to date.

Genome-scale data generated from this thesis work provide a foundational resource for future research to understand how Symbiodiniaceae evolved into the most successful and abundant symbionts in coral reef ecosystems. The sampling of Symbiodiniaceae taxa (and Symbiodinium taxa)

5

in this thesis research was carefully designed to include diverse lifestyles, hosts and phylogenetic subclades within Symbiodinium; leveraging available genome-scale data from previous studies. Because the number of sampled taxa remains restricted, the results presented in this thesis work were interpreted within the context of these sampled taxa. To comprehensively elucidate adaptation mechanisms of other species to their distinct ecological niches, generation of additional genome data will be required, especially given the high genetic divergence among symbiodiniacean lineages.

1.4. Significance

Coral reefs are valuable for multiple reasons: they are biodiversity hotspots, protect coastlines from storms, allow the establishment of associated ecosystems (e.g. mangroves and marine lagoons), provide goods to human societies and allow the practice of recreational activities (attractive for tourism)44,45. The economic value of the goods and benefits provided by coral reefs in the world has been estimated to be between $4B and $500B AUD per year46-49. The Great Barrier Reef alone has been valued at $56B AUD and contributes with $6.4B AUD yearly to Australian economy46. However, coral reefs worldwide are under threat from both local (e.g. pollution, destruction by maritime traffic, overfishing) and global factors (mainly ocean acidification and global warming)50- 53. These threats can result in in situ loss of pigments and death of the symbiodiniacean symbionts, or the symbionts being ejected from the coral hosts: a phenomenon known as coral bleaching3,54. The key to resilience and adaptability of coral reefs to global climate change relies on coral-dinoflagellate symbiosis55,56. It is therefore important to have a good understanding of this association. There has been extensive research exploring the relationship of corals and Symbiodiniaceae in diverse environmental conditions6,57-62. However, the origin, evolution and establishment of this symbiosis from the molecular perspective remains little known.

Using novel genome data, this project contributes specifically in this aspect by characterising molecular and evolutionary mechanisms that enabled Symbiodiniaceae to establish symbiosis with corals and other organisms and to diversify into the most successful coral reef symbionts at present. This is the first systematic study of the evolution of symbiodiniacean genomes. The computational methods and workflows implemented in this work are transferable to other types of symbiosis. The available whole-genome sequences of Symbiodiniaceae (from both this thesis research and previous studies) can be employed for the development of phylogenetic and population genetic markers to reliably delineate species boundaries in combination with ecological and physiological attributes63,64. Additionally, outcomes from this thesis research, in combination with the current knowledge and genomic resources of corals65-72, provide a foundational reference for characterising the molecular 6

basis of the adaptation of coral-dinoflagellate assemblages to stress conditions. This knowledge can in turn be utilised to identify reefs susceptible to harsh environmental conditions and guide risk- mitigation and conservation strategies73-75. Furthermore, the integrated knowledge can be used to pinpoint targets for genetic engineering of coral holobionts that are resilient to global environmental change76-81.

1.5. Outline

This thesis is composed of six chapters. In Chapter 1, I introduce the family Symbiodiniaceae, their functional diversity and the knowledge-gaps regarding their evolutionary transition to symbiosis, impacts of this transition in the evolution of their genomes and adaptation mechanisms to the different environmental conditions in which they inhabit. Next, I present my working hypotheses and the specific aims to address them in this thesis research. The scope and limitations of the study, and its significance are also presented.

In Chapter 2, I summarise our current knowledge on the diversity and evolution of Symbiodiniaceae. I also review the state-of-the-art of Symbiodiniaceae genomics and describe the expected signatures of the evolutionary transition to symbiosis based on what we know from other intracellular residents. In addition, I go through some of the most common methods in comparative genomics and how the implementation of these methods can help us improve our understanding of the evolutionary history of the family.

In Chapter 3, I survey the gene functions that distinguish Symbiodiniaceae from other lineages in the Order Suessiales, as well as functions specific to most of the major clades within the family. These results are based on the analysis of available genome and transcriptome data from symbiodiniacean taxa. The gene functions shared by all lineages in Symbiodiniaceae are relevant to the habitats in which they are usually found as well as to the establishment and maintenance of symbiosis. I also show that some functions are rather specific to lineages within the family and highlight the need to further investigate their role in adaptation.

In Chapter 4, I assess impacts of the evolution into symbiosis on the genome of a symbiotic species versus a free-living relative, both within genus Symbiodinium (Clade A). The study is based on the comparison of newly sequenced de novo genomes. Genome features pertinent to facultative intracellular residents were identified. Enriched gene functions associated with their respective lifestyles were also identified in each species.

7

In Chapter 5, I assess whole-genome sequence divergence among isolates within genus Symbiodinium, and more broadly in Symbiodiniaceae, in a comparative analysis based on newly sequenced de novo genomes from five other isolates in Symbiodinium, the genomes of the two species from Chapter 4 and other available genomes of Order Suessiales. The availability of datasets for different isolates within the same species from this and previous studies, i.e. for Symbiodinium microadriaticum and Symbiodinium tridacnidorum, allowed me to assess the genome variation at the intra-specific level.

In Chapter 6, I discuss the results and insights gained from this work, within the context of the hypotheses and aims outlined in Chapter 1. I finally conclude this thesis by presenting an outlook for future research, highlighting outstanding questions regarding the evolution of Symbiodiniaceae into a diverse and predominantly symbiotic family.

1.6. References

1 Muscatine, L. & Porter, J. W. Reef corals: mutualistic symbioses adapted to nutrient-poor environments. Bioscience 27, 454-460 (1977). 2 Yellowlees, D., Rees, T. A. V. & Leggat, W. Metabolic interactions between algal symbionts and invertebrate hosts. Plant Cell Environ. 31, 679-694 (2008). 3 Hoegh-Guldberg, O. Climate change, and the future of the world's coral reefs. Mar. Freshwater Res. 50, 839-866 (1999). 4 Cowen, R. The role of algal symbiosis in reefs through time. Palaios, 221-227 (1988). 5 Cooper, T., Gilmour, J. & Fabricius, K. Bioindicators of changes in water quality on coral reefs: review and recommendations for monitoring programmes. Coral Reefs 28, 589-606 (2009). 6 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003). 7 Zook, D. P. in Symbiosis Ch. Prioritizing symbiosis to sustain biodiversity: are symbionts keystone species?, 3-12 (Springer, 2001). 8 Rowan, R. & Powers, D. A. Molecular genetic identification of symbiotic dinoflagellates (zooxanthellae). Mar. Ecol. Prog. Ser. 71, 65-73 (1991). 9 Carlos, A. A., Baillie, B. K., Kawachi, M. & Maruyama, T. Phylogenetic position of Symbiodinium (Dinophyceae) isolates from tridacnids (Bivalvia), cardiids (Bivalvia), a sponge (Porifera), a soft coral (Anthozoa), and a free‐living strain. J. Phycol. 35, 1054-1062 (1999). 8

10 LaJeunesse, T. C. Investigating the biodiversity, ecology, and phylogeny of endosymbiotic dinoflagellates in the genus Symbiodinium using the ITS region: in search of a “species” level marker. J. Phycol. 37, 866-880 (2001). 11 Pochon, X., Pawlowski, J., Zaninetti, L. & Rowan, R. High genetic diversity and relative specificity among Symbiodinium-like endosymbiotic dinoflagellates in soritid foraminiferans. Mar. Biol. 139, 1069-1078 (2001). 12 Pochon, X., LaJeunesse, T. & Pawlowski, J. Biogeographic partitioning and host specialization among foraminiferan dinoflagellate symbionts (Symbiodinium; Dinophyta). Mar. Biol. 146, 17-27 (2004). 13 Pochon, X. & Gates, R. D. A new Symbiodinium clade (Dinophyceae) from soritid foraminifera in Hawai’i. Mol. Phylogenet. Evol. 56, 492-497 (2010). 14 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral . Curr. Biol. 28, 2570-2580 (2018). 15 Rowan, R. & Powers, D. A. Ribosomal RNA sequences and the diversity of symbiotic dinoflagellates (zooxanthellae). Proc. Natl. Acad. Sci. U. S. A. 89, 3639-3643 (1992). 16 Suggett, D. J. et al. Functional diversity of photobiological traits within the genus Symbiodinium appears to be governed by the interaction of cell size with cladal designation. New Phytol. 208, 370-381 (2015). 17 Goyen, S. et al. A molecular physiology basis for functional diversity of hydrogen peroxide production amongst Symbiodinium spp. (Dinophyceae). Mar. Biol. 164, 46 (2017). 18 Suggett, D. J., Warner, M. E. & Leggat, W. Symbiotic dinoflagellate functional diversity mediates coral survival under ecological crisis. Trends Ecol. Evol. 32, 735-745 (2017). 19 Warner, M. E. & Suggett, D. J. in The Cnidaria, past, present and future: the world of Medusa and her sisters (eds Stefano Goffredo & Zvy Dubinsky) 489-509 (Springer International Publishing, 2016). 20 Parkinson, J. E. et al. Gene expression variation resolves species and individual strains among coral-associated dinoflagellates within the genus Symbiodinium. Genome Biol. Evol. 8, 665- 680 (2016). 21 Stat, M., Morris, E. & Gates, R. D. Functional diversity in coral–dinoflagellate symbiosis. Proc. Natl. Acad. Sci. U. S. A. 105, 9256-9261 (2008). 22 Lesser, M., Stat, M. & Gates, R. The endosymbiotic dinoflagellates (Symbiodinium sp.) of corals are parasites and mutualists. Coral Reefs 32, 603-611 (2013). 23 Fang, J. K. H., Schönberg, C. H. L., Hoegh-Guldberg, O. & Dove, S. Symbiotic plasticity of Symbiodinium in a common excavating sponge. Mar. Biol. 164, 104 (2017).

9

24 Sachs, J. L. & Wilcox, T. P. A shift to parasitism in the jellyfish symbiont Symbiodinium microadriaticum. Proc. R. Soc. Lond. B Biol. Sci. 273, 425-429 (2006). 25 Morris, L. A., Voolstra, C. R., Quigley, K. M., Bourne, D. G. & Bay, L. K. Nutrient availability and metabolism affect the stability of coral–Symbiodiniaceae symbioses. Trends Microbiol. 27, 678-689 (2019). 26 Baker, D. M., Freeman, C. J., Wong, J. C. Y., Fogel, M. L. & Knowlton, N. Climate change promotes parasitism in a coral symbiosis. ISME J. 12, 921-930 (2018). 27 Littman, R. A., van Oppen, M. J. & Willis, B. L. Methods for sampling free-living Symbiodinium (zooxanthellae) and their distribution and abundance at Lizard Island (Great Barrier Reef). J. Exp. Mar. Biol. Ecol. 364, 48-53 (2008). 28 Manning, M. M. & Gates, R. D. Diversity in populations of free-living Symbiodinium from a Caribbean and Pacific reef. Limnol. Oceanogr. 53, 1853 (2008). 29 Decelle, J. et al. Worldwide occurrence and activity of the reef-building coral symbiont Symbiodinium in the open ocean. Curr. Biol. 28, 3625-3633.e3623 (2018). 30 Granados-Cifuentes, C., Neigel, J., Leberg, P. & Rodriguez-Lanetty, M. Genetic diversity of free-living Symbiodinium in the Caribbean: the importance of habitats and seasons. Coral Reefs 34, 927-939 (2015). 31 Coffroth, M. A., Lewis, C. F., Santos, S. R. & Weaver, J. L. Environmental populations of symbiotic dinoflagellates in the genus Symbiodinium can initiate symbioses with reef cnidarians. Curr. Biol. 16, R985-R987 (2006). 32 Hirose, M., Reimer, J. D., Hidaka, M. & Suda, S. Phylogenetic analyses of potentially free- living Symbiodinium spp. isolated from coral reef sand in Okinawa, Japan. Mar. Biol. 155, 105-112 (2008). 33 Yamashita, H. & Koike, K. Genetic identity of free‐living Symbiodinium obtained over a broad latitudinal range in the Japanese coast. Phycol. Res. 61, 68-80 (2013). 34 Pochon, X., Putnam, H. M., Burki, F. & Gates, R. D. Identifying and characterizing alternative molecular markers for the symbiotic and free-living dinoflagellate genus Symbiodinium. PLoS ONE 7, e29816 (2012). 35 Sarjeant, W. A. S. Fossil and living dinoflagellates. (Elsevier, 2013). 36 Moreno Díaz de la Espina, S., Alverca, E., Cuadrado, A. & Franca, S. Organization of the genome and gene expression in a nuclear environment lacking histones and nucleosomes: the amazing dinoflagellates. Eur. J. Cell Biol. 84, 137-149 (2005). 37 Pochon, X., Putnam, H. M. & Gates, R. D. Multi-gene analysis of Symbiodinium dinoflagellates: a perspective on rarity, symbiosis, and evolution. PeerJ 2, e394 (2014).

10

38 LaJeunesse, T. C. & Thornhill, D. J. Improved resolution of reef-coral endosymbiont (Symbiodinium) species diversity, ecology, and evolution through psbA non-coding region genotyping. PLoS ONE 6, e29013 (2011). 39 Pochon, X., Montoya-Burgos, J. I., Stadelmann, B. & Pawlowski, J. Molecular phylogeny, evolutionary rates, and divergence timing of the symbiotic dinoflagellate genus Symbiodinium. Mol. Phylogenet. Evol. 38, 20-30 (2006). 40 Stat, M., Carter, D. & Hoegh-Guldberg, O. The evolutionary history of Symbiodinium and scleractinian hosts—symbiosis, diversity, and the effect of climate change. Perspect. Plant Ecol. 8, 23-43 (2006). 41 Lin, S. Genomic understanding of dinoflagellates. Res. Microbiol. 162, 551-569 (2011). 42 Xia, X. in Comparative Genomics Ch. Comparative Genomics and the Comparative Methods, 21-47 (Springer, 2013). 43 Murray, S. A. et al. Unravelling the functional genetics of dinoflagellates: a review of approaches and opportunities. Perspect. Phycol., 37-52 (2016). 44 Brander, L. M., Van Beukering, P. & Cesar, H. S. The recreational value of coral reefs: a meta-analysis. Ecol. Econ. 63, 209-218 (2007). 45 Cesar, H., Burke, L. & Pet-Soede, L. The economics of worldwide coral reef degradation. (2003). 46 O'Mahoney, J. et al. At what price? The economic, social and icon value of the Great Barrier Reef, (2017). 47 Costanza, R. et al. The value of the world's ecosystem services and natural capital. Nature 387, 253-260 (1997). 48 Crossman, N. D., Stoeckl, N., Sangha, K. K. & Costanza, R. Economic values of the Northern Territory marine and coastal environments. Australian Marine Conservation Society: Darwin, Australia (2018). 49 Stoeckl, N. et al. The Great Barrier Reef World Heritage Area: its 'value' to residents and tourists, and the effect of world prices on it. (Reef and Rainforest Research Centre Limited, 2014). 50 Bryant, D., Burke, L., McManus, J. & Spalding, M. Reefs at risk: a map-based indicator of threats to the worlds coral reefs. (World Resources institute, Washington D. C., 1998). 51 Hughes, T. P. et al. Climate change, human impacts, and the resilience of coral reefs. Science 301, 929-933 (2003).

11

52 Hoegh-Guldberg, O. et al. Coral reefs under rapid climate change and ocean acidification. Science 318, 1737-1742 (2007). 53 Pandolfi, J. M. et al. Global trajectories of the long-term decline of coral reef ecosystems. Science 301, 955-958 (2003). 54 Brown, B. Coral bleaching: causes and consequences. Coral Reefs 16, S129-S138 (1997). 55 McCulloch, M., Falter, J., Trotter, J. & Montagna, P. Coral resilience to ocean acidification and global warming through pH up-regulation. Nat. Clim. Change 2, 623-627 (2012). 56 Berkelmans, R. & van Oppen, M. J. The role of zooxanthellae in the thermal tolerance of corals: a ‘nugget of hope’ for coral reefs in an era of climate change. Proc. R. Soc. Lond. B Biol. Sci. 273, 2305-2312 (2006). 57 Rodriguez-Lanetty, M., Wood-Charlson, E. M., Hollingsworth, L. L., Krupp, D. A. & Weis, V. M. Temporal and spatial infection dynamics indicate recognition events in the early hours of a dinoflagellate/coral symbiosis. Mar. Biol. 149, 713-719 (2006). 58 Baker, D. M., Andras, J. P., Jordán-Garza, A. G. & Fogel, M. L. Nitrate competition in a coral symbiosis varies with temperature among Symbiodinium clades. ISME J. 7, 1248-1251 (2013). 59 Rowan, R. & Knowlton, N. Intraspecific diversity and ecological zonation in coral-algal symbiosis. Proc. Natl. Acad. Sci. U. S. A. 92, 2850-2853 (1995). 60 Mayfield, A. B. Uncovering spatio-temporal and treatment-derived differences in the molecular physiology of a model coral-dinoflagellate mutualism with multivariate statistical approaches. J. Mar. Sci. Eng. 4, 63 (2016).

61 Davies, S. W., Marchetti, A., Ries, J. B. & Castillo, K. D. Thermal and pCO2 stress elicit divergent transcriptomic responses in a resilient coral. Front. Mar. Sci. 3, 112 (2016). 62 González-Pech, R. A., Ragan, M. A. & Chan, C. X. Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodinium. Sci. Rep. 7, 15021 (2017). 63 Wham, D. C. & LaJeunesse, T. C. Symbiodinium population genetics: testing for species boundaries and analysing samples with mixed genotypes. Mol. Ecol. 25, 2699-2712 (2016). 64 Howells, E., Willis, B., Bay, L. & van Oppen, M. Microsatellite allele sizes alone are insufficient to delineate species boundaries in Symbiodinium. Mol. Ecol. (2016). 65 Dixon, G. B. et al. Genomic determinants of coral heat tolerance across latitudes. Science 348, 1460-1462 (2015). 66 Bhattacharya, D. et al. Comparative genomics explains the evolutionary success of reef- forming corals. eLife 5, e13288 (2016).

12

67 Baumgarten, S. et al. The genome of Aiptasia, a sea anemone model for coral symbiosis. Proc. Natl. Acad. Sci. U. S. A. 112, 11893-11898 (2015). 68 Thomas, L. & Palumbi, S. R. The genomics of recovery from coral bleaching. Proceedings of the Royal Society B: Biological Sciences 284, 20171790 (2017). 69 Ying, H. et al. Comparative genomics reveals the distinct evolutionary trajectories of the robust and complex coral lineages. Genome Biol. 19, 175 (2018). 70 Ying, H. et al. The whole-genome sequence of the coral Acropora millepora. Genome Biol. Evol. 11, 1374-1379 (2019). 71 Fuller, Z. L. et al. Population genetics of the coral Acropora millepora: towards a genomic predictor of bleaching. bioRxiv, 867754 (2019). 72 Bay, R. A., Rose, N. H., Logan, C. A. & Palumbi, S. R. Genomic models predict successful coral adaptation if future ocean warming rates are reduced. Science Advances 3, e1701413 (2017). 73 Forêt, S. et al. Genomic and microarray approaches to coral reef conservation biology. Coral Reefs 26, 475 (2007). 74 Robbins, S. J. et al. A genomic view of the reef-building coral Porites lutea and its microbial symbionts. Nature Microbiology (2019). 75 Weis, V. M. Cell biology of coral symbiosis: foundational study can inform solutions to the coral reef crisis. Integr. Comp. Biol. 59, 845-855 (2019). 76 Miller, D. J., Ball, E. E., Forêt, S. & Satoh, N. Coral genomics and transcriptomics — Ushering in a new era in coral biology. J. Exp. Mar. Biol. Ecol. 408, 114-119 (2011). 77 Cleves, P. A., Strader, M. E., Bay, L. K., Pringle, J. R. & Matz, M. V. CRISPR/Cas9-mediated genome editing in a reef-building coral. Proc. Natl. Acad. Sci. U. S. A. 115, 5235-5240 (2018). 78 Cleves, P. A., Shumaker, A., Lee, J., Putnam, H. M. & Bhattacharya, D. Unknown to known: advancing knowledge of coral gene function. Trends Genet. 36, 93-104 (2020). 79 van Oppen, M. J. H., Oliver, J. K., Putnam, H. M. & Gates, R. D. Building coral reef resilience through assisted evolution. Proc. Natl. Acad. Sci. U. S. A. 112, 2307-2313 (2015). 80 Levin, R. A. et al. Engineering strategies to decode and enhance the genomes of coral symbionts. Frontiers in Microbiology 8 (2017). 81 Knowlton, N. & Leray, M. in Coral Reefs in the Anthropocene (ed Charles Birkeland) 117- 132 (Springer Netherlands, 2015).

13

Chapter 2. Symbiodiniaceae diversity and evolution, and genomics methods for their study

Although dinoflagellates are mostly free-living, the family Symbiodiniaceae is predominantly constituted of symbiotic species. However, the evolutionary mechanisms that underpinned the diversification of this family as multiple symbiotic lineages remain largely unexplored. In this chapter, I review the current knowledge regarding diversity and evolution of Symbiodiniaceae. I then revisit some of the most commonly used approaches in comparative genomics.

Part of this chapter (Section 2.1.7) summarises our expectation of genome features in symbiotic Symbiodiniaceae relative to their evolutionary transition to a symbiotic lifestyle, with reference to existing research in other intracellular symbionts and parasites. This section has been published in Trends in Ecology & Evolution1 (DOI: 10.1016/j.tree.2019.04.010), and featured in the journal cover of the September 2019 issue. As the first author of this publication, I conceived the topic, reviewed all the relevant literature, wrote the manuscript and prepared all figures.

Systematics of the Symbiodiniaceae family was revised in August 20182. Much of the literature before then referred to the distinct lineages of Symbiodiniaceae as clades, i.e. Clades A through I. In this Chapter, I refer to these lineages as clades, following the earlier literature, and relate them to the more-recent revised genera where applicable.

14

2.1. Family Symbiodiniaceae

2.1.1. Morphology

Symbiodiniaceae is a specialised family of dinoflagellates (phylum Dinoflagellata) that together with ciliates (Ciliophora) and apicomplexans () form the (supergroup Alveolata)3. Symbiodiniaceae are typically known for their capability to establish mutualistic symbiosis with other organisms, including corals, jellyfish, clams, foraminifera and ciliates4. While most dinoflagellates are free-living5, Symbiodiniaceae represents one of the few predominantly symbiotic lineages. However, symbiotic Symbiodiniaceae typically exhibit a free- living stage as part of its normal life cycle5, i.e. they are not obligate symbionts. The free-living (or mastigote) cells are motile and morphologically distinguishable from the symbiotic (or coccoid) forms (Fig. 2.1)6. Transition between the two habits occurs quickly and the cellular processes involved are not known in detail (but see Fitt & Trench 19837). Likewise, the evolutionary novelties that allowed the initial establishment of symbiosis between Symbiodiniaceae and other organisms remain little known.

2.1.2. Ecological niches

Although Symbiodiniaceae do not display a great morphological diversity, their free-living origin and symbiosis plasticity have allowed them to occupy a broad spectrum of ecological niches. One of the most surprising findings in regard to Symbiodiniaceae ecology is the existence of strictly free-living forms. Since 1999, numerous Symbiodiniaceae have been recovered from environmental samples without being previously observed in any host8. The sources of these samples range from surface seawater to sediments, from the Caribbean to the Pacific, and across different latitudes4,9-14. The distribution of free-living Symbiodiniaceae makes the number of niches that they can occupy uncountable. Besides, they might represent reservoirs of reef symbionts15-22. Despite their genetic diversity, many of these isolates belong to Symbiodinium sensu stricto, some of which form a monophyletic subclade10,18. If Clade A was indeed the most ancestral symbiodiniacean lineage (see Chapter 1), this monophyletic group would likely have preserved the ancestral free-living lifestyle to some extent, and therefore, represent a key target for studying the evolutionary transition from one lifestyle to the other. However, whether the free-living habit of most of these environmental Symbiodiniaceae is permanent or transient remains controversial17.

15

Fig. 2.1 Morphotypes of Symbiodiniaceae (A) Light micrographs of mastigote (motile) and (B) coccoid (spherical) cells of Effrenium voratum rt-383 (Clade E). Modified from Jeong et al. 201423. (C) Mastigote cell of Symbiodinium natans, the transverse (tf) and the longitudinal (lf) flagella can be observed. Modified from Hansen and Daugbjerg 200920. (D) Scanning-electron micrograph of a freeze-fractured internal mesentery from a coral polyp (Porites porites) showing the symbiosomes with the coccoid symbiodiniacean cells. Modified from LaJeunesse et al. 20126.

In addition to symbiotic and free-living lifestyles, the metabolism and physiology of some Symbiodiniaceae resemble those of parasites rather that of mutualistic symbionts, as reflected in reduced fitness of their hosts24-29. Remarkably, many of these isolates are part of genus Symbiodinium. A potential opportunistic and necrotrophic species (Symbiodinium necroappetens) has been described also in this genus30. The functional diversity of genus Symbiodinium is thus in accordance with its putative ancestral condition.

Amongst the mutualistic symbiodiniacean forms, a broad spectrum of ecological niches is expected. Apart from the diversity in hosts, Symbiodiniaceae can be found along environmental gradients including light, depth and temperature. For instance, light-dependent distribution of Symbiodiniaceae has been linked to physiological constraints31. Mycosporine-like amino acids

16

(MAAs) act as a mechanism of protection against UV in free-living dinoflagellates, e.g. in Alexandrium excavatum32. MAAs have also been found to be synthesized by several Symbiodinium isolates in culture, and not by Breviolum, Cladocopium, Effrenium and Fugacium isolates33, consistent with the distribution dominance of shallow habitats by Symbiodinium species in the Caribbean34,35.

Depth zonation is also an example of ecological niche diversification among clades. Coral colonies of Montastraea spp. in the Caribbean display distinct symbiont associations with Symbiodinium and Breviolum in shallow water (0–3 m), Symbiodinium and Cladocopium in medium depths (3–6 m) and Cladocopium in deeper water (6–14 m)36. Other reefs in the Caribbean show a similar pattern with Cladocopium in deep water, but Durusdinium (part of former Clade D) replace Symbiodinium and Breviolum at lower depth37. Durusdinium were also found with Cladocopium in deep waters leading to the conclusion of Durusdinium as symbionts capable of outgrow other strains in harsh environmental conditions38,39. In the Pacific, however, the zonation patterns for Symbiodinium, Breviolum and Cladocopium do not hold4; instead the depth zonation occurs in lineages within Cladocopium4,40,41.

Members of Durusdinium are considered stress-resistant isolates. Durusdinium have been observed in high-temperature coral reefs and in association with corals recovering from bleaching42,43. However, the most abundant Symbiodiniaceae in the warmest sea of the planet (the Persian-Arabian Gulf) belong to the genus Cladocopium (Clade C)44 and the corals they are associated with show strong local adaptation to high temperature and salinity, which restricts their distribution45,46. These resistant forms of Symbiodiniaceae might be a key for the resilience of coral reefs in the face of global change47,48. On the other hand, some of these isolates may play an opportunistic role, for which their presence serves as an indicator of compromised coral reefs39. A latitudinal distribution pattern of Symbiodiniaceae has also been described, in which Clades A, B and F occupy habitats at higher latitudes and Clade C dominates more tropical environments4 (Fig. 2.2). Finally, a coral host at a specific location can exhibit changes in types of Symbiodiniaceae with time as a consequence of fluctuating environmental factors49.

2.1.3. Phylogeny and systematics

A wide range of molecular markers has been used to infer the phylogeny of Symbiodiniaceae, including allozymes50, random-amplified-polymorphic DNA patterns (RAPD)51, microsatellites52, chloroplast-encoded genes53-55, mitochondrial genes56, nuclear ribosomal DNA2,8,10,11,55,57-61 and combinations of them62,63. In most cases, the major monophyletic clades are recovered4,19. Rowan & Powers first named three of these major clades (A to C) in 199160. This classification was then expanded (up to Clade I)8,37,64-68 and was broadly used in the literature until recently. Back then, all

17

Symbiodiniaceae were classified within the genus Symbiodinium. It was until a revision of the group that the genus was reclassified as a family, and most of the major clades assigned to genera2. However, some of the new genera do not encompass whole clades; for instance, Symbiodinium sensu stricto excludes the temperate Clade A species. Furthermore, the clades that have not been reclassified as genera yet preserve their conventional name. For these reasons, I adopt the current naming scheme of Symbiodiniaceae where pertinent in this document; however, in cases where the revised does not apply, I refer to the distinct clades as such following the earlier literature, to avoid confusion.

Fig. 2.2 Global distribution of clades in Symbiodiniaceae Distribution of Symbiodiniaceae associated with corals in shallow seawater (<7 m depth). The pie charts show the proportional distribution per clade at each location; the diameters are proportional to the square root of the number of species sampled (see scale). The numbers next to the pie charts are dataset identifiers. From Baker (2003)4.

One of the most comprehensive phylogenies of Symbiodiniaceae is based on the concatenation of two nuclear (nr28S and elf2), two chloroplast-encoded (psbA and cp23S) and two mitochondrial (coI and cob) genes, and distinguishes all the nine major clades (A through I)62. Clade A appears as the basal group to all other Symbiodiniaceae clades, and has been described therefore as the most ancestral lineage in various studies40,66,69. However, a conclusive evidence supporting the ancestry of Clade A remains elusive. On the other hand, the more-recently derived clades include F, sister lineages to Clades C and H. Clade C comprises, to our knowledge, the largest genetic diversity of any clade in Symbiodiniaceae70.

18

2.1.4. Evolutionary rates and timescale

Early research on evolutionary rates and divergence times of symbiodiniacean clades was based on nuclear (nr28S, elf2), plastid (cp23S, psbA) and mitochondrial (coI, cob) genes62,63. In these studies, the chloroplast markers exhibit a slower effective evolutionary rate relative to the others, as evidenced by the long branches in Durusdinium and Clade I. Mitochondrial genes appear to evolve at half the rate of nuclear and plastid genes, and seem to be evolving the slowest in Symbiodinium. In an earlier study, divergence times were estimated for each clade using a Bayesian relaxed molecular clock63. These molecular markers yielded similar results when analysed either individually or as concatenated datasets. According to these estimations, Clade A diverged from the other clades about 50 million years ago (MYA). The rest of the clades (B to I) were estimated to have originated about 25 MYA, however the diversification within each clade appears to have started only approximately 15 MYA. These diversification times of Symbiodiniaceae match the decrease of temperature in oceanic waters since the Eocene71. Seawater cooling started about 50 MYA, just when Clade A split from the other lineages. Another important drop of seawater temperature occurred near the end of the Eocene and beginning of the Oligocene (approximately 35 MYA), which is consistent with the divergence of Clade B from other more-recently diverged lineages (Clades C, F and H). Finally, the diversification time of lineages within each clade matches the last pronounced decrease in temperature of the seawater, which commenced in the mid-Miocene. The origin of Symbiodiniaceae 50 MYA also coincides with the decreasing sea level and the global increase in coastlines, which had a significant impact in the marine environments and biodiversity72,73.

A more-recent study suggests that the divergence time of Symbiodiniaceae occurred further back in the past, during the Jurassic (about 160 MYA, Fig. 2.3)2. This investigation applied a strict molecular clock on a Bayesian phylogenetic tree for the nuclear large ribosomal subunit (LSU rDNA) that was calibrated with fossil records, plate-tectonic evidence and biogeographic data. According to this estimation, the divergence of symbiodiniacean lineages matched the ancient radiation of scleractinian corals, which suggests that the diversification of Symbiodiniaceae was tightly coupled with the establishment of symbiosis with cnidarian hosts. This greater evolutionary timescale implies a higher chance for symbiodiniacean dinoflagellates to have undergone macroevolutionary processes leading to their diversification; this presents research opportunities in evolution of symbiosis74.

19

Fig. 2.3 Divergence time of Symbiodiniaceae lineages (A) Divergence time estimation for the different Symbiodiniaceae clades based on a Bayesian strict clock inference with host fossil records (a) and plate tectonics data (b) as calibration points. Expected ages of nodes i-v are based on phylogeographic, paleontological and plate tectonic evidence. (B) Timeline showing the fossil records for the origin of dinoflagellates similar to extant Suessiales and the adaptive radiation of stony corals in the Jurassic. Modified from LaJeunesse et al.2

20

2.1.5. Evolution of symbiosis

Host-symbiont assemblages can expand the resource usage of both the host and the symbiont, thus enhancing their capacity to occupy other ecological niches and leading to diversification75,76. Several evolutionary mechanisms of symbiosis have bene postulated, depending on the nature of the symbiotic relationship and of the partners. Coevolution, for instance, is a major driver in host- symbiont evolution77. So far, no evidence of coevolution sensu stricto has been found in associations between Symbiodiniaceae and corals40,60,78. However, the way that symbionts are transmitted among hosts can be highly correlated to host specificity. Vertically transmitted Symbiodiniaceae (i.e. those transmitted intracellularly in eggs or larvae from parents to offspring, e.g. in brooding corals and foraminifera) display higher host-specificity compared to those transmitted horizontally (i.e. those acquired from the environment)38,72,79,80; this could be indicative of coevolution81. Nevertheless, exceptions to the correlation between transmission mode and host specificity are known. An example is the brooding coral Seriatopora hystrix that exhibits a mixed transmission mode, in which a fraction of the symbionts is acquired from the parental colonies and another from the environment82.

In the case of vertical transmission, short-term coevolutionary processes driving speciation could have occurred in periods of relative environmental stability34,83, and resulted in a strong specificity between symbiodiniacean species and hosts. Furthermore, estimates of the radiation of Symbiodiniaceae during the Jurassic coincide with the radiation of reef-building corals2, which makes the occurrence of potential co-evolving symbiotic associations more likely. Additional evidence of coevolution is provided by the study of cellular functions in Symbiodiniaceae relative to the signal of interaction or complementarity with their hosts84,85. Less-evident gene functions, such as those implicated in host recognition, phagocytosis, trans-membrane transport and calcification86, can also be indicative of coevolutionary events, and often leave footprints on the transcriptomes and genomes of the symbionts87-92. On the other hand, horizontal transmission of symbionts is more complex and more variable, as evidenced by the flexibility of symbiotic associations in Symbiodiniaceae. The source of this variation has been summarised in three factors93: (i) selective pressure on the symbiont to benefit from the host, (ii) adaptive advantages to changing environments given the availability of different symbiont genotypes, and (iii) irregular acquisition of symbionts from the free-living surroundings. The two latter factors are rather host-oriented.

21

The evolution of symbiodiniacean symbioses from the symbionts’ perspective can be explained by two complementary hypotheses: the magnesium inhibition and the arrested phagosome hypotheses94. The magnesium inhibition hypothesis posits that changes in Mg/Ca concentration ratio led to the transition of calcitic to aragonitic seas over the last 80–100 million years, and caused inhibition of the transport of Ca2+ into the dinoflagellate cells. This then triggered the invasion of the intracellular environment of diverse hosts, initiating the symbiotic niche of Symbiodiniaceae. However, this hypothesis does not match the most-recent estimate of the origin of symbiodiniacean dinoflagellates2. On the other hand, the arrested phagosome hypothesis postulates how Symbiodiniaceae managed to reside within the host cells. According to this hypothesis, Symbiodiniaceae actively release by-products typical of ongoing lysosomal digestion, simulating continuously operating phagosomes.

2.1.6. Genomics

The most-representative features of the nuclear and organellar genomes of dinoflagellates, including Symbiodiniaceae, are shown in Fig. 2.4. Dinoflagellates possess amongst the largest eukaryotic genomes, up to the estimated size of 245 Gbp in the haploid genome of Prorocentrum micans, for instance95,96. These large genomes, however, are not an outcome of extraordinary polyploidy or genome duplication events97. Interestingly, genome sizes of Symbiodiniaceae are in the smaller range compared to those of other dinoflagellates, with estimates between 1.5 and 5 Gbp. Reduction in genome size is not likely driven solely by symbiosis because the trend of genome-size reduction started in earlier diverging free-living taxa from Suessiales98.

The estimated numbers of chromosomes are quite variable in dinoflagellates, ranging from five to almost 300. However, this variation might be in part a consequence of miscounts due to the condensed and fragile nature of the chromosomes, and/or their abundant numbers99,100. Additional variation in chromosome counts could be caused by the length of the period during which the isolates had been maintained in culture, which has been proposed to correlate positively with chromosome number101; this hypothesis is yet to be tested formally. Similarly, the number of chromosomes in Symbiodiniaceae is quite variable, with estimates ranging from fewer than ten and up to 100 chromosomes87,102,103.

22

Fig. 2.4 Key features of nuclear and organellar genomes of dinoflagellates (a) Representation of chromosome structure in the nucleus. (b) Diagram of the trans-splicing mechanism of nuclear mRNA. (c) Illustration of the mitochondrial genome structure. (d) Model of plastid minicircles. From Wisecaver & Hackett (2011)97.

23

Genome organisation of these microalgae was (mistakenly) described as intermediate between prokaryotes and eukaryotes, and the nucleus has been called a mesokaryon or, more recently, dinokaryon104,105. Chromosomes in dinoflagellates are typically fibrillar, permanently condensed and attached to the nuclear envelope, with a liquid crystal appearance99. Despite the absence of the characteristic heterochromatin banding pattern106, the presence of telomeres supports their linearity107. These chromosomes lack nucleosomes and histones108. Instead, chromosome compaction is determined by histone-like proteins (HLPs) that are homologous to proteobacterial proteins109,110. HLPs are distributed uniformly in dividing chromosomes, which might be an indication of their relevance to chromosome structure. During the interphase, HLPs are located in extra-chromosomal loops and in the nucleolus111. The concentration of HLPs was shown to correlate with chromosome condensation 112. These observations suggest that HLPs may be involved in the regulation of gene expression (Fig. 2.4a). Although nucleosomes are lacking, all core histone genes of the nucleosome are present and transcribed in Symbiodiniaceae, and have a potential function in the regulation of transcription through methylation113. Another distinctive genomic features of dinoflagellates is that their nuclear DNA includes a so-called fifth nucleotide, 5-hydoroxymethyluracil, that replaces between 7 and 70% of the thymine114. This nucleotide appears to be a vestigial trait without current functionality115, but it is known to affect DNA stability, e.g. by lowering the temperature at which DNA melts116.

Transposable elements (TEs) and tandem repeats have been estimated to constitute less than 30% of the genome region, based on earlier analysis of Symbiodiniaceae genomes87,90, compared to >50% in the dinoflagellate Alexandrium ostenfeldii117. The reported overall G+C content ranges between 43% and 50% in genomes of Symbiodiniaceae, and between 51% and 57% in the corresponding protein-coding sequences (CDS)87-91,113. A very high proportion of CpG is methylated in the genomes of Symbiodiniaceae and other dinoflagellates, which may relate to regulation of gene expression118. This genome feature has been proposed to have evolved through the activity of mobile genome elements carrying methyltransferase domains119. Transcriptional regulation appears to be reduced in dinoflagellates114, e.g. no TATA box has been found in any dinoflagellate genes120. Furthermore, gene-expression profiling in dinoflagellates has revealed that the majority (~70%) of genes are constitutively expressed regardless of growth conditions121-123, and that background expression levels for most genes is largely invariant124. Therefore, most of the regulation processes of gene expression are thought to be post-transcriptional114.

Organellar genomes of dinoflagellates are also atypical relative to those of other eukaryotes. Mitochondrial DNA in dinoflagellates carries only three protein-coding genes (cob, cox1 and cox3) and two fragmented ribosomal DNAs (rDNAs)114, compared to the average 40-50 protein-coding 24

genes in eukaryotes125. However, despite this gene reduction, genome complexity remains high. Firstly, the five genes occur in multiple copies. Secondly, coding regions are highly recombined and often fragmented126. Thirdly, they contain abundant non-coding sequences and repetitive regions with frequent inverted elements127. Furthermore, these genes encode non-conventional start and stop codons, and stop codons embedded in coding regions have been demonstrated to undergo RNA- editing or avoidance128,129. Mitochondrial transcripts are poly-adenylated and require trans-splicing to be translated129,130. Finally, mitochondrial genomes were found to comprise several non-identical copies of DNA sequences of up to approximately 30 kbp127,131. Dinoflagellate mitochondrial genomes are the most complex known to date (Fig. 2.4c).

Plastid genomes of dinoflagellates (Fig. 2.4d) are unconventional compared to those in other algae and plants132. Plastid genes are located in circular plasmid-like DNA molecules known as minicircles. These minicircles are usually between 2 and 3 kbp in size and carry up to four genes each, although instances of empty minicircles have also been discovered48,91,133. All the minicircles of the same species share a conserved core sequence in the noncoding region that may comprise transcription initiation sites; genes display the same orientation with respect to the core134,135. Transcription and replication appear to occur continuously around the circular DNA molecule in a “rolling circle” fashion. The transcripts are then cleaved and poly-uridylylated at the 3-end136. The most complete symbiodiniacean plastid genome available so far belongs to a Cladocopium isolate (subclade C3). It consists of 13 minicircles (11 including protein-coding and two rRNA-coding genes) with one gene each137.

2.1.7. Genome evolution as intracellular residents

Intracellular residents (e.g. parasites and symbionts) undergo similar evolutionary trajectories, the stages of which include initial invasion into the host cell, permanence over generations in the intracellular environment, and transmission between hosts. Each of these stages impacts the evolution of resident genomes, leading to features that collectively constitute the so-called resident genome syndrome138,139 (see below). According to this notion, resident genomes pass through a highly dynamic and unstable phase characterised by extensive structural rearrangements during the initial transition to an intracellular lifestyle140. Confinement of the resident to the intracellular space then eventually leads to a more-stable, reduced genome141,142. In this stage, the genomes are generally small and A+T rich, and the genes display high evolutionary rates. These symptoms have largely been described in the genomes of intracellular bacteria138,139.

25

Currently available genome data from Symbiodiniaceae reveal signatures of symbiosis-related gene functions89-91 but the impact of the evolutionary transition to intracellularity on these genomes remains little explored. Here, I discuss genome evolution of Symbiodiniaceae across the broad spectrum of symbiotic associations (Fig. 2.5) focusing on the connecting scenarios of free-living species, facultative symbionts, and obligate symbionts.

Fig. 2.5 Expected genome features of Symbiodiniaceae across the spectrum of symbiotic associations Expected genome features of coral reef symbionts across the spectrum connecting the scenarios of (A) free-living species, and (B) facultative and (C) obligate symbionts. The meter arrows show the level (or amount) of a set of genome features expected to be seen in a scenario relative to the others.

26

2.1.7.1 Evolutionary implications of intracellular confinement

The resident genome syndrome posits a set of genome features (or symptoms) of intracellular residents that have arisen from long-term spatial confinement in the host cell. During the initial transition to confinement, the genomes of the residents are highly dynamic and unstable, with increased structural rearrangements and activity of mobile elements. Spatial confinement eventually leads to a reduced and more stable genome140,143-145. The reduced capacity of intracellular residents to undergo genetic recombination and the repeated bottlenecks they experience in transmission to 138 other hosts result in a small effective population size (Ne) . A small Ne hastens the fixation of newly emerging alleles (arising from mutation) regardless of their impact on the fitness of the residents; the subsequent accumulation of deleterious mutations is known as Muller’s ratchet146. The underlying driver of Muller’s ratchet, genetic drift147, is reinforced by the relaxation of selective pressure on genes in the resident that encode functions redundant or neutral for the host148.

Accumulation of deleterious mutations often results in genes with lost function (i.e. pseudogenes) that are prone to differential removal from the genome (i.e. deletion bias). Deletions can implicate one or a few bases or one or more genes. Together with the accumulation of substitutions in coding sequences, deletion biases lead to gene loss in resident genomes149,150. In diverse intracellular bacteria and some intracellular eukaryotes, genes encoding DNA repair functions are lost143,151, contributing to the further degradation of their genomes. Accelerated mutation rate and reduced DNA repair capacity make underlying mutational biases evident. In coding regions of resident genomes, neutral accumulation of mutations is reflected in reduced preference of codons used138.

Genome-size reduction can be driven by selective advantages of small genomes, deletion of mutated DNA, and/or a reduced chance of incorporating exogenous DNA138,149,150. Potential selective advantages of smaller genomes include reduced metabolic costs for the maintenance and replication of DNA and faster DNA replication (and thus shorter life cycles)138,152. Lateral genetic transfer (LGT), the major force counteracting genome reduction in bacteria, is limited in the restrictive intracellular environment139,149. However, host cells infected with multiple residents simultaneously open the possibility of extending the gene inventories of these residents. This proposed ‘intracellular arena’ hypothesis153 was supported by previous studies of Wolbachia endosymbionts154 and microsporidian parasites155.

27

2.1.7.2 Free-living species

At one end of the spectrum, some Symbiodiniaceae species have not been found to be associated with a host. These free-living taxa include some species in the Symbiodinium genus (e.g. the type species S. natans and S. pilosum), the exclusively free-living Effrenium genus (Clade E), and Fugacium (Clade F)2,4,22.

A free-living lifestyle presents opportunities for the exchange of genetic material (e.g. recombination via sexual reproduction) with conspecifics, facilitates LGT due to exposure to other organisms, and avoids the bottlenecks of transmission between hosts, all of which counteract the effects of confinement to the intracellular space140,147 (Fig. 2.5A). Fluctuating environmental conditions and the access to different habitat types often require a broader gene repertoire and increased selective pressure for maintaining a range of metabolic functions139. A recent study demonstrates that conserved lineage-specific genes of unknown function in dinoflagellates might play a role in niche specialisation156. We expect these features to be common in the genomes of free-living Symbiodiniaceae as well as in other dinoflagellates.

Although the genome sequence of the free-living Fugacium kawagutii89,91 is available, the rates of recombination and LGT have not been systematically assessed. High genetic diversity and a near- complete meiotic gene set have been reported in symbiodiniacean genomes91,157, suggesting the capacity for sexual reproduction7. However, direct observations of sexual reproduction in these taxa have not been possible; thus, the recombination rate cannot be determined. Likewise, assessing LGT in symbiodiniacean genomes is challenging because of the acquisition of exogenous genes from multiple endosymbiotic events involving prokaryote and eukaryote sources, as demonstrated by the complex history of plastid origin in dinoflagellates158,159. Fragmented genome assemblies derived from short-read sequence data limit our capacity to corroborate the origin of sequences sharing substantial similarity with bacterial and viral genomes, which have thus far largely been regarded as contaminants90,91. This challenge can be overcome by incorporating long-read sequence data in genome assembly.

Genomes from other species in the Order Suessiales, including the free-living Polarella glacialis and the symbiotic Pelagodinium béii, represent important outgroup references to understand the evolutionary transition from a free-living to a symbiotic lifestyle and the origin of Symbiodiniaceae. Comparative genomics of organisms in these taxa will reveal structural (e.g. shared synteny and interspersed repeat landscapes) and functional features (e.g. gene content, gene duplication, metabolic pathways) that are unique to Symbiodiniaceae.

28

2.1.7.3 Facultative symbionts

Most Symbiodiniaceae are symbiotic, representing a broad spectrum of symbiotic associations and a range of host specificity. Their genomes would have experienced the phase of genome instability during the early transition stages to an intracellular lifestyle (and symbiosis). Genomes during this phase may be larger than those of well-established residents and have accumulated extensive structural rearrangements, mobile elements, and pseudogenes (Fig. 2.5B), as observed in other facultative and recently established residents140,143-145,160.

As facultative symbionts, ex hospite stages (cells outside a host) are common in the life cycle of these species. Corals are known to adjust symbiont density regularly by expelling Symbiodiniaceae to the external environment161. On expulsion, the viable symbiodiniacean cells may reproduce sexually with conspecifics ex hospite in a cell-dense environment, boosting the recombination rate that may be even higher than that of their free-living relatives. However, the low viability of these ex hospite cells162 argues against this notion. These competing hypotheses remain to be systematically tested.

Cladocopium goreaui is a host generalist. Reported from >150 coral species in Australia’s Great Barrier Reef163, C. goreaui is largely horizontally transmitted. C. goreaui shows the highest level of genome-fragment duplication (implicating ~15.3% of its genes) compared with Symbiodinium microadriaticum, Breviolum minutum and Fugacium kawagutii (implicating <6% of the genes of each)87,90,91. S. microadriaticum and B. minutum generally exhibit a narrower host range164,165 than C. goreaui, while F. kawagutii is free-living2.

Only modest synteny is shared among symbiodiniacean genomes of different genera, suggesting a high extent of structural rearrangements87,90,91. Some structural rearrangements can be attributed to the activity of transposable elements89. A recent study91 revealed an ancient burst of mobile elements in the genomes of all four analysed species of distinct symbiodiniacean genera, including a member of the most-basal lineage Symbiodinium, S. microadriaticum. These data, albeit limited, indicate that the burst of mobile element activity is likely to have predated the radiation of Symbiodiniaceae and may be associated with the early evolutionary transition to intracellularity of these lineages. However, most interspersed repetitive elements in these genomes remain uncharacterized. Conservation of these elements within the symbiodiniacean lineages may elucidate their roles in the transition to intracellularity.

29

Whether these observed features are consequence of a facultative symbiotic lifestyle in Symbiodiniaceae remains an open question. All published genome assemblies87-91, derived mostly from short-read sequence data, are highly fragmented. The implementation of emerging genomic technologies across isolates from the same genus (e.g. Cladocopium spp.) will enable researchers to address this question more effectively. Specifically, long-read (~20–50 kbp) sequence data can span larger indels and resolve repetitive genomic regions more effectively than short-read data (typically 100–150 bp in length). In addition, duplication and translocation of genome regions can be better characterised by optical mapping of long (>250 kbp) DNA molecules and by genome phasing166,167.

2.1.7.4 Obligate symbionts

At the other end of the spectrum, some Symbiodiniaceae may be obligate symbionts. These taxa are rarely, if at all, found in the environment or reported in culture. However, one cannot dismiss that brief ex hospite stages may still occur due to regular adjustments of symbiont density by the hosts. In the scenario of strict obligate symbionts (Fig. 2.5C), genomes are expected to follow the evolutionary trajectory postulated in the resident genome syndrome.

The Ne measures the impact of genetic drift on the evolution of a lineage. Despite the existence of a theoretical framework in population genetics for vertically transmitted symbionts168 and access 169,170 to a range of methods to estimate Ne from genetic data , estimates of Ne for populations of

Symbiodiniaceae are currently lacking. Nonetheless, small Ne (together with low gene flow, dominance of clonal reproduction, and local adaptation) has been inferred to result in genetic differentiation among symbiont populations inhabiting the same coral species in different reef locations162. Population studies can also contribute to our understanding (inter alia) of the rate of sexual versus asexual reproduction in Symbiodiniaceae, the adaptation potential of reefs based on the diversity of symbiont genotypes available both inside and outside the coral hosts and the functional role of those genotypes, the differences in evolutionary patterns between different types (e.g. host specialists versus host generalists), and the delimitation of cryptic species162,171,172. I look forward to future population-genetic/genomic studies of different symbiodiniacean ecotypes, particularly in exploring bottleneck effects during the between-host transmission on population dynamics, similar to earlier studies of prokaryotic residents173,174.

30

Muller’s ratchet can lead to loss of phylogenetic signal as homologous sequences become more dissimilar. In microsporidia, a specialised group of intracellular parasitic fungi, accelerated mutation rate in protein-coding sequences is uncoupled from genome architecture142 and complicates their positioning in phylogenetic trees175. In light of the high sequence divergence within Symbiodiniaceae176, these lineages should be scrutinised further to identify pseudogenes and other factors potentially contributing to the extensive differences observed in gene families92.

Across a wide range of intracellular bacteria and some intracellular eukaryotes, genes encoding functions in DNA repair are lost143,151, contributing to the further degradation of their genomes. Conversely, functions associated with nucleotide-excision DNA repair are enriched in the core genes of Symbiodiniaceae. This observation may reflect adaptation of Symbiodiniaceae to high UV environments92 and reveal a mechanism that counteracts genome reduction caused by spatial confinement.

Mutation bias (towards high A+T) is well known in the genomes of intracellular bacteria177,178 but is less evident in intracellular eukaryotes. Nonetheless, the highest A+T and G+C contents in eukaryote genomes (to my knowledge) are observed in falciparum and Chlorella variabilis NC64A, respectively; both are known to occur as intracellular residents160,179. Base composition also varies substantially among different regions of these genomes. In the C. variabilis genome, the regions with the lowest G+C content are repeat poor and contain genes with shorter introns and exons, and lower codon-usage bias; G+C content in this green alga also correlates with gene expression and intron size160. In the nucleomorph genome of a cryptomonad (Guillardia theta)180, G+C content varies from 46% in terminal repeats to 35% in tRNAs and plastid genes, and 23% in housekeeping genes. These examples suggest that the impact of confinement to the intracellular space on genome evolution of eukaryotes also varies from one type of genomic region to another, depending in part on the potential of these regions to vary in base composition while maintaining function. Symbiodiniacean genomes do not display substantial mutational biases; rather, their G+C content resembles that of other dinoflagellates114 both globally (between 43.0% and 51.5%) and in the protein-coding regions (between 50.4% and 58.6%). Reduced codon-usage preferences in symbiodiniacean genomes compared with other dinoflagellates may reflect genetic drift acting on coding sequences, although a higher G+C content in the third codon positions opposes this notion92,113. In addition, smaller genome sizes in Symbiodiniaceae relative to those of other dinoflagellates are probably not solely due to their endosymbiotic lifestyle, because the trend of genome reduction is observed in earlier-diverging free-living lineages (Fig. 2.6)98,181.

31

Fig. 2.6 Estimated genome sizes of dinoflagellates Estimated genome sizes of dinoflagellates, shown on the dinoflagellate phylogeny156. Genome sizes estimated based on sequencing data are marked with an asterisk; all other estimates are based on 4′,6-diamidino-2-phenylindole (DAPI) fluorescence staining98.

32

2.2. Methods in genomics

Advances in sequencing technologies are enabling the generation of genome sequence data at a higher throughput and lower cost182. The capacity of including multi-genome data in comparative analyses of ecologically important taxa has revolutionised the field of evolutionary biology. For instance, comparative genomic analysis can reveal evolution of gene families linking to adaptation, whereas surveys of genomic diversity can uncover molecular mechanisms that underpin the evolution of genotype (and phenotype), and the diversification of the taxa of interest to distinct ecological niches and even speciation183. In this section, I review some of the commonly used methods in genomics that are relevant to this project.

2.2.1. Genome assembly

Whole-genome is the reconstruction of the DNA sequence or sequences contained in the complete genome of any organism184. There are two major genome assembly approaches: comparative and de novo. Comparative genome assembly (also known as reference- guided assembly) is carried out when there is a reference genome from the same or a closely related species. In this approach, sequences from the genome of interest are mapped against a reference that guides the assembly. In de novo genome assembly, no reference is available to help guide the assembly process (Fig. 2.7)185. While some draft genomes from Symbiodiniaceae are available, they are remarkably divergent91. For instance, <1% of the sequence reads in Fugacium kawagutii mapped to the draft genome assembly of Breviolum minutum89. De novo assembly is therefore necessary to generate symbiodiniacean genome data.

De novo assembly of genomes is computationally intensive and requires extensive memory186. Single and paired reads are the most common types of high-throughput sequencing data. Single reads are short sequences from single ends of DNA fragments. Overlapping single reads are assembled into longer, contiguous sequences (contigs). Paired reads consist of reads from each of the 5- and 3-ends of the sequenced DNA fragments. The benefit of using paired-read data in genome assembly is that the known distance between them (as deduced from the fragment length) can be used to join or link contigs together into longer sequences (scaffolds), and to help in resolving repetitive regions186.

33

Fig. 2.7 De novo genome assembly Overview of the de novo genome assembly procedure. From Baker (2012)187.

A typical de novo genome assembly pipeline (Fig. 2.8) starts with a pre-processing step, in which low-quality and erroneous reads are either corrected or discarded185. FastQC (bioinformatics.babraham.ac.uk/projects/fastqc) is commonly used to assess the quality of the sequence reads, and Trimmomatic188 is used to filter out adapter sequences and/or trim reads with low quality. A graph model (e.g. de Bruijn graph189) is then constructed, in which the short reads are organised based on the overlaps. Next, the graph is simplified by reducing the nodes (i.e. the reads) and edges (i.e. overlaps), and removing errors (the so-called “bubbles”). A post-processing filtering follows, in which contigs are built and misassembled sequences are detected. In this step, the paired reads are used to extend linked contigs into scaffolds in the reduced graph. The graph-processing steps are then repeated at least once using the updated graph to further detect misassembled contigs and problematic repeats.

Another assembly strategy is the so-called hybrid assembly (Fig. 2.9). In this type of genome assembly, different types of sequence data are used as input, usually combining both short- and long- reads. Integration of the data can occur at the read level, in which short reads are used to correct the long reads that are generally more error-prone190. Another way to combine the data is to use the long

34

reads, or contigs into which long reads are merged, to bridge the neighbouring short-read-assembled contigs (i.e. in the scaffolding step), to close the gapped regions.

Fig. 2.8 A typical pipeline of de novo genome assembly Diagram of a typical pipeline for de novo genome assembly. G: graph, N: nodes, E: edges, ”: corresponding simplifications. From El-Metwally et al. (2013)185.

An example of hybrid assembler is MaSuRCA191,192. MaSuRCA integrates the computational efficiency of the de Bruijn graphs and the flexibility, in terms of read length and sequencing errors, of overlap-layout-consensus. The workflow proceeds with an initial assembly of short reads into super-reads by extending the reads as much as possible with exclusively non-conflicting k-mers. No information is lost because all original reads are contained in super-reads and, because many of the initial reads will result in the same super-read, the data get considerably reduced. If mate-pair data are available, MaSuRCA incorporates them at this point with a modified overlap-based assembly algorithm of the CABOG assembler193. The super-reads are then used to create a database of 15-mers, which are in turn employed for alignment between long reads and the super-reads. The long reads

35

serve as reference for merging overlapping super-reads into pre-mega-reads; inconsistently aligning super-reads are discarded. Long-read sequences are used again to merge pre-mega-reads into mega- reads. Artificial mate pairs, named linking pairs, are then created from mega-reads that cannot be merged due to gaps. In the last step, the final assembly is done using mega-reads and linking pairs with the CABOG assembler193.

Fig. 2.9 Process of a hybrid genome assembly Hybrid assembly workflow displaying the different alternatives to follow for integrating short and long-read data. From Xiao et al. (2016)194.

36

2.2.2. Genome annotation

Genome annotation is the computational process by which biological attributes of genomic sequences are determined184, and a wide range of approaches is available. Here, I review the common steps in a typical genome annotation workflow for eukaryote genomes.

The first step is usually the identification and masking of repeats195. These repeats include low- complexity regions (e.g. homopolymeric series of nucleotides) and transposable elements (e.g. viral sequences, long interspersed nuclear elements, short interspersed nuclear elements)196. Repeats can be identified based on shared similarity to known repeat sequences, or on de novo prediction197. An example of a tool implementing search of repeats by sequence similarity is RepeatMasker198, and an example for de novo repeat identification tool is RepeatModeler199. Once repeats are identified, they are (hard-) masked, i.e. each nucleotide in the repeat is replaced with an ‘X’. In the process known as soft-masking, each masked nucleotide is represented by its corresponding lower case letter (i.e. ‘a’, ‘t’, ‘c’, ‘g’)184.

The next step is evidence alignment, in which transcripts (e.g. RNA-Seq data or full-length Iso- Seq transcripts) are mapped to the genome197. Some annotation pipelines already integrate transcript data directly as evidence, such as PASA200 and MAKER201. Other types of evidence include known nucleotide or protein sequences from the species of interest and/or closely related species, when available. The use of protein sequences allows for identification of divergent coding sequences in minimising the complication of codon degeneracy. A reliable source of protein data is UniProt- SwissProt, a database of manually curated proteins. Further ‘polishing’ is sometimes done by manually defining known splice sites and intron-exon boundaries, or by implementing a splice-site- aware program such as Exonerate202.

An alternative (or sometimes parallel) approach to evidence-based annotation is the use of ab initio gene predictors that identify open reading frames in the genome203, commonly based on a set of reference sequences (i.e. the training set). However, these programs present several limitations. First, most gene predictors report only the most-likely CDS without the untranslated regions (UTRs) or alternative spliced transcripts197. To address this issue, some programs (e.g. AUGUSTUS) use a generalized Hidden Markov Model (HMM) to predict not only genes but also transcript isoforms204. Second, the predictions are sensitive to the choice of the reference (training) set, which details genomic features specific to an organism (such as codon usage and intron-exon boundaries) to distinguish coding sequences from non-coding sequences. In the case of non-model organisms, an appropriate training set is often not available. Third, the precision of ab initio prediction tools at identifying genes can reach almost 100%, but not for detecting intron-exon boundaries, where their

37

accuracy drops to about 60-70%197. Evidence-based annotation and ab initio prediction are not mutually exclusive, and can in fact be combined in an evidence-driven workflow.

2.2.3. Comparative genomics

Comparative genomics is a field in biology focused on the computational comparison of genome-scale data from distinct organisms to identify similarities and differences that may be biologically and evolutionarily relevant205. Comparative genomics comprises three main foci: genome structure, coding sequences and non-coding regions206. Intuitively, a genome comparison begins with whole-genome alignment. This step is a major challenge due not only to the large sizes of the data, but also to long insertions and deletions, large-scale genomic rearrangements, gene duplications and repetitive elements that require fast, robust and efficient algorithms207. Most algorithms start by identifying large conserved regions pairwise, and then extend the alignment (seed- and-extend approach) to include other regions208. See Armstrong et al. (2019)209 for a review of whole-genome alignment tools.

Genome-structure comparison targets features such as nucleotide composition, conserved synteny and gene organisation that inform genome evolution and uncover exclusive attributes of individual genomes206. These features include metrics of genome size, proportion of nucleotides, overall G+C content, as well as the usage biases of codons and amino acids. Synteny analyses are based on the identification of long conserved sequences defined by a set of parameters, such as length of the conserved region, percentage of sequence identity, fraction of the genome(s) found in syntenic regions, distribution of syntenic regions along the genome(s) and content of repetitive elements. Tools like SyMAP210 perform pairwise comparisons of genomic regions, identify syntenic blocks, and allow for visualisation of these blocks in multiple formats. Gene-order conservation is of particular interest since it relates to evolutionary distance. Programs such as MCScanX211 allow for detection of genes in collinear syntenic blocks within and between genomes, facilitating the study of groups of homologous genes between genomes and segmental duplications within a genome.

Comparison of coding sequences (CDS) relies on the quality and consistency of genome annotation212 (see above). The most representative metric of this type of analysis is the total number of genes. Other informative statistics include proportion of the genome that codes for genes, coding regions distribution (i.e. gene density), average gene length, number and average lengths of exons (and of introns), and codon usage206. Comparison of CDS can be extended to other levels, such as comparative proteomics and analysis of gene families, which can be addressed from a functional approach213,214 by, for instance, assigning them Gene Ontology215 terms or by shared sequence 38

similarity, linking them to KEGG ortholog groups216 or inferring ortholog groups with tools like OrthoFinder217.

Non-coding sequences are also commonly analysed when comparing genomes because they harbour sites involved in the regulation of transcription, DNA replication and chromosome structure. Additionally, non-coding DNA can be informative from an evolutionary perspective218. Identification of conserved non-coding regions across genome from different organisms is driven by the hypothesis of selective pressure slowing their evolutionary rates206. Finding highly conserved non-coding DNA sequences can thus be applied to the identification of regulatory elements219.

2.3. Concluding remarks

In this chapter, I have reviewed the state-of-the-art of Symbiodiniaceae genomics, and how common methods of comparative genomics can be utilised to study genome evolution of these ecologically important taxa. In the following chapters, I adopt some of these methods to compare and analyse available and newly generated genome and transcriptome data from Symbiodiniaceae, specifically to address the aims outlined in this thesis (Chapter 1, Section 1.2).

2.4. References

1 González-Pech, R. A., Bhattacharya, D., Ragan, M. A. & Chan, C. X. Genome evolution of coral reef symbionts as intracellular residents. Trends Ecol. Evol. (2019). 2 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral endosymbionts. Curr. Biol. 28, 2570-2580 (2018). 3 Maddison, D. R., Schulz, K.-S. & Maddison, W. P. The tree of life web project. Zootaxa 1668 (2007). 4 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003). 5 Gómez, F. A list of free-living dinoflagellate species in the world’s oceans. Acta Bot. Croat. 64, 129-212 (2005). 6 LaJeunesse, T., Parkinson, J. E. & Trench, R. K. Morphological description of the genus Symbiodinium, (2012). 7 Fitt, W. K. & Trench, R. The relation of diel patterns of cell division to diel patterns of motility in the symbiotic dinoflagellate Symbiodinium microadriaticum Freudenthal in culture. New Phytol. 94, 421-432 (1983). 39

8 Carlos, A. A., Baillie, B. K., Kawachi, M. & Maruyama, T. Phylogenetic position of Symbiodinium (Dinophyceae) isolates from tridacnids (Bivalvia), cardiids (Bivalvia), a sponge (Porifera), a soft coral (Anthozoa), and a free‐living strain. J. Phycol. 35, 1054-1062 (1999). 9 Manning, M. M. & Gates, R. D. Diversity in populations of free-living Symbiodinium from a Caribbean and Pacific reef. Limnol. Oceanogr. 53, 1853 (2008). 10 Hirose, M., Reimer, J. D., Hidaka, M. & Suda, S. Phylogenetic analyses of potentially free- living Symbiodinium spp. isolated from coral reef sand in Okinawa, Japan. Mar. Biol. 155, 105-112 (2008). 11 Gou, W. et al. Phylogenetic analysis of a free-living strain of Symbiodinium isolated from Jiaozhou Bay, PR China. J. Exp. Mar. Biol. Ecol. 296, 135-144 (2003). 12 Pochon, X. et al. Comparison of endosymbiotic and free-living Symbiodinium (Dinophyceae) diversity in a Hawaiian reef environment. J. Phycol. 46, 53-65 (2010). 13 Decelle, J. et al. Worldwide occurrence and activity of the reef-building coral symbiont Symbiodinium in the open ocean. Curr. Biol. 28, 3625-3633.e3623 (2018). 14 Rodriguez-Lanetty, M., Loh, W., Carter, D. & Hoegh-Guldberg, O. Latitudinal variability in symbiont specificity within the widespread scleractinian coral Plesiastrea versipora. Mar. Biol. 138, 1175 (2001). 15 Quigley, K., Bay, L. K. & Willis, B. Temperature and water quality-related patterns in sediment-associated Symbiodinium communities impact symbiont uptake and fitness of juveniles in the genus Acropora. Front. Mar. Sci. 4, 401 (2017). 16 Littman, R. A., van Oppen, M. J. & Willis, B. L. Methods for sampling free-living Symbiodinium (zooxanthellae) and their distribution and abundance at Lizard Island (Great Barrier Reef). J. Exp. Mar. Biol. Ecol. 364, 48-53 (2008). 17 Coffroth, M. A., Lewis, C. F., Santos, S. R. & Weaver, J. L. Environmental populations of symbiotic dinoflagellates in the genus Symbiodinium can initiate symbioses with reef cnidarians. Curr. Biol. 16, R985-R987 (2006). 18 Yamashita, H. & Koike, K. Genetic identity of free‐living Symbiodinium obtained over a broad latitudinal range in the Japanese coast. Phycol. Res. 61, 68-80 (2013). 19 Pochon, X., Putnam, H. M., Burki, F. & Gates, R. D. Identifying and characterizing alternative molecular markers for the symbiotic and free-living dinoflagellate genus Symbiodinium. PLoS ONE 7, e29816 (2012). 20 Hansen, G. & Daugbjerg, N. Symbiodinium natans sp. nov.: a "free-living" dinoflagellate from Tenerife (Northeast-Atlantic Ocean). J. Phycol. 45, 251-263 (2009).

40

21 Takabayashi, M., Adams, L., Pochon, X. & Gates, R. Genetic diversity of free-living Symbiodinium in surface water and sediment of Hawai‘i and Florida. Coral Reefs 31, 157-167 (2012). 22 Granados-Cifuentes, C., Neigel, J., Leberg, P. & Rodriguez-Lanetty, M. Genetic diversity of free-living Symbiodinium in the Caribbean: the importance of habitats and seasons. Coral Reefs 34, 927-939 (2015). 23 Jeong, H. J. et al. Genetics and morphology characterize the dinoflagellate Symbiodinium voratum, n. sp., (Dinophyceae) as the sole representative of Symbiodinium clade E. J. Eukaryot. Microbiol. 61, 75-94 (2014). 24 Stat, M., Morris, E. & Gates, R. D. Functional diversity in coral–dinoflagellate symbiosis. Proc. Natl. Acad. Sci. U. S. A. 105, 9256-9261 (2008). 25 Lesser, M., Stat, M. & Gates, R. The endosymbiotic dinoflagellates (Symbiodinium sp.) of corals are parasites and mutualists. Coral Reefs 32, 603-611 (2013). 26 Sachs, J. L. & Wilcox, T. P. A shift to parasitism in the jellyfish symbiont Symbiodinium microadriaticum. Proc. R. Soc. Lond. B Biol. Sci. 273, 425-429 (2006). 27 Fang, J. K. H., Schönberg, C. H. L., Hoegh-Guldberg, O. & Dove, S. Symbiotic plasticity of Symbiodinium in a common excavating sponge. Mar. Biol. 164, 104 (2017). 28 Morris, L. A., Voolstra, C. R., Quigley, K. M., Bourne, D. G. & Bay, L. K. Nutrient availability and metabolism affect the stability of coral–Symbiodiniaceae symbioses. Trends Microbiol. 27, 678-689 (2019). 29 Baker, D. M., Freeman, C. J., Wong, J. C. Y., Fogel, M. L. & Knowlton, N. Climate change promotes parasitism in a coral symbiosis. ISME J. 12, 921-930 (2018). 30 LaJeunesse, T. C., Lee, S. Y., Gil-Agudelo, D. L., Knowlton, N. & Jeong, H. J. Symbiodinium necroappetens sp. nov. (Dinophyceae): an opportunist ‘zooxanthella’ found in bleached and diseased tissues of Caribbean reef corals. Eur. J. Phycol. 50, 223-238 (2015). 31 Iglesias-Prieto, R. & Trench, R. K. Acclimation and adaptation to irradiance in symbiotic dinoflagellates. I. Responses of the photosynthetic unit to changes in photon flux density. Mar. Ecol. Prog. Ser. 113, 163-175 (1994). 32 Carreto, J., Carignan, M., Daleo, G. & De Marco, S. Occurrence of mycosporine-like amino acids in the red-tide dinoflagellate Alexandrium excavatum: UV-photoprotective compounds? J. Plankton Res. 12, 909-921 (1990). 33 Banaszak, A. T., LaJeunesse, T. C. & Trench, R. K. The synthesis of mycosporine-like amino acids (MAAs) by cultured, symbiotic dinoflagellates. J. Exp. Mar. Biol. Ecol. 249, 219-233 (2000).

41

34 LaJeunesse, T. Diversity and community structure of symbiotic dinoflagellates from Caribbean coral reefs. Mar. Biol. 141, 387-400 (2002). 35 Finney, J. C. et al. The relative significance of host–habitat, depth, and geography on the ecology, endemism, and speciation of coral endosymbionts in the genus Symbiodinium. Microb. Ecol. 60, 250-263 (2010). 36 Rowan, R. & Knowlton, N. Intraspecific diversity and ecological zonation in coral-algal symbiosis. Proc. Natl. Acad. Sci. U. S. A. 92, 2850-2853 (1995). 37 Toller, W. W., Rowan, R. & Knowlton, N. Zooxanthellae of the Montastraea annularis species complex: patterns of distribution of four taxa of Symbiodinium on different reefs and across depths. Biol. Bull. 201, 348-359 (2001). 38 Fabina, N. S., Putnam, H. M., Franklin, E. C., Stat, M. & Gates, R. D. Transmission mode predicts specificity and interaction patterns in coral-Symbiodinium networks. PLoS ONE 7, e44970 (2012). 39 Stat, M. & Gates, R. D. Clade D Symbiodinium in scleractinian corals: a “nugget” of hope, a selfish opportunist, an ominous sign, or all of the above? J. Mar. Biol. 2011 (2010). 40 van Oppen, M. J., Palstra, F. P., Piquet, A. M.-T. & Miller, D. J. Patterns of coral– dinoflagellate associations in Acropora: significance of local availability and physiology of Symbiodinium strains and host–symbiont selectivity. Proc. R. Soc. Lond. B Biol. Sci. 268, 1759-1767 (2001). 41 LaJeunesse, T. C. et al. Low symbiont diversity in southern Great Barrier Reef corals, relative to those of the Caribbean. Limnol. Oceanogr. 48, 2046-2054 (2003). 42 Rowan, R. Coral bleaching: thermal adaptation in reef coral symbionts. Nature 430, 742-742 (2004). 43 Baker, A. C., Starger, C. J., McClanahan, T. R. & Glynn, P. W. Coral reefs: corals' adaptive response to climate change. Nature 430, 741-741 (2004). 44 Hume, B. C. et al. Symbiodinium thermophilum sp. nov., a thermotolerant symbiotic alga prevalent in corals of the world's hottest sea, the Persian/Arabian Gulf. Sci. Rep. 5, 8562 (2015). 45 D'Angelo, C. et al. Local adaptation constrains the distribution potential of heat-tolerant Symbiodinium from the Persian/Arabian Gulf. ISME J. (2015). 46 Gegner, H. M. et al. High salinity conveys thermotolerance in the coral model Aiptasia. Biology Open 6, 1943-1948 (2017).

42

47 Berkelmans, R. & van Oppen, M. J. The role of zooxanthellae in the thermal tolerance of corals: a ‘nugget of hope’ for coral reefs in an era of climate change. Proc. R. Soc. Lond. B Biol. Sci. 273, 2305-2312 (2006). 48 Howells, E. et al. Coral thermal tolerance shaped by local adaptation of photosymbionts. Nat. Clim. Change 2, 116-120 (2012). 49 Chen, C. A., Wang, J.-T., Fang, L.-S. & Yang, Y.-W. Fluctuating algal symbiont communities in Acropora palifera (Scleractinia: Acroporidae) from Taiwan. Mar. Ecol. Prog. Ser. 295, 113-121 (2005). 50 Baillie, B., Monje, V., Silvestre, V., Sison, M. & Belda-Baillie, C. Allozyme electrophoresis as a tool for distinguishing different zooxanthellae symbiotic with giant clams. Proc. R. Soc. Lond. B Biol. Sci. 265, 1949-1956 (1998). 51 Baillie, B. et al. Genetic variation in Symbiodinium isolates from giant clams based on random-amplified-polymorphic DNA (RAPD) patterns. Mar. Biol. 136, 829-836 (2000). 52 Santos, S., Shearer, T., Hannes, A. & Coffroth, M. Fine‐scale diversity and specificity in the most prevalent lineage of symbiotic dinoflagellates (Symbiodinium, Dinophyceae) of the Caribbean. Mol. Ecol. 13, 459-469 (2004). 53 Santos, S. R., Taylor, D. J., Kinzie III, R. A., Sakaj, K. & Coffroth, M. A. Evolution of length variation and heteroplasmy in the chloroplast rDNA of symbiotic dinoflagellates (Symbiodinium, Dinophyta) and a novel insertion in the universal core region of the large subunit rDNA. Phycologia 41, 311-318 (2002). 54 Santos, S., Gutierrez-Rodriguez, C., Lasker, H. & Coffroth, M. Symbiodinium sp. associations in the gorgonian Pseudopterogorgia elisabethae in the Bahamas: high levels of genetic variability and population structure in symbiotic dinoflagellates. Mar. Biol. 143, 111-120 (2003). 55 Takishita, K., Ishikura, M., Koike, K. & Maruyama, T. Comparison of phylogenies based on nuclear-encoded SSU rDNA and plastid-encoded psbA in the symbiotic dinoflagellate genus Symbiodinium. Phycologia 42, 285-291 (2003). 56 Takabayashi, M., Santos, S. R. & Cook, C. B. Mitochondrial DNA phylogeny of the symbiotic dinoflagellates (Symbiodinium, Dinophyta). J. Phycol. 40, 160-164 (2004). 57 Rowan, R. & Powers, D. A. Molecular genetic identification of symbiotic dinoflagellates (zooxanthellae). Mar. Ecol. Prog. Ser. 71, 65-73 (1991). 58 LaJeunesse, T. C. Investigating the biodiversity, ecology, and phylogeny of endosymbiotic dinoflagellates in the genus Symbiodinium using the ITS region: in search of a “species” level marker. J. Phycol. 37, 866-880 (2001).

43

59 Rodriguez-Lanetty, M., Wood-Charlson, E. M., Hollingsworth, L. L., Krupp, D. A. & Weis, V. M. Temporal and spatial infection dynamics indicate recognition events in the early hours of a dinoflagellate/coral symbiosis. Mar. Biol. 149, 713-719 (2006). 60 Rowan, R. & Powers, D. A. A molecular genetic classification of zooxanthellae and the evolution of animal-algal symbioses. Science 251, 1348-1351 (1991). 61 Santos, S. R., Taylor, D. J. & Coffroth, M. A. Genetic comparisons of freshly isolated versus cultured symbiotic dinoflagellates: implications for extrapolating to the intact symbiosis. J. Phycol. 37, 900-912 (2001). 62 Pochon, X., Putnam, H. M. & Gates, R. D. Multi-gene analysis of Symbiodinium dinoflagellates: a perspective on rarity, symbiosis, and evolution. PeerJ 2, e394 (2014). 63 Pochon, X., Montoya-Burgos, J. I., Stadelmann, B. & Pawlowski, J. Molecular phylogeny, evolutionary rates, and divergence timing of the symbiotic dinoflagellate genus Symbiodinium. Mol. Phylogenet. Evol. 38, 20-30 (2006). 64 Pochon, X., Pawlowski, J., Zaninetti, L. & Rowan, R. High genetic diversity and relative specificity among Symbiodinium-like endosymbiotic dinoflagellates in soritid foraminiferans. Mar. Biol. 139, 1069-1078 (2001). 65 Pochon, X., LaJeunesse, T. & Pawlowski, J. Biogeographic partitioning and host specialization among foraminiferan dinoflagellate symbionts (Symbiodinium; Dinophyta). Mar. Biol. 146, 17-27 (2004). 66 Pochon, X. & Gates, R. D. A new Symbiodinium clade (Dinophyceae) from soritid foraminifera in Hawai’i. Mol. Phylogenet. Evol. 56, 492-497 (2010). 67 LaJeunesse, T. & Trench, R. Biogeography of two species of Symbiodinium (Freudenthal) inhabiting the intertidal sea anemone Anthopleura elegantissima (Brandt). Biol. Bull. 199, 126-134 (2000). 68 Rodriguez-Lanetty, M. Evolving lineages of Symbiodinium-like dinoflagellates based on ITS1 rDNA. Mol. Phylogenet. Evol. 28, 152-168 (2003). 69 LaJeunesse, T. C. & Thornhill, D. J. Improved resolution of reef-coral endosymbiont (Symbiodinium) species diversity, ecology, and evolution through psbA non-coding region genotyping. PLoS ONE 6, e29013 (2011). 70 Stat, M., Carter, D. & Hoegh-Guldberg, O. The evolutionary history of Symbiodinium and scleractinian hosts—symbiosis, diversity, and the effect of climate change. Perspect. Plant Ecol. 8, 23-43 (2006). 71 Zachos, J., Pagani, M., Sloan, L., Thomas, E. & Billups, K. Trends, rhythms, and aberrations in global climate 65 Ma to present. Science 292, 686-693 (2001).

44

72 Pochon, X. & Pawlowski, J. Evolution of the soritids-Symbiodinium symbiosis. Symbiosis 42, 77-88 (2006). 73 Bice, K. L., Scotese, C. R., Seidov, D. & Barron, E. J. Quantifying the role of geographic change in Cenozoic ocean heat transport using uncoupled atmosphere and ocean models. Palaeogeogr. Palaeocl. 161, 295-310 (2000). 74 Simpson, C. Evolution: serving up light. Curr. Biol. 28, R873-R875 (2018). 75 Joy, J. B. Symbiosis catalyses niche expansion and diversification. Proc. R. Soc. Lond. B Biol. Sci. 280, 20122820 (2013). 76 Smith, J. M. Evolution: generating novelty by symbiosis. Nature 341, 284 (1989). 77 Thompson, J. N. Specific hypotheses on the geographic mosaic of coevolution. Am. Nat. 153, S1-S14 (1999). 78 van Oppen, M., Mieog, J., Sanchez, C. & Fabricius, K. Diversity of algal endosymbionts (zooxanthellae) in octocorals: the roles of geography and host relationships. Mol. Ecol. 14, 2403-2417 (2005). 79 Garcia-Cuetos, L., Pochon, X. & Pawlowski, J. Molecular evidence for host–symbiont specificity in soritid foraminifera. Protist 156, 399-412 (2005). 80 Bongaerts, P. et al. Sharing the slope: depth partitioning of agariciid corals and associated Symbiodinium across shallow and mesophotic habitats (2-60 m) on a Caribbean reef. BMC Evol. Biol. 13, 205 (2013). 81 Thompson, J. N. Interaction and coevolution. (University of Chicago Press, 2014). 82 Quigley, K. M., Warner, P. A., Bay, L. K. & Willis, B. L. Unexpected mixed-mode transmission and moderate genetic regulation of Symbiodinium communities in a brooding coral. Heredity 121, 524-536 (2018). 83 LaJeunesse, T. C. “Species” radiations of symbiotic dinoflagellates in the Atlantic and Indo- Pacific since the Miocene-Pliocene transition. Mol. Biol. Evol. 22, 570-581 (2005). 84 Shinzato, C., Inoue, M. & Kusakabe, M. A snapshot of a coral “holobiont”: a transcriptome assembly of the scleractinian coral, Porites, captures a wide variety of genes from both the host and symbiotic zooxanthellae. PLoS ONE 9, e85182 (2014). 85 Rosic, N. et al. Unfolding the secrets of coral–algal symbiosis. ISME J. 9, 844-856 (2015). 86 Davy, S. K., Allemand, D. & Weis, V. M. Cell biology of cnidarian-dinoflagellate symbiosis. Microbiol. Mol. Biol. Rev. 76, 229-261 (2012). 87 Shoguchi, E. et al. Draft assembly of the Symbiodinium minutum nuclear genome reveals dinoflagellate gene structure. Curr. Biol. 23, 1399-1408 (2013).

45

88 Shoguchi, E. et al. Two divergent Symbiodinium genomes reveal conservation of a gene cluster for sunscreen biosynthesis and recently lost genes. BMC Genomics 19, 458 (2018). 89 Lin, S. et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science 350, 691-694 (2015). 90 Aranda, M. et al. Genomes of coral dinoflagellate symbionts highlight evolutionary adaptations conducive to a symbiotic lifestyle. Sci. Rep. 6, 39734 (2016). 91 Liu, H. et al. Symbiodinium genomes reveal adaptive evolution of functions related to coral- dinoflagellate symbiosis. Commun. Biol. 1, 95 (2018). 92 González-Pech, R. A., Ragan, M. A. & Chan, C. X. Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodinium. Sci. Rep. 7, 15021 (2017). 93 Douglas, A. Host benefit and the evolution of specialization in symbiosis. Heredity 81, 599- 603 (1998). 94 Hill, M. & Hill, A. The magnesium inhibition and arrested phagosome hypotheses: new perspectives on the evolution and ecology of Symbiodinium symbioses. Biol. Rev. 87, 804- 821 (2012). 95 Hou, Y. & Lin, S. Distinct gene number-genome size relationships for eukaryotes and non- eukaryotes: gene content estimation for dinoflagellate genomes. PLoS ONE 4, e6978 (2009). 96 Veldhuis, M. J., Cucci, T. L. & Sieracki, M. E. Cellular DNA content of marine phytoplankton using two new fluorochromes: taxonomic and ecological implications. J. Phycol. 33, 527-541 (1997). 97 Wisecaver, J. H. & Hackett, J. D. Dinoflagellate genome evolution. Annu. Rev. Microbiol. 65, 369-387 (2011). 98 LaJeunesse, T. C., Lambert, G., Andersen, R. A., Coffroth, M. A. & Galbraith, D. W. Symbiodinium (Pyrrhophyta) genome sizes (DNA content) are smallest among dinoflagellates. J. Phycol. 41, 880-886 (2005). 99 Spector, D. L. Dinoflagellate nuclei. Dinoflagellates, 107-147 (1984). 100 Dodge, J. D. Chromosome numbers in some marine dinoflagellates. Bot. Mar. 5, 121-128 (1963). 101 Holt, J. R. & Pfiester, L. A. A technique for counting chromosomes of armored dinoflagellates, and chromosome numbers of six freshwater dinoflagellate species. Am. J. Bot., 1165-1168 (1982). 102 Udy, J. W., Hinde, R. & Vesk, M. Chromosomes and DNA in Symbiodinium from Australian hosts. J. Phycol. 29, 314-320 (1993).

46

103 Trench, R. K. & Blank, R. J. Symbiodinium microadriaticum Freudenthal, S. goreauii sp. nov., S. kawagutii sp. nov. and S. pilosum sp. nov.: gymnodinioid dinoflagellate symbionts of marine invertebrates. J. Phycol. 23, 469-481 (1987). 104 Hamkalo, B. A. & Rattner, J. The structure of a mesokaryote chromosome. Chromosoma 60, 39-47 (1977). 105 Herzog, M., Von Boletzky, S. & Soyer, M.-O. Ultrastructural and biochemical nuclear aspects of eukaryote classification: independent evolution of the dinoflagellates as a sister group of the actual eukaryotes? Orig. Life 13, 205-215 (1984). 106 Haapala, O. & Soyer, M. Absence of longitudinal differentiation of dinoflagellate (Prorocentrum micans) chromosomes. Hereditas 78, 141-145 (1974). 107 Alverca, E., Cuadrado, A., Jouve, N., Franca, S. & Moreno Diaz de la Espina, S. Telomeric DNA localization on dinoflagellate chromosomes: structural and evolutionary implications. Cytogenet. Genome Res. 116, 224-231 (2007). 108 Bodansky, S., Mintz, L. B. & Holmes, D. S. The mesokaryote Gyrodinium cohnii lacks nucleosomes. Biochem. Biophys. Res. Commun. 88, 1329-1336 (1979). 109 Rizzo, P. & Nooden, L. Partial characterization of dinoflagellate chromosomal proteins. BBA- Nucleic Acid. Prot. Synt. 349, 415-427 (1974). 110 Chan, Y., Kwok, A., Tsang, J. S. & Wong, J. T. Alveolata histone‐like proteins have different evolutionary origins. J. Evol. Biol. 19, 1717-1721 (2006). 111 Géraud, M. L., Sala‐Rovira, M., Herzog, M. & Soyer‐Gobillard, M. O. Immunocytochemical localization of the DNA‐binding protein HCc during the cell cycle of the histone‐less dinoflagellate protoctista Crypthecodinium cohnii B. Biol. Cell 71, 123-134 (1991). 112 Chan, Y.-H. & Wong, J. T. Concentration-dependent organization of DNA by the dinoflagellate histone-like protein HCc3. Nucleic Acids Res. 35, 2573-2583 (2007). 113 Bayer, T. et al. Symbiodinium transcriptomes: genome insights into the dinoflagellate symbionts of reef-building corals. PLoS ONE 7, e35269 (2012). 114 Lin, S. Genomic understanding of dinoflagellates. Res. Microbiol. 162, 551-569 (2011). 115 Rae, P. M. & Steele, R. E. Modified bases in the DNAs of unicellular eukaryotes: an examination of distributions and possible roles, with emphasis on hydroxymethyluracil in dinoflagellates. Biosystems 10, 37-53 (1978). 116 Rae, P. M. 5-Hydroxymethyluracil in the DNA of a dinoflagellate. Proc. Natl. Acad. Sci. U. S. A. 70, 1141-1145 (1973). 117 Jaeckisch, N. et al. Comparative genomic and transcriptomic characterization of the toxigenic marine dinoflagellate Alexandrium ostenfeldii. PLoS ONE 6, e28012 (2011).

47

118 ten Lohuis, M. R. & Miller, D. J. Hypermethylation at CpG‐motifs in the dinoflagellates carterae (Dinophyceae) and Symbiodinium microadriaticum (Dinophyceae): evidence from restriction analyses, 5‐azacytidine and ethionine treatment. J. Phycol. 34, 152- 159 (1998). 119 de Mendoza, A. et al. Recurrent acquisition of cytosine methyltransferases into eukaryotic retrotransposons. Nat. Commun. 9, 1341 (2018). 120 Guillebault, D. et al. A new class of transcription initiation factors, intermediate between TATA box-binding proteins (TBPs) and TBP-like factors (TLFs), is present in the marine unicellular organism, the dinoflagellate Crypthecodinium cohnii. J. Biol. Chem. 277, 40881- 40886 (2002). 121 Lidie, K. B., Ryan, J. C., Barbier, M. & Van Dolah, F. M. Gene expression in Florida red tide dinoflagellate brevis: analysis of an expressed sequence tag library and development of DNA microarray. Mar. Biotechnol. 7, 481-493 (2005). 122 Moustafa, A. et al. Transcriptome profiling of a toxic dinoflagellate reveals a gene-rich protist and a potential impact on gene expression due to bacterial presence. PLoS ONE 5, e9688 (2010). 123 Erdner, D. L. & Anderson, D. M. Global transcriptional profiling of the toxic dinoflagellate Alexandrium fundyense using massively parallel signature sequencing. BMC Genomics 7, 1 (2006). 124 Lin, S., Zhang, H., Zhuang, Y., Tran, B. & Gill, J. Spliced leader–based metatranscriptomic analyses lead to recognition of hidden genomic features in dinoflagellates. Proc. Natl. Acad. Sci. U. S. A. 107, 20033-20038 (2010). 125 Burger, G., Gray, M. W. & Franz Lang, B. Mitochondrial genomes: anything goes. Trends Genet. 19, 709-716 (2003). 126 Waller, R. F. & Jackson, C. J. Dinoflagellate mitochondrial genomes: stretching the rules of molecular biology. Bioessays 31, 237-245 (2009). 127 Nash, E. A. et al. Organization of the mitochondrial genome in the dinoflagellate Amphidinium carterae. Mol. Biol. Evol. 24, 1528-1536 (2007). 128 Lin, S., Zhang, H., Spencer, D. F., Norman, J. E. & Gray, M. W. Widespread and extensive editing of mitochondrial mRNAs in dinoflagellates. J. Mol. Biol. 320, 727-739 (2002). 129 Jackson, C. J. et al. Broad genomic and transcriptional analysis reveals a highly derived genome in dinoflagellate mitochondria. BMC Biol. 5, 41 (2007). 130 Chaput, H., Wang, Y. & Morse, D. Polyadenylated transcripts containing random gene fragments are expressed in dinoflagellate mitochondria. Protist 153, 111-122 (2002).

48

131 Nash, E. A., Nisbet, R. E. R., Barbrook, A. C. & Howe, C. J. Dinoflagellates: a mitochondrial genome all at sea. Trends Genet. 24, 328-335 (2008). 132 Howe, C. J., Nisbet, R. E. R. & Barbrook, A. C. The remarkable chloroplast genome of dinoflagellates. J. Exp. Bot. 59, 1035-1045 (2008).

133 Hiller, R. G. ‘Empty’minicircles and petB/atpA and psbD/psbE (cytb 559 α) genes in tandem in Amphidinium carterae plastid DNA1. FEBS letters 505, 449-452 (2001). 134 Zhang, Z., Green, B. & Cavalier-Smith, T. Single gene circles in dinoflagellate chloroplast genomes. Nature 400, 155-159 (1999). 135 Zhang, Z., Cavalier-Smith, T. & Green, B. R. Evolution of dinoflagellate unigenic minicircles and the partially concerted divergence of their putative replicon origins. Mol. Biol. Evol. 19, 489-500 (2002). 136 Dang, Y. & Green, B. R. Long transcripts from dinoflagellate chloroplast minicircles suggest “rolling circle” transcription. J. Biol. Chem. 285, 5196-5203 (2010). 137 Barbrook, A. C., Voolstra, C. R. & Howe, C. J. The chloroplast genome of a Symbiodinium sp. clade C3 isolate. Protist 165, 1-13 (2014). 138 Andersson, S. G. & Kurland, C. G. Reductive evolution of resident genomes. Trends Microbiol. 6, 263-268 (1998). 139 Moran, N. A. & Wernegreen, J. J. Lifestyle evolution in symbiotic bacteria: insights from genomics. Trends Ecol. Evol. 15, 321-326 (2000). 140 Moran, N. A. & Plague, G. R. Genomic changes following host restriction in bacteria. Curr. Opin. Genet. Dev. 14, 627-633 (2004). 141 Tamas, I. et al. 50 million years of genomic stasis in endosymbiotic bacteria. Science 296, 2376-2379 (2002). 142 Slamovits, C. H., Fast, N. M., Law, J. S. & Keeling, P. J. Genome compaction and stability in microsporidian intracellular parasites. Curr. Biol. 14, 891-896 (2004). 143 Wernegreen, J. J. For better or worse: genomic consequences of intracellular mutualism and parasitism. Curr. Opin. Genet. Dev. 15, 572-583 (2005). 144 Gil, R., Latorre, A. & Moya, A. Bacterial endosymbionts of insects: insights from comparative genomics. Environ. Microbiol. 6, 1109-1122 (2004). 145 McCutcheon, J. P. & Moran, N. A. Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 10, 13-16 (2011). 146 Muller, H. J. The relation of recombination to mutational advance. Mutat. Res. Fund. Mol. Mech. Mut. 1, 2-9 (1964).

49

147 Wernegreen, J. & Moran, N. Evidence for genetic drift in endosymbionts (Buchnera): analyses of protein-coding genes. Mol. Biol. Evol. 16, 83-97 (1999). 148 Pettersson, M. E. & Berg, O. G. Muller’s ratchet in symbiont populations. Genetica 130, 199 (2006). 149 Mira, A., Ochman, H. & Moran, N. A. Deletional bias and the evolution of bacterial genomes. Trends Genet. 17, 589-596 (2001). 150 Andersson, J. O. & Andersson, S. G. E. Insights into the evolutionary process of genome degradation. Curr. Opin. Genet. Dev. 9, 664-671 (1999). 151 Gill, E. E. & Fast, N. M. Stripped-down DNA repair in a highly reduced parasite. BMC Mol. Biol. 8, 24 (2007). 152 Dufresne, A., Garczarek, L. & Partensky, F. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol. 6, R14 (2005). 153 Bordenstein, S. R. & Reznikoff, W. S. Mobile DNA in obligate intracellular bacteria. Nat. Rev. Microbiol. 3, 688–699 (2005). 154 Masui, S., Kamoda, S., Sasaki, T. & Ishikawa, H. Distribution and evolution of bacteriophage WO in Wolbachia, the endosymbiont causing sexual alterations in arthropods. J. Mol. Evol. 51, 491-497 (2000). 155 Corradi, N. Microsporidia: eukaryotic intracellular parasites shaped by gene loss and horizontal gene transfers. Annu. Rev. Microbiol. 69, 167-183 (2015). 156 Stephens, T. G., Ragan, M. A., Bhattacharya, D. & Chan, C. X. Core genes in diverse dinoflagellate lineages include a wealth of conserved dark genes with unknown functions. Sci. Rep. 8, 17175 (2018). 157 Chi, J., Parrow, M. W. & Dunthorn, M. Cryptic sex in Symbiodinium (Alveolata, Dinoflagellata) is supported by an inventory of meiotic genes. J. Eukaryot. Microbiol. 61, 322-327 (2014). 158 Yoon, H. S., Hackett, J. D. & Bhattacharya, D. A single origin of the peridinin- and fucoxanthin-containing plastids in dinoflagellates through tertiary endosymbiosis. Proc. Natl. Acad. Sci. U. S. A. 99, 11724-11729 (2002). 159 Reyes-Prieto, A., Weber, A. P. & Bhattacharya, D. The origin and establishment of the plastid in algae and plants. Annu. Rev. Genet. 41, 147-168 (2007). 160 Blanc, G. et al. The Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic sex. Plant Cell 22, 2943-2955 (2010). 161 Cunning, R. et al. Dynamic regulation of partner abundance mediates response of reef coral symbioses to environmental change. Ecology 96, 1411-1420 (2015).

50

162 Thornhill, D. J., Howells, E. J., Wham, D. C., Steury, T. D. & Santos, S. R. Population genetics of reef coral endosymbionts (Symbiodinium, Dinophyceae). Mol. Ecol. 26, 2640- 2659 (2017). 163 Bongaerts, P. et al. Prevalent endosymbiont zonation shapes the depth distributions of scleractinian coral species. Roy. Soc. Open Sci. 2, 140297 (2015). 164 Thornhill, D. J., Xiang, Y., Pettay, D. T., Zhong, M. & Santos, S. R. Population genetic data of a model symbiotic cnidarian system reveal remarkable symbiotic specificity and vectored introductions across ocean basins. Mol. Ecol. 22, 4499-4515 (2013). 165 Lee, S. Y. et al. Symbiodinium tridacnidorum sp. nov., a dinoflagellate common to Indo- Pacific giant clams, and a revised morphological description of Symbiodinium microadriaticum Freudenthal, emended Trench & Blank. Eur. J. Phycol. 50, 155-172 (2015). 166 van Dijk, E. L., Jaszczyszyn, Y., Naquin, D. & Thermes, C. The third revolution in sequencing technology. Trends Genet. 34, 666-681 (2018). 167 Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews. Genetics 19, 329-346 (2018). 168 Jaenike, J. Population genetics of beneficial heritable symbionts. Trends Ecol. Evol. 27, 226- 232 (2012). 169 Wham, F. C. The origin, meaning, and detection of clusters in population genetic data. Thesis dissertation Doctor of Philosophy thesis, The Pennsylvania State University, (2015). 170 Wang, J., Santiago, E. & Caballero, A. Prediction and estimation of effective population size. Heredity 117, 193 (2016). 171 Thornhill, D. J., Lewis, A. M., Wham, D. C. & LaJeunesse, T. C. Host-specialist lineages dominate the adaptive radiation of reef coral endosymbionts. Evolution 68, 352-367 (2014). 172 Wham, D. C. & LaJeunesse, T. C. Symbiodinium population genetics: testing for species boundaries and analysing samples with mixed genotypes. Mol. Ecol. 25, 2699-2712 (2016). 173 Kaltenpoth, M., Goettler, W., Koehler, S. & Strohm, E. Life cycle and population dynamics of a protective insect symbiont reveal severe bottlenecks during vertical transmission. Evol. Ecol. 24, 463-477 (2010). 174 Mira, A. & Moran, N. A. Estimating population size and transmission bottlenecks in maternally transmitted endosymbiotic bacteria. Microb. Ecol. 44, 137-143 (2002). 175 Keeling, P. J. & Fast, N. M. Microsporidia: biology and evolution of highly reduced intracellular parasites. Annu. Rev. Microbiol. 56, 93-116 (2002).

51

176 Rowan, R. & Powers, D. A. Ribosomal RNA sequences and the diversity of symbiotic dinoflagellates (zooxanthellae). Proc. Natl. Acad. Sci. U. S. A. 89, 3639-3643 (1992). 177 Hershberg, R. & Petrov, D. A. Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet. 6, e1001115 (2010). 178 Lind, P. A. & Andersson, D. I. Whole-genome mutational biases in bacteria. Proc. Natl. Acad. Sci. U. S. A. 105, 17878-17883 (2008). 179 Gardner, M. J. et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511 (2002). 180 Douglas, S. et al. The highly reduced genome of an enslaved algal nucleus. Nature 410, 1091– 1096 (2001). 181 Hackett, J. D. & Bhattacharya, D. in Genomics and evolution of microbial eukaryotes (eds Laura A Katz & Debashish Bhattacharya) Ch. The genomes of dinoflagellates, 48-63 (Oxford University Press, 2006). 182 Ekblom, R. & Galindo, J. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 107, 1 (2010). 183 Ellegren, H. Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 29, 51-63 (2014). 184 Ekblom, R. & Wolf, J. B. A field guide to whole‐genome sequencing, assembly and annotation. Evol. Appl. 7, 1026-1042 (2014). 185 El-Metwally, S., Hamza, T., Zakaria, M. & Helmy, M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9, e1003345 (2013). 186 Simpson, J. T. & Pop, M. The theory and practice of genome sequence assembly. Annu. Rev. Genomics Hum. Genet. 16, 153-172 (2015). 187 Baker, M. De novo genome assembly: what every biologist should know. Nat. Methods 9, 333 (2012). 188 Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, btu170 (2014). 189 Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29, 987-991 (2011). 190 Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693-700 (2012). 191 Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669-2677 (2013).

52

192 Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. (2017). 193 Miller, J. R. et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818-2824 (2008). 194 Xiao, W. et al. Challenges, solutions, and quality metrics of personal genome assembly in advancing precision medicine. Pharmaceutics 8, 15 (2016). 195 Stein, L. Genome annotation: from sequence to biology. Nat. Rev. Genet. 2, 493-503 (2001). 196 Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36-46 (2012). 197 Yandell, M. & Ence, D. A beginner's guide to eukaryotic genome annotation. Nat. Rev. Genet. 13, 329-342 (2012). 198 Tarailo‐Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinf., 4.10. 11-14.10. 14 (2009). 199 Lerat, E. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104, 520-533 (2010). 200 Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654-5666 (2003). 201 Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188-196 (2008). 202 Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). 203 Brent, M. R. Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res. 15, 1777-1786 (2005). 204 Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435-W439 (2006). 205 Xia, X. in Comparative Genomics Ch. What is comparative genomics?, 1-20 (Springer, 2013). 206 Wei, L., Liu, Y., Dubchak, I., Shon, J. & Park, J. Comparative genomics approaches to study organism similarities and differences. J. Biomed. Inform. 35, 142-150 (2002). 207 Dubchak, I. & Pachter, L. The computational challenges of applying comparative-based computational methods to whole genomes. Brief. Bioinform. 3, 18-22 (2002). 208 Dewey, C. N. in Evolutionary genomics: statistical and computational Methods Vol. 1 (ed Maria Anisimova) Ch. Whole-genome alignment, 237-257 (Humana Press, 2012).

53

209 Armstrong, J., Fiddes, I. T., Diekhans, M. & Paten, B. Whole-genome alignment and comparative annotation. Annu. Rev. Anim. Biosci. 7 (2019). 210 Soderlund, C., Bomhoff, M. & Nelson, W. M. SyMAP v3. 4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res., gkr123 (2011). 211 Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49-e49 (2012). 212 Chen, Y., González‐Pech, R. A., Stephens, T. G., Bhattacharya, D. & Chan, C. X. Evidence that inconsistent gene prediction can mislead analysis of dinoflagellate genomes. J. Phycol., 56, 6-10 (2020). 213 Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204-2215 (2000). 214 Thornton, J. W. & DeSalle, R. Gene family evolution and homology: genomics meets phylogenetics. Annu. Rev. Genomics Hum. Genet. 1, 41-73 (2000). 215 Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25- 29 (2000). 216 Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res., gkr988 (2011). 217 Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015). 218 Ludwig, M. Z. Functional evolution of noncoding DNA. Curr. Opin. Genet. Dev. 12, 634- 639 (2002). 219 Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005).

54

Chapter 3. Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodiniaceae

The molecular mechanisms that underpin the diversification of Symbiodiniaceae remain little known, largely due to the limited genome-scale data. However, an earlier global effort of transcriptome sequencing, the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP), has generated a large number of transcriptome data from a broad range of microbial eukaryotes, including dinoflagellates. Studies focused on the gene expression changes associated with exposure to distinct conditions and types of stress have also contributed with transcriptome data from the major clades in Symbiodiniaceae. In addition, the first genome sequences of Symbiodiniaceae were available at the time this work was carried out. These datasets present thus a great platform to assess gene functions that are common or specific to the distinct lineages of Symbiodiniaceae.

In this chapter, I present a comparative analysis using available genome and transcriptome data to assess gene functions associated with the diversification of Symbiodiniaceae. The chapter is presented in the form of a manuscript and addresses Aim 1 (Chapter 1, Section 1.2). Specifically, I assessed gene functions that are common or specific to distinct lineages, including key genera of Symbiodiniaceae (Symbiodinium, Breviolum, Cladocopium, Durusdinium, Effrenium and Fugacium). The manuscript was published in Scientific Reports1 (DOI: 10.1038/s41598-017-15029- w). Given the systematic revision of Symbiodiniaceae after this manuscript was published, and for consistency of presentation throughout this thesis, I revised this manuscript to follow the new genus names where applicable, including figures and tables, and reformatted the text. As the first author of this paper, I conceived the study, designed and conducted all computational analyses, interpreted the results. I wrote the first draft of the manuscript and generated all figures and tables.

55

3.1. Abstract

Symbiodiniaceae are best-known as the photosynthetic symbionts of corals, but some lineages in the family are symbiotic in other organisms or include free-living forms. Identifying similarities and differences among these lineages can help us understand their relationship with corals, and thereby inform on measures to manage coral reefs in a changing environment. Here, using sequences from 24 publicly available transcriptomes and genomes of Symbiodiniaceae, we assessed 78,389 gene families in genera of Symbiodiniaceae and the outgroup Polarella glacialis, and identified putative overrepresented functions in gene families that (1) distinguish Symbiodiniaceae from other members of the order Suessiales, (2) are shared by all of the symbiodiniacean genera with data available at the time, and (3) based on available information, are specific to each clade. Our findings indicate that transmembrane transport, mechanisms of response to reactive oxygen species, and protection against UV radiation are functions enriched in all symbiodiniacean genera but not in P. glacialis. Enrichment of these functions indicates the capability of Symbiodiniaceae to establish and maintain symbiosis, and to respond and adapt to its environment. The observed differences in lineage-specific gene families imply extensive genetic divergence among clades. Our results provide a platform for future investigation of lineage- or genus-specific adaptation of Symbiodiniaceae to their environment.

3.2. Introduction

Symbiodiniacean dinoflagellates are known for their mutualistic relationships with corals and other marine organisms. Association with Symbiodiniaceae enables corals to inhabit nutrient-poor tropical waters, grow and build up coral reefs; breakdown of the relationship leads to coral bleaching and, unless the relationship is re-established, death. Reef ecosystems in turn provide diverse benefits and services both to the environment, and to the economy of nearby communities2. A clear understanding of the relationship between Symbiodiniaceae and corals is thus indispensable if we are to take a knowledge-driven approach to protect and manage these valuable ecosystems in the face of global environmental change.

56

Symbiodiniaceae are classified into nine distinct groups, clades A through I (some of them now genera)3-5. Studies based on genome and transcriptome data (generated at this time only for Symbiodinium sensu stricto, Breviolum, Cladocopium, Durusdinium, Effrenium and Fugacium, formerly clades A through F) have contributed substantially to understanding the biology of each of these groups. One of those studies, based on transcriptome data, revealed that Symbiodinium and Breviolum (clades A and B) use a smaller number of transcription factors than do other eukaryotes, implying particular gene regulation mechanisms6. Other studies report the genetic basis of thermal tolerance in Cladocopium and Durusdinium (clades C and D)7, and gene homologs and pathways shared among Symbiodinium, Breviolum, Cladocopium and Durusdinium (clades A, B, C and D)8. Another transcriptome-based study revealed that divergence within a same genus (Breviolum or clade B in this case) can be mirrored as extensive differences in gene expression9. At the time this work was published, three draft genomes of Symbiodiniaceae were available, revealing the presence of unique splice sites and a unidirectional gene arrangement in B. minutum (clade B)10, and that retrotransposition and gene duplication are the main drivers of gene family expansion in F. kawagutii (clade F)11. Comparative analysis of the S. microadriaticum (clade A) genome with the two others, together with additional sequence data from other dinoflagellates, supports the hypothesis that the symbiotic lifestyle of Symbiodiniaceae was predisposed by an abundance of membrane transporters in all dinoflagellates, rather than being an adaptive novelty12.

Although genome and transcriptome data are available from representatives of Symbiodiniaceae, little is known of how gene content or biological function may differ within and between clades. Key questions about symbiodiniacean biology remain largely unexplored, including what features distinguish them from other dinoflagellates, and what attributes are shared by all Symbiodiniaceae or are exclusive to one or a few genera (or clades). To link genomic information to functions of cells, organisms and ecosystems, a comparative approach using gene families can be adopted13. Here, using available genome10-12 and transcriptome6,7,9,14-17 data from dinoflagellates within Order Suessiales (Symbiodiniaceae and Polarella glacialis) we assessed the gene families and inferred biological functions that are represented in one or more of symbiodiniacean clades A through F, and investigate whether these functions are overrepresented in each analysed group. This represents the first comprehensive analysis of shared intra- and inter-cladal gene families in Symbiodiniaceae.

57

3.3. Results and discussion

3.3.1. Genome and transcriptome data

For this study we assembled 24 datasets (Table 3-1) of Symbiodiniaceae genomes (3) and transcriptomes (19), and P. glacialis transcriptomes (2), with a total of 1,300,300 sequences (total length 1333.87 Mbp; Supplementary Table S1). The N50 length of each set of predicted coding sequences (CDS) ranges between 219 and 3987 bp (average 1480 bp). Fewer than 3% of the sequences from each dataset have significant BLASTn matches (E ≤ 10−10) against bacterial genomes, implying that there is little bacterial contamination. The completeness of each dataset was examined by comparison against core eukaryote genes in CEGMA18 and BUSCO19 (see Methods). On average, 72% of the 234 -stramenopile BUSCO genes and 89% of the 458 CEGMA genes were recovered by BLASTx from the datasets (Supplementary Table S1).

Where available, CDS predictions from the original source were used; otherwise CDS were predicted with TransDecoder v2.0.1 (see Methods). The overall sequence data yielded a total of 1,131,289 CDS with a total length of 1.21 Gbp; these correspond to the same number of predicted protein sequences. Completeness analyses returned results similar to those of the overall data (Supplementary Table S1). Overall G+C content ranges from 50.43% to 58.62% over all lineages (where lineage is defined as any clade of Symbiodiniaceae or P. glacialis: Fig. 3.1A), G+C content in the third codon position between 50.81% and 70.86% (Fig. 3.1B), and the effective number of codons between 49.76 and 56.85 (Fig. 3.1C). These values differ between clades but fall within a relatively narrow range within each clade.

Due to the heterogeneity and incomplete (and fragmented) nature of the transcriptome data, we carefully scoped our analysis at the clade (instead of species or isolate) level. However, we note that most genes in dinoflagellates, including Symbiodiniaceae, have been found to be constitutively expressed irrespective of growth conditions21,22. After pooling datasets by clade and removing redundant sequences (see Methods), our final datasets consist of 584,272 predicted proteins (Supplementary Tables S2 and S3). Details on the contribution of each individual dataset to the clade pools are shown in Fig. S3.1.

58

Table 3-1 Summary of the selected datasets for analysis in the present study Number Total Species (isolate) N50 Type of length Data type [figure code] (bp) sequences (Mbp) Symbiodinium microadriaticum A 49,109 3987 166.722 Genome (CCMP2467)12 [E] Symbiodinium sp. (CassKB8)6 [d] A 72,152 1087 61.921 Transcriptome Symbiodinium sp. (CCMP2430)17 [c] A 44,733 1356 42.483 Transcriptome Breviolum aenigmaticum (mac04-487)9 B 45,343 1355 44.628 Transcriptome [h] Breviolum sp. (SSB01)14 [k] B 59,669 1752 71.172 Transcriptome Breviolum. pseudominutum (rt146)9 [f] B 47,411 1508 51.270 Transcriptome Breviolum psygmophilum B 50,745 1618 51.37 Transcriptome (HIAp, Mf10.14b.02, PurPFlex, rt141)9 [i] Breviolum minutum B1 51,199 1597 57.248 Transcriptome (Mac703, Mf1.05b, rt002, rt351)9 [g] Breviolum minutum (Mf1.05b)10 [L] B1 47,014 2675 97.202 Genome Breviolum minutum (Mf1.05b)6 [j] B1 76,284 741 45.335 Transcriptome Cladocopium sp.7 [m] C 26,986 534 12.546 Transcriptome Cladocopium sp.20 [o] C 55,588 687 30.570 Transcriptome Cladocopium sp.15 [s] C 65,838 1746 97.581 Transcriptome Cladocopium sp.17 [p] C1 45,782 1443 45.706 Transcriptome Cladocopium sp. (MI-SCF055)16 [t] C1 116,479 1323 106.160 Transcriptome Cladocopium sp. (WSY)116 [r] C1 131,066 1239 113.375 Transcriptome Cladocopium sp.17 [q] C15 37,277 1299 33.008 Transcriptome Durusdinium sp.7 [u] D 23,777 920 16.609 Transcriptome Durusdinium sp.17 [v] D1a 43,662 804 25.956 Transcriptome Effrenium voratum (CCMP421)17 [w] E2 71,624 1701 86.612 Transcriptome Fugacium kawagutii (CCMP2468)11 [Y] F 36,850 1467 38.379 Genome Fugacium kawagutii (CCMP2468)17 [x] F 11,679 219 2.666 Transcriptome Polarella glacialis (CCMP1383)17 [b] - 57,865 1581 57.733 Transcriptome Polarella glacialis (CCMP2088)17 [a] - 32,168 1161 21.755 Transcriptome

59

Fig. 3.1 G+C content and codons usage in coding sequences Overall G+C content (A), G+C content in third codon positions (B) and effective number of codons (C) are shown for the complete CDS of each dataset.

3.3.2. Delineation of gene families

Functions of proteins from each clade pool were assessed based on similarity search against the UniProt database, following Aranda et al.12 (see Methods). Of the 584,272 inferred proteins, 228,391 have significant (E ≤ 10−10) matches against sequences in UniProt (Swiss-Prot + TrEMBL); of these, 139,188 find a top match against an entry in Swiss-Prot, and a further 89,203 have a top match against an entry in TrEMBL. The matched UniProt identifiers were used to retrieve their associated KEGG Orthology (KO)23 and Gene Ontology (GO)24 terms. We define a UniProt Homolog Group (UP-HoG) as a set of proteins that share a common UniProt top match that has not been assigned a KO term, and a KEGG Homolog Group (KO-HoG) as those proteins for which the UniProt top match(es) have the same assigned KO term. We clustered proteins that show no significant (E ≤ 10−10) match against any UniProt entry using orthAgogue25 and MCL26, and define each of the resulting groups as an orthAgogue-MCL Homolog Group (OM-HoG). 60

The 228,391 proteins with UniProt matches were grouped into 40,688 UP-HoGs (mean size 3.85, sd 15.87) and 5679 KO-HoGs (mean size 15.39, sd 32.18), while those with no matches in the database were clustered into 37,483 OM-HoGs (mean size 3.47, sd 2.99); see Supplementary Table S4 for further details on size of the gene sets in each category. Because some dinoflagellate genes are similar in sequence to bacterial genes27,28 we carefully filtered these groups to minimise bacterial contamination (Fig. S3.2; see Methods) while attempting to retain true dinoflagellate genes. Most (83%) of the clusters with functional annotation (KO-HoGs and UP-HoGs) show no significant match against any bacterial sequence. Of those that do match a bacterial sequence, nearly one-third are eukaryote-like, with evidence of a multi-exonic CDS. We identified 4296 protein sets as having evidence of putative bacterial contamination, and excluded them from subsequent analysis. Of the 37,483 OM-HoGs, 36,318 (96.9%) have more than one representative in each lineage, and were retained for subsequent analyses. These steps yielded 78,389 protein sets (5331 KO-HoGs, 36,740 UP-HoGs and 36,318 OM-HoGs) for subsequent analysis. We provisionally refer to these sets as gene families.

3.3.3. Functional annotation of gene families

For KO-HoGs and UP-HoGs, function was annotated at the protein level based on 62,339 distinct UniProt matches and their associated Gene Ontology terms (see Methods). For all gene families including OM-HoGs, we also searched for Pfam protein domains as additional support. In total, 33,766 of 78,389 (43%) families were annotated with 48,669 Pfam domains. Fig. 3.2A shows the ranked distribution of 4532 distinct Pfam domains found across all gene families, and the identity of those found in >300. Ankyrin repeat (3 copies) (PF12796), protein kinase (PF00069) and EF-hand domain pair (PF13499) domains were found in 968, 869 and 713 gene families. These functions are known to be prevalent in Symbiodiniaceae10,14. Ankyrins are important for protein-protein interaction (and potentially host-symbiont recognition), and the EF-hand domains are involved in calcium- binding and metabolism10,29. Membrane transport also appears prevalent in these gene families, i.e. ion transport protein (PF00520) and major facilitator superfamily (PF07690) in 472 and 344 respectively. The DnaJ domain found in 408 families are known to be involved in the response of Symbiodiniaceae to photo- and thermal stress16,30,31. Similarly, reverse transcriptase (PF07727 and PF00078) domains in 370 and 377 families respectively may be involved in stress-response mechanisms32-35. The presence of C-5 cytosine-specific DNA methylase (PF00145) domains in 344 families agrees with the hypermethylated state of DNA in Symbiodiniaceae36 that might be also related to the regulation of gene expression in dinoflagellates37.

61

Fig. 3.2 Prevalent protein domains in gene families of Symbiodiniaceae Number of gene families (y-axis) in which each Pfam domain (x-axis) was found in (A) all gene families, and (B) only OM-HoGs. The dashed red line separates the most-prevalent domains, >300 for all gene families and >80 in OM-HoGs, in each case. Identities of these domains are given in the top-right inset of each plot.

We used Pfam domains as a proxy of putative function of OM-HoGs. We recovered 1097 distinct domains distributed among 6283 OM-HoGs. Of these domains, ten were found in >80 families (Fig. 3.2B). These prevalent domains are largely similar to what we observed in the overall gene families (Fig. 3.2A). RNase H (PF00075) and RNA recognition (PF00076) domains, found in 134 and 137 families respectively, have been shown to regulate reverse transcription38,39 and splicing40. Two types of ankyrin repeat domains, PF12796 and PF13857, were found in 178 and 94 OM-HoGs respectively. As the functions implicated by these domains are critical to growth and survival, we included OM-HoGs in further analyses.

62

3.3.4. Dynamics of gene families among Symbiodiniaceae clades

To explore the dynamics of gene families among clades of Symbiodiniaceae, we numbered each node (N1 through N6: Fig. 3.3) on the accepted phylogeny5,41 and counted the families inferred to be represented at each node (Table 3-2). We infer a family to be part of Node-total at a node if a member of that family is identified in any lineage descendant from that node, regardless of whether or not the family is represented elsewhere in our dataset. Node-specific families are a subset of these, represented in one or more descendant clades but otherwise not observed in our dataset. The N1-total and N1-specific gene family sets are by this definition identical (Fig. 3.3); for simplicity we refer to these as N1-total. The membership of all gene sets is given in Supplementary Table S5.

Fig. 3.3 Gene families along the phylogeny of Symbiodiniaceae Changes in gene family numbers in Suessiales shown for the Symbiodiniaceae phylogeny (simplified cladogram based on Pochon et al.41), with Polarella glacialis as outgroup. The notation in this diagram is used throughout the text. Numbers of total and specific gene families at each node are shown to the left of the node in question. Numbers of specific and absent gene families for each lineage are correspondingly shown at the tips (right).

63

The count of gene families differs substantially among clades (Table 3-2). These results may reflect actual genome dynamics (e.g. changes in genome size, gene content and/or sequence divergence) in the various lineages. However, for transcriptomes that lack genome data support, biases arising from the amount or quality of data (Fig. S3.3), taxon sampling, or details of data generation or processing (Supplementary Table S1) cannot be dismissed. To further explore the differences in gene family number among clades, we define lineage-specific gene families (L-specific, where L is an identified lineage, e.g. L = S denotes Symbiodinium) as those represented in only that lineage, and L-absent families as those represented in all these lineages except L. The latter have either been lost from L, or are present but were not recovered in these data. Notation of gene families in individual lineages is given in Supplementary Table S6, and the number of shared gene families among all lineages is shown in Fig. 3.4A. The number of gene families specific to each lineage does not necessarily resemble the changes in gene family number displayed by the nodes. We assessed the effects of unbalanced taxon sampling and of differences in amount of data on the number of gene families inferred as specific to each clade (see below). Although we observed that taxon sampling could bias our results, the natural diversity of Symbiodiniaceae could also contribute to the observed patterns. The amount of data, on the other hand, seems to impact our results less.

Table 3-2 Gene families in which each lineage is represented Number of gene families in which each lineage is found, shown for annotated (KO-HoGs or UP- HoGs) and non-annotated (OM-HoGs) gene families. Lineage Total KO-HoGs/UP-HoGs OM-HoGs Symbiodinium (Clade A) 30,409 15,399 15,010 Breviolum (Clade B) 35,152 15,506 10,646 Cladocopium (Clade C) 43,412 23,121 20,291 Durusdinium (Clade D) 20,833 12,434 8399 Effrenium (Clade E) 17,481 10,910 6571 Fugacium (Clade F) 14,967 7658 7309 Polarella glacialis 15,920 9195 6725

As transcriptome data are inherently incomplete, inferring the gain or loss of genes based on these potentially biased data is not straightforward. Here we discuss our results focusing on Symbiodinium, Breviolum (for which genome data are available) and Cladocopium (the most data- rich lineage, with seven transcriptome datasets). C-specific gene families (9227) are the most

64

abundant overall (Fig. 3.3), compared to 3577 S-specific and 3589 B-specific families. In contrast, C- absent (17) gene families are the least overall, compared to S-absent (71) and B-absent (72) families. The number of gene families specific to and absent from clades Symbiodinium and Breviolum are similar despite higher number of gene families in Breviolum (Table 3-2), suggesting no drastic gain of gene families between them.

Fig. 3.4 Gene families shared among lineages of Symbiodiniaceae Gene families shared by (A) all individual lineages within Suessiales, and (B) Symbiodinium, Breviolum and Cladocopium when compared among each other. The bars represent the number of gene families shared exclusively by the lineages marked below in the box with dots and connected by lines. In (A), lineage-specific gene families, those shared by all lineages in Suessiales (SuesCore) and SymCore-specific gene families are highlighted according to the colour code at the top right. The simplified topology shown at the bottom left depicts phylogenetic relationships among lineages.

In an independent analysis of shared gene families among Symbiodinium, Breviolum and Cladocopium (Fig. 3.4B), the more-recently diverged Cladocopium shares more gene families (11,723) with Breviolum than with the basal Symbiodinium (4581). Interestingly, S-specific families are more abundant than those shared by clade Symbiodinium with Breviolum, with Cladocopium or with both, suggesting either a substantial gain of gene families in Symbiodinium, or an extensive loss

65

of gene families between nodes N2 and N5 (Fig. 3.3). The latter alternative is supported by fewer families shared between Symbiodinium and Breviolum than between Breviolum and Cladocopium. Under this scenario, our results suggest that Cladocopium has retained more gene families and undergone further functional diversification than has Breviolum. Similar gene-family dynamics are observed in the numbers of L-specific and L-absent families in all other lineages (Fig. 3.3), although at this broad scale we cannot dismiss the impact of systematic and data biases such as poor taxa sampling in Durusdinium, Effrenium and Fugacium (Supplementary Table S1 and Fig. S3.3), which could also contribute to this observation.

We categorised the gene families according to the lineages in which they are represented, defining those common to all Symbiodiniaceae and P. glacialis as SuesCore, and those shared by all Symbiodiniaceae clades (regardless of their presence or absence in P. glacialis) as SymCore. Families shared by all Symbiodiniaceae but not P. glacialis, i.e. those exclusive to Symbiodiniaceae, were annotated as SymCore-specific, and in this dataset are equivalent to P-absent. Given the possible combinations of gene-family sharing among these lineages (Fig. 3.4A), it is remarkable that families exclusive to one lineage are always amongst the major fractions, and more abundant than SymCore- specific. This could be explained by extensive divergence among lineages caused by differential recruitment (or preservation coupled with loss in other lineages), or alternatively by an extent of sequence variation so great as to prevent family members from clustering together with the strategies we employed. The low level of variation (sd ≤ 5×10−11) among E values, relative to the top Swiss- Prot match within each UP-HoG, renders the latter alternative less likely. Interestingly, the number of gene families shared by two lineages does not necessary correlate to their phylogenetic proximity. For instance, Cladocopium and Fugacium are closely related (Fig. 3.3, N6) but exclusively share fewer gene families (2302) than do Cladocopium and Breviolum (6923), despite Breviolum diverging from the Cladocopium-Fugacium lineage at a more-ancestral node (N5). However, Fugacium is the lineage with the fewest high-quality predicted proteins and the least-complete dataset (Fig. S3.1, Supplementary Table S2).

3.3.5. What makes Symbiodiniaceae Symbiodiniaceae?

For a functional overview at ordinal rank, we tested first for generality versus specificity by comparing SuesCore against all gene families in Suessiales (see Methods). GO terms enriched in SuesCore correspond to a wide variety of biological processes related to cytoplasmic translation, response to environmental factors (salt, temperature, nutrients, bacteria), and regulation of transcription and life cycle (Supplementary Table S7). Among the most-significantly overrepresented

66

Pfam domains in SuesCore we found pentatricopeptide domains (PPRs), ankyrins, domains of AAA chaperone-like ATPases, several dynein domains and a kinesin motor domain (Supplementary Table S8); proteins carrying the latter three types of domain are necessary for movement and assembly of eukaryotic flagella42,43.

To determine what functions characterise Symbiodiniaceae, and distinguish Symbiodiniaceae from Polarella within Order Suessiales, we tested for enrichment of Pfam domains and GO terms in the SymCore and SymCore-specific gene families. GO terms and Pfam domains in SymCore were very similar to those enriched in SuesCore; this was expected, given that P. glacialis contributes few sequences to the latter gene family set (Supplementary Tables S9 and S10). Amongst SymCore- specific gene families, enriched GO terms describe biological processes that are required for the maintenance of Symbiodiniaceae symbiosis, including transmembrane transport of ions, amino acids and proteins12, mechanisms of response to reactive oxygen species (ROS), and protection against ultraviolet radiation (Supplementary Table S11). Reef habitats of Symbiodiniaceae are typically characterised by high photon flux, and large amounts of ROS are generated during photosynthesis44. Mechanisms involved in nucleotide-excision DNA repair are overrepresented (in SuesCore and SymCore), suggesting the critical involvement of this process in counteracting the mutagenic effects of UV radiation and free radicals in Symbiodiniaceae. Enriched protein domains included those associated with transmembrane transport, protein-protein interaction potentially involved in host recognition (ankyrin and leucine-rich repeats)45-47, DNA repair, and protection from free radicals (Supplementary Table S12).

Since the enrichment tests compare general versus specific attributes, we expect lineage- specific functions to be underrepresented. For instance, multiple copies of genes encoding components of reverse transcription pathways have so far been reported only in Fugacium kawagutii11, and several domains annotated with that function are underrepresented in our SuesCore, SymCore and SymCore-specific gene families. Our results further suggest that certain protein domains considered as abundant in Symbiodiniaceae may be dominant in specific genomes or clades; for instance, the domains involved in DNA methylation and transmembrane amino acid transport were underrepresented in SuesCore and SymCore, as is an EF-hand domain in SymCore-specific gene families.

3.3.6. Lineage-specific enrichment of function

To assess lineage-specific attributes, we systematically identified gene families that are exclusive to, or absent in, each lineage (Supplementary Table S6). The G+C content distribution of 67

the CDS in lineage-specific gene families resembles that of all gene families for that lineage (Fig. S3.4), suggesting non-exogenous origins (i.e. there is no evidence for systematic lateral gene transfer). GO terms and Pfam domains enriched in gene families exclusive to each lineage are not necessarily lineage-specific. For example, retrotransposition facilitated by reverse transcription has been reported in Fugacium, a conclusion supported by our results (Supplementary Tables S13 and S14). However, gene families exclusive to Symbiodinium and Breviolum also display enriched GO terms and protein domains related to reverse transcription and retrotransposition (Supplementary Tables S15 to S18). Although viruses have been found in tight relationship with some Symbiodiniaceae isolates in culture48, our results are not obviously the result of recent viral contamination since the G+C content of CDS associated with retrotransposition and reverse transcription does not differ from that of all CDS in Suessiales (Fig. S3.5). In addition, some retrotransposons are known to be activated under stress conditions in Symbiodiniaceae32 and other eukaryotes including diatoms33 and plants34,35; our findings may reflect functions relevant to stress- response mechanisms in Symbiodiniaceae. Other examples of GO terms and protein domains enriched in lineage-specific gene families that are not exclusive to a certain lineage include DNA methylation in Symbiodinium and Fugacium (Supplementary Tables S15 and S19), and amino acid transmembrane transport in Breviolum and Effrenium (Supplementary Tables S18 and S20).

Among the biological processes annotated in Symbiodinium-specific gene families (S-specific), mechanisms related to adaptation to light conditions and avoidance of photodamage were enriched, including the GO terms Chloroplast avoidance movement, Chloroplast localization, Establishment of plastid localization, Plastid localization, Chloroplast relocation and Phototropism. Free-living species have been described in Symbiodinium49,50. These capabilities could be beneficial for free- living as well as symbiotic lifestyles, or for the ability to switch between the two. On the other hand, gene families absent only from Symbiodinium (S-absent) are rich in ribosomal protein domains and translational functions (Supplementary Tables S21 and S22).

Free-living isolates have been reported in Effrenium as well, including the only isolate in this study. However, adaptive thermal regulation is the only biological process enriched in gene families exclusive to this clade that is obviously associated with the free-living habit (Supplementary Table S20). Many of the enriched functions are related to transmembrane transport, and the most-enriched protein domain in the exclusive gene families was the major facilitator superfamily (Supplementary Tables S20 and S23), a diverse family of membrane transporters implicated in the transport of metabolites and nutrients, including nitrate and nitrite51. Although membrane transport is a characteristic process of symbiodiniacean symbioses and members of the major facilitator

68

superfamily have been already reported for other Symbiodiniaceae isolates30,52, this superfamily seems to have functions of particular relevance in this isolate from Effrenium.

Several of the most-enriched biological processes and protein domains in C-specific are linked to GTPase activity or its regulation, more specifically to the Rho GTPase family (Supplementary Tables S24 and S25). Rho GTPases function as molecular switches that activate responses to a wide variety of stimuli including changes in the cytoskeleton, regulation of gene expression, control of the cell cycle and transmembrane trafficking53. Rho-GTPase has been attributed to the rapid evolution of the Atlantic killifish Fundulus heteroclitus by facilitating adaptation to the presence of toxic compounds in the environment54. We therefore hypothesise that the overrepresentation of proteins with Rho GTPase-related functions, and the subsequent capability to respond effectively to different stimuli, could have contributed to the great genetic diversity observed in Cladocopium and its dominance in the Indo-Pacific ocean55,56.

Durusdinium are known for their high tolerance to thermal stress57,58. The molecular basis of this resilience has been linked to high proportions of unsaturated fatty acids in the cell membranes, protein folding, and chloroplast proteins involved in photosynthesis or constituents of the thylakoid membrane7. In this study we did not find any overrepresentation in D-specific of GO terms or Pfam domains annotated with plastid-related functions. However, the GO term Unsaturated fatty acid elongation is overrepresented in D-specific gene families. Among the overrepresented protein domains are a transcription factor DNA binding domain that regulates expression of heat shock proteins, and a heat shock protein (HSP20), both involved in protein folding in response to thermal stress (Supplementary Tables S26 and S27).

3.3.7. Impact of taxon sampling and data amount on gene-family analysis

Unbalanced taxon sampling is a universal issue in comparative microbial genomics and transcriptomics. Some taxa are intrinsically species-poor (or poorly known), others are species-rich (or well-sampled), and genomes are often of different sizes and complexity. More is usually gained by learning from the world as we find it than by excluding real data to fit an idealised model. Among the datasets used in this study, Cladocopium, Breviolum, Symbiodinium, Durusdinium, Fugacium, and Effrenium are represented by 7, 4, 2, 2, 1, and 1 species. To assess the impact of taxon sampling on our gene-family analysis, we systematically assessed the number of lineage-specific families across the 78,389 gene families by rarefying the Breviolum and Cladocopium datasets to two species 7 at a time. With seven species in Breviolum and four in Cladocopium, there are a total of 126 ( C2 × 4 C2) possible combinations, and for each combination the number of species represented by 69

Cladocopium, Breviolum, Symbiodinium, Durusdinium, Fugacium, Effrenium and Polarella glacialis are 2, 2, 2, 2, 1, 1 and 1. The results for these 126 assessments are shown in Fig. 3.5; the number observed based on the non-rarefied data for each lineage is denoted by a red asterisk.

Fig. 3.5 Rarefaction analyses of taxa and gene families Rarefication analyses for number of species per lineage and number of gene families. The red asterisks show the number of gene families specific to each lineage based on non-rarefied datasets, and the turquoise circles the results based on the rarefied data. In the species rarefication (A), each point corresponds to one of 126 possible combinations with the shown number of species per lineage. Rarefication of gene families was done at three different thresholds Z = 18,000 (B), 24,000 (C) and 30,000 (D), and each point represents one of the 500 replicates; the number of gene families after rarefication is shown in brackets for each lineage.

70

In general, the number of lineage-specific families based on the rarefied data is higher than for the non-rarefied data, except for Breviolum and Cladocopium. The data points from Breviolum (and probably Symbiodinium) form two distinct clouds, while those in Cladocopium range across a single broad distribution (Fig. 3.5A). This result points to biases in phylogenetic coverage and/or diversity, especially for the hyperdiverse Cladocopium. Therefore, rarefying the data, while statistically sensible, does not adequately capture the diversity of Symbiodiniaceae either within or among clades.

To further assess biases related to the amount of data (i.e. number of proteins) from each genus we rarefied clade representation in the overall families. Here, we used Effrenium (found in 17,481 families; Table 3-2) as the lower-bound reference, and Symbiodinium (found in 30,409 families; Table 3-2) as the upper-bound reference for rarefication. We denote Z as the maximum number of gene families to contain a specific clade, and set Z = 18,000 (lower bound), 24,000, and 30,000 (upper bound). For instance, at Z = 18,000 we first removed for each clade (Symbiodinium through Fugacium), members from any gene families at random until there were no more than 18,000 families that contain representatives of that clade, before assessing the number of lineage-specific families; we did this in 500 replicates. The data points for these 500 replicates are shown in Fig. 3.5B (Z = 18,000), Fig. 3.5C (Z = 24,000) and Fig. 3.5D (Z = 30,000); the results based on non-rarefied data are shown in red asterisks.

In contrast to what we observed in Fig. 3.5A, the number of families specific to each lineage (i.e. each case of Fig. 3.5B, C and D) is largely consistent, with little variation. The numbers based on rarefied data are higher than in the non-rarefied data, except for Cladocopium. As expected, the numbers based on rarefied data are more similar to those based on non-rarefied data as Z is increased from 18,000 to 30,000. Interestingly, for Symbiodinium and Breviolum at Z = 24,000 these numbers converged. These results strongly imply that our results are less sensitive to the biases in amount of data than to species selection in each lineage.

3.4. Concluding remarks

The study of Symbiodiniaceae from a genomic perspective, using both transcriptome and genome data, has broadened our understanding of its evolution, its capability to establish symbiosis and its response to a wide variety of conditions. Here we examined the gene families of six Symbiodiniaceae genera (Symbiodinium, Breviolum, Cladocopium, Durusdinium, Effrenium and Fugacium) to identify functional attributes either shared among, or exclusive to, each of them. We also used data from the closely related species Polarella glacialis to determine which features are characteristic of Symbiodiniaceae within the order Suessiales. Gene families shared among all these 71

Symbiodiniaceae are enriched in functions essential to the establishment and maintenance of symbiosis, and survival in a high-energy environment. At the same time, lineage-specific differences in the presence or absence of gene families, and in the enrichment of functions, offer potential for members of distinct genera to specialise in diverse environments. Our results provide a foundation for future investigation of lineage- or genus-specific adaptation of Symbiodiniaceae to their environment, and emphasise the need for more high-quality genome data from understudied Symbiodiniaceae lineages and closely related species (such as Polarella glacialis). In the next chapter, I present an in-depth genome comparison between a free-living and a symbiotic species within the basal genus Symbiodinium to assess features that are relevant to the early evolutionary transition of Symbiodiniaceae to symbiosis.

3.5. Methods

3.5.1. Data collection and preparation

We collected a total of 30 datasets (Fig. S3.1), from which 24 were selected for this study (Table 3-1) based on quality of assembled sequences and certainty of taxonomic assignment (Fig. S3.6), including the published genomes of B. minutum10, F. kawagutii11 and S. microadriaticum12, and 21 transcriptomes (19 from Symbiodiniaceae and two from Polarella glacialis) from previous studies6,7,9,14-16,20 and from the Marine Microbial Eukaryote Transcriptome Sequencing Projects database (MMETSP)17. Characteristics of the datasets are summarised in Supplementary Table S1. Because different methods yield different estimates of completeness59, we compared each dataset with the 458 CEGMA genes18 (BLASTx, E ≤ 10−10) and the BUSCO19 datasets for eukaryotes, alveolates-stramenopiles, and protists (using BUSCO v3.0.2b and by BLASTx, E ≤ 10−10). Sequences in each dataset were additionally searched (BLASTn, E ≤ 10−10) against all bacterial genomes in RefSeq release 76 to assess the proportion of sequences from bacterial sources (putative contaminants).

Where available (i.e. for the three genomes and the transcriptomes from MMETSP), the predicted CDS and proteins were used for the analyses. For the other transcriptome data, we used TransDecoder v2.0.1 (github.com/TransDecoder) to predict CDS and proteins at default settings. Completeness of the protein datasets was assessed with CEGMA18 and BUSCO19 genes, as for the original data but using BLASTp instead of BLASTx. Detail for each CDS/protein dataset is shown in Supplementary Table S1. Codon usage of full-length CDS (i.e. CDS that begin with a start codon and end with a stop codon) in each dataset was assessed using chips and cusp from the EMBOSS software suite (emboss.sourceforge.net). Proteins from Symbiodiniaceae isolates within the same 72

genus were pooled together: three datasets in Symbiodinium, seven in Breviolum, seven in Cladocopium, two in Durusdinium, one in Effrenium and two in Fugacium. Similarly, all proteins from the two P. glacialis isolates were pooled as one. Redundant sequences from each clade pool were removed using CD-HIT60 to cluster similar sequences at default settings (sequence identity threshold = 0.90); the longest sequence in each group was kept as representative.

3.5.2. Homolog clusters

To assess protein functions, we followed Aranda et al.12 using BLASTp search (E ≤ 10−10) against the UniProt database (release 2016_01). Briefly, protein sequences were first searched against Swiss-Prot, and those with no matches were subsequently searched against the TrEMBL database. The UniProt identifier of the best match for each protein was used to retrieve its associated KEGG Orthology (KO)23 term using UniProtKB ID mapping release 2015_03 (ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping) and the Gene Ontology (GO)24 terms in UniProt-GOA release 163 (ftp.ebi.ac.uk/pub/databases/GO/goa). Proteins without functional annotation were clustered using orthAgogue25 v1.0.3 (e-value cut-off = 10−10) and MCL26 (I = 1.2, scheme 7), as recommended for extensive genetic divergence (expected among Symbiodiniaceae lineages2); here we define a group of two or more such proteins as an OM-HoG.

To minimise the inclusion of sequences from potential bacterial sources (i.e. contaminants) in our analysis, we carefully selected a high-confidence set of putative homolog groups of Symbiodiniaceae and Polarella glacialis for subsequent analysis based on the schema detailed in Fig. S3.2. All KO-HoGs and UP-HoGs in which no member matched a bacterial sequence were included in subsequent analysis. For groups in which one or more members matched a bacterial sequence, we referred to the genome data of the corresponding isolate where available. We took the presence of multiple exons in these CDS as evidence of eukaryote origin; homolog groups containing any protein with such evidence were retained for subsequent analyses. For homolog groups in which one or more members matched a bacterial sequence but no genome data were available, and those for which some members had bacterial hits but no multi-exon evidence, we kept any groups in which the bacterial hits are present in two or more lineages; these are potential real dinoflagellate proteins that may have arisen through lateral genetic transfer from a bacterial source27,28. Among the non-annotated OM- HoGs, we considered only those containing proteins from two or more of the original 24 datasets.

73

3.5.3. Functional analysis of gene families

For our purposes here, we refer to the selected homolog clusters as gene families. We based our functional annotation on the Gene Ontology (GO) terms. Pfam domains61 were annotated for the proteins corresponding to each gene family using PfamScan62 (E ≤ 0.001). For each category in Supplementary Table S6, GO and Pfam-domain enrichment analyses were performed against N1- total as the reference background; here we consider a gene family as the unit of analysis. GO enrichment analysis was performed using the topGO Bioconductor package63 implemented in R v3.2.1, applying Fisher’s Exact test with the ‘elimination’ method to correct for the dependence structure among GO terms. A one-tailed Fisher’s Exact test was used to assess over- and under- representation of Pfam protein domains independently, with adjustment of p-values for multiple tests following Benjamini and Hochberg64.

3.6. References

1 González-Pech, R. A., Ragan, M. A. & Chan, C. X. Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodinium. Sci. Rep. 7, 15021 (2017). 2 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003). 3 Pochon, X., LaJeunesse, T. & Pawlowski, J. Biogeographic partitioning and host specialization among foraminiferan dinoflagellate symbionts (Symbiodinium; Dinophyta). Mar. Biol. 146, 17-27 (2004). 4 Pochon, X. & Gates, R. D. A new Symbiodinium clade (Dinophyceae) from soritid foraminifera in Hawai’i. Mol. Phylogenet. Evol. 56, 492-497 (2010). 5 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral endosymbionts. Curr. Biol. 28, 2570-2580 (2018). 6 Bayer, T. et al. Symbiodinium transcriptomes: genome insights into the dinoflagellate symbionts of reef-building corals. PLoS ONE 7, e35269 (2012). 7 Ladner, J. T., Barshis, D. J. & Palumbi, S. R. Protein evolution in two co-occurring types of Symbiodinium: an exploration into the genetic basis of thermal tolerance in Symbiodinium clade D. BMC Evol. Biol. 12, 217 (2012). 8 Rosic, N. et al. Unfolding the secrets of coral–algal symbiosis. ISME J. 9, 844-856 (2015). 9 Parkinson, J. E. et al. Gene expression variation resolves species and individual strains among coral-associated dinoflagellates within the genus Symbiodinium. Genome Biol. Evol. 8, 665- 680 (2016). 74

10 Shoguchi, E. et al. Draft assembly of the Symbiodinium minutum nuclear genome reveals dinoflagellate gene structure. Curr. Biol. 23, 1399-1408 (2013). 11 Lin, S. et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science 350, 691-694 (2015). 12 Aranda, M. et al. Genomes of coral dinoflagellate symbionts highlight evolutionary adaptations conducive to a symbiotic lifestyle. Sci. Rep. 6, 39734 (2016). 13 Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res., gkr988 (2011). 14 Xiang, T., Nelson, W., Rodriguez, J., Tolleter, D. & Grossman, A. R. Symbiodinium transcriptome and global responses of cells to immediate changes in light intensity when grown under autotrophic or mixotrophic conditions. Plant J. 82, 67-80 (2015).

15 Davies, S. W., Marchetti, A., Ries, J. B. & Castillo, K. D. Thermal and pCO2 stress elicit divergent transcriptomic responses in a resilient coral. Front. Mar. Sci. 3, 112 (2016). 16 Levin, R. A. et al. Sex, scavengers, and chaperones: transcriptome secrets of divergent Symbiodinium thermal tolerances. Mol. Biol. Evol. 33, 2201-2215 (2016). 17 Keeling, P. J. et al. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol. 12, e1001889 (2014). 18 Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-1067 (2007). 19 Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210-3212 (2015). 20 González-Pech, R. A., Vargas, S., Francis, W. R. & Wörheide, G. Transcriptomic resilience of the Montipora digitata holobiont to low pH. Front. Mar. Sci. 4 (2017). 21 Moustafa, A. et al. Transcriptome profiling of a toxic dinoflagellate reveals a gene-rich protist and a potential impact on gene expression due to bacterial presence. PLoS ONE 5, e9688 (2010). 22 Liew, Y. J., Li, Y., Baumgarten, S., Voolstra, C. R. & Aranda, M. Condition-specific RNA editing in the coral symbiont Symbiodinium microadriaticum. PLoS Genet. 13, e1006619 (2017). 23 Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000).

75

24 Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25- 29 (2000). 25 Ekseth, O. K., Kuiper, M. & Mironov, V. OrthAgogue: an agile tool for the rapid prediction of orthology relations. Bioinformatics 30, 734-736 (2014). 26 Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575-1584 (2002). 27 Beauchemin, M. et al. Dinoflagellate tandem array gene transcripts are highly conserved and not polycistronic. Proc. Natl. Acad. Sci. U. S. A. 109, 15793-15798 (2012). 28 Chan, C. X. et al. Analysis of Alexandrium tamarense (Dinophyceae) genes reveals the complex evolutionary history of a microbial eukaryote. J. Phycol. 48, 1130-1142 (2012). 29 Bourne, D. G. & Webster, N. S. in The prokaryotes: prokaryotic communities and ecophysiology (eds Eugene Rosenberg et al.) 163-187 (Springer Berlin Heidelberg, 2013). 30 Leggat, W., Hoegh-Guldberg, O., Dove, S. & Yellowlees, D. Analysis of an EST library from the dinoflagellate (Symbiodinium sp.) symbiont of reef-building corals. J. Phycol. 43, 1010- 1021 (2007). 31 Baumgarten, S. et al. Integrating microRNA and mRNA expression profiling in Symbiodinium microadriaticum, a dinoflagellate symbiont of reef-building corals. BMC Genomics 14, 704 (2013). 32 Chen, J. E., Cui, G., Wang, X., Liew, Y. J. & Aranda, M. Recent expansion of heat-activated retrotransposons in the coral symbiont Symbiodinium microadriaticum. ISME J. 12, 639-643 (2018). 33 Maumus, F. et al. Potential impact of stress activated retrotransposons on genome evolution in a marine diatom. BMC Genomics 10, 624 (2009). 34 Ramallo, E., Kalendar, R., Schulman, A. H. & Martínez-Izquierdo, J. A. Reme1, a Copia retrotransposon in melon, is transcriptionally induced by UV light. Plant Mol. Biol. 66, 137 (2007). 35 Ito, H. et al. A stress-activated transposon in Arabidopsis induces transgenerational abscisic acid insensitivity. Sci. Rep. 6, 23181 (2016). 36 ten Lohuis, M. R. & Miller, D. J. Hypermethylation at CpG‐motifs in the dinoflagellates Amphidinium carterae (Dinophyceae) and Symbiodinium microadriaticum (Dinophyceae): evidence from restriction analyses, 5‐azacytidine and ethionine treatment. J. Phycol. 34, 152- 159 (1998). 37 ten Lohuis, M. R. & Miller, D. J. Light-regulated transcription of genes encoding peridinin chlorophyll a proteins and the major intrinsic light-harvesting complex proteins in the

76

dinoflagellate Amphidinium carterae Hulburt (Dinophycae). Changes in Cytosine Methylation Accompany Photoadaptation 117, 189-196 (1998). 38 Goedken, E. R. & Marqusee, S. Folding the ribonuclease H domain of Moloney murine leukemia virus reverse transcriptase requires metal binding or a short N-terminal extension. Proteins 33, 135-143 (1998). 39 Lemay, J. et al. HuR interacts with human immunodeficiency virus type 1 reverse transcriptase, and modulates reverse transcription in infected cells. Retrovirology 5, 47 (2008). 40 Birney, E., Kumar, S. & Krainer, A. R. Analysis of the RNA-recognition motif and RS and RGG domains: conservation in metazoan pre-mRNA splicing factors. Nucleic Acids Res. 21, 5803-5816 (1993). 41 Pochon, X., Putnam, H. M. & Gates, R. D. Multi-gene analysis of Symbiodinium dinoflagellates: a perspective on rarity, symbiosis, and evolution. PeerJ 2, e394 (2014). 42 Lindemann, C. B. & Lesich, K. A. Flagellar and ciliary beating: the proven and the possible. J. Cell Sci. 123, 519-528 (2010). 43 Blaineau, C. et al. A novel microtubule-depolymerizing kinesin involved in length control of a eukaryotic flagellum. Curr. Biol. 17, 778-782 (2007). 44 Dykens, J. A., Shick, J. M., Benoit, C., Buettner, G. R. & Winston, G. W. Oxygen radical production in the sea anemone Anthopleura elegantissima and its endosymbiotic algae. J. Exp. Biol. 168, 219-241 (1992). 45 Schwarz, J. A. et al. Coral life history and symbiosis: functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata. BMC Genomics 9, 97 (2008). 46 Jernigan, K. K. & Bordenstein, S. R. Ankyrin domains across the Tree of Life. PeerJ 2, e264 (2014). 47 Nguyen, M. T. H. D., Liu, M. & Thomas, T. Ankyrin-repeat proteins from sponge symbionts modulate amoebal phagocytosis. Mol. Ecol. 23, 1635-1645 (2014). 48 Lawrence, S. A., Wilson, W. H., Davy, J. E. & Davy, S. K. Latent virus-like infections are present in a diverse range of Symbiodinium spp. (Dinophyta). J. Phycol. 50, 984-997 (2014). 49 Hirose, M., Reimer, J. D., Hidaka, M. & Suda, S. Phylogenetic analyses of potentially free- living Symbiodinium spp. isolated from coral reef sand in Okinawa, Japan. Mar. Biol. 155, 105-112 (2008). 50 Yamashita, H. & Koike, K. Genetic identity of free‐living Symbiodinium obtained over a broad latitudinal range in the Japanese coast. Phycol. Res. 61, 68-80 (2013).

77

51 Quistgaard, E. M., Löw, C., Guettou, F. & Nordlund, P. Understanding transport by the major facilitator superfamily (MFS): structures pave the way. Nature Reviews Molecular Cell Biology 17 (2016). 52 Mohamed, A. R. et al. Transcriptomic insights into the establishment of coral-algal symbioses from the symbiont perspective. bioRxiv, 652131 (2019). 53 Moon, S. Y. & Zheng, Y. Rho GTPase-activating proteins in cell regulation. Trends Cell Biol. 13, 13-22 (2003). 54 Nacci, D., Proestou, D., Champlin, D., Martinson, J. & Waits, E. R. Genetic basis for rapidly evolved tolerance in the wild: adaptation to toxic pollutants by an estuarine fish species. Mol. Ecol. 25, 5467-5482 (2016). 55 Stat, M., Carter, D. & Hoegh-Guldberg, O. The evolutionary history of Symbiodinium and scleractinian hosts—symbiosis, diversity, and the effect of climate change. Perspect. Plant Ecol. 8, 23-43 (2006). 56 Pochon, X. & Pawlowski, J. Evolution of the soritids-Symbiodinium symbiosis. Symbiosis 42, 77-88 (2006). 57 Rowan, R. Coral bleaching: thermal adaptation in reef coral symbionts. Nature 430, 742-742 (2004). 58 Berkelmans, R. & van Oppen, M. J. The role of zooxanthellae in the thermal tolerance of corals: a ‘nugget of hope’ for coral reefs in an era of climate change. Proc. R. Soc. Lond. B Biol. Sci. 273, 2305-2312 (2006). 59 Veeckman, E., Ruttink, T. & Vandepoele, K. Are we there yet? Reliably estimating the completeness of plant genome sequences. Plant Cell 28, 1759-1768 (2016). 60 Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006). 61 Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 32, D138-D141 (2004). 62 Li, W. et al. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 43, W580-W584 (2015). 63 topGO: enrichment analysis for Gene Ontology v.2 (2010). 64 Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289-300 (1995).

78

3.7. Supplementary figures

Fig. S3.1 Contribution of individual datasets to clade pools Each lineage is represented with a different bar and individual datasets at each lineage pool is shown in a different shade. Datasets with the largest contribution to the lineage pool are at the base of each bar with decreasing contribution towards the top. Individual datasets are represented with letters (a-y, excluding n) following the code in Table 3-1. ID code of genomic datasets are shown as upper case at the base of their respective lineage bar, indicative of their full-length proteins. The total number of proteins in each clade pool is displayed at the top of the bars.

79

Fig. S3.2 Gene-family selection criteria Breakdown of the selection criteria for the homolog clusters found in Symbiodiniaceae and P. glacialis. The number in parenthesis indicates the number of clusters that fell into each category. Red boxes represent the clusters selected as gene families.

80

Fig. S3.3 Correlation between amount of input data and distinct gene-family sets Correlation of the number of proteins used as input with the total (A), specific (B) and missing (C) number of gene families for each lineage.

81

Fig. S3.4 G+C content distribution in coding sequences of Symbiodiniaceae G+C content in Suessiales CDS of lineage-specific gene families (following notation in main text) compared to the CDS of all gene families for the corresponding lineage. Different colours represent different lineages.

82

Fig. S3.5 G+C content distribution in sequences coding for functions related to retrotransposition G+C content in Suessiales CDS with functions involved in retrotransposition and reverse transcription compared to all CDS. Different colours represent different lineages and colour grey represents all lineages combined. CDS with retrotransposition functions are noted as RT.

83

Fig. S3.6 Taxonomic verification of outlier datasets (A) Overall G+C content, (B) G+C content in third codon positions and (C) effective number of codons usage for the complete CDS of each dataset. Each distinct shape of the data point represents the corresponding lineage. The outlier datasets from each lineage marked with a red arrowhead were removed from this study. (D) Verification of genus identity for the outlier datasets with ambiguous clade assignment, based on mapping of transcripts onto published genomes of S. microadriaticum (Clade A), B. minutum (Clade B) and F. kawagutii (Clade F). As a reference for each dataset, the number of total transcripts is shown (black bar). The mapping of CassKB8 transcriptome (Clade A) is included as a reference.

84

3.8. Supplementary tables

Supplementary tables are available from DOI: 10.1038/s41598-017-15029-w (nature.com/articles/s41598-017-15029-w#additional-information).

85

Chapter 4. Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium

In Chapter 3, I investigated gene functions that are common or unique to distinct lineages of Symbiodiniaceae using available genome-scale data, many of these functions are relevant to adaptation of these organisms to distinct ecological niches, and potentially to their symbiotic lifestyle. The basal genus Symbiodinium consists of both free-living and symbiotic forms, and most dinoflagellates external to Symbiodiniaceae are free-living. Genome data from Symbiodinium represent therefore a key analysis platform to assess genome features related to the evolutionary transition from a free-living to a symbiotic lifestyle.

In this chapter, I present distinct genome features between Symbiodinium tridacnidorum (symbiotic) and Symbiodinium natans (free-living) that likely relate to their evolution into distinct lifestyles. These results are based on the comparison of newly generated high-quality de novo genome assemblies from these two species. The chapter is presented as a manuscript and addresses Aim 2 (Chapter 1, Section 1.2). The manuscript has been published as a pre-print in bioRxiv1 (DOI: 10.1101/783902) and reformatted for this thesis. Data associated with this manuscript are publicly available at cloudstor.aarnet.edu.au/plus/s/095Tqepmq2VBztd. As the lead author of this paper, I conceived the study, designed, led and conducted all computational analyses, and interpreted the results. I prepared the first draft of the manuscript and generated all figures and tables.

86

4.1. Abstract

Symbiodiniaceae are predominantly symbiotic dinoflagellates in corals and other reef organisms. Although previous genome studies have focused on adaptation of these dinoflagellates to symbiosis, the impacts of the transition to a symbiotic lifestyle on genome evolution of Symbiodiniaceae remain unclear. The basal genus Symbiodinium encompasses symbiotic and free- living taxa, thus genomes of Symbiodinium isolates provide a useful analytic platform to address this knowledge gap. Here, we present de novo genome assemblies, incorporating both long- and short- read sequencing data, for the symbiotic Symbiodinium tridacnidorum CCMP2592 (genome size 1.29 Gbp) and the free-living Symbiodinium natans CCMP2548 (genome size 0.74 Gbp). These genomes display extensive sequence divergence, sharing only ~1.5% conserved regions (≥90% identity). We predicted 45,474 and 35,270 genes for S. tridacnidorum and S. natans, respectively; of the 58,541 homologous gene families, 28.46% are common to both genomes. We recovered a greater extent of gene duplication and higher abundance of repeats, transposable elements and pseudogenes in the genome of S. tridacnidorum than in that of S. natans; these features, common in other facultative symbionts, likely reflect the impact of evolutionary transition to symbiosis on the genome evolution of S. tridacnidorum, and of other symbiotic lineages of Symbiodiniaceae. These findings demonstrate that genome structural rearrangements are pertinent to distinct lifestyles in Symbiodinium and may contribute to the vast genetic diversity within the genus, and more broadly in Symbiodiniaceae. Moreover, the results from our whole-genome comparisons against a free-living outgroup support the notion that the symbiotic lifestyle is a derived trait in, and that the free-living lifestyle is ancestral to, Symbiodinium.

4.2. Introduction

Symbiodiniaceae are dinoflagellates (Order Suessiales) crucial for coral reefs because of their symbiotic relationship with corals and diverse marine organisms. Although these dinoflagellates display subtle morphological diversity, their extensive genetic variation is well-recognised, prompting the recent systematic revision to family status2,3. Sexual reproduction stages have not been directly observed in Symbiodiniaceae, but the presence of a complete meiotic gene repertoire suggests that they are able to reproduce sexually4-6. The potential sexual reproduction of Symbiodiniaceae has been used to explain their extensive genetic variation7-11. The genetic diversity in Symbiodiniaceae is in line with their broad range of symbiotic associations with other organisms, covering a broad spectrum depending on host specificity, transmission mode and permanence in the host12,13.

87

Furthermore, some taxa are considered free-living because they have been found only in environmental samples, and in experiments fail to infect potential hosts14-17.

The basal lineage of Symbiodiniaceae (formerly clade A) consists of two monophyletic groups, one of which has been revised as Symbiodinium sensu stricto3,18. Symbiodinium (as revised) includes a wide range of mutualistic, opportunistic and free-living forms. Symbiodinium tridacnidorum, for instance, encompasses isolates in ITS2-type A3 that are predominantly symbionts of giant clams in the Indo-Pacific Ocean3. Although the nature of this symbiosis is extracellular, they can also establish intracellular symbiosis with cnidarian hosts both in experimental settings and in nature19. On the other hand, Symbiodinium natans (the type species of the genus) is free-living. S. natans occurs frequently in environmental samples, exhibits a widespread distribution and, thus far, has not been shown to colonise cnidarian hosts3,20.

Symbiosis, or the lack thereof, has been predicted to impact genome evolution of Symbiodiniaceae13. Most symbiotic Symbiodiniaceae are thought to be facultative to some extent, with the potential to shift between a free-living motile stage (i.e. mastigote form) and a spherical symbiotic stage (i.e. coccoid form). The genomes of facultative and recent intracellular symbionts and parasites are usually very unstable, with extensive structural rearrangements, intensified activity of transposable elements (TEs) and exacerbated gene duplication that leads to the accumulation of pseudogenes21,22. Symbiotic Symbiodiniaceae are thus expected to display similar genomic features.

In this study, we compared newly generated genome data of S. tridacnidorum CCMP2592 and S. natans CCMP2548 to identify features that are potentially pertinent to their evolution into different lifestyles. We found extensive genome-sequence divergence and few shared families of predicted genes between the two species. A greater extent of gene duplication, and the higher abundance of TEs and pseudogenes in S. tridacnidorum relative to S. natans suggest that duplication and transposition underpin genome divergence between these two species. These genome features may reflect impacts of the early evolutionary transition of Symbiodinium (and more broadly of Symbiodiniaceae) to an intracellular lifestyle, similar to those observed in other facultative symbionts.

88

4.3. Results and discussion

4.3.1. Genome sequences and predicted genes of S. tridacnidorum and S. natans

The genome sequences of S. tridacnidorum CCMP2592 and S. natans CCMP2548 were assembled de novo using both short- and long-read sequence data (Table 4-1, Supplementary Table 1). The estimated genome size is 1.29 Gbp for S. tridacnidorum and 0.74 Gbp for S. natans (Supplementary Table 2); the latter is the smallest reported for any Symbiodiniaceae genome to date. Using an integrative gene-prediction workflow tailored for dinoflagellate genomes (see Methods), we predicted 45,474 high-quality gene models in S. tridacnidorum and 35,270 in S. natans (Table 4-2). The gene repertoire for each genome is more complete (85.15% and 83.41% recovery of core conserved eukaryote genes23 in S. tridacnidorum and S. natans, respectively) than other Symbiodinium genomes (<79% recovery; Fig. S4.1).

Table 4-1 Statistics of hybrid genome assemblies Statistics of the hybrid de novo genome assemblies of S. natans CCMP2548 and S. tridacnidorum CCMP2592. Metric S. tridacnidorum S. natans Overall G+C (%) 51.01 51.79 Number of scaffolds 6245 2855 Assembly length (bp) 1,103,301,044 761,619,964 N50 scaffold length (bp) 651,264 610,496 Max. scaffold length (Mbp) 4.01 3.40 Number of contigs (bp) 7913 4262 N50 contig length (bp) 356,695 358,021 Max. contig length (Mbp) 2.96 2.90 Gap (%) 0.02 0.02

89

Table 4-2 Gene statistics Statistics of predicted genes from S. tridacnidorum and S. natans genomes. Statistic S. tridacnidorum S. natans Genes Number of genes 45,474 35,270 Mean gene (exons + introns) length (bp) 10647.95 8779.96 Mean CDS length (bp) 2033.50 1660.13 Gene content (total gene length/total assembly length, %) 43.87 40.66 CDS G+C (%) 57.32 58.16 Supported by transcript data (%) 61.73 82.99 Exons Average number per gene 16.15 15.66 Average length (bp) 125.89 106.00 Total length (bp) 92,471,373 58,552,877 Introns Number of genes with introns 40,282 30,171 Average length 568.48 485.61 Total length (bp) 391,733,376 251,116,222 G+C (%) 50.20 51.33 Intron-exon boundaries 5-donor splice sites (%) GC (canonical) 56.38 58.04 GT (non-canonical) 25.71 23.60 GA (non-canonical) 17.91 18.36 Nucleotide after the AG 3-acceptor G 96.53 97.09 splice sites (%) A 1.98 1.75 T 0.92 0.78 C 0.57 0.38 Intergenic regions Average length (bp) 11,467.68 11,585.13 G+C (%) 50.20 51.50

90

4.3.2. Genomes of S. natans and S. tridacnidorum are highly divergent

The genomes of S. tridacnidorum and S. natans are highly dissimilar from one another (Fig. 4.1). Only 14.70 Mbp (1.33%) of the genome sequence of S. tridacnidorum aligned to 11.84 Mbp (1.55%) of that of S. natans at 90% identity or greater. Most aligned genomic regions are short (<100 bp, Fig. 4.1A). About half of these regions represent repeats, and another ~40% represent genic regions that are common to both species (Fig. 4.1B). We observed a low mapping rate (<15%) of read pairs from one genome dataset against the genome assembly of the counterpart, and vice versa (Fig. 4.1C). Using all predicted genes, we inferred 58,541 gene families (including 26,649 single-copy genes), many of which are exclusive to each species (Fig. 4.1D), e.g. 25,700 are specific to S. tridacnidorum. However, the predominant gene functions are conserved, as shown by the top ten most abundant protein domains encoded in the genes from both species (Fig. 4.1E). The composition of repetitive elements differs between the two genomes. Simple repeats and long interspersed nuclear elements (LINEs), for instance, are in smaller proportion in the genome of S. tridacnidorum than they are in that of S. natans (Fig. 4.1F). Conversely, long terminal repeats (LTRs) and DNA transposons are more prominent in S. tridacnidorum.

4.3.3. Duplication events and transposable elements contribute to the divergence between S. tridacnidorum and S. natans genomes

We further assessed the distinct genome features in each species that may have contributed to the discrepancy in genome sizes. Specifically, we assessed, for each feature, the ratio (Δ) of the total length of the implicated sequence regions in S. tridacnidorum to the equivalent length in S. natans (Fig. 4.2). The genome size estimate for S. tridacnidorum is 1.74 times larger than that for S. natans (Supplementary Table 2); we use this ratio as a reference for comparison. Most of the examined genome features span a larger region in the genome of S. tridacnidorum, as expected. The Δ for each inspected genic feature (even for exons and introns separately), approximates 1.74. However, six features related to duplicated genes and repetitive elements have Δ > 1.74. This observation suggests that gene duplication and repeats likely expanded in S. tridacnidorum (and/or contracted in S. natans), contributing to the genome-size discrepancy.

91

Fig. 4.1 Comparison of S. tridacnidorum and S. natans genomes (A) Density polygon of the similarity between aligned genome sequences of S. tridacnidorum and S. natans as a function of the length of the aligned region in the query sequence. (B) Proportion of distinct genome features (by sequence length) among the aligned regions between the two genomes. Overlap of the sequences with similarity between both genomes with predicted genes and repetitive elements. (C) Mapping rate of filtered read pairs generated for each species against its own assembled genome sequence and of the counterpart. ‘St’: S. tridacnidorum, ‘Sn’: S. natans. (D) Homologous gene families for the two genomes, showing the number of shared families and those that are exclusive to each genome. (E) Top ten most-abundant protein domains recovered, sorted in decreasing relative abundance (from bottom to top) among proteins of S. tridacnidorum (left) and those of S. natans (right). The abundance for each domain in both genomes is shown in each chart for comparison. Domains common among the top ten most abundant for both species are connected with a line between the charts. ‘MORN’: MORN repeat, ‘RCC1’: Regulator of chromosome condensation repeat, ‘RVT’: reverse transcriptase, ‘DUF’: domain of unknown function, ‘PPR’: pentatricopeptide repeat, ‘EFH’: EF-hand, ‘IonTr’: ion transporter, ‘Pkin’: protein kinase, ‘Ank’: ankyrin repeat, ‘DNAmet’: C-5 cytosine-specific DNA methylase. (F) Composition of sequence features for each of the two genomes, showing the percentage of sequences (by length) associated with distinct types of repetitive elements. ‘St’: S. tridacnidorum, ‘Sn’: S. natans.

92

Tandem duplication of exons and genes is common in dinoflagellates, and may serve as an adaptive mechanism to enhance functions relevant for their biology24,25. Whereas in some dinoflagellates genes in tandem arrays can have hundreds of copies, e.g. up to 5000 copies of the peridinin-chlorophyll a-binding protein (PCP) gene in Lingulodinium polyedra26, these arrays are not as prominent in the genomes of S. tridacnidorum and S. natans (Fig. S4.2), with the largest array comprising 10 and 13 gene copies, respectively. The 13-gene array in S. natans encodes a full-length alpha amylase, whereas the remaining 12 copies are fragments of this gene and likely not functional. On the other hand, the 10-gene block in S. tridacnidorum contains genes encoding PCP; of these, seven contain duplets of PCP domains, lending support to the previous finding of the origin of a PCP form by duplication in Symbiodiniaceae27; the remaining three copies contain 1, 6 and 14 PCP domains respectively. An additional gene, not part of the tandem array, contains another PCP-duplet. The total 37 individual PCP domains (35 in a gene cluster and two in a separate duplet) supports the earlier size estimation (36 ± 12) of the PCP family in a genome of Symbiodiniaceae28. In stark contrast, we only recovered a duplet of PCP domains among all predicted proteins of S. natans.

The length of duplicated gene blocks is drastically longer in S. tridacnidorum than in S. natans (Δ = 6.32; Fig. 4.2). This observation, and the number of gene-block duplicates in each of the two species, suggests that segmental duplication has occurred more frequently during the course of genome evolution of S. tridacnidorum. We found 23 syntenic collinear blocks within the S. tridacnidorum genome (i.e. within-genome duplicated gene blocks) implicating 242 genes in total. Of these genes, 20 encode protein kinase functions (Supplementary Table 3) that are associated with distinct signalling pathways. In comparison, only five syntenic collinear blocks implicating 62 genes were found in the S. natans genome; these genes largely encode functions of cation transmembrane transport, relevant for the maintenance of pH homeostasis. Ankyrin and pentatricopeptide repeats are common in the predicted protein products of duplicated genes in both genomes.

Retroposition is another gene-duplication mechanism known to impact genome evolution of Symbiodiniaceae and other dinoflagellates24,29. To survey retroposition in genomes of S. tridacnidorum and S. natans, we searched for relicts of the dinoflagellate spliced-leader (DinoSL) sequence in upstream regions of all predicted genes. Since the DinoSL is attached to transcribed genes by trans-splicing30, genes containing these relicts represent the primary evidence of retroposition into the genome. We found 412 and 252 genes with conserved DinoSL relicts in S. tridacnidorum and S. natans, respectively. Genes with higher expression levels have been assumed to be more prone to be retroposed into the genome31. The identified retroposed genes in the two species encode distinct functions based on the annotated Gene Ontology (GO) terms (Fig. 4.3A). This observation may be attributed to the preferential expression of functions that are (or were) relevant to each species. For 93

instance, peptide antigen binding (GO:0042605) might be important for host recognition in S. tridacnidorum32.

Fig. 4.2 Contribution of genomic features to distinct composition of S. tridacnidorum and S. natans genomes Each genome feature was assessed based on the ratio (Δ) of the total length of the implicated

sequence region in S. tridacnidorum to the equivalent length in S. natans, shown in log2-scale. The ratio of the estimated genome sizes is shown as reference (marked with a dashed line). The untransformed Δ for each feature is shown in its corresponding bar. A genome feature with Δ greater than the reference likely contributed to the discrepancy of genome sizes. Bars are coloured based on the genome in which they are more abundant as shown in the legend.

94

Fig. 4.3 Overrepresented functions in retroposed and RT-genes GO molecular functions enriched in genes with conserved DinoSL relicts in their upstream regions (A) and genes coding for reverse transcriptase domains (RT-genes) (B).

Both retroposition and retrotransposition have been reported to contribute to gene-family expansion in Symbiodiniaceae33. Protein domains with functions related to retrotransposition were overrepresented in gene products of S. tridacnidorum relative to those of S. natans (Supplementary Table 4). However, reverse transcriptase domains (PF00078 and PF07727) are abundant in both; they were found in 1313 predicted proteins in S. tridacnidorum and 591 in S. natans.

Retrotransposons can accelerate mutation rate34 and alter the architecture of genes in their flanking regions35, and may explain the emergence of genes coding for reverse transcriptase domains (RT-genes) in these genomes. Other domains found in these proteins are involved in diverse cellular processes including ubiquitin-mediated proteolysis, DNA methylation, transmembrane transport and photosynthesis (Fig. 4.3B, Supplementary Table 5). The lack of overlap between functions enriched in genes containing DinoSL relicts and those in RT-genes indicates that retroposition and retrotransposition are independent processes. The abundance of repeats characteristic of TEs (such as LINEs and LTRs; Fig. 4.2) further supports the enhanced activity of retrotransposition in S. tridacnidorum. Although LINEs display high sequence divergence (Kimura distance36 20-30), potentially a remnant from an ancient burst of this type of element common to all Suessiales5,24, most LTRs and DNA transposons are largely conserved (Kimura distance < 5), suggesting that they may

95

be active (Fig. 4.4). We note that these conserved LTRs and DNA transposons were recovered only in our hybrid genome assemblies incorporating both short- and long-read sequence data, and not in our preliminary genome assemblies based solely on short-read data (Fig. S4.3, Supplementary Table 6). This indicates that these conserved, repetitive regions can be resolved only using long-read sequence data (Fig. S4.4), highlighting the importance of long-read data in generating and assembling dinoflagellate genomes.

Fig. 4.4 Interspersed repeat landscapes of S. tridacnidorum and S. natans Interspersed repeat landscapes of S. natans (A) and S. tridacnidorum (B). The colour code of the different repeat classes is shown at the bottom of the charts.

4.3.4. High divergence among gene copies counteracts gene-family expansion in S. tridacnidorum

Duplicated genes can experience distinct fates37,38. These fates can result in different scenarios depending on the divergence accumulated in the sequences. First, if the function remains the same or changes slightly (e.g. through subfunctionalisation), the duplicated gene sequences will remain similar, resulting in gene-family expansion. We assessed the difference in gene-family sizes between S. tridacnidorum and S. natans using Fisher’s exact test (see Methods), and consider those with an adjusted p ≤ 0.05 as significantly different (Fig. 4.5). Although events contributing to the increase of 96

gene-copy numbers appear more prevalent in S. tridacnidorum, gene families are not drastically larger than those in S. natans; only 20 families are significantly larger. Of these 20 families, one (OG0000004) putatively encodes protein kinases and glycosyltransferases that are necessary for the biosynthesis of glycoproteins, and another (OG0000013) encodes ankyrin and transport proteins (Supplementary Table 7). These functions are important for the recognition of and interaction with the host among symbiodiniacean symbionts39-41. In comparison, five gene families were significantly larger in S. natans than in S. tridacnidorum, of which one (OG0000003) encodes for a sodium- transporter and another (OG0000034) for a transmembrane protein. Many genes in the expanded families encode for retrotransposition functions in both genomes, lending support to the contributing role of retrotransposons in gene-family expansion in Symbiodiniaceae33. Although the functions of many other genes in these families could not be determined due to the lack of known similar sequences, they might be relevant for adaptation to specific ecological niches as previously proposed for dinoflagellates42.

Second, if novel beneficial functions of the gene copies emerge (i.e. neofunctionalisation), the sequence divergence between gene copies may become too large to be recognised as the same family. This scenario could, at least partially, explain the higher number of single-copy genes exclusive to S. tridacnidorum (25,649) than those exclusive to S. natans (16,137). Whereas 13,320 (82.54%) of the 16,137 single-copy genes of S. natans are supported by transcriptome evidence, only 13,189 (51.42%) of those 25,649 in S. tridacnidorum are. It remains unclear if these latter represent functional genes. Moreover, the annotated functions of these single-copy genes exclusive to each genome are similar in both species (Supplementary Table 8), suggesting the presence of highly diverged homologs.

Finally, duplicated genes can undergo loss of function (i.e. nonfunctionalisation or pseudogenisation). Pseudogene screening in both genomes (see Methods) identified 183,516 putative pseudogenes in S. tridacnidorum and 48,427 in S. natans. The nearly four-fold difference in the number of pseudogenes between the two genomes further supports the notion that more-frequent duplication events occur in S. tridacnidorum, and may explain the lower proportion of genes with transcript support in this species (Table 4-2). Our results suggest that the high sequence divergence of duplicated genes, potentially due to the accumulation of mutations as a consequence of pseudogenisation, perhaps together with neofunctionalisation, may hinder gene family expansion in the genome of S. tridacnidorum.

97

Fig. 4.5 Relative gene-family sizes in S. tridacnidorum and S. natans Volcano plot comparing gene-family sizes against Fisher’s exact test significance (p-value). The colour of the circles indicates the species in which those gene families are larger according to the top-right legend. The number of gene families with the same ratio and significance is represented with the circle size following the bottom-right legend. Filled circles represent size differences that are considered statistically significant (adjusted p ≤ 0.05).

4.3.5. Gene functions of S. tridacnidorum and S. natans are relevant to their lifestyle

According to our analysis of enriched gene functions in S. tridacnidorum relative to S. natans based on annotated GO terms, methylation and the biosynthesis of histidine and peptidoglycan were among the most significant (Supplementary Table 9). The enrichment of methylation is not surprising 98

because retrotransposons of Symbiodiniaceae are known to have acquired methyltransferase domains, likely contributing to the hypermethylated nuclear genomes of these dinoflagellates43. The link between the extent of methylation in symbiodiniacean genomes and its representation among predicted genes can be further assessed using methylation sequencing.

Although some corals can synthesise histidine de novo, metazoans generally lack this capacity44. The enrichment of histidine biosynthesis in S. tridacnidorum may be a result of host- symbiont coevolution or, alternatively, may explain why this species is a preferred symbiont over others (e.g. S. natans). Biosynthesis of peptidoglycans is also important for symbiosis, because these molecules, on the cell surface of Symbiodiniaceae, interact with host lectins as part of the symbiont recognition process32,41.

On the other hand, S. natans displays a wider range of enriched functions related to cellular processes (Supplementary Table 9), as expected for free-living Symbiodiniaceae13. One of the most significantly overrepresented gene functions is the transmembrane transport of sodium. Whereas this function is likely related to pH (osmotic) homeostasis with the extracellular environment, the occurrence of a sodium:phosphate symporter (PF02690) in tandem, exclusive to S. natans, and the abundance of a sodium:chloride symporter (PF00209) among the RT-genes (Supplementary Table 5) suggest that S. natans makes use of the Na+ differential gradient (caused by the higher Na+ concentration in seawater) for nutrient uptake in a similar fashion to the assimilation of inorganic phosphate by the malaria parasite (Plasmodium falciparum) in the Na+-rich cytosol of the host’s erythrocytes45.

4.3.6. Are features underpinning genome divergence in Symbiodiniaceae ancestral or derived?

To assess whether the genome features found in S. tridacnidorum were ancestral or derived relative to S. natans, we compared the genome sequences from both species with those from the outgroup Polarella glacialis CCMP138324, a psychrophilic free-living species closely related to Symbiodiniaceae (also in Order Suessiales). A greater genome sequence proportion of S. natans (3.38%) than that of S. tridacnidorum (0.85%) aligned to the P. glacialis genome assembly. Interestingly, the aligned regions in both cases implicate only ~5 Mbp (~0.18%) of the P. glacialis genome sequence. This observation is likely due to duplicated genome regions of S. natans that have remained highly conserved. Similarly, the average percent identity of the best-matching sequences between any of the two Symbiodinium genomes against P. glacialis is very similar (i.e. 92.13% and 92.56% for S. tridacnidorum and S. natans, respectively). Nonetheless, regions occupied by 99

duplicated genes are recovered in larger proportions in Symbiodinium than in P. glacialis (Fig. 4.6). On the other hand, LTR retrotransposons are evidently more prominent in P. glacialis. However, these LTRs are more diverged (Kimura distances 3-8)24 than those in the two Symbiodinium (Kimura distances < 5; Fig. 4.4), indicating an independent, more-ancient burst of these elements in P. glacialis.

Fig. 4.6 Proportion of distinct elements in genomes of S. tridacnidorum, S. natans and P. glacialis Proportion (in percentage of the sequence length) covered by different types of genome features in the hybrid assemblies of S. tridacnidorum, S. natans and P. glacialis.

100

4.4. Concluding remarks

We report for the first time, based on whole-genome sequence data, evidence of structural rearrangements and TEs contributing to the extensive genomic divergence between the symbiotic S. tridacnidorum and the free-living S. natans, including the discrepancy in genome sizes. In comparison, structural rearrangements and TE activity are less prominent in the genomes of S. natans and the outgroup species P. glacialis. Structural rearrangements, abundance of pseudogenes, and enhanced activity of TEs are common in facultative and recent intracellular symbionts and parasites21,22, and are expected in symbiotic Symbiodiniaceae13. Our results support this hypothesis. In this regard, our results agree with the notion that the symbiotic lifestyle is a derived trait in Symbiodinium, and that the free-living lifestyle is likely ancestral. Under this assumption, the genome proportion spanned by TEs and duplicated genes in S. natans is expected to be similar (if not smaller) than that in the outgroup P. glacialis. However, we found the proportion of duplicated genes to be larger in S. natans (Fig. 4.6), prompting two possible explanations. First, the pervasive simple repeats in the P. glacialis genome24, independently expanded along this lineage or possibly an ancestral trait in Suessiales, drastically diminishes the proportion of genic regions in the genome. Second, the free- living lifestyle of S. natans may be a derived trait in Symbiodinium, having passed through a symbiotic phase earlier in its evolutionary history. However, the robust placement of S. natans in the basal position alongside Symbiodinium pilosum (another free-living species) in the Symbiodinium phylogeny3 contradicts this less-parsimonious explanation. Additional high-quality genome data from free-living and symbiotic taxa are thus required to gain a clearer understanding of the evolutionary transition(s) between free-living and symbiotic lifestyles in Symbiodiniaceae.

The results presented in this chapter provide novel insights into genomes of Symbiodinium that might be relevant to the evolution into distinct ecological niches and lifestyles. More importantly, the genome sequences from this same genus are extremely divergent. To better understand the evolutionary transition between free-living and symbiotic lifestyles, and to assess the divergence within the basal genus Symbiodinium, more genome data from this taxon is necessary. In the next chapter, I further explore the extent of genome divergence within genus Symbiodinium, and more broadly across Symbiodiniaceae, adopting more-comprehensive comparative analysis incorporating newly generated genome data from other Symbiodinium isolates.

101

4.5. Methods

4.5.1. Symbiodinium cultures

Single-cell monoclonal cultures of two Symbiodinium (formerly Clade A) species were obtained from the Bigelow National Center for Marine Algae and Microbiota. Symbiodinium natans (strain CCMP2548) was originally collected from open ocean water in Hawaii, USA. Symbiodinium tridacnidorum (sub-type A3, strain CCMP2592) was originally recovered from a stony coral (Heliofungia actiniformis) on the Great Barrier Reef, Australia. The cultures were maintained in multiple 100-mL batches (in 250-mL Erlenmeyer flasks) in f/2 (without silica) medium (0.2 mm filter-sterilized) under a 14:10 h light-dark cycle (90 µE/m2/s) at 25 ºC. The medium was supplemented with antibiotics (ampicillin [10 mg/mL], kanamycin [5 mg/mL] and streptomycin [10 mg/mL]) to reduce bacterial growth.

4.5.2. Nucleic acid extraction

Genomic DNA was extracted following the 2×CTAB protocol with modifications. Symbiodinium cells were first harvested during exponential growth phase (before reaching 106 cells/mL) by centrifugation (3000 g, 15 min, room temperature (RT)). Upon removal of residual medium, the cells were snap-frozen in liquid nitrogen prior to DNA extraction, or stored at −80 °C. For DNA extraction, the cells were suspended in a lysis extraction buffer (400 µL; 100 mM Tris-Cl pH 8, 20 mM EDTA pH 8, 1.4 M NaCl), before silica beads were added. In a freeze-thaw cycle, the mixture was vortexed at high speed (2 min), and immediately snap-frozen in liquid nitrogen; the cycle was repeated 5 times. The final volume of the mixture was made up to 2% w/v CTAB (from 10% w/v CTAB stock; kept at 37 °C). The mixture was treated with RNAse A (Invitrogen; final concentration 20 µg/mL) at 37 °C (30 min), and Proteinase K (final concentration 120 µg/mL) at 65 °C (2 h). The lysate was then subjected to standard extractions using equal volumes of phenol:chloroform:isoamyl alcohol (25:24:1 v/v; centrifugation at 14,000 g, 5 min, RT), and chloroform:isoamyl alcohol (24:1 v/v; centrifugation at 14,000 g, 5 min, RT). DNA was precipitated using pre-chilled isopropanol (gentle inversions of the tube, centrifugation at 18,000 g, 15 min, 4 °C). The resulting pellet was washed with pre-chilled ethanol (70% v/v), before stored in Tris-HCl (100 mM, pH 8) buffer. DNA concentration was determined with NanoDrop (Thermo Scientific), and

DNA with A230:260:280 ≈ 1.0:2.0:1.0 was considered appropriate for sequencing. Total RNA was isolated using the RNeasy Plant Mini Kit (Qiagen) following directions of the manufacturer. RNA quality and concentration were determined with an Agilent 2100 BioAnalyzer.

102

4.5.3. Genome sequence data generation and de novo assembly

In total, we generated 1021.63 Gbp (6.77 billion reads) of genome sequence data for S. natans and 259.57 Gbp (1.48 billion reads) for S. tridacnidorum (Supplementary Table 1). Short-read sequence data (2 × 150 bp reads) were generated using multiple paired-end (for both species) and mate-pair (for S. natans only) libraries on the Illumina HiSeq 2500 and 4000 platforms at the Australian Genome Research Facility (Melbourne) and the Translational Research Institute Australia (Brisbane). One of the paired-end libraries for S. natans (of insert length 250 bp) was designed such that the read-pairs of 2 × 150 bp would overlap. Genome size and sequence read coverage were estimated based on k-mer frequency analysis (Supplementary Table 2) as counted with Jellyfish46 v2.2.6, using only pared-end data.

Quality assessment of the raw paired-end data was done with FastQC v0.11.5 (bioinformatics.babraham.ac.uk/projects/fastqc), and subsequent processing with Trimmomatic47 v0.36. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: ILLUMINACLIP:[AdapterFile]:2:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:4:25 MINLEN:100 AVGQUAL:30; CROP and HEADCROP were run (prior to LEADING and TRAILING) when required to remove read ends with nucleotide biases. Overlapping read pairs from the library with insert size of 250 bp were merged with FLASh48 v1.2.11. Library adapters from the mate-pair data were removed with NxTrim49 v0.41. A preliminary de novo genome assembly per species was done for genome-guided transcriptome assembly (see below) with CLC Genomics Workbench v7.5.1 (qiagenbioinformatics.com) using default parameters and the merged pairs (for S. natans), the unmerged read pairs and the trim-surviving unpaired reads. The preliminary assembly of S. natans was further scaffolded with SSPACE50 v3.0 and the mate-pair filtered data.

Additionally, long-read sequence data were generated on a PacBio Sequel system at the Ramaciotti Centre for Genomics (Sydney). These data and the paired-end libraries (adding up to a coverage of 152-fold for S. natans and 200-fold for S. tridacnidorum) were used for hybrid de novo genome assembly (Supplementary Table 1) with MaSuRCA51,52 v3.3.0, following the procedure described in the manual. Except for the PacBio sub-reads, filtered to a minimum length of 5 kbp, all sequence data were input without being pre-processed, as recommended by the developer. The genome assemblies were further scaffolded with transcriptome data generated in this study (see below) using L_RNA_scaffolder53.

103

4.5.4. Removal of putative microbial contaminants

To identify putative sequences from bacteria, archaea and viruses in the genome scaffolds we followed the approach of Liu et al.5. In brief, we first searched the scaffolds (BLASTn) against a database of bacterial, archaeal and viral genomes from RefSeq (release 88); hits with E ≤ 10−20 and alignment bit score ≥ 1000 were considered as significant. We then calculated the proportion of bases in each scaffold covered by significant hits. Next, we assessed the added length of implicated genome scaffolds across different thresholds of these proportions, and the corresponding gene models in these scaffolds as predicted from available transcripts using PASA54 v2.3.3 (see below), with a modified script available at github.com/chancx/dinoflag-alt-splice) that recognises an additional donor splice site (GA), and TransDecoder54 v5.2.0. This preliminary gene prediction was done on the repeat- masked genome using clean transcripts, as described below. The most-stringent sequence coverage (≥5%) was selected as the threshold for all samples, i.e. any scaffold with significant bacterial, archaeal or viral hits covering ≥5% of its length was considered as contaminant and removed from the assembly (Fig. S4.5).

4.5.5. RNA sequence data generation and transcriptome assembly

We generated transcriptome sequence data for both S. tridacnidorum and S. natans (Supplementary Table 10). Short-read sequence data (2 × 150 bp reads) were generated using paired- end libraries on the Illumina NovaSeq 6000 platform at the Australian Genome Research Facility (Melbourne). Quality assessment of the raw paired-end data was done with FastQC v0.11.4 (bioinformatics.babraham.ac.uk/projects/fastqc), and subsequent processing with Trimmomatic47 v0.35. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: HEADCROP:10 ILLUMINACLIP:[AdapterFile]:2:30:10 CROP:125 SLIDINGWINDOW:4:13 MINLEN:50. The surviving read pairs were further trimmed with QUADTrim v2.0.2 (bitbucket.org/arobinson/quadtrim) with the flags -m 2 and -g to remove homopolymeric guanine repeats at the end of the reads (a systematic error of Illumina NovaSeq 6000).

Transcriptome assembly was done with Trinity55 v2.1.1 in two modes: de novo and genome- guided. De novo transcriptome assembly was done using default parameters and the trimmed read pairs. For genome-guided assembly, high-quality read pairs were aligned to the preliminary de novo genome assembly using Bowtie56 v2.2.7. Transcriptomes were then assembled with Trinity in the genome-guided mode using the alignment information, and setting the maximum intron size to 100,000 bp. Both de novo and genome-guided transcriptome assemblies from each sample were used for scaffolding (see above) and gene prediction (see below). 104

4.5.6. Full-length transcript evidence for gene prediction

Full-length transcripts for S. tridacnidorum and S. natans were sequenced using the PacBio Iso- Seq technology. All sequencing was conducted using the PacBio Sequel platform at the Institute for Molecular Bioscience (IMB) Sequencing Facility, The University of Queensland (Brisbane, Australia; Supplementary Table 10). Full-length cDNA was first synthesised and amplified using the TeloPrime Full-Length cDNA Amplification Kit (Lexogen) and TeloPrime PCR Add-on Kit (Lexogen) following the protocols provided in the product manuals. One synthesis reaction was performed for each sample using 821 ng from S. tridacnidorum and 1.09 µg from S. natans of total RNA as starting material. Next, 25 (S. tridacnidorum) and 23 (S. natans) PCR cycles were carried out for cDNA amplification. PCR products were divided into two fractions, which were purified using 0.5× (for S. tridacnidorum) and 1× (for S. natans) AMPure PB beads (Pacific Biosciences), and then pooled with equimolar quantities. The recovered 699 ng (S. tridacnidorum) and 761 ng (S. natans) of cDNA were used for sequencing library preparation with the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences). The cDNA from these libraries were sequenced in two SMRT cells. To generate the dinoflagellate spliced-leader (DinoSL) specific transcript library, 12 PCR cycles were carried out for both samples using the conserved DinoSL fragment (5′- CCGTAGCCATTTTGGCTCAAG-3′) as forward primer, the TeloPrime PCR 3′-primer as reverse primer, and the fraction of full-length cDNA purified with 0.5× (for S. tridacnidorum) and 1× (for S. natans) AMPure PB beads. The above-described PCR purification and sequencing library preparation methods were used for the DinoSL transcript libraries; cDNA from these libraries was sequenced in one SMRT cell per sample.

Due to the abundance of undesired 5′-5′ and 3′-3′ pairs, and to recover as much transcript evidence as possible for gene prediction, we followed two approaches (Fig. S4.6). First, the Iso-Seq 3.1 workflow (github.com/PacificBiosciences/IsoSeq3/blob/master/README_v3.1.md) was followed. Briefly, circular consensus sequences (CCS) were generated from the subreads of each SMRT cell with ccs v3.1.0 without polishing, and setting the minimum number of subreads to generate CCS (--minPasses) to 1. Removal of primers was done with lima v1.8.0 in the Iso-Seq mode, with a subsequent refinement step using isoseq v3.1.0. At this stage, the refined full-length transcripts of all SMRT cells (excluding those from the DinoSL library) were combined to be then clustered by similarity and polished with isoseq v3.1.0. High- and low-quality transcripts resulting from this approach were further used for gene prediction (see below).

For the second approach, we repeated the Iso-Seq workflow with some modifications. We polished the subreads with the Arrow algorithm and used at least three subreads per CCS with ccs

105

v3.1.0 to generate high-accuracy CCS. Primer removal and refinement were done as explained above. The subsequent clustering and polishing steps were skipped. The resulting polished CCS and full- length transcripts were also used for gene prediction. Iso-Seq data from the DinoSL library were processed separately following the same two approaches.

4.5.7. Genome annotation and gene prediction

We adopted the same comprehensive ab initio gene prediction approach reported in Chen et al.57, using available genes and transcriptomes of Symbiodiniaceae as guiding evidence. A de novo repeat library was first derived for the genome assembly using RepeatModeler v1.0.11 (repeatmasker.org/RepeatModeler). All repeats (including known repeats in RepeatMasker database release 20180625) were masked using RepeatMasker v4.0.7 (repeatmasker.org). As direct transcript evidence, we used the de novo and genome-guided transcriptome assemblies from Illumina short- read sequence data, as well as the PacBio Iso-Seq full-length transcript data (see above). We concatenated all the transcript datasets per sample and “cleaned” them with SeqClean (sourceforge.net/projects/seqclean) and the UniVec database build 10.0. We used PASA54 v2.3.3, customised to recognise dinoflagellate alternative splice donor sites (see above), and TransDecoder54 v5.2.0 to predict coding sequences (CDS). These CDS were searched (BLASTp, E ≤ 10−20) against a protein database that consists of RefSeq proteins (release 88) and a collection of available and predicted (with TransDecoder54 v5.2.0) proteins of Symbiodiniaceae (total of 111,591,828 sequences; Supplementary Table 11). We used the analyze_blastPlus_topHit_coverage.pl script from Trinity55 v2.6.6 to retrieve only those CDS having a hit with >70% coverage of the database protein sequence (i.e. nearly full-length) in the database for subsequent analyses.

The near full-length gene models were checked for TEs using HHblits v2.0.16 (probability = 80% and E-value = 10−5), searching against the JAMg transposon database (sourceforge.net/projects/jamg/files/databases), and TransposonPSI (transposonpsi.sourceforge.net). Gene models containing TEs were removed from the gene set, and redundancy reduction was conducted using cd-hit58,59 v4.6 (ID = 75%). The remaining gene models were processed using the prepare_golden_genes_for_predictors.pl script from the JAMg pipeline (altered to recognise GA donor splice sites; jamg.sourceforge.net). This script produces a set of “golden genes” that was used as training set for the ab initio gene-prediction tools AUGUSTUS60 v3.3.1 (customised to recognise the non-canonical splice sites of dinoflagellates, following the changes made to that available at github.com/chancx/dinoflag-alt-splice) and SNAP61 v2006-07-28. Independently, the soft-masked genome sequences were passed to GeneMark-ES62 v4.32 for unsupervised training and gene

106

prediction. UniProt Swiss-Prot proteins (downloaded on 27 June 2018) and the predicted proteins of Symbiodiniaceae (Supplementary Table 11) were used to produce a set of gene predictions using MAKER63 v2.31.10 protein2genome; the custom repeat library was used by RepeatMasker as part of MAKER prediction. A primary set of predicted genes was produced using EvidenceModeler64 v1.1.1, modified to recognise GA donor splice sites. This package combined the gene predictions from PASA, SNAP, AUGUSTUS, GeneMark-ES and MAKER protein2genome into a single set of evidence-based predictions. The weightings used for the package were: PASA 10, Maker protein 8, AUGUSTUS 6, SNAP 2 and GeneMark-ES 2. Only gene models with transcript evidence (i.e. predicted by PASA) or supported by at least two ab initio prediction programs were kept. We assessed completeness by querying the predicted protein sequences in a BLASTp similarity search (E ≤ 10−5, ≥50% query/target sequence cover) against the 458 core eukaryotic genes from CEGMA23. Transcript data support for the predicted genes was determined by BLASTn (E ≤ 10−5) similarity search, querying the transcript sequences against the predicted CDS from each genome. Genes for which the transcripts aligned to their CDS with at least 50% of sequence cover and 90% identity were considered as supported by transcript data.

4.5.8. Gene-function annotation and enrichment analyses

Annotation of the predicted genes was done based on sequence similarity searches against know proteins following the same approach as Liu et al.5, in which the predicted protein sequences were used as query (BLASTp, E ≤ 10−5, minimum query or target cover of 50%) against Swiss-Prot first, and those with no Swiss-Prot hits subsequently against TrEMBL (both databases from UniProt, downloaded on 27 June 2018). The best UniProt hit with associated Gene Ontology (GO, geneontology.org) terms was used to annotate the query protein with those GO terms using the UniProt-GOA mapping (downloaded on 03/06/2019). Pfam domains65 were searched in the predicted proteins of both Symbiodinium species using PfamScan66 (E ≤ 0.001) and the Pfam-A database (release 30 August 2018). Tests for enrichment of Pfam domains were done with one-tailed Fisher’s exact tests, independently for over- and under-represented features; domains with Benjamini- Hochberg67 adjusted p ≤ 0.05 were considered significant. Enrichment of GO terms was performed using the topGO Bioconductor package (bioconductor.org/packages/release/bioc/html/topGO.html) implemented in R v3.5.1, applying Fisher’s Exact test with the ‘elimination’ method to correct for the dependence structure among GO terms. GO terms with a p ≤ 0.01 were considered significant.

107

4.5.9. Comparative genomic analyses

Whole-genome sequence alignment was carried out with nucmer68 v4.0.0 with the hybrid genome assembly of S. natans as reference and that of S. tridacnidorum as query, and using anchor matches that are unique in the sequences from both species (--mum). Sequences from both Symbiodinium genomes were queried in the same way against the genome sequence of P. glacialis CCMP138324. Filtered read pairs (see above, Supplementary Table 1) from both species were aligned to their corresponding and counterpart genome sequences using BWA69 v0.7.13, and rates of mapping with different quality scores were calculated with SAMStat70 v1.5.1.

Groups of homologous sequences from the two Symbiodinium genomes were inferred with Orthofinder71 v2.3.1, and considered gene families. The significance of size differences of the gene families shared by S. tridacnidorum and S. natans was assessed with a two-tailed Fisher’s exact test correcting p-values for multiple testing with the Benjamini-Hochberg method67; difference in size was considered significant for gene families with adjusted p ≤ 0.05.

We used the predicted genes and their associated genomic positions to identify potential segmental genome duplications in both Symbiodinium species, as well as in P. glacialis. First, we used BLASTp (E ≤ 10−5) to search for similar proteins within each genome; the hit pairs were filtered to include only those where the alignment covered at least half of either the query or the matched protein sequence. Next, we ran MCScanX72 in intra-specific mode (-b 1) to identify collinear syntenic blocks of at least five genes and genes arranged in tandem within each genome separately.

Identification of genes with DinoSL and pseudogenes was done in a similar way to Song et al. (2017)29. We queried the original DinoSL sequence (DCCGUAGCCAUUUUGGCUCAAG)30, excluding the first ambiguous position, against the upstream regions (up to 500 bp) of all genes in a BLASTn search, keeping the default values of all alignment parameters but with word size set to 9 (- word_size 9). Pseudogene detection was done with tBLASTn, with the predicted protein for each genome as query against the genome sequence, with the regions covered by the predicted genes masked, as target. Matched regions with ≥75% identity were considered part of pseudogenes and surrounding matching fragments were considered as part of the same pseudogene as long as they were at a maximum distance of 1 kbp from another pseudogene fragment and in the same orientation.

4.6. References

1 González-Pech, R. A. et al. Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium. bioRxiv, 783902 (2019).

108

2 Rowan, R. & Powers, D. A. Ribosomal RNA sequences and the diversity of symbiotic dinoflagellates (zooxanthellae). Proc. Natl. Acad. Sci. U. S. A. 89, 3639-3643 (1992).

3 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral endosymbionts. Curr. Biol. 28, 2570-2580 (2018).

4 Chi, J., Parrow, M. W. & Dunthorn, M. Cryptic sex in Symbiodinium (Alveolata, Dinoflagellata) is supported by an inventory of meiotic genes. J. Eukaryot. Microbiol. 61, 322-327 (2014).

5 Liu, H. et al. Symbiodinium genomes reveal adaptive evolution of functions related to coral- dinoflagellate symbiosis. Commun. Biol. 1, 95 (2018).

6 Morse, D. A transcriptome-based perspective of meiosis in dinoflagellates. Protist (2019).

7 Baillie, B. et al. Genetic variation in Symbiodinium isolates from giant clams based on random-amplified-polymorphic DNA (RAPD) patterns. Mar. Biol. 136, 829-836 (2000).

8 Baillie, B., Monje, V., Silvestre, V., Sison, M. & Belda-Baillie, C. Allozyme electrophoresis as a tool for distinguishing different zooxanthellae symbiotic with giant clams. Proc. R. Soc. Lond. B Biol. Sci. 265, 1949-1956 (1998).

9 LaJeunesse, T. Diversity and community structure of symbiotic dinoflagellates from Caribbean coral reefs. Mar. Biol. 141, 387-400 (2002).

10 Pettay, D. T. & LaJeunesse, T. C. Long-range dispersal and high-latitude environments influence the population structure of a “stress-tolerant” dinoflagellate endosymbiont. PLoS ONE 8, e79208 (2013).

11 Thornhill, D. J., Lewis, A. M., Wham, D. C. & LaJeunesse, T. C. Host-specialist lineages dominate the adaptive radiation of reef coral endosymbionts. Evolution 68, 352-367 (2014).

12 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003).

13 González-Pech, R. A., Bhattacharya, D., Ragan, M. A. & Chan, C. X. Genome evolution of coral reef symbionts as intracellular residents. Trends Ecol. Evol. (2019).

14 Quigley, K., Bay, L. K. & Willis, B. Temperature and water quality-related patterns in sediment-associated Symbiodinium communities impact symbiont uptake and fitness of juveniles in the genus Acropora. Front. Mar. Sci. 4, 401 (2017).

109

15 LaJeunesse, T. C. Investigating the biodiversity, ecology, and phylogeny of endosymbiotic dinoflagellates in the genus Symbiodinium using the ITS region: in search of a “species” level marker. J. Phycol. 37, 866-880 (2002).

16 Nitschke, M. R., Davy, S. K., Cribb, T. H. & Ward, S. The effect of elevated temperature and substrate on free-living Symbiodinium cultures. Coral Reefs 34, 161-171 (2015).

17 Granados-Cifuentes, C., Neigel, J., Leberg, P. & Rodriguez-Lanetty, M. Genetic diversity of free-living Symbiodinium in the Caribbean: the importance of habitats and seasons. Coral Reefs 34, 927-939 (2015).

18 Pochon, X., Montoya-Burgos, J. I., Stadelmann, B. & Pawlowski, J. Molecular phylogeny, evolutionary rates, and divergence timing of the symbiotic dinoflagellate genus Symbiodinium. Mol. Phylogenet. Evol. 38, 20-30 (2006).

19 Lee, S. Y. et al. Symbiodinium tridacnidorum sp. nov., a dinoflagellate common to Indo- Pacific giant clams, and a revised morphological description of Symbiodinium microadriaticum Freudenthal, emended Trench & Blank. Eur. J. Phycol. 50, 155-172 (2015).

20 Hansen, G. & Daugbjerg, N. Symbiodinium natans sp. nov.: a "free-living" dinoflagellate from Tenerife (Northeast-Atlantic Ocean). J. Phycol. 45, 251-263 (2009).

21 Moran, N. A. & Plague, G. R. Genomic changes following host restriction in bacteria. Curr. Opin. Genet. Dev. 14, 627-633 (2004).

22 McCutcheon, J. P. & Moran, N. A. Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 10, 13-16 (2011).

23 Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-1067 (2007).

24 Stephens, T. G. et al. Polarella glacialis genomes encode tandem repeats of single-exon genes with functions critical to adaptation of dinoflagellates. bioRxiv, 704437 (2019).

25 Bachvaroff, T. R. & Place, A. R. From stop to start: tandem gene arrangement, copy number and trans-splicing sites in the dinoflagellate Amphidinium carterae. PLoS ONE 3, e2929 (2008).

26 Le, Q. H., Markovic, P., Hastings, J. W., Jovine, R. V. M. & Morse, D. Structure and organization of the peridinin-chlorophyll a-binding protein gene in polyedra. Molecular and General Genetics MGG 255, 595-604 (1997).

110

27 Norris, B. J. & Miller, D. J. Nucleotide sequence of a cDNA clone encoding the precursor of the peridinin-chlorophyll a-binding protein from the dinoflagellate Symbiodinium sp. Plant Mol. Biol. 24, 673-677 (1994).

28 Reichman, J. R., Wilcox, T. P. & Vize, P. D. PCP gene family in Symbiodinium from Hippopus hippopus: low levels of concerted evolution, isoform diversity, and spectral tuning of chromophores. Mol. Biol. Evol. 20, 2143-2154 (2003).

29 Song, B. et al. Comparative genomics reveals two major bouts of gene retroposition coinciding with crucial periods of Symbiodinium evolution. Genome Biol. Evol. 9, 2037-2047 (2017).

30 Zhang, H. et al. Spliced leader RNA trans-splicing in dinoflagellates. Proc. Natl. Acad. Sci. U. S. A. 104, 4618-4623 (2007).

31 Slamovits, C. H. & Keeling, P. J. Widespread recycling of processed cDNAs in dinoflagellates. Curr. Biol. 18, R550-R552 (2008).

32 Hurst, C. J. The mechanistic benefits of microbial symbionts. Vol. 2 (Springer, 2016).

33 Lin, S. et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science 350, 691-694 (2015).

34 Quadrana, L. et al. Transposition favors the generation of large effect mutations that may facilitate rapid adaption. Nat. Commun. 10, 3421 (2019).

35 Cordaux, R. & Batzer, M. A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691 (2009).

36 Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120 (1980).

37 Prince, V. E. & Pickett, F. B. Splitting pairs: the diverging fates of duplicated genes. Nat. Rev. Genet. 3, 827-837 (2002).

38 Lynch, M. & Conery, J. S. The evolutionary fate and consequences of duplicate genes. Science 290, 1151-1155 (2000).

39 Mohamed, A. R. et al. Transcriptomic insights into the establishment of coral-algal symbioses from the symbiont perspective. bioRxiv, 652131 (2019).

40 Davy, S. K., Allemand, D. & Weis, V. M. Cell biology of cnidarian-dinoflagellate symbiosis. Microbiol. Mol. Biol. Rev. 76, 229-261 (2012).

111

41 Weis, V. M. Cell biology of coral symbiosis: foundational study can inform solutions to the coral reef crisis. Integr. Comp. Biol. (2019).

42 Stephens, T. G., Ragan, M. A., Bhattacharya, D. & Chan, C. X. Core genes in diverse dinoflagellate lineages include a wealth of conserved dark genes with unknown functions. Sci. Rep. 8, 17175 (2018).

43 de Mendoza, A. et al. Recurrent acquisition of cytosine methyltransferases into eukaryotic retrotransposons. Nat. Commun. 9, 1341 (2018).

44 Ying, H. et al. Comparative genomics reveals the distinct evolutionary trajectories of the robust and complex coral lineages. Genome Biol. 19, 175 (2018).

45 Saliba, K. J. et al. Sodium-dependent uptake of inorganic phosphate by the intracellular malaria parasite. Nature 443, 582-585 (2006).

46 Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764-770 (2011).

47 Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, btu170 (2014).

48 Magoč, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957-2963 (2011).

49 O’Connell, J. et al. NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31, 2035-2037 (2015).

50 Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre- assembled contigs using SSPACE. Bioinformatics 27, 578-579 (2011).

51 Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669-2677 (2013).

52 Zimin, A. V. et al. Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. (2017).

53 Xue, W. et al. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 14, 604 (2013).

54 Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654-5666 (2003).

112

55 Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644-652 (2011).

56 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).

57 Chen, Y., González-Pech, R. A., Stephens, T. G., Bhattacharya, D. & Chan, C. X. Evidence that inconsistent gene prediction can mislead analysis of dinoflagellate genomes. J. Phycol. 56, 6-10 (2020).

58 Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics 28, 3150-3152 (2012).

59 Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).

60 Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435-W439 (2006).

61 Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 1 (2004).

62 Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494-6506 (2005).

63 Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).

64 Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, 1 (2008).

65 Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 32, D138-D141 (2004).

66 Li, W. et al. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 43, W580-W584 (2015).

67 Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B, 289-300 (1995).

68 Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

113

69 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760 (2009).

70 Lassmann, T., Hayashizaki, Y. & Daub, C. O. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics 27, 130-131 (2011).

71 Emms, D. M. & Kelly, S. OrthoFinder2: fast and accurate phylogenomic orthology analysis from gene sequences. bioRxiv, 466201 (2018).

72 Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49-e49 (2012).

73 Shoguchi, E. et al. Two divergent Symbiodinium genomes reveal conservation of a gene cluster for sunscreen biosynthesis and recently lost genes. BMC Genomics 19, 458 (2018).

74 Aranda, M. et al. Genomes of coral dinoflagellate symbionts highlight evolutionary adaptations conducive to a symbiotic lifestyle. Sci. Rep. 6, 39734 (2016).

75 Shoguchi, E. et al. Draft assembly of the Symbiodinium minutum nuclear genome reveals dinoflagellate gene structure. Curr. Biol. 23, 1399-1408 (2013).

114

4.7. Supplementary figures

Fig. S4.1 Completeness of symbiodiniacean genomes Recovery of CEGMA Clusters of Orthologous Groups (COGs) by BLASTp similarity search (see Methods) in the predicted protein sequences of S. natans CCMP2548 and S. tridacnidorum CCMP2592, as well as in the most recently predicted protein sequences57 of other available genomes of Symbiodiniaceae: S. tridacnidorum73, Symbiodinium microadriaticum74, Breviolum minutum75, Cladocopium goreaui5, Cladocopium sp. C9273 and F. kawagutii5. The fraction of COGs for each genome, out of the 458 in CEGMA, is shown on top of its corresponding bar.

115

Fig. S4.2 Genes of S. tridacnidorum and S. natans in tandem arrays (A) Number of blocks of genes arranged in tandem, and (B) fraction of genes (out of the total number of genes per genome) implicated in those blocks, found in genomes of S. tridacnidorum and S. natans. (C) Breakdown of the number of genes by the number of exons they contain. (D) Gene fraction (out of the total number of genes in tandem) with transcript support. (E) Block length distribution of genes in tandem as a function of the number of gene copies implicated in the blocks. (F) Fraction of genes with functional annotation based either on sequence similarity of their protein products with sequences in UniProt Swiss-Prot or identified Pfam protein domains, or both.

116

Fig. S4.3 Interspersed repeat landscapes from preliminary genome assemblies of S. natans and S. tridacnidorum Interspersed repeat landscapes from the preliminary genome assemblies (based on short-read data only) of S. natans (A) and S. tridacnidorum CCMP2592 (B). The colour code of the different repeat classes (including multiple-copy genes) is shown at the bottom of the two charts.

117

Fig. S4.4 Genome features of other Symbiodiniaceae relative to those from S. tridacnidorum and S. natans Ratio (Δ) of the sequence length covered by genic features (A, C) and repetitive elements (B, D) from other Symbiodiniaceae genome assemblies (based on short-sequence read data) relative to those from the hybrid assemblies of S. tridacnidorum (A, B) and S. natans (C, D). The ratios between estimated genome sizes is given as reference, except for S. tridacnidorum and Cladocopium sp. C92 (Shoguchi et al. 2018)73 that lack estimates, for which total assembly length was used instead.

118

Fig. S4.5 Removal of contaminant sequences from genome assemblies Added length (A, D) and count (B, E) of putatively contaminant sequences, and number of implicated genes (C, F), in genomes assemblies from S. tridacnidorum (A-C) and S. natans (D-F), based on different sequence cover thresholds (x-axis) of significant hits against bacterial, archaeal and viral genomes. The bars in each chart are coloured according to their corresponding query cover threshold following the colour code at the bottom.

119

Fig. S4.6 Workflow to generate evidence for gene prediction from full-length transcripts Diagram showing the detailed steps followed to process PacBio Iso-Seq data to generate full-length transcript evidence for gene prediction. The traditional IsoSeq 3.1 workflow (blue arrows) was followed to obtain low- and high- quality transcripts. In an alternative approach (purple arrows), circular consensus sequences (CCS) were called and polished simultaneously. These polished CCS were further trimmed and refined into full-length transcripts skipping the clustering step. Iso-Seq sequences from the DinoSL library were processed apart from the other libraries. Boxes in dark blue represent the transcript evidence subsequently used for gene prediction and the values in parentheses show the corresponding number of sequences from the standard (left) + the DinoSL (right) libraries for both S. tridacnidorum (St) and S. natans (Sn).

120

4.8. Supplementary tables

Supplementary tables are available from DOI: 10.1101/783902 (biorxiv.org/content/10.1101/783902v1.supplementary-material).

121

Chapter 5. Genomes of Symbiodiniaceae reveal extensive sequence divergence but conserved functions at family and genus levels

The results from this thesis work thus far have shown gene functions that are relevant to diversification of Symbiodiniaceae into distinct ecological niches (Chapter 3), and that structural rearrangements and repetitive elements are key contributors to the remarkably high divergence between genomes from two Symbiodinium species with distinct (symbiotic and free-living) lifestyles (Chapter 4). It remains unclear whether the extreme genome divergence found between the two species compared in the previous chapter represents a rare case, or whether this extent of divergence is common between taxa within genus Symbiodinium and family Symbiodiniaceae. Therefore, a more-comprehensive comparative analysis (incorporating genome data from different taxa) is necessary to better understand genome divergence and evolution of Symbiodiniaceae. In particular, data from multiple isolates of a species, and/or multiple species, would provide an excellent analysis platform to tease apart intra-genus and intra-species genome divergence in these ecologically important organisms.

Here, I assess genome-sequence divergence among Symbiodinium and among genera of Symbiodiniaceae by comparing genome data from 15 isolates: nine Symbiodinium isolates, four of other genera and two of the outgroup species Polarella glacialis. Of these datasets, five were generated from Symbiodinium isolates encompassing diverse ecological niches (in addition to the two assemblies presented in Chapter 4); in total, seven genome assemblies of Symbiodinium were generated as part of this thesis work. Genome-sequence divergence was assessed comprehensively within Order Suessiales, within Family Symbiodiniaceae and within Genus Symbiodinium, and between isolates of individual species (i.e. Symbiodinium microadriaticum and Symbiodinium tridacnidorum). In this way, this chapter addresses Aims 1, 2 and 3 of this thesis (Chapter 1, Section 1.2). The chapter is presented as a manuscript that has been published as a pre-print article in bioRxiv1 (DOI: 10.1101/800482) and reformatted for this thesis. Data generated from this work are publicly available at cloudstor.aarnet.edu.au/plus/s/095Tqepmq2VBztd. As the first author of this paper, I conceived the study, designed, led and conducted all computational analyses, and interpreted the results. I prepared the first draft of the manuscript and generated all figures and tables.

122

5.1. Abstract

Dinoflagellates of the family Symbiodiniaceae (Order Suessiales) are predominantly symbiotic, and many are known for their association with corals. The genetic and functional diversity among Symbiodiniaceae is well acknowledged, but the genome-wide sequence divergence among these lineages remains little known. Here, we present de novo genome assemblies of five isolates from the basal genus Symbiodinium, encompassing distinct ecological niches. Incorporating existing data from Symbiodiniaceae and other Suessiales (15 genome datasets in total), we determined pairwise whole- genome sequence similarity using alignment and alignment-free (based on k-mers) methods. We also investigated genome features that are common or unique to these Symbiodiniaceae, to genus Symbiodinium, and to the individual species S. microadriaticum and S. tridacnidorum. Our whole- genome comparisons reveal extensive sequence divergence, with no sequence regions common to all 15 datasets. Distances (estimated from k-mer similarity) among Symbiodinium isolates are similar to those between isolates of distinct genera. We observed extensive structural rearrangements among symbiodiniacean genomes; those from two distinct Symbiodinium species share the most (853) syntenic gene blocks. Functions enriched in genes core to Symbiodiniaceae are also enriched in those core to Symbiodinium. Gene functions related to symbiosis and stress response exhibit similar relative abundance in all analysed genomes. Our findings show that genetic diversity of Symbiodiniaceae extends beyond the punctual nucleotide substitutions of phylogenetic markers to structural rearrangements, even among isolates of the same species. Despite the extensive genome-sequence diversity, gene functions have remained largely conserved in Suessiales. This is the first comprehensive comparison of Symbiodiniaceae based on whole-genome sequence data, including comparisons at the intra-genus and intra-species levels.

5.2. Introduction

Symbiodiniaceae is a family of dinoflagellates (Order Suessiales) that diversified largely as symbiotic lineages, many of which are crucial symbionts for corals. However, the diversity of Symbiodiniaceae extends beyond symbionts of diverse coral reef organisms, to other putative parasitic, opportunistic and free-living forms2-5. The broad spectrum of ecological niches that Symbiodiniaceae can occupy agrees with the extensive physiological differences found among Symbiodiniaceae, not only across genera but also within6-10. Likewise, genetic divergence among Symbiodiniaceae is known to be extensive, in some cases comparable to that among members of distinct dinoflagellate orders11,12. This remarkable multi-factor diversity prompted the recent systematic revision of the group as family Symbiodiniaceae, with seven delineated genera13. 123

Conventionally, genetic divergence among Symbiodiniaceae has been estimated based on sequence-similarity comparison of a few conserved marker genes. However, individual genes evolve at distinct rates as evidenced by the resolution of the Symbiodiniaceae phylogeny when these genes are used as independent markers14-20; some markers are encoded in multiple copies in a genome21. While multi-loci markers may be used to assess the high genetic diversity among Symbiodiniaceae, their use as markers can result in phylogenetic inconsistencies, as reported for allozymes and random- amplified-polymorphic DNA (RAPD) patterns22,23. Other markers, such as microsatellites, can help teasing apart more fine-scaled diversity among species and even populations24-32.

The diversity of whole-genome sequences among Symbiodiniaceae is less known. An early study based on DNA/DNA hybridisation detected between 30% and 70% of sequence homology among Symbiodiniaceae isolates12. A comparative study (Chapter 3) using predicted genes from available transcriptome and genome data uncovered extensive differences in gene-family numbers among the major lineages (or clades) of the family33. A more-recent investigation34 (Chapter 4) revealed little similarity between the whole-genome sequences of a symbiotic and a free-living Symbiodinium species. However, whether this sequence divergence is an isolated case within the genus (or the family), or is associated with the distinct lifestyles, remains to be investigated using more genome-scale data.

Comparative genomics studies at intra-genus and/or intra-species levels can yield novel insights into the biology of Symbiodiniaceae. For instance, a transcriptomic survey of four species (with multiple isolates per species) of Breviolum (formerly Clade B) revealed differential gene expression that is potentially associated with their prevalence in the host35. Another example is the comparison of a symbiotic and a free-living Symbiodinium mentioned above, that showed how structural rearrangements and gene duplication can contribute to the divergence between species within a same genus34. Thus, comparison of genome data from multiple isolates of the same genus, and/or of the same species can lead to the identification of molecular mechanisms that underpin the diversification of Symbiodiniaceae at a finer resolution.

In this study, we report genome-sequence variation across Symbiodiniaceae taxa, focusing on genus Symbiodinium. Five newly generated genome datasets of Symbiodinium, encompassing distinct ecological niches (free-living, symbiotic and opportunistic) and two distinct isolates of Symbiodinium microadriaticum, were included in this study. Comparing these genomes against those available from other Symbiodinium, other Symbiodiniaceae and the outgroup species Polarella glacialis (15 datasets in total), we investigated genome features that are common or unique to the distinct lineages within a single species, within a single genus, and within Family Symbiodiniaceae. This is the most comprehensive comparative analysis to date of Symbiodiniaceae based on whole-genome sequence data. 124

5.3. Results and discussion

5.3.1. Genome sequences of Symbiodiniaceae

We generated draft genome assemblies de novo for Symbiodinium microadriaticum CassKB8, Symbiodinium microadriaticum 04-503SCI.03, Symbiodinium necroappetens CCMP2469, Symbiodinium linucheae CCMP2456 and Symbiodinium pilosum CCMP2461. These five assemblies, generated using only short-read sequence data, are of similar quality to previously published genomes of Symbiodiniaceae (Table 5-1 and Supplementary Table 1). The number of assembled scaffolds ranges from 37,772 for S. linucheae to 104,583 for S. necroappetens; the corresponding N50 scaffold lengths are 58,075 and 14,528 bp, respectively. The fraction of the genome recovered in the assemblies ranged from 54.64% (S. pilosum) to 76.26% (S. necroappetens) of the corresponding genome size estimated based on k-mers (Supplementary Table 2). The overall G+C content of all analysed Symbiodinium genomes is ~50% (Fig. S5.1), with the lowest (48.21%) in S. pilosum CCMP2461 and the highest (51.91%) in S. microadriaticum CassKB8.

For a comprehensive comparison, we included in our analysis all available genome data from Symbiodiniaceae and the outgroup species of Polarella glacialis (Supplementary Table 1). These data comprise nine Symbiodinium isolates (three of the species S. microadriaticum and two of S. tridacnidorum), Breviolum minutum, two Cladocopium isolates, Fugacium kawagutii, and two Polarella glacialis isolates34,36-40 (i.e. a total of 15 datasets of Suessiales, of which 13 are of Symbiodiniaceae); we used the revised genome assemblies from Chen et al.41 where applicable. Of the 15 genome assemblies, four were generated using both short- and long-read data (those of S. natans CCMP2548, S. tridacnidorum CCMP2592 and the two P. glacialis isolates)34,40; all others were generated largely using short-read data.

125

Table 5-1 Symbiodinium isolates for which genome data were generated and genome assembly statistics Details on the Symbiodinium isolates for which genome data were generated in this study, and their corresponding genome assembly statistics. Isolate details/ S. microadriaticum S. necroappetens S. linucheae S. pilosum assembly statistic CassKB8 04-503SCI.03 CCMP2469 CCMP2456 CCMP2461 ITS2-subtype A1 A1 A13 A4 A2 Lifestyle Symbiotic Symbiotic Opportunistic Symbiotic Free-living Host Cassiopea sp. Orbicella faveolata Condylactis gigantea Plexaura homamalla Zoanthus sociatus (jellyfish) (stony coral) (anemone) (octocoral) (zoanthid) Collection site Hawaii Florida Jamaica Bermuda Jamaica (Pacific) (Atlantic) (Caribbean) (Atlantic) (Caribbean) Overall G+C (%) 51.91 50.46 50.85 50.36 48.21 Number of scaffolds 67,937 57,558 104,583 37,772 48,302 Assembly length (bp) 813,744,491 775,008,844 767,953,253 694,902,460 1,089,424,773 N50 scaffold length (bp) 42,989 49,975 14,528 58,075 62,444 Max. scaffold length (Mbp) 0.38 1.08 1.34 0.46 1.34 Number of contigs 167,159 162,765 157,685 141,380 142,969 N50 contig length (bp) 10,400 11,136 11,420 11,147 17,506 Max. contig length (Mbp) 0.15 1.05 1.34 0.19 1.34 Gap (%) 1.15 1.44 0.56 1.35 0.79 Estimated genome size (bp) 1,120,150,369 1,052,668,212 1,007,022,374 914,781,885 1,993,912,458 Assembled fraction of genome (%) 72.65 73.62 76.26 75.96 54.64

126

5.3.2. Isolates of Symbiodiniaceae and Symbiodinium exhibit extensive genome divergence

We assessed genome-sequence similarity based on pairwise whole-genome sequence alignment (WGA). In each pairwise comparison, we assessed the overall percentage of the query genome sequence that aligned to the reference (Q), and the average percent identity of the reciprocal best one- to-one aligned sequences (I); see Methods for detail. Our results revealed extensive sequence divergence among the compared genomes at the order (Suessiales), family (Symbiodiniaceae) and genus (Symbiodinium) levels (Fig. 5.1A). As expected, the genome-pairs that exhibit the highest sequence similarity are isolates from the same species, e.g. S. microadriaticum CassKB8 and 04- 503SCI.03 (Q = 87.44%, I = 99.72%; CassKB8 as query), and the two P. glacialis isolates (Q = 97.10%, I = 98.59%; CCMP1383 as query). In contrast, genome sequences of the two S. tridacnidorum isolates appear more divergent (Q = 30.07%, I = 87.18%; CCMP2592 as query). Remarkably, some genomes within Symbiodinium are as divergent as those of distinct genera: for instance, Q = 1.10% and I = 91.88% for S. pilosum compared against S. natans as reference, and Q = 1.03% and I = 92.15% for S. tridacnidorum CCMP2592 against Cladocopium sp. C92. The genome sequences of S. microadriaticum CCMP2467 share the most genome regions with all analysed isolates (Fig. 5.1A). When compared against these sequences as reference, we did not recover any genome regions that are conserved (alignment length ≥24 bp, with >70% identity) in all analysed isolates (Fig. 5.1). At most, six isolates have genome regions aligned against the reference, all of which belong to the same genus: S. microadriaticum CassKB8, S. microadriaticum 04-503SCI.03, S. linucheae, S. tridacnidorum CCMP2592, S. natans and S. pilosum. However, the total length of the region common in these genomes is only 89 bp (Fig. 5.1B).

For each possible genome-pair, we also assessed the extent of shared k-mers (short, sub- sequences of defined length k) between them (optimised k = 21; see Methods) from which a pairwise distance (d) was derived (Supplementary Table 3). These distances were used to infer the phylogenetic relationship of these genomes as a neighbour-joining (NJ) tree (Fig. 5.1C) and as a similarity network (Fig. S5.2). As shown in Fig. 5.1C, the most distant genome-pair (i.e. the pair with the highest d) is S. tridacnidorum CCMP2592 and B. minutum (d = 7.56). Symbiodinium isolates are about as distant from the other Symbiodiniaceae (푑̅ = 7.24) as they are from the outgroup P. glacialis (푑̅ = 7.23). This is surprising, in particular because P. glacialis isolates have shorter distances with the other Symbiodiniaceae (푑̅ = 6.84) and Symbiodinium is considered to be more ancestral than all other genera in Symbiodiniaceae13. However, this observation may be biased by the greater representation of Symbiodinium isolates compared to any other genera of Symbiodiniaceae.

127

Fig. 5.1 Genome divergence among Symbiodiniaceae (A) Similarity between Symbiodiniaceae (and the outgroup P. glacialis) based on pairwise whole- genome sequence alignments. The colour of the square depicts the average percent identity of the best reciprocal one-to-one aligned regions (I) between each genome pair and the size of the square is proportional to the percent of the query genome that aligned to the reference (Q), as shown in the legend. The tree topologies on the left and bottom indicate the known phylogenetic relationship13 among the isolates. Isolates in Symbiodinium are highlighted in grey. (B) Total sequence length (y-axis) of genomic regions aligning to the reference genome assembly of S. microadriaticum CCMP2467 shared by different numbers of the datasets used in this study (x- axis). Data points represent distinct combinations of datasets, ranging from one (an individual genome dataset) to six (six datasets aligning to the same regions of the reference), and are coloured to show the genera to which they correspond; only one combination includes distinct genera (S. tridacnidorum Sh18 and Cladocopium sp. C92). (C) NJ tree based on 21-mers shared by genomes of Suessiales; branch lengths are proportional to the estimated distances (see Methods). The

128

shortest and longest distances (d) in the tree, as well as average distances (푑̅) among representative clades are shown following the bottom-left colour code. ‘Clade BCF’: clade including B. minutum, F. kawagutii and the two Cladocopium isolates. (D) Number of collinear syntenic gene blocks shared by pairs of genomes of Suessiales. Gene blocks shared by more than two isolates are not shown.

The largest distance among genome-pairs within Symbiodinium is between two free-living species, S. natans and S. pilosum (d = 5.64). These two isolates are also the most divergent from all others in the genus (d > 4.50 between either of them and any other Symbiodinium; Supplementary Table 3). The distance between S. natans and S. pilosum is similar to that observed between F. kawagutii and C. goreaui (d = 5.74), members of distinct genera. Similar to our WGA results, the shortest distances are between isolates of the same species, e.g. d = 0.77 between P. glacialis CCMP1383 and CCMP2088, and 푑̅ = 0.86 among S. microadriaticum isolates. However, the distance between the two S. tridacnidorum isolates (CCMP2592 and Sh18; d = 2.87) is larger than that between S. necroappetens and S. linucheae (d = 2.66). The divergence among Symbiodinium isolates is further supported by the mapping rate of paired reads (Fig. S5.3).

We used the same gene-prediction workflow, customised for dinoflagellates, for the five Symbiodinium genome studies generated in this study as for the other ten assemblies included in our analyses34,40,41 (Table 5-1). The number of predicted genes in these genomes ranged between 23,437 (in S. pilosum CCMP2461) and 42,652 (in S. microadriaticum CassKB8), which is similar to the number of genes (between 25,808 and 45,474) predicted in the other Symbiodiniaceae genomes (Supplementary Table 4). To further assess genome divergence, we identified conserved synteny based on collinear syntenic gene blocks (see Methods). Fig. 5.1D illustrates the gene blocks shared between any possible genome-pairs; those blocks shared by more than two genomes are not shown. S. microadriaticum CCMP2467 and S. tridacnidorum CCMP2592 share the most gene blocks (853 implicating 8589 genes). Although the two P. glacialis genomes share 346 gene blocks (2524 genes), no blocks were recovered between the genome of either P. glacialis isolate and any of S. microadriaticum CassKB8, S. microadriaticum 04-503SCI.03, S. necroappetens, C. goreaui, Cladocopium sp. C92 or F. kawagutii. The collinear gene blocks shared by P. glacialis CCMP1383 and S. microadriaticum CCMP2467 (3 blocks, 19 genes) represent the most abundant between any P. glacialis and any Symbiodiniaceae isolate. Although we cannot dismiss the impact of contiguity and completeness of the genome assemblies (Supplementary Table 1, Fig. S5.4) on our observations

129

here (and results from the WGA and k-mer analyses above), these results provide the first comprehensive overview of genome divergence at the resolution of species, genus and family levels.

5.3.3. Remnants of transposable elements were lost in more-recently diverged lineages of Symbiodiniaceae

Fig. 5.2A shows the composition of repeats for each of the 15 genomes. The repeat composition of P. glacialis is distinct from that of Symbiodiniaceae genomes, largely due to the known prevalence of simple repeats34,40. Long interspersed nuclear elements (LINEs) in Symbiodiniaceae and in P. glacialis are highly diverged, with Kimura distance42 centred between 15 and 40; these elements likely represent remnants of LINEs from an ancient burst pre-dating the diversification of Suessiales34,37,40. Interestingly, the proportion of these elements is substantially larger in the genomes of Symbiodinium (the basal lineage) and P. glacialis (the outgroup) than in those of other Symbiodiniaceae (Fig. 5.2B). For instance, LINEs comprise between 74.10 Mbp (S. tridacnidorum Sh18) and 96.9 Mbp (S. linucheae) in each of the Symbiodinium genomes, except for those in S. pilosum that cover almost twice as much (171.31 Mbp). In comparison, LINEs cover on average 7.49 Mbp in the genomes of other Symbiodiniaceae (Fig. S5.5). This result suggests that the remnants of LINEs were lost in the more-recently diverged lineages of Symbiodiniaceae.

The genome of the free-living S. pilosum presents an outlier among the Symbiodinium genomes. In addition to the nearly two-fold increased abundance of LINEs, the estimated genome size for S. pilosum (1.99 Gbp) is also nearly two-fold larger than the estimate for any other Symbiodinium genome (Supplementary Table 2). This suggests whole-genome duplication or potentially a more- dominant diploid stage, but we found no evidence to support either scenario (Fig. S5.6). The prevalence of repetitive regions in S. pilosum, however, would explain in part why the total assembled bases of the genome constitute only 54.64% of the estimated genome size (Supplementary Table 1).

130

Fig. 5.2 Repeat composition of Suessiales genomes (A) Percentage of sequence regions comprising the major classes of repetitive elements, shown for each genome assembly analysed in this study. (B) Interspersed repeat landscape for each assembled genome. Both (A) and (B) follow the colour code shown in the bottom legend.

131

5.3.4. Diversity of gene features within Suessiales

Differences among predicted genes of Symbiodiniaceae have been attributed to phylogenetic relationship and to the implementation of distinct gene prediction approaches41. Our Principal Component Analysis (PCA), based on metrics of consistently predicted genes (Supplementary Table 4), revealed substantial variation within the genus Symbiodinium (Fig. 5.3). We noticed that the observed variation can be associated with three main factors: (1) phylogenetic relationship, (2) the type of sequence data used for genome assembly and the consequent assembly quality, and (3) lifestyle of the isolates. The variation resulting from the phylogenetic relationship among the genomes is illustrated by the separation of the distinct genera along PC2 (explaining 24.82% of the variance). The metrics contributing the most to PC2 are associated with proportion of splice donors and acceptors (Fig. S5.7). The type of sequence data used for genome assembly and assembly quality are reflected along PC1 (explaining 42.79% of the variance). For instance, taxa for which hybrid assemblies were made (those incorporating both short-read and long-read sequence data), i.e. the free- living S. natans and P. glacialis, and the symbiotic S. tridacnidorum CCMP2592, are distributed between –4.5 and 0.1 along PC1. The distribution of the symbiotic Symbiodinium is limited (between 0.5 and 1.5 of PC1), with the exception of the two S. tridacnidorum isolates, for which the genome assemblies are of distinct quality (i.e. the high-quality hybrid assembly of CCMP2592 compared to the draft assembly of Sh18 that is fragmented and incomplete; Supplementary Table 1 and Fig. S5.4). In addition, the opportunistic S. necroappetens and free-living S. pilosum are distributed at >2 along PC1. These observations suggest that the distinct lifestyles may contribute to differences in gene architecture.

The predicted coding sequences (CDS) among Symbiodinium taxa exhibit biases in nucleotide composition of codon positions (Fig. S5.8) and in codon usage (Fig. S5.9). The G+C content among CDS (Supplementary Table 4) and among third codon positions (Fig. S5.8) varies slightly, but is generally higher relative to the overall G+C content (Fig. S5.1, Supplementary Table 1). This is consistent with the results previously reported for genomes and transcriptomes of Symbiodiniaceae33,43. Of all Symbiodinium isolates, S. microadriaticum CassKB8 and 04-503SCI.03 have the most CDS with a strong codon preference; S. microadriaticum CCMP2467 has the least (Fig. S5.9). These observations highlight the genetic diversity within a single genus, and within a single species.

132

Fig. 5.3 PCA of gene features in Symbiodiniaceae PCA displaying the diversity of predicted genes among the analysed genomes based on gene metrics (Supplementary Table 4). Data points are coloured by genus and shaped by lifestyles according to the legends to the right. Data points enclosed in a light blue area correspond to isolates with hybrid genome assemblies. Smi: S. microadriaticum, Sne: S. necroappetens, Sli: S. linucheae, Str: S. tridacnidorum, Sna: S. natans, Spi: S. pilosum, Bmi: B. minutum, Cgo: C. goreaui, Csp: Cladocopium sp. C92, Fka: F. kawagutii, Pgl: P. glacialis. Isolate name is shown in subscript for those species with more than one isolate.

5.3.5. Gene families of Symbiodiniaceae

Using all 555,682 predicted protein sequences from the 15 genomes, we inferred 42,539 homologous sets (of size ≥ 2; see Methods); here we refer to these sets as gene families. Of the 42,539 families, 18,453 (43.38%) contain genes specific to Symbiodiniaceae (Fig. 5.4). Interestingly, more (8828) gene families are specific to sequenced isolates of Symbiodinium than to sequenced isolates of the other Symbiodiniaceae combined (2043 specific to Breviolum, Cladocopium and Fugacium

133

isolates). Although the simplest explanation is that substantially more gene families have been gained (or preserved) in Symbiodinium than in the other three genera, we cannot dismiss potential biases caused by our more-comprehensive taxon sampling for this genus. In contrast, a previous study reported substantially more gene families specific to the clade encompassing Breviolum, Cladocopium and Fugacium (26,474) than specific to Symbiodinium (3577)33. It is difficult to compare these two results because the previous study used predominantly transcriptomic data (which are fragmented and include transcript isoforms), proteins predicted with distinct and inconsistent methods, and a different approach to delineate gene families.

Fig. 5.4 Number of gene families along the phylogeny of Symbiodiniaceae Species tree inferred based on 28,116 gene families containing at least 4 genes from any Suessiales isolate using STAG44 and STRIDE45 (part of the conventional OrthoFinder pipeline46), rooted with P. glacialis as outgroup. At each node, the total number of families that include genes from one or more diverging isolates is shown in dark blue, those exclusive to one or more diverging isolates in light blue. The numbers shown for each isolate (on the right) represent numbers of gene families that include genes from (dark blue) and exclusive to (light blue) that isolate. The proportion of gene trees supporting each node is shown. Branch lengths are proportional to the number of substitutions per site.

134

Of all families, 2500 (5.88%) contain genes from all 15 Suessiales isolates; 4677 (10.9%) represent 14 or more isolates. We consider these 4677 as the core gene families to Suessiales. Only 406 gene families are exclusive and common to all 13 Symbiodiniaceae isolates; 914 represent 12 or more isolates. Similarly, 193 are exclusive and common to all nine Symbiodinium isolates; 539 represent eight or more isolates. We define these 914 and 539 families as the core gene families for Symbiodiniaceae and for Symbiodinium, respectively. Despite the variable quality and completeness of the genome assemblies analysed here (Supplementary Table 1, Fig. S5.4), we consider these results more reliable than those based largely on transcriptome data33, in which transcript isoforms, in addition to quality and completeness of the datasets, can result in overestimation of gene numbers and introduce noise and bias to the data. The smaller number of gene families shared among Symbiodiniaceae found here (i.e. 18,453 compared to 76,087 in the earlier study33) likely reflects our more-conservative approach. Nonetheless, our observations support the notion that evolution of gene families has contributed to the diversification of Symbiodiniaceae33.

5.3.6. Core genes of Symbiodiniaceae and of Symbiodinium encode similar functions

To identify gene functions characteristic of Symbiodiniaceae and Symbiodinium, we carried out enrichment analyses based on Gene Ontology (GO)47 of the annotated gene functions in the corresponding core families. Among the core genes of Symbiodiniaceae, the most significantly overrepresented GO terms relate to retrotransposition, components of the membrane (including ABC transporters), cellulose binding, and reduction and oxidation reactions of the electron transport chain (Supplementary Table 5). Retrotransposition has been shown to contribute to gene-family expansion and changes in the gene structure of Symbiodiniaceae34,48. The enrichment of this function in Symbiodiniaceae may be due to a common origin of genes that encode remnant protein domains from past retrotransposition events (e.g. genes encoding reverse transcriptase, as previously reported34). Proteins integrated in the cell membrane are relevant to symbiosis49,50. For instance, ABC transporters may play a major role in the exchange of nutrients between host and symbiotic Symbiodiniaceae51. The enrichment of cellulose-binding may be related to the changes in the cell wall during the transition between the mastigote and coccoid stages common in symbiotic Symbiodiniaceae52. The overrepresentation of electron transport chain functions may be associated with the high plasticity required within the electron transport chain (and associated metabolic pathways) to deal with rapidly fluctuating light environments inherent to the shallow benthic systems that Symbiodiniaceae typically inhabit53,54.

135

Similarly, among core genes of Symbiodinium, the most significantly enriched functions are related to retrotransposition (Supplementary Table 6). This is likely a reflection of the higher content of LINEs in Symbiodinium genomes (and perhaps also of LTRs in S. tridacnidorum CCMP2592 and S. natans CCMP2548) compared to the other Symbiodiniaceae isolates (Fig. 5.2 and Fig. S5.5). Nevertheless, the presence of retrotransposition among the functions overrepresented in the cores of both Symbiodiniaceae and Symbiodinium supports the notion of substantial divergence, potentially result of pseudogenisation or neofunctionalisation, accumulated between gene homologs that prevents the clustering of these homologs within the same gene family33,34.

5.3.7. Functions related to symbiosis and stress response are conserved in Suessiales

We further examined the functions annotated for the predicted genes of all 15 Suessiales isolates based on the annotated GO terms and protein domains. A recent study, focusing on the transcriptomic changes in Cladocopium sp. following establishment of symbiosis with coral larvae51, compiled a list of symbiosis-related gene functions in Symbiodiniaceae. We searched for these functions, and found that they are conserved in Symbiodiniaceae regardless of the lifestyle (e.g. the free-living S. natans, S. pilosum and F. kawagutii, or the opportunistic S. necroappetens), and even in the outgroup P. glacialis (Fig. 5.5). This result supports the notion that genomes of dinoflagellates encode gene functions conducive to adaptation to a symbiotic lifestyle36. However, we observed a trend of reduced abundance of these functions in genes of B. minutum, C. goreaui and Cladocopium sp. C92, with the exception of genes encoding ankyrin and tetratricopeptide repeat domains. Although multiple Pfam domains of ankyrin or tetratricopeptide repeats exist, all isolates exhibit consistently higher abundance for specific types (PF12796 and PF13424, respectively). Interestingly, despite the presence of ABC transporters in the enriched functions of the core genes of Symbiodiniaceae (Supplementary Table 5), they appear to occur in low abundance. The conservation of these gene functions may also indicate other unknown biological processes that are also relevant to symbiosis. A phylogenomic approach integrating genome-scale data from both hosts and symbionts presents a more-powerful strategy to delineate the molecular mechanisms of symbiotic interactions, and to assess impacts of symbiosis on genome evolution of the hosts55.

136

Fig. 5.5 Relative abundance of symbiosis-related functions in genes of Suessiales Heat map showing the relative abundance (α) of GO terms (relative to the total number of genes) and protein domains (relative to the total number of identified domains) that are related to symbiosis shown for each genome. The transformed values of α are shown in the form of 3α.

137

The abundance of functions associated with response to distinct types of stress, cell division, DNA damage repair, photobiology and motility also appear to be conserved across Suessiales (Fig. 5.6). The abundance of genes annotated with DNA repair functions is consistent with the previously reported overrepresentation of these functions in genomes and transcriptomes of Suessiales33 and the presence of gene orthologs involved in a wide range of DNA damage responses in dinoflagellates56. Likewise, the relatively high abundance of functions related to DNA recombination may represent further support for the potential of sexual reproduction in these dinoflagellates37,57, and for the contribution of sexual recombination to genetic diversity of Symbiodiniaceae22,23,58-61. Moreover, the higher abundance of a cold-shock DNA-binding domain and bacteriorhodopsin in P. glacialis compared to the Symbiodiniaceae isolates highlights the adaptation of this species to extreme cold and low-light environments, and is consistent with the highly duplicated genes encoding these functions in P. glacialis genomes40.

5.4. Concluding remarks

Our results suggest that whereas gene functions appear to be largely conserved across isolates from the same order (Suessiales), family (Symbiodiniaceae) and genus (Symbiodinium), there is substantial genome-sequence divergence among these isolates. However, what drives this divergence remains an open question. Although sexual recombination probably contributes to the extensive genetic diversity in Symbiodiniaceae22,23,58-61, its limitation to homologous regions renders its contribution as the sole driver of divergence unlikely. The evolutionary transition from a free-living to a symbiotic lifestyle can contribute to the loss of conserved synteny as consequence of large- and small-scale structural rearrangements43,62,63. The enhanced activity of mobile elements in the early stages of this transition can further disrupt synteny, impact gene structure and accelerate mutation rate64,65. However S. natans and S. pilosum, for which the free-living lifestyle has been postulated to be ancestral34, are still quite divergent from each other (Fig. 5.1). Ancient events, such as geological changes or emergence of hosts, are thought to influence diversification of Symbiodiniaceae13,19,66 and may help explain the divergence of the extant lineages. For example, in a hypothetical scenario, drastic changes in environmental conditions could have split the ancestral Symbiodiniaceae population into multiple sub-populations with very small population sizes. This would have enabled rapid divergence among the sub-populations that, in turn, could have evolved and diversified independently into the extant taxa.

138

Fig. 5.6 Relative abundance of selected functions in genes of Suessiales Heat map showing the relative abundance (α) of GO terms (relative to the total number of genes) and protein domains (relative to the total number of identified domains) that are associated with key functions shown for each genome. The transformed values of α are shown in the form of 3α.

Although genome data generally provide a comprehensive view of gene functions, we cannot dismiss artefacts that may have been introduced by the type of sequence data used to generate the genome assemblies analysed here. Genes encoding functions critical to dinoflagellates often occur in multiple copies, and those of Symbiodiniaceae are no exceptions34,36,40. Incorporation of long-read sequence data in the genome assembly is important to resolve repetitive elements (including genes occurring in multiple copies) and allow for more-accurate analysis of abundance or enrichment of gene functions. On the other hand, accurate inference of gene families can be challenging especially for gene homologs with an intricate evolutionary history. Moreover, a good taxa representation can aid the inference of homology67,68. Data that better resolve multi-copy genes (e.g. through the

139

incorporation of long-read sequences in the assembly process34) will allow better understanding of gene loss and innovation along the genome evolution of Symbiodiniaceae.

This work reports the first whole-genome comparison at multiple taxonomic levels within dinoflagellates: within Order Suessiales, within Family Symbiodiniaceae, within Genus Symbiodinium, and separately for the species S. microadriaticum and S. tridacnidorum. We show that whereas genome sequences can diverge substantially among Symbiodiniaceae, gene functions nonetheless remain largely conserved even across Suessiales. Our understanding of the evolution of this remarkably divergent family would benefit from more-narrowly scoped studies at the intra- generic and intra-specific levels. Even so, our work demonstrates the value of comprehensive surveys to unveil macro-evolutionary processes that led to the diversification of Symbiodiniaceae.

5.5. Methods

5.5.1. Symbiodinium cultures

Single-cell monoclonal cultures of S. microadriaticum CassKB8 and S. microadriaticum 04- 503SCI.03 were acquired from Mary Alice Coffroth (Buffalo University, New York, USA), and those of S. necroappetens CCMP2469, S. linucheae CCMP2456 and S. pilosum CCMP2461 were purchased from the National Center for Marine Algae and Microbiota at the Bigelow Laboratory for Ocean Sciences, Maine, USA (Table 5-1). The cultures were maintained in multiple 100-mL batches (in 250-mL Erlenmeyer flasks) in f/2 (without silica) medium (0.2 mm filter-sterilized) under a 14:10 h light-dark cycle (90 μE/m2/s) at 25 ºC. The medium was supplemented with antibiotics (ampicillin [10 mg/mL], kanamycin [5 mg/mL] and streptomycin [10 mg/mL]) to reduce bacterial growth.

5.5.2. Nucleic acid extraction

Genomic DNA was extracted following the 2×CTAB protocol with modifications. Symbiodinium cells were first harvested during exponential growth phase (before reaching 106 cells/mL) by centrifugation (3000 g, 15 min, room temperature (RT)). Upon removal of residual medium, the cells were snap-frozen in liquid nitrogen prior to DNA extraction, or stored at –80 °C. For DNA extraction, the cells were suspended in a lysis extraction buffer (400 μL; 100 mM Tris-Cl pH 8, 20 mM EDTA pH 8, 1.4 M NaCl), before silica beads were added. In a freeze-thaw cycle, the mixture was vortexed at high speed (2 min), and immediately snap-frozen in liquid nitrogen; the cycle was repeated 5 times. The final volume of the mixture was made up to 2% w/v CTAB (from 10% w/v CTAB stock; kept at 37 °C). The mixture was treated with RNAse A (Invitrogen; final 140

concentration 20 μg/mL) at 37 °C (30 min), and Proteinase K (final concentration 120 μg/mL) at 65 °C (2 h). The lysate was then subjected to standard extractions using equal volumes of phenol:chloroform:isoamyl alcohol (25:24:1 v/v; centrifugation at 14,000 g, 5 min, RT), and chloroform:isoamyl alcohol (24:1 v/v; centrifugation at 14,000 g, 5 min, RT). DNA was precipitated using pre-chilled isopropanol (gentle inversions of the tube, centrifugation at 18,000 g, 15 min, 4 °C). The resulting pellet was washed with pre-chilled ethanol (70% v/v), before stored in Tris-HCl (100 mM, pH 8) buffer. DNA concentration was determined with NanoDrop (Thermo Scientific), and

DNA with A230:260:280 ≈ 1.0:2.0:1.0 was considered appropriate for sequencing. Total RNA was isolated using the RNeasy Plant Mini Kit (Qiagen) following directions of the manufacturer. RNA quality and concentration were determined using Agilent 2100 BioAnalyzer.

5.5.3. Genome sequence data generation and de novo genome assembly

All genome sequence data generated for the five Symbiodinium isolates are detailed in Supplementary Table 7. Short-read sequence data (2 × 150 bp reads, insert length 350 bp) were generated using paired-end libraries on the Illumina HiSeq 2500 and 4000 platforms at the Australian Genome Research Facility (Melbourne) and the Translational Research Institute Australia (Brisbane). For all samples, except for S. pilosum CCMP2461, an additional paired-end library (insert length 250 bp) was designed such that the read-pairs of 2 × 150 bp would overlap. Quality assessment of the raw paired-end data was done with FastQC v0.11.5 (bioinformatics.babraham.ac.uk/projects/fastqc), and subsequent processing with Timmomatic69 v0.36. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: ILLUMINACLIP:[AdapterFile]:2:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:4:25 MINLEN:100 AVGQUAL:30; CROP and HEADCROP were run (prior to LEADING and TRAILING) when required to remove read ends with nucleotide biases. Genome size and sequence read coverage were estimated from the trimmed read pairs based on k-mer frequency analysis (Supplementary Table 2) as counted with Jellyfish70 v2.2.6; proportion of the single-copy regions of the genome and heterozygosity were computed with GenomeScope71 v1.0. De novo genome assembly was performed for all isolates with CLC Genomics Workbench v7.5.1 (qiagenbioinformatics.com) at default parameters, and using the filtered read pairs and single-end reads. The genome assemblies of S. microadriatricum 04-503SCI.03, S. microadriaticum CassKB8, S. linucheae CCMP2456 and S. pilosum CCMP2461 were further scaffolded with transcriptome data (see below) using L_RNA_scaffolder72. Short sequences (<1000 kbp) were removed from the assemblies.

141

5.5.4. Removal of putative microbial contaminants

To identify putative sequences from bacteria, archaea and viruses in the genome scaffolds, we followed the approach of Chen et al.41. In brief, we first searched the scaffolds (BLASTn) against a database of bacterial, archaeal and viral genomes from RefSeq (release 88), and identified those with significant hits (E ≤ 10−20 and bit score ≥ 1000). We then examined the sequence cover of these regions in each scaffold, and identified the percentage (in length) contributed by these regions relative to the scaffold length. We assessed the added length of implicated genome scaffolds across different thresholds of percentage sequence cover in the alignment, and the corresponding gene models in these scaffolds as predicted from available transcripts (see below) using PASA73 v2.3.3, with a modified script (github.com/chancx/dinoflag-alt-splice) that recognises an additional donor splice site (GA), and TransDecoder73 v5.2.0. Any scaffold with significant bacterial, archaeal or viral hits covering ≥5% of its length was considered as a putative contaminant and removed from the assembly (Fig. S5.10). Additionally, the length of the remaining scaffolds was plotted against their G+C content; scaffolds (>100 kbp) with irregular G+C content (in this case, G+C ≤ 45% or ≥60%) were considered as putative contaminant sequences and removed (Fig. S5.11).

5.5.5. Generation and assembly of transcriptome data

We generated transcriptome sequence data for the Symbiodinium isolates, except for S. necroappetens CCMP2469 for which the extraction of total RNAs failed (Supplementary Table 8). Short-read sequence data (2 × 150 bp reads) were generated using paired-end libraries on the Illumina NovaSeq 6000 platform at the Australian Genome Research Facility (Melbourne). Quality assessment of the raw paired-end data was done with FastQC v0.11.4 (bioinformatics.babraham.ac.uk/projects/fastqc), and subsequent processing with Trimmomatic69 v0.35. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: HEADCROP:10 ILLUMINACLIP:[AdapterFile]:2:30:10 CROP:125 SLIDINGWINDOW:4:13 MINLEN:50. The surviving read pairs were further trimmed with QUADTrim v2.0.2 (bitbucket.org/arobinson/quadtrim) with the flags -m2 and -g to remove homopolymeric guanine repeats at the end of the reads (a systematic error of Illumina NovaSeq 6000).

Transcriptome assembly was performed with Trinity74 v2.1.1 in two modes: de novo and genome-guided. De novo transcriptome assembly was done using default parameters and the trimmed read pairs. For genome-guided assembly, high-quality read pairs were aligned to their corresponding de novo genome assembly (prior to scaffolding) using Bowtie75 v2.2.7. Transcriptomes were then assembled with Trinity in the genome-guided mode using the alignment information, and setting the 142

maximum intron size to 100,000 bp. Both de novo and genome-guided transcriptome assemblies from each of the four samples were used for scaffolding (see above) and gene prediction (see below) in their corresponding genome.

5.5.6. Gene prediction and function annotation

We adopted the same comprehensive ab initio gene prediction approach reported in Chen et al.41, using available genes and transcriptomes of Symbiodiniaceae as supporting evidence. A de novo repeat library was first derived for the genome assembly using RepeatModeler v1.0.11 (repeatmasker.org/RepeatModeler). All repeats (including known repeats in RepeatMasker database release 20180625) were masked using RepeatMasker v4.0.7 (repeatmasker.org). As direct transcript evidence, we used the de novo and genome-guided transcriptome assemblies from Illumina short- read sequence data (see above). For S. necroappetens CCMP2469, we used transcriptome data of the other four Symbiodinium isolates for gene prediction, as well as other available transcriptome datasets of Symbiodinium: S. microadriaticum CassKB876, S. microadriaticum CCMP246736, S. tridacnidorum Sh1838, and S. tridacnidorum CCMP2592 and S. natans CCMP254834. We also combined the S. microadriaticum CassKB8 transcriptome data generated here with those from a previous study76. We concatenated all the transcript datasets per sample, and vector sequences were discarded using SeqClean (sourceforge.net/projects/seqclean) based on shared similarity to sequences in the UniVec database build 10.0. We used PASA73 v2.3.3, customised to recognise dinoflagellates alternative splice donor sites (github.com/chancx/dinoflag-alt-splice), and TransDecoder73 v5.2.0 to predict CDS. These CDS were searched (BLASTp, E ≤ 10−20) against a protein database that consists of RefSeq proteins (release 88) and a collection of available and predicted proteins (using TransDecoder73 v5.2.0) of Symbiodiniaceae (total of 111,591,828 sequences; Supplementary Table 9). We used the analyze_blastPlus_topHit_coverage.pl script from Trinity74 v2.6.6 to retrieve only those CDS having an alignment >70% to a protein (i.e. nearly full-length) in the database for subsequent analyses.

The near full-length gene models were checked for transposable elements (TEs) using HHblits v2.0.16 (probability = 80% and E-value = 10−5), searching against the JAMg transposon database (sourceforge.net/projects/jamg/files/databases), and TransposonPSI (transposonpsi.sourceforge.net). Gene models containing TEs were removed from the gene set, and redundancy reduction was conducted using cd-hit77,78 v4.6 (ID = 75%). The remaining gene models were processed using the prepare_golden_genes_for_predictors.pl script from the JAMg pipeline (altered to recognise GA donor splice sites; jamg.sourceforge.net). This script produces a set of “golden genes” that were used

143

as training set for the ab initio gene-prediction tools AUGUSTUS79 v3.3.1 (customised to recognise the non-canonical splice sites of dinoflagellates; github.com/chancx/dinoflag-alt-splice) and SNAP80 v2006-07-28. Independently, the soft-masked genome sequences were used for gene prediction using GeneMark-ES81 v4.32. Swiss-Prot proteins (downloaded on 27 June 2018) and the predicted proteins of Symbiodiniaceae (Supplementary Table 9) were used as supporting evidence for gene prediction using MAKER82 v2.31.10 protein2genome; the custom repeat library was used by RepeatMasker as part of MAKER prediction. A primary set of predicted genes was produced using EvidenceModeler83 v1.1.1, modified to recognise GA donor splice sites. This package combined the gene predictions from PASA, SNAP, AUGUSTUS, GeneMark-ES and MAKER protein2genome into a single set of evidence-based predictions. The weightings used for the package were: PASA 10, Maker protein 8, AUGUSTUS 6, SNAP 2 and GeneMark-ES 2. Only gene models with transcript evidence (i.e. predicted by PASA) or supported by at least two ab initio prediction programs were kept. We assessed completeness by querying the predicted protein sequences in a BLASTp similarity search (E ≤ 10−5, ≥50% query/target sequence cover) against the 458 core eukaryotic genes from CEGMA84. Transcript data support for the predicted genes was determined by BLASTn (E ≤ 10−5), querying the transcript sequences against the predicted CDS from each genome. Genes for which the transcripts aligned to their CDS with at least 50% of sequence cover and 90% identity were considered as supported by transcript data.

Functional annotation of the predicted genes was conducted based on sequence similarity searches against known proteins following the same approach as Liu et al.37, in which the predicted protein sequences were first searched (BLASTp, E ≤ 10−5, minimum query or target cover of 50%) against the manually curated Swiss-Prot database, and those with no Swiss-Prot hits were subsequently searched against TrEMBL (both databases from UniProt, downloaded on 27 June 2018). The best UniProt hit with associated GO terms (geneontology.org) was used to annotate the query protein with those GO terms using the UniProt-GOA mapping (downloaded on 03 June 2019). Pfam domains85 were searched in the predicted proteins of all samples using PfamScan86 (E ≤ 0.001) and the Pfam-A database (release 30 August 2018).

5.5.7. Comparison of genome sequences and analysis of conserved synteny

We compared the genome data of 15 isolates in Order Suessiales (Supplementary Table 1): the five for which we generated genome assemblies in this study (S. microadriaticum CassKB8, S. microadriaticum 04-503SCI.3, S. necroappetens CCMP2469, S. linucheae CCMP2456 and S. pilosum CCMP2461), three generated by Shoguchi and collaborators (B. minutum, S. tridacnidorum

144

Sh18 and Cladocopium sp. C92)38,39, two from González-Pech et al. (S. tridacnidorum CCMP2592 and S. natans CCMP2548)34, two from Liu et al. (C. goreaui and F. kawagutii)37, two from Stephens et al. (P. glacialis CCMP1383 and CCMP2088)40, and one from Aranda et al. (S. microadriaticum CCMP2467)36. Genes were consistently predicted from all genomes using the same workflow34,40,41.

Whole-genome sequence alignment was carried out for all possible genome pairs (225 combinations counting each genome as both reference and query) with nucmer87 v4.0.0, using anchor matches that are unique in the sequences from both reference and query sequences (--mum). Here, the similarity between two genomes was assessed based on the proportion of the total bases in the genome sequences of the query that aligned to the reference genome sequences (Q) and the average percent identity of one-to-one alignments (i.e. the reciprocal best one-to-one aligned sequences for the implicated region between the query and the reference; I). For example, if two genomes are identical, both Q and I would have a value of 100%. Filtered read pairs (see above, Supplementary Table 7) from all isolates were aligned to each other’s (and against their own) assembled genome scaffolds using BWA88 v0.7.13; mapping rates relative to base quality scores were calculated with SAMStat89 v1.5.1. For each possible genome-pair, we further assessed sequence similarity of the repeat-masked genome assemblies based on the similarity between their k-mers profiles. To determine the appropriate k-mer size to use, we extracted and counted k-mers using Jellyfish70 v2.2.6 at multiple k values (between 11 and 101, step size = 2); k = 21 was found to capture an adequate level of uniqueness among these genomes as inferred based on the proportion of distinct and unique 90 S k-mers (Fig. S5.12). We then computed pairwise D2 distances (d) for the 15 isolates following Bernard et al.91. The calculated distances were used to build a NJ tree with Neighbor (PHYLIP v3.697, evolution.genetics.washington.edu/phylip.html) at default settings. For deriving an alignment-free similarity network, pairwise similarity was calculated as 10 – d92.

To assess conserved synteny, we identified collinear syntenic gene blocks common to each genome pair based on the predicted genes and their associated genomic positions. Following Liu et al.37, we define a syntenic gene block as a region conserved in two genomes in which five or more genes are coded in the same order and orientation. First, we concatenated the sequences of all predicted proteins to conduct all-versus-all BLASTp (E ≤ 10−5) searching for similar proteins between each genome pair. The hit pairs were then filtered to include only those where the alignment covered at least half of either the query or the matched protein sequence. Next, we ran MCScanX93 in inter- specific mode (-b 2) to identify blocks of at least five genes shared by each genome pair. We independently searched for collinear syntenic blocks within each genome (i.e. duplicated gene blocks). Likewise, we conducted a BLASTp (E ≤ 10−5) to search for similar proteins within each genome; the hit pairs were filtered to include only those where the alignment covered at least half of 145

either the query or the matched protein sequence. We then ran MCScanX in intra-specific mode (-b 1).

5.5.8. Genic features, gene families and function enrichment

We examined differences among the predicted genes of all Suessiales isolates with a Principal Component Analysis (PCA; Fig. 5.3) using relevant metrics (Supplementary Table 4), following Chen et al.41. We calculated G+C content in the third position of synonymous codons and effective number of codons used (Nc) with CodonW (codonw.sourceforge.net) for complete CDS (defined as those with both start and stop codons) of all isolates. Groups of homologous sequences from all genomes were inferred with OrthoFinder46 v2.3.1 and considered as gene families. A rooted species tree was inferred using 28,116 families encompassing at least 4 genes from any isolate using STAG44 and STRIDE45, following the standard OrthoFinder pipeline. GO enrichment of genes in families core to Symbiodiniaceae and Symbiodinium (defined as those common to all isolates in, and exclusive to, each group) was conducted using the topGO Bioconductor package (bioconductor.org/packages/release/bioc/html/topGO.html) executed in R v3.5.1, implementing Fisher’s Exact test and the ‘elimination’ method; the GO terms associated to the genes of all isolates surveyed here were used as background to compare against. We considered a p ≤ 0.01 as significant.

5.6. References

1 González-Pech, R. A. et al. Genomes of Symbiodiniaceae reveal extensive sequence divergence but conserved functions at family and genus levels. bioRxiv, 800482 (2019).

2 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003).

3 Lesser, M., Stat, M. & Gates, R. The endosymbiotic dinoflagellates (Symbiodinium sp.) of corals are parasites and mutualists. Coral Reefs 32, 603-611 (2013).

4 LaJeunesse, T. C., Lee, S. Y., Gil-Agudelo, D. L., Knowlton, N. & Jeong, H. J. Symbiodinium necroappetens sp. nov. (Dinophyceae): an opportunist ‘zooxanthella’ found in bleached and diseased tissues of Caribbean reef corals. Eur. J. Phycol. 50, 223-238 (2015).

5 Hansen, G. & Daugbjerg, N. Symbiodinium natans sp. nov.: a "free-living" dinoflagellate from Tenerife (Northeast-Atlantic Ocean). J. Phycol. 45, 251-263 (2009).

146

6 Suggett, D. J. et al. Functional diversity of photobiological traits within the genus Symbiodinium appears to be governed by the interaction of cell size with cladal designation. New Phytol. 208, 370-381 (2015).

7 Goyen, S. et al. A molecular physiology basis for functional diversity of hydrogen peroxide production amongst Symbiodinium spp. (Dinophyceae). Mar. Biol. 164, 46 (2017).

8 Lawson, C. A., Possell, M., Seymour, J. R., Raina, J.-B. & Suggett, D. J. Coral endosymbionts (Symbiodiniaceae) emit species-specific volatilomes that shift when exposed to thermal stress. Sci. Rep. 9, 17395 (2019).

9 Warner, M. E. & Suggett, D. J. in The Cnidaria, past, present and future: the world of Medusa and her sisters (eds Stefano Goffredo & Zvy Dubinsky) 489-509 (Springer International Publishing, 2016).

10 Stat, M., Morris, E. & Gates, R. D. Functional diversity in coral–dinoflagellate symbiosis. Proc. Natl. Acad. Sci. U. S. A. 105, 9256-9261 (2008).

11 Rowan, R. & Powers, D. A. Ribosomal RNA sequences and the diversity of symbiotic dinoflagellates (zooxanthellae). Proc. Natl. Acad. Sci. U. S. A. 89, 3639-3643 (1992).

12 Blank, R. J. & Huss, V. A. R. DNA divergency and speciation in Symbiodinium (Dinophyceae). Plant Syst. Evol. 163, 153-163 (1989).

13 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral endosymbionts. Curr. Biol. 28, 2570-2580 (2018).

14 Takishita, K., Ishikura, M., Koike, K. & Maruyama, T. Comparison of phylogenies based on nuclear-encoded SSU rDNA and plastid-encoded psbA in the symbiotic dinoflagellate genus Symbiodinium. Phycologia 42, 285-291 (2003).

15 Takabayashi, M., Santos, S. R. & Cook, C. B. Mitochondrial DNA phylogeny of the symbiotic dinoflagellates (Symbiodinium, Dinophyta). J. Phycol. 40, 160-164 (2004).

16 LaJeunesse, T. C. Investigating the biodiversity, ecology, and phylogeny of endosymbiotic dinoflagellates in the genus Symbiodinium using the ITS region: in search of a “species” level marker. J. Phycol. 37, 866-880 (2001).

17 Rowan, R. & Powers, D. A. A molecular genetic classification of zooxanthellae and the evolution of animal-algal symbioses. Science 251, 1348-1351 (1991).

18 Pochon, X., Putnam, H. M. & Gates, R. D. Multi-gene analysis of Symbiodinium dinoflagellates: a perspective on rarity, symbiosis, and evolution. PeerJ 2, e394 (2014). 147

19 Pochon, X., Montoya-Burgos, J. I., Stadelmann, B. & Pawlowski, J. Molecular phylogeny, evolutionary rates, and divergence timing of the symbiotic dinoflagellate genus Symbiodinium. Mol. Phylogenet. Evol. 38, 20-30 (2006).

20 Pochon, X., Putnam, H. M., Burki, F. & Gates, R. D. Identifying and characterizing alternative molecular markers for the symbiotic and free-living dinoflagellate genus Symbiodinium. PLoS ONE 7, e29816 (2012).

21 Hume, B. C. C. et al. SymPortal: a novel analytical framework and platform for coral algal symbiont next-generation sequencing ITS2 profiling. Mol. Ecol. Resour. 19, 1063-1080 (2019).

22 Baillie, B., Monje, V., Silvestre, V., Sison, M. & Belda-Baillie, C. Allozyme electrophoresis as a tool for distinguishing different zooxanthellae symbiotic with giant clams. Proc. R. Soc. Lond. B Biol. Sci. 265, 1949-1956 (1998).

23 Baillie, B. et al. Genetic variation in Symbiodinium isolates from giant clams based on random-amplified-polymorphic DNA (RAPD) patterns. Mar. Biol. 136, 829-836 (2000).

24 Santos, S., Shearer, T., Hannes, A. & Coffroth, M. Fine‐scale diversity and specificity in the most prevalent lineage of symbiotic dinoflagellates (Symbiodinium, Dinophyceae) of the Caribbean. Mol. Ecol. 13, 459-469 (2004).

25 Wham, D. C. & LaJeunesse, T. C. Symbiodinium population genetics: testing for species boundaries and analysing samples with mixed genotypes. Mol. Ecol. 25, 2699-2712 (2016).

26 Pettay, D. T. & LaJeunesse, T. C. Microsatellite loci for assessing genetic diversity, dispersal and clonality of coral symbionts in ‘stress-tolerant’ clade D Symbiodinium. Mol. Ecol. Resour. 9, 1022-1025 (2009).

27 Wham, D. C., Pettay, D. T. & LaJeunesse, T. C. Microsatellite loci for the host-generalist “zooxanthella” and other Clade D Symbiodinium. Conserv. Genet. Resour. 3, 541-544 (2011).

28 Wham, D. C., Carmichael, M. & LaJeunesse, T. C. Microsatellite loci for Symbiodinium goreaui and other Clade C Symbiodinium. Conserv. Genet. Resour. 6, 127-129 (2014).

29 Pinzón, J. H., Devlin-Durante, M. K., Weber, M. X., Baums, I. B. & LaJeunesse, T. C. Microsatellite loci for Symbiodinium A3 (S. fitti) a common algal symbiont among Caribbean Acropora (stony corals) and Indo-Pacific giant clams (Tridacna). Conserv. Genet. Resour. 3, 45-47 (2011).

148

30 Bay, L. K., Howells, E. J. & van Oppen, M. J. H. Isolation, characterisation and cross amplification of thirteen microsatellite loci for coral endo-symbiotic dinoflagellates (Symbiodinium clade C). Conserv. Genet. Resour. 1, 199 (2009).

31 Howells, E. J., van Oppen, M. J. H. & Willis, B. L. High genetic differentiation and cross- shelf patterns of genetic diversity among Great Barrier Reef populations of Symbiodinium. Coral Reefs 28, 215-225 (2009).

32 Santos, S., Gutierrez-Rodriguez, C., Lasker, H. & Coffroth, M. Symbiodinium sp. associations in the gorgonian Pseudopterogorgia elisabethae in the Bahamas: high levels of genetic variability and population structure in symbiotic dinoflagellates. Mar. Biol. 143, 111-120 (2003).

33 González-Pech, R. A., Ragan, M. A. & Chan, C. X. Signatures of adaptation and symbiosis in genomes and transcriptomes of Symbiodinium. Sci. Rep. 7, 15021 (2017).

34 González-Pech, R. A. et al. Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium. bioRxiv, 783902 (2019).

35 Parkinson, J. E. et al. Gene expression variation resolves species and individual strains among coral-associated dinoflagellates within the genus Symbiodinium. Genome Biol. Evol. 8, 665- 680 (2016).

36 Aranda, M. et al. Genomes of coral dinoflagellate symbionts highlight evolutionary adaptations conducive to a symbiotic lifestyle. Sci. Rep. 6, 39734 (2016).

37 Liu, H. et al. Symbiodinium genomes reveal adaptive evolution of functions related to coral- dinoflagellate symbiosis. Commun. Biol. 1, 95 (2018).

38 Shoguchi, E. et al. Two divergent Symbiodinium genomes reveal conservation of a gene cluster for sunscreen biosynthesis and recently lost genes. BMC Genomics 19, 458 (2018).

39 Shoguchi, E. et al. Draft assembly of the Symbiodinium minutum nuclear genome reveals dinoflagellate gene structure. Curr. Biol. 23, 1399-1408 (2013).

40 Stephens, T. G. et al. Polarella glacialis genomes encode tandem repeats of single-exon genes with functions critical to adaptation of dinoflagellates. bioRxiv, 704437 (2019).

41 Chen, Y., González-Pech, R. A., Stephens, T. G., Bhattacharya, D. & Chan, C. X. Evidence that inconsistent gene prediction can mislead analysis of dinoflagellate genomes. J. Phycol. 56, 6-10 (2020).

149

42 Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111-120 (1980).

43 González-Pech, R. A., Bhattacharya, D., Ragan, M. A. & Chan, C. X. Genome evolution of coral reef symbionts as intracellular residents. Trends Ecol. Evol. (2019).

44 Emms, D. M. & Kelly, S. STAG: Species Tree Inference from All Genes. bioRxiv, 267914 (2018).

45 Emms, D. M. & Kelly, S. STRIDE: Species Tree Root Inference from Gene Duplication Events. Mol. Biol. Evol. 34, 3267-3278 (2017).

46 Emms, D. M. & Kelly, S. OrthoFinder2: fast and accurate phylogenomic orthology analysis from gene sequences. bioRxiv, 466201 (2018).

47 Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25- 29 (2000).

48 Lin, S. et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science 350, 691-694 (2015).

49 Davy, S. K., Allemand, D. & Weis, V. M. Cell biology of cnidarian-dinoflagellate symbiosis. Microbiol. Mol. Biol. Rev. 76, 229-261 (2012).

50 Weis, V. M. Cell biology of coral symbiosis: foundational study can inform solutions to the coral reef crisis. Integr. Comp. Biol. (2019).

51 Mohamed, A. R. et al. Transcriptomic insights into the establishment of coral-algal symbioses from the symbiont perspective. bioRxiv, 652131 (2019).

52 Fujise, L., Yamashita, H. & Koike, K. Application of calcofluor staining to identify motile and coccoid stages of Symbiodinium (Dinophyceae). Fisheries Sci. 80, 363-368 (2014).

53 Hennige, S. J., Suggett, D. J., Warner, M. E., McDougall, K. E. & Smith, D. J. Photobiology of Symbiodinium revisited: bio-physical and bio-optical signatures. Coral Reefs 28, 179-195 (2009).

54 Behrenfeld, M. J., Prasil, O., Kolber, Z. S., Babin, M. & Falkowski, P. G. Compensatory changes in Photosystem II electron turnover rates protect photosynthesis from photoinhibition. Photosynth. Res. 58, 259-268 (1998).

55 Delaux, P.-M. et al. Comparative phylogenomics uncovers the impact of symbiotic associations on host genome evolution. PLoS Genet. 10, e1004487 (2014).

150

56 Li, C. & Wong, J. T. Y. DNA damage response pathways in dinoflagellates. Microorganisms 7, 191 (2019).

57 Chi, J., Parrow, M. W. & Dunthorn, M. Cryptic sex in Symbiodinium (Alveolata, Dinoflagellata) is supported by an inventory of meiotic genes. J. Eukaryot. Microbiol. 61, 322-327 (2014).

58 LaJeunesse, T. Diversity and community structure of symbiotic dinoflagellates from Caribbean coral reefs. Mar. Biol. 141, 387-400 (2002).

59 Pettay, D. T. & LaJeunesse, T. C. Long-range dispersal and high-latitude environments influence the population structure of a “stress-tolerant” dinoflagellate endosymbiont. PLoS ONE 8, e79208 (2013).

60 Brian, J. I., Davy, S. K. & Wilkinson, S. P. Multi-gene incongruence consistent with hybridisation in Cladocopium (Symbiodiniaceae), an ecologically important genus of coral reef symbionts. PeerJ 7, e7178 (2019).

61 Thornhill, D. J., Lewis, A. M., Wham, D. C. & LaJeunesse, T. C. Host-specialist lineages dominate the adaptive radiation of reef coral endosymbionts. Evolution 68, 352-367 (2014).

62 Wernegreen, J. J. For better or worse: genomic consequences of intracellular mutualism and parasitism. Curr. Opin. Genet. Dev. 15, 572-583 (2005).

63 Moran, N. A. & Plague, G. R. Genomic changes following host restriction in bacteria. Curr. Opin. Genet. Dev. 14, 627-633 (2004).

64 Cordaux, R. & Batzer, M. A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691 (2009).

65 Quadrana, L. et al. Transposition favors the generation of large effect mutations that may facilitate rapid adaption. Nat. Commun. 10, 3421 (2019).

66 Stat, M., Carter, D. & Hoegh-Guldberg, O. The evolutionary history of Symbiodinium and scleractinian hosts—symbiosis, diversity, and the effect of climate change. Perspect. Plant Ecol. 8, 23-43 (2006).

67 Trachana, K. et al. Orthology prediction methods: a quality assessment using curated protein families. BioEssays 33, 769-780 (2011).

68 Kuzniar, A., van Ham, R. C. H. J., Pongor, S. & Leunissen, J. A. M. The quest for orthologs: finding the corresponding gene across genomes. Trends in Genetics 24, 539-551 (2008).

151

69 Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, btu170 (2014).

70 Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764-770 (2011).

71 Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202-2204 (2017).

72 Xue, W. et al. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 14, 604 (2013).

73 Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654-5666 (2003).

74 Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644-652 (2011).

75 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).

76 Bayer, T. et al. Symbiodinium transcriptomes: genome insights into the dinoflagellate symbionts of reef-building corals. PLoS ONE 7, e35269 (2012).

77 Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics 28, 3150-3152 (2012).

78 Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).

79 Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435-W439 (2006).

80 Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 1 (2004).

81 Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494-6506 (2005).

82 Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).

83 Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, 1 (2008).

152

84 Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061-1067 (2007).

85 Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 32, D138-D141 (2004).

86 Li, W. et al. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 43, W580-W584 (2015).

87 Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

88 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760 (2009).

89 Lassmann, T., Hayashizaki, Y. & Daub, C. O. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics 27, 130-131 (2011).

90 Greenfield, P. & Roehm, U. Answering biological questions by querying k-mer databases. Concurr. Comp. Pract. E. 25, 497-509 (2013).

91 Bernard, G., Greenfield, P., Ragan, M. A. & Chan, C. X. k-mer similarity, networks of microbial genomes, and taxonomic rank. mSystems 3, e00257-00218 (2018).

92 Bernard, G., Ragan, M. & Chan, C. Recapitulating phylogenies using k-mers: from trees to networks [version 2; peer review: 2 approved]. F1000Research 5 (2016).

93 Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49-e49 (2012).

153

5.7. Supplementary figures

Fig. S5.1 G+C content distribution in genome sequences of Symbiodinium spp. Distribution of per-scaffold G+C content for each assembled genome of Symbiodinium.

154

T = 2.5 T = 3.0 T = 3.5 T = 4.0 T = 4.5

T = 5.0 T = 5.5 T = 6.0 T = 6.5 T = 7.0 S. pilosum

eae uch . lin dorum S S. natans S. tridacni

T = 7.5 T = 8.0 T = 8.5 T = 9.0 m T = 9.5 u ic t a i r d a o r c i m

. S

Symbiodinium Breviolum Cladocopium Fugacium Polarella Fig. S5.2 Alignment-free similarity network Similarity network constructed based on shared 21-mers among the genome sequences of the Suessiales isolates. Each panel shows the network at a similarity threshold (T), at edges (connexions) with a similarity value below the threshold are removed; see Bernard et al. (2018)91 for more detail. Data points representing each dataset are coloured following the bottom legend. S. microadriaticum isolates are the last Symbiodinium clique at T = 9.0.

155

Fig. S5.3 Mapping rate among Symbiodinium genomes Mapping rate of filtered paired reads that we generated for each Symbiodinium isolate against the assembled genomes of itself (grey background) and of all other Symbiodinium isolates. The tree topologies on the left and bottom indicate the known phylogenetic relationship13 among the isolates.

156

Fig. S5.4 Completeness of Suessiales genome assemblies Recovery of 458 conserved eukaryote genes among the predicted genes in each genome, based on CEGMA Clusters of Orthologous Groups84. Isolates for which genome data were generated in this study are indicated with an asterisk. The percentage of recovered CEGMA genes is shown for each bar.

157

Fig. S5.5 Absolute length covered by LINEs in genomes of Suessiales Total length of each analysed genome that comprises LINEs, sorted in decreasing order.

158

Fig. S5.6 No evidence of genome duplication in S. pilosum CCMP2461 (A) GenomeScope71 profile based on 19-mers. (B) Number of the duplicated gene blocks found within each genome (x-axis) against the number of implicated genes in those blocks (y-axis). The size of the dot is proportional to the added sequence length comprising the duplicated gene blocks, as shown in the top-left legend.

159

Fig. S5.7 Contribution of gene metrics to PCA Loading plot showing the contribution of the distinct gene metrics employed for the PCA (Supplementary Table 4, Fig. 5.3) to PC1 and to PC2.

160

S. microadriaticum CassKB8 S. microadriaticum 04-503SCI.03 S. microadriaticum CCMP2467 100 m = 0.10 m = 0.03 m = -0.09 r = 0.18 31,086 r = 0.05 26,500 r = -0.14 25,803 p = 3.3e-257 p = 6.9e-21 p = 5.5e-133 75 405 564 199 50

25 s

n 4119

o 486 18 i 2677 t 151 3556

i 0

s

o

p

n

o S. necroappetens CCMP2469 S. linucheae CCMP2456 S. tridacnidorum CCMP2592 d

o 100 m = -0.06 m = -0.05 m = -0.06 c

r = -0.11 24,260 r = -0.09 23,567 r = -0.11 38,601

d -83 -46 -108

n p = 2.3e p = 3.0e p = 2.5e 2

75

d n

a 431 323 454

t

s 50

1

f

o

) 25

% (

16

t 3596 20 3098 35 4962

n 0

e

t

n

o

c

C S. tridacnidorum Sh18 S. natans CCMP2548 S. pilosum CCMP2461 + m = -0.13 m = -0.04 m = -0.15

G 100 r = -0.21 18,004 r = -0.06 30,407 r = -0.25 14,834 e -202 -32 -253

g p = 2.3e p = 1.2e p = 4.6e a

r 75

e v

A 321 323 480 50

25

9 2242 38 3535 28 3228 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 G+C content (%) of 3 rd codon position Fig. S5.8 Nucleotide composition of codon positions in CDS of Symbiodinium G+C content of third codon position (x-axis) against average G+C content of first and second positions (y-axis) of full-length CDS. The grey diagonal line indicates the values where the nucleotide composition is the same in both metrics, indicative of neutral evolution. The red line represents the regression line estimated from the CDS data; the corresponding slope (m), Pearson’s correlation coefficient (r) and statistical significance (p) are shown in red. To highlight overall patterns of nucleotide composition, the plot is split into four quadrants with coordinates at 50% from both axes. Dots falling into each quadrant are coloured in a different shade of blue; the corresponding number of CDS for each quadrant is shown.

161

S. microadriaticum CassKB8 S. microadriaticum 04-503SCI.03 S. microadriaticum CCMP2467 60

50

40

30

741 570 236

)

c

N (

S. necroappetens CCMP2469 S. linucheae CCMP2456 S. tridacnidorum CCMP2592

d e

s 60

u

s n

o 50

d

o

c

f 40

o

r e

b 30 m

u 420 370 484

n

e

v

i

t

c e

f S. tridacnidorum Sh18 S. natans CCMP2548 S. pilosum CCMP2461 f

E 60

50

40

30 266 378 366 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 G+C of synonymous third codon positions (%) Fig. S5.9 Codon usage in Symbiodinium isolates Effective number of codons used (Nc, y-axis) as a function of the G+C content of synonymous third codon positions (x-axis). The curve line represents the neutral expectation of Nc. CDS with a Nc 25% smaller than the expected are considered to display strong codon usage preference and are highlighted in a darker shade of blue; the corresponding CDS count is shown in each graph.

162

Fig. S5.10 Removal of contaminant sequences from genome assemblies Identification of putative contaminant sequences in each assembled genome of Symbiodinium. The total length of the genome sequences with shared significant similarity to known bacterial, archaeal and viral genome sequences (left panels), the number of these implicated sequences (middle

163

panels), and the number of genes located in these implicated sequences (right panels) are shown, across different thresholds of the percentage of sequence cover relative to aligned regions. The threshold of 5% (bars with red outline) is the chosen threshold in this analysis, and the implicated sequences were considered as contaminants and removed from the corresponding assembly.

Fig. S5.11 G+C and length sequence outliers (A-E) Distribution of G+C content in assembled genome scaffolds relative to the scaffold lengths, shown for each Symbiodinium genome assembly generated in this study. Data points in a darker shade of blue represent scaffolds that were identified as outliers and removed from the assembly. (F) Summary of the identified outlier scaffolds relative to their total length.

164

Fig. S5.12 Determination of k for alignment-free tree and network Proportion of distinct (A) and unique (B) k-mers at different k-values in each isolate (coloured according to the bottom legend).

5.8. Supplementary tables

Supplementary tables are available from DOI: 10.1101/800482 (biorxiv.org/content/10.1101/800482v1.supplementary-material).

165

Chapter 6. General discussion, conclusions and outlook

Prior to this thesis research, studies on Symbiodiniaceae genomics had been restricted to two species, Breviolum minutum1 and Fugacium kawagutii2, and focused mainly in the characterisation of genome features, such as gene structure and organisation and gene number expansion. As the technology of whole-genome became more accessible, and during the course of this thesis research, newly generated genome data for Symbiodiniaceae allowed for comparative analyses of gene functions addressing key biological questions, such as adaptation to symbiosis, and protection against irradiation3-5. Nonetheless, the molecular mechanisms that underpinned the diversification of Symbiodiniaceae were little unknown, as did the implications of the evolutionary transition from a free-living to a symbiotic lifestyle on genome evolution of Symbiodiniaceae. Likewise, genome variation among Symbiodiniaceae within the same genus or same species remained uncharacterised.

To address these knowledge gaps, in this thesis I (i) identified gene functions that are common and specific to distinct lineages of Symbiodiniaceae using available genome-scale data; (ii) formulated a framework for probing genome evolution of Symbiodiniaceae relative to the evolutionary transition to symbiosis, building on our current understanding of other intracellular symbionts and parasites; (iii) generated de novo genome assemblies from seven isolates within the basal genus Symbiodinium, encompassing distinct ecological niches; (iv) assessed genome features of a free-living and a symbiotic Symbiodinium species based on high-quality genome assemblies; and (v) comprehensively assessed and compared all available genome data of Symbiodiniaceae and the free-living outgroup species Polarella glacialis, including multiple isolates within the same genus and the same species. In this Chapter, I review the key findings from this thesis work relative to the working hypotheses (Section 1.1) and specific aims (Section 1.2), and discuss the contributions of this thesis work to Symbiodiniaceae genomics and perspectives for future research. A summary diagram is shown in Fig. 6.1.

166

Fig. 6.1 Genomics of Symbiodiniaceae in the last decade Contributions of genome-scale data from this thesis research relative to other studies of Symbiodiniaceae1-8 from 2010.

6.1. General discussion

This thesis research is the first comprehensive assessment of genome diversity not only of Symbiodiniaceae but also of dinoflagellates. This is also the first study that makes use of alignment- free methods to estimate sequence similarity between eukaryotic genomes. Through this thesis research, I discovered gene functions that distinguish Symbiodiniaceae relative from other dinoflagellates in Order Suessiales (Chapters 3, 4 and 5). These functions, related to transmembrane transport and response to reactive oxygen species, are associated with the capability of Symbiodiniaceae to establish and maintain symbiotic associations, and with their adaptation to the highly dynamic light regimes that they are typically exposed to. I also discovered gene functions exclusive to each lineage that can serve as an analysis platform for future investigations targeted to understand adaptation to more-specialised ecological niches. These results addressed Aim 1.

167

De novo genome assemblies from seven Symbiodinium isolates were generated through this thesis work. These include two high-quality assemblies incorporating both short- and long-read sequence data, and five draft assemblies generated using short-read sequence data. These data were incorporated in two independent comparative analyses, one focusing on genomes of free-living versus symbiotic species (Chapter 4), another focusing more broadly on all Symbiodiniaceae genomes (Chapter 5).

Genome comparison between a symbiotic and a free-living Symbiodinium species (Chapter 4) revealed extensive sequence divergence that is largely explained by high extent of structural rearrangements and gene duplication in the symbiotic S. tridacnidorum7. These results together with the subsequent comprehensive comparison (Chapter 5) revealed genome features that may be related to the evolutionary transition of Symbiodiniaceae from a free-living to a symbiotic lifestyle, and genome divergence within the genus Symbiodinium and within individual Symbiodinium species6. In these ways, these results addressed Aims 2 and 3. Although high-quality genome data from more free-living and symbiotic taxa are required to confidently conclude that the observed differences are indeed attributed to the distinct lifestyles. Here, I discuss these results relative to the working hypotheses outlined in Chapter 1 (Section 1.1).

6.1.1. Symbiodiniaceae gene functions are associated with adaptation and diversification

My analysis of transcriptome and genome data (Chapter 3) uncovered gene families that are exclusive to major lineages of Symbiodiniaceae. Gene functions encoded in the genomes of Symbiodiniaceae are related to adaptation of these lineages to the tropical shallow benthic environments that they usually inhabit. Lineage-specific gene families can arise from highly diverged homologs because they are inferred based on sequence similarity, as supported by the conservation of gene functions across Symbiodiniaceae6. However, gene functions were annotated based on sequence similarity as well, thus the potential of function innovation in these families cannot be dismissed. A clear understanding of the contribution of lineage-specific gene families to physiological differences among Symbiodiniaceae demands experimental characterisation of the functions encoded by these gene families. It is also possible that, to some extent, distinct physiological attributes among Symbiodiniaceae are result of epigenetic and/or post-transcriptional gene expression regulation9-11. These features represent an area of opportunity for future research.

168

6.1.2. Genomes features of Symbiodinium likely relate to transition from free-living to symbiotic

The results from this thesis research strongly support this hypothesis. I formulated a research framework for studying genome evolution of Symbiodiniaceae based on known features of intracellular bacterial symbionts (Chapter 2), and adopted this framework in the subsequent analyses presented in Chapters 4 and 5. I identified genome features that distinguish the symbiotic S. tridacnidorum from the free-living S. natans7, which largely align with our expectations in the framework. These features include a high abundance of transposable elements, structural rearrangements, gene duplication and pseudogenes, and conservation of gene functions relevant to symbiosis. This body of research also highlighted challenges of resolving repetitive elements in dinoflagellate genomes, as evidenced by the genome assemblies generated using only short-read sequence data. The inclusion of the outgroup species Polarella glacialis in the genome comparison also allowed to confirm the ancestral condition of the free-living lifestyle in genus Symbiodinium. More high-quality genome data (incorporating long-read sequences) of other symbiotic and free- living Symbiodiniaceae are thus useful to further test Hypothesis II in a more-robust experimental design. For instance, these data will provide an excellent analysis platform to assess whether the free- living lifestyle in other Symbiodiniaceae lineages is ancestral or derived; a similar comparative genomics study has demonstrated that nodulation in plants did not evolve multiple times but was due to parallel gene loss12.

6.1.3. Genome-sequence divergence among Symbiodiniaceae exceeds known genetic diversity

My results strongly support this hypothesis. The comparison between the symbiotic S. tridacnidorum and the free-living S. natans (Chapter 4) revealed extensive genome sequence divergence attributed mostly to structural rearrangements and gene duplication. This suggests that symbiosis could be a trigger of genetic diversification in Symbiodinium and, possibly, in Symbiodiniaceae. I also found that within-genus genome divergence is not restricted to Symbiodinium with distinct lifestyles, and that substantial genome divergence exists even within the same species (e.g. between the two S. tridacnidorum isolates presented in Chapter 5)13. Although symbiosis could drive in part of the observed genome divergence, other processes may also underpin extreme genome divergence in Symbiodinium (e.g. the substantial divergence between the two free-living species S. natans and S. pilosum) and more generally in Symbiodiniaceae. Although this investigation was targeted at the genus Symbiodinium, the results also revealed genome variation between two 169

Cladocopium isolates, shedding light on the remarkable divergence in other symbiodiniacean lineages. Because Cladocopium is considered the most genetic diverse genus of Symbiodiniaceae14, it would be interesting to assess whether the genome divergence in Cladocopium is comparable to that in Symbiodinium. Results from this thesis work reveal extensive genome-sequence divergence among Symbiodiniaceae, even within a genus and/or a species. This genome diversity extends beyond nucleotide changes to structural rearrangements, including variable composition of repetitive elements.

6.2. Concluding remarks and future perspectives

This thesis work represents the first and most comprehensive assessment of genome evolution of Symbiodiniaceae, and of the basal lineage Symbiodinium, using extensive whole-genome sequence data. This research contributes with seven de novo genome assemblies (two of them incorporating long-read sequence data) from diverse Symbiodinium isolates, their corresponding transcriptomes and predicted protein-coding genes. Due to the complexity and idiosyncrasy of these (dinoflagellate) genomes, sophisticated, customised and novel methods in bioinformatics and computational biology were adopted in this thesis work. The knowledge and genome data generated from this body of research provide a foundational analysis platform for future research in comparative genomics of dinoflagellates and other microbial eukaryotes, in the biology of Symbiodiniaceae and in the evolution of eukaryote-eukaryote symbioses. The results from this research also provide valuable insights into the genome evolution and diversification of Symbiodiniaceae.

The recent availability of draft genomes from Symbiodiniaceae1,3-5,7 and systematic revision of these taxa15 are allowing investigators to venture deeper into the evolutionary history of these ecologically important organisms. In this thesis work, I emphasised genome evolution of Symbiodiniaceae with different lifestyles (mainly symbiotic versus free-living). However, the highly intricate ecology of Symbiodiniaceae must be acknowledged, even amongst the symbiotic forms. For example, host specificity does not always correlate with transmission mode between hosts (horizontal or vertical) or with the facultativeness of a symbiotic association16. Nevertheless, a narrow focus on one of these convoluted aspects (e.g. transmission modes or host specificity) of Symbiodiniaceae biology, e.g. comparing horizontally versus vertically transmitted symbionts, can pose useful models for understanding evolution and diversification of Symbiodiniaceae, despite the simplification of these potentially non-discrete categories.

Although a comprehensive taxa representation generally informs comparative genomic studies better, the identification and targeting of key species remain important. For instance, the use of 170

genome data from the closely related species Polarella glacialis17 as outgroup in this research helped elucidating the potentially ancestral origin of the free-living lifestyle in Symbiodinium and the conservation of gene functions in Symbiodiniaceae. To determine how far back this gene-function conservation goes in dinoflagellates, genome data of earlier-diverging lineages are needed. I also foresee genome comparisons among diverse Symbiodiniaceae lineages becoming more specialised, using genome data from more species in each of the newly established genera, as I did here for Symbiodinium, to address biological questions that are more narrowly focused. Such questions would, for instance, aim to decipher gene functions that contribute to heat tolerance in Durusdinium spp. (the former Clade D) or genome features implicated in the hyperdiversity in Cladocopium (the former Clade C). Available genome data of Symbiodiniaceae, including those generated as part of this thesis research, can be used to assess local adaptation via selection tests. Population-scale genomic analysis represents a powerful tool to explore genetic diversity in distinct populations and for each species.

Symbiodiniaceae studies can also benefit from the implementation of cutting-edge genomic technologies. Evidence of this was provided in this thesis work, in which repetitive regions of the genome were better resolved, because of the incorporation of long-read sequence data, in the assemblies of S. tridacnidorum CCMP2592 and S. natans CCMP2548 than in genome assemblies of other Symbiodiniaceae. A more-detailed characterisation of large-scale duplication and translocation in genome sequences would require the application of technologies like optical mapping and genome phasing18,19. The use of full-length transcript sequencing technologies (e.g. PacBio Iso-Seq) will help elucidate transcript isoforms among Symbiodiniaceae, as demonstrated in an earlier study of P. glacialis genomes17. Single-cell genomics can be useful to profile Symbiodiniaceae communities in individual coral colonies with improved accuracy and to outline specific coral-dinoflagellate assemblages. The emerging field of spatial single-cell transcriptomics20 can also be applied to directly assess the biological processes occurring at the coral-symbiont interphase (i.e. the symbiosome) and fight coral bleaching.

Gene-function annotation for non-model organisms is typically based on sequence similarity to genes/proteins in public databases. However, these databases contain predominantly data from model organisms, most of which are distantly related to Symbiodiniaceae. Consequently, a considerable proportion of dinoflagellate genes lack any sort of functional annotation but are likely important for adaptation and niche specialisation21. In this research, a gene-prediction workflow tailored for idiosyncratic gene features of dinoflagellates was implemented17,22. However, analyses based on annotated gene functions of Symbiodiniaceae should be interpreted cautiously. Curation of gene functions can be done only experimentally, and the implementation of novel tools to do so, such as CRISPR/Cas9 gene editing, should be an essential focus of future research23. Experimental studies 171

can also corroborate the contribution of transposable-element activity to genome divergence among Symbiodiniaceae. Determining the frequency and extent of transposition events is key to explain genetic diversity within Symbiodiniaceae as transposition appears to be an important driver of genome evolution in this group2,7,24,25. Empirical estimations of mutation and recombination rates are also fundamental for a deeper understanding of symbiodiniacean evolution and for attempts of assisted evolution to develop resilient coral-symbiont assemblages26. Genetic engineering through genome editing may be used to generate coral holobionts that are resilient to harsh environmental conditions27,28. Empirical studies (based on physiology, gene expression or metabolomics, for instance) can identify key biological processes for adaptation or response to specific conditions. The actual targets for engineering (e.g. genes or regulatory elements) require detailed knowledge of the genome sequence of the target organism. For example, genome resources of Symbiodiniaceae generated as part of this thesis work can inform us if a gene target occurs in multiple copies of the sequence to edit, and/or if some of these copies are indeed functional (i.e. not pseudogenes).

Research efforts in Symbiodiniaceae should leverage and integrate current knowledge of other symbiotic organisms that share similar evolutionary trajectories to answer fundamental but open questions about the biology and evolution of this group. For example, in this thesis I made use of the knowledge regarding genome evolution of bacterial symbionts and parasites to draw expectations about genome features of Symbiodiniaceae with distinct lifestyles13. Evidently, the biology of Symbiodiniaceae deviates from that of bacteria, and the impacts of their biology on genome evolution represent areas of opportunity for research. Certainly, Symbiodiniaceae research will increasingly rely on other widely scoped surveys beyond traditional comparative genomics and transcriptomics, to incorporate proteomics, metabolomics, epigenomics, post-transcriptional regulation (e.g. by small RNAs), and their associated experimental validation. Concerted multidisciplinary collaborations will become a norm. Disentangling the molecular basis of the diverse physiology of Symbiodiniaceae29- 32 and of their adaptation to a broad spectrum of ecological niches15,33,34 should be a high-priority focus of future research. This research can be built on genomic resources, such as those generated as part of this thesis work, and a combination of some of the broader-scope techniques mentioned above. Finally, extending beyond Symbiodiniaceae and in combination with genome-scale data from the associated hosts and microbiome, hologenomics35 represents a promising approach to gain a comprehensive snapshot of various symbiotic associations that are critical to coral reefs, as proven by a recent genomic study of a coral holobiont8, including the Symbiodiniaceae symbionts and microbiome, that revealed how gene functions of the distinct biotic components contribute to sustain a healthy ecological unit.

172

Genomes of dinoflagellates were once inaccessible due to their immense sizes and idiosyncratic features. Data and analytic workflows generated from this thesis work (and elsewhere) are changing the landscape of genome research in Symbiodiniaceae, and can be readily applied more broadly to study other dinoflagellates, e.g. bloom-forming and toxin-producing species. With the continuing advancement of genomic technologies, the future of dinoflagellate genome research has never been more exciting.

6.3. References

1 Shoguchi, E. et al. Draft assembly of the Symbiodinium minutum nuclear genome reveals dinoflagellate gene structure. Curr. Biol. 23, 1399-1408 (2013). 2 Lin, S. et al. The Symbiodinium kawagutii genome illuminates dinoflagellate gene expression and coral symbiosis. Science 350, 691-694 (2015). 3 Aranda, M. et al. Genomes of coral dinoflagellate symbionts highlight evolutionary adaptations conducive to a symbiotic lifestyle. Sci. Rep. 6, 39734 (2016). 4 Liu, H. et al. Symbiodinium genomes reveal adaptive evolution of functions related to coral- dinoflagellate symbiosis. Commun. Biol. 1, 95 (2018). 5 Shoguchi, E. et al. Two divergent Symbiodinium genomes reveal conservation of a gene cluster for sunscreen biosynthesis and recently lost genes. BMC Genomics 19, 458 (2018). 6 González-Pech, R. A. et al. Genomes of Symbiodiniaceae reveal extensive sequence divergence but conserved functions at family and genus levels. bioRxiv, 800482 (2019). 7 González-Pech, R. A. et al. Structural rearrangements drive extensive genome divergence between symbiotic and free-living Symbiodinium. bioRxiv, 783902 (2019). 8 Robbins, S. J. et al. A genomic view of the reef-building coral Porites lutea and its microbial symbionts. Nature Microbiology (2019). 9 Leggat, W., Yellowlees, D. & Medina, M. Recent progress in Symbiodinium transcriptomics. J. Exp. Mar. Biol. Ecol. 408, 120-125 (2011). 10 Liew, Y. J., Li, Y., Baumgarten, S., Voolstra, C. R. & Aranda, M. Condition-specific RNA editing in the coral symbiont Symbiodinium microadriaticum. PLoS Genet. 13, e1006619 (2017). 11 Baumgarten, S. et al. Integrating microRNA and mRNA expression profiling in Symbiodinium microadriaticum, a dinoflagellate symbiont of reef-building corals. BMC Genomics 14, 704 (2013).

173

12 van Velzen, R. et al. Comparative genomics of the nonlegume Parasponia reveals insights into evolution of nitrogen-fixing rhizobium symbioses. Proc. Natl. Acad. Sci. U. S. A. 115, E4700-E4709 (2018). 13 González-Pech, R. A., Bhattacharya, D., Ragan, M. A. & Chan, C. X. Genome evolution of coral reef symbionts as intracellular residents. Trends Ecol. Evol. (2019). 14 Stat, M., Carter, D. & Hoegh-Guldberg, O. The evolutionary history of Symbiodinium and scleractinian hosts—symbiosis, diversity, and the effect of climate change. Perspect. Plant Ecol. 8, 23-43 (2006). 15 LaJeunesse, T. C. et al. Systematic revision of Symbiodiniaceae highlights the antiquity and diversity of coral endosymbionts. Curr. Biol. 28, 2570-2580 (2018). 16 Quigley, K. M., Warner, P. A., Bay, L. K. & Willis, B. L. Unexpected mixed-mode transmission and moderate genetic regulation of Symbiodinium communities in a brooding coral. Heredity 121, 524-536 (2018). 17 Stephens, T. G. et al. Polarella glacialis genomes encode tandem repeats of single-exon genes with functions critical to adaptation of dinoflagellates. bioRxiv, 704437 (2019). 18 van Dijk, E. L., Jaszczyszyn, Y., Naquin, D. & Thermes, C. The third revolution in sequencing technology. Trends Genet. 34, 666-681 (2018). 19 Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329-346 (2018). 20 Burgess, D. J. Spatial transcriptomics coming of age. Nat. Rev. Genet. 20, 317-317 (2019). 21 Stephens, T. G., Ragan, M. A., Bhattacharya, D. & Chan, C. X. Core genes in diverse dinoflagellate lineages include a wealth of conserved dark genes with unknown functions. Sci. Rep. 8, 17175 (2018). 22 Chen, Y., González-Pech, R. A., Stephens, T. G., Bhattacharya, D. & Chan, C. X. Evidence that inconsistent gene prediction can mislead analysis of dinoflagellate genomes. J. Phycol. 56, 6-10 (2020). 23 Cleves, P. A., Shumaker, A., Lee, J., Putnam, H. M. & Bhattacharya, D. Unknown to known: advancing knowledge of coral gene function. Trends Genet. 36, 93-104 (2020). 24 de Mendoza, A. et al. Recurrent acquisition of cytosine methyltransferases into eukaryotic retrotransposons. Nat. Commun. 9, 1341 (2018). 25 Chen, J. E., Cui, G., Wang, X., Liew, Y. J. & Aranda, M. Recent expansion of heat-activated retrotransposons in the coral symbiont Symbiodinium microadriaticum. ISME J. 12, 639-643 (2018).

174

26 van Oppen, M. J. H., Oliver, J. K., Putnam, H. M. & Gates, R. D. Building coral reef resilience through assisted evolution. Proc. Natl. Acad. Sci. U. S. A. 112, 2307-2313 (2015). 27 Cleves, P. A., Strader, M. E., Bay, L. K., Pringle, J. R. & Matz, M. V. CRISPR/Cas9-mediated genome editing in a reef-building coral. Proc. Natl. Acad. Sci. U. S. A. 115, 5235-5240 (2018). 28 Levin, R. A. et al. Engineering strategies to decode and enhance the genomes of coral symbionts. Frontiers in Microbiology 8 (2017). 29 Suggett, D. J. et al. Functional diversity of photobiological traits within the genus Symbiodinium appears to be governed by the interaction of cell size with cladal designation. New Phytol. 208, 370-381 (2015). 30 Goyen, S. et al. A molecular physiology basis for functional diversity of hydrogen peroxide production amongst Symbiodinium spp. (Dinophyceae). Mar. Biol. 164, 46 (2017). 31 Lawson, C. A., Possell, M., Seymour, J. R., Raina, J.-B. & Suggett, D. J. Coral endosymbionts (Symbiodiniaceae) emit species-specific volatilomes that shift when exposed to thermal stress. Sci. Rep. 9, 17395 (2019). 32 Warner, M. E. & Suggett, D. J. in The Cnidaria, past, present and future: the world of Medusa and her sisters (eds Stefano Goffredo & Zvy Dubinsky) 489-509 (Springer International Publishing, 2016). 33 Stat, M., Morris, E. & Gates, R. D. Functional diversity in coral–dinoflagellate symbiosis. Proc. Natl. Acad. Sci. U. S. A. 105, 9256-9261 (2008). 34 Baker, A. C. Flexibility and specificity in coral-algal symbiosis: diversity, ecology, and biogeography of Symbiodinium. Annu. Rev. Ecol. Evol. Syst., 661-689 (2003). 35 Bordenstein, S. R. & Theis, K. R. Host biology in light of the microbiome: ten principles of holobionts and hologenomes. PLoS Biol. 13, e1002226 (2015).

175