bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 First genome of , an opportunistic pathogen, reveals novel insight

2 into marine phylogeny, ecology and CAZyme cell-wall degradation

3 Mun Hua Tan1,2,3,4, Stella Loke1,2, Laurence J. Croft1,2, Frank H. Gleason5, Lene Lange6, Bo

4 Pilgaard7, Stacey M. Trevathan-Tackett1*

5

6 1Centre of Integrative Ecology, School of Life and Environmental Sciences Deakin

7 University, Geelong, Australia

8 2Deakin Genomics Centre, Deakin University, Geelong, Australia

9 3School of BioSciences, Bio21 Institute, University of Melbourne, Australia

10 4Department of Microbiology and Immunology, University of Melbourne, Bio21 Institute,

11 Melbourne, Australia

12 5School of Life and Environmental Sciences, University of Sydney, Sydney, 2006, New

13 South Wales, Australia

14 6BioEconomy, Research & Advisory, Valby, 2500 Copenhagen, Denmark

15 7Protein Chemistry and Enzyme Technology, Department of Bioengineering, Technical

16 University of Denmark, Kgs. Lyngby, Denmark

17

18

19 * Corresponding Author: Dr Stacey Trevathan-Tackett, [email protected],

20 Deakin University, 221 Burwood Hwy, Burwood, Victoria, Australia 3125

21

22 Keywords: evolution; ion regulation; mitochondrial genome; saprobe; ;

23 virulence

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

24 Abstract

25 Labyrinthula spp. are saprobic, marine that also act as opportunistic pathogens and

26 are the causative agents of seagrass wasting disease (SWD). Despite the threat of local- and

27 large-scale SWD outbreaks, there are currently gaps in our understanding of the drivers of

28 SWD, particularly surrounding Labyrinthula virulence and ecology. Given these

29 uncertainties, we investigated Labyrinthula from a novel genomic perspective by presenting

30 the first draft genome and predicted proteome of a pathogenic isolate of Labyrinthula

31 SR_Ha_C, generated from a hybrid assembly of Nanopore and Illumina sequences.

32 Phylogenetic and cross-phyla comparisons revealed insights into the evolutionary history of

33 Stramenopiles. Genome annotation showed evidence of glideosome-type machinery and an

34 apicoplast protein typically found in protist pathogens and parasites. Proteins involved in

35 Labyrinthula’s actin-myosin mode of transport, as well as carbohydrate degradation were

36 also prevalent. Further, CAZyme functional predictions revealed a repertoire of enzymes

37 involved in breakdown of cell-wall and carbohydrate storage compounds common to

38 . The relatively low number of CAZymes annotated from the genome of

39 Labyrinthula SR_Ha_C compared to other Labyrinthulea species may reflect the

40 conservative annotation parameters, a specialised substrate affinity and the scarcity of

41 characterised protist enzymes. Inherently, there is high probability for finding both unique

42 and novel enzymes from Labyrinthula spp. This study provides resources for further

43 exploration of Labyrinthula ecology and evolution, and will hopefully be the catalyst for new

44 hypothesis-driven SWD research revealing more details of molecular interactions between

45 Labyrinthula species and its host substrate.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

46 Introduction

47 Sequencing capabilities of both short and long read technologies have shown marked

48 progress in recent years, resulting in the generation of a large number of contiguous and high

49 quality genome assemblies reported for a myriad of living organisms at an unprecedented

50 pace. As a result of genome sequencing projects at local-to-global scales, we are better

51 resolving linages across the Tree of Life (Lewin et al. 2018), revealing new diversity for

52 underrepresented or unculturable microorganisms (e.g. aquatic and and

53 zoosporic fungi; Sibbald and Archibald 2017, Lange et al. 2019a) and informing the

54 ecological and functional roles of organisms in the environment (Kyrpides et al. 2014). In

55 marine ecosystems, genomic and transcriptomic sequencing is producing vital information on

56 the ecological roles microorganisms play. For example, genomes of putative pathogens are

57 revealing genes involved in virulence and attachment to the host, as well as mechanisms

58 involved in degrading the host substrate and switching between non-pathogenic and

59 pathogenic lifestyles (Fernandes et al. 2011). These new insights are especially important as

60 marine emergent diseases and opportunistic pathogens are predicted to increase globally in

61 the coming decades due to the suboptimal living conditions for marine biota caused by

62 human- and climate-induced changes (Harvell et al. 1999).

63

64 Seagrass meadows are one such marine ecosystem susceptible to a decline in health and

65 increased infection, as they are subjected to dynamic pressures at the land-sea interface, such

66 as elevated temperatures in shallow regions and run-off-associated turbidity, eutrophication

67 and hyposalinity conditions (Sullivan et al. 2018). There are currently four pathogens that

68 cause disease to multiple seagrass genera: Labyrinthula, Phytophthora, Halophytophthora,

69 and Phytomyxea (Sullivan et al. 2018). Labyrinthula, a heterotrophic, halophytic protist in

70 the lineage, is the causative agent of seagrass wasting disease (SWD). To-date,

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

71 it is also the most well-studied seagrass pathogen after being discovered in the early twentieth

72 century when it decimated 90% of the North Atlantic population of the seagrass Zostera

73 marina (Sullivan et al. 2018). Labyrinthula isolates from seagrass leaves have shown variable

74 levels of virulence through laboratory infection trials and associated phylogenetic analyses

75 (Martin et al. 2016, Brakel et al. 2017). Virulent isolates are able to invade the leaf cells of

76 living plants and subsequently cause black leaf lesions, the diagnostic trait of SWD (Sullivan

77 et al. 2018). However, Labyrinthula, is also a ubiquitous commensal saprobe that

78 decomposes marine plants and litter, with one species parasitic to turfgrass (Martin et

79 al. 2016). Because of the flexibility of its ecological function, Labyrinthula spp. are

80 considered opportunistic pathogens that in some cases utilise a mild strategy

81 (Brakel et al. 2017).

82

83 Since its discovery, the advances in our understanding of Labyrinthula species include

84 phylogenetic reclassifications of Labyrinthula (Leander and Porter 2001) and molecular

85 techniques that facilitate detection of Labyrinthula without the need for culturing (Duffin et

86 al. 2020, Lohan et al. 2020). Despite this progress, we still do not completely understand the

87 pathosystem of SWD. For example, it is still under debate whether the conditions that

88 promote SWD rely more on seagrass health and defence capabilities or the virulence and

89 enzymatic capabilities of Labyrinthula isolates and haplotypes, or a combination of host and

90 pathogen characteristics (Martin et al. 2016, Duffin et al. 2020). Much of the new knowledge

91 on the pathosystem has come from the immune and stress response of the seagrass using

92 genomic markers or assays (Brakel et al. 2014, Duffin et al. 2020), as opposed to pathogen

93 virulence (Martin et al. 2016). Furthermore, despite our molecular advances, we also know

94 little about the role of specific organelles, such as the bothrosome, in the hunt and

95 consumption of food sources (Collier and Rest 2019). Next-generation sequencing

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

96 technologies are helping us produce new hypotheses for Labyrinthula and SWD. For

97 example, Lohan et al. (2020) questioned how the variation in Labyrinthula diversity links

98 with intra-species diversity of the seagrass host.

99

100 This set of ecological uncertainties constitutes the basis for our interest in sequencing the

101 genome of a Labyrinthula isolate. So far the genomes of only three different genera of the

102 Labyrinthulea class have been sequenced, which has been valuable in revealing novel

103 ecological insights and biotechnology applications (Song et al. 2018). In this study, using a

104 hybrid assembly approach combining both short and long read sequence data, we assemble

105 the first known draft genome of a pathogenic haplotype of Labyrinthula and report a highly

106 contiguous assembly. Through sequencing and annotating a draft Labyrinthula genome, our

107 aim is to further resolve the genus’ lineage and function, particularly its capacity to degrade

108 seagrass carbohydrates, and compare those functions in closely-related but non-pathogenic,

109 saprobic members of Labyrinthulea. We expect that this work will also provide a launching

110 pad to deeply investigate and understand the basic ecology and evolutionary history of this

111 ubiquitous marine protist, as well as to unlock the potential of their understudied and

112 unexploited CAZyme diversity.

113

114 Materials and methods

115 Labyrinthula sp. isolate description, sample preparation and sequencing

116 The Labyrinthula sp. (order Labyrinthulida) isolate used for sequencing was cultured on

117 seawater-serum agar plates (Trevathan-Tackett et al. 2018) at room temperature prior to

118 preparing DNA and RNA for sequencing. Isolated from a seagrass meadow in Victoria,

119 Australia, isolate SR_Ha_C (GenBank: MF872134.1; referred simply as Labyrinthula

120 SR_Ha_C herein) was previously shown to be highly pathogenic to both Zostera muelleri and

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

121 Heterozostera nigricaulis (Trevathan-Tackett et al. 2018). Genomic DNA for short-read

122 Illumina sequencing was extracted from the Labyrinthula SR_Ha_C culture with the

123 ZymoBIOMICS Quick DNA kit according to the manufacturers’ instructions. Total RNA

124 was prepared using the Monarch Total RNA isolation kit according to the standard protocol

125 for hard to lyse cells. The short-read genome library was prepared with the Nextera Flex

126 DNA library preparation kit (Illumina, San Diego) and sequenced on the MiSeq with 2 x

127 300bp V3 chemistry. The short-read RNAseq library was prepared with the Nugen mRNA

128 ovation kit and sequenced on the MiniSeq with 2x150bp chemistry. The full-length

129 transcriptome libraries were prepared for long-read sequencing with Lexogen’s teloprime kit

130 and Oxford Nanopore’s SQK-LSK109 library prep kit (ONT, UK). Sequencing was

131 performed on one Flongle (FLO) flow cell for 24 h and one MinION R9.4.1 flow cell for 24

132 h. See Electronic Supplementary Material 1 for detailed isolate, extraction, assembly and

133 annotation methodology.

134

135 Genome assembly

136 Illumina reads were trimmed for quality and adapters with Trimmomatic, whereas adapters

137 were trimmed from Nanopore MinION reads with Porechop. Haploid genome size, repetitive

138 content and heterozygosity size for Labyrinthula SR_Ha_C was estimated with

139 GenomeScope based on the frequency of k-mers in short reads (Vurture et al. 2017).

140 MaSuRCA (Zimin et al. 2017) was used to perform a hybrid assembly of the genome,

141 combining short and long read data. Resulting contigs were corrected and polished with three

142 iterations of Racon and Pilon. Since bimodal k-mer distributions were observed from the

143 GenomeScope analysis, indicating high levels of heterozygosity, the Purge Haplotigs pipeline

144 was used to remove haplotigs (Roach et al. 2018). Checks were put into place to scan for

145 possible bacterial contaminant contigs. The first was to search for any contigs with first hits

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

146 to bacterial sequences in NCBI’s non-redundant nucleotide database. Secondly, RNAmmer

147 was used to identify contigs containing bacterial ribosomal RNAs. In both searches, none of

148 the contigs met these criteria and therefore all were retained. The final assembly was

149 compared to those of other Stramenopiles based on assembly quality and completeness with

150 Quast (Mikheenko et al. 2018) and BUSCO (Seppey et al. 2019), respectively. A list of

151 species included in this comparison is available in Table S1.

152

153 Mitogenome recovery and phylogenetic analysis

154 The complete Labyrinthula SR_Ha_C mitogenome was extracted from the assembly through

155 blastn homology with other protist mitogenomes. The mitogenome was annotated with

156 MFannot and MITOS and manually adjusted according to proteins in NCBI’s non-redundant

157 database. Phylogenetic reconstruction was performed based on thirteen genes typically found

158 in mitogenomes (atp6, atp8, cox1-3, nad1-6, nad4l, cytb) (Boore 1999). These

159 were extracted from 127 Stramenopile mitogenomes and four from outgroup species (Table

160 S2). Protein sequences were aligned with MAFFT, trimmed with trimAl and concatenated

161 into a superalignment with FASconCAT-G. IQ-TREE was used to find the best-fit partition

162 model and to infer a maximum-likelihood tree. The tree was rooted with Reclinomonas

163 americana, which has a mitogenome that most closely resembles the ancestral proto-

164 mitochondrial genome (Lang et al. 1997).

165

166 Gene prediction and functional annotation

167 Two types of “hints” were used to predict genes for this non-model organism. The first type

168 consists of Labyrinthula transcripts generated in this study by sequencing (Illumina) and

169 assembling the transcriptome (Haas et al. 2013) to generate an Illumina-based transcriptome

170 (Data S1), in addition to long Nanopore-sequenced cDNA reads. The second type included

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

171 known peptide sequences from 235 protists downloaded from Ensembl Protists (release 46;

172 Howe et al. 2019) and curated sequences from UniProtKB/Swiss-Prot (Consortium 2018).

173

174 Gene prediction was performed with the MAKER2 annotation pipeline (Holt and Yandell

175 2011). A first iteration was run to identify a preliminary set of genes based on provided hints.

176 The predictions were used to train two ab initio gene predictors, Augustus and SNAP, after

177 which the resulting gene models were used in a second MAKER2 iteration. Since protist

178 genomes are diverse and therefore contribute highly divergent gene sequences, the

179 keep_preds option was activated in this iteration to report all predictions regardless of

180 concordance to supporting hints. InterProScan was used to scan these predictions for the

181 presence PFAM protein domains and any gene models that have physical evidence

182 (annotation edit distance, AED < 1) and/or protein domains were retained. To annotate

183 function, DIAMOND (Buchfink et al. 2015) was used to find homology between predicted

184 genes and proteins in NCBI’s non-redundant protein database (e-value cutoff 1e-5).

185

186 Ortholog discovery within the Stramenopila (i.e. Chromista) was performed using

187 protein sequences obtained from GenBank for two commonly-studied protists: 1.

188 (BioProject: PRJNA148), 2. Phytophthora sojae (BioProject:

189 PRJNA262907). Pairwise searches of protein sequence sets were performed with DIAMOND

190 (Buchfink et al. 2015) and orthologs between Labyrinthula and the two species were

191 identified through Reciprocal Best Hits (RBH).

192

193 Identification and peptide-based functional annotation of putative carbohydrate-active

194 enzymes

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

195 The dbCAN2 meta server (Zhang et al. 2018) was used to annotate carbohydrate-active

196 enzymes (CAZymes) based on enzyme families available in the CAZy database (Cantarel et

197 al. 2008). Due to the divergence of Labyrinthula from other , the HMMER search

198 option was selected (E-Value < 1e-15, coverage > 0.35) to search against representative

199 hidden Markov models of signature domains within the five enzyme classes: Glycoside

200 Hydrolases (GH), Glycosyl Transferases (GT), Polysaccharide Lyases (PL), Carbohydrate

201 Esterases (CE), and Auxiliary Activities (AA). Counts of putative Labyrinthula SR_Ha_C

202 CAZyme sequences were compared to data available on the CAZy database for 36 other

203 Stramenopile species.

204

205 The predicted amino acid sequences of Labyrinthula SR_Ha_C were analysed by the new

206 non-alignment based method, Conserved Unique Peptide Patterns (CUPP; Barrett and Lange

207 2019). CUPP automatically, annotates proteins to families and subfamilies. Further, if

208 characterised members of the CUPP group query is already assigned to a specific CAZyme

209 function, CUPP assigns that function to the query protein. For the purpose of comparison,

210 predicted proteins of three other members of the class Labyrinthulea (a.k.a.

211 ) were analysed: kerguelense (PBS07; family

212 Aplanochytridae, order Labyrinthulida), as well as Aurantiochytrium limacinum (ATCC

213 MYA-1381) and aggregatum (ATCC 28209) (family Thraustochytriidae,

214 order Thraustochytrida).

215

216 Results & Discussion

217 A contiguous Labyrinthula sp. genome

218 This study reports the first assembled genome of a Labyrinthula species, available on

219 GenBank (BioProject: PRJNA607370, WGS version: JAALGZ01). This 50.9 Mbp assembly

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

220 contained 159 contigs (Table 1), consistent with the kmer-based method’s estimate of a 51

221 Mbp haploid genome size for this isolate (Figure S1). The majority of short reads (99.07%)

222 were successfully aligned back to the assembly. Contiguity and completeness of this

223 assembly were substantially improved by the inclusion of Nanopore long reads in a hybrid

224 assembly, demonstrating the value of strategic incorporation of reads from different

225 platforms. Though the SR_Ha_C isolate was maintained in culture media that included

226 streptomycin and penicillin, further careful checks were performed to exclude bacterial

227 contigs. No bacterial contaminants were found in the genome, ensuring that findings reported

228 in this study accurately reflect the genetics of this Labyrinthula SR_Ha_C isolate. The ‘clean’

229 genome will also prevent inaccurate interpretations in other larger-scale studies looking into

230 investigations of bacteria-to-animal lateral gene transfer (Sibbald and Archibald 2017).

231

232

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

233 Table 1. Data generation, genome assembly and annotation of the Labyrinthula sp.

234 isolate SR_Ha_C. nr refers to the non-redundant protein database. aa = amino acids.

Raw sequencing data and depth (based on 51 Mbp genome size estimate) Illumina (MiSeq) 21.5 Gbp (421.79×) Oxford Nanopore (MinION) 612 Mbp (12.00×) Mitogenome assembly Assembly size (circular) 26,822 bp # protein coding genes 23 # ribosomal RNAs 2 # transfer RNAs 36 Genome assembly Number of contigs 159 Assembly size 50,948,119 bp Contig N50 (median) size 1,265,572 bp Average contig size 320,428.42 bp Longest contig 3,302,845 bp Shortest contig 4,362 bp BUSCO completeness (lineage: stramenopiles_odb10, n:100) Complete 60.0% Fragmented 9.0% Missing 31.0% BUSCO completeness (lineage: eukaryota_odb10, n:255) Complete 42.8% Fragmented 9.0% Missing 48.2% Genome annotation Number of ab initio proteins predicted by gene predictors 17,041 with AED+ < 1 and/or with PFAM domain 8,488 with homology to NCBI’s nr database (1e-5) 6,351 with homology to NCBI’s nr database (1e-50) 1,372 Average protein length 470 aa Average number of exons/gene 2.03 Longest protein* 15,723 aa 235 + Annotation edit distance (AED) ranges from 0 to 1, a small value shows lesser variance between the 236 predicted gene and provided evidence. 237 * Homologous to “Receptor-type tyrosine-protein phosphatase F” [Hondaea fermentalgiana] 238

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

239

240 Genome completeness was assessed with the BUSCO algorithm, which reported a 60.0% and

241 42.8% completeness for this reported assembly, based on stramenopiles_odb10 and

242 eukaryota_odb10 ortholog sets, respectively (Table 1, Figure 1). The former ortholog set (27

243 species, 100 BUSCOs) contained only three species from the phylum compared to the

244 Oomycota (14) and (9) phyla. The latter (70 species, 255 BUSCOs) was derived

245 from data of 37 metazoan species, 22 fungi, 10 plants and one Rhodophyta, the single protist

246 representative in this dataset. While the Labyrinthula SR_Ha_C assembly reported relatively

247 high values of longest contig and N50 lengths, BUSCO-assessed completeness ranked lower

248 than that of most other Stramenopiles (Figures 1a, S2). Higher completeness was generally

249 observed for species in the phylum Oomycota but appeared to be more variable in the other

250 two phyla (Figure 1a). In contrast, QUAST analysis showed Labyrinthula SR_Ha_C

251 outperforming the Bigyra phylum average (green cells in Figure 1b) across all measured

252 metrics (Figure 1c). Similarly, high quality QUAST outcomes for hominis were

253 not reflected in BUSCO completeness values (Labyrinthula SR_Ha_C 60%, Blastocystis

254 hominis 45%).

255

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

256

257 Figure 1: Comparison of genome assemblies for Stramenopiles in three phyla: Bigyra,

258 Ochrophyta and Oomycota. a. (Left) N50 length and longest contig/scaffold for each

259 assembly (cutoff 20 Mbp); (Right) Percentages of complete and fragmented BUSCOs based

260 on 100 BUSCOs in stramenopiles_odb10. Number of assemblies per phylum are listed in

261 brackets. A detailed version and sources are available in Figure S2 and Table S1. b. Further

262 details on assemblies with sequence (contig/scaffold) length distribution for species in the

263 phylum Bigyra. Longest sequence length, N50 and N75 are converted to percentages out of

264 assembly length for more accurate comparisons across species with different genome sizes.

265 Green cells are for values that outperform the average and red for values poorer than average,

266 subject to the context of each column. c. QUAST-generated Figures showing a cumulative

267 length plot, Nx values and GC content for species in the phylum Bigyra (colour legend in

268 panel b). Assembly sources: Ensembl Protists, NCBI, JGI.

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

269

270 Though a surge of new genomes are being made available in more recent times, taxonomic

271 sampling of protists are largely skewed to select ‘popular’ lineages with relevance to medical,

272 agricultural and aquaculture interests. However, since gene model prediction, quality

273 assessment and other comparative genomic analyses largely rely on the support of closely-

274 related sequences; this patchy and uneven representation presents gaps in current knowledge

275 and challenges efforts to better characterise and understand genomes of less-studied

276 organisms. In the case of quality assessments, while the BUSCO algorithm has been shown

277 to be an effective and popular choice of tool for assessing the ‘completeness’ of genome

278 assemblies, its shortcomings have also been discussed especially in relation to lineages

279 sampled in its reference ortholog datasets (ExpositoAlonso et al. 2020). Subject to

280 characteristics of an assessed genome, there can be technical limitations in estimating

281 proportions of fragmented and missing BUSCOs, and in species that are highly divergent

282 from the reference ortholog dataset, higher degrees of reported BUSCO ‘missingness’ may

283 reflect the organism’s evolutionary divergence and history rather than the completeness of its

284 assembly (ExpositoAlonso et al. 2020). Our results show that the use of multiple

285 assessment qualities is important to more accurately assess the quality of assemblies,

286 particularly those of highly divergent, non-model organisms (ExpositoAlonso et al. 2020).

287

288 Labyrinthula sp. gene annotation

289 A transcriptome with 1.2 million transcripts was generated and assisted in the annotation of

290 the genome, resulting in the prediction of >17,000 protein sequences (Table 1; annotation

291 results in Data S1). We applied a conservative filter to retain only gene models that were

292 either supported by existing sequence evidence or those that contain PFAM protein domains,

293 in order to avoid false gene predictions albeit at the cost of removing some real, but clade-

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

294 specific, novel genes. Since the annotation process typically relies on the homology of

295 sequences to those in other well-annotated reference genomes (ExpositoAlonso et al. 2020),

296 the absence of a genome available as a close relative of Labyrinthula SR_Ha_C resulted in a

297 reduced final number of 8,488 proteins reported in this study (Table 1; Data S1). This is

298 likely an underestimation of the true number owing to the limited sequenced genomes in this

299 clade and further highlights the novelty of the Labyrinthula genome.

300

301 A survey of the annotated functions of these predicted peptides revealed rhodanese-like

302 proteins related to stress response and sulphur metabolism (Cipollone et al. 2007), as well as

303 a rhodopsin protein (Data S1). While light-activated functions have not been explored in

304 Labyrinthula species, phototaxis and rhodopsin-related functions have been increasingly

305 identified for estuarine zoosporic fungi and protists (Kazama 1972, Tian et al. 2018). We

306 noted proteins involved in sodium, potassium and calcium exchange and transport suggestive

307 of the pathways involved in ion homeostasis and salinity adaptations in halophytic organisms

308 (Harding et al. 2017). The annotation also revealed a wealth of actin-myosin and

309 carbohydrate-degradation proteins. A pairwise annotation, with orthologs of known

310 pathogenic protist Plasmodium falciparum and Phytophthora sojae, further

311 revealed glideosome and apicoplast ribosomal proteins (Data S1). The presence of a

312 glideosome protein in addition to the actin-myosin machinery suggests shared traits with the

313 apicomplexan family of pathogens, and opens an interesting avenue for investigating the

314 machinery underlying Labyrinthula parasitism and host invasion, as well as its evolutionary

315 history of endosymbiosis (Keeley and Soldati 2004, Oborník et al. 2009).

316

317 CAZymes and functional predictions facilitating plant cell wall degradation

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

318 A total of 112 genes associated with carbohydrate-active enzyme (CAZy) activities were

319 identified from the predicted Labyrinthula gene set (Figure 2, Table S3). This was distributed

320 across 56 glycosyltransferases (GT), 34 glycoside hydrolases (GH), 17 auxiliary activities

321 (AA), 3 polysaccharide lyases (PL) and 2 carbohydrate esterases (CE). We identified the

322 presence of genes categorised in several groups of putative aromatic and cellulose-degrading

323 enzymes with activities associated with the breakdown of plant cell walls. Some of these

324 enzyme families also show an enrichment of genes relative to other available protist data on

325 the CAZy database, including, the AA2 enzyme family (activities: manganese peroxidase,

326 lignin peroxidase), the PL1 family (activities: pectate lyase, pectin lyase), the GH1 family

327 (activities: β-glucosidases, β-galactosidases) and the GH5 family (activities:

328 endoglucanase/cellulase, etc) (Figure 2).

329

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

330

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

331 Figure 2: Annotation of Carbohydrate-active enzymes in Labyrinthula sp. isolate

332 SR_Ha_C in comparison with other Stramenopile profiles available in the CAZy

333 database. Comparative analysis was done for five enzyme classes: Auxiliary Activities

334 (AA), Carbohydrate Esterases (CE), Glycoside Hydrolases (GH), GlycosylTransferases (GT)

335 and Polysaccharide Lyases (PL). Enzyme family numbers correspond to those in the CAZy

336 database and are indicated on the x-axis. Asterisks (*) indicate the Labyrinthula isolate

337 SR_Ha_C reported in this study.

338

339 As these enzyme families can act on a diversity of substrates, we used the CUPP analysis to

340 more closely predict the function of the CAZymes. We found that the Labyrinthula

341 SR_Ha_C genome has a broad selection of cell wall-degrading and storage materials

342 modifying enzymes (starch, cellulose, trehalose, Beta-D-galactosides, α-mannan. α-1,2-

343 mannan), as well as glucosylceramidase that can be involved in pathways in plant hosts-

344 pathogen relationships (Table 2) (Warnecke and Heinz 2003). The three enzyme functions

345 with highest number of functionally annotated genes were 1,2-α-mannosidase (5), beta

346 glucosidase (4) and glucosylceramidase (3). Interestingly, functional annotation by CUPP did

347 not identify xylanolytic, chitinolytic or pectinolytic CAZymes. However, as the seagrass

348 fibres (particularly of the Zostera genus) are known to be rich in both xylan and pectin

349 (Davies et al. 2007) and the functional annotation by homology suggested xylanolytic and

350 pectinolytic proteins (Data S1), we suggest that Labyrinthula SR_Ha_C could have such

351 enzyme functionalities, but represented by enzymes too distantly related to the known

352 CAZymes to be identified.

353

354 The CUPP annotations of Labyrinthula SR_Ha_C were low (74) compared to the other three

355 Labyrinthulea species: A. kerguelens (129), S. aggregatum (215), and A. limacinum (266;

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

356 Tables 2, S4). Interestingly, the ratio of CAZymes, which could be predicted to function,

357 divides the four species in two groups. Notably, the two Thraustochytrid species A.

358 limacinum and S. aggregatum had 7% and 12 % of their CAZymes ascribed to function,

359 respectively, while the two Labyrinthulidae species had 29% (Labyrinthula sp.) and 24% (A.

360 kerguelense) ascribed. In short, among the four species studied, the richest CAZyme profile

361 has also the highest level of uniqueness found among its enzyme portfolio (Tables 2, S4).

362 Although we are working with a conservative predicted proteome, the rather low number of

363 CAZymes found in Labyrinthula SR_Ha_C may be interpreted to reflect its ecology, e.g.

364 being an opportunistic pathogen, it has a more specialised substrate/host affinity, as

365 compared to the more saprotrophic lifestyles of the other three species. Further, it is attractive

366 to interpret the more specialised enzyme profile of Labyrinthula SR_Ha_C. in the context of

367 its highly specialised ectoplasmic net, a unique structure not only attaching the Labyrinthula

368 colony to surfaces but also secreting digestive enzymes and nutritional uptake, similar to S.

369 aggregatum (Iwata and Honda 2018).

370

19

bioRxiv preprint (which wasnotcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade

371 Table 2 Comparative functional annotation of CAZymes in Labyrinthula sp. isolate SR_Ha_C as compared to three other doi:

372 Labyrinthulea. Each line of the table specifies one type of enzyme function, listing EC number, and estimated substrate. Note, the number of https://doi.org/10.1101/2020.09.14.297390

373 CAZymes functionally annotated for all these four species are as can be seen rather few. This is due to most of the secretome profile of the

374 Labyrinthulea are too distantly related to known and characterised CAZymes to provide a basis for functional annotation. Further, as appears in

375 the case of Labyrinthula SR_Ha_C, all enzyme functions are represented by only one type of enzyme family, while for the more diverse available undera

376 CAZyme profiles of the three other species, some enzyme functions are represented by two enzyme families. The Gene Number columns are

377 given as heat maps. All CAZymes found by the CUPP analysis (both with predicted function and the CAZymes with no predicted function) are CC-BY-NC-ND 4.0Internationallicense ; 378 listed in Table S4. this versionpostedSeptember15,2020.

Labyrinthula Aplanochytrium Aurantiochytrium Schizochytrium SR_Ha_C kerguelense limacinum aggregatum

Labyrinthulidae Aplanochytriidae Thraustochytriaceae Thraustochytriaceae

Enzyme Enzyme Enzyme Enzyme Nb. No. No. No. Estimated families families families families EC no. Function description of of of of overall substrate representing representing representing representing genes genes genes genes function function function function . 2 1 1 1 1 1 GH31 3 1 - 1 - The copyrightholderforthispreprint Starch 3.2.1.20 α-glucosidase GH13 GH31 GH31 GH31 Endo-1,4-β-D- Cellulose 2 2 GH5 2 2 GH5 - 0 - - 1 1 GH5 - 3.2.1.4 glucanase Cellulose 3.2.1.21 β-glucosidase 4 4 GH1 1 1 GH1 - 0 - - 1 1 GH1 - Xylan 3.2.1.37 1,4-β-xylosidase 0 - 1 1 GH3 - 1 1 GH3 - 2 2 GH3 - 1 0 - 1 - 0 - - 0 - - Fucose 3.2.1.51 α-L-fucosidase GH29

20

bioRxiv preprint (which wasnotcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade

Labyrinthula Aplanochytrium Aurantiochytrium Schizochytrium doi: SR_Ha_C kerguelense limacinum aggregatum https://doi.org/10.1101/2020.09.14.297390 Labyrinthulidae Aplanochytriidae Thraustochytriaceae Thraustochytriaceae

Enzyme Enzyme Enzyme Enzyme Nb. No. No. No. Estimated families families families families EC no. Function description of of of of overall substrate representing representing representing representing genes genes genes genes function function function function available undera 1 0 - 1 - 0 - - 0 - - Arabinan 3.2.1.88 β-L-arabinosidase GH27 1 β-D-galactosides 3.2.1.23 β-galactosidase 1 1 GH2 1 1 GH2 - 0 - - 3 1 GH1 GH35 1 1 1 α-D-galactosides 3.2.1.22 α-galactosidase 1 1 GH36 2 0 - - 1 - CC-BY-NC-ND 4.0Internationallicense GH27 GH36 GH36 ;

β-mannan 3.2.1.25 β-mannosidase 0 - 1 1 GH2 - 0 - - 0 - - this versionpostedSeptember15,2020. 1 1 1 α-1,3-glucan 3.2.1.84 1,3-α-glucosidase 1 1 GH31 1 - 1 - 1 - GH31 GH31 GH31 1 1 1 Chitin 3.2.1.14 chitinase 0 - 2 0 - - 1 - GH19 GH18 GH18 N-acetyl-β-D- β-N- 1 2 3.2.1.52 0 - 1 - 2 - 0 - - hexosaminides acetylhexosaminidase GH20 GH20 endo-1,3-β-D- 1 β-1,3-glucan 3.2.1.39 0 - 0 - - 0 - - 1 - glucosidase GH16 1 Trehalose 3.2.1.28 α,α-trehalase 1 1 GH37 1 - 0 - - 0 - - GH37

Galactosamino- 1 . 3.2.1.109 endogalactosaminidase 0 - 0 - - 0 - - 1 - glycan GH114 maltopentaose-forming α-1,6-mannan 3.2.1.x 0 - 1 - 0 - 0 - The copyrightholderforthispreprint α-amylase 1 1 α-mannan 3.2.1.24 α-mannosidase 1 1 GH38 1 - 0 - - 1 - GH38 GH38 4 2 2 1 3 α-1,2-mannan 3.2.1.113 1,2-α-mannosidase 5 5 GH47 4 - 4 4 GH47 GH92 GH47 GH92 GH47 glycoprotein endo- 2 α-1,2-mannan 3.2.1.130 0 - 2 - 0 - - 0 - - alpha-1,2-mannosidase GH99 Glycoprotein 3.2.1.96 endo-β-N- 0 - 1 1 - 0 - - 0 - -

21

bioRxiv preprint (which wasnotcertifiedbypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmade

Labyrinthula Aplanochytrium Aurantiochytrium Schizochytrium doi: SR_Ha_C kerguelense limacinum aggregatum https://doi.org/10.1101/2020.09.14.297390 Labyrinthulidae Aplanochytriidae Thraustochytriaceae Thraustochytriaceae

Enzyme Enzyme Enzyme Enzyme Nb. No. No. No. Estimated families families families families EC no. Function description of of of of overall substrate representing representing representing representing genes genes genes genes function function function function available undera acetylglucosaminidase GH85 N-acetyl-D- N-acetylglucosamine- glucosamine 6- 3.5.1.25 6-phosphate 0 - 0 - - 1 1 CE9 - 1 1 CE9 - phosphate deacetylase

Peptidophospho- CC-BY-NC-ND 4.0Internationallicense 3.2.1.146 β-galactofuranosidase 0 - 1 1 GH2 - 0 - - 0 - - galactomannan ; 1 1 1 this versionpostedSeptember15,2020. Cerebrosides 3.2.1.46 galactosylceramidase 0 - 1 - 1 - 1 - GH59 GH59 GH59 2 Cerebrosides 3.2.1.45 glucosylceramidase 3 2 GH5 2 2 GH5 - 5 3 GH5 3 3 GH5 - GH30 cytochrome-c Peroxide 1.11.1.5 1 1 AA2 1 1 AA2 - 1 1 AA2 - 2 2 AA2 - peroxidase p-benzoquinone Aromatics 1.6.5.6 1 1 AA6 0 - - 4 4 AA6 - 1 1 AA6 - reductase (NADPH) 379

380 . The copyrightholderforthispreprint

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

381

382 Labyrinthula SR_Ha_C shared several Function:Family constellations with the other three

383 Labyrinthulea species, that is, the four glycohydrolases, belonging to GH31 and GH38, are

384 representing the same four EC functions in all species (Table 2). However significant

385 discrepancies in Function:Family constellations are found as well: Enzyme function EC

386 3.2.1.113 predictive of α-mannan is represented by GH38 in Labyrinthula SR_Ha_C and A.

387 kerguelense; while represented by two different proteins, GH47 and GH92 in the two

388 Thraustochytriaceae species (A. limacinum and S. aggregatum). All in all, these observations

389 reflect relatedness among the Labyrinthulea species, as well as indicating that they are only

390 distantly related in digestive enzyme profiles, and likewise, substrate affinities.

391

392 Evolutionary relationships within Bigyra and Stramenopile

393 We recovered a complete circular Labyrinthula SR_Ha_C mitogenome (accession number:

394 MT267870) (Table 1, Figure S3). All genes were transcribed on the heavy strand, which has

395 a guanine-rich base composition of 28.18% A, 8.26% C, 39.16% G and 24.40% T. In

396 addition to the 13 protein-coding genes (PCG) typically found in animal mitogenomes, 10

397 other PCGs found in other heterotrophic protists were present in the Labyrinthula

398 mitogenome (Wideman et al. 2020), namely atp9, nad7, nad9, nad11, rpl16, rps11-13, tatA

399 and tatC. At the time of this study, molecular sequences for Labyrinthula species were scant

400 on the NCBI database, reporting only 572 nucleotide fragments, majority of which are

401 ribosomal RNA and ITS sequences. Our complete annotated mitogenome showed an

402 interesting strand asymmetry characteristic, containing a high guanine (G) to cytosine (C)

403 ratio of 4.74:1 on its heavy strand versus an average of 1.22:1 in other Stramenopiles.

404 Guanine-enriched sequences in other eukaryotes have been shown to adopt G-quadruplex

405 (G4) stable structures and may play potential roles in the regulation of DNA replication,

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

406 transcription and translation (Falabella et al. 2019). Using G4Hunter (Brázda et al. 2019),

407 Labyrinthula SR_Ha_C’s predicted G4 frequency (15/1kbp) was 50- and 300-fold higher

408 than predicted in Schizochytrium sp. TIO1101 and roenbergensis, respectively. It is

409 difficult to infer that the cause of this high frequency due to the newness of this metric for

410 non-human mitogenomes. However, as more mitogenomes are made available in the future, it

411 will be of interest to see if this high frequency is observed in the mitogenomes of other more

412 closely-related species and whether G4 richness is associated with protist evolutionary

413 history or as potential regulators of virulence in pathogens (Harris and Merrick 2015).

414

415

416 Figure 3: Evolutionary relationships of Labyrinthula sp. isolate SR_Ha_C and other

417 species within Bigyra and Stremenopiles more generally. Phylogenetic tree based on 13

418 mitochondrial protein-coding genes (atp, cox, nad, cytb). This tree contains 127

419 Stramenopiles species/isolates in three phyla (Bigyra:13, Ochrophyta:91, Oomycota:23) and

420 three Rhodophyta species as outgroup, rooted with Reclinomonas americana. Nodal support

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

421 is indicated in the form of SH-aLRT / UFBoot values, with reliable clades having ≥80% and

422 ≥95% values, respectively. Closed circles indicate maximum support (100/100) while open

423 circles report less than maximum values but above the reliable threshold.

424

425 In the reconstructed phylogeny of Stramenopiles, Labyrinthula and other protist species in

426 the phylum, Bigyra formed a paraphyletic group at the base of Stramenopile, in addition to

427 monophyletic clades for the other two phyla (Oomycota and Ochrophycota) (Figure 3;

428 superalignment and phylogenetic tree is available in Data S2). Within Bigyra itself,

429 Labyrinthula SR_Ha_C shared a strongly-supported sister relationship with two

430 Thraustochytrida species (Thraustochytrium aureum and Schizochytrium sp. TIO1101), all

431 three of which were taxonomically assigned to the class Labyrinthulea. This clade clustered

432 with species from the other two major phyla, consistent with current taxonomic

433 classifications and reports in other studies based on nuclear data (Tsui et al. 2009). However,

434 this inference should be treated with caution since inadequate and uneven representation of

435 branches in phylogenetic trees can introduce errors caused by long-branch attraction and false

436 monophyly, which can lead to inaccurate phylogenetic relationships. It is therefore crucial to

437 recognise the importance of strategic sampling to increase mitogenomic resources for under-

438 studied taxonomic groups, thereby filling in the gaps in taxonomic representation to enable

439 more comprehensive investigations into molecular diversity and evolutionary histories of

440 protists (Wideman et al. 2020).

441

442 Ecological and biotechnology implications of the Labyrinthula SR_Ha_C genome

443 This study reports the first assembled genome of a Labyrinthula species, and adds valuable

444 genomic and transcriptomic resources to the limited molecular sequences available for the

445 group Labyrinthulea (Song et al. 2018). The recovery of the complete mitochondrial genome

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

446 makes this the first of its kind for the order Labyrinthulida, strongly supporting a

447 monophyletic Labyrinthulida+Thraustochytrida clade, while providing insights into the

448 evolutionary history of Stramenopile species. Our cross-phyla comparison of >100

449 Stramenopile genome assemblies unveiled possible caveats in current existing quality

450 assessment tools that need to be carefully considered when assessing work for species from

451 divergent and under-sampled taxonomic groups. Genome annotation of Labyrinthula

452 SR_Ha_C opens comparisons to the recently published aquatic early lineage zoosporic fungi

453 (Lange et al. 2019a). Further, the presented genome could act as a reference to propel new

454 research directions for the Labyrinthula genus, such as population genetics (Lohan et al.

455 2020).

456

457 While our mining of the draft genome was not exhaustive, our initial investigations show

458 promise in highlighting novel pathways to understanding Labyrinthula ecology and the

459 SWD-pathosystem. The CAZymes and predicted CAZy function profiles provided insight

460 into cell-wall breakdown. These pathways have been largely assumed through the

461 classification of Labyrinthula’s saprobic lifestyle, but such evidence relating to this

462 ecological function has most often been limited to their thraustochytrid cousins (Raghukumar

463 and Damare 2011). Cell-wall and fibre breakdown not only has significant implications to

464 marine carbon sequestration, but in conjunction with adhesion to the host, highlights

465 mechanisms that may be key to the invasion of living seagrass tissue (Gibson et al. 2011,

466 Iwata and Honda 2018). These results will hopefully provide a springboard to future research

467 on Labyrinthula’s virulence or SWD assays using both host and pathogen gene responses.

468

469 Our study also highlights the industrial potential for Labyrinthula. Enzyme secretome

470 profiles and specialised functional proteins of fungal and non-fungal zoosporic unicellular

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

471 organisms, such as Labyrinthula, share many similarities with regard to enzyme functions,

472 but also reflect different types of plant biomass affinities and degrading capabilities (Lange et

473 al. 2019b). Secretome organisation, e.g. one enzyme function may be represented by several

474 types of protein families or one type of protein family may include several enzyme functions,

475 could reflect a strategy that is evolutionary optimised for robustness to varying conditions

476 (salinity, temperature, pH, substrate availability). Notably, the uniqueness of the genome of

477 Labyrinthula SR_Ha_C, combined with the very low number of sequenced genomes in this

478 group makes the genome/transcriptome and function predictions described here a hotspot for

479 future enzyme discovery activities. For example, the lignocellulose repertoire of plant

480 pathogens is inspiring research on their use for biomass conversion to biofuels or higher

481 value products (Gibson et al. 2011), while thraustochytrids are rich source bioproducts

482 (Marchan et al. 2018). We expect that much more could be found from this highly diverse,

483 un-exploited and under-studied group of microorganisms.

484

485

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

486 Acknowledgments

487 This research utilised computational resources and services provided by the National

488 Computational Infrastructure (NCI), which is supported by the Australian Government. We

489 thank Samuel Lysy for his assistance in the lab. We would also like to thank Robert Ruge of

490 Deakin University for assistance and use of the SIT HPC Cluster. Funding for this study was

491 provided by the Mary Collins Trust and the Deakin Genomics Centre, Deakin University

492 (Australia).

493

494 Availability of data and material

495 The whole genome assembly was deposited into the NCBI/GenBank Whole Genome

496 Sequence (WGS) database with the accession number JAALGZ000000000 and raw genomic

497 reads into NCBI/Short Read Archive (SRA) with accession numbers SRR12496087 and

498 SRR12496088, all of which are under the BioProject PRJNA607370. The mitochondrial

499 genome was deposited into the NCBI/GenBank database with the accession number

500 MT267870.

501

502 Authors’ contributions

503 STT, SL, MT, LC, LL and FHG designed the research and direction of analyses. STT

504 collected and isolated the sample. SL performed whole genome and transcriptome

505 sequencing. MT assembled and annotated the genome and transcriptome, and carried out

506 phylogenetic analysis. LC and MT performed comparative analyses of assemblies and

507 proteomes of other protist genomes. LL, BP and MT analysed the peptide-based functional

508 annotation of CAZymes. MT and STT led the writing of the manuscript. All authors edited

509 and developed the manuscript.

510

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

511 References

512 Barrett, K., and L. Lange. 2019. Peptide-based functional annotation of carbohydrate-active 513 enzymes by conserved unique peptide patterns (CUPP). Biotechnology for Biofuels 514 12:1-21. 515 Boore, J. L. 1999. Animal mitochondrial genomes. Nucleic Acids Research 27:1767-1780. 516 Brakel, J., T. B. Reusch, and A.-C. Bockelmann. 2017. Moderate virulence caused by the 517 protist Labyrinthula zosterae in ecosystem foundation species Zostera marina under 518 nutrient limitation. Marine Ecology Progress Series 571:97-108. 519 Brakel, J., F. J. Werner, V. Tams, T. B. Reusch, and A.-C. Bockelmann. 2014. Current 520 European Labyrinthula zosterae are not virulent and modulate seagrass (Zostera 521 marina) defense gene expression. PloS one 9:e92448. 522 Brázda, V., J. Kolomazník, J. Lýsek, M. Bartas, M. Fojta, J. Šťastný, and J.-L. Mergny. 2019. 523 G4Hunter web application: a web server for G-quadruplex prediction. Bioinformatics 524 35:3493-3495. 525 Buchfink, B., C. Xie, and D. H. Huson. 2015. Fast and sensitive protein alignment using 526 DIAMOND. Nature Methods 12:59-60. 527 Cantarel, B. L., P. M. Coutinho, C. Rancurel, T. Bernard, V. Lombard, and B. Henrissat. 528 2008. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for 529 Glycogenomics. Nucleic Acids Research 37:D233-D238. 530 Cipollone, R., P. Ascenzi, and P. Visca. 2007. Common themes and variations in the 531 rhodanese superfamily. IUBMB Life 59:51-59. 532 Collier, J. L., and J. S. Rest. 2019. Swimming, gliding, and rolling toward the mainstream: 533 cell biology of marine protists. Molecular Biology of the Cell 30:1245-1248. 534 Consortium, T. U. 2018. UniProt: a worldwide hub of protein knowledge. Nucleic Acids 535 Research 47:D506-D515. 536 Davies, P., C. Morvan, O. Sire, and C. Baley. 2007. Structure and properties of fibres from 537 sea-grass (Zostera marina). Journal of Materials Science 42:4850-4857. 538 Duffin, P., D. L. Martin, K. M. Pagenkopp Lohan, and C. Ross. 2020. Integrating host 539 immune status, Labyrinthula spp. load and environmental stress in a seagrass 540 pathosystem: Assessing immune markers and scope of a new qPCR primer set. PloS 541 one 15:e0230108. 542 ExpositoAlonso, M., H. G. Drost, H. A. Burbano, and D. Weigel. 2020. The Earth 543 BioGenome project: opportunities and challenges for plant genomics and 544 conservation. The Plant Journal 102:222-229. 545 Falabella, M., R. J. Fernandez, F. B. Johnson, and B. A. Kaufman. 2019. Potential roles for 546 G-quadruplexes in mitochondria. Current medicinal chemistry 26:2918-2932. 547 Fernandes, N., R. J. Case, S. R. Longford, M. R. Seyedsayamdost, P. D. Steinberg, S. 548 Kjelleberg, and T. Thomas. 2011. Genomes and virulence factors of novel bacterial 549 pathogens causing bleaching disease in the marine red alga Delisea pulchra. PloS one 550 6:e27387. 551 Gibson, D. M., B. C. King, M. L. Hayes, and G. C. Bergstrom. 2011. Plant pathogens as a 552 source of diverse enzymes for lignocellulose digestion. Current Opinion in 553 Microbiology 14:264-270. 554 Haas, B. J., A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood, J. Bowden, M. B. 555 Couger, D. Eccles, B. Li, M. Lieber, M. D. MacManes, M. Ott, J. Orvis, N. Pochet, F. 556 Strozzi, N. Weeks, R. Westerman, T. William, C. N. Dewey, R. Henschel, R. D. 557 LeDuc, N. Friedman, and A. Regev. 2013. De novo transcript sequence reconstruction 558 from RNA-seq using the Trinity platform for reference generation and analysis. 559 Nature Protocols 8:1494-1512.

29 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

560 Harding, T., A. J. Roger, and A. G. Simpson. 2017. Adaptations to high salt in a halophilic 561 protist: differential expression and gene acquisitions through duplications and gene 562 transfers. Frontiers in Microbiology 8:944. 563 Harris, L. M., and C. J. Merrick. 2015. G-quadruplexes in pathogens: a common route to 564 virulence control? PLoS Pathog 11:e1004562. 565 Harvell, C., K. Kim, J. Burkholder, R. Colwell, P. R. Epstein, D. Grimes, E. Hofmann, E. 566 Lipp, A. Osterhaus, and R. M. Overstreet. 1999. Emerging marine diseases--climate 567 links and anthropogenic factors. Science 285:1505-1510. 568 Holt, C., and M. Yandell. 2011. MAKER2: an annotation pipeline and genome-database 569 management tool for second-generation genome projects. BMC Bioinformatics 570 12:491. 571 Howe, K. L., B. Contreras-Moreira, N. De Silva, G. Maslen, W. Akanni, J. Allen, J. Alvarez- 572 Jarreta, M. Barba, D. M. Bolser, L. Cambell, M. Carbajo, M. Chakiachvili, M. 573 Christensen, C. Cummins, A. Cuzick, P. Davis, S. Fexova, A. Gall, N. George, L. Gil, 574 P. Gupta, K. E. Hammond-Kosack, E. Haskell, S. E. Hunt, P. Jaiswal, S. H. Janacek, 575 P. J. Kersey, N. Langridge, U. Maheswari, T. Maurel, M. D. McDowall, B. Moore, 576 M. Muffato, G. Naamati, S. Naithani, A. Olson, I. Papatheodorou, M. Patricio, M. 577 Paulini, H. Pedro, E. Perry, J. Preece, M. Rosello, M. Russell, V. Sitnik, D. M. 578 Staines, J. Stein, M. K. Tello-Ruiz, S. J. Trevanion, M. Urban, S. Wei, D. Ware, G. 579 Williams, A. D. Yates, and P. Flicek. 2019. Ensembl Genomes 2020—enabling non- 580 vertebrate genomic research. Nucleic Acids Research 48:D689-D695. 581 Iwata, I., and D. Honda. 2018. Nutritional intake by ectoplasmic nets of Schizochytrium 582 aggregatum (Labyrinthulomycetes, Stramenopiles). Protist 169:727-743. 583 Kazama, F. 1972. Ultrastructure and phototaxis of the zoospores of Phlyctochytrium sp., an 584 estuarine chytrid. Microbiology 71:555-566. 585 Keeley, A., and D. Soldati. 2004. The glideosome: a molecular machine powering motility 586 and host-cell invasion by . Trends in Cell Biology 14:528-532. 587 Kyrpides, N. C., P. Hugenholtz, J. A. Eisen, T. Woyke, M. Göker, C. T. Parker, R. Amann, 588 B. J. Beck, P. S. Chain, and J. Chun. 2014. Genomic encyclopedia of bacteria and 589 archaea: sequencing a myriad of type strains. PLoS biology 12:e1001920. 590 Lang, B. F., G. Burger, C. J. O'Kelly, R. Cedergren, G. B. Golding, C. Lemieux, D. Sankoff, 591 M. Turmel, and M. W. Gray. 1997. An ancestral mitochondrial DNA resembling a 592 eubacterial genome in miniature. Nature 387:493-497. 593 Lange, L., K. Barrett, B. Pilgaard, F. Gleason, and A. Tsang. 2019a. Enzymes of early- 594 diverging, zoosporic fungi. Applied Microbiology and Biotechnology 103:6885-6902. 595 Lange, L., B. Pilgaard, F.-A. Herbst, P. K. Busk, F. Gleason, and A. G. Pedersen. 2019b. 596 Origin of fungal biomass degrading enzymes: Evolution, diversity and function of 597 enzymes of early lineage fungi. Fungal Biology Reviews 33:82-97. 598 Leander, C. A., and D. Porter. 2001. The Labyrinthulomycota is comprised of three distinct 599 lineages. Mycologia:459-464. 600 Lewin, H. A., G. E. Robinson, W. J. Kress, W. J. Baker, J. Coddington, K. A. Crandall, R. 601 Durbin, S. V. Edwards, F. Forest, and M. T. P. Gilbert. 2018. Earth BioGenome 602 Project: Sequencing life for the future of life. Proceedings of the National Academy 603 of Sciences 115:4325-4333. 604 Lohan, K. M. P., R. DiMaria, D. L. Martin, C. Ross, and G. M. Ruiz. 2020. Diversity and 605 microhabitat associations of Labyrinthula spp. in the Indian River Lagoon System. 606 Diseases of Aquatic Organisms 137:145-157. 607 Marchan, L. F., K. J. L. Chang, P. D. Nichols, W. J. Mitchell, J. L. Polglase, and T. 608 Gutierrez. 2018. , ecology and biotechnological applications of 609 thraustochytrids: A review. Biotechnology Advances 36:26-46.

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

610 Martin, D. L., Y. Chiari, E. Boone, T. D. Sherman, C. Ross, S. Wyllie-Echeverria, J. K. 611 Gaydos, and A. A. Boettcher. 2016. Functional, phylogenetic and host-geographic 612 signatures of Labyrinthula spp. provide for putative species delimitation and a global- 613 scale view of seagrass wasting disease. Estuaries and Coasts 39:1403-1421. 614 Mikheenko, A., A. Prjibelski, V. Saveliev, D. Antipov, and A. Gurevich. 2018. Versatile 615 genome assembly evaluation with QUAST-LG. Bioinformatics 34:i142-i150. 616 Oborník, M., J. Janouškovec, T. Chrudimský, and J. Lukeš. 2009. Evolution of the apicoplast 617 and its hosts: from heterotrophy to autotrophy and back again. International Journal 618 for Parasitology 39:1-12. 619 Raghukumar, S., and V. S. Damare. 2011. Increasing evidence for the important role of 620 Labyrinthulomycetes in marine ecosystems. Botanica Marina 54:3-11. 621 Roach, M. J., S. A. Schmidt, and A. R. Borneman. 2018. Purge Haplotigs: allelic contig 622 reassignment for third-gen diploid genome assemblies. BMC Bioinformatics 19:460. 623 Seppey, M., M. Manni, and E. M. Zdobnov. 2019. BUSCO: Assessing Genome Assembly 624 and Annotation Completeness. Pages 227-245 in M. Kollmar, editor. Gene Prediction: 625 Methods and Protocols. Springer New York, New York, NY. 626 Sibbald, S. J., and J. M. Archibald. 2017. More protist genomes needed. Nature Ecology & 627 Evolution 1:1-3. 628 Song, Z., J. E. Stajich, Y. Xie, X. Liu, Y. He, J. Chen, G. R. Hicks, and G. Wang. 2018. 629 Comparative analysis reveals unexpected genome features of newly isolated 630 Thraustochytrids strains: on ecological function and PUFAs biosynthesis. BMC 631 genomics 19:541. 632 Sullivan, B. K., S. M. Trevathan-Tackett, S. Neuhauser, and L. L. Govers. 2018. Host- 633 pathogen dynamics of seagrass diseases under future global change. Marine Pollution 634 Bulletin 134:75-88. 635 Tian, Y., S. Gao, S. Yang, and G. Nagel. 2018. A novel rhodopsin phosphodiesterase from 636 Salpingoeca rosetta shows light-enhanced substrate affinity. Biochemical Journal 637 475:1121-1128. 638 Trevathan-Tackett, S. M., B. K. Sullivan, K. Robinson, O. Lilje, P. I. Macreadie, and F. H. 639 Gleason. 2018. Pathogenic Labyrinthula associated with Australian seagrasses: 640 Considerations for seagrass wasting disease in the southern hemisphere. 641 Microbiological Research 206:74-81. 642 Tsui, C. K., W. Marshall, R. Yokoyama, D. Honda, J. C. Lippmeier, K. D. Craven, P. D. 643 Peterson, and M. L. Berbee. 2009. Labyrinthulomycetes phylogeny and its 644 implications for the evolutionary loss of chloroplasts and gain of ectoplasmic gliding. 645 Molecular Phylogenetics and Evolution 50:129-140. 646 Vurture, G. W., F. J. Sedlazeck, M. Nattestad, C. J. Underwood, H. Fang, J. Gurtowski, and 647 M. C. Schatz. 2017. GenomeScope: fast reference-free genome profiling from short 648 reads. Bioinformatics 33:2202-2204. 649 Warnecke, D., and E. Heinz. 2003. Recently discovered functions of glucosylceramides in 650 plants and fungi. Cellular and Molecular Life Sciences CMLS 60:919-941. 651 Wideman, J. G., A. Monier, R. Rodríguez-Martínez, G. Leonard, E. Cook, C. Poirier, F. 652 Maguire, D. S. Milner, N. A. Irwin, and K. Moore. 2020. Unexpected mitochondrial 653 genome diversity revealed by targeted single-cell genomics of heterotrophic 654 flagellated protists. Nature Microbiology 5:154-165. 655 Zhang, H., T. Yohe, L. Huang, S. Entwistle, P. Wu, Z. Yang, P. K. Busk, Y. Xu, and Y. Yin. 656 2018. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. 657 Nucleic acids research 46:W95-W101. 658 Zimin, A. V., D. Puiu, M.-C. Luo, T. Zhu, S. Koren, G. Marçais, J. A. Yorke, J. Dvořák, and 659 S. L. Salzberg. 2017. Hybrid assembly of the large and highly repetitive genome of

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.14.297390; this version posted September 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

660 Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads 661 algorithm. Genome Research 27:787-792. 662

32