bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Detection of Horizontal Gene Transfer in the Genome of the Choanoflagellate Salpingoeca

2 rosetta

3

4 Danielle M. Matriano1, Rosanna A. Alegado2, and Cecilia Conaco1

5

6 1 Marine Science Institute, University of the Philippines, Diliman

7 2 Department of Oceanography, Hawaiʻi Sea Grant, Daniel K. Inouye Center for Microbial

8 Oceanography: Research and Education, University of Hawai`i at Manoa

9

10 Corresponding author:

11 Cecilia Conaco, [email protected]

12

13 Author email addresses:

14 Danielle M. Matriano, [email protected]

15 Rosanna A. Alegado, [email protected]

16 Cecilia Conaco, [email protected]

17

18

19

20

21

22

1

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

23 Abstract

24

25 Horizontal gene transfer (HGT), the movement of heritable materials between distantly related

26 organisms, is crucial in eukaryotic evolution. However, the scale of HGT in choanoflagellates, the

27 closest unicellular relatives of metazoans, and its possible roles in the evolution of animal

28 multicellularity remains unexplored. We identified 703 potential HGTs in the S. rosetta genome

29 using sequence-based tests. The majority of which were orthologous to bacterial lineages, yet

30 displayed genomic features consistent with the rest of the S. rosetta genome − evidence of ancient

31 acquisition events. Putative functions include enzymes involved in amino acid and carbohydrate

32 metabolism, cell signaling, the synthesis of extracellular matrix components, and the detection of

33 bacterial compounds. Functions of candidate HGTs may have contributed to the ability of

34 choanoflagellates to assimilate novel metabolites, thereby supporting adaptation, survival in

35 diverse ecological niches, and response to external cues that are possibly critical in the evolution

36 of multicellularity in choanoflagellates.

37

38 Keywords: Lateral Gene Transfer, Evolution, Multicellularity

39

40 Background

41

42 In and Archaea, several mechanisms, such as transformation, conjugation, and

43 transduction (1–8), facilitate the introduction of novel genes between neighboring strains and

44 , known as horizontal (or lateral) gene transfer (HGT) (9,10). Horizontal transfers into

45 eukaryote genomes, however, are thought to be low frequency events, as genetic material must

2

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

46 enter the recipient cell’s nucleus to be incorporated into the genome and vertically transmitted

47 (11). While HGT has been difficult to demonstrate in eukaryotic lineages (12), conflicting

48 branching patterns between individual gene histories and species phylogenies (4) has garnered a

49 number of explanations: a gene transfer ratchet to fix prey-derived genes from prey species ("you

50 are what you eat") (13), movement of DNA from organelles to the nucleus through endosymbiosis

51 (14,15), the presence of “weak-links” or unprotected windows during unicellular or early

52 developmental stages that may enable integration of foreign DNA into eukaryotic genomes (11),

53 and horizontal transposon transfers (16–18). Indeed, HGT may have accelerated genome evolution

54 and innovation in microbial eukaryotes by contributing to species divergence (19–21), metabolic

55 diversity and versatility (22,23), and the establishment of interkingdom genetic exchange

56 (10,19,24,25).

57 Recently, Choanoflagellatea, a clade of aquatic, free-living, heterotrophic nanoflagellates

58 (26,27), has emerged as a model for understanding the unicellular origins of animals.

59 Choanoflagellates phagocytose detritus and a diverse array of microorganisms and are globally

60 distributed in marine, brackish, and freshwater environments (28). All choanoflagellates have a

61 life history stage characterized by an ovoid unicell capped by a single posterior flagellum that

62 functions to propel the cell in the water column and to generate currents that sweep food items

63 toward an actin microvilli collar (26,27,29). This cellular architecture bears striking similarity to

64 sponge choanocytes (or “collar cells”), leading to speculation on the evolutionary relationship

65 between choanoflagellates and animal (27). Abundant molecular phylogenetic evidence support

66 choanoflagellates as the closest extant unicellular relatives of animals (30–35).

67 While comparative genomics have revealed a host of animal genes found in common with

68 the Urchoanozoan (28), few studies have investigated the evolutionary history of genes involved

3

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

69 in the origins of choanoflagellates and animals, including the possible acquisition of genes from

70 other lineages. Choanoflagellates were historically grouped into two distinct orders based on

71 (28,32,33,36): Acanthoecida (families Stephanoecidae and Acanthoecidae), which

72 produce a siliceous cage-like basket exoskeleton or lorica; and non-loricate Craspedida (families

73 Codonosigidae and Salpingoecidae), which produce an organic extracellular sheath or theca

74 (32,33). Previous studies suggest that silicate transport genes and transposable elements in loricate

75 choanoflagellates may have been acquired from microalgal prey (16,37). At least 1,000 genes,

76 including entire enzymatic pathways present in extant choanoflagellates, may have resulted from

77 HGT events (38–46).

78 The colonial craspedid Salpingoeca rosetta has been established as a model organism for

79 origins of animal multicellularity (27,47). In this study, we used sequence-based methods to

80 identify putative HGT events in the S. rosetta genome. S. rosetta has a rich life history entwined

81 with affiliated environmental bacteria (26,27,48–50). Transient differentiation from the solitary

82 morphotype into chain or rosette colonies as well as sexual mating in choanoflagellates are

83 triggered by distinct prey bacteria (48). We hypothesized that sources of novel horizontal transfers

84 in S. rosetta are primarily from food sources - bacterial or microalgal donors consistent with the

85 “you are what you eat” gene ratchet theory (13). We compared the unique gene signatures (i.e. GC

86 content, codon usage bias, intron number) of potential HGTs against the rest of the S. rosetta

87 genome to determine whether the genes were ancient or recent transfers. We also assessed the

88 extent of taxonomic representation of the candidate HGTs in other choanoflagellates.

89 Characterizing the magnitude and functions of HGTs in choanoflagellate genomes may provide

90 insight into their role in the evolution and adaptation of the lineage to new ecological niches and

4

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

91 lifestyles. Potential HGT events in choanoflagellates can provide unique insights into the

92 evolutionary history of genes involved in the origins of choanoflagellates and animals.

93

94 Results

95

96 Identification and genic architecture of candidate HGTs in S. rosetta

97 Of the 11,731 reference S. rosetta protein coding genes, 4,527 (38.59%) and 4,391

98 (37.43%) were related to metazoan and S. rosetta sequences, respectively (Fig. 1A). Our taxon

99 filter flagged 2,804 (23.90%) genes with highest orthology to sequences from prokaryotes and

100 unicellular non-metazoan taxa. Of these, 1,837 (15.66%) genes had best hits to sequences from

101 prokaryotes, fungi, and unicellular eukaryotic taxa with marine representatives, such as

102 chlorophytes, rhodophytes, haptophytes, and the stramenopile, alveolate, and Rhizaria supergroup

103 (SAR). Best hits from prokaryotes (i.e. bacteria and archaea) and unicellular eukaryotic taxa (i.e.

104 unicellular algae and fungi) were scored as potential HGT candidates and subjected to further

105 analysis.

106

107 1.2. Identification of candidate HGTs by Alien Index analysis

108 Using metrics described by Gladyshev et al. (2008), Alien Index (AI) analysis of the 1,837

109 potential HGT candidate genes identified 703 candidate HGTs from potential bacteria, archaea,

110 and common unicellular marine fungi (i.e. Basidiomycota and Ascomycota) and eukaryotic (i.e.

111 Chlorophyta, Haptophyta, Pelagophyta, and Bacillariophyta) donors (Fig. 1B, Table S1), which

112 included only 38% of the genes flagged by sequence alignment to the NCBI nr database. A total

5

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

113 of 111 (16%) of the potential HGTs in S. rosetta have orthologs to HGT candidates that were

114 previously identified in the choanoflagellate, Monosiga brevicollis (40).

115

116 1.3. Genic architecture and gene expression of candidate HGTs

117 In S. rosetta, the difference in median protein coding sequence (CDS) length of candidate

118 HGTs (1,509 bp) from the overall median CDS length (1,401 bp) was small but significant

119 (Kruskal-Wallis test, p ≤ 0.001; Fig. 2A). Candidate HGTs also had significantly higher GC

120 content at the third codon position (GC3) relative to other S. rosetta genes (Fig. 2B; Kruskal-Wallis

121 test, p ≤ 0.001), indicating high degeneracy of these frequently recombining genes. However,

122 candidate HGTs were observed to have a similar median number of introns, specifically five, as

123 compared to other S. rosetta genes (Fig. 2C). Only 1,103 (8.79 %) of S. rosetta genes lacked

124 introns, of which 73 (0.62%) passed the AI ≥ 45 threshold. Codon bias index (CBI), which is a

125 measure of codon usage frequency, was also similar between the putative HGTs and other genes

126 (Fig. 2D). In addition, 662 of the candidate HGTs were expressed in the four life stages of S.

127 rosetta (i.e. thecate cells, swimming cells, chain colonies, and rosette colonies), 38 genes were

128 expressed in at least one stage, and only 3 were not detected in any of the stages (Fig. 2E, Table

129 S2), based on data from the study of Fairclough et al. (2013).

130

131 Orthologs of candidate HGTs in other taxa

132 To estimate when horizontally transferred genes in the S. rosetta genome were acquired,

133 we assessed the number of orthologs of candidate HGTs in 20 other choanoflagellates and

134 representative eukaryotes, opisthokonts, filasterean, and metazoans based on OrthoMCL groups

135 or ortholog families identified by the study of Richter et al. (2018) (Table S3). Majority of

6

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

136 candidate HGTs (517 genes) had orthologs in other eukaryotes (Excavata, Diaphoretickes, and

137 Amoebozoa), 9 in fungi, and 2 in filasterea, likely reflecting genes in donor taxa or gene transfers

138 into older lineages. Thirty six genes had orthologs in animals while 139 were only found in

139 choanoflagellates (Fig. 3A). Most candidate HGTs (443 genes) had orthologs in all

140 choanoflagellate families and 87 were found only in Craspedida (clade 1), which includes S.

141 rosetta (Fig. 3B).

142

143 Potential sources of S. rosetta HGTs

144 The majority of the candidate HGTs exhibited highest similarity to bacteria (57%),

145 Chlorophyta (15%), Haptophyta (10%), Stramenopiles (10%), and fungal (Ascomycota and

146 Basidiomycota) (8%) sequences (Fig. 4A). Phylogenetic analysis of selected putative HGTs

147 revealed conflicting phylogenetic signals relative to the consensus eukaryote reference phylogeny

148 of Richter et al. (2018) (Fig. 4B-D, Additional File 1), confirming their possible origin from donor

149 taxa. The most common potential bacterial donors were placed in one of the following phyla:

150 Proteobacteria, Terrabacteria, Fibrobacteres, Chlorobi, and (FCB) group, and the

151 Planctomycetes, Verrucomicrobia, and Chlamydiae (PVC) group (Fig. 4A). At least 28 sequences

152 may have been derived from marine microorganisms that S. rosetta potentially interacts with,

153 including known food sources, such as Algoriphagus machipongonensis (Bacteroidetes) (9 genes)

154 and Vibrio sp. (Proteobacteria) (13 genes) (Table S4). Other notable potential donors included the

155 Bacteroidetes Cytophaga sp. and Flectobacillus sp, both of which have been shown to induce

156 colony formation in S. rosetta (48).

157

158

7

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

159 Functions of candidate HGTs in S. rosetta

160 Of the 703 candidate HGTs in S. rosetta, 595 (85%) had identifiable PFAM domains and

161 494 (70%) were assigned gene ontology (GO) annotations. At least 539 (76.7%) candidate HGTs

162 contained more than two annotated PFAM domains. The most notable protein domains represented

163 in the set of putative HGTs included enzymes (i.e. protein kinase, short chain dehydrogenase,

164 aminotransferase, ubiquitin-activating enzyme active site, bacterial pre-peptidase C-terminal,

165 sulfatases, hydrolases, oxidoreductases, guanyl cyclase, serine carboxypeptidase, and

166 methyltransferase), transmembrane transporters (i.e. mitochondrial carrier protein, ATP-binding

167 cassette (ABC) transporter, major facilitator superfamily, sugar transporter), ECM-associated

168 domains (i.e. dermatopontin/calcium-binding EGF domain), and signaling domains (i.e. protein

169 kinase, immunoglobin-like, plexins, transcription factor (IPT/TIG)) (Fig. 5A, Table S5).

170 Gene ontology analysis further supported the finding that majority of HGT candidates were

171 enzymes with a variety of catalytic activities and were associated with cellular membranes where

172 most biosynthetic and energy transduction processes of the cell occur (51) (Fig. 5B, Table S6).

173 More specifically, potential HGTs in S. rosetta were associated with functions related to

174 carbohydrate, protein, and lipid metabolic processes. The most common enriched GO terms were

175 related to amino acid metabolism, including peptidyl-aspartic acid hydroxylation, glutamine

176 metabolic process, threonine biosynthetic process, and lysine biosynthetic process via

177 diaminopimelate (DAP). Other enriched functions included oxidation-reduction processes,

178 carbohydrate metabolism, tricarboxylic acid cycle, and methylation. Homophilic cell adhesion,

179 receptor-mediated endocytosis, nucleotide-excision repair, and regulation of phosphatidylinositol

180 3-kinase (PI3K) signal transduction pathway were also enriched.

181

8

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

182 Selected candidate HGTs

183 To further examine the potential functions of candidate HGTs, the genes were manually

184 curated and assigned to a general cellular function based on GO affiliations and PFAM domains.

185 The most common functions within the set of putative HGTs were general metabolism,

186 metabolism of carbohydrates, proteins, and lipids, oxidoreduction, extracellular matrix

187 components, signaling, and transport (Fig. 5C, Table S7).

188 Candidate HGTs included a diverse array of transporter families, including ABC

189 transporters, sugar transporters, major facilitator superfamily (MFS), and mitochondrial carrier

190 proteins, among others. Putative HGTs that function in the catabolism or modification of proteins

191 included various families of peptidases. Carbohydrate metabolism related HGTs included 22

192 glycosyl hydrolase domains, 6 glycosyltransferases, and 3 mannosyltransferases.

193 Candidate HGTs involved in lysine biosynthesis through the diaminopimelic acid (DAP)

194 pathway included aspartate-semialdehyde dehydrogenase (asd), diaminopimelate decarboxylase

195 and aspartate kinase (lysAC), and diaminopimelate epimerase (dapF). Other enzymes involved in

196 the biosynthesis of histidine, threonine, methionine, cysteine, tryptophan, and arginine were also

197 detected as potential HGTs (Additional file 2). Candidate HGTs related to the metabolism of amino

198 acid is common in other choanoflagellate representatives, sponges, and cnidarians.

199 Other interesting candidate HGTs encoded stress response-related enzymes, including

200 glutathione S-transferases, peroxidases, heat shock proteins, and thioredoxins. In addition, we

201 identified HGT candidates associated with DNA repair including helicase, ligases, and

202 ribonucleases. Beta-lactamase domain-containing genes that were potentially acquired from

203 bacteria were also retained in the S. rosetta genome.

9

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

204 Several ECM-associated genes were identified as potential HGTs, including

205 dermatopontin/calcium-binding EGF (DPT/Ca2+ EGF) domain-containing genes. These genes

206 have orthologs in craspedids but all have no orthologs in loricates. Genes containing DPT/Ca2+

207 EGF domains had no orthologs in representative filastereans but were regained in higher animals,

208 including cnidarians and bilaterians.

209 Two chondroitin sulfate AC lyase genes were detected as potential HGTs. One gene

210 (EGD79853) has orthologs in Craspedida, Acanthoecidae, and Stephanoecidae, yet is absent in

211 Hartaetosiga balthica and Hartaetosiga gracilis. This gene is conserved in filastereans and

212 sponges. The other chondroitin sulfate AC lyase (EGD79387) gene, however, has no orthologs in

213 other representative filastereans and metazoans.

214

215 Discussion

216

217 Horizontal gene transfer in S. rosetta

218 Our analysis provides evidence of a rich repertoire of S. rosetta genes that may have been

219 acquired through horizontal gene transfer. Recent gene acquisitions are usually distinguished by

220 divergent genetic characteristics, such as GC content, codon usage bias, and genetic architecture

221 (e.g. intron content, coding sequence length) (52,53). Over time, transferred genes undergo

222 sequence changes to adapt to host genome characteristics, enabling improved transcription and

223 translation (54). A majority of S. rosetta HGT candidates had gene features similar to the rest of

224 the genome and were expressed, based on the study by Fairclough et al. (2013), indicating that

225 these acquisitions were not recent (34). We noted a higher GC content at the third codon position

226 of the HGT candidate genes, which is a typical marker for genes derived from bacterial sources

10

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

227 (55,56), yet the genes did not exhibit a divergent intron count. It is possible that foreign genes with

228 translationally optimal codons and high GC3 content that resembles the GC-rich genome of S.

229 rosetta are more likely to be positively selected, as was observed for some horizontally acquired

230 transposable elements (TEs) in this species (16). The prevalence of introns in S. rosetta candidate

231 HGTs further indicate adaptation of acquired genes to the intron-rich genome of S. rosetta.

232 Moreover, most HGT candidates had orthologs in other choanoflagellate taxa, suggesting that the

233 gene transfer events occurred before divergence of the various choanoflagellate groups.

234 Nevertheless, phylogenetic analysis of the candidate HGTs revealed incongruent gene trees, with

235 candidate HGTs clustering with genes from potential donors, including bacteria, microalgae, and

236 fungi. This suggests that candidate horizontally transferred genes in S. rosetta were acquired from

237 multiple prokaryotic and unicellular eukaryotic donors.

238 Most of the potential donors of HGTs in S. rosetta clustered with potential bacterial donors,

239 in contrast to M. brevicollis where HGTs were identified as coming mostly from algal donors (40).

240 S. rosetta is an active phagotroph of bacteria, such as A. machipongonensis and Vibrio spp., which

241 have also been shown to influence its development and metabolic processes (26,27,48,50). Thus,

242 the major mechanism of HGT in S. rosetta may be through the engulfing of food or associated

243 microbes. The acquisition of genes through ingestion would have also allowed for the transfer of

244 more diverse genetic functions as opposed to a more selective one via endosymbiosis. The

245 predation of bacteria corroborate with the “you are what you eat” gene transfer ratchet theory,

246 which suggests that the evolution of the nuclear genome of most protists was driven by acquisition

247 of exogenous genes by phagotrophy or engulfment and acquisition of gene fragments from their

248 food sources (13). HGT events may also be facilitated by the activity of TEs, which are abundant

249 in the genome of S. rosetta (16). However, it should be noted that because some taxonomic

11

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

250 lineages are underrepresented in sequence databases and since some HGTs may have been

251 integrated into the host genome for a long time, the identification of the donor species for HGTs

252 is not straightforward.

253

254 Functions of HGTs in S. rosetta

255 Most candidate HGTs in S. rosetta were orthologous to operational genes, such as enzymes

256 that function in amino acid, carbohydrate, and lipid biosynthesis, as well as genes that function in

257 intercellular signaling and in the establishment and modification of ECM components. Based on

258 the complexity hypothesis proposed by Jain et al. (1999), gene transferability is dependent on two

259 factors: gene function and protein-protein interaction/network interaction (22,23,57–63).

260 Operational genes are more likely to be passed horizontally because they can function

261 independently of other genes (40,64–69). On the other hand, informational genes or genes involved

262 in transcription and translational processes, physically interact with more gene products, limiting

263 their functionality when transferred individually and reducing the possibility that they will be

264 successfully retained as HGTs (58).

265 The acquisition of genes from a vast gene pool enhances genetic variability in free-living

266 organisms. This variability may have conferred upon S. rosetta the ability to explore and establish

267 new niches and adapt to various environmental conditions by mediating interactions with other

268 organisms in the environment and facilitating life stage transitions. The contribution of novel

269 functions acquired via HGT may be higher in choanoflagellates like M. brevicollis and S. rosetta

270 that have retained fewer ancestral gene families compared to other choanoflagellates (70).

271 Novel combinations of protein domains could potentially enhance catalytic efficiency and

272 functional novelty of enzymes. The genome of S. rosetta contains a rich complement of

12

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

273 multidomain genes that may have contributed to its unique biology and morphology (31,34). It is

274 possible that some of these molecular innovations emerged through gene fusion or domain

275 shuffling events that incorporated pre-existing domains with domains acquired from horizontally

276 transferred genes (71).

277

278 Candidate HGTs potentially contribute to metabolism and defense

279 Many of the putative HGTs in S. rosetta contribute to nutrient acquisition and metabolic

280 processes by enabling efficient use of available organic substrates. Diverse transporters facilitate

281 uptake and transport of the building blocks for metabolic pathways (i.e. sugars, phosphates, and

282 amino acids) or the excretion of biomolecules and toxins (40,72,73). Proteases and glycosyl

283 hydrolases break down proteins and complex sugars (74–76). Glycosyl hydrolases, in particular,

284 break down complex sugars and proteoglycans (i.e. heparan sulfate proteoglycans, chondroitin

285 sulfate proteoglycans, and hyaluronans) (77), as well as plant materials and matrix polysaccharides

286 of biofilms (74,75). These enzymes are important for nutrient acquisition in carbohydrate-rich

287 environments, including mud core samples, possibly rich in plant materials, where S. rosetta was

288 isolated (27,78). Conservation of horizontally acquired glycosyl hydrolases in phagotrophic

289 choanoflagellates suggest their importance in digesting various food sources and adapting to an

290 environment high in plant biomass (40). Transporters, proteases, and glycosyl hydrolases have

291 also been flagged as candidate HGTs in M. brevicollis, bdelloid rotifers, sponge, rumen ciliates,

292 and fungi (40,65,66,78,79).

293 HGTs with functions related to amino acid biosynthesis contribute to the metabolic

294 flexibility of S. rosetta. Some of these enzymes, particularly those involved in the DAP pathway

295 of lysine biosynthesis, as well as those involved in the biosynthesis of arginine, threonine and

13

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

296 methionine, have previously been identified as potential HGTs in M. brevicollis (40,41).

297 Conservation of these potential HGTs in the choanoflagellate lineage and their absence in most

298 metazoans, filastereans, and fungi suggest that they were likely transferred into older lineages and

299 retained in the choanoflagellate lineage. Retention of genes involved in amino acid metabolism in

300 choanoflagellates may contribute to unique metabolic competencies. On the other hand, these

301 genes have been lost in the animal stem lineage (80,81) as multicellular animals rely on direct

302 acquisition of essential amino acids from their diet (82). Cnidarians, however, regained the ability

303 to synthesize aromatic amino acids tryptophan, phenylalanine, and other aromatic compounds

304 through HGT events (80).

305 S. rosetta, like other marine organisms, are exposed to multiple biotic and abiotic stressors,

306 including temperature, salinity, and pH fluctuations, osmotic and oxidative stress, and xenobiotics.

307 The retention of horizontally acquired genes such as oxidoreductases, protein-folding chaperones,

308 repair enzymes, and antibiotic degrading enzymes, may protect against cellular damage and

309 strengthen organismal defenses against environmental stressors and potential pathogens (43).

310

311 Candidate HGTs potentially regulate intercellular interactions

312 S. rosetta has several life stages regulated by extrinsic factors. Its genome harbors multiple

313 adhesion receptors and cell membrane enzymes for substrate attachment, cell-cell communication,

314 and colony formation that may regulate these life stage transitions (26,34,48–50,70). Several of

315 these genes were identified as potential HGTs in S. rosetta. HGTs with dermatopontin (DPT/Ca2+

316 EGF) domains may function similar to dermatopontin, which is an acidic multifunctional matrix

317 protein that promotes cell adhesion, ECM collagen fibrillogenesis, and cell assembly mediated by

318 cell surface integrin binding (83–87). It is also a major component of the organic matrix of

14

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

319 biomineralized tissues (e.g. mussels) (88). It is possible that expression of dermatopontin domain-

320 containing genes in thecate cells is involved in controlling specific cell behaviors, such as

321 phagocytosis, mating, or colony formation in S. rosetta. Absence of genes with DPT/Ca2+ EGF

322 domains in some loricate choanoflagellates further highlights the possible importance of these

323 genes in the production of the organic theca in Craspedida.

324 S. rosetta also has a rich repertoire of glycosyl hydrolases, glycosyltransferases, and

325 mannosyltransferases that were potentially acquired through horizontal transfer. These enzymes

326 are known to sculpt the ECM of animals by changing the structure of the extracellular matrix

327 components and by modifying functional groups on molecules to regulate their interactions

328 (89,90). Glycosyl hydrolases may degrade proteoglycans to control the mating process of S. rosetta

329 (50,91). Glycosyltransferases regulate key signaling and adhesion proteins like cadherins and

330 integrins (92–94). In particular, mannosyltransferases modify mannose groups on glycoproteins,

331 which mediate the binding of C-type lectin domain proteins, such as the rosetteless gene that is

332 crucial in rosette formation (95).

333 The S. rosetta genome contains two chondroitin sulfate lyase genes, both of which are

334 candidate HGTs from potential bacterial donors. It was suggested that these proteins may be

335 involved in endogenous processes for regulating mating in S. rosetta (50). It is also possible that

336 GAG lyases contribute to the ability of choanoflagellates to modify cell walls and extracellular

337 matrix proteins, which may facilitate transitions from one life stage to the next. Chondroitin sulfate

338 lyase cleaves chondroitin sulfate glycosaminoglycans (GAGs) via an elimination mechanism

339 resulting in disaccharides or oligosaccharides (96). GAGs are typically found as side chains on

340 proteoglycans on cell membranes and the ECM of animal tissues where they regulate processes

341 such as adhesion, differentiation, migration, proliferation, and cell-cell communication (96). The

15

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

342 proteoglycan, chondroitin sulfate, was found to partially suppress the adhesive properties of

343 dermatopontin (83–87), suggesting that breakdown of chondroitin sulfate through lyase activity

344 may promote stronger cell adhesion. The bacterium, Vibrio fischerii, produces a chondroitin lyase

345 called EroS (“extracellular regulator of sex”) that induces mating in the choanoflagellate, S. rosetta

346 (50). The conservation of chondroitin lyase genes in S. rosetta and most choanoflagellate

347 representatives suggest that these genes may be uncovered unique adaptive mechanisms in

348 choanoflagellates. Further studies on the expression of these genes in different environmental

349 conditions are needed in order to better understand their potential functions in the host.

350 As with other studies on the detection of HGTs in eukaryotes, it is important to note that

351 the current work is limited by the lack of broader taxon representation in publicly available

352 databases. In addition, parametric tests may not accurately estimate the number of HGT events,

353 particularly for ancient transfer events. Moreover, it can be difficult to distinguish HGT events

354 from gene gain or loss events in the last universal common ancestor as both result in patchy

355 distribution of genes in the species tree. These limitations may have resulted in overestimation of

356 the number of horizontally acquired genes in S. rosetta. Further development of methods to

357 effectively filter out false positives and fine tune the results of HGT analysis is needed to reveal

358 the true extent of horizontally acquired genes in the last common ancestor of choanoflagellates

359 and animals.

360

361 Conclusions

362

363 We detected a rich repertoire of genes potentially acquired through horizontal transfer in

364 the genome of S. rosetta. Most of these genes were possibly ancient horizontal transfers gained

16

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

365 prior to the divergence of choanoflagellates, as evidenced by similar genomic signatures and

366 expression motifs to the host, as well as conservation in other choanoflagellate taxa. Our results

367 show that the gain of genes from the microorganisms that S. rosetta interacts with may have played

368 a key role in the diversification of cellular metabolic processes and contributed novel functions

369 that enhanced the catalytic ability of enzymes thereby allowing S. rosetta to colonize diverse

370 ecological niches. In addition, the acquisition of genes in S. rosetta involved in extra-cellular

371 senses and sensory response to its environment, through the induction of reproduction and

372 multicellular development, may have helped shape choanoflagellate evolution and multicellular

373 life stages. We anticipate that future expression analysis of these promising candidate HGTs will

374 provide a more in-depth understanding on their potential roles in choanoflagellates and animal

375 multicellularity.

376

377 Materials and Methods

378

379 Identification of HGTs by Sequence Alignment

380 The 11,731 predicted S. rosetta protein sequences downloaded from the Ensembl protist database

381 (97) (http://www.ensembl.org/; last accessed October, 2018) were aligned to the complete NCBI

382 non-redundant (nr) protein database (http://www.ncbi.nlm.nih.gov/; last accessed October, 2018)

383 using Diamond(98) local sequence alignment with a threshold E-value of 1x10-5. The taxonomic

384 affiliation of hits were retrieved from NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy),

385 and a Python (99) script was used to collect hits with specific taxon IDs. Only sequences with the

386 lowest E-value corresponding to bacteria, archaea, (i.e. Chlorophyta, Stramenopiles,

387 Bacillariophyta, Pelagophyta, and Haptophta), unicellular fungi (i.e. Ascomycota and

17

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

388 Basidiomycota), or protozoa were considered as potential HGTs; peptides with no hits to archaea,

389 bacteria, unicellular algae, unicellular fungi and other protozoans were disregarded from further

390 analysis.

391

392 Determining the Alien Index Scores of Candidate HGTs

393 Alien Index (AI) analysis quantitatively measures how well the S. rosetta protein sequences align

394 to non-metazoan versus metazoan protein sequences (65). The E-value of the best sequence

395 alignment match of S. rosetta peptides against all metazoan or non-metazoan gene sequences from

396 the NCBI nr database were used to compute the AI score of each gene using the formula (65):

397

((푏푒푠푡 푚푒푡푎푧표푎푛 퐸푣푎푙푢푒) + 1퐸 − 200) ((푏푒푠푡 푛표푛 − 푚푒푡푎푧표푎푛 퐸푣푎푙푢푒) + 1퐸 − 200) 398 퐴퐼 = log − log

399

400 In cases where S. rosetta sequences had no hits to non-metazoans (excluding choanoflagellates) or

401 metazoans, the E-value was set to 1. Genes that scored ≥45 were classified as foreign, 0 ≤ AI ≤ 45

402 as indeterminate, and less than 0 as metazoan genes (65).

403

404 Genomic feature analysis of candidate HGTs and identification of gene orthologs

405 Genomic features of putative HGTs, specifically GC content at the third codon position (GC3) and

406 codon usage, as represented by the codon bias index (CBI), were determined using correspondence

407 analysis on CodonW (100). Coding sequence (CDS) lengths and intron numbers were determined

408 from published information on the S. rosetta genome on Ensembl (97). To determine if there were

409 mean differences between the genomic features of candidate HGTs compared to all S. rosetta

410 genes, we performed Kruskal-Wallis test in RStudio version 1.2.1335 (101) with R version 3.6.1

18

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

411 (101). To identify orthologs of candidate HGTs, we obtained OrthoMCL groups data from the

412 study of Richter et al. (2018).

413

414 Gene expression analysis

415 To examine expression of candidate HGTs, we obtained transcriptome data from the study of

416 Fairclough et al. (2013). Sequence libraries representing various life stages of S. rosetta were

417 downloaded from the NCBI Short Read Archive (rosette cells, SRX042054; chain cells,

418 SRX042047; thecate cells, SRX042052; swimming cells, SRX042053). Reads were mapped

419 against the predicted cDNA sequences of S. rosetta using kallisto (102) with default settings to

420 obtain gene expression values in transcripts per million reads (TPM).

421

422 Protein domain and gene ontology analysis

423 Identification of protein domains and gene ontology associations was conducted on Blast2GO

424 (103) to determine possible functions of candidate HGTs in S. rosetta. Gene ontology terms

425 associated with each predicted peptide were determined from its best Diamond hit at an E-value ≤

426 1x10-5 to the UniProt database (104). Enriched functions in the set of putative HGTs versus non-

427 HGTs were identified using the Fisher’s exact test in topGO (105) in R package version 3.6.1

428 (101). Only functions with an FDR-corrected p-value ≤ 0.05 were considered statistically

429 significant.

430

431 Phylogenetic analysis of candidate HGTs

432 Peptide sequences of candidate HGTs were aligned with representative sequences from selected

433 metazoa, eukaryotic, and prokaryotic taxa using MAFFT version 7 (106) then trimmed using

19

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

434 Gblocks (107) with default settings to remove ambiguous and divergent protein alignments. For

435 each protein sequence, Akaike information criterion (AIC) and Bayesian information criterion

436 (BIC) scores were calculated in MEGAX (108). The best evolutionary model for phylogeny

437 between different models tested for the protein sequences was determined by using the calculated

438 AIC and BIC scores that had the smallest score difference, in this case mtrev substitution model

439 was consistently used. Markov Chain Monte Carlo (MCMC) parameters of each analysis were set

440 to 100,000 generations sampled every 100 trees. By default, the first 25% of the trees were

441 discarded as burn-in. Bayesian phylogenetic trees were constructed using MrBayes 3.2.6 with

442 posterior probabilities of 0.70-1.00 indicated on selected branches (109). Trees were edited online

443 using Interactive Tree Of Life (iTOL) (110).

444

445 Figures and Tables

446

447 Fig. 1. HGT candidates in the S. rosetta genome. (A) Number of putative HGTs detected using

448 sequence alignment and Alien Index (AI45) analysis. (B) Taxonomic affiliation of the best

449 sequence match for all S. rosetta genes and genes that passed the AI>45 filter.

450

451 Fig. 2. Gene architecture of candidate HGTs. Histograms showing the distribution of (A) coding

452 sequence length (CDS), (B) GC content at the third codon position (GC3), (C) intron number, and

453 (D) codon bias index (CBI) in candidate horizontally transferred genes (red) in comparison to the

454 bulk of S. rosetta genes (gray). (E) Number of HGT candidates that are expressed in the indicated

455 life stages of S. rosetta.

456

20

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

457 Fig. 3. Conservation of candidate HGTs. (A) The number of HGT candidates with orthologs in

458 other lineages. (B) The number of HGT candidates with orthologs in choanoflagellate taxa within

459 orders Craspedida (clade 1 and 2) and Acanthoecida (family Acanthoecidae and Stephanoecidae).

460

461 Fig. 4. Potential donor phyla of candidate HGTs in S. rosetta. (A) Taxon affiliation of candidate

462 HGTs based on the best sequence match for each gene. Bayesian analysis of selected candidate

463 HGTs, including a (B) glycosyl hydrolase of prokaryotic origin, (C) ABC transporter of algal

464 origin, and (D) mannosyltransferase of fungal origin. S. rosetta genes are indicated in bold red

465 letters. Genes from other choanoflagellates are shown in red, algae in green, fungi or opisthokonts

466 in orange, prokaryotes in blue, and metazoans in black. Circles at the branches indicate posterior

467 probabilities of 0.70-1.00.

468

469 Fig. 5. Functional analysis of candidate HGTs. (A) Most common PFAM protein domains in

470 the set of candidate HGTs. (B) Gene ontology functions enriched in the set of putative HGTs

471 (cellular component, CC; molecular function, MF; biological process, BP). Enrichment p-values

472 (p ≤ 0.05) for selected functions are shown. Text colors in A and B indicate general functional

473 groupings. (C) Number of candidate HGTs with associated functions based on manual curation.

474

475 Additional Data

476

477 Table S1. Summary of HGT analyses for all S. rosetta genes.

478 Table S2. Expression values of candidate HGTs

479 Table S3. Orthologs of candidate HGTs

21

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

480 Table S4. Candidate HGTs from selected potential bacterial donors

481 Table S5. PFAM domain architectures in candidate HGTs

482 Table S6. Gene ontology enrichment in candidate HGTs

483 Table S7. Associated functions of candidate HGTs

484

485 Additional file 1. Phylogenetic analysis of selected candidate HGTs in S. rosetta. Peptide

486 sequences of candidate HGTs were aligned to representative sequences from selected metazoa,

487 eukaryotic, and prokaryotic taxa were using MAFFT version 7. Phylogenetic trees were generated

488 using MrBayes 3.2.6 with posterior probabilities of 0.70-1.00 indicated on selected branches.

489

490 Additional file 2. Candidate HGTs in amino acid biosynthetic pathways. Genes flagged as

491 HGTs in S. rosetta are shown in red lines, while those detected as HGTs in both S. rosetta and M.

492 brevicollis are shown in blue lines. Text colors indicate potential donors of candidate HGTs.

493

494 Abbreviations

495 ABC transporter: Adenosine triphosphate-binding cassette transporter

496 AI: Alien Index

497 Asd: aspartate-semialdehyde dehydrogenase

498 CBI: Codon bias index

499 CDS: coding sequence

500 DapF: diaminopimelate epimerase

501 DPT/Ca2+ EGF: Dermatopontin/calcium binding epidermal growth factor domain

502 ECM: Extracellular matrix

22

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

503 EroS: extracellular regulator of sex

504 FAD/NAD: Flavin adenine dinucleotide/ nucleotide adenine dinucleotide binding domain

505 FPKM: Fragments Per Kilobase of transcript per Million mapped reads

506 GAG lyase: glycosaminoglycan lyase

507 GO: Gene Ontology

508 HGT: Horizontal gene transfer

509 HSP: heat shock protein

510 HTT: Horizontal transposon transfers

511 LGT: Lateral gene transfer

512 LysC-lysA fusion: diaminopimelate decarboxylase and aspartate kinase

513 MCMC: Markov Chain Monte Carlo

514 ML: Maximum likelihood

515 NHL repeats: Ncl-1, HT2A and lin-41 (NHL) repeats

516 PFAM: Protein family

517 TE: transposable element

518

519 Declarations

520

521 Consent for publication

522 Not applicable.

523

524

525

23

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

526 Availability of data and materials

527 The datasets generated during and/or analysed during the current study are available in: Mateo-

528 Matriano D, Alegado RA, Conaco C. Detection of Horizontal Gene Transfer in the

529 Choanoflagellate Salpingoeca rosetta data sets. figshare. 2020. [DOI:

530 10.6084/m9.figshare.12286658.v1]

531

532 Competing interests

533 The other authors declare that they have no competing interests.

534

535 Funding

536 This research project was made possible by generous funding support from the Department of

537 Science and Technology Accelerated Science and Technology Human Resource Development

538 Program-National Science Consortium (DOST-ASTHRDP-NSC) and the Marine Science Institute

539 (MSI), University of the Philippines, Diliman.

540

541 Authors' contributions

542 DM and CC together conceived and designed the study. DM and CC conducted the analyses and

543 drafted the manuscript. CC and RA advised on data analyses and helped to draft and edit the

544 manuscript. All authors read and approved the final manuscript.

545

546 Acknowledgements

547 We acknowledge the Core Facility for Bioinformatics, Philippine Genome Center (CFB-PGC)

548 especially Joshua Dizon and Francis Tablizo for helping with the construction of the databases and

24

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

549 the scripts used in this project. We greatly appreciate the insightful comments and suggestions of

550 Dr. Cheryl Andam (College of Life Sciences and Agriculture UNH), Dr. Deo Onda (UP, MSI),

551 Dr. Ron Leonard Dy (UP, NIMBB) and Becca Lensing (UH, C-MORE) on the analysis and

552 interpretation of the data.

553

554 References

555 1. Lorenz MG, Wackernagel W. Bacterial gene transfer by natural genetic transformation in

556 the environment. Microbiol Rev. 1994;58(3):563–602.

557 2. Dubnau D. DNA Uptake in Bacteria. Annu Rev Microbiol. 1999;53(1):217–44.

558 3. Chen I, Dubnau D. DNA uptake during bacterial transformation. Vol. 2, Nature Reviews

559 Microbiology. 2004. p. 241–9.

560 4. Heinemann JA, Sprague GF. Bacterial conjugative plasmids mobilize DNA transfer

561 between bacteria and yeast. Nature. 1989;340(6230):205–9.

562 5. Llosa M, Gomis-Rüth FX, Coll M, De la Cruz F. Bacterial conjugation: A two-step

563 mechanism for DNA transport. Mol Microbiol. 2002;45(1):1–8.

564 6. Haas AL, Baboshina O, Williams B, Schwartz LM. Coordinated induction of the ubiquitin

565 conjugation pathway accompanies the developmentally programmed death of insect

566 skeletal muscle. J Biol Chem. 1995;270(16):9407–12.

567 7. Norman A, Hansen LH, Sørensen SJ. Conjugative plasmids: Vessels of the communal

568 gene pool. Vol. 364, Philosophical Transactions of the Royal Society B: Biological

569 Sciences. 2009. p. 2275–89.

570 8. Kyndt T, Quispe D, Zhai H, Jarret R, Ghislain M, Liu Q, et al. The genome of cultivated

571 sweet potato contains Agrobacterium T-DNAs with expressed genes: An example of a

25

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

572 naturally transgenic food crop. Proc Natl Acad Sci U S A. 2015;112(18):5844–9.

573 9. Keeling PJ, Palmer JD. Horizontal gene transfer in eukaryotic evolution. Vol. 9, Nature

574 Reviews Genetics. 2008. p. 605–18.

575 10. Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: Building the web of life. Vol.

576 16, Nature Reviews Genetics. 2015. p. 472–82.

577 11. Huang J. Horizontal gene transfer in eukaryotes: The weak-link model. BioEssays.

578 2013;35(10):868–75.

579 12. Takeuchi N, Kaneko K, Koonin E V. Horizontal gene transfer can rescue prokaryotes

580 from Muller’s ratchet: Benefit of DNA from dead cells and population subdivision. G3

581 Genes, Genomes, Genet. 2014;4(2):325–39.

582 13. Ford Doolittle W. You are what you eat: A gene transfer ratchet could account for

583 bacterial genes in eukaryotic nuclear genomes. Vol. 14, Trends in Genetics. 1998. p. 307–

584 11.

585 14. O’Malley MA. Endosymbiosis and its implications for evolutionary theory. Proc Natl

586 Acad Sci U S A. 2015;112(33):10270–7.

587 15. Margulis L, Chapman M, Guerrero R, Hall J. The last eukaryotic common ancestor

588 (LECA): Acquisition of cytoskeletal motility from aerotolerant spirochetes in the

589 Proterozoic Eon. Proc Natl Acad Sci U S A. 2006;103(35):13080–5.

590 16. Southworth J, Grace CA, Marron AO, Fatima N, Carr M. A genomic survey of

591 transposable elements in the choanoflagellate Salpingoeca rosetta reveals selection on

592 codon usage. Vol. 10, Mobile DNA. 2019.

593 17. Pace JK, Gilbert C, Clark MS, Feschotte C. Repeated horizontal transfer of a DNA

594 transposon in mammals and other tetrapods. Proc Natl Acad Sci U S A.

26

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

595 2008;105(44):17023–8.

596 18. Mizrokhi LJ, Mazo AM. Evidence for horizontal transmission of the mobile element

597 jockey between distant Drosophila species. Proc Natl Acad Sci U S A. 1990;87(23):9216–

598 20.

599 19. Williams D, Fournier GP, Lapierre P, Swithers KS, Green AG, Andam CP, et al. A

600 Rooted Net of Life. Vol. 6, Biology Direct. 2011.

601 20. Andam CP, Gogarten JP. Biased gene transfer and its implications for the concept of

602 lineage. Biol Direct. 2011;6.

603 21. Huang J, Gogarten JP. Ancient horizontal gene transfer can benefit phylogenetic

604 reconstruction. Trends Genet. 2006;22(7):361–6.

605 22. Jain R, Rivera MC, Lake JA. Horizontal gene transfer among genomes: The complexity

606 hypothesis. Proc Natl Acad Sci U S A. 1999;96(7):3801–6.

607 23. Jain R, Rivera MC, Moore JE, Lake JA. Horizontal gene transfer accelerates genome

608 innovation and evolution. Mol Biol Evol. 2003;20(10):1598–602.

609 24. Swithers KS, Gogarten JP, Fournier GP. Trees in the web of life. Vol. 8, Journal of

610 Biology. 2009.

611 25. Redrejo-Rodriǵuez M, Munõz-Espín D, Holguera I, Menciá M, Salas M. Functional

612 eukaryotic nuclear localization signals are widespread in terminal proteins of

613 bacteriophages. Proc Natl Acad Sci U S A. 2012;109(45):18482–7.

614 26. Dayel MJ, King N. Prey capture and phagocytosis in the choanoflagellate Salpingoeca

615 rosetta. PLoS One. 2014;9(5).

616 27. Dayel MJ, Alegado RA, Fairclough SR, Levin TC, Nichols SA, McDonald K, et al. Cell

617 differentiation and morphogenesis in the colony-forming choanoflagellate Salpingoeca

27

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

618 rosetta. Dev Biol. 2011;357(1):73–82.

619 28. Richter DJ, Nitsche F. Choanoflagellatea. In: Handbook of the Protists: Second Edition.

620 2017. p. 1479–96.

621 29. Hoffmeyer TT, Burkhardt P. Choanoflagellate models — Monosiga brevicollis and

622 Salpingoeca rosetta. Vol. 39, Current Opinion in Genetics and Development. 2016. p. 42–

623 7.

624 30. Lang BF, O’Kelly C, Nerad T, Gray MW, Burger G. The closest unicellular relatives of

625 animals. Curr Biol. 2002;12(20):1773–8.

626 31. Ruiz-Trillo I, Roger AJ, Burger G, Gray MW, Lang BF. A phylogenomic investigation

627 into the origin of Metazoa. Mol Biol Evol. 2008;25(4):664–72.

628 32. Carr M, Richter DJ, Fozouni P, Smith TJ, Jeuck A, Leadbeater BSC, et al. A six-gene

629 phylogeny provides new insights into choanoflagellate evolution. Mol Phylogenet Evol.

630 2017;107:166–78.

631 33. Nitsche F, Carr M, Arndt H, Leadbeater BSC. Higher level taxonomy and molecular

632 phylogenetics of the Choanoflagellatea. J Eukaryot Microbiol. 2011;58(5):452–62.

633 34. Fairclough SR, Chen Z, Kramer E, Zeng Q, Young S, Robertson HM, et al. Premetazoan

634 genome evolution and the regulation of cell differentiation in the choanoflagellate

635 Salpingoeca rosetta. Genome Biol. 2013;14(2):1–15.

636 35. King N, Westbrook MJ, Young SL, Kuo A, Abedin M, Chapman J, et al. The genome of

637 the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature.

638 2008;451(7180):783–8.

639 36. Jeuck A, Arndt H, Nitsche F. Extended phylogeny of the Craspedida (Choanomonada).

640 Eur J Protistol. 2014;50(4):430–43.

28

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

641 37. Marron AO, Alston MJ, Heavens D, Akam M, Caccamo M, Holland PWH, et al. A family

642 of diatom-like silicon transporters in the siliceous loricate choanoflagellates. Proc R Soc B

643 Biol Sci. 2013;280(1756).

644 38. Bapteste E, Moreira D, Philippe H. Rampant horizontal gene transfer and phospho-donor

645 change in the evolution of the phosphofructokinase. Gene. 2003;318(1–2):185–91.

646 39. Malik SB, Ramesh MA, Hulstrand AM, Logsdon JM. Protist homologs of the meiotic

647 Spo11 gene and topoisomerase VI reveal an evolutionary history of gene duplication and

648 lineage-specific loss. Mol Biol Evol. 2007;24(12):2827–41.

649 40. Yue J, Sun G, Hu X, Huang J. The scale and evolutionary significance of horizontal gene

650 transfer in the choanoflagellate Monosiga brevicollis. BMC Genomics. 2013;14(1).

651 41. Sun G, Huang J. Horizontally acquired DAP pathway as a unit of self-regulation. J Evol

652 Biol. 2011;24(3):587–95.

653 42. Maruyama S, Matsuzaki M, Misawa K, Nozaki H. Cyanobacterial contribution to the

654 genomes of the plastid-lacking protists. BMC Evol Biol. 2009;9(1).

655 43. Nedelcu AM, Miles IH, Fagir AM, Karol K. Adaptive eukaryote-to-eukaryote lateral gene

656 transfer: Stress-related genes of algal origin in the closest unicellular relatives of animals.

657 J Evol Biol. 2008;21(6):1852–60.

658 44. Nedelcu AM, Blakney AJC, Logue KD. Functional replacement of a primary metabolic

659 pathway via multiple independent eukaryote-to-eukaryote gene transfers and selective

660 retention. J Evol Biol. 2009;22(9):1882–94.

661 45. Tucker RP, Beckmann J, Leachman NT, Schöler J, Chiquet-Ehrismann R. Phylogenetic

662 analysis of the teneurins: Conserved features and premetazoan ancestry. Mol Biol Evol.

663 2012;29(3):1019–29.

29

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

664 46. Torruella G, Suga H, Riutort M, Peretó J, Ruiz-Trillo I. The evolutionary history of lysine

665 biosynthesis pathways within eukaryotes. J Mol Evol. 2009;69(3):240–8.

666 47. Fairclough SR, Dayel MJ, King N. Multicellular development in a choanoflagellate. Vol.

667 20, Current Biology. 2010.

668 48. Alegado RA, Brown LW, Cao S, Dermenjian RK, Zuzow R, Fairclough SR, et al. A

669 bacterial sulfonolipid triggers multicellular development in the closest living relatives of

670 animals. Elife. 2012;2012(1).

671 49. Woznica A, Cantley AM, Beemelmanns C, Freinkman E, Clardy J, King N. Bacterial

672 lipids activate, synergize, and inhibit a developmental switch in choanoflagellates. Proc

673 Natl Acad Sci U S A. 2016;113(28):7894–9.

674 50. Woznica A, Gerdt JP, Hulett RE, Clardy J, King N. Mating in the Closest Living Relatives

675 of Animals Is Induced by a Bacterial Chondroitinase. Cell. 2017;170(6):1175-1183.e11.

676 51. Ray S, Kassan A, Busija AR, Rangamani P, Patel HH. The plasma membrane as a

677 capacitor for energy and metabolism. Vol. 310, American Journal of Physiology - Cell

678 Physiology. 2016. p. C181–92.

679 52. Daubin V, Lerat E, Perrière G. The source of laterally transferred genes in bacterial

680 genomes. Genome Biol. 2003;4(9).

681 53. Lawrence JG, Ochman H. Molecular archaeology of the Escherichia coli genome. Proc

682 Natl Acad Sci U S A. 1998;95(16):9413–7.

683 54. Langille MGI, Hsiao WWL, Brinkman FSL. Detecting genomic islands using

684 bioinformatics approaches. Vol. 8, Nature Reviews Microbiology. 2010. p. 373–82.

685 55. Hildebrand F, Meyer A, Eyre-Walker A. Evidence of selection upon genomic GC-content

686 in bacteria. PLoS Genet. 2010;6(9).

30

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

687 56. Hershberg R, Petrov DA. Evidence that mutation is universally biased towards AT in

688 bacteria. PLoS Genet. 2010;6(9).

689 57. Ford Doolittle W. Lateral genomics. Vol. 24, Trends in Biochemical Sciences. 1999.

690 58. Rivera MC, Jain R, Moore JE, Lake JA. Genomic evidence for two functionally distinct

691 gene classes. Proc Natl Acad Sci U S A. 1998;95(11):6239–44.

692 59. Sicheritz-Ponten T. A phylogenomic approach to microbial evolution. Nucleic Acids Res.

693 2001;29(2):545–52.

694 60. Gogarten JP, Senejani AG, Zhaxybayeva O, Olendzenski L, Hilario E. Inteins: Structure,

695 Function, and Evolution. Annu Rev Microbiol. 2002;56(1):263–87.

696 61. Brown JR. Ancient horizontal gene transfer. Vol. 4, Nature Reviews Genetics. 2003. p.

697 121–32.

698 62. Wellner A, Lurie MN, Gophna U. Complexity, connectivity, and duplicability as barriers

699 to lateral gene transfer. Genome Biol. 2007;8(8).

700 63. Lercher MJ, Pál C. Integration of horizontally transferred genes into regulatory interaction

701 networks takes many million years. Mol Biol Evol. 2008;25(3):559–67.

702 64. Yue J, Hu X, Huang J. Origin of plant auxin biosynthesis. Vol. 19, Trends in Plant

703 Science. 2014. p. 764–70.

704 65. Gladyshev EA, Meselson M, Arkhipova IR. Massive horizontal gene transfer in bdelloid

705 rotifers. Science (80- ). 2008;320(5880):1210–3.

706 66. Conaco C, Tsoulfas P, Sakarya O, Dolan A, Werren J, Kosik KS. Detection of prokaryotic

707 genes in the amphimedon queenslandica genome. PLoS One. 2016;11(3).

708 67. Boschetti C, Carr A, Crisp A, Eyres I, Wang-Koh Y, Lubzens E, et al. Biochemical

709 Diversification through Foreign Gene Expression in Bdelloid Rotifers. PLoS Genet.

31

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

710 2012;8(11).

711 68. Lal D, Lal R. Evolution of mercuric reductase (merA) gene: A case of horizontal gene

712 transfer. Microbiology. 2010;79(4):500–8.

713 69. Eyres I, Boschetti C, Crisp A, Smith TP, Fontaneto D, Tunnacliffe A, et al. Horizontal

714 gene transfer in bdelloid rotifers is ancient, ongoing and more frequent in species from

715 desiccating habitats. BMC Biol. 2015;13(1).

716 70. Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss

717 on the animal stem lineage. Elife. 2018;7.

718 71. Leadbeater BSC. The choanoflagellates: Evolution, biology and ecology. The

719 Choanoflagellates: Evolution, Biology and Ecology. 2014. 1–315 p.

720 72. Walmsley AR, Barrett MP, Bringaud F, Gould GW. Sugar transporters from bacteria,

721 parasites and mammals: Structure-activity relationships. Vol. 23, Trends in Biochemical

722 Sciences. 1998. p. 476–81.

723 73. Goffeau A, De Hertogh B. ABC Transporters. In: Encyclopedia of Biological Chemistry:

724 Second Edition. 2013. p. 7–11.

725 74. Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B. The

726 carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014;42(D1).

727 75. Michikawa M, Ichinose H, Momma M, Biely P, Jongkees S, Yoshida M, et al. Structural

728 and biochemical characterization of glycoside hydrolase family 79 β-glucuronidase from

729 Acidobacterium capsulatum. J Biol Chem. 2012;287(17):14069–77.

730 76. van der Hoorn RAL. Plant Proteases: From Phenotypes to Molecular Mechanisms. Annu

731 Rev Plant Biol. 2008;59(1):191–223.

732 77. Tan L, Eberhard S, Pattathil S, Warder C, Glushka J, Yuan C, et al. An Arabidopsis cell

32

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

733 wall proteoglycan consists of pectin and arabinoxylan covalently linked to an

734 arabinogalactan protein. Plant Cell. 2013;25(1):270–87.

735 78. Ricard G, McEwan NR, Dutilh BE, Jouany JP, Macheboeuf D, Mitsumori M, et al.

736 Horizontal gene transfer from bacteria to rumen ciliates indicates adaptation to their

737 anaerobic, carbohydrates-rich environment. BMC Genomics. 2006;7.

738 79. Garcia-Vallvé S, Romeu A, Palau J. Horizontal gene transfer of glycosyl hydrolases of the

739 rumen fungi. Mol Biol Evol. 2000;17(3):352–61.

740 80. Starcevic A, Akthar S, Dunlap WC, Shick JM, Hranueli D, Cullum J, et al. Enzymes of

741 the shikimic acid pathway encoded in the genome of a basal metazoan, Nematostella

742 vectensis, have microbial origins. Proc Natl Acad Sci U S A. 2008;105(7):2533–7.

743 81. López-Escardó D, Grau-Bové X, Guillaumet-Adkins A, Gut M, Sieracki ME, Ruiz-Trillo

744 I. Reconstruction of protein domain evolution using single-cell amplified genomes of

745 uncultured choanoflagellates sheds light on the origin of animals. Philos Trans R Soc B

746 Biol Sci. 2019;374(1786).

747 82. Payne SH, Loomis WF. Retention and loss of amino acid biosynthetic pathways based on

748 analysis of whole-genome sequences. Eukaryot Cell. 2006;5(2):272–6.

749 83. Kuroda K, Okamoto O, Shinkai H. Dermatopontin expression is decreased in hypertrophic

750 scar and systemic sclerosis skin fibroblasts and is regulated by transforming growth

751 factor- β1, interleukin-4, and matrix collagen. J Invest Dermatol. 1999;112(5):706–10.

752 84. Okamoto O, Fujiwara S. Dermatopontin, a novel player in the biology of the extracellular

753 matrix. Vol. 47, Connective Tissue Research. 2006. p. 177–89.

754 85. Lewandowska K, Choi HU, Rosenberg LC, Sasse J, Neame PJ, Culp LA. Extracellular

755 matrix adhesion-promoting activities of a dermatan sulfate proteoglycan-associated

33

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

756 protein (22K) from bovine fetal skin. J Cell Sci. 1991;99(3):657–68.

757 86. Lewandowska K, Choi HU, Rosenberg LC, Zardi L, Culp LA. Fibronectin-mediated

758 adhesion of fibroblasts: Inhibition by dermatan sulfate proteoglycan and evidence for a

759 cryptic glycosaminoglycan-binding domain. J Cell Biol. 1987;105(3):1443–54.

760 87. Winnemoller M, Schon P, Vischer P, Kresse H. Interactions between thrombospondin and

761 the small proteoglycan decorin: Interference with cell attachment. Eur J Cell Biol.

762 1992;59(1):47–55.

763 88. Marxen JC, Nimtz M, Becker W, Mann K. The major soluble 19.6 kDa protein of the

764 organic shell matrix of the freshwater snail Biomphalaria glabrata is an N-glycosylated

765 dermatopontin. Biochim Biophys Acta - Proteins Proteomics. 2003;1650(1–2):92–8.

766 89. Wetzel LA, Levin TC, Hulett RE, Chan D, King GA, Aldayafleh R, et al. Predicted

767 glycosyltransferases promote development and prevent spurious cell clumping in the

768 choanoflagellate S. Rosetta. Elife. 2018;7.

769 90. Larson BT, Ruiz-Herrero T, Lee S, Kumar S, Mahadevan L, King N. Biophysical

770 principles of choanoflagellate self-organization. Proc Natl Acad Sci U S A.

771 2020;117(3):1303–11.

772 91. Trincone A, Tramice A, Giordano A, Andreotti G. Glycoside hydrolases in aplysia

773 fasciata: Analysis and applications. Biotechnol Genet Eng Rev. 2008;25(1):129–48.

774 92. Sawaguchi S, Varshney S, Ogawa M, Sakaidani Y, Yagi H, Takeshita K, et al. O-GlcNAc

775 on NOTCH1 EGF repeats regulates ligand-induced Notch signaling and vascular

776 development in mammals. Elife. 2017;6.

777 93. Stratford M. Yeast flocculation: Receptor definition by mnn mutants and concanavalin A.

778 Yeast. 1992;8(8):635–45.

34

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

779 94. Naderer T, Vince J, McConville M. Surface Determinants of Leishmania Parasites and

780 their Role in Infectivity in the Mammalian Host. Curr Mol Med. 2005;4(6):649–65.

781 95. Levin TC, Greaney AJ, Wetzel L, King N. The Rosetteless gene controls development in

782 the choanoflagellate S. rosetta. Elife. 2014;3.

783 96. Zhang F, Zhang Z, Linhardt RJ. Glycosaminoglycans. In: Handbook of Glycomics. 2010.

784 p. 59–80.

785 97. Kersey PJ, Allen JE, Allot A, Barba M, Boddu S, Bolt BJ, et al. Ensembl Genomes 2018:

786 An integrated omics infrastructure for non-vertebrate species. Nucleic Acids Res.

787 2018;46(D1):D802–8.

788 98. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND.

789 Vol. 12, Nature Methods. 2014. p. 59–60.

790 99. Python Software Foundation. Python Language Reference, version 3.5. Python Software

791 Foundation. 2016.

792 100. Wang M, Zhang J, Zhou JH, Chen HT, Ma LN, Ding YZ, et al. Analysis of codon usage

793 in type 1 and the new genotypes of duck hepatitis virus. BioSystems. 2011;106(1):45–50.

794 101. 3.5.1. RDCT. A Language and Environment for Statistical Computing. R Found Stat

795 Comput [Internet]. 2018;2:https://www.R-project.org. Available from: http://www.r-

796 project.org

797 102. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq

798 quantification. Nat Biotechnol. 2016;34(5):525–7.

799 103. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Blast2GO: A

800 universal tool for annotation, visualization and analysis in functional genomics research.

801 Bioinformatics. 2005;21(18):3674–6.

35

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

802 104. Bateman A. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res.

803 2019;47(D1):D506–15.

804 105. Alexa A, Rahnenfuhrer J. topGO: topGO: Enrichment analysis for Gene On-tology. R

805 package version 2.26.0. 2009. p. R package version 2.22.0.

806 106. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7:

807 Improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80.

808 107. Castresana J. Selection of conserved blocks from multiple alignments for their use in

809 phylogenetic analysis. Mol Biol Evol. 2000;17(4):540–52.

810 108. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular evolutionary

811 genetics analysis across computing platforms. Mol Biol Evol. 2018;35(6):1547–9.

812 109. Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, Höhna S, et al. Mrbayes

813 3.2: Efficient bayesian phylogenetic inference and model choice across a large model

814 space. Syst Biol. 2012;61(3):539–42.

815 110. Letunic I, Bork P. Interactive Tree of Life (iTOL) v4: Recent updates and new

816 developments. Nucleic Acids Res. 2019;47(W1).

817

36

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A 100 Metazoa 90 Choanoflagellata 80 4527 Other eukaryote 70 Prokaryote 480 Fungi 60 SAR 50 Chlorophyta 4391 40 Haptophyta

Percent of genes 30 42 Rhodophyta 52 20 967 Unclassified 588 –117 76 10 –76 523 –30 503 53 0 –9 All NR AI≥45 B Taxon filter AI45 1134 703 180 (1837) (883) bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A B C 0.10 Candidate HGTs Candidate HGTs Candidate HGTs 4e-4 (median: 1509) 5 (median: 0.723) (median: 5) Other genes Other genes 0.08 Other genes (median: 1401) (median: 0.699) (median: 5) 3e-4 p = 3.92x10-6 4 p = 3.06x10-14 p = 0.49 0.06 3 Density Density 2e-4 Density 0.04 2 1e-4 1 0.02

0e+0 0 0.00 0 5 10 0.2 0.4 0.6 0.8 1.0 -10 0 10 20 30 40 50 Coding sequence length (kbp) GC3 Intron number

Swimming Chain D 8 Candidate HGTs E (median: 0.002) Other genes Thecate 1 0 Rosette 6 (median: -0.003) 1 p = 0.54 2 2

4 4 7

Density 1 0

662 2 0 4 0 16 0 0 -0.3 -0.2 -0.1 0.0 0.10.2 0.3 0.4 Codon Bias Index (CBI) bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Craspedida A Other eukaryotes B Clade 2 Acanthoecidae 517 0 0 Fungi 0 74 0 9 Filasterea 87 22 0 0

2 139 443 Choanoflagellata 2 0 36 9 63 3 Metazoa Craspdedida Clade 1 Stephanoecidae bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A B Glycosyl hydrolase te Helgoeca o Kiritimatiellales ry Kiritimatiellales ka u Chlorophyta e r 76 (15%) e h t Stramenopiles S.dolichothecata O 52 (10%) Proteobacteria

185 (26%) Kutzneria

Microstomoeca Micromonospora Haptophyta 53 (10%)

O EGD81375

p

i s Fungi Bacteroidetes t h 42 (8%) Pedobacter o 62 (12%) Bacteroides k Arcticibacter o Other FCB n Other prokaryote Chitinophagaceae t 34 (7%) 10 (2%) e Duganella t MassiliaPelomonas Acidobacteria PVC group o y 20 (4%) 43 (8%) r a k Other o Terrabacteria 33 (6%) r Pseudozobellia P Prolixibacteraceae 1.0 93 (18%) Niabella Formosa Flavobacteriaceae

Duncaniella

Echinicola Sunxiuqinia Choanoflagellate Proteobacteria Actinobacteria Fungi Fungi/opisthokont Bacteroidetes Other Terrabacteria Haptophyta Algae Other FCB Acidobacteria Stramenopiles Prokaryote PVC group Other prokaryote Chlorophyta Metazoa

ABC transporter Mannosyltransferase Cyanidioschyzon

C D Chitinophaga Flavobacterium Chryseobacterium

MicrostomoecaMonosiga

EGD83142 Sphingobacterium Monosiga

EGD79576 Gracilariopsis Chondrus Porphyra Helgoeca

Galdieria Amphimedon Nothophytophthora Lasallia Blyttiomyces Nematostella Pyronema Phytophthora Batrachochytrium Chrysochromulina Saitoella Nannochloropsis Globisporangium

Plasmodiophora Klebsormidium Syncephalastrum Micractinium Albugo Chlorella Muco 1.0 Parasitella

Gonapodya

Absidia Rhizopus Spizellomyces Naegleria

Capsaspora 1.0 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.28.176636; this version posted June 29, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

PFAM domain counts Enrichment p-value A 0 4 8 12 B 0.08 0.04 0.00 Dermatopontin; Calcium-binding EGF domain Short chain dehydrogenase Oxidation-reduction process Aminotransferase class I and II Peptidyl-aspartic acid hydroxylation Bacterial pre-peptidase C-terminal domain FAD dependent oxidoreductase Glutamine metabolic process Mitochondrial carrier protein Protein kinase domain Threonine biosynthetic process Sulfatase Ubiquitin-activating enzyme active site Lysine biosynthetic process via DAP ABC-2 type transporter; ABC transporter Alcohol dehydrogenase GroES-like domain; Zinc-binding Homophilic cell adhesion Alpha/beta hydrolase family Aspartyl/Asparaginyl beta-hydroxylase Receptor-mediated endocytosis IPT/TIG domain Nucleotide-excision repair Major Facilitator Superfamily Methyltransferase domain Carbohydrate metabolic process Adenylate and Guanylate cyclase catalytic domain Aminotransferase class-V Riboflavin biosynthetic process Beta-lactamase DEAD/DEAH box helicase; Helicase C-terminal Regulation of PI3Ks Flavin containing amine oxidoreductase NAD dependent epimerase/dehydratase family Tricarboxylic acid cycle Serine carboxypeptidase S28 Sugar (and other) transporter Methylation

Enzymes Transporters Other membrane proteins Amino acid & nucleotide Absorption/adhesion metabolism Other metabolic process Stress response

1094 C 1412 22 16 Unknown Translation 23 185 Other metabolism Nucleic acid metabolism 27 Carbohydrate metabolism Stress response Protein metabolism 31 Methyltransferase Oxidoreductase Repair 39 Extracellular matrix Signaling Lipid metabolism 62 41 Transporter Sulfur modification Amino acid metabolism DNA integration 44 67 DNA binding

45 51