<<

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Phylogenomics provides new insights into gains and 2 losses of selenoproteins among

3 Hongping Liang1,2,3#, Tong Wei2,3,4#, Yan Xu1,2,3#, Linzhou Li2,3,6, Sunil Kumar Sahu2,3,4, Hongli 4 Wang1,2,3, Haoyuan Li1,2, Xian Fu2,3, Gengyun Zhang2,4, Michael Melkonian7, Xin Liu2,3,4, Sibo 5 Wang2,4,5*,Huan Liu2,4,5*

6 1 BGI Education Center, University of Chinese Academy of Sciences, Beijing, China. 7 2 BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, China. 8 3 China National Gene Bank, Institute of New Agricultural Resources, BGI-Shenzhen, Jinsha Road, 9 Shenzhen 518120, China. 10 4 State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, China. 11 5 Department of , University of Copenhagen, Copenhagen, Denmark. 12 6 School of Biology and Biological Engineering, South China University of Technology, 510006, China. 13 7 Botanical Institute, Cologne Biocenter, University of Cologne, Cologne D-50674, Germany. 14 #these authors contributed equally to this work 15 * Correspondence: *[email protected]

16 Abstract: Selenoproteins that contain selenocysteine (Sec) are found in all kingdoms of . 17 Although they constitute a small proportion of the proteome, selenoproteins play essential roles in 18 many organisms. In photosynthetic , selenoproteins have been found in but are 19 missing in land (). In this study, we explored the evolutionary dynamics of Sec 20 incorporation by conveying a genomic search for the Sec machinery and selenoproteins across 21 Archaeplastida. We identified a complete Sec machinery and variable sizes of selenoproteomes in 22 the main algal lineages. However, the entire Sec machinery was missing in the BV 23 (-Florideophyceae) of Rhodoplantae () and only partial machinery was 24 found in three of Archaeplastida, indicating parallel loss of Sec incorporation in different 25 groups of algae. Further analysis of genome and transcriptome data suggests that all major lineages 26 of streptophyte algae display a complete Sec machinery, although the number of selenoproteins is 27 low in this group, especially in subaerial taxa. We conclude that selenoproteins tend to be lost in 28 Archaeplastida upon adaptation to a subaerial or acidic environment. The high number of redox- 29 active selenoproteins found in some bloom-forming marine microalgae may be related to defense 30 against viral infections. Some of the selenoproteins in these organisms may have been gained by 31 from bacteria.

32 Keywords: Evolution; Horizontal gene transfer; Phylogenomics; Selenoproteins; Selenocysteine; Sec 33 machinery; 34

35 1. Introduction 36 Selenium (Se) is an essential trace element for human health and its deficiency leads to various 37 diseases, such as Keshan and Kashin-Beck diseases, and affects the immune system and promotes 38 cancer development [1,2]. An essential Se metabolism is present in many organisms, including 39 bacteria, archaea, and eukaryotes [1,3,4]. However, higher concentrations of Se are toxic by 40 functioning as a pro-oxidant, which affects the intracellular glutathione (GSH) pool leading to an 41 enhanced level of ROS accumulation [5,6]. Se is essential for growth and development of numerous 42 algal species but not for terrestrial plants (embryophytes), although it accumulates in certain 43 species and can serve as dietary sources for Se uptake [3,7-9]. 44 Se is incorporated into nascent polypeptides in the form of selenocysteine (Sec), the 21st amino 45 acid [10]. Se incorporation requires a specialized machinery and Sec insertion sequence (SECIS) 46 elements present in selenoprotein mRNAs [11,12]. In eukaryotes, it consists of Sec synthesis and Sec 47 incorporation. Sec synthesis starts with tRNASec, aminoacylated with serine, which is phosphorylated bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

48 by O-phosphoseryl-transfer tRNASec kinase (PSTK) and then catalyzed by Sec synthase (SecS) to 49 produce selenocysteinyl-tRNASec from selenophosphate [10-13]. The Sec donor, selenophosphate, is 50 generated from selenide by selenophosphate synthetase 2 (SPS2), which is often a selenoprotein itself 51 [14,15]. During Sec incorporation, SECIS-binding protein 2 (SBP2) recognizes the SECIS elements in 52 the 3’-untranslated region (3’-UTR) and recruits the Sec-specific elongation factor (eEFSec) that 53 delivers selenocysteinyl-tRNASec to the ribosome at the in-frame Sec-coding UGA stop codon (Figure 54 2a). Bacteria possess a similar machinery including selB (Sec-specific elongation factor), selC (tRNASec) 55 and selD (selenophosphate synthase), except that Sec synthesis is catalyzed by a single bacterial Sec 56 synthase, SelA [10]. 57 Although selenoproteins constitute only a small fraction of the proteome in any living organism, 58 they play important roles in redox regulation, antioxidation, and thyroid hormone activation in 59 including humans [16]. Sec incorporation has been well documented in animals, bacteria, 60 and archaea, while the largest selenoproteome was reported in algae. In the pelagophyte alga 61 Aureococcus anophagefferens, 59 selenoproteins were identified in its genome, compared with 25 62 selenoproteins in humans [17,18]. The alga Chlamydomonas reinhardtii has at least ten 63 selenoproteins, whereas the picoplanktonic, marine green alga Ostreococcus lucimarinus harbors 20 64 selenoprotein genes in its genome [3,19]. Considering that Se is essential for growth in at least 33 algal 65 species that belong to six phyla, Sec incorporation is thought to be universal in diverse algal lineages 66 [20]. In a previous study, no selenoproteins were found in any land plants [7], suggesting a complete 67 loss of Sec incorporation after streptophyte terrestrialization. Exploring the Sec machinery across the 68 Archaeplastida, especially in algae, would provide insight into its evolutionary dynamics in this 69 important lineage of photosynthetic eukaryotes. Here in this study, we searched 38 plant genomes, 70 including 33 algal species that represent the major algal lineages, for the Sec machinery and 71 selenoproteins.

72 2. Results 73

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

74 2.1. Sec machinery in algae 75 To cover the plant of life, we selected 33 genomes of algal species and 5 species 76 with a focus on Archaeplastida, the major group of photosynthetic eukaryotes with primary . 77 The 33 algal species include one , six rhodophytes, 16 chlorophytes, and seven 78 streptophyte algae. Another three species, the pelagophyte A. anophagefferens, the diatom Thalassiosira 79 pseudonana, and the coccolithophorid Emiliania huxleyi, were also included to represent other distinct 80 algal lineages (Supplementary Table S1A). 81 The Sec machinery was searched in 38 genome assemblies using Selenoprofiles [21] (See 82 Methods). As shown in Figure 1, embryophytes lack the entire Sec machinery as previously reported 83 [7]. Interestingly, the Sec machinery is not intact in all tested algal species. Among 33 algae, three 84 Rhodoplantae lack the entire Sec machinery as in embryophytes. The chlorophyte 85 neglectum, and the rhodophyte merolae lack PSTK, and the glaucophyte Cyanophora 86 paradoxa SBP2. According to the species tree, it seems that the Sec machinery was lost completely in 87 one rhodophyte clade that includes umbilicalis, yezoensis, and and 88 partially in a few other algal species (Figure 1).

89 90 91 Figure 1. The number and distribution of selenoproteins, and enzymes involved in the Sec 92 machinery. The phylogenetic tree was retrieved from the NCBI database and the 1 KP 93 Project (http://www.onekp.com). Presence (green symbols) or absence (empty symbols) of the 94 enzymes involved in the Sec machinery (circles) and tRNASec (triangles) across sequenced 95 embryophyte, streptophyte algae, chlorophyte, Rhodoplantae, Glaucoplantae and 96 genomes are shown in the left panel. The distribution and number of selenoproteins are plotted 97 in the yellow in the second panel, and the predicted SECIS elements are represented by 98 the blue bars. Distribution and number of selenoprotein homologues (Cys) are plotted in an 99 orange column on the right panel. Prasinophyte algae () are highlighted in red.

100 2.2. Sec incorporation in the major algal lineages 101 In addition, we also identified the complete Sec machinery in some Rhodoplantae (Figure 1). 102 The Rhodoplantae are often classified at the subphylum level into two , and 103 Rhodophytina [22], the latter consisting of 6 classes that can be grouped into two lineages: SCRP 104 (, , , ) and BF 105 (Bangiophyceae, Florideophyceae) [23,24]. The entire Sec machinery was absent in Porphyra, Pyropia 106 and Chondrus that belong to the BF clade (Figure 1). 107 The three and algal species encoded the complete Sec machinery, and 108 generally also displayed more selenoproteins than most [3,5,17,19,25]. In the , 109 the picoplanktonic Mamiellophyceae stand out because they not only encode the complete Sec 110 machinery but also contain a large number of selenoproteins (Figure 1). In the remaining Chlorophyta 111 comprising the three classes , and (the TUC clade

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

112 according to [26]), except for M. neglectum, all other sequenced genomes encode the full Sec 113 machinery and contain selenoproteins, although their number is considerably lower than in the 114 Mamiellophyceae (Figure 1) supporting a previous report [5]. The number of selenoproteins among 115 Chlorophyta is variable; very low numbers were encountered in Chlamydomonas eustigma and 116 Coccomyxa subellipsoidea, the first isolated from acid mine drainage with very high sulfate content (and 117 in this aspect resembling the cyanidiophyte which also only has a few 118 selenoproteins, Figure 1), the latter exclusively occurring in subaerial habitats (damp rocks and stones, 119 [27]).

120 2.3. Variable number of selenoproteins identified in algae 121 Selenoproteins were scanned in the 38 plant genome assemblies using Selenoprofiles 122 (Supplementary Figure S2), and their SECIS elements were identified in the 6-kb downstream of their 123 putative stop codons by SECISearch3 [21,28]. There are some predicted selenoproteins that did not 124 predict SECIS elements in the downstream region, especially in Mamiellophyceae, e.g. Bathycoccus 125 prasinos, which may be because of lineage-specific characteristics or incomplete assembly [28]. The 126 presence of selenoproteins in each assembled genome agrees with the intactness of the Sec machinery. 127 In the rhodophyte clade that lacks the machinery or in the algae that miss one of the components, 128 none of the known selenoproteins and SECIS elements were found in their genomes (except for C. 129 paradoxa, in which the unidentified SBP2 protein may be incompletely assembled or other proteins 130 replace the function of SBP2).

131 Table 1. Number of enzymes involved in the Sec incorporation machinery and selenoproteins. 132 The number of enzymes of the Sec incorporation machinery and selenoproteins are detected by 133 Selenoprofiles across the sequenced algae, liverworts, , and a part of lower 134 embryophyte genomes and transcriptomes (from the 1 KP project).

1KP group Selenoproteins (Sec) & Sec Machinery Homologues Group Species (513) Clade/Order Number eEFSec PSTK SBP2 SecS SPS Sec Cys Other

Conifers 76 0 0 0 0 0 0 5256 2574

Lycophytes 21 0 0 0 0 0 0 1473 678 Vascular Eusporangiate Monilo- (175) 10 0 0 0 0 0 0 640 280 phytes

Leptosporangiate 68 0 0 0 0 0 0 4999 2184 Monilophytes Non- Hornworts 6 0 0 0 0 0 0 245 120 Vascular Mosses 39 0 1 5 17 0 1 3297 1485 (70) Liverworts 25 1 0 5 0 1 2 2184 1044

Zygnematophyceae 40 51 7 27 34 25 15 2426 1298

Coleochaetophyceae 4 3 2 1 1 1 1 217 120

Charophyceae 2 1 2 1 1 2 4 103 65 Algae (268) Klebsormidiophyceae 5 10 6 4 5 4 5 295 131

Mesostigmatophyceae 4 4 2 4 4 1 12 199 124

Chlorophyta 137 174 50 83 112 75 222 7518 4489

Glaucoplantae) 6 6 4 5 2 3 4 281 177

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Rhodoplantae 35 6 4 4 9 5 5 1296 736

Chromista (algae) 35 45 1 24 32 20 4 2100 1127 135

136 The Sec machinery is absent in embryophytes including the liverwort Marchantia polymorpha and 137 the Physcomitrella patens (Figure 1). The availability of genomes (or transcriptomes) of all major 138 lineages of streptophyte algae, the phylogeny of which can now be regarded as basically resolved 139 [29], allowed identification of the likely step in the evolution of streptophytes when the loss of the 140 Sec machinery and of selenoproteins occurred. As a first attempt to address this question, we 141 searched the transcriptomic data from the 1KP project (http://www.onekp.com) for the presence of 142 the Sec machinery and selenoproteins (268 algal species, 70 species of non-vascular (liverworts, 143 mosses, hornworts) plants, and 175 species of monilophytes, , and ). The number 144 of enzymes of the Sec machinery and the number of selenoproteins were computed for each group 145 (Supplementary Table S2). The sec machinery was completely absent from hornworts with no Sec 146 incorporation machinery enzyme and selenoproteins. In liverworts and mosses, only a few 147 selenoproteins were detected (2 and 1 respectively), and only a few enzymes of the Sec machinery 148 were randomly distributed (in no species were more than two of the five components of 149 the Sec machinery detected: PSTK and SecS were absent in hornworts and eEFsec and SPS were 150 absent in mosses) (Supplementary Table S2). In vascular plants, the Sec machinery was absent in all 151 transcriptomes of all plants and no selenoproteins were detected (Supplementary Table S2). In the 152 sister group of embryophytes, the , enzymes of the Sec machinery were more 153 widely distributed compared to (Supplementary Table S2). In Zygnematophyceae, none 154 among the five genes of the Sec machinery was found in their transcriptomes (four of the five 155 components of the Sec machinery were present in about one third of the 40 taxa). It might be a 156 consequence of the fragmentary nature of transcriptomes (e.g. we could not detect a complete Sec 157 machinery in the transcriptomes of “ sp.” and Mesotaenium endlicherianum, although in both 158 genomes the complete Sec machinery had been identified, Figure 1). Furthermore, selenoproteins 159 were identified in only 15 of the 40 Zygnematophyceae and their number per species was low. Again, 160 we did not detect selenoproteins in the transcriptomes of “Spirotaenia sp.” and M. endlicherianum, 161 although in their genomes a few genes encoding selenoproteins were identified (note that the number 162 of selenoproteins, as well as components of the Sec machinery, is higher in “Spirotaenia sp.” because 163 of its recent genome triplication; Cheng et al. unpublished observ). In the other clades of the 164 streptophyte algae (, , Klebsormidiophyceae and 165 ) the situation is similar to that in Zygnematophyceae, a complete Sec 166 machinery is present but the number of selenoproteins identified is low, especially in the subaerial 167 taxa (two in Klebsormidium nitens and three in atmophyticus), the only exception being the 168 scaly viride with 9 identified selenoproteins (Table 1 and Supplementary Table 169 S2).

170 2.3. Phylogenetic analysis of the enzymes involved in the Sec machinery 171 To further analyze the evolution of the Sec machinery, we conducted phylogenetic analyses of 172 five genes encoding Sec-containing enzymes from the available Archaeplastida genome data set. The 173 phylogenetic of PSTK, SBP2, and SPS showed either insufficient phylogenetic signal resulting 174 in low support values for internal branches (PSTK) or very long branches in several taxa (SBP2, SPS) 175 that led to spurious topologies due to long-branch attraction or indicated discordant gene histories 176 (Supplementary Figure S3 a, b and c).

177 The phylogenies of EFsec and SecS were largely congruent with some support for internal 178 branches (especially EFsec) that roughly corresponded to the known phylogenetic relationships 179 among higher order taxa, although relationships within some groups (e.g. streptophyte algae) 180 remained unresolved (Figure 2b and 2c). The EFsec phylogeny revealed four clades of sequences that 181 were reasonably well supported: clade I comprised three sequences of Rhodoplantae, clade II six

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

182 sequences of picoplanktonic Mamiellophyceae, clade III seven sequences of streptophyte algae, and 183 clade IV 9 sequences from the TUC clade (3 sequences of Trebouxiophyceae and 6 sequences of 184 Chlorophyceae).

185

186 Figure 2. Phylogenetic analysis of enzymes involved in the Sec machinery. (a) Schematics of the 187 selenoprotein biosynthesis pathway. (b-c) Maximum-likelihood trees of EFsec and SecS 188 respectively. Bootstrap values >50% are shown. The tree support for internal branches was 189 assessed using 500 bootstrap replicates. (d) Distribution of selected selenoproteins across the 190 Archaeplastida. Presence of selenoproteins are shown by green check marks.

191 2.3.1. Phylogenetic analysis of eukaryotic SPS proteins 192 We built an SPS gene set comprising both prokaryotes and eukaryotes to reconstruct a global 193 SPS phylogenetic tree (Figure 3). SPS split into three well-separated clades: Clade I including a 194 diverse range of bacteria, most of the , and with secondary plastids 195 (Stramenopiles, cryptotphytes, and ), clade II containing bacteria and four 196 species of green algae (Chara braunii; Gonium pectorale; C. reinhardtii; and Volvox carteri), and Clade III 197 including archaea, a diverse range of protists (photosynthetic and non-photosynthetic), fungi, and 198 three rhodophytes but no other Archaeplastida (Supplementary Table S3). The sequence of SPS Clade 199 I contains three domains: Pyr_redox_2, AIRS and AIRS_C. However, sequences of Clade II and Clade 200 III only showed the presence of AIRS and AIRS_C. The SPSs from Clade II and Clade III have different 201 characteristics of arrangements (Supplementary Figure S3; as the phylogenetic tree suggested, 202 potential horizontal gene transfer might have occurred in Clades I and II.). The SPS of the three 203 Volvocales (C. reinhardtii, G. pectorale and V. carteri) from Clade II might have been acquired by 204 horizontal gene transfer (HGT) from , because they form a monophylum (92% boostrap 205 support) with two terrestrial, filamentous cyanobacteria (Tolypothrix bouteillei, Scytonema hofmannii) 206 which are themselves nested within a larger radiation of bacteria (Supplementary Figure S3). For C. 207 braunii, we suspect that this gene derived from either a (cyano) bacterial or volvocalean 208 contamination.

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

209

210 Figure 3. Phylogenetic analysis of selenophosphate synthetase (SPS). (a) Reconstructed protein 211 phylogeny of the reference set of SPS proteins. The red point denotes potential horizontal gene 212 transfer events in SPS clades I and II. (b) Alignment of SPS domains of the three SPS clades.

213 2.4. Distribution of types of selenoproteins among Archaeplastida 214 A comprehensive analysis of the distribution of selenoproteins revealed that picoplanktonic 215 Mamiellophyceae possess an expanded set of selenoproteins, whereas some selenoproteins had a 216 scattered distribution among other Archaeplastida (Figure 2d, and Supplementary Figure S2). This 217 may be related to the distinct types of eEFsec and SecS present in the Mamiellophyceae (Figure 2b, 218 and Figure 2c). Functional annotation of the selenoproteins in the genomes of the Mamiellophyceae 219 showed that they are mainly involved in oxidative stress response and adaptation. The MsrA 220 selenoprotein, e.g., is a key Sec-containing enzyme for the repair of oxidatively damaged peptides. 221 However, MsrA_b, a bacterium-like MsrA selenoprotein, was identified only in the picoplanktonic 222 Mamiellophyceae and in M. viride (Figure 2d), suggesting that early-diverging lineages of aquatic 223 Viridiplantae might be subjected to stronger oxidative stress, and MsrA_b but not MsrA

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

224 (Supplementary Figure S2) is essential for these species to perform the repair of peptides. Another 225 Sec-containing oxidoreductase (FrnE) is present in the Mamiellophyceae and in M. viride but not in 226 any other Archaeplastida genome sequenced (Figure 2d). FrnE is a cadmium-inducible protein that 227 is characterized as a disulfide isomerase having a role in oxidative stress tolerance. Therefore, it also 228 supports the above hypothesis that Mamiellophyceae and M. viride (or perhaps scaly green algae, in 229 general) need these enzymes to cope with stronger oxidative stress. In this context, it is interesting to 230 note that in the bloom-forming pelagophyte alga A. anophagerfferens, which has the second largest 231 number of selenoproteins reported (50), a large number of redox active selenoproteins were 232 overexpressed upon infection by a giant virus of the Mimiviridae clade [30], which suggests that viral 233 infections, that are also prominent in the picoplanktonic Mamiellophyceae (prasinoviruses; [31]) and 234 have also been described in M. viride [32], may elicit similar responses in their hosts. Viral infections 235 are unknown in the three Volvocales studied (C. reinhardtii, V. carteri, and G. pectorale), however 236 Volvocales are often subject to invasion by parasitic protists or fungi [33-35] and this could perhaps 237 explain the presence of selenoproteins in these taxa.

238 3. Discussion

239 3.1 The distribution of the Sec machinery and selenoproteins in algae 240 It has been hypothesized that the Sec machinery and selenoproteins were lost in Viridiplantae 241 upon transfer from an aquatic to a terrestrial environment perhaps related to the paucity of a suitable 242 chemical species of selenium (i.e. selenite) in most terrestrial environments [7,9,20,36-38,40]. The 243 results presented here support this notion and further suggest that the Sec machinery was lost in the 244 common ancestor of embryophytes as all extant embryophytes lack this machinery in their genomes 245 (Figure 1). The few enzymes of this machinery that were detected in the transcriptomes of some 246 liverworts and mosses (Supplementary Table S2) likely represent contaminations. Interestingly, 247 although the complete Sec machinery is still present in all classes of streptophyte algae, the number 248 of selenoproteins detected in the subaerial species (C. atmophyticus, K. nitens, “Spirotaenia sp.”, M. 249 endlicherianum) was low (1-3 proteins), whereas in the aquatic species (M. viride, C. braunii, C. scutata) 250 more selenoproteins (4-9 proteins) were found (Figure 1). Very low numbers of selenoproteins (i.e. 251 one protein) were also encountered in subaerial/acidophilic species of Chlorophyceae (C. eustigma, C. 252 subellipsoidea) and in the subaerial/acidophilic Rhodoplantae (G. sulphuraria). These results 253 corroborate the hypothesis that adaptation to subaerial/terrestrial or acidophilic habitats supports the 254 gradual loss of selenoproteins in diverse groups of algae. We suspect that once selenoproteins have 255 been lost, selection on maintaining the Sec machinery is abolished. Intermediate stages in this process 256 may be seen in the subaerial chlorophyte M. neglectum (now M. braunii) and in the acidophilic red 257 alga C. merolae [36], which each lost one enzyme (PSTK or SBP respectively) of the Sec machinery. We 258 hypothesize that once the Sec machinery is lost, transfer of algae to aquatic (marine) habitats (as in 259 most species of Rhodoplantae) will not lead to reappearance of selenoproteins (some red algae 260 exposed to strong oxidative stress such as P. umbilicalis have developed intimate associations with 261 bacteria that express selenoproteins [40]). Similarly, transcriptomes of later-diverging 262 Zygnematophyceae (i.e. ), that are predominantly aquatic in mostly acidic environments 263 (bogs), also either lack selenoproteins or have only 1-2 selenoprotein(s) (Supplementary Table S2). It 264 will be interesting to learn, once their genome sequences will become available, whether they display 265 a Sec machinery or not. Palenik et al. [19] proposed a trade-off between increased Se requirements 266 but decreased nitrogen requirements for peptide synthesis in Ostreococcus spp., and it is worth noting 267 that this genus encodes a surprisingly high number of selenocysteine-containing proteins relative to 268 its genome size [19]. The core Chlorophyta showed a similar number of genes involved in nitrogen 269 metabolism as the picoplanktonic Mamiellophyceae (Supplementary Table S4). In Trebouxiophyceae 270 and Ulvophyceae (represented by Ulva mutabilis), fewer selenoproteins were identified than in the 271 Mamiellophyceae. Functional annotation of the selenoproteins in Trebouxiophyceae and 272 Ulvophyceae showed that they mainly participated in some redox activities such as redox signaling 273 (thioredoxin reductase, TR) and oxidative stress response (glutathione peroxidase, GPx)

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

274 (Supplementary Figure S2). However, it is still unclear why Trebouxiophyceae and Ulvophyceae 275 possess fewer selenoproteins, the first occur in freshwater or are often subaerial, the latter is mostly 276 multicellular and may not require the diversity of highly reactive selenoenzymes characteristic for 277 picoeukaryotes.

278 3.2 Probable Horizontal Gene Transfer of SPS and some selenoproteins 279 SPS was detected in both prokaryotes and eukaryotes, although their sequence similarity is quite 280 low (~30%; [4]). Our phylogenetic analyses resolved three clades of SPS genes with mixed species 281 composition of prokaryotes and eukaryotes suggesting HGT among these unrelated organisms. For 282 SPS Clade II, we provided evidence that a single HGT event occurred from terrestrial cyanobacteria 283 into the common ancestor of C. reinhardtii, V. carteri, and G. pectorale. Several selenoproteins of the 284 picoplanktonic Mamiellophyceae may also have had their origin in the domain bacteria and been 285 recruited from bacteria (perhaps via viruses) through HGT. Selenoproteins are relatively common in 286 bacteria, about 34% of the sequenced bacteria utilize Sec, mostly different groups of proteobacteria 287 [38, 39]. Phylogenetic analyses of selenoproteomes in bacteria have identified rampant losses of 288 selenoproteins but also occasional HGT events, even between domains (bacteria and archaea) [42,43]. 289 It is tempting to speculate that these HGTs supported bloom-forming, marine microalgae that often 290 lack cell walls, their cells being covered only by mineralized or non-mineralized scales, to cope with 291 viral invasions using their highly redox-reactive selenoproteins. 292

293 4. Conclusion

294 A phylogenomic analysis of the selenocysteine (Sec) –machinery and selenoproteins in genomes 295 and transcriptomes of diverse Archaeplastida provided evidence for complete or partial loss of the 296 Sec-machinery in several, unrelated lineages accompanied by loss of selenoproteins. In streptophytes, 297 the Sec-machinery and selenoproteins were apparently lost in the common ancestor of embryophytes, 298 as the Sec-machinery was present in all lineages of streptophyte algae but absent in embryophytes. 299 The number of selenoproteins identified in algae correlated with the type of their habitats, low 300 numbers of selenoproteins were encountered in algae thriving in subaerial/terrestrial or acidic 301 environments. The large number of selenoproteins found in some bloom-forming, marine microalgae 302 may be related to their function in the defense against viral infections. Some components of the Sec- 303 machinery and selenoproteins may have been acquired by algae through horizontal gene transfer 304 from bacteria.

305 5. Materials and Methods

306 5.1 Data information. 307 A total of 38 genome sequences were used in this study, the genomes including 5 embryophytes, 308 7 streptophyte algae, 16 chlorophytes, 6 Rhodoplantae, 1 Glaucoplant and 3 photosynthetic protists 309 (two stramenopiles and a haptophyte). The transcriptomes contained 121 green algae, 25 liverworts, 310 6 hornworts, 38 mosses, and 170 terrestrial plants (Supplementary Table S1A). The 33 whole genome 311 assemblies were downloaded from the NCBI genome database. In addition, 5 newly assembled 312 streptophyte algal genomes were used, including Mesotaenium endlicherianum (strain CCAC 1140), 313 “Spirotaenia sp.” (strain CCAC 0220), scutata (strain SAG 110.80), Mesostigma viride (strain 314 CCAC 1140), Chlorokybus atmophyticus (strain CCAC 0220). The CCAC strains were obtained from the 315 Culture Collection of Algae at the University of Cologne (http://www.ccac.uni-koeln.de/). All 316 cultures were axenic, and during all steps of culture scale-up until nucleic acid extraction, axenicity 317 was monitored by sterility tests as well as light microscopy. Total RNA was extracted from M. viride 318 using the Tri Reagent Method, and from C. atmophyticus using the CTAB-PVP Method as described 319 in Johnson et al [44]. Total DNA was extracted using a modified CTAB protocol [45,46]. The 320 phylogenetic backbone of algae was retrieved from the NCBI taxonomy database 321 (https://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi). The completeness of

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

322 genome assemblies was assessed by BUSCO 3.0.2 with gene database [47]. The results were 323 listed in the Supplementary Table S1B. We also counted the usage of stop codons for the single-copy 324 genes. The results were shown in Supplementary Figure S1 (Supplementary Table S1B).

325 5.2 Sec incorporation machinery. 326 The genome sequences were searched for the Sec incorporation machinery by the Selenoprofiles 327 pipeline (version 3.0, http://big.crg.cat/services/selenoprofiles) with the parameter “-p machinery” 328 [21,48]. Firstly, we ran the pipeline with profile-based Sec machinery. To reduce the incomplete gene 329 sequence mistakes, the blastp version 2.6.0+ (e-value <10-5) was used against the predicted genes as in a 330 special algae database to detect Sec machinery. In addition, transcriptome data were also searched using 331 the same methods. First, the nucleic acid sequences were searched by Selenoprofiles, and then subjected 332 to blastp (e-value <10-5) with the predicted algae-specific Sec machinery database.

333 5.3 Identification of the selenocysteine tRNA (tRNASec). 334 Secmarker version 0.4 (http://secmarker.crg.es/index.html) was used to identify the dedicated 335 tRNASec in the genome sequences [49]. The predicted secondary structure was drawn with the 336 parameter “-plot”.

337 5.4 Prediction of selenoproteins and SECIS elements 338 Selenoproteins were identified from the genome assemblies with Selenoprofiles with the 339 parameter “-p metazoa, protist, prokarya”. The candidates were filtered with cutoff: e-value <0.01 and 340 the sensible AWSIc Z-score > -3. SECIS elements were searched in the 6-kb DNA sequences downstream 341 of predicted selenoprotein genes at the SECISearch3 website (http://seblastian.crg.es/; with the 342 parameter “-output_three_prime, -output_secis”) [28].

343 5.5 Phylogenetic tree construction 344 In phylogenetic analysis, each candidate was searched by Selenoprofiles and blastp version 2.6.0+ 345 [44] to detect more candidates (e-value < 1 × 10−5). Multiple sequence alignments were performed by 346 MAFFT version 7.310 [50,51]. In eEFSec, SecS, PSTK, and SBP2, the maximum-likelihood tree was 347 constructed for each protein family using the IQ-TREE software with 500 bootstrap replicates [52]. The 348 SPS maximum-likelihood trees were constructed for each protein family using the RAxML version 8.2.4 349 with the GTR+I+G model [53]. For the phylogeny of SPS (SelD), the bacteria sequences were 350 downloaded from the NR database by submitting every alga SPS sequences to nr databases. All target 351 bacterial sequences were retrieved but only several randomly chosen sequences in each bacterial 352 were used for the SPS phylogenetic analyses. Representative archaea and protist sequences 353 were used in the analysis of SPS. In addition to this, the lately reported 9 fungi that utilize Sec were also 354 added (192 sequences) [4,50].

355 5.6 Identification of conserved motifs and domains. 356 Pfam 32.0 (http://pfam.xfam.org/) was used to identify the domains in the Sec incorporation 357 machinery [54]. Additional motifs were identified by Multiple Em for Motif Elicitation 5.0.5 (MEME, 358 http://meme-suite.org/). The alignment of the SPS domain was visualized by ESPript 3.0.

359 Supplementary Materials: The sequences of selenoprotein which we identified from the green algae 360 (Mesostigma viride, Chlorokybus atmophyticus, Klebsormidium nitens, Chara braunii, Coleochaete scutata, 361 “Spirotaenia sp.”, Mesotaenium endlicherianum) are available in the CNGB Nucleotide Sequence 362 Archive (CNSA: http://db.cngb.org/cnsa; accession number CNP0000452). The specific details 363 regarding other genes which were used in this study are available in supplementary File S5.

364 Author Contributions: Data curation, Yan Xu and Linzhou Li; Formal analysis, Hongping Liang, Hongli Wang 365 and Haoyuan Li; Funding acquisition, Xin Liu and Huan Liu; Investigation, Hongping Liang; Methodology, 366 Linzhou Li, Gengyun Zhang and Sibo Wang; Project administration, Tong Wei and Huan Liu; Resources,

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

367 ; Software, Yan Xu and Linzhou Li; Supervision, Sunil Kumar Sahu, Xin Liu, Sibo Wang and 368 Huan Liu; Visualization, Hongping Liang; Writing – original draft, Sunil Kumar Sahu and Sibo Wang; Writing 369 – review & editing, Sunil Kumar Sahu, Xian Fu and Michael Melkonian.

370

371 Funding: Financial support was provided by National Key Research and Development Program of China 372 (No.2017YFB0403904) and the Shenzhen Municipal Government of China (Grant numbers No. 373 JCYJ20151015162041454 and No. JCYJ20160331150739027).

374 Acknowledgments: We thank Dr. Shifeng Cheng for kindly providing the gene sequences of “Spirotaenia sp." 375 and Mesotaenium endlicherianum.

376 Conflicts of Interest: The authors declare no competing interests.

377 Abbreviations

Sec Selenocysteine Se Selenium SECIS Selenocysteine Insertion Sequence PSTK O-phosphoseryl-transfer tRNASec kinase SecS Sec synthase SPS selenophosphate synthetase 2 SBP2 SECIS-binding protein 2 eEFSec Sec-specific elongation factor CTAB Cetyl trimethylammonium bromide

378 References

379 1. Rayman, M.P. Selenium and human health. The Lancet 2012, 379, 1256-1268, doi:10.1016/s0140- 380 6736(11)61452-9. 381 2. Avery, J.C.; Hoffmann, P.R. Selenium, Selenoproteins, and Immunity. Nutrients 2018, 10, 1203, 382 doi:10.3390/nu10091203. 383 3. Novoselov, S.V.; Rao, M.; Onoshko, N.V.; Zhi, H.; Kryukov, G.V.; Xiang, Y.; Weeks, D.P.; Hatfield, D.L.; 384 Gladyshev, V.N. Selenoproteins and selenocysteine insertion system in the model system, 385 Chlamydomonas reinhardtii. EMBO J 2002, 21, 3681-3693, doi:10.1093/emboj/cdf372. 386 4. Mariotti, M.; Salinas, G.; Gabaldon, T.; Gladyshev, V.N. Utilization of selenocysteine in early-branching 387 fungal phyla. Nat Microbiol 2019, 4, 759-765, doi:10.1038/s41564-018-0354-9. 388 5. Araie, H.; Suzuki, I.; Shiraiwa, Y. Identification and characterization of a selenoprotein, thioredoxin 389 reductase, in a unicellular marine haptophyte alga, Emiliania huxleyi. J Biol Chem 2008, 283, 35329-35336, 390 doi:10.1074/jbc.M805472200. 391 6. Papp, L.V.; Holmgren, A.; Khanna, K.K. Selenium and selenoproteins in health and disease. Antioxid Redox 392 Signal 2010, 12, 793-795, doi:10.1089/ars.2009.2973. 393 7. Lobanov, A.V.; Fomenko, D.E.; Zhang, Y.; Sengupta, A.; Hatfield, D.L.; Gladyshev, V.N. Evolutionary 394 dynamics of eukaryotic selenoproteomes: large selenoproteomes may associate with aquatic life and small 395 with terrestrial life. Genome Biol. 2007, 8, 1-16, 10.1186/gb-2007-8-9-r198. 396 8. Bulteau, A.L.; Chavatte, L. Update on selenoprotein biosynthesis. Antioxid Redox Signal 2015, 23, 775-794, 397 doi:10.1089/ars.2015.6391. 398 9. Schiavon, M.; Pilon-Smits, E.A. The fascinating facets of plant selenium accumulation - biochemistry, 399 physiology, evolution and ecology. New Phytol 2017, 213, 1582-1596, doi:10.1111/nph.14378. 400 10. Böck, A., K. Forchhammer, J. Heider and C. Barion. Selenoprotein synthesis: an expansion of the genetic 401 code. Trends Biochem Sci 1991, 16, 463-467, doi: 10.1016/0968-0004(91)90180-4 402 11. , M.J.; Banu, L.; Chen, Y.; Mandel, S.J.; Kieffer, J.D.; Harney, J.W.; Larsen, P.R. Recognition of UGA as 403 a selenocysteine codon in Type I deiodinase requires sequences in the 3′ untranslated region. Nature 1991, 404 353, 273-276, doi:10.1038/353273a0 405 12. Low, S.C.; Berry, M.J. Knowing when not to stop: selenocysteine incorporation in eukaryotes. Trends 406 Biochem Sci 1996, 21, 203-208, doi:10.1016/S0968-0004(96)80016-8

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

407 13. Carlson, B.A.; Xu, X.M.; Kryukov, G.V.; Rao, M.; Berry, M.J.; Gladyshev, V.N.; Hatfield, D.L. Identification 408 and characterization of phosphoseryl-tRNA[Ser]Sec kinase. Proc Natl Acad Sci U S A 2004, 101, 12848-12853, 409 doi:10.1073/pnas.0402636101. 410 14. Xu, X.M.; Carlson, B.A.; Irons, R.; Mix, H.; Zhong, N.; Gladyshev, V.N.; Hatfield, D.L. Selenophosphate 411 synthetase 2 is essential for selenoprotein biosynthesis. Biochem J 2007, 404, 115-120, doi:10.1042/BJ20070165. 412 15. Fletcher, J.E.; Copeland, P.R.; Driscoll, D.M.; Krol, A. The selenocysteine incorporation machinery: 413 Interactions between the SECIS RNA and the SECIS-binding protein SBP2. RNA 2001, 7, 1442-1453, 414 doi:10.1021/ma800238c 415 16. Labunskyy, V.M.; Hatfield, D.L.; Gladyshev, V.N. Selenoproteins: molecular pathways and physiological 416 roles. Physiol Rev 2014, 94, 739-777, doi:10.1152/physrev.00039.2013. 417 17. Gobler, C.J.; Lobanov, A.V.; Tang, Y.Z.; Turanov, A.A.; Zhang, Y.; Doblin, M.; Taylor, G.T.; Sanudo- 418 Wilhelmy, S.A.; Grigoriev, I.V.; Gladyshev, V.N. The central role of selenium in the biochemistry and 419 ecology of the harmful pelagophyte, Aureococcus anophagefferens. ISME J 2013, 7, 1333-1343, 420 doi:10.1038/ismej.2013.25. 421 18. Kryukov, G.V.; Castellano, S.; Novoselov, S.V.; Lobanov, A.V.; Zehtab, O.; Guigo, R.; Gladyshev, V.N. 422 Characterization of mammalian selenoproteomes. Science 2003, 300, 1439-1443, doi:10.1126/science.1083516. 423 19. Palenik, B.; Grimwood, J.; Aerts, A.; Salamov, A.; Putnam, N.H.; Dupont, C.L.; Jorgensen, R.A.; Rombauts, 424 S.; Zhou, K.; Otillar, R. et al. The tiny eukaryote Ostreococcus provides genomic insights into the paradox 425 of plankton speciation. Proc Natl Acad Sci USA 2007, 104, 7705-7710, doi:10.1073/pnas.0611046104 426 20. Araie, H.; Shiraiwa, Y. Selenium utilization strategy by microalgae. Molecules 2009, 14, 4880-4891, 427 doi:10.3390/molecules14124880. 428 21. Mariotti, M.; Guigo, R. Selenoprofiles: profile-based scanning of eukaryotic genome sequences for 429 selenoprotein genes. BMC Bioinformatics 2010, 26, 2656-2663, doi:10.1093/bioinformatics/btq516. 430 22. Yoon, H.S.; Nelson, W.; Linstrom, S.C.; Boo, S.M.; Pueschel, C.; Qiu, H.; Bhattacharya, D. Rhodophyta. In 431 Archibald, J.M. et al. (eds.), Handbook of the Protists. Springer International Publishing 2017, 1-45, 367-406, 432 doi: 10.1007/978-3-319-28149-0 433 23. Parte, S.; Sirisha, V.L.; D'Souza, J.S. Biotechnological applications of marine enzymes from algae, bacteria, 434 fungi, and sponges. Adv Food Nutr Res 2017, 80, 75-106, doi:10.1016/bs.afnr.2016.10.005. 435 24. Qiu, H.; Yoon, H.S.; Bhattacharya, D. Red algal phylogenomics provides a robust framework for inferring 436 evolution of key metabolic pathways. PLoS Curr 2016, 8, 437 doi:10.1371/currents.tol.7b037376e6d84a1be34af756a4d90846 438 25. Price, N.M.; Harrison, P.J. Specific selenium-containing macromolecules in the marine diatom Thalassiosira 439 pseudonana. Plant Physiol 1988, 86, 192-199, doi:10.1104/pp.86.1.192 440 26. Marin, B. Nested in the Chlorellales or independent ? Phylogeny and classification of the 441 (Viridiplantae) revealed by molecular phylogenetic analyses of complete nuclear and 442 -encoded rRNA operons. Protist 2012, 163, 778-805, doi:10.1016/j.protis.2011.11.004. 443 27. Acton, E. Coccomyxa subellipsoidea, a new member of the palmellaceae. Annals Bot 1909, 23, 573-577 444 28. Mariotti, M.; Lobanov, A.V.; Guigo, R.; Gladyshev, V.N. SECISearch3 and Seblastian: new tools for 445 prediction of SECIS elements and selenoproteins. Nucleic Acids Res 2013, 41, e149, doi:10.1093/nar/gkt550. 446 29. Wickett, N.J.; Mirarab, S.; Nguyen, N.; Warnow, T.; Carpenter, E.; Matasci, N.; Ayyampalayam, S.; Barker, 447 M.S.; Burleigh, J.G.; Gitzendanner, M.A., et al. Phylotranscriptomic analysis of the origin and early 448 diversification of land plants. Proc Natl Acad Sci USA 2014, 111, E4859-4868, doi:10.1073/pnas.1323926111. 449 30. Moniruzzaman, M.; Gann, E.R.; Wilhelm, S.W. Infection by a giant virus (AaV) induces widespread 450 physiological reprogramming in Aureococcus anophagefferens CCMP1984 - a harmful bloom algae. Front 451 Microbiol 2018, 9, 752, doi:10.3389/fmicb.2018.00752. 452 31. Weynberg, K.D.; Allen, M.J.; Wilson, W.H. Marine prasinoviruses and their tiny plankton hosts: Viruses 453 2017, 9, doi:10.3390/v9030043. 454 32. Melkonian, M. Virus-like particles in the scaly green flagellate Mesostigma viride. Br Phycol J 1982, 17, 63- 455 68, doi:10.1080/00071618200650081. 456 33. Surek B., Melkonian M. The filose Vampyrellidium perforans nov. sp. (Vampyrellidae, 457 Aconchulinida): Axenic culture, feeding behaviour and host range specificity. Arch Protistenkd 1980, 123, 458 166-191, doi: 10.1016/S0003-9365(80)80003-0. 459 34. Hess, S. Hunting for agile prey: trophic specialisation in leptophryid amoebae (Vampyrellida, ) 460 revealed by two novel predators of planktonic algae. FEMS Microbiol Ecol 2017, 93, fix104, doi: 461 10.1093/femsec/fix104

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

462 35. Seto, K.; Degawa, Y. Collimyces mutans gen. et sp. nov. (Rhizophydiales, Collimycetaceae fam. nov.), a 463 new chytrid parasite of Microglena (Volvocales, clade Monadinia). Protist 2018, 169, 507-520, 464 doi:10.1016/j.protis.2018.02.006 465 36. Matsuzaki, M., Misumi, O., Shin-I, T., Ma, ruyama, S., Takahara, M., Miyagishima, S. Y., et al. Genome 466 sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae. Nature 2004, 428, 653-657, doi: 467 10.1038/nature02398 468 37. Schiavon, M.; Ertani, A.; Parrasia, S.; Vecchia, F.D. Selenium accumulation and metabolism in algae. Aquat 469 Toxicol 2017, 189, 1-8, doi:10.1016/j.aquatox.2017.05.011 470 38. Gojkovic, Ž., Garbayo, I., Ariza, J. L. G., Márová, I., Vílchez, C. Selenium bioaccumulation and toxicity in 471 cultures of green microalgae. Algal Res. 2015, 7, 106–116, doi:10.1016/j.algal.2014.12.008 472 39. Maruyama, S.; Misumi, O.; Ishii, Y.; Asakawa, S.; Shimizu, A.; Sasaki, T.; Matsuzaki, M.; Shin-i, T.; Nozaki, 473 H.; Kohara, Y., et al. The minimal eukaryotic ribosomal DNA units in the primitive red alga 474 Cyanidioschyzon merolae. DNA Res 2004, 11, 83-91, doi:doi:10.1093/dnares/11.2.83 475 40. Kim, J.W.; Brawley, S.H.; Prochnik, S.; Chovatia, M.; Grimwood, J.; Jenkins, J.; LaButti, K.; Mavromatis, K.; 476 Nolan, M.; Zane, M., et al. Genome analysis of Planctomycetes inhabiting blades of the red alga Porphyra 477 umbilicalis. PLoS ONE 2016, 11, e0151883, doi:10.1371/journal.pone.0151883 478 41. Raven, J.A.; Giordano, M. Algae. Curr Biol 2014, 24, R590-595, doi:10.1016/j.cub.2014.05.039 479 42. Zhang, Y.; Romero, H., Salinas, G.; Gladyshev, V.N. Dynamic evolution of selenocysteine utilization in 480 bacteria: a balance between selenoprotein loss and evolution of selenocysteine from redox active cysteine 481 residues. Genome Biol 2006, 7, R94, doi:10.1186/gb-2006-7-10-r94 482 43. Peng, T.; Lin, J.; Xu, Y.Z.; Zhang, Y. Comparative genomics reveals new evolutionary and ecological 483 patterns of selenium utilization in bacteria. ISME J 2016, 10, 2048-2059, doi:10.1038/ismej.2015.246. 484 44. Johnson, M.T.; Carpenter, E.J.; Tian, Z.; Bruskiewich, R.; Burris, J.N.; Carrigan, C.T.; Chase, M.W.; Clarke, 485 N.D.; Covshoff, S.; Depamphilis, C.W. et al. Evaluating methods for isolating total RNA and predicting the 486 success of sequencing phylogenetically diverse plant transcriptomes. PLoS ONE 2012, 7, e50226, 487 doi:10.1371/journal.pone.0050226 488 45. Rogers SO, B.A. Extraction of DNA from milligram amounts of fresh, and mummified plant. 489 Plant Mol Biol 1985, 5, 59-76, doi:10.1007/bf00020088 490 46. Sahu, S.K.; Thangaraj, M.; Kathiresan, K. DNA Extraction protocol for plants with high levels of secondary 491 metabolites and polysaccharides without using liquid nitrogen and phenol. ISRN Mol Biol 2012, 2012, 492 205049, doi:10.5402/2012/205049 493 47. Simao, F.A.; Waterhouse, R.M.; Ioannidis, P.; Kriventseva, E.V.; Zdobnov, E.M. BUSCO: Assessing genome 494 assembly and annotation completeness with single-copy orthologs. BMC Bioinformatics. 2015, 31, 3210-3212, 495 doi:10.1093/bioinformatics/btv351 496 48. Santesmasses, D.; Mariotti, M.; Guigo, R. Selenoprofiles: A computational pipeline for annotation of 497 selenoproteins. Methods Mol Biol 2018, 1661, 17-28, doi:10.1007/978-1-4939-7258-6_2 498 49. Santesmasses, D., Mariotti, M., Guigó, R. Computational identification of the selenocysteine tRNA 499 (tRNASec) in genomes. PLOS Comput Biol 2017, 13, e1005383 doi:10.1371/journal.pcbi.1005383 500 50. Katoh, K.; Misawa, K.; Kuma, K.; Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment 501 based on fast Fourier transform. Nucleic Acids Res 2002, 30, 3059-3066, doi:10.1093/nar/gkf436 502 51. Nakamura, T.; Yamada, K.D.; Tomii, K.; Katoh, K. Parallelization of MAFFT for large-scale multiple 503 sequence alignments. BMC Bioinformatics 2018, 34, 2490-2492, doi:10.1093/bioinformatics/bty121 504 52. Nguyen, L.T.; Schmidt, H.A.; von Haeseler, A.; Minh, B.Q. IQ-TREE: a fast and effective stochastic 505 algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 2015, 32, 268-274, 506 doi:10.1093/molbev/msu300 507 53. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. 508 BMC Bioinformatics 2014, 30, 1312-1313, doi:10.1093/bioinformatics/btu033 509 54. Hoang, D.T.; Chernomor, O.; von Haeseler, A.; Minh, B.Q.; Vinh, L.S. UFBoot2: Improving the Ultrafast 510 Bootstrap Approximation. Mol Biol Evol 2018, 35, 518-522, doi:10.1093/molbev/msx281 511 55. El-Gebali, S.; Mistry, J.; Bateman, A.; Eddy, S.R.; Luciani, A.; Potter, S.C.; Qureshi, M.; Richardson, L.J.; 512 Salazar, G.A.; Smart, A., et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019, 47, D427- 513 D432, doi:10.1093/nar/gky995 514

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

515 Figure legends

516 Figure 1. The number and distribution of selenoproteins, and enzymes involved in the Sec

517 machinery. The phylogenetic tree was retrieved from the NCBI taxonomy database and the 1 KP Project

518 (http://www.onekp.com). Presence (green symbols) or absence (empty symbols) of the enzymes

519 involved in the Sec machinery (circles) and tRNASec (triangles) across sequenced embryophyte,

520 streptophyte algae, chlorophyte, Rhodoplantae, Glaucoplantae and protist genomes are shown in the

521 left panel. The distribution and number of selenoproteins are plotted in the yellow column in the second

522 panel, and the predicted SECIS elements are represented by the blue bars. Distribution and number of

523 selenoprotein homologues (Cys) are plotted in an orange column on the right panel. Prasinophyte algae

524 (Mamiellophyceae) are highlighted in red.

525

526 Figure 2. Phylogenetic analysis of enzymes involved in the Sec machinery. (a) Schematics of the

527 selenoprotein biosynthesis pathway. (b-c) Maximum-likelihood trees of EFsec and SecS respectively.

528 Bootstrap values >50% are shown. The tree support for internal branches was assessed using 500

529 bootstrap replicates. (d) Distribution of selected selenoproteins across the Archaeplastida. Presence

530 (green marks symbols) of selenoproteins are shown.

531

532 Figure 3. Phylogenetic analysis of selenophosphate synthetase. (a) Reconstructed protein

533 phylogeny of the reference set of SPS proteins. The red point denotes potential horizontal gene

534 transfer events in SOS clades I and II. (b) Alignment of SPS domains of the three SPS clades.

535

536 Legend of Table

537

538 Table 1. Number of enzymes involved in the Sec incorporation machinery and selenoproteins. The

539 number of enzymes of the Sec incorporation machinery and selenoproteins are detected by

540 Selenoprofiles across the sequenced Algae, Liverworts, Mosses, Hornworts and a part of lower

541 embryophyte genomes and transcriptomes (from the 1 KP project).

542

543 Supplementary Figures

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

544

545 Supplementary Figure S1: Completeness of genome assemblies and Stop codon statistics. The

546 genome quality was assessed by the BUSCO program. The number of single, multiple, missing and

547 fragmented genes are shown in the middle histogram. The stop codon usage of genes is shown on

548 the right histograms. The species that lost the Sec machinery are marked with an asterisk in the tree.

549 Each group was colored in different background in the left column.

550

551 Supplementary Figure S2: Species tree with families of selenoproteins and homologues (Cys) by

552 Selenoprofiles. Each column corresponds to a selenoprotein family plotted in different color to

553 distinguish Sec-containing (green), Cys homologues (red), and others (grey). 33 out of 62

554 selenoproteins families having at least one Sec-containing gene were selected for the analysis. The

555 number in the left box represents the number of selenoproteins.

556

557 Supplementary Figure S3: The phylogenetic tree of PSTK, SBP2, and SPS. All the predicted

558 candidates were used to reconstruct a maximum-likelihood tree, with 500 bootstrap replicates. Here,

559 Orange: Mamiellophyceae; Yellow: core Chlorophyta; Green: Streptophyte algae; Red: Rhodoplantae;

560 and Grey: Glaucoplantae.

561

562 Supplementary Figure S4: The phylogenetic tree of SPS. The maximum-likelihood tree by using the

563 reference set of SPS proteins was reconstructed using the GTR+I+G model.

564

565 Supplementary Figure S5: The alignment of the SPS domain. The complement alignment of SPS

566 domains in the three clades. Nucleotide conservation is indicated with red and yellow coloring. The

567 three clades are separated by spaces. From top to bottom, they are clade III, clade II and clade I.

568

569 Supplementary Table

570

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

571 Supplementary Table S1: Table S1A: The contents of each column are species class, names, assembly

572 identifiers and BUSCO assesses genome quality of genomes analysed in this study. Table S1B: The

573 distribution of stop codon in each species.

574

575 Supplementary Table S2: The number of enzymes of the Sec machinery and the number of

576 selenoproteins were computed for each group.

577

578 Supplementary Table S3: The details of species in SPS phylogenetic tree: SuperKingdom,

579 Phylum/Class, GI Number, Species Name and SPS Clade.

580

581 Supplementary Table S4: Genes important for Nitrogen metabolism pathway in Chlorophyta.

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Sec Selenoproteins SECIS element Homologues (Cys) SPS tRNAeEFsecPSTKSBP Sepsecs

0 Arabidopsis thaliana 0 65 0 45 Oryza sativa japonica 0 Embryophyte 0 moellendorffii 0 74 Marchantia polymorpha 0 0 65 Physcomitrella patens 0 0 32 Mesotaenium endlicherianum 1 1 77 "Spirotaenia sp." 3 3 69 Streptophyte Algae Coleochaete scutata 4 2 106 Chara braunii 4 4 58 Klebsormidium nitens 2 2 35 Chlorokybus atmophyticus 3 2 25 Mesostigma viride 9 7 64 Volvox carteri 7 3 34 Gonium pectorale 10 3 29 Chlamydomonas reinhardtii 14 7 36 Chlamydomonas eustigma 1 1 26 Chromochloris zofingiensis 4 4 46 Monoraphidium neglectum 0 0 30 Ulva mutabilis 2 2 19 Chlorella sorokiniana 4 3 19 Chlorophyte Chlorella variabilis 5 5 30 Coccomyxa subellipsoidea 1 1 23 Ostreococcus sp. RCC809 21 21 37 Ostreococcus lucimarinus 28 26 41 Ostreococcus tauri 24 22 40 Bathycoccus prasinos 21 12 42 Micromonas commoda 27 25 52 Micromonas pusilla 23 54 Mamiellophyceae 30 Porphyra umbilicalis 0 0 55 Pyropia yezoensis 0 0 40 Rhodoplantae Chondrus crispus 0 0 39 Porphyridium purpureum 2 1 49 Galdieria sulphuraria 3 2 26 Cyanidioschyzon merolae 0 0 28 Glaucoplantae 4 2 25 Aureococcus anophagefferens 50 34 76 Protist Thalassiosira pseudonana 19 11 56 Emiliana huxleyi 73 53 105 bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1KP Group Sec Machinery Selenoproteins & Homologous

Group (513) Clade/Order Species Number eEFSec PSTK SBP2 SecS SPS Sec Cys Other

Conifers 76 0 0 0 0 0 0 5256 2574

Vascular (175) Lycophytes 21 0 0 0 0 0 0 1473 678

Eusporangiate Monilophytes 10 0 0 0 0 0 0 640 280

Leptosporangiate Monilophytes 68 0 0 0 0 0 0 4999 2184

Hornworts 6 0 0 0 0 0 0 245 120 Non-Vascular (70) Mosses 39 0 1 5 17 0 1 3297 1485

Liverworts 25 1 0 5 0 1 2 2184 1044

Zygematophyceae 40 51 7 27 34 25 15 2426 1298

Coleochaetophyceae 4 3 2 1 1 1 1 217 120

Charophyceae 2 1 2 1 1 2 4 103 65

Klebsormidiophyceae 5 10 6 4 5 4 5 295 131 Algae (268) Mesostigmatophyceae 4 4 2 4 4 1 12 199 124

Chlorophyta 137 174 50 83 112 75 222 7518 4489

Glaucophyta (algae) 6 6 4 5 2 3 4 281 177

Rhodoplantae 35 6 4 4 9 5 5 1296 736

Chromista (algae) 35 45 1 24 32 20 4 2100 1127

bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a. Sec Machinery SBP2 b. SecS Clade II

Micromonas commoda e I ad l Sec Sec Sec Sec EFsec C Cyanophora paradoxa Bathycoccus prasinos tRNA Ser-tRNA P-Ser-tRNA Sec-tRNA Cyanidioschyzon merolae Sec Ostreococcus lucimarinus P Sec Galdieria sulphuraria Ostreococcus sp. RCC809 Ser Ser Micromonas pusilla SECIS Element Ostreococcus tauri

Porphyridium purpureum 72 ATP+Ser+SerRS ATP+PSTK SecS mRNA 100 51 100 C 47 la d Klebsormidium nitens “Spirotaenia sp.” e 55 I

I ACU 71 I 68 Mesostigma viride UGA 64 52 Chara braunii Mesotaenium endlicherianum 55 96 ACU 100 ACU ACU ACU 49 100 Coccomyxa subellipsoidea

Volvox carteri 57 Chlorella sorokiniana Ulva mutabilis Chlorella variabilis Gonium pectorale

Chlamydomonas reinhardtii Chlamydomonas eustigma d. Selenoproteins

SelH SelM FrnE TR selenou msra_b ars_s EhSEP2 Ostsp2 Ostsp3 Sel15 MSP SelK

Monoraphidium neglectum Chromochloris zofingiensis Mesotaenium endlicherianum Clade IV “Spirotaenia sp . ” Coleochaete scutata II Chara braunii de la c. eEFSec C Klebsormidium nitens Chlorokybus atmophyticus

Bathycoccus prasinos Mesostigma viride Micromonas commoda Volvox carteri Ostreococcus tauri Ostreococcus sp. RCC809 Micromonas pusilla Gonium pectorale

Ostreococcus lucimarinus Chlamydomonas reinhardtii I 47 Chlamydomonas eustigma de a Galdieria sulphuraria l Porphyridium purpureum 100 Chromochloris zofingiensis C Monoraphidium neglectum 79 92 Cyanidioschyzon merolae

72 Ulva mutabilis

88 C 82 Chlorella sorokiniana l Mesostigma viride Coleochaete scutata a 89 “Spirotaenia sp.” d Chlorella variabilis 73 Klebsormidium nitens e 92 57 61 I

I Coccomyxa subellipsoidea

98 I Chara braunii Ostreococcus sp. RCC809 100 65 Mesotaenium endlicherianum Chlorokybus atmophyticus Ostreococcus lucimarinus 88 Ostreococcus tauri Coccomyxa subellipsoidea Chlorella variabilis Chlorella sorokiniana 98 65 Bathycoccus prasinos Chlamydomonas reinhardtii Micromonas commoda

Gonium pectorale Micromonas pusilla Volvox carteri Porphyra umbilicalis Pyropia yezoensis Monoraphidium neglectum Rhodoplantae Chondrus crispus Chlamydomonas eustigma Mamiellophyceae Porphyridium purpureum

Chromochloris zofingiensis Galdieria sulphuraria Clade IV core Chlorophyta Cyanidioschyzon merolae Glaucoplantae Cyanophora paradoxa bioRxiv preprint doi: https://doi.org/10.1101/674895; this version posted July 1, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

a.

SPS clade II

AIRS AIRS_C

100

78

SPS clade III

AIRS AIRS_C 100

100

100

48

58

1

SPS clade I

Pyr_redox_2 AIRS AIRS_C

Viridiplantae Rhodophyta Protist Fungi Archaea Tree scale: 1 Bacteria b.

230 240 250 260 270 280 Methanothermococcus thermolithotrophicus NRKA LL ILR KLEEGLNEKV AN A M TD I TGFG L LGH S N E MAEQ SN VE IEIN R.. LP L IKNT G Methanocaldococcus fervens NRYA LM ALR RAEKRVGDKI AN A L TD I TGFG I LGH S N E MADN SN VL IEINL .. LP C IKKT P Methanocaldococcus vulcanius NRYA LKS LR NAEKKVGDKV AN A L TD I TGFG L LGH S N E MA KN SN VL IEINL .. LP C IKRT P Methanocaldococcus jannaschii NRYA LK ALR KAEERVGDKI AN A L TD I TGFG I LGH S N E MA KN SN VL IEINL .. LP C IKRT P SPS cladeSPS I Methanocaldococcus infernus NKYS LE ALR RAEEKVGDKI AN A L TD V TGFG L LGH S N E IA KQ SK VN IK IDT.. LP C IKKT P Cedecea davisae NK VG TEF ADIA G...... VT A M TD V TGFG L LGH LS E MCQ GA GV Q ADV WFEC VP K LP GV E Escherichia fergusonii NI AGA SF ANIE G...... VK A M TD V TGFG L LGH LS E MCQ GA GV Q AR VDYDA IP K LP GV E Citrobacter koseri NI AGA SF AH ID G...... VK A M TD V TGFG L LGH LS E MCQ GA GV Q AR ICYQD IP K LP GV E Mannheimia granulomatis NL IGA EFSELPE...... VT A M TD V TGFG L LGH LS E VCQ GS N VR AEV HFAD IQT LD GT K

SPS cladeIISPS Klebsiella oxytoca NL VG SAF ANID G...... VK A M TD V TGFG L LGH LS E VC RG A GV Q AQ LT YAS IP K LP GV E Burkholderiales bacterium NQ AGA Q ILR AH G...... AT AC TD L TGFG L LGH LV E MT RPSGV D AELQL SA LP L LD GA V Comamonadaceae bacterium NQS GA Q ILR TH G...... AT AC TD L TGFG L LGH LV E MT RPSGV D AELQL ST LP L LD GA V Hydrogenophaga flava NQ AGA R ILR EH G...... AT AC TD L TGFG L LGH LV E MT RPSGM D AELQL GA LP L LD GA V Hydrogenophaga pseudoflava NQ AGA Q ILR EH G...... AT AC TD L TGFG L LGH LV E MT RPSGV D AELQL GA LP L LD GA V

SPS III clade Limnobacter sp. NR LGA QC LR EY G...... SK AC TD L TGFG L LGH LV E MT RPS E VD AEL D LSA LP L LD GA L