bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Reconstruction of the carbohydrate 6-O sulfotransferase family evolution

2 in reveals novel member, CHST16, lost in amniotes

3

4

5 Daniel Ocampo Daza1,2 and Tatjana Haitina1*

6

7 1 Department of Organismal Biology, Uppsala University, Norbyvägen 18 A, 752 36

8 Uppsala, Sweden.

9 2 School of Natural Sciences, University of California Merced, 5200 North Lake Road,

10 Merced CA 95343, United States.

11 *Author for correspondence: Tatjana Haitina, Department of Organismal Biology, Uppsala

12 University, Norbyvägen 18 A, 752 36 Uppsala, Sweden, Tel.: +46 18 471 6120; E-mail:

13 [email protected]

14

15 Keywords: carbohydrate 6-O sulfotransferases, Gal/GalNAc/GlcNAc 6-O sulfotransferases,

16 whole genome duplication,

17

18 Running title: Evolution of carbohydrate 6-O sulfotransferases in vertebrates

1 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

19 Abstract

20 Glycosaminoglycans are sulfated polysaccharide molecules, essential for many biological

21 processes. The 6-O sulfation of glycosaminoglycans is carried out by carbohydrate 6-O

22 sulfotransferases (C6OST), previously named Gal/GalNAc/GlcNAc 6-O sulfotransferases.

23 Here for the first time we present a detailed phylogenetic reconstruction, analysis of gene

24 synteny conservation and propose an evolutionary scenario for the C6OST family in major

25 vertebrate groups, including mammals, birds, non-avian reptiles, amphibians, lobe-finned

26 , ray-finned fishes, cartilaginous fishes and jawless vertebrates. The C6OST gene

27 expansion likely occurred in the lineage, after the divergence of tunicates and before

28 the emergence of extant vertebrate lineages.

29 The two rounds of whole genome duplication in early vertebrate evolution (1R/2R)

30 only contributed two additional C6OST subtype , increasing the vertebrate repertoire

31 from four genes to six, divided into two branches. The first branch includes CHST1 and

32 CHST3 as well as a previously unrecognized subtype, CHST16, that was lost in amniotes.

33 The second branch includes CHST2, CHST7 and CHST5. Subsequently, local duplications of

34 CHST5 gave rise to CHST4 in the ancestor of tetrapods, and to CHST6 in the ancestor of

35 primates.

36 The teleost specific gene duplicates were identified for CHST1, CHST2 and CHST3

37 and are result of whole genome duplication (3R) in the teleost lineage. We could also detect

38 multiple, more recent lineage-specific duplicates. Thus, the repertoire of C6OST genes in

39 vertebrate has been shaped by different events at several stages during vertebrate

40 evolution, with implications for the evolution of the skeleton, nervous system and cell-cell

41 interactions.

42

2 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

43 Introduction

44 Glycosaminoglycans (GAG) are sulfated linear polysaccharide molecules comprised of

45 repeating disaccharides, like chondroitin sulfate (CS), dermatan sulfate (DS), keratan sulfate

46 (KS) and (HS). CS and DS are composed of N-acetylgalactosamine

47 (GalNAc) linked to glucuronic acid or iduronic acid, respectively. HS and KS are composed

48 of N-acetylglucosamine (GlcNAc) linked to glucuronic acid or galactose, respectively.

49 Sulfated GAGs are found in both vertebrates and and are important for many

50 biological processes like cell adhesion, signal transduction and immune response (Yamada et

51 al. 2011; Soares da Costa et al. 2017). The polymerization of long, linear GAG chains onto

52 core takes place in the Golgi apparatus and results in the formation of proteoglycans

53 that are essential components of the extracellular matrix (Kjellén & Lindahl 1991).

54 [Location of Table 1]

55 Sulfation is a complex posttranslational modification process that is common for

56 GAGs and is important for their activity (Soares da Costa et al., 2017). The 6-O sulfation of

57 CS and KS is carried out by of the carbohydrate 6-O sulfotransferase (C6OST)

58 family (Kusche-Gullberg & Kjellén 2003). The nomenclature of the reported members of this

59 family is summarized in table 1. Hereafter we will use the abbreviation C6OST to refer to the

60 in general and the symbols CHST1 through CHST7 to refer to the individual

61 genes. The 6-O sulfation of HS is carried out by enzymes of another protein family, the

62 heparan sulfate 6-O sulfotransferases (HS6ST1-HS6ST3) (Nagai & Kimata 2014) that will

63 not be discussed here (Kusche-Gullberg & Kjellén 2003). The sulfation of GAGs is carried

64 out by using 3'-phosphoadenosine 5'-phosphosulfate (PAPS) as a sulfonate donor. PAPS

65 binds to C6OSTs at specific sequence motifs: the 5 '-phosphosulfate binding (5'PSB) motif

66 RxGSSF (Habuchi et al. 2003) and the 3'-phosphate binding (3'PB) motif RDPRxxxxSR

67 (Tetsukawa et al. 2010). As an example, in the human chondroitin 6-sulfotransferase 1

3 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

68 protein sequence (encoded by CHST3), the 5’PSB motif corresponds to positions 142-147

69 (RTGSSF), and the 3’PB motif to positions 301-310 (RDPRAVLASR). Chondroitin 6-O

70 sulfotransferase 1 is one of the most widely studied members of the C6OST family, showing

71 activity towards both CS and KS (see references in Yusa et al. 2006). Another ,

72 chondroitin 6-O sulfotransferase 2 (encoded by CHST7) also shows activity towards CS

73 (Kitagawa et al. 2000). The expression of CHST3 and CHST7 has been reported in a number

74 of tissues, including chondrocytes, immune system organs, as well as the central and

75 peripheral nervous system (see Habuchi (2014) and references therein). Mutations in the

76 CHST3 gene are associated with congenital spondyloepiphyseal dysplasia, congenital joint

77 dislocations and hearing loss in kindred families (Srivastava et al. 2017; Waryah et al. 2016),

78 whereas overexpression of CHST3 lowers the abundance of the proteoglycan aggrecan in the

79 extracellular matrix of the aged brain, which is associated with the loss of neural plasticity

80 (Miyata & Kitagawa 2016). The orthologues of CHST3 and CHST7 have been characterized

81 in , where chst3a, chst3b and chst7 are expressed in pharyngeal cartilages, the

82 notochord and several brain regions during development (Habicher et al. 2015).

83 Carbohydrate Sulfotransferase 1 (encoded by CHST1), also called Keratan Sulfate

84 Gal-6 Sulfotransferase 1 (KSGal6ST), is responsible for the sulfation of KS. CHST1

85 expression has been described in the developing mouse brain (Hoshino et al. 2014) and in

86 human endothelial cells (Li & Tedder 1999). Another sulfotransferase showing activity

87 towards keratan sulphate is the corneal N-acetylglucosamine-6-O-sulfotransferase (C-

88 GlcNAc6ST, encoded by CHST6). In humans, the closely related CHST5 and CHST6 genes

89 are located approximately 40 Kb apart on 16. Whereas in the mouse genome

90 there is only one gene in the corresponding chromosomal region on chromosome 8, called

91 CHST5. While human CHST5 expression seems restricted to the small intestine and colon

92 (Lee et al. 1999), mutations of CHST6 in humans cause a rare autosomal recessive macular

4 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

93 corneal dystrophy (Akama et al. 2000; Carstens et al. 2016; Rubinstein et al. 2016). In mice,

94 a similar condition in the form of CS/DS aggregates in the cornea is caused by the disruption

95 of the CHST5 gene (Parfitt et al. 2011). In addition, the similarity in enzymatic activity

96 between the human CHST6 and mouse CHST5 gene products has been used to suggest that

97 these genes are orthologous (Akama et al. 2001, 2002).

98 N-acetylglucosamine-6-O-sulfotransferase 1 (GlcNAc6ST1, encoded by CHST2) is

99 expressed in brain tissues and several internal organs of adult mice (Fukuta et al. 1998). It has

100 also been reported in endothelial tissues of both humans and mice (Li & Tedder 1999). In the

101 mouse model, CHST2 is critical for neuronal plasticity in the developing visual cortex

102 (Takeda-Uchimura et al. 2015), and CHST2 deficiency leads to increased levels of amyloid-β

103 phagocytosis thus modulating Alzheimer's pathology (Zhang et al. 2017). N-

104 acetylglucosamine-6-O-sulfotransferase 2 (GlcNAc6ST2, encoded by CHST4) is expressed

105 exclusively in the high endothelial venules of the lymph nodes (Bistrup et al. 2004) and its

106 disruption in mice affects lymphocyte trafficking (Hemmerich, Bistrup et al. 2001). CHST4 is

107 also expressed in early-stage uterine cervical and corpus (Seko et al. 2009).

108 Functional studies of GAGs have expanded greatly during the last twenty years. The

109 roles of GAG modifying enzymes, including C6OSTs, in health and disease have been

110 studied extensively in human and mouse, as well as to some extent in chicken (Fukuta et al.

111 1995; Yamamoto et al. 2001; Nogami et al. 2004; Kobayashi et al. 2010). However, there is a

112 big information gap regarding which C6OST genes can be found outside mammalian

113 vertebrates, as well as the phylogenetic relationship between them. Outside mammals and

114 chicken, the number of studies is very limited to just a few species, including zebrafish

115 (Danio rerio) (Habicher et al. 2015), as well as vase tunicate (Ciona intestinalis) (Tetsukawa

116 et al. 2010), and pearl oyster (Pinctada fucata martensii) (Du et al. 2017).

5 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

117 Here for the first time we describe the evolution of the C6OST family of

118 sulfotransferases based on detailed phylogenetic and chromosomal location analyses in major

119 vertebrate groups, and propose a duplication scenario for the expansion of the C6OST family

120 during vertebrate evolution. We also report a previously unrecognized KSGal6ST-like

121 subtype of C6OST, encoded by a gene we have called CHST16, that is lost in amniotes.

122

123 Materials and Methods

124 Identification of C6OST gene sequences

125 C6OST amino acid sequences corresponding to the CHST1-7 genes were sought primarily in

126 genome assemblies hosted by the National Center for Biotechnology Information (NCBI)

127 Assembly database (www.ncbi.nlm.nih.gov/assembly) (Kitts et al. 2016; Canese et al. 2017)

128 and the Vertebrate Genomes Project (VGP, vertebrategenomesproject.org) (The G10K

129 consortium, manuscript in preparation). For some species, the Ensembl genome browser

130 (www.ensembl.org) (Perry et al. 2018) was used. C6OST sequences were also sought in

131 transcriptome assemblies hosted by the PhyloFish Portal (phylofish.sigenae.org) (Pasquier et

132 al. 2016). Independent sources for an Eastern Newt reference transcriptome (Abdullayev et

133 al. 2013) and the gulf pipefish genome assembly (Small et al. 2016) were also used. All the

134 investigated species, with their binomial nomenclature, and genome/transcriptome assemblies

135 are listed in supplementary table S1 together with links to their sources. In total, 158 species

136 were investigated, representing the major vertebrate groups. They include 18 mammalian

137 species; 33 avian species in 16 orders; 14 non-avian reptile species; six amphibian species;

138 the basal lobe-finned coelacanth; the holostean fishes spotted gar and bowfin; 67 teleost

139 fish species in 35 orders, including the important model species zebrafish; seven cartilaginous

140 fish species; as well as three jawless vertebrate species, the sea lamprey, Arctic lamprey (also

141 known as Japanese lamprey), and inshore hagfish. C6OST sequences from

6 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

142 species were also sought. Notably, the vase tunicate was used to provide a relative dating

143 point with respect to the early vertebrate whole genome duplications (1R/2R), and the fruit

144 fly was used as an outgroup.

145 The C6OST sequences were identified by searching for NCBI Gene models

146 (Maglott et al. 2011; Brown et al. 2014) or Ensembl gene predictions (Curwen et al. 2004)

147 annotated as CHST1-7, followed by extensive tblastn searches (Altschul et al. 1990) to

148 identify sequences with no corresponding gene annotations or for species where gene

149 annotations were not available. For invertebrate species, sequences were also sought using

150 the profile-Hidden Markov Model search tool HMMER (hmmer.org) (Finn et al. 2011),

151 aimed at reference proteomes. In most cases (excluding transcriptome data), corresponding

152 gene models could be found and their database IDs were recorded. In cases where no gene

153 models could be identified, or where they included errors (Prosdocimi et al. 2012), C6OST

154 sequences were predicted/corrected by manual inspection of their corresponding genomic

155 regions, including flanking regions and introns. Exons were curated with respect to consensus

156 sequences for splice donor/acceptor sites (Rogozin et al. 2012; Abril et al. 2005) and start of

157 translation (Nakagawa et al. 2008), as well as to other C6OST family

158 members.

159 All identified amino acid sequences were collected and the corresponding genomic

160 locations (if available) were recorded. All collected sequences were verified against the Pfam

161 database of protein families (pfam.xfam.org) (El-Gebali et al. 2019) to ensure they contained

162 a sulfotransferase type 1 domain (Pfam ID: PF00685), and inspected manually to verify the

163 characteristic 5 '-phosphosulfate binding (5'PSB) and 3'-phosphate binding (3'PB) motifs. All

164 genomic locations and sequence identifiers used in this study have been verified against the

165 latest genome assembly and database versions, including NCBI’s Reference Sequence

166 Database (RefSeq 92, 8 January 2019) and Ensembl (version 95, January 2019).

7 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

167

168 Sequence alignment and phylogenetic analyses

169 Sequence alignments were constructed with the MUSCLE alignment algorithm (Edgar 2004)

170 applied through AliView 1.25 (Larsson 2014). Alignments were curated manually in order to

171 adjust poorly aligned stretches with respect to conserved motifs and exon boundaries, as well

172 as to identify faulty or incomplete sequences. Phylogenies were constructed using IQ-TREE

173 v1.6.3 which applies a stochastic maximum likelihood algorithm (Nguyen et al. 2015). The

174 best-fit amino acid substitution model and substitution parameters were selected using IQ-

175 TREE’s model finder with the -m TEST option (Kalyaanamoorthy et al. 2017). The

176 proportion of invariant sites was optimized using the --opt-gamma-inv option. Branch

177 supports were calculated using IQ-TREE’s non-parametric ultrafast bootstrap (UFBoot)

178 method (Minh et al. 2013) with 1000 replicates, as well as the approximate likelihood-ratio

179 test (aLRT) with SH-like supports (Anisimova & Gascuel 2006; Anisimova et al. 2011) over

180 1000 iterations.

181

182 Conserved synteny analyses

183 The two whole genome duplications that occurred at the base of vertebrate evolution (1R and

184 2R) resulted in a large number of quartets of related chromosome regions, each such quartet

185 is called a paralogon or a paralogy group, and related chromosome regions are said to be

186 paralogous. To investigate whether any of the vertebrate C6OST genes arose in the 1R/2R

187 whole genome duplication events, and duplicated further in the teleost whole genome

188 duplication 3R, we attempted to detect known paralogous chromosome regions around

189 C6OST genes. This was done through the detection of conserved synteny, the conservation of

190 gene family co-localizations across different chromosome regions that reflects the

191 duplications of not only individual genes but larger chromosome segments and by extension

8 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

192 the whole genome. We carried out a conserved synteny analysis around the C6OST gene-

193 bearing chromosome regions in the human, Carolina anole lizard, spotted gar and zebrafish

194 genomes. The anole lizard was chosen because of the presence of a “CHST4/5-like” gene in

195 this species. The spotted gar was chosen because of the presence of CHST16, which is

196 missing from amniotes, because it shows a moderate degree of genome rearrangement

197 compared with teleost fish genomes, and because its taxonomic position allows it to bridge

198 the gap between ray-finned fishes and lobe-finned fishes (including tetrapods) (Amores et al.

199 2011; Braasch et al. 2016). The zebrafish genome was chosen to investigate the involvement

200 of 3R. Lists of gene predictions 2.5 Mb upstream and downstream of the C6OST genes in

201 these species were downloaded using the BioMart function in Ensembl version 83. These

202 lists were sorted according to Ensembl protein family predictions (Ensembl 83 is the last

203 version of the database to use these family predictions) in order to identify gene families with

204 members on at least two of the C6OST gene-bearing chromosome blocks within each

205 species. Sequence identifiers and genomic locations for each member gene in these gene

206 families were collected for the following species: human, chicken, Western clawed frog,

207 spotted gar, zebrafish, medaka, and elephant shark to identify regions that show ancestral

208 paralogy relationships. All locations and sequence identifiers were verified against the latest

209 versions of the genome assemblies in the NCBI Assembly database.

210 Only one CHST1 gene could be identified in the zebrafish (see results below).

211 Therefore, to investigate the conserved synteny between duplicated CHST1 genes in teleost

212 fishes, the regions of the channel catfish CHST1 genes on 4 and 8 were also

213 analysed. Gene models 1 Mb to each side of the channel catfish CHST1 genes were identified

214 in the NCBI Genome Data Viewer and used to identify the homologous chromosome regions

215 in the zebrafish genome. These regions were then used to identify gene families with

9 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

216 members on both chromosome regions through Ensembl’s BioMart, the same way as

217 described above.

218

219 Results

220

221 [Location of figure 1]

222

223 Phylogeny of the C6OST family

224 C6OST amino acid sequences were collected from a large diversity of vertebrate genomes

225 and aligned in order to produce phylogenies of the C6OST gene family across vertebrates.

226 The first phylogeny of C6OST family in is summarized in figure 1 and includes the

227 vase tunicate along with a smaller subset of representative vertebrates. This phylogeny is

228 rooted with two putative family members from the fruit fly, CG31637 and CG9550. These

229 sequences share the sulfotransferase domain (Pfam ID: PF00685) and recognizable 5’PSB

230 and 3’PB motifs with the vertebrate C6OST sequences. In addition to this phylogeny, we

231 produced detailed phylogenies for each C6OST subtype branch including all investigated

232 vertebrate species. These have been included as supplementary figures S1-S6. This was done

233 in order to classify all identified sequences into correct C6OST subtype, and to improve the

234 resolution of some phylogenetic relationships that were unclear in the smaller phylogeny.

235 Our phylogeny (figure 1) supports the subdivision of the C6OST family into two main

236 branches. The first branch contains three well-supported subtype clades, including CHST1

237 and CHST3 as well as a previously unrecognized CHST1-like subtype of genes we have

238 named CHST16. The second C6OST branch in vertebrates contains well-supported CHST2,

239 CHST4, CHST5 and CHST7 clades, as well as several smaller clades of CHST4/5-like

240 sequences from jawless vertebrates (Agnatha), amphibians and non-avian reptiles. We could

10 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

241 identify CHST16, CHST3, as well as CHST2/7-like and CHST4/5-like sequences in lampreys

242 (described in detail below), as well as CHST1, CHST16, CHST2/7-like and CHST4/5-like

243 sequences in the inshore hagfish. While the identity of some of these sequences remains

244 unclear, the jawless vertebrate branches place the divergences between the C6OST subtype

245 clades before the split between jawless and jawed vertebrates (Gnathostomata), early in

246 vertebrate evolution. To provide an earlier dating point, we used seven C6OST-like

247 sequences from the vase tunicate. These sequences were first identified by Tetsukawa et al.

248 (2010), and our phylogeny (figure 1) supports their conclusion that the tunicate C6OST genes

249 represent an independent lineage-specific gene expansion. Thus, the diversification of

250 vertebrate C6OST genes likely occurred in a chordate or vertebrate ancestor after the

251 divergence of tunicates, that took place 518-581 Mya (Hedges and Kumar, 2009), This is

252 consistent with the time window of the two rounds of whole-genome duplication early in

253 vertebrate evolution, 1R/2R (Dehal & Boore 2005; Nakatani et al. 2007; Putnam et al. 2008;

254 Sacerdot et al. 2018). We could also identify teleost-specific duplicates of CHST1, CHST2,

255 and CHST3 (described below), which is consistent with the third round of whole-genome

256 duplication (3R) that occurred early in the teleost lineage (Meyer & van de Peer 2005).

257

258 CHST1 and CHST16

259 The CHST1 and CHST16 branches of the C6OST phylogeny are shown in figure 2A. The

260 full-species phylogeny of CHST1 sequences is shown in supplementary figure S1, and of

261 CHST16 sequences in supplementary figure S2. Both CHST1 and CHST16 are represented in

262 cartilaginous fishes (Chondrichthyes), lobe-finned fishes (Sarcopterygii), including tetrapods

263 and coelacanth, as well as ray-finned fishes (), including the spotted gar and

264 teleost fishes. Overall, both CHST1 and CHST16 branches follow the accepted phylogeny of

265 vertebrate groups, and the overall topology is well-supported. However, there are some

11 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

266 notable inconsistencies: We could identify putative CHST1 and CHST16 orthologs in the

267 inshore hagfish, as well as putative CHST16 orthologs in the Arctic and sea lampreys.

268 However, the jawless vertebrate CHST16 sequences cluster basal to both jawed vertebrate

269 CHST1 and CHST16 clades. Nevertheless, our overall results indicate that both CHST1 and

270 CHST16 were present in a vertebrate ancestor prior to the split between jawless and jawed

271 vertebrates. Within teleost fishes, we could identify duplicate CHST1 genes located on

272 separate chromosomes, which we have named CHST1a and CHST1b. However, the spotted

273 gar CHST1 sequence clusters together with the teleost CHST1a clade rather than basal to both

274 CHST1a and CHST1b clades. This is true for the smaller phylogeny shown in figure 2A as

275 well as for the phylogeny with full species representation (supplementary figure S1), which

276 also includes a CHST1 sequences from another holostean fish, the bowfin. The duplicate

277 CHST1 genes in the investigated eel species also both cluster within the CHST1a branch.

278 Thus, we could not determine with any certainty whether these CHST1 duplicates in eels

279 represent CHST1a and CHST1b. These inconsistencies are likely, at least partially, caused by

280 uneven evolutionary rates between different lineages as well as a relatively high degree of

281 sequence conservation for CHST1 sequences. There are also duplicate CHST1, but not

282 CHST16, genes in the allotetraploid African clawed frog, whose locations on chromosomes

283 4L and 4S correspond to each of the two homeologous sub-genomes (Session et al. 2016).

284 There have been notable gene losses of both CHST1 and CHST16. CHST16 genes

285 could not be identified in non-avian reptiles, birds or mammals, indicating that this subtype

286 was lost in the amniote ancestor. Within cartilaginous fishes, CHST16 was missing in all

287 investigated skate species, indicating a loss from Rajiformes and possibly more generally

288 from Batoidea. CHST16 genes were also missing from both eel species, indicating a loss

289 within the Anguilla, and possibly more generally within Elopomorpha. Not all teleost

290 fish species preserve both CHST1 duplicates. Notably, CHST1a is missing from cypriniform

12 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

291 fishes, including the zebrafish, and we could not identify CHST1a in one salmoniform fish,

292 the Arctic char. The CHST1b gene seems to have been lost more widely: We could not find

293 CHST1b sequences in several basal spiny-rayed fishes (Acanthomorpha), such as opah (order

294 Lampriformes), Atlantic cod (order Gadiformes), and longspine squirrelfish ( order

295 Holocentriformes), as well as several basal percomorph fishes, namely bearded brotula (order

296 Ophidiiformes), mudskippers (order ) which instead have duplicate CHST1a

297 genes, yellowfin tuna (order Scombriformes), tiger tail seahorse and gulf pipefish (order

298 Syngnathiformes). It was also missing from turbot (order Pleuronectiformes), turquoise

299 killifish, guppy and Southern platyfish (order Cyprinodontiformes), barred knifejaw (order

300 Centrarchiformes), European perch and Three-spined stickleback (order Perciformes), and

301 Japanese pufferfish (order Tetraodontiformes). At least some of these losses could be due to

302 incomplete genome assemblies, as there are closely related species that do have both CHST1a

303 and CHST1b genes, such as the giant oarfish (order Lampriformes), blackbar soldierfish

304 (order Holocentriformes), tongue sole (order Pleuronectiformes), Murray cod (order

305 Centrarchiformes), tiger rockfish (order Perciformes), and green spotted pufferfish (order

306 Tetraodontiformes). We have followed the phylogeny and classification of teleost fishes by

307 Near et al. (2012) and Betancur-R et al. (2017). Aside from this diversity in terms of CHST1b

308 gene preservation or loss, CHST1b genes seem to have evolved more rapidly in neoteleosts,

309 as indicated by the branch lengths within this clade in our phylogenies (highlighted in figure

310 2A and supplementary figure S1). The neoteleost CHST1b genes also have a divergent exon

311 structure (see “Exon/intron structures of C6OST genes” below).

312

313 [Location of figure 2]

314

13 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

315 We could identify 15 gene families with members in the vicinity of both CHST1 and

316 CHST16 (subset 1, supplementary table S2): ANO1/2, ANO3/4/9, ANO5/6/7, ARFGAP,

317 C1QTNF4/17, CRY, DEPDC4/7, LDH, MYBPC, PACSIN, PPFIBP, RASSF9/10,

318 SLC5A5/6/8/12, SLC17A6/8, and TCP11. This was detected in the spotted gar genome on

319 linkage groups 27 and 8 (figure 2B). All identified chromosome segments are shown in

320 supplementary data 3. In the , the identified blocks of conserved synteny

321 correspond mainly to segments of chromosomes 11 (where CHST1 is located), 12 and 22

322 (figure 2B), as well as 1, 6, 19 and X (not shown here). These chromosome regions have been

323 recognized as paralogous, the result of the 1R/2R whole genome duplications, in several

324 large-scale reconstructions of vertebrate ancestral genomes (Nakatani et al. 2007; Putnam et

325 al. 2008; Sacerdot et al. 2018). The CHST1 and CHST16-bearing chromosome regions

326 correspond to the “D” paralogon in Nakatani et al. (2007), more specifically the vertebrate

327 ancestral paralogous segments called “D1” (CHST1) and “D0” (CHST16). In the study by

328 Sacerdot et al. (2018), they correspond to the reconstructed pre-1R ancestral chromosome 6.

329 One of the authors has previously analyzed these chromosome regions extensively, and could

330 also conclude that they arose in 1R/2R (Lagman et al. 2013; Ocampo Daza & Larhammar

331 2018).

332 With respect to 3R, we could identify seven neighboring gene families in subset 1

333 which also had teleost-specific duplicate genes in the vicinity of CHST1a and CHST1b:

334 ANO1/2, ANO3/4/9, ANO5/6/7, DEPDC4/7, SLC17A6/8, PPIFBP, and RASSF9/10 (figure

335 2B). We could also identify a further six gene families with members in the vicinity of

336 CHST1a and CHST1b in teleosts (subset 2, supplementary table S2): DNAJA1/2/4, GLG1,

337 GPI, LSM14A, SLC27A, and TRPM1/3/6/7 (labeled with blue color on figure 2B). The

338 identified conserved synteny blocks correspond to segments of chromosomes 18 and 7 on one

339 side, and 25 on the other, in the zebrafish genome, as well as segments of chromosomes 6

14 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

340 and 3 in the medaka genome. These chromosome segments have previously been identified

341 to be the result of the teleost specific whole-genome duplication, 3R (Kasahara et al. 2007;

342 Nakatani et al. 2007; Nakatani & McLysaght 2017). In the reconstruction of the pre-3R

343 genome by Nakatani & McLysaght (2017), these regions correspond to proto-chromosome

344 10. The chromosome segments we identified in the spotted gar, chicken and human genome

345 also agree with the reconstruction by Nakatani & McLysaght (2017). In the analyses of

346 teleost fish chromosome evolution by Kasahara et al. (2007), as well as in (Nakatani et al.

347 2007), these regions correspond to proto-chromosome “j”. We could also detect extensive

348 translocations between paralogous chromosome segments in teleost fishes that originated in

349 1R/2R. See chromosome 18 in the zebrafish and chromosome 6 in the medaka, as an example

350 (figure 2B). This is also compatible with previous observations (Ocampo Daza & Larhammar

351 2018; Kasahara et al. 2007; Nakatani & McLysaght 2017). These homology-related

352 chromosome rearrangements likely underlie the co-localization of CHST16 with CHST1b in

353 many teleost genomes, as on chromosome 6 in the medaka (figure 2B), or with CHST1a

354 within Otocephala, as on chromosome 18 in the Mexican cave tetra (see chromosome

355 designations in the phylogeny in figure 2A). Cypriniform fishes, including the zebrafish, have

356 lost the CHST1a gene as described above. However, the CHST16 gene is located near the

357 chromosome segment where CHST1a would have been located in the zebrafish genome

358 (figure 2B). The co-localization of CHST16 with either CHST1a or CHST1b occurs only

359 within teleost fishes, and could be detected in all investigated teleost genomes that have been

360 mapped to chromosomes or assembled into linkage groups (supplementary data S1).

361 In addition to 3R, there have been more recent whole-genome duplication events

362 within cyprinid fishes (family ) and salmoniform fishes (order Salmoniformes)

363 (Glasauer & Neuhauss 2014). We identified additional duplicates of CHST1b in the goldfish

364 genome, however only one of the three duplicates is mapped, to chromosome 50 (figure 2A),

15 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

365 and the other two are identical. There are also duplicates of the CHST16 gene on goldfish

366 chromosomes 18 and 43. These chromosome locations correspond to homeologous

367 chromosomes that arose in the cyprinid-specific whole-genome duplication (see figure 2 in

368 Chen et al. 2018). In salmoniform fishes, there are additional duplicates of CHST1a in the

369 Atlantic salmon and of CHST1b and CHST16 in all three investigated species, Atlantic

370 salmon, rainbow trout and Arctic char. While several of these salmoniform duplicates are

371 unmapped, including one of the CHST1a duplicates in Atlantic salmon, the chromosomal

372 locations of the syntenic CHST1b and CHST16 duplicates on chromosomes 10 and 16 in

373 Atlantic salmon (figure 2 in Lien et al.2016), 2 and 1 in rainbow trout (figure 2 in Berthelot et

374 al. 2014), as well as 4 and 26 in Arctic char (figure 1 in Christensen et al. 2018), correspond

375 to duplicated chromosome segments from the salmoniform-specific whole-genome

376 duplication.

377

378 CHST3

379 The CHST3 branch of our C6OST phylogeny is shown in figure 3A. The full-species

380 phylogeny of CHST3 sequences is shown in supplementary figure S3. We could identify

381 CHST3 sequences across all major vertebrate lineages, including both lamprey species but

382 not the inshore hagfish. The allotetraploid African clawed frog has duplicated CHST3 genes

383 located on chromosomes 7L and 7S that correspond to each of the two homeologous sub-

384 genomes (Session et al. 2016). Overall, our phylogenies follow the accepted phylogeny of

385 vertebrate groups. However, following are some inconsistencies in both phylogenies. The

386 cartilaginous fish clade is not resolved, however all cartilaginous fish CHST3 sequences

387 cluster basal to the bony vertebrate clade. In addition, the coelacanth CHST3 sequence

388 clusters basal to both the tetrapod and ray-finned fish clades. Nonetheless, the overall

16 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

389 topologies of our phylogenetic analyses support the presence of a CHST3 gene before the

390 split between jawless and jawed vertebrates.

391 In teleost fishes we found duplicates of CHST3 located on different chromosomes

392 (figure 3B). These genes have been named chst3a and chst3b in the zebrafish (Habicher et al.

393 2015). The single CHST3 sequence from spotted gar clusters at the base of the well-supported

394 teleost chst3a and chst3b clades, which is consistent with duplication in the time window of

395 3R. We could identify nine gene families with teleost-specific duplicate members in the

396 vicinity of chst3a and chst3b: CDHR1, DNAJB12, LRIT, PALD1, PPA, PPIF, QRFPR,

397 RGR, and ZMIZ (subset 3, supplementary table S2). The identified paralogous segments on

398 chromosomes 13 and 12 in zebrafish and on chromosomes 15 and 19 in medaka (figure 3B),

399 correspond to chromosome segments that most likely arose in 3R (Kasahara et al. 2007;

400 Nakatani et al. 2007; Nakatani & McLysaght 2017). In Kasahara et al. (2007) and Nakatani et

401 al. (2007) they correspond to pre-3R proto-chromosome “d”, and in Nakatani & McLysaght

402 (2017) they correspond to proto-chromosome “4”. The chromosome segments we could

403 identify in the spotted gar, chicken and human genomes (figure 3B) are also compatible with

404 these previous studies, as well as with comparative genomic analyses of the spotted gar

405 genome vs. the human and zebrafish genomes (see figures 3 and 4, Table S2 in Amores et al.

406 2011). Other smaller synteny blocks not shown here are shown in supplementary data 3. We

407 could identify three copies of CHST3a in the goldfish genome, however only one of them is

408 mapped, to chromosome 13. Thus, it is not possible to attribute at least one of the

409 duplications to the cyprinid-specific whole-genome duplication. There are also three copies

410 of the CHST3a gene in the tongue sole genome, located in tandem on . In

411 salmoniform fishes, no CHST3a gene could be found, indicating a loss of this gene, and no

412 additional gene duplicates of CHST3b seem to have been preserved.

413

17 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

414 [Location of figure 3]

415

416 Our overall phylogeny of C6OST sequences shows that CHST3 is closely related to

417 CHST1 and CHST16 (figure 1). However, we could not identify any conserved synteny, i.e.

418 no shared genomic neighbors, between CHST3 and CHST1 or CHST16, in any of the

419 analyzed species. Based on our conserved synteny analysis, the CHST1 and CHST16 genes

420 likely arose from an ancestral gene through 1R or 2R, as described above. This suggests not

421 only that the CHST3 gene was present together with the ancestral CHST1/16 gene before 1R,

422 but also that the CHST3 and CHST1/16 genes arose by a mechanism other than the

423 duplication of a larger chromosome segment, as in a whole-genome duplication.

424

425 CHST2 and CHST7

426 The CHST2 and CHST7 branches of our C6OST phylogeny are shown in figure 4A. The full-

427 species phylogeny of CHST2 sequences is included in supplementary figure S4, and of

428 CHST7 sequences in supplementary figure S5. The CHST2 and CHST7 subtype branches are

429 well-supported, and follow the accepted phylogeny of vertebrate groups, overall. Both clades

430 include orthologs from lobe-finned fishes (including coelacanth and tetrapods), ray-finned

431 fishes (including spotted gar and teleost fishes), as well as cartilaginous fishes. However,

432 both coelacanth CHST2 and CHST7 cluster with the respective ray-finned fish clades rather

433 than at the base of the lobe-finned fish clades. This is likely due to the low evolutionary rate,

434 i.e. high degree of sequence conservation, of these coelacanth sequences (Amemiya et al.

435 2013) relative to the tetrapod sequences. There are duplicate CHST2 and CHST7 genes in the

436 allotetraploid African clawed frog (Xenopus laevis), located on the homeologous

437 chromosomes (Session et al. 2016) 5L and 5S, 2L and 2S, respectively.

438

18 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

439 [Location of figure 4]

440

441 Notably, CHST7 could not be identified in any of the three investigated species within

442 Testudines (turtles, tortoises and terrapins), indicating an early deletion of this gene within

443 the lineage. CHST7 is also missing from the two-lined caecilian (an amphibian) and several

444 avian lineages, including penguins (order Sphenisciformes, three species were investigated),

445 falcons (order Falconiformes, 2 species were investigated), as well as the rock pigeon (order

446 Columbiformes), hoatzin (order Ophistocomiformes), downy woodpecker (order Piciformes),

447 and hooded crow (order Passeriformes). For at least some of these species, the lack of a

448 CHST7 sequence could be due to incomplete genome assemblies, as there are other closely

449 related species with CHST7, for example the band-tailed pigeon (order Columbiformes), and

450 great tit (order Passeriformes) (supplementary data S3).

451 In jawless vertebrates, we could identify CHST2 and CHST7-like sequences from the

452 inshore hagfish, sea lamprey and Arctic lamprey genomes, forming two clades. These

453 sequences cluster basal to both the jawed vertebrate CHST2 and CHST7 clades (figure 4A).

454 Thus, we cannot determine whether these sequences represent either CHST2 or CHST7, both

455 CHST2 and CHST7, or possibly an ancestral CHST2/7 subtype.

456 There are duplicates of CHST2 in teleost fishes located on different chromosomes

457 (figure 4A). These have been named chst2a and chst2b in the zebrafish. We could identify

458 both chst2a and chst2b only within basal teleost lineages (supplementary figure S4): eels

459 (order Anguilliformes), freshwater butterflyfishes, bony-tongues and mormyrids (superorder

460 Osteoglossomorpha), as well as Otocephala, including the Atlantic herring, European

461 pilchard and allis shad (order Clupeiformes), zebrafish, Dracula fish, Amur ide and goldfish

462 (order ), Mexican cave tetra and red-bellied piranha (order Characiformes),

463 electric eel (order Gymnotiformes), and channel catfish (order Siluriformes). This indicates

19 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

464 that the CHST2a gene was lost early in the euteleost lineage, which includes the majority of

465 teleost species diversity. Both CHST2a and CHST2b clades are well-supported in the

466 phylogeny (figure 4A), and the single spotted gar CHST2 sequence clusters at the base of

467 both branches, which is consistent with duplication within the time window of 3R. In the full-

468 species phylogeny (supplementary figure S4) this topology is disrupted by the

469 osteoglossomorph CHST2a and CHST2b branches. This is likely caused by uneven

470 evolutionary rates for teleost CHST2 sequences, as shown by the branch lengths within the

471 phylogeny. Teleost CHST2 sequences seem to have had a very low basal amino acid

472 substitution rate overall, however the CHST2a sequences, as well as the CHST2b sequences

473 in the African butterflyfish and mormyrids (superorder Osteoglossomorpha), seem to have

474 evolved at a relatively faster rate. There are additional duplicates of both CHST2a and

475 CHST2b, as well as of CHST7, in the goldfish genome, and their locations on chromosomes 2

476 and 27, 24 and 49, 6 and 31, respectively, supports their emergence through the cyprinid-

477 specific whole genome duplication (see figure 2 in Chen et al. 2018). Salmoniform fishes

478 lack CHST2a, as all euteleost fishes do, however there are duplicated CHST2b genes in the

479 three species that were investigated. The CHST2b duplicates are located on chromosomes 19

480 and 29 in the Atlantic salmon genome, and 11 and 15 in the rainbow trout genome. These

481 locations correspond to chromosomal segments known to have emerged in the salmoniform

482 whole-genome duplication (Lien et al. 2016; Berthelot et al. 2014). In the Arctic char, only

483 one of the CHST2b duplicates is mapped, to chromosome 14. No CHST7 duplicates could be

484 found in salmoniform fishes.

485 We could identify seven gene families with members in the vicinity of both CHST2

486 and CHST7: BRINP, NOS1AP, PFK, and RGS4/5/8/16 were detected in the zebrafish

487 genome, NCK and ST6GAL in the spotted gar genome, and SLC9A6/7/9 in both the spotted

488 gar and human genomes (subset 4, supplementary table S2). Additionally, we could identify

20 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

489 seven gene families with members in the vicinity of both chst2a and chst2b in the zebrafish

490 genome: ABI, ACBD4/5, ANXA13, DIPK2, ELCB, PCOLCE, and PLSCR (subset 5,

491 supplementary table S2). Out of both subsets, only four gene families, NCK, SLC9A6/7/9,

492 ST6GAL, and DIPK2, had members in the vicinity of CHST2 and CHST7 in the tetrapod

493 and/or spotted gar genomes. This apparent deficit of conserved syntenic genes possibly

494 reflects a low gene density in the region of CHST7. We examined the chromosome regions

495 around CHST2 and CHST7 genes in the synteny database Genomicus version 69.10

496 (http://www.genomicus.biologie.ens.fr/genomicus-69.10) (Muffato et al. 2010; Louis et al.

497 2015), and could only identify two gene pairs, in any species’ genome, in the vicinity of both

498 CHST2 and CHST7: SLC9A9 and SLC9A7, as well as C3orf58 and CXorf36. The latter gene

499 pair was not identified in our synteny analysis. Nevertheless, the conserved synteny blocks

500 that we could identify (figure 4B) correspond to chromosome regions recognized to have

501 resulted from the 1R/2R whole genome duplications. In the reconstruction by Nakatani et al.

502 (2007), these chromosome segments correspond to paralogon “F”, specifically the “F0”

503 (CHST2) and “F3” (CHST7) vertebrate ancestral paralogous segments. In the analysis based

504 around the Florida lancelet (Branchiostoma floridae) genome, they correspond to the

505 reconstructed ancestral linkage group 10 (Putnam et al. 2008). In the more recent study by

506 Sacerdot et al. (2018), they correspond to the reconstructed pre-1R ancestral chromosome 17.

507 We could also identify one gene family, ITPR, with members in the vicinity of chst7 and

508 chst16 in the zebrafish genome. However, this synteny pattern is not reproduced in any of the

509 other investigated genomes (supplementary data S3).

510 With respect to 3R, seven gene families from subset 4 and subset 5 have teleost-

511 specific duplicate members in the vicinity of both chst2a and chst2b in the zebrafish genome:

512 PFK from subset 4, as well as ABI, ACBD4/5, ANXA13, DIPK2, ELCB, and PCOLCE from

513 subset 5. One additional family, PLSCR, has members in the vicinity of chst2a (plscr1.1,

21 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

514 plscr1.2) and chst2b (plscr2). However, the locations of these genes outside teleost fishes

515 reveal that they likely arose through a local duplication in a bony fish ancestor, at the latest,

516 rather than in teleost fishes and 3R.The identified paralogous segments on chromosomes 2

517 and 24 in zebrafish, and 17 and 20 in medaka (figure 4B), correspond to chromosome

518 segments previously identified to be the result of 3R (Kasahara et al. 2007; Nakatani et al.

519 2007; Amores et al. 2011; Nakatani & McLysaght 2017). In Kasahara et al. (2007) and

520 Nakatani et al. (2007) they correspond to pre-3R proto-chromosome “m”, and in Nakatani &

521 McLysaght (2017) they correspond to proto-chromosome “13”. Our analysis also identified

522 the possible location of the CHST2a gene in medaka on chromosome 17, had it not been lost

523 early in the euteleost lineage. The chromosome segments for the ABI, ACBD4/5, ANXA13,

524 ELCB and PFK families in the spotted gar, chicken, Western clawed frog, and human

525 genomes (figure 4B) don’t correspond to the CHST2 and CHST7-bearing chromosome

526 segments. However, several studies have identified those chromosome segments to also be

527 part of the CHST2 and CHST7-bearing paralogon (see figure 4 in Kasahara et al. 2007, figure

528 4 in Bian et al. 2016, and figure 3 in Nakatani & McLysaght 2017).

529 In the zebrafish, the ABI, NCK, and PFK gene families have members in the vicinity

530 of either chst2a or chst2b as well as chst7 (figure 4B), reflecting the 1R/2R-generated

531 paralogy as described above. There are three additional families that also fulfil this criterium,

532 BRINP, NOS1AP, and RGS4/5/8/16 (supplementary figure S6), but are likely not the result

533 of 1R/2R. The single homologous paralogy blocks on human , chicken

534 chromosome 8, Western clawed frog chromosome 4, and spotted gar linkage group 10

535 suggest that the gene duplicates BRINP2 and BRINP3, as well as RGS4, RGS5, RGS8 and

536 RGS16 arose through ancient local duplications rather than through 1R/2R.

537

538 CHST4, CHST5 and related CHST4/5-like sequences

22 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

539 The branch of the C6OST family that contains the known CHST4, CHST5, and CHST6

540 sequences is by far the most complex of the C6OST phylogeny. This branch of our C6OST

541 phylogeny is shown in figure 5, and the full-species phylogeny is included in supplementary

542 figure S7. Aside from the known C6OST sequences, we can report several CHST4 and

543 CHST5-like subtypes represented in amphibians, non-avian reptiles and jawless vertebrates.

544 The CHST4 and CHST5 genes (and CHST6) are located on the same chromosome in many

545 species, and some of the CHST4 and CHST5-like genes we identified are located downstream

546 of these, suggesting that there has been a high propensity for the duplication of C6OST genes

547 in these regions.

548 [Location of figure 5]

549 In both our phylogenies, the known CHST4 and CHST5 sequences cluster into two

550 well-defined and well-supported clades. This allowed us to classify a number of sequences

551 with hitherto unclear identities. All identified sequences from cartilaginous fishes,

552 coelacanth, and ray-finned fishes, including spotted gar and teleosts, cluster confidently

553 together with tetrapod sequences within the CHST5 clade, while CHST4 sequences could

554 only be identified from tetrapod species. We could only identify orthologs of the human

555 CHST6 gene in other primate species, and these cluster confidently within the CHST5 clade.

556 While there are genes in other mammalian species, chicken and Western clawed frog that

557 have previously been identified as CHST6 (Taniguchi et al. 2012) our analyses shows that

558 they are better described as CHST5 for the non-primate mammal and chicken genes, and as

559 “CHST4/5-like” in the Western clawed frog. The zebrafish chst5 genes is also erroneously

560 annotated as chst6 in the zebrafish information network database at zfin.org (gene ID: ZDB-

561 GENE-060810-74). In teleost fishes, we could identify many lineage-specific duplicates of

562 CHST5. The largest number was found in the genome of the climbing perch (order

563 Anabantiformes) with 15 gene duplicates, some of which are likely pseudogenes. There are

23 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

564 also CHST5 duplicates in the tongue sole (order Pleuronectiformes), corkwing wrasse (order

565 Labriformes), brown dottyback (order Pseudochromidae), turquoise killifish, guppy, and

566 Southern platyfish (order Cyprinodontiformes), Eastern happy, zebra mbuna, Nile tilapia, and

567 Simochromis diagramma (order ), European pilchard (order Clupeiformes), as

568 well as in the mormyrids, Asian arowana, silver arowana, and Arapaima (superorder

569 Osteoglossomorpha) (supplementary figure S7). At least some of the duplications are shared

570 between several species within Cyprinodontiformes, Cichliformes and Osteoglossomorpha.

571 In the genomes that have been mapped to chromosomes, assembled into linkage groups or

572 longer genomic scaffolds, these duplicated CHST5 genes are located in tandem rather than on

573 different chromosomes (supplementary data S1). We could also identify CHST5 duplicates in

574 the goldfish genome, located on chromosomes 25 and 50, that likely arose in the cyprinid fish

575 whole-genome duplication (see figure 2 in Chen et al. 2018). Conversely, no CHST5

576 duplicates from the salmoniform-specific whole-genome duplication seem to have been

577 preserved.

578 In addition to the CHST4 and CHST5 (including CHST6) sequences, we could

579 identify a multitude of previously unrecognized CHST4/5-like sequences in jawless

580 vertebrates, amphibians and non-avian reptiles. With some exceptions detailed below, we

581 have used the name CHST4/5-like for these sequences. In jawless vertebrates, we identified

582 four CHST4/5-like sequences in the Arctic lamprey and sea lamprey genomes, respectively

583 and two CHST4/5-like sequences in the inshore hagfish genome. These sequences form a

584 well-supported clade that clusters at the base of the CHST4, CHST5 and CHST4/5-like branch

585 (figure 5). As for jawed vertebrates, we could identify a well-supported clade of CHST4/5-

586 like sequences represented in amphibians as well as Lepidosauria, excluding snakes. We

587 could identify sequences of this subtype in all investigated amphibian lineages, including the

588 two-lined caecilian (clade Gymnophiona), salamanders (order Urodela), and frogs (order

24 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

589 Anura), as well as in the tuatara (order Rhynchocephalia), the ocelot gecko (infraorder

590 Gekkota), and several lizards, including the European green lizard (clade Laterata), the

591 Carolina anole lizard and the central bearded dragon (suborder Iguania). This clade clusters

592 basal to the main CHST4 and CHST5 branches in both phylogenies. This indicates that these

593 sequences either represent an ancestral jawed vertebrate C6OST subtype that has not been

594 preserved in any other lineages, or more parsimoniously, CHST4/5-like gene duplicates with

595 a derived mode of sequence evolution that arose in tetrapods before the split between

596 amphibians and amniotes. A conserved synteny analysis of the Carolina anole lizard

597 CHST4/5-like sequence on chromosome 2 showed no conservation of synteny with other

598 C6OST gene-bearing regions (supplementary figure S8). In amphibians, there have been

599 multiple rounds of local gene duplication within this clade. While the Western and African

600 clawed frogs only have one CHST4/5-like gene in this clade, Parker’s slow frog has four

601 copies located in tandem on the same genomic scaffold, and the two-lined caecilian has five

602 (supplementary figure S9). The phylogenetic relationships between different duplicates in

603 different species are not resolved in either phylogeny (figure 5, supplementary figure S7).

604 Apart from these CHST4/5-like sequences, we could identify CHST5-like sequences in the

605 three frog genomes. In the phylogeny with full-species representation (supplementary figure

606 S7), these sequences form a well-supported sister clade to the frog CHST5 sequences. Owing

607 also to their arrangement downstream of the CHST5 genes in the Western and African clawed

608 frog genomes (supplementary figure S9), we have called them CHST5.2 and the frog CHST5

609 genes CHST5.1. The relationship between the frog CHST5.1 and CHST5.2 genes is not

610 reproduced in the smaller phylogeny (figure 5). Indeed, the amphibian branch of CHST5 is

611 unresolved in both phylogenies, likely due to diverging evolutionary rates within the CHST5

612 clade. The amphibian CHST5 (CHST5.1 in frogs) sequences have had a lower rate of amino

613 acid substitution, while the frog CHST5.2 sequences seem to have had an accelerated

25 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

614 evolutionary rate, as shown by the branch lengths in both phylogenies. The allotetraploid

615 African clawed frog has duplicate CHST4, CHST5.1 and CHST5.2 genes, all located on the

616 homeologous chromosome pair 4L and 4S (supplementary figure S9), totaling seven genes

617 within this branch of the C6OST phylogeny.

618 We could identify only two gene families showing conserved synteny between either

619 CHST4 or CHST5 and another C6OST family member: KLF9/13/14/16 and RPGRIP1. These

620 were detected in the zebrafish genome, with members in the vicinity of chst5 and chst2a

621 (KLF9/13/14/16) or chst2b (RPGRIP1). However, this conserved synteny relationship is not

622 reproduced in any of the other genomes we investigated, including the medaka

623 (supplementary data S3). Thus, it is likely the result of chromosome rearrangements in the

624 lineage leading to zebrafish. No other patterns of conserved synteny could be detected with

625 respect to the CHST4, CHST5 and related genes. As with the CHST3 gene, this lack of

626 conserved synteny suggests that the ancestral CHST4/5 gene did not arise in the two early

627 vertebrate whole-genome duplications 1R and 2R, but rather that it was present together with

628 the ancestral CHST2/7 and CHST1/16 genes as well as the CHST3 gene before 1R.

629

630 Exon/intron structures of vertebrate C6OST genes

631 As has been reported previously, most mammalian C6OST genes consist of intron-less ORFs

632 (Lee et al. 1999; Bistrup et al. 1999; Kitagawa et al. 2000; Hemmerich, Lee, et al. 2001), with

633 the exception of CHST3, whose protein-coding domain is encoded by two exons (Tsutsumi et

634 al. 1998). Here we can report that the protein-coding domain of the CHST16 genes also

635 consists of two exons. The position of the intron is different from the CHST3 intron. These

636 exon structures are common to jawed vertebrates and likely represent the exon/intron

637 structures of the ancestral genes. In jawless vertebrates there seem to have been several

638 independent intron insertions that make it difficult to deduce the ancestral conditions. There

26 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

639 are several notable exceptions to the common protein-coding exon/intron structures, all found

640 within teleost fishes: CHST1a in Gobiiformes (gobies) has five exons (independent intron

641 insertions); CHST1b in Acanthomorpha (spiny-rayed fishes) has five exons (independent

642 intron insertions); CHST3a in Neoteleostei has three exons, one additional intron insertion

643 downstream of the ancestral jawed vertebrate CHST3 intron; CHST3a in gobies has four

644 exons, one additional intron insertion downstream of ancestral jawed vertebrate and

645 neoteleost introns; CHST3b in Protacanthopterygii (including Northern pike and

646 salmoniform fishes) has three exons, one additional intron insertion at a different position

647 than neoteleost CHST3a; CHST5 in Characiformes (including the Mexican cave tetra and

648 red-bellied piranha) has two exons (independent intron insertion); CHST7 in Syngnathiformes

649 (seahorses, pipefishes) and gobies has two exons, the intron insertion is shared between the

650 two lineages.

651

652 Invertebrate C6OST genes

653 In addition to the vase tunicate and fruit fly C6OST sequences included in our phylogeny

654 (figure 1), we could identify putative C6OST sequences from the Florida lancelet

655 (Branchiostoma floridae), the hemichordate acorn worm (Saccoglossus kowalevskii), the

656 purple sea urchin (Strongylocentrotus purpuratus) as well as the honey bee (Apis mellifera)

657 and the silk moth (Bombyx mori) (supplementary data S1). Ultimately, these sequences were

658 not used in our final phylogenies, however it is worth mentioning that there seem to have

659 been extensive lineage-specific expansions of C6OST genes in all but the insect species. As

660 described above, seven unique C6OST sequences have been described from the vase tunicate

661 (Tetsukawa et al. 2010). We could identify 16 unique C6OST sequences in the Florida

662 lancelet, 31 in the acorn worm, and 30 in the purple sea urchin. All full-length sequences

27 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

663 contain the characteristic sulfotransferase domain (Pfam PF00685) and have recognizable

664 5'PSB and 3’PB motifs (supplementary data S2).

665

666 Discussion

667

668 [Location of figure 6]

669

670 The evolution of vertebrate C6OST genes

671 We present the first large-scale analysis and phylogenetic classification of vertebrate

672 carbohydrate 6-O sulfotransferase (C6OST) genes. Our analyses are based on genomic and

673 transcriptomic data from a total of 158 species representing all major vertebrate groups,

674 including 18 mammalian species, 33 avian species in 16 orders, 14 non-avian reptile species,

675 six amphibian species, the basal lobe-finned fish coelacanth, the holostean fishes spotted gar

676 and bowfin, 67 teleost fish species in 35 orders, seven cartilaginous fish species, and 3

677 jawless vertebrate species. This allowed us to identify a previously unrecognized C6OST

678 subtype gene that we have named CHST16, which was lost in amniotes, as well as a

679 multitude of lineage-specific duplicates of the known C6OST genes. We could also identify

680 known C6OST genes in lineages where they were previously unrecognized, like CHST7 in

681 birds and non-avian reptiles, and CHST1a and CHST1b duplicates in teleost fishes. Our

682 phylogenetic analyses show that the jawed vertebrate C6OST gene family consists of six

683 ancestral clades forming two main branches. The first branch contains CHST1, CHST3 and

684 CHST16, and the second branch contains CHST2, CHST7 and CHST5. The CHST5 clade also

685 includes CHST4 and CHST6. We have identified genes within the six ancestral subtypes in

686 the genomes of several cartilaginous fishes, lobe-finned fishes, including the coelacanth and

687 tetrapods, as well as ray-finned fishes, including the spotted gar and teleost fishes (figure 6).

28 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

688 We conducted comparative analyses of conserved synteny - the conservation of gene

689 content across several chromosomal regions, that are the result from block duplications of

690 large chromosomal segments or whole-genome duplications . We found that the basal

691 vertebrate whole-genome duplications (1R/2R) only contributed two additional genes to the

692 ancestral vertebrate repertoire, giving rise to CHST1 and CHST16, as well as CHST2 and

693 CHST7 (figure 7). Indeed, the CHST1 and CHST16 genes, and the CHST2 and CHST7 genes,

694 are located in genomic regions previously recognized to have originated through 1R/2R

695 (Nakatani et al. 2007; Kasahara et al. 2007; Putnam et al. 2008; Sacerdot et al. 2018). Thus,

696 four C6OST genes - CHST3, CHST5 as well as the ancestral CHST1/16 and CHST2/7 - were

697 present already in a vertebrate ancestor, before 1R. Our phylogenetic analysis places the

698 origin of the gene family between the divergence of tunicates from the main chordate branch

699 that took place 518-581 Mya (Hedges and Kumar, 2009) and the emergence of extant

700 vertebrate lineages. Thus, the gene family expansions that laid the ground for the vertebrate

701 C6OST gene family likely occurred very early in chordate/vertebrate evolution.

702

703 [Location of figure 7]

704

705 In jawless vertebrates (Agnatha) we identified the CHST1 and CHST16 sequences

706 from the inshore hagfish, as well as the CHST16 and CHST3 sequences from the Arctic and

707 sea lampreys (figure 2A and figure 6). We could also identify several CHST2/7-like, and

708 CHST4/5-like, sequences in all three jawless vertebrate species, however their duplication

709 events are not completely resolved in our phylogeny (figure 4, figure 5 and open boxes in

710 figure 6). There has been a long debate about whether jawless vertebrates diverged after 1R

711 or whether jawless and jawed vertebrates share both rounds of whole genome duplication

712 (Kuraku et al. 2009). Previous studies of both the sea lamprey and Arctic lamprey genomes

29 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

713 indicate that jawless and jawed vertebrates share both whole-genome duplications (Smith et

714 al. 2013; Mehta et al. 2013), but that the shared patterns of gene synteny are obscured by the

715 asymmetric retention and loss of gene duplicates (Kuraku 2013). Another recent

716 reconstruction of vertebrate genome evolution based on the retention of gene content, order

717 and orientation, as well as gene family phylogenies, also concluded that jawless and jawed

718 vertebrate diverged after 2R (Sacerdot et al. 2018; Holland & Ocampo Daza 2018). In

719 contrast, an analysis of the meiotic map of the sea lamprey genome interprets the patterns of

720 synteny conservation differently, suggesting instead one round of genome duplication at the

721 base of vertebrates with subsequent independent segmental duplications in jawless and jawed

722 vertebrates (Smith & Keinath 2015). Our results are compatible with at least one whole

723 genome duplication shared between jawless and jawed vertebrates. The CHST4/5-like genes

724 we identified in jawless vertebrates likely represent the ancestral vertebrate CHST5 and the

725 CHST2/7-like genes likely represent the ancestral CHST2 and CHST7 genes, however, we

726 could not assign them to one individual clade. As our schematic shows both whole-genome

727 duplications were followed by several gene losses (figure 7). Thus, it is also possible that

728 jawless and jawed vertebrates have preserved and lost different gene copies generated by 1R

729 or 2R. Such differential gene losses/preservations or “hidden paralogies” are thought to

730 underlie many ambiguous orthology and paralogy assignments between lamprey sequences

731 and jawed vertebrate sequences (Kuraku 2010).

732 After the emergence of jawed vertebrates, and later of bony vertebrates, and the split

733 between lobe-finned fishes and ray-finned fishes, the evolution of C6OST genes has taken

734 different routes within different lineages, with several additional gene duplications and

735 losses. The evolution of the CHST5 branch is particularly marked by local gene duplications:

736 Local gene duplications of CHST5 generated CHST4 in a tetrapod ancestor, CHST6 in a

737 primate ancestor, as well as a previously unrecognized gene in frogs that we called CHST5.2

30 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

738 (figure 7). Only tetrapods were found to have both CHST4 and CHST5, and all investigated

739 species with assembled genomes have the genes located on the same chromosome, linkage

740 group or genomic scaffold. The cartilaginous fish, coelacanth, spotted gar and teleost

741 sequences within this branch cluster confidently within the CHST5 clade, rather than basal to

742 the tetrapod CHST4 and CHST5 clades. This also indicates that CHST4 emerged later (figure

743 7). As detailed above, the phylogenetic identity of the jawless vertebrate CHST4/5-like

744 sequences is somewhat uncertain, whereby we have decided not to name them CHST5.

745 Different genes have previously been identified as CHST6 in non-primate mammals, chicken,

746 Western clawed frog and zebrafish. In most cases it is CHST5 that has been misidentified,

747 and we suggest new gene names and symbols in the results above.

748 The newly identified CHST16 is present within cartilaginous fishes, but was likely

749 lost from rays and skates. It was also identified in all investigated ray-finned fish species, the

750 coelacanth and amphibians, but not in any of amniote species, indicating an early gene loss in

751 this lineage. Nothing is known about the functions of CHST16, so we cannnot speculate

752 whether its deletion was concurrent with a loss of function, or whether another C6OST gene

753 could compensate for its loss. However, it is notable that the likely emergence of CHST4 as a

754 copy of CHST5, and the loss of CHST16, are associated with the time window for the

755 transition from water to land and the emergence of the amniotic egg.

756 The C6OST gene family has had a dynamic and varied evolution within the teleost

757 fish lineage. The teleost-specific whole-genome duplication, 3R, gave rise to duplicates of

758 CHST1, CHST2 and CHST3, increasing the number of genes to nine in the teleost ancestor

759 (figure 6, figure 7). This was followed by multiple differential losses; notably, the loss of

760 CHST1a from cypriniform fishes, including zebrafish, the loss of CHST3a from salmoniform

761 fishes, and of CHST2a from euteleost fishes. CHST1b seems to have been lost independently

762 within several lineages of spiny-rayed fishes (Acanthomorpha). This apparently relaxed

31 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

763 selection on the retention of CHST1b seems to have been preceded by a stage of rapid

764 evolution. We found that the neoteleost branch of CHST1b sequences show an accelerated

765 rate of amino acid substitution coupled with the insertion of four introns into the otherwise

766 uninterrupted ORF. In addition to 3R, there have been more recent whole-genome

767 duplications in cyprinid fishes and in salmoniform fishes (Glasauer & Neuhauss 2014),

768 occasionally called 4R. We concluded that the salmoniform 4R contributed duplicates of

769 CHST1a, CHST1b, CHST16 and CHST2b (figure 6, figure 7), raising the number of C6OST

770 genes in this lineage to a total of eleven. We identified additional duplicates of CHST16,

771 CHST2a, CHST2b, CHST3b, CHST5 and CHST7 in the goldfish genome that likely arose in

772 the cyprinid whole-genome duplication, as well as copies of CHST1b and CHST3a of an

773 uncertain origin, raising the number of C6OST genes in this species to a total of 17. Other

774 notable gene expansions within the teleost fishes include the local expansion of CHST5 in

775 several lineages, such as killifishes and live-bearers, and Osteoglossomorpha.

776 Finally, we also identified a number of CHST4/5-like genes in amphibians and several

777 lineages within Lepidosauria whose origins could not be elucidated by phylogenetic nor

778 conserved synteny analyses (figure 7). The species representation suggests that this gene

779 arose early in tetrapod evolution and was subsequently lost multiple times: from mammals,

780 turtles, tortoises and terrapins (Testudines), birds and crocodilians (Archosauria), as well as

781 snakes. Further investigation of the amphibian CHST4/5-like sequences and lepidosaurian

782 CHST4/5-like sequences in additional amphibian and non-avian reptile species will be needed

783 to resolve this phylogenetic uncertainty.

784

785 Functional considerations

786 Our interest in the evolution of C6OSTs mainly revolves around their key roles in the

787 formation of extracellular matrices (ECMs) during the formation of mineralized tissues, such

32 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

788 as cartilage and bone. Within this context, the modest expansion of the C6OST family in

789 1R/2R, stands in contrast to several gene families with functions in skeleton formation.

790 Notably the Hox gene family (Ravi et al. 2009; Kuraku & Meyer 2009; Sundström et al.

791 2008), and the fibrillar gene family (Boot-Handford & Tuckwell 2003; Zhang &

792 Cohn 2008), whose expansions in 1R/2R has been considered essential for the evolution of

793 the vertebrate skeleton (Wada 2010). Several key genes involved in the regulation of bone

794 homeostasis also arose through 1R/2R, including the parathyroid hormone gene (PTH) and

795 the calcitonin genes (CALC), as wells as their cognate receptor genes (Hwang et al. 2013).

796 The bone morphogenic protein (BMP) genes BMP2 and BMP4 have also been shown to have

797 emerged in 1R/2R (Marques et al. 2016; Feiner et al. 2019), as have BMP5, BMP6, BMP7

798 and BMP8. Sulfotransferases have a pivotal role in generating the active GAGs that are

799 subsequently combined into proteoglycans, building the matrix that provides a scaffold for

800 tissue formation. Our results raise the question of whether enzymes like sulfotransferases

801 were important prerequisites before genes specifically dedicated towards skeleton formation

802 could evolve. Out of the C6OST genes we have studied, CHST3 and CHST7 sulfate

803 chondroitin sulfate (CS), and have been implicated in the development and homeostasis of

804 the skeletal system. CS molecules that are 6-O sulfated by CHST3 are important components

805 of proteoglycans like aggrecan in cartilage. Our results show that CHST3 and CHST7 are not

806 closely related, which suggests that chondroitin 6-O sulfation activity has evolved at least

807 twice, even before the advent of cartilaginous skeletons in vertebrates. Indeed, we found that

808 cartilaginous fishes have both CHST3 and CHST7 genes, and at least CHST3 is clearly

809 present in jawless vertebrates. It is also notable that CHST7 is most closely related to CHST2,

810 which suggests some functional overlap, at least in early vertebrate evolution. CHST2,

811 through the 6-O sulfation of keratan sulfate (KS), may also have a role in the skeletal system,

812 in the maintenance of cartilaginous tissue (Hayashi et al. 2011). The expression profiles of

33 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

813 the zebrafish chst3a, chst3b and chst7 genes during development mirror the known functions

814 of CHST3 and CHST7 in mammals, with expression mainly in skeletal tissues and the central

815 nervous system (Habicher et al. 2015), albeit with only a very weak signal for chst3b. Further

816 study in teleost species should be done in order to compare the expression profiles of the

817 teleost-specific CHST3 duplicates.

818 As for the novel C6OST subtype gene we describe here, CHST16, its close

819 phylogenetic relationship to the CHST1 paralogue could indicated some similarities in gene

820 expression profile and preferred substrate. A large-scale high-throughput expression analysis

821 identified expression of chst1b in the zebrafish retina and brain (Thisse & Thisse 2004),

822 which is shared with mammalian CHST1 orthologues. In the zebrafish retina, as in other

823 vertebrate retinas, KS is a common and abundant GAG molecule (Souza et al. 2007). These

824 facts point at a common role of CHST1 in 6-O sulfation of KS in the retina and central

825 nervous system across lobe-finned fishes, including tetrapods and ray-finned fishes, including

826 teleost fishes.

827

828 Acknowledgments

829 This work was supported by the Swedish Research Council (Vetenskapsrådet) through an

830 International Postdoc Grant awarded to D.O.D. (2016-00552) and a Starting Grant awarded

831 to T.H. (621-2012-4673). We wish to thank the G10K Consortium and professor Erich D.

832 Jarvis from Rockefeller University, New York, for access to genomic data from several key

833 species, including platypus, greater horseshoe bat, Anna’s hummingbird, kakapo, zebra finch,

834 two-lined caecilian, and thorny skate, through the Vertebrate Genomes Project

835 (vertebrategenomesproject.org).

836

837 References

34 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

838 Abdullayev I, Kirkham M, Björklund ÅK, Simon A, Sandberg R. 2013. A reference 839 transcriptome and inferred proteome for the salamander Notophthalmus viridescens. Exp. 840 Cell Res. 319:1187–1197. doi: 10.1016/j.yexcr.2013.02.013. 841 Abril JF, Castelo R, Guigó R. 2005. Comparison of splice sites in mammals and chicken. 842 Genome Res. 15:111–119. doi: 10.1101/gr.3108805. 843 Akama TO et al. 2000. Macular corneal dystrophy type I and type II are caused by distinct 844 mutations in a new sulphotransferase gene. Nat. Genet. 26:237–41. doi: 10.1038/79987. 845 Akama TO et al. 2001. Human Corneal GlcNAc 6-O-Sulfotransferase and Mouse Intestinal 846 GlcNAc 6-O-Sulfotransferase Both Produce Keratan Sulfate. J. Biol. Chem. 276:16271– 847 16278. doi: 10.1074/jbc.M009995200. 848 Akama TO, Misra AK, Hindsgaul O, Fukuda MN. 2002. Enzymatic Synthesis in Vitro of the 849 Disulfated Disaccharide Unit of Corneal Keratan Sulfate. J. Biol. Chem. 277:42505–42513. 850 doi: 10.1074/jbc.M207412200. 851 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search 852 tool. J. Mol. Biol. 215:403–10. doi: 10.1006/jmbi.1990.9999. 853 Amemiya CT et al. 2013. The African coelacanth genome provides insights into tetrapod 854 evolution. Nature. 496:311–316. doi: 10.1038/nature12027. 855 Amores A, Catchen JM, Ferrara A, Fontenot Q, Postlethwait JH. 2011. Genome evolution 856 and meiotic maps by massively parallel DNA sequencing: spotted gar, an outgroup for the 857 teleost genome duplication. Genetics. 188:799–808. doi: 10.1534/genetics.111.127324. 858 Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio test for branches: A fast, 859 accurate, and powerful alternative. Syst. Biol. 55:539–52. doi: 10.1080/10635150600755453. 860 Anisimova M, Gil M, Dufayard J-F, Dessimoz C, Gascuel O. 2011. Survey of branch support 861 methods demonstrates accuracy, power, and robustness of fast likelihood-based 862 approximation schemes. Syst. Biol. 60:685–99. doi: 10.1093/sysbio/syr041. 863 Berthelot C et al. 2014. The rainbow trout genome provides novel insights into evolution 864 after whole-genome duplication in vertebrates. Nat. Commun. 5:3657. doi: 865 10.1038/ncomms4657. 866 Bian C et al. 2016. The Asian arowana (Scleropages formosus) genome provides new 867 insights into the evolution of an early lineage of teleosts. Sci. Rep. 6:24501. doi: 868 10.1038/srep24501.Bistrup A et al. 1999. Sulfotransferases of two specificities function in 869 the reconstitution of high endothelial cell ligands for L-selectin. J. Cell Biol. 145:899–910. 870 doi: 10.1083/jcb.145.4.899. 871 Bistrup A et al. 2004. Detection of a Sulfotransferase (HEC-GlcNAc6ST) in High 872 Endothelial Venules of Lymph Nodes and in High Endothelial Venule-Like Vessels within 873 Ectopic Lymphoid Aggregates. Am. J. Pathol. 164:1635–1644. doi: 10.1016/S0002- 874 9440(10)63722-4. 875 Boot-Handford RP, Tuckwell DS. 2003. Fibrillar collagen: The key to vertebrate evolution? 876 A tale of molecular incest. BioEssays. 25:142–151. doi: 10.1002/bies.10230. 877 Braasch I et al. 2016. The spotted gar genome illuminates vertebrate evolution and facilitates 878 human-teleost comparisons. Nat. Genet. doi: 10.1038/ng.3526. 879 Brown GR et al. 2014. Gene: a gene-centered information resource at NCBI. Nucleic Acids 880 Res. 43:36–42. doi: 10.1093/nar/gku1055. 881 Canese K et al. 2017. Database resources of the National Center for Biotechnology

35 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

882 Information. Nucleic Acids Res. 46:D8–D13. doi: 10.1093/nar/gkx1095. 883 Carstens N et al. 2016. Novel mutation in the CHST6 gene causes macular corneal dystrophy 884 in a black South African family. BMC Med. Genet. 17:1–9. doi: 10.1186/s12881-016-0308-0. 885 Chen Z et al. 2018. De Novo assembly of the goldfish (Carassius auratus) genome and the 886 evolution of genes after whole genome duplication. bioRxiv. 373431. doi: 10.1101/373431. 887 Christensen KA et al. 2018. The arctic charr (salvelinus alpinus) genome and transcriptome 888 assembly. PLoS One. 13:1–30. doi: 10.1371/journal.pone.0204076. 889 Curwen V et al. 2004. The Ensembl Automatic Gene Annotation System. Genome Res. 890 14:942–950. doi: 10.1101/gr.1858004. 891 Dehal P, Boore JL. 2005. Two rounds of whole genome duplication in the ancestral 892 vertebrate. PLOS Biol. 3:e314. doi: 10.1371/journal.pbio.0030314. 893 Du X et al. 2017. The pearl oyster Pinctada fucata martensii genome and multi-omic analyses 894 provide insights into biomineralization. Gigascience. 6:1–12. doi: 895 10.1093/gigascience/gix059. 896 Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high 897 throughput. Nucleic Acids Res. 32:1792–7. doi: 10.1093/nar/gkh340. 898 El-Gebali S et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Res. 899 47:D427–D432. doi: 10.1093/nar/gky995. 900 Feiner N, Motone F, Meyer A, Kuraku S. 2019. Asymmetric paralog evolution between the 901 “cryptic” gene Bmp16 and its well-studied sister genes Bmp2 and Bmp4. Sci. Rep. 9:1–13. 902 doi: 10.1038/s41598-019-40055-1. 903 Finn RD, Clements J, Eddy SR. 2011. HMMER web server: interactive sequence similarity 904 searching. Nucleic Acids Res. 39 Suppl 2:W29-37. doi: 10.1093/nar/gkr367. 905 Fukuta M et al. 1995. Molecular cloning and expression of chick chondrocyte chondroitin 6- 906 sulfotransferase. J. Biol. Chem. 270:18575–18580. doi: 10.1074/jbc.270.31.18575. 907 Fukuta M, Kobayashi Y, Uchimura K, Kimata K, Habuchi O. 1998. Molecular cloning and 908 expression of human chondroitin 6-sulfotransferase. Biochim. Biophys. Acta - Gene Struct. 909 Expr. 1399:57–61. doi: 10.1016/S0167-4781(98)00089-X. 910 Glasauer SMK, Neuhauss SCF. 2014. Whole-genome duplication in teleost fishes and its 911 evolutionary consequences. Mol. Genet. Genomics. 289:1045–1060. doi: 10.1007/s00438- 912 014-0889-2. 913 Habicher J et al. 2015. Chondroitin / Dermatan Sulfate Modification Enzymes in Zebrafish 914 Development. PLoS One. 10:e0121957. doi: 10.1371/journal.pone.0121957. 915 Habuchi H et al. 2003. Biosynthesis of heparan sulphate with diverse structures and 916 functions: two alternatively spliced forms of human heparan sulphate 6-O-sulphotransferase- 917 2 having different expression patterns and properties. Biochem. J. 371:131–42. doi: 918 10.1042/BJ20021259. 919 Habuchi O. 2014. Carbohydrate (Chondroitin 6) Sulfotransferase 3; Carbohydrate (N- 920 Acetylglucosamine 6-O) Sulfotransferase 7 (CHST3,7). In: Handbook of 921 Glycosyltransferases and Related Genes.Vol. 1–2 Springer Japan: Tokyo pp. 979–987. doi: 922 10.1007/978-4-431-54240-7_41. 923 Hayashi M, Kadomatsu K, Kojima T, Ishiguro N. 2011. Keratan sulfate and related murine 924 glycosylation can suppress murine cartilage damage in vitro and in vivo. Biochem. Biophys. 925 Res. Commun. 409:732–737. doi: 10.1016/j.bbrc.2011.05.077.

36 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

926 Hedges SB, Kumar S. 2009. The Timetree of Life. Oxford University Press, New York. 927 Hemmerich S, Bistrup A, et al. 2001. Sulfation of L-selectin ligands by an HEV-restricted 928 sulfotransferase regulates lymphocyte homing to lymph nodes. Immunity. 15:237–247. doi: 929 10.1016/S1074-7613(01)00188-1. 930 Hemmerich S, Lee JK, et al. 2001. Chromosomal localization and genomic organization for 931 the galactose/ N-acetylgalactosamine/N-acetylglucosamine 6-O-sulfotransferase gene family. 932 Glycobiology. 11:75–87. http://www.ncbi.nlm.nih.gov/pubmed/11181564.Holland LZ, 933 Ocampo Daza D. 2018. A new look at an old question : when did the second whole genome 934 duplication occur in vertebrate evolution ? Genome Biol. 19. doi: 10.1186/s13059-018-1592- 935 0. 936 Hoshino H et al. 2014. KSGal6ST Is Essential for the 6-Sulfation of Galactose within 937 Keratan Sulfate in Early Postnatal Brain. J. Histochem. Cytochem. 62:145–156. doi: 938 10.1369/0022155413511619. 939 Hwang J-I et al. 2013. Expansion of Secretin-Like G Protein-Coupled Receptors and Their 940 Peptide Ligands via Local Duplications Before and After Two Rounds of Whole-Genome 941 Duplication. Mol. Biol. Evol. 30:1119–1130. doi: 10.1093/molbev/mst031. 942 Kalyaanamoorthy S, Minh BQ, Wong TKF, Von Haeseler A, Jermiin LS. 2017. 943 ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods. 944 14:587–589. doi: 10.1038/nmeth.4285. 945 Kasahara M et al. 2007. The medaka draft genome and insights into vertebrate genome 946 evolution. Nature. 447:714–9. doi: 10.1038/nature05846. 947 Kitagawa H, Fujita M, Ito N, Sugahara K. 2000. Molecular Cloning and Expression of a 948 Novel Chondroitin 6- O -Sulfotransferase. J. Biol. Chem. 275:21075–21080. doi: 949 10.1074/jbc.M002101200. 950 Kitts PA et al. 2016. Assembly: A resource for assembled genomes at NCBI. Nucleic Acids 951 Res. 44:D73–D80. doi: 10.1093/nar/gkv1226. 952 Kjellén L, Lindahl U. 1991. Proteoglycans: Structures and Interactions. Annu. Rev. Biochem. 953 60:443–475. doi: 10.1146/annurev.bi.60.070191.002303. 954 Kobayashi T et al. 2010. Functional analysis of chick heparan sulfate 6-O-sulfotransferases in 955 limb bud development. Dev. Growth Differ. 52:146–156. doi: 10.1111/j.1440- 956 169X.2009.01148.x. 957 Kuraku S. 2010. Palaeophylogenomics of the vertebrate ancestor--impact of hidden paralogy 958 on hagfish and lamprey gene phylogeny. Integr. Comp. Biol. 50:124–9. doi: 959 10.1093/icb/icq044. 960 Kuraku S. 2013. Impact of asymmetric gene repertoire between cyclostomes and 961 gnathostomes. Semin. Cell Dev. Biol. 24:119–27. doi: 10.1016/j.semcdb.2012.12.009. 962 Kuraku S, Meyer A. 2009. The evolution and maintenance of Hox gene clusters in 963 vertebrates and the teleost-specific genome duplication. Int. J. Dev. Biol. 53:765–73. doi: 964 10.1387/ijdb.072533km. 965 Kuraku S, Meyer A, Kuratani S. 2009. Timing of genome duplications relative to the origin 966 of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 26:47–59. doi: 967 10.1093/molbev/msn222. 968 Kusche-Gullberg M, Kjellén L. 2003. Sulfotransferases in glycosaminoglycan biosynthesis. 969 Curr. Opin. Struct. Biol. 13:605–611. doi: 10.1016/j.sbi.2003.08.002.

37 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

970 Lagman D et al. 2013. The vertebrate ancestral repertoire of visual opsins, transducin alpha 971 subunits and oxytocin/vasopressin receptors was established by duplication of their shared 972 genomic region in the two rounds of early vertebrate genome duplications. BMC Evol. Biol. 973 13:238. doi: 10.1186/1471-2148-13-238. 974 Larsson A. 2014. AliView: a fast and lightweight alignment viewer and editor for large data 975 sets. Bioinformatics. 30:btu531-. doi: 10.1093/bioinformatics/btu531. 976 Lee JK, Bhakta S, Rosen SD, Hemmerich S. 1999. Cloning and characterization of a 977 mammalian N-acetylglucosamine-6-sulfotransferase that is highly restricted to intestinal 978 tissue. Biochem. Biophys. Res. Commun. 263:543–9. doi: 10.1006/bbrc.1999.1324. 979 Li X, Tedder TF. 1999. CHST1 and CHST2 sulfotransferases expressed by human vascular 980 endothelial cells: cDNA cloning, expression, and chromosomal localization. Genomics. 981 55:345–7. doi: 10.1006/geno.1998.5653. 982 Lien S et al. 2016. The Atlantic salmon genome provides insights into rediploidization. 983 Nature. 533:200–205. doi: 10.1038/nature17164. 984 Louis A, Nguyen NTT, Muffato M, Roest Crollius H. 2015. Genomicus update 2015: 985 KaryoView and MatrixView provide a genome-wide perspective to multispecies comparative 986 genomics. Nucleic Acids Res. 43:D682–D689. doi: 10.1093/nar/gku1112. 987 Maglott D, Ostell J, Pruitt KD, Tatusova T. 2011. Entrez gene: Gene-centered information at 988 NCBI. Nucleic Acids Res. 39:54–58. doi: 10.1093/nar/gkq1237. 989 Marques CL et al. 2016. Comparative analysis of zebrafish bone morphogenetic 2, 4 990 and 16: Molecular and evolutionary perspectives. Cell. Mol. Life Sci. 73:841–857. doi: 991 10.1007/s00018-015-2024-x. 992 Mehta TK et al. 2013. Evidence for at least six Hox clusters in the Japanese lamprey 993 (Lethenteron japonicum). Proc. Natl. Acad. Sci. U. S. A. 110:16044–9. doi: 994 10.1073/pnas.1315760110. 995 Meyer A, van de Peer Y. 2005. From 2R to 3R: evidence for a fish-specific genome 996 duplication (FSGD). BioEssays. 27:937–45. doi: 10.1002/bies.20293. 997 Minh BQ, Nguyen MAT, Von Haeseler A. 2013. Ultrafast approximation for phylogenetic 998 bootstrap. Mol. Biol. Evol. 30:1188–1195. doi: 10.1093/molbev/mst024. 999 Miyata S, Kitagawa H. 2016. Chondroitin 6-Sulfation Regulates Perineuronal Net Formation 1000 by Controlling the Stability of Aggrecan. Neural Plast. 2016:7–9. doi: 1001 10.1155/2016/1305801. 1002 Muffato M, Louis A, Poisnel C-E, Roest Crollius H. 2010. Genomicus: a database and a 1003 browser to study gene synteny in modern and ancestral genomes. Bioinformatics. 26:1119– 1004 21. doi: 10.1093/bioinformatics/btq079. 1005 Nagai N, Kimata K. 2014. Heparan-Sulfate 6-O-Sulfotransferase 1-3 (HS6ST1-3). In: 1006 Handbook of Glycosyltransferases and Related Genes.Vol. 1–2 Springer Japan: Tokyo pp. 1007 1067–1080. doi: 10.1007/978-4-431-54240-7_68. 1008 Nakagawa S, Niimura Y, Gojobori T, Tanaka H, Miura KI. 2008. Diversity of preferred 1009 nucleotide sequences around the translation initiation codon in eukaryote genomes. Nucleic 1010 Acids Res. 36:861–871. doi: 10.1093/nar/gkm1102. 1011 Nakatani Y, McLysaght A. 2017. Genomes as documents of evolutionary history: a 1012 probabilistic macrosynteny model for the reconstruction of ancestral genomes. 1013 Bioinformatics. 33:i369–i378. doi: 10.1093/bioinformatics/btx259.

38 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1014 Nakatani Y, Takeda H, Kohara Y, Morishita S. 2007. Reconstruction of the vertebrate 1015 ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res. 1016 17:1254–65. doi: 10.1101/gr.6316407. 1017 Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: A fast and effective 1018 stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 1019 32:268–274. doi: 10.1093/molbev/msu300. 1020 Nogami K et al. 2004. Distinctive Expression Patterns of Heparan Sulfate O- 1021 Sulfotransferases and Regional Differences in Heparan Sulfate Structure in Chick Limb 1022 Bluds. J. Biol. Chem. 279:8219–8229. doi: 10.1074/jbc.M307304200. 1023 Ocampo Daza D, Larhammar D. 2018. Evolution of the growth hormone, prolactin, prolactin 1024 2 and somatolactin family. Gen. Comp. Endocrinol. doi: 10.1016/j.ygcen.2018.01.007. 1025 Parfitt GJ et al. 2011. Electron tomography reveals multiple self-association of chondroitin 1026 sulphate/dermatan sulphate proteoglycans in Chst5-null mouse corneas. J. Struct. Biol. 1027 174:536–541. doi: 10.1016/j.jsb.2011.03.015. 1028 Pasquier J et al. 2016. Gene evolution and after whole genome duplication in 1029 fish: the PhyloFish database. BMC Genomics. 17:368. doi: 10.1186/s12864-016-2709-z. 1030 Perry E et al. 2018. Ensembl 2019. Nucleic Acids Res. 47:D745–D751. doi: 1031 10.1093/nar/gky1113. 1032 Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD. 2012. Controversies in modern 1033 evolutionary biology: the imperative for error detection and quality control. BMC Genomics. 1034 13:5. doi: 10.1186/1471-2164-13-5. 1035 Putnam NH et al. 2008. The amphioxus genome and the evolution of the chordate karyotype. 1036 Nature. 453:1064–71. doi: 10.1038/nature06967. 1037 Ravi V et al. 2009. Elephant shark (Callorhinchus milii) provides insights into the evolution 1038 of Hox gene clusters in gnathostomes. Proc. Natl. Acad. Sci. U. S. A. 106:16327–32. doi: 1039 10.1073/pnas.0907914106. 1040 Rogozin IB, Carmel L, Csürös M, Koonin E V. 2012. Origin and evolution of spliceosomal 1041 introns. Biol. Direct. 7:11. doi: 10.1186/1745-6150-7-11. 1042 Rubinstein Y et al. 2016. Macular corneal dystrophy and posterior corneal abnormalities. 1043 Cornea. 35:1605–1610. doi: 10.1097/ICO.0000000000001054. 1044 Sacerdot C, Louis A, Bon C, Berthelot C, Roest Crollius H. 2018. Chromosome evolution at 1045 the origin of the ancestral vertebrate genome. Genome Biol. 19:166. doi: 10.1186/s13059- 1046 018-1559-1. 1047 Seko A et al. 2009. N-Acetylglucosamine 6-O-sulfotransferase-2 as a tumor marker for 1048 uterine cervical and corpus . Glycoconj. J. 26:1065–1073. doi: 10.1007/s10719-008- 1049 9227-4. 1050 Session AM et al. 2016. Genome evolution in the allotetraploid frog Xenopus laevis. Nature. 1051 538:336–343. doi: 10.1038/nature19840. 1052 Small CM et al. 2016. The genome of the Gulf pipefish enables understanding of 1053 evolutionary innovations. Genome Biol. 17:258. doi: 10.1186/s13059-016-1126-6. 1054 Smith JJ et al. 2013. Sequencing of the sea lamprey (Petromyzon marinus) genome provides 1055 insights into vertebrate evolution. Nat. Genet. 45:415–21. doi: 10.1038/ng.2568. 1056 Smith JJ, Keinath MC. 2015. The sea lamprey meiotic map improves resolution of ancient 1057 vertebrate genome duplications. Genome Res. 25:1081–1090. doi: 10.1101/gr.184135.114.

39 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1058 Soares da Costa D, Reis RL, Pashkuleva I. 2017. Sulfation of Glycosaminoglycans and Its 1059 Implications in Human Health and Disorders. Annu. Rev. Biomed. Eng. 19:1–26. doi: 1060 10.1146/annurev-bioeng-071516-044610. 1061 Souza ARC et al. 2007. Chondroitin sulfate and keratan sulfate are the major 1062 glycosaminoglycans present in the adult zebrafish Danio rerio (Chordata-Cyprinidae). 1063 Glycoconj. J. 24:521–530. doi: 10.1007/s10719-007-9046-z. 1064 Srivastava P, Pandey H, Agarwal D, Mandal K, Phadke SR. 2017. Spondyloepiphyseal 1065 dysplasia Omani type: CHST3 mutation spectrum and phenotypes in three Indian families. 1066 Am. J. Med. Genet. Part A. 173:163–168. doi: 10.1002/ajmg.a.37996. 1067 Sundström G, Larsson TA, Larhammar D. 2008. Phylogenetic and chromosomal analyses of 1068 multiple gene families syntenic with vertebrate Hox clusters. BMC Evol. Biol. 8:254. doi: 1069 10.1186/1471-2148-8-254. 1070 Takeda-Uchimura Y et al. 2015. Requirement of keratan sulfate proteoglycan phosphacan 1071 with a specific sulfation pattern for critical period plasticity in the visual cortex. Exp. Neurol. 1072 274:145–155. doi: 10.1016/j.expneurol.2015.08.005. 1073 Taniguchi N et al. 2012. Carbohydrate (N-Acetylglucosamine 6-O) Sulfotransferase 5 and 6 1074 (CHST5,6). In: Handbook of Glycosyltransferases and Related Genes.Vol. 6 pp. 1005–1014. 1075 doi: 10.1007/978-4-431-54240-7. 1076 Tetsukawa A, Nakamura J, Fujiwara S. 2010. Identification of chondroitin/dermatan 1077 sulfotransferases in the protochordate, Ciona intestinalis. Comp. Biochem. Physiol. B. 1078 Biochem. Mol. Biol. 157:205–12. doi: 10.1016/j.cbpb.2010.06.009. 1079 Thisse B, Thisse C. 2004. Fast release clones: a high throughput expression analysis. ZFIN 1080 direct data Submission. https://zfin.org/ZDB-PUB-040907-1. 1081 Tsutsumi K, Shimakawa H, Kitagawa H, Sugahara K. 1998. Functional expression and 1082 genomic structure of human chondroitin 6-sulfotransferase. FEBS Lett. 441:235–241. doi: 1083 10.1016/S0014-5793(98)01532-4. 1084 Wada H. 2010. Origin and Genetic Evolution of the Vertebrate Skeleton. Zoolog. Sci. 1085 27:119–123. doi: 10.2108/zsj.27.119. 1086 Waryah AM et al. 2016. A novel CHST3 allele associated with spondyloepiphyseal dysplasia 1087 and hearing loss in Pakistani kindred. Clin. Genet. 90:90–95. doi: 10.1111/cge.12694. 1088 Yamada S, Sugahara K, Özbek S. 2011. Evolution of glycosaminoglycans: Comparative 1089 biochemical study. Commun. Integr. Biol. 4:150–158. doi: 10.4161/cib.4.2.14547. 1090 Yamamoto Y, Takahashi I, Ogata N, Nakazawa K. 2001. Purification and characterization of 1091 N-acetylglucosaminyl sulfotransferase from chick corneas. Arch. Biochem. Biophys. 392:87– 1092 92. doi: 10.1006/abbi.2001.2422. 1093 Yusa A, Kitajima K, Habuchi O. 2006. N-linked oligosaccharides on chondroitin 6- 1094 sulfotransferase-1 are required for production of the active enzyme, golgi localization, and 1095 sulfotransferase activity toward keratan sulfate. J. Biol. Chem. 281:20393–20403. doi: 1096 10.1074/jbc.M600140200. 1097 Zhang G, Cohn MJ. 2008. Genome duplication and the origin of the vertebrate skeleton. 1098 Curr. Opin. Genet. Dev. 18:387–393. doi: 10.1016/j.gde.2008.07.009. 1099 Zhang Z et al. 2017. Deficiency of a sulfotransferase for sialic acid-modified glycans 1100 mitigates Alzheimer’s pathology. Proc. Natl. Acad. Sci. 114:E2947–E2954. doi: 1101 10.1073/pnas.1615036114.

40 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1102 Figure and table legends: 1103

1104 Table 1. Nomenclature of the carbohydrate 6-O sulfotransferases.

1105

1106 Fig. 1. Maximum likelihood phylogeny of chordate C6OST sequences. The phylogeny is

1107 supported by approximate Likelihood-Ratio Test (aLRT) and UltraFast Bootstrap (UFBoot)

1108 analyses. UFBoot supports for deep nodes are shown. Red filled arrowheads indicate

1109 unsupported nodes (≤ 75%) in both UFBoot and aLRT, yellow filled arrowheads indicate

1110 nodes with low aLRT support only, red open arrowheads indicate nodes with low UFBoot

1111 support only. The phylogeny is rooted with the Drosophila melanogaster C6OST sequences

1112 CG9550 and CG31637.

1113

1114 Fig. 2. A. Phylogeny of the CHST1 and CHST16 branch of C6OST sequences. The

1115 placement of the branch within the full C6OST phylogeny is indicated in the bottom left.

1116 Sequence names include species names followed by chromosome/linkage group designations

1117 (if available) and gene symbols. Asterisks indicate incomplete sequences. For node support

1118 details, see figure 1 caption. Some node support values for shallow nodes have been omitted

1119 for visual clarity. The fast-evolving branch of neoteleost CHST1b sequences is highlighted.

1120 B. Conserved synteny between CHST1 and CHST16-bearing chromosome regions in

1121 tetrapods and the spotted gar, including CHST1a, CHST1b, and CHST16-bearing

1122 chromosome regions in teleost fishes. Stars indicate chromosome regions in common

1123 between two different conserved synteny analyses: CHST1 and CHST16-bearing regions

1124 (yellow labels), and CHST1a and CHST1b-bearing regions (blue labels).

1125

41 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1126 Fig. 3. A. Phylogeny of the CHST3 branch of C6OST sequences. See figure 2 caption for

1127 details. B. Conserved synteny across CHST3, CHST3a and CHST3b-bearing chromosome

1128 regions.

1129

1130 Fig. 4. A. Phylogeny of the CHST2 and CHST7 branch of C6OST sequences. See figure 2

1131 caption for details. B. Conserved synteny between CHST2 and CHST7-bearing chromosome

1132 regions in tetrapods and spotted gar, including CHST2a and CHST2b-bearing regions in

1133 teleost fishes.

1134

1135 Fig. 5. Phylogeny of the CHST4, CHST5 and related C6OST sequences. See figure 2 caption

1136 for details.

1137

1138 Fig. 6. C6OST gene repertoires in selected vertebrate species. Opened boxes indicate genes

1139 with unresolved phylogenetic positions. Ancestral teleost, Atlantic salmon and Xenopus

1140 laevis duplicate gene designations are shown within the boxes. Other species-specific or

1141 lineage-specific duplicates are indicated by “X2” etc. within the boxes. Additional CHST4/5-

1142 like and CHST5.2 genes in non-avian reptiles and amphibians not shown here (see

1143 supplementary Fig. S7).

1144

1145 Fig. 7. Evolutionary scenario of C6OST family in chordates. Seven-pointed stars indicate

1146 whole genome duplications. Crossed-over boxes indicate gene losses. Our analyses indicate

1147 that CHST1 and CHST16 genes, as well as CHST2 and CHST7 genes, arose from ancestral

1148 CHST1/16 and CHST2/7 genes through either 1R (“A”) or 2R (“B”). The uncertain position

1149 of jawless vertebrates relative to 1R and 2R is indicated by dashed lines.

42 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Table 1. Nomenclature of the carbohydrate 6-O sulfotransferases.

Gene name Synonyms Human chr. CHST1 KS6ST, KSGal6ST, GST1 11 CHST2 GlcNAc6ST1, Gn6ST1, GST2 3 CHST3 C6ST1, GST0 10 CHST4 GlcNAc6ST2, HEC-GlcNAc6ST, GST3 16 CHST5 GlcNAc6ST3, I-GlcNAc6ST, GST4-alpha 16 CHST6 GlcNAc6ST5, C-GlcNAc6ST, GST4-beta 16 CHST7 C6ST2, GlcNAc6ST4, GST5 X

bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

75 Tunicate C6OSTs 100 CHST3

100

100 99 CHST16 75 99

97 CHST1 100

100

85 100 CHST7 100

99

CHST2 100 100

98 100 CHST4/5-like 100 CHST4 83

81 CHST5 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Green.spotted.puffer.Un.CHST3b Southern.platyfish.10.CHST3b A Inshore.hagfish.CHST16 CH ST16 B Sea.lamprey.CHST16 Human Chicken Arctic.lamprey.CHST16 12 22 11 1 5

Elephant.shark.CHST16 ARFGAP3 ANO9 0.44 ANO9 1.76 100 Cloudy.catshark.CHST16* 100 PPFIBP2 7.51 ANO6 30.72 ANO5 2.96 Whale.shark.CHST16 RASSF10 13.01 RASSF9 42.19 SLC17A6 3.03 99 Coelacanth.CHST16 ANO2 5.95 LDHB 21.66 LDHA 18.39 SLC17A8 47.29 ANO3 3.45 100 42 .86 100 Axolotl.CHST16 LDHC 18.41 ANO4 47.43 SLC5A12 3.65 100 PPFIBP1 27.52 Parkers.slow.frog.CHST16 LDHAL6A 18.46 SLC5A8 47.65 TCP11L1 7.31 100 ANO6 45.22 X.tropicalis.3.chst16 ANO5 22.19 CRY1 53.63 DEPDC7 7.32

RASSF9 85.84 47 .19 99 X.laevis.3L.chst16 DEPDC4 100.27 SLC17A6 22.34 TCP11L2 53.90 PPFIBP2 7.57 Spotted.gar.8.CHST16 SLC17A8 100.36 ANO3 26.19 MYBPC1 55.67 RASSF10 7.96 75 Atlantic.herring.CHST16 ANO4 100.72 PACSIN2 SLC5A12 26.72 LDHB 67.03 LDHA 12.75 DEPDC7 33.02 PPFIBP1 68.19 ANO1 17.98 100 100 Cave.tetra.18.CHST16 SLC5A8 101.21 100 Channel.catfish.4.CHST16 MYBPC1 101.59 TCP11L1 33.04 ARFGAP3 68.40 C1QTNF4 22.97 100 Zebrafish.18.CHST16 TCP11L2 106.30 PACSIN3 43.02 PACSIN2 68.47 MYBPC3 23.22 100 Goldfish.18.CHST16.1 CRY1 107.09 CHST1 45.65 ANO2 74.39 PACSIN3 23.41 100 Goldfish.43.CHST16.2 CRY2 45.85 ARFGAP2 23.44 Northern.pike.19.CHST16 ARFGAP2 47.18 CRY2 24.20 100 Atlantic.salmon.10.CHST16.1 MYBPC3 47.35 CHST1 24.23 10 Mb

10 Mb C1QTNF4 47.6 100 Atlantic.salmon.16.CHST16.2 Atlantic.cod.CHST16 ANO1 70.08 99 100 Nile.tilapia.7.CHST16 100 Stickleback.XIX.CHST16 Medaka.6.CHST16 86 Southern.platyfish.2.CHST16 63 Western clawed frog 63 Orange.clownfish.4.CHST16 Spotted gar Tongue.sole.CHST16 LG8 LG27 64 European.sea.bass.6.CHST16 PACSIN2 1.49 LDHA 0.24 99 3 4 99 Green.spotted.puffer.13.CHST16 ARFGAP3 1.52 RASSF10 8.44 Gulf.pipefish.2.CHST16 TCP11L2 2.12 PPFIBP2 8.86

CH ST1 ano2 8.33 pacsin3 0.03 Inshore.hagfish.CHST1 TCP11L1 9.36 chst16 12.69 arfgap2 0.27 DEPDC4 2.13 DEPDC7 9.37 Elephant.shark.CHST1* cry1 12.83 ano1 4.69 MYBPC1 2.69 100 ANO1 11.02 Thorny.skate.CHST1 ldhb 13.15 cry2 5.66 SLC5A8 2.79 100 CHST1 11.48 Whale.shark.CHST1 tcp11l2 38.87 chst1 5.75 ANO4 2.91 100 ARFGAP2 11.65 Cloudy.catshark.CHST1* depdc4 38.89 mybpc3 7.66 SLC16A8 3.08 PACSIN3 11.68 Coelacanth.CHST1 slc17a8 38.92 c1qtnf4 7.94 PPFIBP1 3.59 C1QTNF4 12.60 97 Two-lined.caecilian.CHST1 ano4 39.10 ppfibp2 8.35 CHST16 3.97 97 MYBPC3 12.85 Axolotl.CHST1 slc5a8 39.23 tcp11l1 9.32 CRY1 4.08 67 SLC5A12 13.79 Parkers.slow.frog.CHST1 mybpc1 39.40 depdc7 9.34 C1QTNF17 4.29 98 ANO3 13.92 100 X.tropicalis.4.chst1 rassf9 44.61 ano9 13.52 LDHB 6.23 SLC17A6 14.47 X.laevis.4L.chst1.L ano6 54.51 ano3 17.11 ANO5 14.54 X.laevis.4S.chst1.S slc5a12 17.35 RASSF9 32.20 100 ANO9 14.57 100 Green.sea.turtle.CHST1 Unmapped ANO6 35.29 genes: 76 Tuatara.CHST1 Unmapped Unmapped Anole.lizard.1.CHST1 genes: genes: CRY2 10 Mb ppfibp1 10 Mb 100 Eastern.brown.snake.CHST1 ano5 Saltwater.crocodile.CHST1 arfgap3 slc17a6 Emperor.penguin.CHST1 rassf10 Zebrafinch.5.CHST1 ANO2 48.73 Chicken.5.CHST1 Hummingbird.5A.CHST1 Platypus.CHST1 Zebrafish 100 100 Opossum.5.CHST1 4 18 7 25 Human.11.CHST1 ano6 1.91 slc27a2a (1of2) 0.05 ldhbb 3.33 Mouse.2.Chst1 ano2b 4.27 slc27a2a (2of2) 0.06 depdc7b 5.97 Atlantic.herring.CHST1b CH ST1b slc5a8 5.87 trpm7 0.15 cry2 13.87 lsm14ab 0.30 chst1b 14.02 94 99 Cave.tetra.24.CHST1b ppfibp1a 10.84 Channel.catfish.8.CHST1b cry1a 12.01 gpib 0.40 tcp11l1 15.27 94 Goldfish.50.CHST1b_1of( 3 ) tcp11l2 16.76 glg1b 5.92 ppfibp2b 15.94 Zebrafish.25.chst1b ldhba 16.84 dnaja2b 6.86 rassf10a 16.44 Goldfish.CHST1b_2o( f3 ) mybpc1 17.79 ano2a 7.24 ldha 25.12 74 trpm1a 29.35 depdc4 26.04 cry1b 15.11 slc27a2b 32.31 Goldfish.CHST1b_3o( f3 ) mybpc3 31.84 Northern.pike.19.CHST1b chst16 15.18 trpm1b 34.56 ppfibp1b 15.55 slc17a6b 32.83 dnaja4 34.67 100Atlantic.salmon.10.CHST1b.1 slc17a8 15.62 ano5b 32.90 ano9a 35.18 Atlantic.salmon.16.CHST1b.2 c1qtnf17 15.77 ano9b 32.90 ano5a 35.19 94 G.spotted.puffer.CHST1b 98 rassf9 16.33 c1qtnf4 38.61 slc17a6a 35.25 Tongue.sole.6.CHST1b dnaja2a 41.92 100 ppfibp2a 29.15 ano3 35.38 100 European.sea.bass.6.CHST1b arfgap2 44.85 slc5a12 35.50 95 Nile.tilapia.CHST1b depdc7a 45.71 pacsin3 54.22 lsm14aa 35.89 75 Medaka.6.CHST1b 99 ano1a 54.44 gpia 35.91 Orange.clownfish.4.CHST1b glg1a 37.25 Spotted.gar.27.CHST1 CH ST1a ano1b 37.50 99 Japanese.eel.CHST1_1of( 2 ) 10 Mb rassf10b 65.94 Japanese.eel.CHST1_2of( 2 ) 74 Atlantic.herring.CHST1a pacsin2 77.93 75 Cave.tetra.18.CHST1a arfgap3 77.96 Channel.catfish.4.CHST1a 100 Northern.pike.2.CHST1a PPFIBP2b 0.43 Atlantic.salmon.CHST1a.2 Medaka RASSF10a 0.80 Atlantic.salmon.11.CHST1a.1 100 RASSF9 1.85 GPIa 5.70 Atlantic.cod.CHST1a C1QTNF4 0.16 Green.spotted.puffer.5.CHST1a CHST1b 5.94 83 GLG1a 6.62 ARFGAP2 0.70 Tongue.sole.5.CHST1a GLG1b 1.10 Stickleback.II.CHST1a LDHBb 13.49 SLC27A2b 15.13 PACSIN3 1.19 European.sea.bass.5.CHST1a ANO3 ANO1a 4.67 SLC5A8.2 1.68 15.49 Nile.tilapia.1.CHST1a SLC17A6a TRPM1a 5.25 62 SLC5A8.1 1.69 15.54 Gulf.pipefish.4.CHST1a 23 ANO5a 6 CHST1a 7.85 3 57 ANO2b 3.56 15.56 Orange.clownfish.1.CHST1a ANO9a 15.56 SLC27A2a (2of2) 13.69 86 Medaka.3.CHST1a LDHBa 4.59 SLC27A2a (1of2) 13.70 100 DNAJA4 16.96 Southern.platyfish.4.CHST1a ANO6 13.62 MYBPC3 18.16 DEPDC4 17.11 TRPM1b 16.98 Inshore.hagfish.CHST2/7-like_2of2 ANO2a 19.04 SLC17A6b 18.25 TCP11L2 17.11 Arctic.lamprey.CHST2/7-like_2of2 10 Mb DNAJA2b 19.33 ANO5b 18.28 ARFGAP3 17.46 PPFIBP1b 21.43 RASSF10b 20.22 Sea.lamprey.CHST2/7-like_3oPACSIN2f3 17.48 SLC17A8 21.54 PPFIBP2a 22.96 Inshore.hagfish.CHST2/7-like_1of2* MYBPC1 17.60 C1QTNF17 21.57 DEPDC7a 23.25 Arctic.lamprey.CHST2/7-like_1of2 PPFIBP1a 19.78 TCP11L1 23.82 TRPM7 27.05 Sea.lamprey.CHST2/7-like_1of3* CRY1a 19.80 Sea.lamprey.CHST2/7-like_2of3* DEPDC7b 23.83 GPIb 30.53 CRY2 27.25 LSM14Ab 30.57 Elephant.shark.CHST7 DNAJA2a 33.10 Thorny.skate.CHST7 CRY1b 30.90 CHST16 30.94 ANO9b.2 36.51 Whale.shark.CHST7 LDHA 32.10 ANO9b.1 37.02 Cloudy.catshark.CHST7 Axolotl.CHST7 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Vase.tunicate.C6OST_4of7 Vase.tunicate.3.C6OST_3of7* Vase.tunicate.2.C6OST_1of7 Vase.tunicate.5.C6OST_2of7

A CHST3 B Arctic.lamprey.CHST3 Human Chicken Sea.lamprey.CHST3* 4 Elephant.shark.CHST3 10 4 6 100 Thorny.skate.CHST3 100 RGR 3.73 100Whale.shark.CHST3 LRIT1 3.74 Cloudy.catshark.CHST3 LRIT2 3.76 100 PPA1 70.23 Coelacanth.CHST3 CDHR1 3.80 PALD 70.46 Two-lined.caecilian.CHST3 DNAJB12 12.15 68 CHST3 72.01 Axolotl.CHST3 PPA1 12.23 DNAJB12 72.36 100 PALD1 12.24 100 Parkers.slow.frog.CHST3 ZMIZ1 79.07 CHST3 12.68 X.tropicalis.7.chst3 PPIF 79.35 QRFPR 54.07 PPIF 13.10 X.laevis.7L.chst3.L CDHR1 84.19 X.laevis.7S.chst3.S ZMIZ1 13.44 100 LRIT2 84.23 59 Green.sea.turtle.CHST3 LRIT1 84.24 10 Mb Saltwater.crocodile.CHST3 52 QRFPR 121.38 RGR 84.24 100 Tuatara.CHST3 76 100 Eastern.brown.snake.CHST3 Anole.lizard.3.CHST3 100 Chicken.6.CHST3 100 Zebrafinch.6.CHST3 10 Mb Hummingbird.6.CHST3 99 Emperor.penguin.CHST3 Platypus.CHST3 95 100 Opossum.Un.CHST3 Mouse.10.Chst3 Western clawed frog Spotted gar Human.10.CHST3 1 7 LG5 Spotted.gar.5.CHST3 Japanese.eel.CHST3a CHST3a pald1 32.80 Atlantic.herring.CHST3a dnajb12 35.25 PPIF 6.63 100 Cave.tetra.25.CHST3a rgr 42.28 ZMIZ1 6.79 lrit1 42.30 QRFPR 7.01 100 100 Channel.catfish.3.CHST3a Zebrafish.13.chst3a lrit2 42.32 PALD1 7.24 cdhr1 42.39 PPA1 7.29 Goldfish.CHST3a.3(3of3) qrfpr 62.24 chst3 51.01 DNAJB12 7.42 100 Goldfish.13.CHST3a.1(1of3) ppa1 63.15 100 ppif 55.89 CHST3 7.90 Goldfish.CHST3a.2(2of3) zmiz1 56.16 CDHR1 7.94 Atlantic.cod.CHST3a 10 Mb LRIT2 7.97 Stickleback.VI.CHST3a 100 LRIT1 7.98 Orange.clownfish.20.CHST3a RGR 7.99 10 Mb 44 Nile.tilapia.13.CHST3a Green.spotted.puffer.17.CHST3a 100 European.sea.bass.11.CHST3a 73 65 Medaka.15.CHST3a Southern.platyfish.22.CHST3a 61 Gulf.pipefish.22.CHST3a 60 Tongue.sole.12.CHST3a.1* 100 Tongue.sole.12.CHST3a.2* Tongue.sole.12.CHST3a.3 Japanese.eel.CHST3b CHST3b Atlantic.herring.CHST3b Zebrafish Medaka 12 13 19 15 99 100 Goldfish.12.CHST3b 97 PPIFb 10.59 PPIFa 21.34 100 Zebrafish.12.chst3b rgra 29.46 ZMIZ1b 10.69 ZMIZ1a 21.44 100 Cave.tetra.Un.CHST3b lrit1a 29.46 Channel.catfish.13.CHST3b lrit2a 29.47 QRFPRb 10.77 PPA1b 21.48 100 cdhr1a 29.51 PALD1a 10.92 DNAJB12a 21.54 100 Northern.pike.6.CHST3b Atlantic.salmon.1.CHST3b qrfprb 47.90 chst3a 29.51 PPA1a 10.93 CHST3a 21.77 dnajb12a 30.04 DNAJB12b 11.01 CDHR1a 21.77

European.sea.bass.1B.CHST3b pald1a 48.20 10 Mb 100 Stickleback.V.CHST3b ppa1a 48.24 ppa1b 30.16 CHST3b 11.29 LRIT2a 21.8 Gulf.pipefish.10.CHST3b dnajb12b 48.37 pald1b 30.17 CDHR1b 11.30 LRIT1a 21.81 100 Atlantic.cod.CHST3b zmiz1b 48.78 qrfpra 30.22 LRIT2b 11.32 RGRa 21.81 Orange.clownfish.19.CHST3b ppifb 48.80 zmiz1a 30.32 LRIT1b 11.33 Nile.tilapia.8-24.CHST3b rgrb 48.97 ppifa 30.57 RGRb 11.33 lrit1b 48.97

Tongue.sole.8.CHST3b 10 Mb Medaka.19.CHST3b cdhr1b 49.0 Green.spotted.puffer.Un.CHST3b chst3b 49.01 Southern.platyfish.10.CHST3b Inshore.hagfish.CHST16 Sea.lamprey.CHST16 Arctic.lamprey.CHST16 Elephant.shark.CHST16 Cloudy.catshark.CHST16* Whale.shark.CHST16 Coelacanth.CHST16 Axolotl.CHST16 Parkers.slow.frog.CHST16 X.tropicalis.3.chst16 X.laevis.3L.chst16 Spotted.gar.8.CHST16 Atlantic.herring.CHST16 Cave.tetra.18.CHST16 Channel.catfish.4.CHST16 Zebrafish.18.CHST16 Goldfish.18.CHST16.1 Goldfish.43.CHST16.2 Northern.pike.19.CHST16 Atlantic.salmon.10.CHST16.1 Atlantic.salmon.16.CHST16.2 Atlantic.cod.CHST16 Nile.tilapia.7.CHST16 Stickleback.XIX.CHST16 Medaka.6.CHST16 Southern.platyfish.2.CHST16 Orange.clownfish.4.CHST16 Tongue.sole.CHST16 European.sea.bass.6.CHST16 Green.spotted.puffer.13.CHST16 Gulf.pipefish.2.CHST16 Inshore.hagfish.CHST1 Elephant.shark.CHST1* Thorny.skate.CHST1 Whale.shark.CHST1 Cloudy.catshark.CHST1* Coelacanth.CHST1 Two-lined.caecilian.CHST1 Axolotl.CHST1 Parkers.slow.frog.CHST1 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Southern.platyfish.4.CHST1a Mouse.8.Chst5 Inshore.hagfish.CHST2/7-like_2ofHuman.16.CHST5( 2 ) A 100 Arctic.lamprey.CHST2/7-like_2ofHuman.16.CHST6 ( 2 ) 85 Axolotl.CHSTS5ea.lamprey.CHST2/7-like_3o( f3 ) Inshore.hagfish.CHST2/7-likeTwo-lined.caecilian.CHST5.1. ( 1of2 )* 1 100 Arctic.lamprey.CHST2/7-like_1ofTwo-lined.caecilian.CHST5.1.( 2 ) 2 100 Sea.lamprey.CHST2/7-like_1of3( Parkers.slow.frog.CHST5. *)* 1 Sea.lamprey.CHST2/7-like_2of3( X.laevis.4S.chst5.1. *)* S Elephant.shark.CHST7 X.tropicalis.4.chst5.1 CH ST7 100 Thorny.skate.CHST7X.laevis.4L.chst5.1.L B 100 Elephant.shark.CHST5 Human 100 Whale.shark.CHST7 Cloudy.catshark.CHST7 Thorny.skate.CHST5 3 8 10 X 2 12 Axolotl.CHST7 Whale.shark.CHST5 NCK1 136.86 99 Parkers.slow.frog.CHSTCloudy.catshark7 .CHST5 DIPK2B 45.20 PCOLCE2 142.89 PFKP 3.07 100 X.tropicalisCoela.2.cchsanth.t7CHST5 CHST7 46.57 CHST2 143.12 ABI1 26.86 100 X.laevis.2L.chSstpo7.Ltted.gar.23.CHST5 ACBD5 27.24 SLC9A7 46.76 100 100 Japanese.eel.CHSSLC9A9T5 143.85 100 X.laevis.2S.chst7.S DIPK2A 143.97 Opossum.7.CHST7 Atlantic.herring.CHSTPLSCR4 145 6.25 100 Cave.tetra.Un.CHST5 100Human.X.CHST7 PLSCR2 146.50 Mouse.X.Chst7 Channel.catfish.8.CHST5 PFKM PLSCR1 146.54 ANXA13 100 99 Anole.lizard.3.CHST7 Zebrafish.25.CHST5 48.11 100 PLSCR5 146.61 123.74 Saltwater.crocodile.CHST7 Goldfish.25.CHST5.ST6GAL1 186.931 NCK2 105.74 10 Mb 100 Goldfish.50.CHST5.2 ST6GAL2 106.89 29 99 Eastern.brown.snake.CHST7 Tuatara.CHST7 Northern.pike.LG19.CHST5 ABI2 203.33 100 Chicken.1.CHST7 Atlantic.salmon.16.CHST5 100 Atlantic.cod.CHST5 100 Hummingbird.1.CHST7 Chicken Zebrafinch.1.CHST7 Green.spotted.puffer.CHST5 Coelacanth.CHST7 Stickleback.CHST5 9 2 1 7 33 PFKM ABI2 Spotted.gar.17.CHST7 European.sea.bass.6.CHSTST6GAL1 5.41 5 PFKP 11.44 96 Orange.clownfish.4.CHST5 DIPK2B 112.12 Japanese.eel.CHST7 PCOLCE2 10.73 ACBD5 15.99 SLC9A7 131.54 100 13 .67 Atlantic.herring.CHST7 Nile.tilapia.7.CHST5.CHST2 10.811 ABI1 16.01 CHST7 131.62 N/D 75 Cave.tetra.13.CHST7 Nile.tilapia.7.CHST5.SLC9A9 11.04 2 ELCB 16.22 NCK2 136.58 99 138.15 Channel.catfish.11.CHST7 Nile.tilapia.7.CHST5.DIPK2A 11.07 ANXA133 ST6GAL2 137.12 77 Nile.tilapia.7.CHST5.PLSCR1 11.73 4 Goldfish.6.CHST7.1 10 Mb Zebrafish.6.chst7 Gulf.pipefish.CHSTPLSCR5 11.825 74 Goldfish.31.CHST7.2 Tongue.sole.6.CHST5.1 Northern.pike.8.CHST7 Tongue.sole.6.CHST5.2 99 100 Atlantic.salmon.CHST7 Medaka.CHST5 99 Atlantic.cod.CHST7 Southern.platyfish.2.CHST5.Western clawed frog2 99 Green.spotted.puffer.1.CHST7 Southern.platyfish.2.CHST5.5 1 Southern.platyfish.9.CHST7 Southern.platyfish.2.CHST5.63 2 9 76 Medaka.4.CHST7 Southern.platyfish.2.CHST5.pcolce2 116.32 pfkp 12.074 st6gal2 106.23 48 Orange.clownfish.2.CHST7 chst2 116.50 acbd5 16.67 nck2 106.57 slc9a9 116.90 abi1 16.68 111.31 10 Mb 37 European.sea.bass.4.CHST7 chst7 abi2 49.43 dipk2a 116.97 elcb 16.92 slc9a7 111.38 0.6 3676 Gulf.pipefish.9.CHST7 Tongue.sole.2.CHST7 plscr2 117.57 pfkm 140.25 79 plscr1 117.63 Stickleback.CHST7 74 plscr5 117.69 Nile.tilapia.23.CHST7

CH ST2 st6gal1 123.92 Elephant.shark.CHST2 nck1 128.24 100 Thorny.skate.CHST2 100 100 Cloudy.catshark.CHST2 Whale.shark.CHST2 Spotted gar Two-lined.caecilian.CHST2 98 Axolotl.CHST2 LG14 LG18 LG9 LG4 LG12 LG17 98 Parkers.slow.frog.CHST2 100 X.laevis.5L.chst2.L ST6GAL1(2of2) 13.33 ACBD5 16.04 DIPK2B 1.07 X.tropicalis.5.chst2 NCK1 14.43 ABI1 16.04 SLC9A7 20.63 CHST7 20.69 100 91 PLSCR1 14.86 ELCB 16.16 X.laevis.5S.chst2.S ANXA13 NCK2 22.89 PLSCR2 14.89 15.6 PFKP 30.59 100Anole.lizard.3.CHST2 ST6GAL2 23.12 Eastern.brown.snake.CHST2 DIPK2A 15.22 SLC9A9 15.23 91 ABI2 Green.sea.turtle.CHST2 CHST2 15.32 26.91 91 Tuatara.CHST2 PCOLCE2 15.35

Platypus.CHST2 10 Mb 91 93 Opossum.7.CHST2 100 Human.3.CHST2 100 90 91 Mouse.9.Chst2 PFKM Saltwater.crocodile.CHST2* 17.97 93 Chicken.9.CHST2 Emperor.penguin.CHST2* Zebrafinch.9.CHST2 Zebrafish Hummingbird.9.CHST2 2 24 21 6 9 23 Coelacanth.CHST2 pcolce2a 8.03 nck1b 3.41 abi2b 9.97 st6gal2a 6.01 Spotted.gar.14.CHST2 chst2a 8.11 pfkpa 3.96 nck2b 15.15 nck2a 6.50 CH ST2a 99 Japanese.eel.CHST2a dipk2aa 8.26 plscr2 5.21 st6gal2b 15.49 abi2a 13.73 84 Atlantic.herring.CHST2a plscr1.1 8.61 dipk2ab 5.71 slc9a7 37.53 dipk2b 33.88 plscr1.2 8.64 chst2b 5.79 chst7 37.62 100100 Cave.tetra.16.CHST2a Channel.catfish.7.CHST2a anxa13a 9.82 pcolce2b 5.89 pfkmb 39.77 100 Zebrafish.2.chst2a elcba 9.86 10 Mb acbd5a 5.93 abi1a 5.94 100Goldfish.2.CHST2a.1 abi1b 9.92 st6gal1 pfkma elcbb 6.11 80 Goldfish.27.CHST2a.2 acbd5b 9.92 31.15 27.57 CH ST2b 24.98 anxa13b 6.38 Japanese.eel.CHST2b nck1a pfkpb 48.17 Atlantic.herring.CHST2b 100Cave.tetra.Un.CHST2b Channel.catfish.1.CHST2b 100 Goldfish.24.CHST2b.1 Medaka Zebrafish.24.chst2b 17 20 14 2 5 4 21 100 Goldfish.49.CHST2b.2 PFKPa Northern.pike.21.CHST2b PFKPb 7.31 9.62 NCK2a 10.28 NCK1b 18.40 Atlantic.salmon.19.CHST2b.1 ANXA13a 15.04 ST6GAL2a 10.57 ELCBa 15.06 PLSCR2 18.66 ABI2a 14.40 Atlantic.salmon.29.CHST2b.2 DIPK2Ab 18.96 95 ABI1b 15.10 Medaka.20.CHST2b ACBD5b 15.10 CHST2b 19.03 Southern.platyfish.21.CHSR2b PCOLCE2b 19.06 10 Mb 54 Tongue.sole.3.CHST2b ABI1a 23.01 Nile.tilapia.9.CHST2b ACBD5a 23.01 41European.sea.bass.18-21.CHST2b ELCBb 23.09 ABI2b 28.18 CHST7 53Orange.clownfish.22.CHST2b ANXA13b 23.20 1.64 28.21 SLC9A7 ST6GAL1 68 Stickleback.XXI.CHST2b PFKMb 30.21 4.41 61 Atlantic.cod.CHST2b 67 Gulf.pipefish.21.CHST2b Green.spotted.puffer.6.CHST2b* Arctic.lamprey.CHST4/5-like_2of4 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Green.spotted.puffer.6.CHST2b* Axolotl.CHST4/5-like_2of3 ( ) Two-lined.caecilian.CHST4/5-like_2of5 100Arctic.lamprey.CHST4/5-like_2of4 ( ) Two-lined.caecilian.CHST4/5-like_4of5 99 Sea.lamprey.CHST4/5-like_2of4 Inshore.hagfish.CHST4/5-like_2of( 2 ) Axolotl.CHST4/5-like_3of3 100 Inshore.hagfish.CHST4/5-like_1of( Two-lined.caecilian.CHST4/5-like_5of 2 ) 5 98 Anole.lizard.2.CHST4/5-like 10Arctic.lamprey.CHST4/5-like_1of0 ( 4 ) Sea.lamprey.CHSTTuatara.CHST4/5-lik4/5-like_1o( f4 ) e Parkers.slow.frog.CHST5.2.1 95 Arctic.lamprey.CHST4/5-like_3of( 4 ) 100 Sea.lamprey.CHSTParkers.slow.frog.CHST5.2.4/5-like_3o( f4 ) 2 100 X.tropicalis.4.chst5.2 Arctic.lamprey.CHST4/5-like_4of( 4 ) 100 X.laevis.4L.chst5.2.L Sea.lamprey.CHST4/5-like_4o( f4 ) X.laevis.4S.chst5.2.S CH ST4/5-like Parkers.slow.frog.CHST4/5-like_4o( f4 ) Axolotl.CHST4 X.tropicalis.8.chst4/5-like 100 100 Parkers.slow.frog.CHST4 X.laevis.8S.chst4/5-like 100 X.laevis.4L.chst4.L Parkers.slow.frog.CHST4/5-likX.laevis.4S.chst4.e_3o( f4 ) S 100 100 Parkers.slow.frog.CHST4/5-like_1o( f4 ) 100 100 Green.sea.turtle.CHST4 Parkers.slow.frog.Saltwater.crocodile.CHSTCHST4/5-like_2o( f4 ) 4 Axolotl.CHST4/5-like_1of( 3 ) Tuatara.CHST4 Two-lined.caecilian.CHST4/5-like_1of( Chicken.11.CHST 5 ) 4 94 92 Two-lined.caecilian.CHST4/5-like_3of( 5 ) 97 Emperor.penguin.CHST4 92 Axolotl.CHST4/5-like_2of( 3 ) Hummingbird.11.CHST4 Two-lined.caecilian.CHST4/5-like_2ofZebrafinch.11.CHST( 5 ) 4 ( ) 81 93 Two-lined.caecilian.CHST4/5-like_4of5Anole.lizard.CHST4 Axolotl.CHST4/5-like_3of( 3 ) 92 Eastern.brown.snake.CHST4 Two-lined.caecilian.CHST4/5-like_5of( 5 ) Platypus.CHST4.1 100 89 100 Anole.lizard.2.CHST4/5-like Platypus.CHST4.2 Tuatara.CHST4/5-like Opossum.1.CHST4.CH ST5.2 1 Parkers.slow.frog.CHST5.2.1 96 Opossum.1.CHST4.2 Parkers.slow.frog.CHST5.2.2 Human.16.CHST4 X.tropicalis.4.chst5.2 100 Mouse.8.Chst4 100 X.laevis.4L.chst5.2.L Saltwater.crocodile.CHST5 X.laevis.4S.chst5.2.S Chicken.11.CHST5 Axolotl.CHST4 Emperor.penguin.CHST5 100 Parkers.slow.frog.CHSTZebrafinch.11.CHST4 5 100 100 100X.laevis.4LHummingbird.chst4.L .11.CHST5 X.laevis.4S.chst4.Green.sea.turtle.CHSTS 5 Green.sea.turtle.CHST4 93 Tuatara.CHST5

Saltwater.crocodile.CHST4 CH ST4 100 86 Eastern.brown.snake.CHST5 95 Tuatara.CHST4 Anole.lizard.CHST5 Chicken.11.CHST4 Opossum.1.CHST5 100 83 Emperor.penguin.CHST4 Platypus.CHST5.1 Hummingbird.11.CHST4 Platypus.CHST5.2 100 Zebrafinch.11.CHST4 Mouse.8.Chst5 Human.16.CHST5 100 Anole.lizard.CHST4 Eastern.brown.snake.CHST4 Human.16.CHST6 Axolotl.CHST-1 5 100 100 Platypus.CHST4.1 Platypus.CHST4.Two-lined.caecilian.CHST5.1.-22 1 100 Opossum.1.CHST4.Two-lined.caecilian.CHST5.1.-11 2 Parkers.slow.frog.CHST5.1 95 Opossum.1.CHST4.-22 Human.16.CHSTX.laevis.4S.chst5.1.4 S 100 83 Mouse.8.ChstX.tropicalis.4.chst5.4 1 Saltwater.crocodile.CHST5 X.laevis.4L.chst5.1.L 98 Chicken.11.CHST5 Elephant.shark.CHST5 Emperor.penguin.CHST5 Thorny.skate.CHST5 Whale.shark.CHST5 100 Zebrafinch.11.CHST5 Hummingbird.11.CHST5 Cloudy.catshark.CHST5 Green.sea.turtle.CHST5 Coelacanth.CHST5 71 Tuatara.CHST5 Spotted.gar.23.CHST5 76 Eastern.brown.snake.CHST5 Japanese.eel.CHST5 96 Anole.lizard.CHST5 Atlantic.herring.CHST5 Cave.tetra.Un.CHST5 100 Opossum.1.CHST5 100 Platypus.CHST5.1 Channel.catfish.8.CHST5 Zebrafish.25.CHST5 81 Platypus.CHST5.2 100 Mouse.8.Chst5 Goldfish.25.CHST5.1 100 Human.16.CHST5 Goldfish.50.CHST5.2 Human.16.CHST6 Northern.pike.LG19.CHST5 Axolotl.CHST5 Atlantic.salmon.16.CHST5 95 Two-lined.caecilian.CHST5.1.1 Atlantic.cod.CHST5 Green.spotted.puffer.CHST5 Two-lined.caecilian.CHST5.1.2 2 Stickleback.CHST5 Parkers.slow.frog.CHST5.1 European.sea.bass.6.CHST5 33 90 X.laevis.4S.chst5.1.S Orange.clownfish.4.CHST5 X.tropicalis.4.chst5.1 Nile.tilapia.7.CHST5.1 X.laevis.4L.chst5.1.L Nile.tilapia.7.CHST5.2 68 Elephant.shark.CHST5 Nile.tilapia.7.CHST5.3 100 Thorny.skate.CHST5

100 Nile.tilapia.7.CHST5.CH ST5 4 99 Whale.shark.CHST5 Gulf.pipefish.CHST5 95 Cloudy.catshark.CHST5 Tongue.sole.6.CHST5.1 Coelacanth.CHST5 Tongue.sole.6.CHST5.2 62 Spotted.gar.23.CHST5 Medaka.CHST5 Japanese.eel.CHST5 Southern.platyfish.2.CHST5.2 100 Atlantic.herring.CHST5 Southern.platyfish.2.CHST5.1 69 Cave.tetra.Un.CHST5 100 Southern.platyfish.2.CHST5.3 91 Channel.catfish.8.CHST5 Southern.platyfish.2.CHST5.4 60 Zebrafish.25.CHST5 100 69 100 Goldfish.25.CHST5.1 Goldfish.50.CHST5.2 0.6 100 Northern.pike.LG19.CHST5 Atlantic.salmon.16.CHST5 100 Atlantic.cod.CHST5 100 Green.spotted.puffer.CHST5 Stickleback.CHST5 98 European.sea.bass.6.CHST5 86 Orange.clownfish.4.CHST5 90 Nile.tilapia.7.CHST5.1 100 Nile.tilapia.7.CHST5.2 100 Nile.tilapia.7.CHST5.3 70 100 Nile.tilapia.7.CHST5.4 Gulf.pipefish.CHST5 69 100 Tongue.sole.6.CHST5.1 Tongue.sole.6.CHST5.2 99 Medaka.CHST5 48 Southern.platyfish.2.CHST5.2 99 Southern.platyfish.2.CHST5.1 99 100 Southern.platyfish.2.CHST5.3 Southern.platyfish.2.CHST5.4 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

6

CHST1 CHST1 CHST3 CHST2 CHST7 CHST4 CHST5 CHST6 Human Mouse Chicken Emperor penguin Saltwater crocodile CH ST4/5-like Tuatara Anole lizard Eastern brown snake Green sea turtle

Two-lined caecilian X2 CH ST5.2 X5

Parker’s slow frog X2 X4 Western clawed frog

-1.L -3.L -2.L -7.L -4.L -5.1.L -5.2.L African clawed frog -1.S -3.S -2.S -7.S -4.S -5.1.S -5.2.S

Axolotl X3 Coelacanth Spotted gar Japanese eel -1a -1b -3a -3b -2a -2b Atlantic herring -1a -1b -3a -3b -2a -2b Zebrafish -1b -3a -3b -2a -2b Cave tetra -1a -1b -3a -3b -2a -2b Channel catfish -1a -1b -3a -3b -2a -2b -1a.1 -1b.1 -16.1 -2b.1 Atlantic salmon -3b -1a.2 -1b.2 -16.2 -2b.2

Gulf pipefish -1a -3a -3b -2b Medaka -1a -1b -3a -3b -2b Southern platyfish -1a -3a -3b -2b X4 Nile tilapia -1a -1b -3a -3b -2b X5 Orange clownfish -1a -1b -3a -3b -2b

Stickleback -1a -3a -3b -2b European sea bass -1a -1b -3a -3b -2b Elephant shark Whale shark Thorny skate Inshore hagfish X2 X2 Arctic lamprey X2 X4 Sea lamprey X3 X4 bioRxiv preprint doi: https://doi.org/10.1101/667535; this version posted June 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

CHST1/16 CHST2/7 Ancestral vertebrate CHST3 CHST5 1R AB A B Chordates

Vertebrates Tunicates

2R Ancestral jawed 1R Multiple vertebrate duplications of 2R C6OST genes CHST1 CHST2

CHST16 CHST7 Jawed vertebrates Jawless CHST3 CHST5 vertebrates

Ancestral Bony vertebrates Cartilaginous 3R teleost fish fishes CHST1a

CHST2a Multiple duplications CHST1b Ray-finned fishes Lobe-finned fishes of CHST2/7-like and CHST4/5 -like genes CHST2b 3R CHST16 CHST4 CHST5 CHST7 Teleost fishes Spotted gar Tetrapods Coelacanth CHST3a CHST5

CHST3b -2a Euteleosts Otocephala Amniotes Amphibians

-3a Neoteleosts Salmoniforms Mammals Birds and non- avian reptiles In some Acantho- -1b 4R -1a In Cypriniforms In Lepidosaurs morph lineages

Duplication of CHST1a, In primates: In frogs: -1b, -16 and -2b in the salmoniform 4R. CHST4 CHST5 CHST6 CHST4 CHST5.1 CHST5.2