bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 The laboratory domestication of : from

2 diverse populations to inbred substrains

3

4 Jaanus Suurväli1, Andrew R Whiteley2, Yichen Zheng1, Karim Gharbi3,4, Maria Leptin1, 5 Thomas Wiehe1

6

7 1 Institute for Genetics, University of Cologne, Cologne, Germany

8 2 Wildlife Biology Program, Department of Ecosystem and Conservation Sciences, College of 9 Forestry and Conservation, University of Montana, Missoula, MT, USA

10 3 Edinburgh Genomics, Ashworth Laboratories, University of Edinburgh, Edinburgh, UK

11 4 Earlham Institute, Norwich Research Park, Norwich, UK

12

13 Corresponding author: Jaanus Suurväli, [email protected]

14

15 KEYWORDS: zebrafish, RAD-seq, , genetic differentiation, wild 16 populations, laboratory strains,

17

1

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

18 Abstract

19 The zebrafish (Danio rerio) is a model vertebrate widely used to study disease, development 20 and other aspects of vertebrate biology. Most of the research is performed on laboratory 21 strains, one of which has been fully sequenced in order to derive a reference genome. It is 22 known that the laboratory strains differ genetically from each other, but so far no genome- 23 scale survey of variation between the laboratory and wild zebrafish populations exists.

24 Here we use Restriction-Associated DNA sequencing (RAD-seq) to characterize three 25 different wild zebrafish lineages from a population genetic viewpoint, and to compare them to 26 four common laboratory strains. For this purpose we combine new genome-wide sequence 27 data obtained from natural samples in India, Nepal and Bangladesh with a previously 28 published dataset. We measured nucleotide diversity, heterozygosity, allele frequency spectra 29 and patterns of gene conversion, and find that wild fish are much more diverse than laboratory 30 strains. Further, in wild zebrafish there is a clear signal of GC-biased gene conversion that is 31 missing in laboratory strains. We also find that zebrafish populations in Nepal and 32 Bangladesh are distinct from all the other strains studied, making them an attractive subject 33 for future studies of zebrafish population genetics and molecular ecology. Finally, isolates of 34 the same strains kept in different laboratories show a clear pattern of ongoing differentiation 35 into genetically distinct substrains. Together, our findings broaden the basis for future genetic 36 and evolutionary studies in Danio rerio.

37

2

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

38 Introduction

39 Zebrafish (Danio rerio) is a small fish native to freshwater bodies of the subtropical India, 40 Nepal and Bangladesh (Whiteley, et al. 2011; Parichy 2015). In laboratory conditions the 41 zebrafish breed often, produce many offspring, are relatively easy to grow and have 42 translucent embryos, making them an excellent model species for drug screening, 43 developmental biology and genetics. A common feature of zebrafish laboratory strains is that 44 they have gone through independent genetic bottlenecks followed by different regimes of 45 inbreeding. Therefore, one expects to observe a substantial reduction of genetic diversity 46 when comparing laboratory strains with wild populations. Due to their independent origins, 47 laboratory strains are expected to genetically differ not only from the wild fish but also from 48 each other. Hence, conclusions drawn from a single laboratory strain do not necessarily apply 49 to the other strains, nor are they fully representative for wild populations (Brown, et al. 2012; 50 Butler, et al. 2015; Baker, et al. 2018; Balik-Meisner, et al. 2018; Holden, et al. 2019).

51 Given the ample resources and results from decades of laboratory zebrafish research, e.g. a 52 high quality reference genome based on the Tübingen (TU) strain (Howe, et al. 2013), studies 53 of natural populations are timely and offer an opportunity to complement the existing 54 knowledge by insights on evolutionary history, behaviour, genetic variation and adaptive 55 potential that could be obtained from studying wild populations in their natural habitats. 56 Genetic polymorphism, and differences among laboratory strains or wild populations, is not 57 usually a problem in the study of highly conserved processes or genes, for instance those 58 directing the development of an organism, or core growth or signalling pathways. The 59 laboratory fish also remain an excellent disease model for diseases based on such pathways. 60 Complex protocols have been established to model many diseases affecting humans, these 61 models can also be used for drug testing in early phases of pharmaceutical development 62 (Cornet, et al. 2018). In contrast, many of the biological functions and characteristics 63 associated with success in natural habitats would be best studied in wild populations, as the 64 sheltered living conditions of laboratory specimens are very distinct from those in the wild. In 65 zebrafish facilities all embryos are sanitized, feeding is regular and standardized and pathogen 66 contact is minimized with strict procedures and protocols (Murray, et al. 2016). The 67 laboratory strains face a substantially reduced variety of diseases and pathogens and their 68 immune response is therefore expected not to be under nearly as strong diversifying selection 69 as it is in the wild. Combined with the bottleneck during the establishment of a new strain it is

3

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

70 expected that within a given strain the immune genes would be much less polymorphic than 71 they are in the wild, hence affecting all results from studies seeking insights into the zebrafish 72 immune system. Furthermore, while in laboratory conditions it is possible to experimentally 73 identify genes involved in a particular metabolic pathway and even to model metabolic 74 diseases affecting humans (Seth, et al. 2013), association studies among wild populations 75 have the potential to reveal and dissect even minor contributors to polygenic traits that give a 76 selective advantage in a particular habitat (Gagnaire and Gaggiotti 2016). Another field that 77 greatly benefits from working with wild fish directly is behavioural research (Bhat, et al. 78 2015; Suriyampola, et al. 2016).

79 A study published in 2011 examined both genomic and mitochondrial variation in several 80 wild zebrafish populations from Southern Asia and revealed three major lineages to be 81 present. Common laboratory strains were found to be genetically closest to fish captured from 82 North-East India (Whiteley, et al. 2011; Wilson, et al. 2014). Sequencing the genome of one 83 wild zebrafish from North-East India revealed nearly seven million differences from the 84 reference genome (Patowary, et al. 2013).

85 Here we report the results of a whole genome survey of genetic variability in a large set of 86 laboratory and wild zebrafish. We performed Restriction-Associated DNA Sequencing (RAD- 87 seq) on wild-caught zebrafish from three major lineages, thus complementing an existing 88 large dataset of RAD-seq data from laboratory and wild-derived fish. Joining both datasets we 89 carried out a comparative analysis of wild and laboratory fish and observed major differences 90 in genetic variability, in levels of heterozygosity and in the allele frequency spectra among, 91 but also within, the two groups (wild and laboratory zebrafish). Mutation patterns in wild fish 92 show a clear bias in favour of G/C which appears to be nearly absent in laboratory zebrafish 93 strains. Among fish, this bias has been previously studied mainly in the three-spined 94 stickleback (Capra and Pollard 2011; Roesti, et al. 2013). Finally, we present evidence that 95 isolates of the same strains obtained from different laboratories show remarkable genetic 96 differentiation and can thus be considered distinct substrains.

97

4

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

98 Results

99 Description of the dataset 100 An overview of the samples is shown on Figure 1. We sequenced ~0.3% of the genome 101 (4,374,886 positions) of 26 wild and 29 laboratory zebrafish, and together with the resulting 102 data, re-analysed 75 wild-derived and 215 laboratory zebrafish from a previously published 103 dataset (Wilson, et al. 2014). 241,238 SNPs (Single Nucleotide Polymorphisms) were 104 identified (overlapping a total of 11,531 genes), of which 124,015 map to introns. 102,302 of 105 the SNPs are transitions (Ti) and 138,936 are transversions (Tv), resulting in a Ti/Tv ratio of 106 1.358, a value that is higher than what can be estimated for other zebrafish datasets (1.203 107 reported in (Stickney, et al. 2002), 1.132 based on biallelic sites in dbSNP). Only 8% of the 108 SNPs (19,202) were reported by Ensembl VEP as previously known. 222,695 (92%) of the 109 variant alleles were present in wild populations, 146,243 of these (60% of the total dataset) 110 were not found in any of the laboratory strains. 18,543 (8%) of the identified variant alleles 111 were present in one or more laboratory strains yet not present in any of the wild fish.

112

113 Figure 1. Zebrafish samples used in the study. Sample descriptions were obtained from publications 114 first describing the fish (Whiteley, et al. 2011; Wilson, et al. 2014), when applicable. n = number of 115 individuals sampled. The map of sampling locations was obtained from Google Earth v7.3.2 (April 04, 116 2019). SIO, NOAA, U.S. Navy, NGA, GEBCO. Image Landsat / Copernicus [December 14, 2015].

5

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

117 Population structure of laboratory and wild zebrafish 118 According to how the data were collected - five laboratory strains, three wild populations and 119 one wild-derived population (CB) - we expected that an admixture analysis would yield a 120 likelihood peak for nine populations. However, admixture analysis revealed that the samples 121 likely come from 13 subpopulations instead. CB appeared as a mixture of three distinct 122 subpopulations; the independently obtained TU and WIK (University of Oregon, 2014 and 123 Zebrafish International Resource Center, 2018) were distinct from one another as well 124 (Figure 2A).

125

126 Figure 2. Population structure of the zebrafish samples. (A) Admixture plot generated by the R 127 package LEA. Each row corresponds to one fish. Colours indicate the proportion of variation shared 128 between individuals. 13 distinct subpopulations are identified, three of them within CB (B) Different 129 zebrafish lineages have fixed differences in the amino acid sequence that are carried by every 130 individual of the strain/population (below the diagonal). This correlates strongly with genetic distances 131 (above the diagonal). (C) Unrooted Maximum Likelihood tree, generated with 1,000 bootstrap 132 replicates.

133 We plotted as heatmaps different measures for mean pairwise genetic distances between 134 populations (Figure 2B, Supplementary Data S1). Wild fish from India (UT) appeared 135 genetically very close to all the laboratory strains in the study. AB was found to be closely 136 related to both TU isolates, however, TU_2018 is much more differentiated from other strains 137 than AB or TU_2014 (Figure 2B). Among the wild fish, CHT and KHA have hundreds of

6

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

138 fixed non-synonymous differences from the other populations (Fig 2B, Supplementary Data 139 S2). Three protein-coding genes were observed to have fixed non-synonymous differences 140 between the two WIK isolates, resulting in the amino acid changes 79Glu->Lys in chitin 141 synthase (chs1; dbSNP id rs507105222), 559Glu->Gly in DNA polymerase eta (polh, dbSNP 142 ID rs502489776) and 2133Leu->Gln in dystonin (dst, no dbSNP id for the SNP).

143 A phylogenetic tree constructed with these population assignments revealed a clear separation 144 of KHA and CHT (Whiteley, et al. 2011) from the rest (Figure 2C). AB and the two TU 145 isolates appear closely related to each other; a similar pattern could be observed for the WIK 146 isolates and for all three CB subgroups. EKW and Nadia appear be monophyletic in the tree, 147 albeit with less bootstrap support (Figure 2C).

148 Four populations/strains had fixed nonsynonymous substitutions not present elsewhere in the 149 data (Supplementary Data S2). These included 48 positions in CHT and 34 in KHA, but also 150 one in the EKW strain (330Asp->Asn in the lipase maturation factor lmf2a) and another in the 151 2018 isolate of WIK (109Ala->Glu in tanc2, a gene for which the mouse knockout is 152 embryonic lethal (Han, et al. 2010)). No sequence data were available for the locus in the 153 2014 isolate of WIK, likely resulting from allelic dropout (Andrews, et al. 2016).

154 Within-strain genetic diversity is higher in the wild zebrafish than in the 155 laboratory strains 156 At the population level, clear differences could be observed between zebrafish from the wild

157 and the laboratory strains. We calculate three different estimators (θπ, θw and θ1) of the scaled 158 mutation rate, θ. All had much higher values in the wild fish and in CB than in the laboratory

159 strains, including the “newer” strain Nadia. In most cases we observe a θπ value of around 160 0.003-0.007 (0.3%-0.7%), with laboratory strains at the lower end of the scale (Figure 3A). 161 This is consistent with previous studies that estimated the genetic diversity in zebrafish to be 162 at around ~0.5%, with wild fish being more diverse than laboratory strains (Guryev, et al. 163 2006; Coe, et al. 2009; Whiteley, et al. 2011; Butler, et al. 2015; Balik-Meisner, et al. 2018).

164 Other estimators of θ are in approximately the same range. In KHA and CHT the value of θπ 165 is similar to what is observed for laboratory strains. Among the laboratory strains, the TU 166 isolate from 2018 (TU_2018) showed the least diversity – less than 0.2% for all estimators 167 (Figure 3A). In contrast, WIK_2018 is more diverse than WIK_2014.

7

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

168

169 Figure 3. Within-strain variability of zebrafish. (A) Three different estimators of the scaled 170 mutation rate, calculated independently for each population, show wild fish (CB, UT, KHA, CHT) to

171 be genetically much more diverse than any of the laboratory strains. θπ – average number of pairwise

172 differences, divided by total length of the sequence. θw – proportion of polymorphic sites, normalized

173 with sample size. θ1 – observed number of singleton mutations, divided by total length of the 174 sequence. Fis – coefficieny of inbreeding. (B) The genomes of laboratory strains contain long stretches 175 of reduced heterozygosity that can vary even between isolates of the supposedly same strain. In CB, 176 observed heterozygosity is usually slightly higher than expected. The long arm of chromosome 4 has

8

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

177 very few polymorphic sites identified in all populations, attributable to its repetitiveness. Individual 178 data points are not indicated; lines represent loess-smoothed averages calculated from the 179 heterozygosity of polymorphic sites. The positions of identified polymorphic sites themselves are 180 shown for each population as a separate track at the bottom.

181 A clear trend emerges when one compares values for the three estimators to each other. In the

182 wild populations, the value of θπ is usually the smallest, followed by θw and then θ1 (Figure 183 3A) - a situation which is compatible with the assumption of background, or purifying, 184 selection purging deleterious mutations from the population. In laboratory strains and in CB

185 the order is reversed: θ1 is smallest and θπ largest, indicating a strong demographic effect 186 overriding the genomic signature of purifying selection.

187 To better understand the possible reasons for this reversed relationship we studied the

188 inbreeding coefficient FIS = 1 - HI/HS (Wright 1951) which compares within-individual

189 heterozygosity (HI) to the within-population (HS) heterozygosity. FIS does not deviate much 190 from zero in most wild or laboratory populations. The corresponding value becomes positive

191 if multiple populations are combined. However, FIS is negative in all CB substrains, which

192 can happen only when HI is larger than HS. This is the case when individuals are more 193 heterozygous than expected based on the population allele frequencies, which is tantamount to 194 a deviation from Hardy-Weinberg proportions. A possible explanation, compatible with the 195 description of CB in the literature (Wilson, et al. 2014), is that the different CB “substrains” 196 each result from very few, perhaps only one, pair of breeding individuals. All homozygote 197 differences between the parents are then propagated as heterozygote differences in the 198 children of the next few generations.

199 The most plausible explanation for the observed differences between TU_2014 and TU_2018, 200 and between WIK_2014 and WIK_2018, would be a different degree of inbreeding and 201 . To test this, we plotted the observed and expected heterozygosity at the 202 polymorphic sites for all strains as loess-smoothed curves across the genome (Figure 3B, 203 Supplementary Data S3, Supplementary Data S4). WIK_2014 and TU_2018 were found to 204 contain large stretches of severely reduced heterozygosity (no polymorphic sites and/or very 205 low heterozygosity at the sites that are polymorphic), a signature of inbreeding (Kardos, et al. 206 2018). These patterns were less common in the TU fish from 2014 and WIK fish from 2018. 207 In contrast, the wild fish genomes and CB do not contain such long runs of homozygosity 208 (Supplementary Data S3, Supplementary Data S4). However, in the CB subgroups the 209 observed heterozygosity in consistently higher than expected, in accordance with these being

9

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

210 the F1 offspring of a few breeding pairs. As reported before, very few sequences can be 211 confidently aligned to the long arm of chromosome 4 (4q), attributable to its extreme 212 repetitiveness (Wilson, et al. 2014; Howe, et al. 2016). Therefore, very few polymorphic sites 213 were identified on chromosome 4q from the current data. Sequencing technologies that 214 produce longer reads than 100-300 bp would be required to tackle the variation on this part of 215 the zebrafish genome.

216 Distinct allele frequency spectra in different laboratory strains and wild 217 populations 218 Frequency spectra were calculated for each population independently. This was done (1) for 219 all SNPs, irrespective of their particular alleles (Supplementary Data S5), and (2) 220 distinguishing the two classes of A/T and G/C polymorphisms (Figure 4), to check for 221 potential traces of GC-biased gene conversion (gBGC). gBGC essentially is a bias in the 222 DNA repair machinery in favour of G or C alleles after DNA double strand breaks, for 223 example during recombination. Although not a selection force per se, it can mimic the effect 224 of selection and distort statistics that are based on the allele frequency spectrum, by increasing 225 the frequency of G/C alleles at the expense of A/T alleles. To our knowledge, gBGC in fish 226 has so far only been described for the three-spined stickleback (Capra and Pollard 2011)).

227

228 Figure 4. Derived allele frequency in zebrafish. (A) The frequency spectra in wild population 229 closely follows the expectations (black line). In contrast, laboratory strains have a lack of low 230 frequency alleles, with the most inbred strain (TU_2018) demonstrating a nearly flat spectrum. These

10

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

231 spectra look similar for G/C->A/T and A/T->G/C substitution types. In the laboratory fish and the 232 “newest” laboratory strain Nadia, biases can be seen for the substitutions that change GC-content, with 233 A/T to G/C substitutions being more common among high frequency variants and G/C to A/T among 234 low frequency variants. In UT and CHT the observed values are generally close to expected. The 235 irregular shape of spectra for these two populations is caused by the uneven distribution of genomes 236 among the bins, resulting from relatively small sample sizes. In the KHA population from Nepal, 237 numerous alleles that are rare in other populations are present at high frequencies. Populations with 238 significant differences between the A/T -> G/C and G/C -> A/T spectra are marked with an asterisk 239 (*). n=number of available genomes.

240 In all the laboratory strains the spectra are characterized by depletion of low frequency alleles 241 and an increase in the proportion of medium and high frequency alleles, compatible with a 242 high amount of inbreeding and a strong decline of effective population size upon establishing 243 laboratory strains. This effect appears to be the weakest in EKW and Nadia. Nadia is a newer 244 strain that, at the time the sequence data were obtained, had spent only 8 generations in the 245 laboratory (Wilson, et al. 2014) (Figure 4). In the wild-derived CB there is a depletion of very low 246 frequency alleles, which can be seen both with the three subgroups combined and separate. However, 247 the rest of its allele frequency spectrum behaves almost as in a neutral panmictic population, 248 i.e. proportional to the curve 1/x. In the true wild populations the patterns are different 249 (Figure 4). The frequency spectra of UT and CHT closely follow the expected power law 1/x 250 (times a constant). The spectrum of the KHA population from Nepal is clearly U-shaped, with 251 a strong excess of high frequency derived alleles (Figure 4). Besides positive selection and 252 local adaptation causing such a distortion of the frequency spectrum, another possible (and 253 perhaps more likely) reason is mis-assignment of ancestral and derived status, based on the 254 pooled consensus rule. In any case, the folded spectrum for KHA closely follows the power 255 law and the unfolded spectrum remains U-shaped even when the reference allele is redefined 256 based on what is the most common in the three wild zebrafish populations (Supplementary 257 Data S5).

258 As for gBGC, in most laboratory strains the distribution of G/C-> A/T and A/T -> G/C is 259 similar (Figure 4). Chi-square test reveals significant differences between the two spectra in 260 WIK_2014 and in TU_2018, however the spectra themselves are not consistent with gBGC. 261 In the wild fish, as well as in the Nadia strain the p-value obtained from a chi-square test is 262 highly significant (p < 2.2e-16) and a clear bias for G/C to A/T can be seen for low frequency 263 alleles. For high frequency alleles, the pattern A/T to G/C is reversed, still compatible with 264 the presence of gBGC (Glemin, et al. 2015).

11

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

265 In the population CB, more complicated and unusual results were observed. CB includes three 266 subpopulations and within each one of them, there are a large number of SNPs heterozygotic

267 in every individual (HI = 1); up to 5% of all polymorphic sites in the (sub)population 268 (~3,000/60,000 for CB1 and CB3). This is in contrast to the 1-200 such sites that are seen in 269 all the other strains (in CHT and KHA, 0.2% of all polymorphic sites). Under Hardy- 270 Weinberg equilibrium, this is highly unlikely; for a SNP with 50% frequency, the probability 271 of it being heterozygous in the entire sample is 1/(2n). Balancing selection can be also ruled 272 out because these heterozygotic sites are evenly distributed in the genome. In addition, the 273 number of singletons is much smaller than expected with θ, both in the entire CB and in 274 subpopulations.

275 We propose that both phenomena are caused by the fact that CB is a F1 generation after a 276 severe bottleneck, with each subpopulation deriving from one breeding pair (parents). If a 277 SNP is homozygous for one allele in the father and homozygous for another in the mother, all

278 F1 will be heterozygous. Under a standard equilibrium population with size N and genomic θ 279 we find that the expected number of such sites (assuming that the parents are unrelated 280 samples from the wild population) is ≈ θ /6, as can be calculated by the formula

281

282 Assuming the genetic diversity of the wild parents of CB are similar to CHT, the expected 283 number of all-heterozygote sites within the offspring of one breeding pair is ~2,000, which is 284 on the same scale as the actual number of 3,000. The rarity of singletons can also be explained 285 by the fact that any SNPs must have at least an allele frequency of 25% in the parent 286 generation (because there are only four genomes in the breeding pair) and it is unlikely that

287 only one F1 genome inherited the 25% allele.

288 In conclusion, these results are consistent with CB being the F1 generation of three separate 289 breeding pairs (three clutches of eggs). Given enough time, the descendants of CB will likely 290 have the genetic structure of the established laboratory strains, the genomes of which contain 291 sequence segments of high and low heterozygosity resulting from strong reduction in 292 population size and inbreeding. (Supplementary Data S3, Supplementary Data S4).

293

12

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

294 Discussion

295 Differences between wild and laboratory zebrafish 296 We confirmed that laboratory strains have lower genetic variation than wild populations and 297 that most (~60%) of the variants in the dataset are only found in the wild. Because of the 298 focus on zebrafish laboratory strains in the past, only a small fraction of the identified SNPs 299 have been previously described. As a strain becomes more homogenous during breeding in 300 the lab, variants become fixed or lost, which was seen from both reduced heterozygosity of 301 the laboratory strains and the depletion of singletons and rare variants on the allele frequency 302 spectra. The expansion of population size after the initial bottleneck causes an excess of

303 intermediate-frequency alleles and deficit of rare ones, resulting in θπ > θw > θ1. With the 304 present dataset we could observe just how quickly this happens. For example, the AB and 305 Nadia strains have similar proportions of polymorphic sites and nucleotide diversity. 306 However, at the time of sample collection, Nadia had spent only eight generations in captivity 307 and was considered a “wild-like” strain (Wilson, et al. 2014), while AB had already been 308 established in the 1970s.

309 The second major difference between laboratory and wild zebrafish is in the biases in 310 nucleotide composition that could be detected by studying the allele frequency spectra. In 311 many species there are two opposing factors driving the nucleotide composition. The G/C -> 312 A/T bias affects mainly alleles at low frequencies and is caused by cytosine deamination 313 being the most common type of mutation. The A/T -> G/C bias is thought to be based on a 314 different mechanism: during recombination, G/C rich alleles are preferred to the A/T rich 315 alleles; the fine balance of these two mechanisms has a large impact on the genome 316 composition. This effect can be difficult to distinguish from actual selection and has been 317 mostly studied in mammals (Duret and Galtier 2009). However, it is also known to operate in 318 many other taxa including land plants and bacteria (Mugal, et al. 2015). To our knowledge the 319 GC-biased gene conversion in fish has so far only been explored in the three-spined 320 stickleback (Capra and Pollard 2011; Roesti, et al. 2013) and never in Cypriniformes, the most 321 speciose order of freshwater fishes (Nelson, et al. 2016). Here we show that the mechanism 322 would be difficult to study in laboratory strains because GC-bias was not detectable. This can 323 possibly be explained by the reduction of variant sites seen in the inbred laboratory strains – 324 GC-biased gene conversion is a feature of recombination and can thus only operate on sites

13

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

325 that have previously already become polymorphic via other mechanisms (e.g. mutation). 326 Indeed, the bias can be observed in the Nadia strain, which was obtained more recently than 327 the other laboratory strains of this study and has hence more polymorphic sites.

328 What can we learn from the analysis of wild zebrafish? 329 As exemplified by the GC-biased gene conversion, it is apparent that some biological 330 mechanisms cannot be studied in most of the laboratory strains due to their inbred nature. 331 Among the wild zebrafish used for the current study, the Nepal KHA and Bangladesh CHT 332 populations are less diverse than those caught or recently derived from North-East India (UT 333 and CB). In addition, a surprisingly high number of derived alleles occurred at high frequency 334 or were fixed in KHA. It is possible that, just like the laboratory strains, these populations 335 have undergone a recent genetic bottleneck. Both KHA and CHT have many high frequency 336 and fixed differences from the other fish in the dataset on both nucleotide and amino acid 337 levels. At least some of those alleles have never been studied before, yet most likely provide 338 the animals a selective advantage in their current habitat.

339 The populations in North-East India had a substantially higher proportion of polymorphic 340 sites than any of the laboratory strains or the fish from Nepal and Bangladesh. This is likely 341 due to a combination of mutation, gene flow, and drift. These populations are the most 342 plausible source for those laboratory strains for which the origin is not documented, based on 343 the results from the analysis of both genetic distances and phylogenetic trees. Indeed, major 344 cities of the region (e.g. Kolkata) are easily accessible both by air and sea.

345 Differences between laboratory strains 346 Differences between laboratory strains are relevant for the generality and reproducibility of 347 zebrafish molecular studies. Fish from supposedly identical genetic backgrounds (referred to 348 among researchers as the same strain) have been observed to provide conflicting results across 349 different labs and different studies; supposedly minor, often undocumented details in the 350 rearing of the fish have been pointed out as potentially impactful on the outcome of the 351 experiments, leading to the results being difficult to reproduce (Varga, et al. 2018). Our 352 results point to the additional explanation that genetic differences have arisen between 353 common laboratory strains kept in different facilities. Indeed, in mice such an effect has been 354 documented and researchers even warrant caution when interpreting results from different 355 substrains of the widely/globally used C57BL/6 mice (Mekada, et al. 2009). We show here 356 that fish of the same strain, but obtained from different sources and at different time points,

14

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

357 are genetically distinct enough to be considered substrains. Although this can be seen with 358 both TU and WIK, there are some differences in the patterns observed. TU from 2018 is the 359 most homozygous strain in the study, whereas the TU from 2014 appears to be much more 360 heterozygous. Conversely, WIK from 2014 appears more homozygous than WIK from 2018 361 and the two have fixed different amino acids in at least some proteins, which is not the case 362 for TU.

363 The genetic distance between AB and the TU substrains was found to be nearly as small as 364 the distance between the two TU substrains themselves. Considering that AB was obtained 365 from a pet shop in the USA in 1970’s and TU from a different pet shop in Europe in the 366 1990’s, one possible explanation would be that they are both derived from the same 367 undescribed pet shop strain that sold across the world over this entire period. Previous 368 research has revealed that these two share at least one trait that distinguishes them from the 369 other strains: they do not have an easily detectable sex determination locus at the tip of 370 chromosome 4 (Wilson, et al. 2014). Even so, the two differ from each other in some major 371 aspects such as structural variation (Brown, et al. 2012; Holden, et al. 2019) and haplotypes of 372 essential immune genes, e.g. the Major Histocompatibility Complex (McConnell, et al. 2016).

373 While WIK and TU/AB are clearly distinct both from each other and from the wild fish, in the 374 phylogenetic sense the populations of EKW and Nadia appear much closer to the putative 375 source in North-East India. EKW is derived from a population maintained in Florida by 376 commercial fish breeders; Nadia was obtained from North-East India 8 generations prior to 377 sequencing (Wilson, et al. 2014). Compared to AB/TU/WIK, EKW and Nadia have more 378 variants and not as flat derived allele frequency spectra, thus they can be considered more 379 „wild-like“. However, the profile of G/C->A/T and A/T->G/C biases characteristic of wild 380 fish is only present in Nadia and not in EKW.

381 In summary, this study reveals that while laboratory strains of zebrafish are an excellent 382 model for vertebrate biology, it would be of great interest to include wild populations into 383 more studies. They have more genetic diversity, biologic mechanisms not apparent from 384 studying the laboratory strains (as exemplified by GC-biased gene conversion in the current 385 study) and have distinct populations with their own preferred habitats and adaptations. 386 Furthermore, one of the arguments for working exclusively with laboratory strains – 387 reproducibility – does not always hold true. What is thought to be the same strain in two

15

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

388 different laboratories may have already evolved into their own distinct substrains, as we 389 observed for both TU and WIK in the current study.

390 Materials and Methods

391 Samples and datasets 392 The collection of wild zebrafish from India (UT), Bangladesh (CHT) and Nepal (KHA) has 393 been described in detail by Whiteley et al (Whiteley, et al. 2011). The present study utilizes 394 RAD-seq data from the same fish; a single restriction enzyme (SbfI) was used for library 395 construction in order to maintain consistency with the previously published large dataset 396 (Wilson, et al. 2014). Sequencing itself was performed on an Illumina HiSeq 2500 (2x125 bp 397 paired end reads).

398 Fifteen additional fish from the TU strain and fourteen from the WIK strain were obtained in 399 May 2018 from the Zebrafish International Resource Center in Eugene, Oregon, USA. RAD- 400 seq libraries were produced using the protocol decribed by (Ali, et al. 2016). This protocol 401 involves ligating biotinylated, individually indexed, adapters to RAD loci and physically 402 separating them from the rest of the genome using streptavidin coated magnetic beads. DNA 403 was digested with SbfI. Following isolation of RAD loci, Illumina sequencing adapters were 404 added using the NEBNext Library Prep Kit for Illumina (New England Biolabs, Ipswitch, 405 MA) using 1:10 diluted adapters and the optional size-selection step. Sequencing was 406 preformed on an Illumina HiSeq X instrument using 2x150 bp paired end reads (Novogene 407 Corporation Inc., Sacremento, CA).

408 RAD-seq data for zebrafish laboratory strains (EKW, AB, WIK, TU, Nadia), as well as for 409 the F1 offspring of fish derived from a wild population in India (CB) were obtained from the 410 NCBI Sequence Read Archive (Bioproject PRJNA253959). Collection of these samples, as 411 well as generation of the sequencing data itself has been described in detail by Wilson et al 412 (Wilson, et al. 2014). For downstream analyses the samples were renamed to include both the 413 strain name and the last three characters of the SRA identifier in their labels (example: 414 SRR1519522 became AB_522). Two fish from the original CB dataset and one from Nadia 415 were excluded from these analyses as during test runs of the pipeline they appeared 416 genetically different from the rest of their respective populations. For WIK, 42 fish out of 61

16

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

417 in the original dataset were the offspring of a single female; 41 of these were excluded from 418 further analysis.

419 Data processing 420 All data were generated by sequencing zebrafish genomic DNA digested with the enzyme 421 SbfI. Wilson et al sequences (Wilson, et al. 2014) originate from single-end sequencing and 422 have a length of 95bp. For compatibility, all raw data from the newly sequenced populations 423 was first trimmed to the same length (this was done with GNU cut), then de-multiplexed and 424 cleaned using process_radtags from Stacks v2.4 (Catchen, et al. 2011; Rochette and Catchen 425 2017). Sequences were mapped to the zebrafish reference genome (GRCz11; downloaded 426 from Ensembl) using bwa mem (version 0.7.17-r1188) (Li and Durbin 2009) with the default 427 settings. Samtools (version 1.9) was used to filter out unmapped reads and non-primary 428 alignments. Variant calling was performed with Stacks v2.4 (Catchen, et al. 2011; Rochette 429 and Catchen 2017). In order to deal with possible mutations of the cutting site and subsequent 430 allele dropout, data were considered for analysis only if available from at least 70% of 431 populations, and in at least 80% of the individuals within each population. The log likelihood 432 threshold for Stacks to retain RAD loci was -10.

433 Analysis of population structure 434 Admixture analysis of the population structure was performed with the R package LEA 435 (Frichot and Francois 2015), using a single SNP per RAD locus. The optimal number of 436 populations was chosen based on minimal entropy analysis performed by the software. 437 Concatenated sequences of variant positions in the data (one sequence per population; 438 encoded according to the IUPAC nomenclature) were used as input for Maximum Likelihood 439 tree construction with RAxML v8.2.12 (Stamatakis 2014), using the GTRCAT model and 440 1,000 bootstrap replicates. Estimates for genetic distances were obtained from the populations 441 module of Stacks.

442 Data analysis 443 Estimates of heterozygosity, nucleotide diversity and population differentiation were obtained 444 from the populations module of Stacks (Catchen, et al. 2011). Variant sites were annotated 445 with the Ensembl Variant Effect Predictor (VEP) (McLaren, et al. 2016).

446 We determined three estimates of the scaled mutation rate θ=4Nμ, Tajima's estimator θπ

447 (Tajima 1989), Watterson's estimator θ w (Watterson 1975) and θ1, an estimator of θ based

17

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

448 only on singleton mutations (Fu and Li 1993). The first one is the average number of 449 nucleotide differences in pairwise comparisons, and the second is obtained by dividing the

450 total number of polymorphic sites within a sample by h2n-1, the harmonic number of the 451 (diploid) sample size minus one. All estimators are rescaled to values per bp.

452 Allele frequencies were extracted from the stacks-generated .vcf (Variant Call Format) file 453 with VCFtools (version 0.1.13) (Danecek, et al. 2011). Stacks assumes data to be biallelic and 454 by default considers the less frequent allele in the dataset to be derived. These data were used 455 to plot allele frequency spectra and to extract differentially fixed alleles. Expected allele 456 frequencies were calculated assuming a power law distribution of 1/x * constant. Both the 457 observed and expected values were summed to bins of 10% frequency before plotting. In 458 parallel, the same was performed when redefining the ‘ancestral’ and ‘derived’ alleles based 459 on what is the most common in a reduced dataset of seven fish each from the three major wild 460 lineages – India, Nepal and Bangladesh (Supplementary Data S5).

461 Data visualization and statistics 462 R core tools and the R packages gplots and ggplot2 were used to generate the initial plots 463 (Warnes, et al. 2016; Wickham 2016; R Core Team 2018). These were then imported to 464 Adobe Illustrator CS6 (version 16.03) for final editing. Chi-square tests for the allele 465 frequency distributions of each population were performed with R.

466

18

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

467 Data Availability

468 All generated sequence data is available for download from the NCBI Sequence Read Archive 469 (SRA); BioProject PRJNA555030. The scripts used in this study are available from GitHub at 470 https://github.com/jsuurvali/rad_analysis. Files containing the identified variant sites and their 471 functional annotations are available from the Dryad Digital Repository 472 .

473 Author Contributions

474 JS supervised sequencing, analysed the data and wrote the manuscript. AW contributed 475 sample material and RAD-sequencing of laboratory zebrafish. KG performed RAD- 476 sequencing of the wild zebrafish. TW and YZ contributed to data analysis. ML and TW 477 conceived the study and revised the manuscript.

478 Acknowledgements

479 This study was supported by the German Research Council (DFG) grant SPP1819 to TW and 480 ML. AW was supported by the National Science Foundation award DEB-1652278 and by a 481 subaward to the National Science Foundation award IOS-1257562. Emilia Martins, Anuradha 482 Bhat, Rick Mayden and Jiwan Shrestha helped obtain the wild zebrafish samples reported in 483 this paper. Seth Smith helped with the design and implementation of the RAD-Seq protocols used to 484 obtain sequence data for TU and WIK. Kamel Jabbari provided helpful feedback and advice for 485 the analysis of GC-biased gene conversion.

486

19

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

487 Literature Cited

488 Ali OA, O'Rourke SM, Amish SJ, Meek MH, Luikart G, Jeffres C, Miller MR. 2016. RAD 489 Capture (Rapture): Flexible and Efficient Sequence-Based Genotyping. Genetics 202:389-400. 490 Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. 2016. Harnessing the power of 491 RADseq for ecological and evolutionary genomics. Nat Rev Genet 17:81-92. 492 Baker MR, Goodman AC, Santo JB, Wong RY. 2018. Repeatability and reliability of 493 exploratory behavior in proactive and reactive zebrafish, Danio rerio. Sci Rep 8:12114. 494 Balik-Meisner M, Truong L, Scholl EH, Tanguay RL, Reif DM. 2018. Population genetic 495 diversity in zebrafish lines. Mamm Genome 29:90-100. 496 Bhat A, Greulich MM, Martins EP. 2015. Behavioral Plasticity in Response to Environmental 497 Manipulation among Zebrafish (Danio rerio) Populations. PLoS One 10:e0125097. 498 Brown KH, Dobrinski KP, Lee AS, Gokcumen O, Mills RE, Shi X, Chong WW, Chen JY, 499 Yoo P, David S, et al. 2012. Extensive genetic diversity and substructuring among zebrafish strains 500 revealed through copy number variant analysis. Proc Natl Acad Sci U S A 109:529-534. 501 Butler MG, Iben JR, Marsden KC, Epstein JA, Granato M, Weinstein BM. 2015. SNPfisher: 502 tools for probing genetic variation in laboratory-reared zebrafish. Dev Suppl 142:1542-1552. 503 Capra JA, Pollard KS. 2011. Substitution patterns are GC-biased in divergent sequences 504 across the metazoans. Genome Biol Evol 3:516-527. 505 Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH. 2011. Stacks: building and 506 genotyping Loci de novo from short-read sequences. G3 (Bethesda) 1:171-182. 507 Coe TS, Hamilton PB, Griffiths AM, Hodgson DJ, Wahab MA, Tyler CR. 2009. Genetic 508 variation in strains of zebrafish (Danio rerio) and the implications for ecotoxicology studies. 509 Ecotoxicology 18:144-150. 510 Cornet C, Di Donato V, Terriente J. 2018. Combining Zebrafish and CRISPR/Cas9: Toward a 511 More Efficient Drug Discovery Pipeline. Front Pharmacol 9:703. 512 Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter 513 G, Marth GT, Sherry ST, et al. 2011. The variant call format and VCFtools. Bioinformatics 27:2156- 514 2158. 515 Duret L, Galtier N. 2009. Biased gene conversion and the evolution of mammalian genomic 516 landscapes. Annu Rev Genomics Hum Genet 10:285-311. 517 Frichot E, Francois O. 2015. LEA: An R package for landscape and ecological association 518 studies. Methods in Ecology and Evolution 6:925-929. 519 Fu YX, Li WH. 1993. Statistical tests of neutrality of mutations. Genetics 133:693-709. 520 Gagnaire PA, Gaggiotti OE. 2016. Detecting polygenic selection in marine populations by 521 combining population genomics and quantitative genetics approaches. Curr Zool 62:603-616.

20

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

522 Glemin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L. 2015. Quantification of GC- 523 biased gene conversion in the human genome. Genome Res 25:1215-1228. 524 Guryev V, Koudijs MJ, Berezikov E, Johnson SL, Plasterk RH, van Eeden FJ, Cuppen E. 525 2006. Genetic variation in the zebrafish. Genome Res 16:491-497. 526 Han S, Nam J, Li Y, Kim S, Cho SH, Cho YS, Choi SY, Choi J, Han K, Kim Y, et al. 2010. 527 Regulation of dendritic spines, spatial memory, and embryonic development by the TANC family of 528 PSD-95-interacting proteins. J Neurosci 30:15102-15112. 529 Holden LA, Wilson C, Heineman Z, Dobrinski KP, Brown KH. 2019. An Interrogation of 530 Shared and Unique Copy Number Variants Across Genetically Distinct Zebrafish Strains. Zebrafish 531 16:29-36. 532 Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray 533 S, McLaren K, Matthews L, et al. 2013. The zebrafish reference genome sequence and its relationship 534 to the human genome. Nature 496:498-503. 535 Howe K, Schiffer PH, Zielinski J, Wiehe T, Laird GK, Marioni JC, Soylemez O, Kondrashov 536 F, Leptin M. 2016. Structure and evolutionary history of a large family of NLR proteins in the 537 zebrafish. Open Biol 6:160009. 538 Kardos M, Akesson M, Fountain T, Flagstad O, Liberg O, Olason P, Sand H, Wabakken P, 539 Wikenros C, Ellegren H. 2018. Genomic consequences of intensive inbreeding in an isolated wolf 540 population. Nat Ecol Evol 2:124-131. 541 Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler 542 transform. Bioinformatics 25:1754-1760. 543 McConnell SC, Hernandez KM, Wcisel DJ, Kettleborough RN, Stemple DL, Yoder JA, 544 Andrade J, de Jong JL. 2016. Alternative haplotypes of antigen processing genes in zebrafish diverged 545 early in vertebrate evolution. Proc Natl Acad Sci U S A 113:E5014-5023. 546 McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. 547 2016. The Ensembl Variant Effect Predictor. Genome Biol 17:122. 548 Mekada K, Abe K, Murakami A, Nakamura S, Nakata H, Moriwaki K, Obata Y, Yoshiki A. 549 2009. Genetic Differences among C57BL/6 Substrains. Experimental Animals 58:141-149. 550 Mugal CF, Weber CC, Ellegren H. 2015. GC-biased gene conversion links the recombination 551 landscape and demography to genomic base composition: GC-biased gene conversion drives genomic 552 base composition across a wide range of species. Bioessays 37:1317-1326. 553 Murray KN, Varga ZM, Kent ML. 2016. Biosecurity and Health Monitoring at the Zebrafish 554 International Resource Center. Zebrafish 13 Suppl 1:S30-38. 555 Nelson JS, Grande T, Wilson MVH. 2016. Fishes of the world. Hoboken, New Jersey: John 556 Wiley & Sons. 557 Parichy DM. 2015. Advancing biology through a deeper understanding of zebrafish ecology 558 and evolution. Elife 4.

21

bioRxiv preprint doi: https://doi.org/10.1101/706382; this version posted July 18, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

559 Patowary A, Purkanti R, Singh M, Chauhan R, Singh AR, Swarnkar M, Singh N, Pandey V, 560 Torroja C, Clark MD, et al. 2013. A sequence-based variation map of zebrafish. Zebrafish 10:15-20. 561 R Core Team. 2018. R: A Language and Environment for Statistical Computing: R 562 Foundation for Statistical Computing, Vienna, Austria. 563 Rochette NC, Catchen JM. 2017. Deriving genotypes from RAD-seq short-read data using 564 Stacks. Nature Protocols 12:2640. 565 Roesti M, Moser D, Berner D. 2013. Recombination in the threespine stickleback genome-- 566 patterns and consequences. Mol Ecol 22:3014-3027. 567 Seth A, Stemple DL, Barroso I. 2013. The emerging use of zebrafish to model metabolic 568 disease. Dis Model Mech 6:1080-1088. 569 Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of 570 large phylogenies. Bioinformatics 30:1312-1313. 571 Stickney HL, Schmutz J, Woods IG, Holtzer CC, Dickson MC, Kelly PD, Myers RM, Talbot 572 WS. 2002. Rapid mapping of zebrafish mutations with SNPs and oligonucleotide microarrays. 573 Genome Res 12:1929-1934. 574 Suriyampola PS, Shelton DS, Shukla R, Roy T, Bhat A, Martins EP. 2016. Zebrafish Social 575 Behavior in the Wild. Zebrafish 13:1-8. 576 Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA 577 polymorphism. Genetics 123:585-595. 578 Varga ZM, Ekker SC, Lawrence C. 2018. Workshop Report: Zebrafish and Other Fish 579 Models-Description of Extrinsic Environmental Factors for Rigorous Experiments and Reproducible 580 Results. Zebrafish 15:533-535. 581 Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, 582 Magnusson A, Moeller S, Schwartz M, et al. 2016. gplots: Various R Programming Tools for Plotting 583 Data. https://CRAN.R-project.org/package=gplots. 584 Watterson GA. 1975. On the number of segregating sites in genetical models without 585 recombination. Theor Popul Biol 7:256-276. 586 Whiteley AR, Bhat A, Martins EP, Mayden RL, Arunachalam M, Uusi-Heikkila S, Ahmed 587 AT, Shrestha J, Clark M, Stemple D, et al. 2011. Population genomics of wild and laboratory zebrafish 588 (Danio rerio). Mol Ecol 20:4259-4276. 589 Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag New York. 590 Wilson CA, High SK, McCluskey BM, Amores A, Yan YL, Titus TA, Anderson JL, Batzel P, 591 Carvan MJ, 3rd, Schartl M, et al. 2014. Wild sex in zebrafish: loss of the natural sex determinant in 592 domesticated strains. Genetics 198:1291-1308. 593 Wright S. 1951. The genetical structure of populations. Ann Eugen 15:323-354.

594

22