<<

Copyright Ó 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.116616

Whole-Genome Profiling of Mutagenesis in Caenorhabditis elegans

Stephane Flibotte,*,† Mark L. Edgley,‡ Iasha Chaudhry,‡ Jon Taylor,‡ Sarah E. Neil,‡ Aleksandra Rogula,‡ Rick Zapf,‡ Martin Hirst,† Yaron Butterfield,† Steven J. Jones,† Marco A. Marra,† Robert J. Barstead§ and Donald G. Moerman*,‡,1 *Department of Zoology, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada, †Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4S6, Canada, ‡Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada and §Department of Molecular and Cell Biology, Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma 73104 Manuscript received March 12, 2010 Accepted for publication April 28, 2010

ABSTRACT Deep sequencing offers an unprecedented view of an ’s genome. We describe the spectrum of induced by three commonly used : ethyl methanesulfonate (EMS), N-ethyl-N- nitrosourea (ENU), and trimethylpsoralen (UV/TMP) in the nematode Caenorhabditis elegans. Our analysis confirms the strong GC to AT bias of EMS. We found that ENU mainly produces A to T and T to A transversions, but also all possible transitions. We found no bias for any specific transition or transversion in the spectrum of UV/TMP-induced mutations. In 10 mutagenized strains we identified 2723 variants, of which 508 are expected to alter or disrupt gene function, including 21 nonsense mutations and 10 mutations predicted to affect mRNA splicing. This translates to an average of 50 informative mutations per strain. We also present evidence of genetic drift among laboratory wild-type strains derived from the Bristol N2 strain. We make several suggestions for best practice using massively parallel short read sequencing to ensure detection.

UTAGENESIS and the screening for mutants Brenner did the same for the nematode C. elegans M have long been a key tool of the practicing (Brenner 1974). The now classic article by Coulondre geneticist. The early work of T. H. Morgan and his and Miller (1977) demonstrated the types of nucleo- colleagues relied on recovery of spontaneous muta- tide substitutions generated by EMS and confirmed tions, which was limiting for the study of inheritance earlier observations (Bautz and Freese 1960) concern- due to their infrequent occurrence (Morganet al. 1922; ing the strong bias for GC to AT transitions. Today, EMS also see Sturtevant 1965). The discovery by H. J. is still the most powerful and popular used by Muller and others that X rays cause mutations ushered researchers studying D. melanogaster and C. elegans. in the era of inducing mutations (Muller 1927). There Purely on the basis of genetic inference, when used at is a long history of studies on mutagen specificity, both a concentration of 50 mm, EMS is calculated to induce in prokaryotes and in eukaryotes, and today many 20 function-affecting variant alleles in C. elegans strains mutagens are utilized in a variety of model . derived using this mutagen (Greenwald and Horvitz In this article we use whole-genome deep sequencing in 1982; Anderson 1995). the model organism Caenorhabditis elegans to explore The chemical N-ethyl-N-nitrosourea (ENU) has been the types and frequencies of mutations induced by used as a mutagen since the 1970s but came to prom- various mutagens and to document the feasibility of inence when it was demonstrated to be the most global identification of mutations. effective chemical mutagen in mice (Russell et al. The mutagenic properties of ethyl methanesulfonate 1979). Today it is still the chemical mutagen of choice (EMS) were first demonstrated using the T4 viral system for this organism (Anderson 2000; Acevedo-Arozena (Loveless 1959). Soon after, Lewis and Bacher (1968) et al. 2008). ENU has also been used for C. elegans demonstrated how to administer EMS to Drosophila mutagenesis (De Stasio et al. 1997). Although it ap- melanogaster to generate mutations, and later Sydney pears to have different biases with regard to gene targets and base changes relative to EMS, the background mutational load after ENU mutagenesis has not been Supporting information is available online at http://www.genetics.org/ fully characterized (De Stasio and Dorman 2001). cgi/content/full/genetics.110.116616/DC1. The chemical 4,5 9,8-trimethylpsoralen is a crosslink- 1Corresponding author: Department of Zoology, University of British Columbia, 6270 University Blvd., Vancouver, BC V6T 1Z4, Canada. ing agent that is activated by near ultraviolet light. E-mail: [email protected] Studies in Escherichia coli have shown that it causes both

Genetics 185: 431–441 ( June 2010) 432 S. Flibotte et al. single-base changes and deletions (Piette et al. 1985; the parent strain for all unc-22 strains described here. A second Sladek et al. 1989). C. elegans researchers became in- N2 subculture, from the Schedl lab at Washington University terested in the potential of ultraviolet trimethylpsoralen (St. Louis, MO), designated WU N2, was also used in this work, but we used existing whole-genome sequence data (Hillier (UV/TMP) to generate deletions in worms after the first et al. 2008) instead of generating new data. The N2 reference deletions in this organism were isolated using this DNA sequence used in this study was largely derived from mutagen (Yandell et al. 1994). UV/TMP is now a major cosmids constructed from a wild-type strain, although a reagent in the arsenal of the C. elegans knockout portion of the reference was derived from yeast artificial consortium laboratories (Barstead and Moerman chromosomes (YACs), which were produced from the strain CB1301, an endonuclease minus mutant derived from a wild- 2006). As a tool for generating deletions in eukaryotes type strain by EMS mutagenesis (R. Waterston, personal it is quite useful but, outside of studies on prokaryotes, communication). little else is known about the spectrum of mutagenic Nematode culture and mutagenesis: Nematodes generally effects caused by UV/TMP. were grown as previously described (Brenner 1974). Popula- Massively parallel short read sequencing technologies tions for mutagenesis were prepared by washing VC2010 (N2) animals off starved plates with 10-ml aliquots of M9 buffer offer unprecedented opportunities to study the com- containing 0.01% Triton X-100, pelleting in a benchtop plete genetic complement of an individual organism centrifuge in 15-ml centrifuge tubes, and replating on (Hillier et al. 2008). For genetic model systems the 150-mm NGM agar plates seeded with E. coli strain OP50 or impact of this technology extends to the identification x1666. After 2 days at 20°, gravid adults were harvested by washing with sterile distilled water and the population was and correlation of induced mutations with selected ulston arin synchronized by alkaline hypochlorite treatment (S phenotypes (S et al. 2008). Several of the techno- and Hodgkin 1988), and the egg pellet was replated on fresh logical and bioinformatic issues that arise with next 150-mm seeded plates. When the populations were predom- generation sequencing have already been addressed for inantly at the L4 stage, they were collected for mutagenesis by the nematode C. elegans (Hillier et al. 2008; Sarin et al. washing in M9/Triton X-100. m 2008; Shen et al. 2008; Rose et al. 2010). Still, it is not Mutagenesis with EMS was performed at 50 m (for VC and DM strains) or 60 mm (for RB strains) according to standard clear how deeply one must sequence to confidently protocols (Sulston and Hodgkin 1988). Mutagenesis by UV/ identify a relevant variant allele in a target mutant strain. TMP was performed at 2 mg/ml TMP according to our Also of importance are questions concerning mutagen laboratory standard protocol. Briefly, 100 mg TMP (Sigma, choice and dosage as they relate to the rate of induction St. Louis; T6137) was dissolved by vigorous shaking in 40 ml engyo Ando Mitani of new mutations and background mutational load. We acetone (G - and 2000) in a sterile 50-ml polypropylene centrifuge tube to make a solution of 2.5 mg/ml. have undertaken the following study on mutagenesis A nematode population washed from a growth plate was and mutation detection to establish the parameters transferred to a sterile 15-ml polypropylene centrifuge tube necessary to exploit next generation sequencing tech- and the volume adjusted to 4 ml. TMP solution was added to nologies for C. elegans genetics. For the first time we the worm suspension to the desired concentration. The tubes offer a whole-genome direct measure of mutation were incubated in darkness on a benchtop shaker at 150 rpm spectrum and background load for EMS, ENU, and for 1 hr. After incubation the worms were washed free of mutagen and bacteria in five changes of M9/Triton X-100 and UV/TMP. Readers interested in whole-genome se- diluted to 12.5 ml in M9/Triton. Twelve 1-ml aliquots were quencing of EMS mutagenized strains in C. elegans placed in individual wells of a sterile untreated 12-well tissue should also see the accompanying article in this issue culture plate (Biopacific, E222804601F), and the plate was by Sarin et al. (2010). In our study we also measured the irradiated with 360 nm UV at 340 mW/cm2 for 90 sec in a single-nucleotide variation among currently used wild- custom-built irradiation cabinet. Mutagenesis by ENU was performed at 0.5 mm according to De Stasio and Dorman type strains. In addition, we measured sequence read (2001) for the VC strains and at 1.0 mm for the RB strains. It depth of all sequence and coding sequence and from should be noted that batches of mutagen do vary in potency. In this we make a recommendation of average genome our experience TMP is particularly susceptible to batch coverage to ensure the correct identification of the variability, which makes it very difficult to arrive at a standard causative mutation. We also examined the issue of false recommended concentration. For each mutagen, 180 P0 animals were plated at three per positive and false negative calls and make recommen- 60-mm seeded agar plate. The F1 progeny were screened for dations to eliminate most false positives without losing heterozygous unc-22 twitchers in 1% nicotine (Moerman and aillie bona fide mutations. B 1979); from each such animal, a homozygous F2 unc- 22 twitcher was selected for further processing. Each homo- zygous line was propagated clonally from F2 through F7 to MATERIALS AND METHODS drive other carried mutations to homozygosity, and a homo- zygous F7 animal was selected from each to establish a stock. Strains used in this study: Homozygous unc-22 strains Between two and four such F7 lines were selected for each derived from mutagenesis were VC1923, VC1924, RB5000, mutagen. The VC and DM lines were subjected to comparative and RB5002 (EMS); VC2362 and VC2366 (UV/TMP); and genomic hybridization (CGH) analysis using our whole- VC2451, VC2452, and RB5001 (ENU). DM1017 [unc-52 genome chip designs (Roche NimbleGen design name (e3003e998) II; sup-38(ra5)IV] was generated by EMS muta- ‘‘2006-11-16_CE2_WG_CGH_E’’ for the EMS mutants and genesis of CB998 (Gilchrist and Moerman 1992). VC2010 ‘‘081002_CE_UBC_RZ_CGH’’for the TMP and ENU mutants) is the Moerman Gene Knockout Lab subculture of N2 ob- and to whole-genome sequencing on the Illumina platform at tained from the Caenorhabditis Genetics Center in 2002 and is the Michael Smith Genome Sciences Centre (Vancouver, BC, Comparing Mutagens in C. elegans 433

Canada). The RB lines were not examined with CGH but were regions with a small P-value in the KS test from the second step, sequenced using the Illumina platform at the Oklahoma larger size, and good agreement between the two size measure- Medical Research Foundation (OMRF) (Oklahoma City, OK). ments, i.e., (1) the size of the region with lack of coverage from Whole-genome shotgun sequencing: Worm strains for the first step and (2) the increase in apparent insert size from sequencing were grown on 150-mm petri plates containing the second step. rich NGM medium (standard recipe but 83 peptone) with Comparative genomic hybridization: The VC and DM agarose [Invitrogen (Carlsbad, CA) UltraPure, catalog no. mutant strains were processed with whole-genome compara- 16500-500] substituted for agar, seeded with E. coli x1666. tive genomic hybridization following the procedure described Populations were grown to starvation, harvested by washing in detail in Maydan et al. (2007), using the N2 wild-type with sterile M9/Triton X-100, and pelleted by centrifugation VC2010 strain for reference DNA. Roche NimbleGen manu- in sterile 15-ml centrifuge tubes, and the supernatant was factured the microarrays but the experiments and subsequent removed by aspiration. The buffer was removed using five data processing were performed in the laboratory of D. G. changes of sterile distilled water, centrifugation, and aspiration; Moerman. after the final wash, water was removed, leaving a concentrated worm pellet in a minimum volume, typically between 300 and 500 ml. Worm concentrates were frozen at 80°. RESULTS Two methods were used for DNA preparation. DNA for strains VC2010 and VC1924 was prepared by standard phenol/ Variation in wild-type strains: We began our study by chloroform extraction with RNAse treatment, precipitation, analyzing our laboratory wild-type strain, VC2010, and and resuspension in TE (pH 8.0). DNA for all other strains was comparing it to the sequences of two other wild-type prepared using the PureGene Genomic DNA Tissue Kit strains, the C. elegans reference sequence and the re- [QIAGEN (Valencia, CA) no. 158622], following a supple- illier mentary QIAGEN protocol for nematodes. DNA concentra- cently reported sequence of the WU N2 strain (H tions were determined using a Thermo Fisher Nanodrop et al. 2008). All three sequenced strains are subcultures of spectrophotometer, and quality was checked by electrophore- the original N2 strain established by Brenner (1974), but sis on 1% agarose gels. were maintained separately for an unknown number of Prepared were submitted to the Michael Smith generations before sequencing. Thus, an uncertain Genome Sciences Centre (VC and DM strains) or the OMRF sequencing center (RB strains) in amounts ranging from 5 to amount of drift may have occurred. For this project we 10 mg. The WU N2 strain, which was derived from the Bristol used the massively parallel short sequencing technology N2 strain, was sequenced by Hillier et al. (2008) with from Illumina (either a GA1 or a GA2 instrument). In unpaired reads of 32 nucleotides in length while the VC2010 addition, we also reanalyzed sequencing reads from the and VC1924 strains were sequenced with 36-mer unpaired WU N2 strain from Washington University (Hillier et al. reads. The other strains were sequenced with a more recent 2008). Sequencing reads were aligned using the Maq protocol using paired reads, with a read length of 37 for the RB i strains (RB5000, RB5001, and RB5002) and 50 for all the software suite (L et al. 2008) to the wild-type Bristol N2 remaining strains. reference genome WS190 assembled from Sanger se- Sequence alignment, variant determination, and search for quencing reads (C. elegans Sequencing Consortium deletions: The sequence reads were aligned to the WormBase 1998). Sequence processing details can be found in (www.wormbase.org) reference genome version WS190, using materials and methods. Variant candidates (single- the Maq software suite version 0.7.1 (Li et al. 2008) with default parameters. The variant calling procedure also followed a nucleotide transitions and transversions) were called with standard Maq pipeline with default filters and parameters the Maq suite primarily using default parameters and except that the maximum number of mismatches was set to filters. three and the minimum mapping quality was set to 10 for a After obtaining an average sequence depth of cover- read to be considered in the consensus calling. Identical read age of 17-fold (Figure 1), we found 871 differences pairs were removed before variant determination and only the unambiguous homozygous variants were kept. Variants with between the derived sequence of VC2010 and the ambiguous consensus base (i.e., base other than A, C, T, or G) reference sequence. Similar to the results reported by were eliminated. Furthermore, the minimum Phred-like Hillier et al. (2008), in our reanalysis of their data (32- consensus quality was set to 30 instead of the default 20, which fold coverage), we detected 844 substitutions compared provided a better compromise between sensitivity and speci- to the wild-type reference genome. Of these, 634 of the ficity for the current application and data sets. The variant candidates were then labeled according to their type and differences are shared between the two strains (Figure location (within intron or exon, synonymous or nonsynon- 2). These could represent accumulated mutations de- ymous, etc.) with a custom Perl script, using gene information pending on the relationship of these two strains to the available in WS190. strains used for the reference sequence; more likely they Searches for homozygous deletions .10 bases followed a represent errors in the reference sequence. Assuming two-step approach. First the regions with no coverage in the mutant strain under consideration but with some coverage in that they all correspond to sequencing errors and that ;no the N2 strain VC2010 were flagged for further analysis with the others exist, we can estimate the sequencing error rate of second step. The second step compared the distribution of the reference genome to a little less than 1 in 100,000 apparent distance between the aligned read pairs in the bases, which is in good agreement with previous estimates regions of interest to the average distribution for the whole (C. elegans Sequencing Consortium 1998). Of the chromosome, using a Kolmogorov–Smirnov (KS) test. Those two steps were implemented with Perl and R scripts and the variants unique to each strain, almost all are in non- most promising candidates were further evaluated by coding sequences or in silent bases of codons (synony- visual inspection. The most promising candidates were the mous changes). Some of these variants may represent 434 S. Flibotte et al.

Figure 1.—Average genome coverage for the various strains studied in this work. The average coverage is calculated by dividing the sum of the aligned bases by the genome size. The boxes are color coded according to the mu- tagens used to create the strains and the color key is provided in the inset.

sequencing errors, whereas others may represent differ- (Figure 5). Because mutations are rare in each of these ence accumulated due to drift during maintenance of the strains, the variants shared between strains could corre- two strains. We return to this question after consideration spond to additional errors in the reference genome (see of the mutagen-treated strains. above), errors produced by some sort of bias in the Variant detection in EMS-, ENU-, and TMP/UV- current sequencing protocol, or genetic drift between treated strains. We next sequenced at varying depths of VC2010 and the strain(s) used to create the reference coverage (Figure 1) 10 strains derived from VC2010 after genome (see below). Before analyzing the variants treatment with three commonly used mutagens using unique to each strain in more detail, we wished to eval- standard protocols. Five strains were produced with uate the sensitivity and specificity of the variant calls. EMS, two with UV/TMP, and three with ENU (see Sensitivity and specificity analysis: We used two materials and methods). As can be seen in Figure 1, approaches to evaluate sensitivity. If we assume that the in the current study the average genomic coverage for variants present in both wild-type N2 strains (VC2010 each strain ranged from 123 to 333 with a median value of 193. Variant candidates (single-nucleotide transi- tions and transversions) were called with the Maq suite primarily using default parameters and filters. Since the default setting of this software requires a minimum of three reads to detect a variant, we evaluated the fraction of the genome covered to at least this depth (Figure 3). Even for the strains with the lowest overall coverage, .90% of the genomic bases were covered to a depth of at least three reads. Because coverage can be influenced by GC content and exons are GC rich compared to the rest of the genome, and also because we are primarily interested in variants in exons, we determined the fraction of exon bases with at least three reads and the fraction of exons with all bases covered by at least three reads (Figures 3 and 4). The exon coverage closely paralleled the overall genome coverage, but Figure 4 Figure 2.—Number of variants called in two N2 wild-type does illustrate that an overall coverage .203 is neces- strains (VC2010, Vancouver; WU N2, St. Louis) when the se- sary to detect all the variants in more than 80 of the quence of each is compared to the Cambridge wild-type ref- exons. erence genome WS190 and to each other. The blue bars The genomes of each of these mutagenized strains correspond to variants common to VC2010 and WU N2 but different from the reference genome. The yellow plus red have a large number of mutations compared to the bars represent variants unique to either VC2010 or WU N2. reference. Many of these differences are shared with the The red bars are associated with nonsynonymous variants. VC2010 parent strain or with other mutagenized strains The yellow bars are associated with synonymous variants. Comparing Mutagens in C. elegans 435

Figure 3.—Percentage of bases covered at a depth of at least 33 for each strain. The red bars corre- spond to the coverage for all the bases in the genome and the yellow bars to the bases within exons. The blue bars represent the per- centage of exons for which all the bases are covered to a depth of at least 33.

and WU N2) represent true differences from the 90%, without significant loss in sensitivity (Table 1). A reference genome, we can use these as a ‘‘true’’ positive threshold consensus quality of 40 increases the specific- set to estimate the ability to detect these in each of the ity to 99%, but at a cost in sensitivity (Table 1). To mutant strains. Sensitivity ranged from 79 to 96% for evaluate specificity directly, we tested 40 predicted individual mutant strains, with a median of 90% and a variants in the strain VC1924 by PCR of the affected strong correlation to the average genome coverage regions and subsequent Sanger sequencing and con- (correlation coefficient r ¼ 0.865; see Table 1). When firmed 39 of them. The remaining one was not a single- focusing such analysis only on variants present in exons, base substitution but a single-base also present the median sensitivity increases to 95%. in VC2010. On the basis of this sensitivity/specificity The above sensitivity calculations are based on a biased data set; they derive only from regions of the genome where variants have already been detected in this study. We therefore estimated the sensitivity in a second way. We created 10,000 in silico single-nucleotide ‘‘mutations’’ at random locations in the reference genome and cal- culated our ability to detect these with real reads from our paired-end 50-mer data sets. At a depth of 203, the overall sensitivity to detect the in silico mutations was 89% and reached 95% when focusing on exons. Detection specificity is the proportion of the variants called that are true differences from the reference genome. This requires comprehensive knowledge of the true positives in a data set. Specificity here can be estimated by analyzing the variants called for VC2010 provided two assumptions are made: (1) variants called in VC2010 but none of the derived strains are all false positives and (2) none of the remaining VC2010 variants (those called in multiple strains) are false positives. Under these two assumptions, the specificity for calling Figure 4.—Percentage of exons totally covered to a depth variants in the current VC2010 data set is 90%. Changing of at least 33 as a function of average genome-wide coverage. the minimum Phred-like consensus quality from the Each data point corresponds to a different strain studied in default of 20 to 30 improved the specificity from 77 to this work. 436 S. Flibotte et al.

Figure 5.—Number of variants identified in the wild-type N2 strain VC2010 and the 10 mutated strains studied in this work. The boxes are color coded ac- cording to the mutagens used to create the stains and the color key is pro- vided at the top left. The hatched bars correspond to the variants present in .1 of those 11 strains while the solid bars correspond to the unique variants.

analysis the variants labeled as unique in Figure 5 are searches with CGH experiments. With the microarray likely to represent the bulk of induced variants present designs used in the current study, CGH has previously in each strain with only a few false positives. Therefore been capable of detecting deletions up to megabases in we focused our further analysis on those candidates. size and mutations as small as single-base changes The mutation spectrum of EMS, ENU, and TMP/ (Maydan et al. 2007, 2009). Only two deletion candidates UV: On the basis of the unique variants (Figure 5) EMS were found in our mutagenized strains and these were produced more variants (median of 323 per strain) than confirmed using Sanger sequencing. The first deletion ENU (median of 226) or UV/TMP (median of 78) at the removes 122 bp of an exon of the unc-22 gene in the UV/ concentrations used. As expected, the relative fre- TMP strain VC2366, coincidentally very close to a T to G quency of the possible transitions and transversions in the same exon (Figure 8). The other differed for each mutagen. UV/TMP produced the least deletion, 200 bases in size, affected an intron in the amount of bias between the various mutation types; gene arx-1 (Y71F9AL.16) in the ENU strain VC2452. EMS primarily produced G/C to A/T transitions; and ENU produced A/T to T/A transversions, G/C to A/T transitions, and A/T to G/C transitions (Figure 6). TABLE 1 These mutations appear to be distributed differently Variant detection sensitivity as a function of the threshold in across gene features for the different mutagens (Figure Phred-like quality score 7). The proportion of variants within exons was the smallest for the UV/TMP strains and the largest for the Sensitivity (%) Sensitivity (%) Sensitivity (%) Strain Coverage quality .20 quality .30 quality .40 EMS strains. For those mutations falling in exons, we found 477 missense and 21 nonsense mutations in the VC2366 11.93 78.6 78.5 64.5 10 mutant strains derived from the VC2010 wild-type VC2362 12.73 83.8 84.2 76.7 3 strain. In addition, we found a total of 10 variants VC1923 12.8 81.7 81.9 73.7 RB5000 14.03 88.7 89.3 87.6 affecting splice sites (see Table 2). On average we DM1017 16.03 87.5 87.5 82.9 identified 50 informative mutations per strain. The VC2451 20.33 90.8 91.2 87.4 complete list of variants called in the mutant strains is VC2452 23.03 90.4 91.2 87.6 available for download (see supporting information, VC1924 23.43 95.1 95.7 93.2 3 File S1). RB5002 29.1 94.7 95.1 95.1 RB5001 30.13 93.4 93.9 94.2 Deletion mutations in the mutagen-treated strains: Because Maq as used here does not detect insertion/ The variant detection sensitivity was evaluated for individ- deletion differences, we searched independently for ual strains using three different thresholds in Phred-like qual- ity score from Maq. Variants common to VC2010 and WU N2 large homozygous deletions in the sequence data sets were used for the evaluation, and the list comprised 655 var- produced with the paired end sequencing protocol (see iants at quality .20, 634 variants at quality .30, and 533 var- materials and methods) and complemented the iants at quality .40. Comparing Mutagens in C. elegans 437

Figure 6.—Relative frequency of the various transitions and transversions identified in the 10 mutated strains studied in this work. The data for individual strains were combined according to the mutagen used. Only the variants appear- ing in a single strain were used. The color key identifying the type of mutation is provided in the inset, and the first letter represents the refer- ence nucleotide while the second letter repre- sents the mutated nucleotide.

Genetic drift in wild-type strains: The availability of events after mutagenesis. Even so, it is clear that each whole-genome sequences for three separate ‘‘N2-like’’ mutagenized animal carries on average a genetic load of wild-type strains, one from Cambridge, England, one hundreds of variants of which as many as 50 may have from St Louis, and one from Vancouver, British Colum- deleterious effects. This translates to about eight muta- bia, provides an opportunity to examine genetic drift. tions on every chromosome in the worm. There are two Since the sensitivity is not perfect, we cannot expect all implications of this observation. First, it emphasizes the variants to be called in all the data sets. To estimate the need to outcross a mutant animal with a particular amount of genetic drift between VC2010 and WU N2 phenotype several times to ensure that all, or at least we concentrated on the 886 variants called in at least most, background mutations are eliminated. Second, six (i.e., more than half) of the VC2010-derived strains for those wishing to identify the mutation responsible (Figures 2 and 5). Assuming a sensitivity of 90% and no genetic drift, we calculate that we should detect 797 of for a given phenotype, simply sequencing the mutant those variants in the WU N2 sequence. However,we called genome is not enough. Some genetic mapping should only 662 of those variants. The difference between these be done concurrently or one will be confounded by all two numbers, 135, is the estimate of the number of the choices in candidate variants. A number of powerful variants present in VC2010 but absent in WU N2. Further, new SNP, bulk segregant, or CGH mapping methods are there were 85 variants in the WU N2 that were not found in VC2010 or its derivatives. Since both the sensitivity and the specificity are estimated to be 90%, we can use that number of 85 variants directly as an estimate of the variants really present in WU N2 but not in VC2010. Taken together,135 1 85 gives us a reasonable estimate of the genetic drift between VC2010 and WU N2.

DISCUSSION In this study we determined the spectrum of muta- tions across the entire C. elegans genome induced by three well-known mutagens EMS, ENU, and UV/TMP. In total we detected 2723 variants and we placed these into two broad categories. In the first group, comprising 2215 variants, are those that do not affect any encoded protein. These variants map to intergenic regions, or Figure 7.—Relative variant frequency affecting various introns, or lead to synonymous change. In the second gene features in the 10 mutated strains studied in this work. group, comprising 508 variants, are those producing The data for individual strains were combined according to nonsynonymous changes (477 events), nonsense muta- the mutagen used. Only the variants appearing in a single tions (21 events), and mutations affecting splice sites (10 strain are represented. The proportions of nonsynonymous variants are represented in blue (dark for nonsense and light events). As we endeavored to make each animal homo- for missense mutations), the synonymous variants within zygous across the entire genome prior to sequencing, exons are shown in green, the variants appearing in introns our study probably detects only about half of the genetic are yellow, and red bars are used for the remaining variants. 438 S. Flibotte et al.

TABLE 2 Variants associated with nonsense mutations and splicing defects in the mutant strains sequenced in the current study

Coordinate Reference Mutant Strain Allele Chromosome WS190 base base Quality Mutation Gene DM1017 gk2895 III 10004481 G T 81 Splicing C05B5.11 RB5001 ok5304 III 7783172 C T 57 Splicing C08C3.3 RB5001 ok5478 X 15475897 A C 135 Nonsense C46E1.3 VC1923 gk1398 II 3721382 G A 48 Splicing C46E10.3 RB5002 ok5778 V 10149631 G A 83 Splicing C51E3.1 VC1924 gk963 I 8430347 C T 111 Splicing F02E9.9 RB5002 ok5863 X 1312996 C T 95 Nonsense H42K12.3 DM1017 gk2864 II 14421096 T A 54 Nonsense K04B12.1 VC2452 gk2668 V 16180639 G A 111 Nonsense K05D4.2 DM1017 gk2856 II 14115235 G A 57 Nonsense K09E4.4 RB5000 ok5083 III 2036666 C T 90 Splicing M01E10.2 VC2451 gk2294 I 4765803 G T 39 Nonsense M04F3.2 RB5002 ok5810 V 13368760 G A 126 Nonsense F32H5.6 RB5001 ok5385 V 1846700 A T 105 Nonsense T10B5.7 RB5002 ok5804 V 12945577 G A 105 Nonsense T16G1.8 VC1924 gk964 II 8930558 G A 74 Splicing T21B10.1 RB5002 ok5573 II 3779342 C T 108 Splicing T24E12.11 RB5002 ok5516 I 5111435 G A 84 Nonsense Y110A7A.15 VC1924 gk2021 V 18843877 C T 81 Nonsense Y17D7A.3 RB5000 ok5041 II 3107233 A T 57 Nonsense Y25C1A.3 VC1923 gk1442 II 12590593 G A 36 Nonsense Y38E10A.6 DM1017 gk2714 I 93554 C T 126 Nonsense Y48G1C.1 RB5002 ok5820 V 15195283 G A 153 Nonsense Y75B12B.6 DM1017 e998 II 14657282 C T 57 Nonsense ZC101.2 RB5001 ok5212 I 10356594 T C 126 Splicing ZC434.9 VC1924 gk952 III 9536643 G T 57 Nonsense ZK1098.3 VC2451 gk2406 IV 11978396 G A 102 Nonsense ZK617.1 VC1924 gk965 IV 11982077 G A 90 Nonsense ZK617.1 VC2452 gk2609 IV 11992038 G A 135 Nonsense ZK617.1 VC2452 gk2548 II 10497733 T G 108 Splicing F59B10.1 DM1017 gk2764 I 4371101 G A 36 Nonsense ZK973.9 available where traits can be mapped to a 5-map-unit EMS bias to produce G to A and C to T transitions would interval, or even a smaller interval, after a single cross translate into an increase in the proportion of variants (Michelmore et al. 1991; Jakubowski and Kornfeld affecting exons relative to UV/TMP as can be seen in 1999; Wicks et al. 2001; Swan et al. 2002; Davis et al. Figure 7. Following mutagenesis procedures and doses 2005; Flibotte et al. 2009). In combination with deep commonly used by the C. elegans research community sequencing these methods will provide an unprece- (see materials and methods), we find that EMS is the dented resource for new forward genetics discoveries. most powerful mutagen causing 1.5- to 2-fold more base For EMS and ENU the spectrum of mutation events we substitutions than ENU and at least 3-fold more muta- observe is similar to what has been reported by others tions than UV/TMP. Of striking note, none of 31 (Coulondre and Miller 1977; De Stasio et al. 1997). nonsense or splice site variants identified were isolated EMS produces primarily G/C to A/T transitions and after UV/TMP mutagenesis (Table 2): all were identi- ENU, while exhibiting some bias for these transitions, fied after either EMS or ENU mutagenesis. As each also produced many other transitions and transversions mutagen is distinct in behavior, choice of mutagen will (Figure 6). UV/TMP demonstrated the least mutation be determined by what type of mutation a researcher bias, producing all manner of transitions and trans- desires. If the object is to obtain null alleles, EMS or versions (Figure 6). This was unexpected, as previous possibly ENU is an excellent choice. On the other hand if investigators reported a bias using this mutagen for TA one is seeking allelic variability, including a variety of to GC transversions (Piette et al. 1985). It should be missense alleles, ENU may be a wiser choice because it will noted, however, that this earlier study was performed on generate a broader spectrum of changes at high frequency. a single locus and sampled only a small number of From our study and those of others (Hillier et al. mutations. The GC content within exons (43%) is much 2008; Denver et al. 2009; Vergara et al. 2009) it is clear higher than the overall GC content (35%) of the C. that not all wild-type N2 subcultures are identical. elegans genome. It is therefore expected that the strong Denver et al. (2009) estimated that the spontaneous Comparing Mutagens in C. elegans 439

Figure 8.—Deletion found on chromosome IV of the strain VC2366. The top panel shows the coverage in the region of interest. The mid- dle panel shows the apparent distance between the alignments of read pairs to the reference ge- nome. The bottom panel shows the fluorescence ratio in log2 scale in a comparative genomic hy- bridization experiment (Maydan et al. 2007).

forward for C. elegans is 3 3 109 per site mutation. As a case in point, we were unable to find the per generation. This translates to about one mutation unc-22 mutation in strains VC1923 and VC2362 even every three generations. We identified hundreds of though all VC strains reported here carry mutations in variants unique to our N2 strain or the WU N2 strain. the unc-22 gene (see materials and methods for Using a very conservative metric, we find that 135 selection procedure after mutagenesis). For strains variants are unique to VC2010 and another 85 are VC1923 and VC2362 the genome coverage was 13-fold unique to the WU wild-type strain. On the basis of the and at this level exons in genes were not adequately above calculation of the occurrence of spontaneous mu- represented (Figures 3 and 4). In all strains with deeper tations and the time separating these wild-type strains, coverage all unc-22 exons were covered and we could these variants are most likely the result of genetic drift identify the mutation. On the basis of our experience that has occurred over the hundreds of generations with various sequencing depths we recommend aiming these two strains have been separated, both from each for 253 genome coverage to determine a homozygous other and from the reference Bristol N2 wild-type strain. mutation, as this would give close to 100% coverage of all This observation leads us to recommend that investi- exons to a depth of three reads. At this sequencing depth gators sequence the parental strains used in any forward the variant detection sensitivity will be 95% (Table 1). genetic screen to distinguish induced variation from Both read depth and Phred-like quality cutoffs have variation due to drift. Granted, this could get expensive an impact on sensitivity and specificity. The default if all one’s mutations are in such different parental back- setting for the Phred-like consensus score in the Maq grounds that one has to sequence a different parent strain program is 20. By shifting this to 30 one gains improved for every mutant. An alternative would be to sequence a specificity with only a marginal loss of sensitivity, but if second allele of the unknown gene and subtract the com- one shifts the setting any higher, then the loss in sen- mon background variants (for details on this approach sitivity is significant (see Table 1). We examined a limited see Sarin et al. 2010). Of course the limitation here is that set of 40 variant calls in the strain VC1924 by Sanger a second allele will not always be available. sequencing and only one ‘‘false positive’’ was identified, In this deep sequencing study of the 100-Mb C. elegans a single-base insertion. Calling this type of feature genome we have tried to determine the most appropri- accurately is a known weakness of the Maq aligner ate parameters to maximize sensitivity and specificity of (Krawitz et al. 2010) especially when working with variant detection without incurring undo cost. We ex- shorter 35-base unpaired reads like those used to analyze plored base and exon coverage after varying sequencing VC1924. For most of our samples we used a sequencing depths (Figures 1, 3, and 4 and Table 1). We incorpo- protocol with paired-end reads. New analysis algorithms rated the idea of not just measuring genome coverage, and software associated with massively parallel sequenc- but also measuring total exon coverage as this proved to ing technologies become available on a regular basis. be a better and more sensitive indicator of how well all Doing an extensive review of such analysis techniques is exons of genes are represented in the sequence. Here we clearly beyond the scope of the current work. However, define the total exon coverage as the percentage of all the nature of the false positive variant called in VC1924 exons for which all the bases are covered to a depth of at suggests that one could benefit from using an algorithm least three reads. Coverage much below 153 generally capable of producing gapped alignments. To test this resulted in sensitivity too low to ensure detecting a hypothesis, we reanalyzed all the sequencing data sets 440 S. Flibotte et al. used in the current work with the BWA/Samtools We thank Bob Waterston for advice and many helpful comments combination (Li and Durbin 2009; Li et al. 2009). The and Mark Johnston for editorial comments. We also thank Oliver new analysis program correctly identified the single-base Hobert for communication of data in advance of publication. We thank the British Columbia Cancer Association Genome Sciences insertion relative to the reference genome in all 12 data Centre Functional Genomics Group for expert technical assistance in sets, including in both wild-type strains (our unpub- library construction and sequencing. This research was supported by lished results). grants from Genome Canada and Genome British Columbia, the One of the goals of this study was to establish some Michael Smith Research Foundation, and the Canadian Institute for guidelines for future genome analysis projects, as it is Health Research to D.G.M. and by National Institutes of Health National Human Genome Research Institute grant P41HG003652 to clear that next generation sequencing is an important R.B. M.A.M. and S.J.J. are Scholars of the Michael Smith Foundation addition to the tools used to correlate phenotype and for Health Research. D.G.M. is a Fellow of the Canadian Institute for genotype. The parameters we suggest above will become Advanced Research. easier to attain as sequencing costs continue to drop and as new programs for analysis become available. Data quality is important for any deep sequencing project—- LITERATURE CITED even more so when the end product is a persistent genetic reagent like a C. elegans strain. One must ensure Acevedo-Arozena, A., S. Wells,P.Potter,M.Kelly,R.D.Cox et al., 2008 ENU mutagenesis, a way forward to understand that the reported sequence truly represents the archived gene function. Annu. Rev. Genomics Hum. Genet. 9: 49–69. strain. Discrepancies that arise due to poor sequence or Anderson, K. V., 2000 Finding the genes that direct mammalian devel- poor strain maintenance may be fatal to future genome- opment: ENU mutagenesis in the mouse. Trends Genet. 16: 99–102. Anderson, P., 1995 Mutagenesis. Methods Cell Biol. 48: 31–58. scaled projects that seek to correlate phenotypes with Barstead, R. J., and D. G. Moerman, 2006 C. elegans deletion mu- gene networks. Poor sequence can result from bad re- tant screening. Methods Mol. Biol. 351: 51–58. actions, poor data analysis, or inconsistent/incomplete Bautz, E., and E. Freese, 1960 On the mutagenic effects of alkylat- ing agents. Proc. Natl. Acad. Sci. USA. 46: 1585–1594. data annotation. We hope that the suggestions we make Brenner, S., 1974 The genetics of Caenorhabditis elegans. Genetics here will help others to avoid at least some of these 77: 71–94. possible pitfalls. C. elegans Sequencing Consortium, 1998 Genome sequence of the nematode C. elegans: a platform for investigating biology. Even with sequence perfectly representing the ge- Science 282: 2012–2018. nome, if the strains are not managed properly, the re- Coulondre, C., and J. H. Miller, 1977 Genetic studies of the lac ported sequence will not represent the archived strain. repressor. III. Additional correlation of mutational sites with spe- cific amino acid residues. J. Mol. Biol. 117: 525–567. For example, if the investigator does not freeze the strain Davis, M. W., M. Hammarlund,T.Harrach,P.Hullett,S.Olsen for several months after the DNA is prepared for deep et al., 2005 Rapid single nucleotide polymorphism mapping in sequencing, then the archived strain will almost cer- C. elegans. BMC Genomics 6: 118. Denver, D. R., P. C. Dolan,L.J.Wilhelm,W.Sung,J.I.Lucas-Lledo tainly accumulate variation via new mutation and drift as et al., 2009 A genome-wide view of Caenorhabditis elegans base- described above for our N2 subculture, VC2010. More substitution mutation processes. Proc. Natl. Acad. Sci. USA 106: significantly, if the investigator is not careful to drive the 16310–16314. De Stasio, E. A., and S. Dorman, 2001 Optimization of ENU strain to homozygosity before the sequencing sample is mutagenesis of Caenorhabditis elegans. Mutat. Res. 495: 81–88. prepared, the archived strain will lose variation through De Stasio,E.,C.Lephoto,L.Azuma,C.Holst,D.Stanislaus et al., simple genetic segregation as it is propagated prior to 1997 Characterization of revertants of unc-93(e1500) in Caenorhab- ditis elegans induced by N-ethyl-N-nitrosourea. Genetics 147: 597–608. freezing. In this study we self-crossed animals for 7 gen- Dixon, S. J., M. Costanzo,A.Baryshnikova,B.Andrews and C. erations, but it would be even better to do 10 (a policy Boone, 2009 Systematic mapping of genetic interaction net- that we ourselves have adopted). works. Annu. Rev. Genet. 43: 601–625. Flibotte, S., M. L. Edgley,J.Maydan,J.Taylor,R.Zapf et al., Massively parallel sequencing has the potential to 2009 Rapid high resolution single nucleotide polymorphism- completely change the genetic research landscape. For comparative genome hybridization mapping in Caenorhabditis the first time an investigator can examine the full geno- elegans. Genetics 181: 33–37. Gengyo-Ando, K., and S. Mitani, 2000 Characterization of muta- type of an organism. This will facilitate the study of phe- tions induced by ethyl methanesulfonate, UV, and trimethylpsor- notypes that are difficult to score, because they require alen in the nematode Caenorhabditis elegans. Biochem. Biophys. specialized assays, are incompletely penetrant, or are Res. Commun. 269: 64–69. Gilchrist, E. J., and D. G. Moerman, 1992 Mutations in the sup-38 variably expressed. Further, with a sufficiently large gene of Caenorhabditis elegans suppress muscle-attachment defects collection of mutated strains, where each gene is rep- in unc-52 mutants. Genetics 132: 431–442. resented by multiple alleles, one could identify directly Greenwald, I. S., and H. R. Horvitz, 1982 Dominant suppressors of a muscle mutant define an of Caenorhabditis the genes responsible for a given screening phenotype elegans. Genetics 101: 211–225. simply by identifying the common gene targets among Hillier, L. W., G. T. Marth,A.R.Quinlan,D.Dooling,G.Fewell the relevant strains. Research on multicellular organ- et al., 2008 Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5: 183–188. isms has long lagged behind that on single-cell organ- Jakubowski, J., and K. Kornfeld, 1999 A local, high-density, single- isms like the yeast Saccharomyces cerevisiae in the study of nucleotide polymorphism map used to clone Caenorhabditis genetic networks (reviewed in Dixon et al. 2009). Whole- elegans cdf-1. Genetics 153: 743–752. Krawitz, P., C. Rodelsperger,M.Jager,L.Jostins,S.Bauer et al., genome sequencing should help close the gap and 2010 Microindel detection in short-read sequence data. Bioin- offers potentially new ways to study gene networks. formatics 26: 722–729. Comparing Mutagens in C. elegans 441

Lewis, E. B., and F. Bacher, 1968 Methods of feeding ethyl meth- ditis elegans with an altered recombination pattern. BMC ane solfonate (EMS) to Drosophila. Dros. Inf. Serv. 43: 193. Genomics 11: 131. Li, H., and R. Durbin, 2009 Fast and accurate short read alignment Russell, W. L., E. M. Kelly,P.R.Hunsicker,J.W.Bangham,S.C. with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760. Maddux et al., 1979 Specific-locus test shows ethylnitrosourea Li, H., J. Ruan and R. Durbin, 2008 Mapping short DNA sequenc- to be the most potent mutagen in the mouse. Proc. Natl. Acad. ing reads and calling variants using mapping quality scores. Sci. USA 76: 5818–5819. Genome Res. 18: 1851–1858. Sarin, S., S. Prabhu, M. M. O’Meara,I.Pe’er and O. Hobert, Li, H., B. Handsaker,A.Wysoker,T.Fennell,J.Ruan et al., 2008 Caenorhabditis elegans mutant allele identification by 2009 The Sequence Alignment/Map format and SAMtools. Bi- whole-genome sequencing. Nat. Methods 5: 865–867. oinformatics 25: 2078–2079. Sarin, S., V. Bertrand,H.Bigelow,A.Boyanov,M.Doitsidou Loveless, A., 1959 The influence of radiomimetic substances on et al., 2010 Analysis of multiple ethyl methanesulfonate-muta- deoxyribonucleic acid synthesis and function studied in Escher- genized Caenorhabditis elegans strains by whole-genome sequenc- ichia coli/phage systems. III. Proc. R. Soc. Lond. B Biol. Sci. 150: ing. Genetics 185: 417–430. 497–508. Shen, Y., S. Sarin,Y.Liu,O.Hobert and I. Pe’er, 2008 Comparing Maydan, J. S., S. Flibotte,M.L.Edgley,J.Lau,R.R.Selzer et al., platforms for C. elegans mutant identification using high- 2007 Efficient high-resolution deletion discovery in Caenorhab- throughput whole-genome sequencing. PLoS ONE 3: e4012. ditis elegans by array comparative genomic hybridization. Genome Sladek, F. M., A. Melian and P. Howard-Flanders, 1989 Incision Res. 17: 337–347. by UvrABC excinuclease is a step in the path to mutagenesis by Maydan, J. S., H. M. Okada,S.Flibotte,M.L.Edgley and D. G. crosslinks in Escherichia coli. Proc. Natl. Acad. Sci. USA Moerman, 2009 De novo identification of single nucleotide mu- 86: 3982–3986. tations in Caenorhabditis elegans using array comparative genomic Sturtevant, A. H., 1965 A History of Genetics.Harper&Row,NewYork. hybridization. Genetics 181: 1673–1677. Sulston, J., and J. Hodgkin, 1988 Methods, pp. 587–606 in The Michelmore, R. W., I. Paran and R. V. Kesseli, 1991 Identification Nematode Caenorhabditis elegans, edited by W. B. Wood. Cold of markers linked to disease-resistance genes by bulked segregant Spring Harbor Laboratory Press, Cold Spring Harbor, NY. analysis: a rapid method to detect markers in specific genomic Swan, K. A., D. E. Curtis,K.B.McKusick,A.V.Voinov,F.A.Mapa regions by using segregating populations. Proc. Natl. Acad. Sci. et al., 2002 High-throughput gene mapping in Caenorhabditis USA 88: 9828–9832. elegans. Genome Res. 12: 1100–1105. Moerman, D. G., and D. L. Baillie, 1979 Genetic organization in Vergara, I. A., A. K. Mah,J.C.Huang,M.Tarailo-Graovac,R.C. Caenorhabditis elegans: fine-structure analysis of the unc-22 gene. Johnsen et al., 2009 Polymorphic segmental duplication in the Genetics 91: 95–103. nematode Caenorhabditis elegans. BMC Genomics 10: 329. Morgan, T. H., A. H. Sturtevant,H.J.Muller andC.B.Bridges, Wicks, S. R., R. T. Yeh,W.R.Gish,R.H.Waterston and R. H. 1922 The Mechanism of Mendelian .H.Holt&Co.,NewYork. Plasterk, 2001 Rapid gene mapping in Caenorhabditis elegans Muller, H. J., 1927 Artificial transmutation of the gene. Science 66: using a high density polymorphism map. Nat. Genet. 28: 160–164. 84–87. Yandell, M. D., L. G. Edgar and W. B. Wood, 1994 Trimethyl- Piette, J., D. Decuyper-Debergh and H. Gamper, 1985 Mutagenesis psoralen induces small deletion mutations in Caenorhabditis elegans. of the lac promoter region in M13 mp10 phage DNA by 49-hydrox- Proc.Natl.Acad.Sci.USA91: 1381–1385. ymethyl-4,59,8-trimethylpsoralen. Proc. Natl. Acad. Sci. USA 82: 7355–7359. Rose, A. M., N. J. O’Neil,M.Bilenky,Y.S.Butterfield,N.Malhis et al., 2010 Genomic sequence of a mutant strain of Caenorhab- Communicating editor: D. I. Greenstein

Supporting Information http://www.genetics.org/cgi/content/full/genetics.110.116616/DC1

Whole-Genome Profiling of Mutagenesis in Caenorhabditis elegans

Stephane Flibotte, Mark L. Edgley, Iasha Chaudhry, Jon Taylor, Sarah E. Neil, Aleksandra Rogula, Rick Zapf, Martin Hirst, Yaron Butterfield, Steven J. Jones, Marco A. Marra, Robert J. Barstead and Donald G. Moerman

Copyright © 2010 by the Genetics Society of America DOI: 10.1534/genetics.110.116616

2 SI S. Flibotte et al.

FILE S1

LIST OF VARIANTS CALLED IN THE MUTAGENIZED STRAINS

File S1 is available for download as a text file at http://www.genetics.org/cgi/content/full/genetics.110.116616/DC1.

The downloadable text file is in a comma-separated format. The labels in the header are self-explanatory except that the Quality in the sixth column refers to the Phred-like quality score from the Maq software. A Gene Feature entry with a * after the word intron in the eighth column indicates that the mutation is associated with a splicing defect. Only the variants appearing in a single strain are listed and the Coordinate entries in the third column refer to the reference genome available at Wormbase version WS190.