1 Mapping Challenging Mutations by Whole-Genome Sequencing Harold

bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Mapping challenging mutations by whole-genome sequencing Harold E. Smith, Amy S. Fabritius, Aimee Jaramillo-Lambert, and Andy Golden National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892 1 bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Running title: Challenging mutations by WGS Keywords: Forward genetics, variant detection, complex alleles, SNP mapping Corresponding author: Harold E. Smith 8 Center Drive Bethesda, MD 20892 301-594-6554 [email protected] 2 bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. ABSTRACT Whole-genome sequencing provides a rapid and powerful method for identifying mutations on a global scale, and has spurred a renewed enthusiasm for classical genetic screens in model organisms. The most commonly characterized category of mutation consists of monogenic, recessive traits, due to their genetic tractability. Therefore, most of the mapping methods for mutation identification by whole-genome sequencing are directed toward alleles that fulfill those criteria (i.e., single-gene, homozygous variants). However, such approaches are not entirely suitable for the characterization of a variety of more challenging mutations, such as dominant and semi-dominant alleles or multigenic traits. Therefore, we have developed strategies for the identification of those classes of mutations, using polymorphism mapping in Caenorhabditis elegans as our model for validation. We also report an alternative approach for mutation identification from traditional recombinant crosses, and a solution to the technical challenge of sequencing sterile or terminally arrested strains where population size is limiting. The methods described herein extend the applicability of whole-genome sequencing to a broader spectrum of mutations, including classes that are difficult to map by traditional means. INTRODUCTION The advent of next-generation sequencing technology has transformed our ability to identify the genetic variation that underlies phenotypic diversity. It is now feasible to perform whole-genome sequencing (WGS) in a matter of days and at relatively low cost. By comparing the data to an available reference genome, one can determine the complete constellation of sequence variants that are present in the sample of interest. 3 bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. WGS has proven a boon to experimental geneticists who utilize classical, or forward, genetics (i.e., random mutagenesis and phenotypic screening) in model organisms. Although those techniques have been employed for more than a century (Morgan, 1911a; Morgan, 1911b), determination of the mutation responsible for the observed phenotype remains the rate-limiting step in most cases. Mutation identification by WGS compares favorably to traditional techniques, such as linkage mapping and positional cloning, in terms of speed, labor, and expense (Hobert, 2010; Bowerman, 2011; Hu, 2014). Consequently, the technology has been adopted in a wide variety of model species, including Caenorhabditis elegans (Sarin et al., 2008), Drosophila melanogaster (Blumenstiel et al., 2009), Escherichia coli (Barrick and Lenski, 2009), Schizosaccharomyces pombe (Irvine et al., 2009), Arabidopsis thaliana (Laitinen et al. 2010), Saccharomyces cerevisiae (Birkeland et al., 2010), Dictyostelium discoideum (Saxer et al., 2012), Chlamydomonas reinhardtii (Dutcher et al., 2012), and Danio rerio (Bowen et al., 2012). A great advantage of forward genetic screening is the freedom from constraint on the types of alleles recovered; the only requirement is that the mutation produces the phenotype of interest. That property stands in contrast to techniques for reverse genetic screening, such as RNA interference (Fire et al., 1998), in which the molecular target is known but the ability to modify its activity is limited to a reduction of gene function. Although that category of mutation (reduction or loss of function) represents the class most commonly recovered in forward genetic screens, it is also possible to generate alleles that increase the level or activity of the gene product, reverse normal gene function, or produce novel activity. However, the expanded repertoire of variants 4 bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. accessible by random mutagenesis is accompanied by an increasing complexity in analysis. Although forward genetic screens are typically designed to obtain recessive alleles, they can also recover dominant, semi-dominant, and even multigenic mutations. The latter categories of alleles are challenging to map by traditional methods, such as linkage to morphological or molecular markers, due to the complexities of zygosity (for dominant and semi-dominant alleles) or the dependence of the phenotype upon independently segregating loci (for multigenic traits). As a practical consequence, preference is given to recessive, monogenic alleles, and the tools for mutation identification by WGS are tailored to that end. We reasoned that the data produced by whole-genome sequencing, coupled with the appropriate mapping crosses, should greatly enhance our ability to identify causative variants among the less genetically tractable categories of mutations. To that end, we have utilized a polymorphism mapping method previously developed for mutation identification in Caenorhabditis elegans (Wicks et al., 2001; Swan et al., 2002; Doitsidou et al., 2010), and evaluated various strategies for mapping dominant, semi-dominant, and two-gene mutations. We have also performed analysis of WGS data from traditional recombinant mapping crosses, useful for mutation identification in legacy strains or in cases where polymorphism mapping may not be suitable. Finally, we have adopted a method to prepare sequencing libraries from small amounts of genomic DNA, and demonstrate its utility in strains where sample recovery may be limiting. Together, those approaches extend the application of WGS to the identification of various types of mutations. MATERIALS AND METHODS 5 bioRxiv preprint doi: https://doi.org/10.1101/036046; this version posted January 6, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. All C. elegans mutant strains were derived from the wild-type N2 Bristol strain and contained one or more of the following alleles: rol-6(su1006) II, lin-8(n111) II, lin- 9(n112) III, spe-48(hc85) I (formerly identified as spe-8), spe-10(hc104) V, dpy-11(e224) V, unc-76(e911) V. Hawaiian strain isolate CB4856 (Hodgkin & Doniach, 1997) was used for polymorphism mapping. CRISPR-mediated gene editing to create spe-48(gd11) was performed as described previously (Paix et al., 2015), using dpy-10 as a co-CRISPR marker (Arribere et al., 2014). CRISPR reagents are listed in Supplement Table S1. Worms were propagated using standard growth conditions (Brenner, 1974). Fertility assays were performed as described previously (Kulkarni et al., 2012). Sequencing libraries were constructed using the TruSeq DNA sample prep kit v2 for large input samples or TruSeq ChIP sample prep kit for low input samples (Illumina, San Diego, CA). Briefly, gDNA samples were sheared by sonication, end-repaired, A- tailed, adapter-ligated, and PCR-amplified. A detailed protocol for obtaining sheared gDNA from low input samples can be found in the Supplement (File S1). Libraries were sequenced on a HiSeq 2500 (Illumina, San Diego, CA) to generate 50bp reads, and yielded >20-fold genome coverage per sample. Genome version WS220 (www.wormbase.org) was used as the reference. The SNP data in Figure 1 were determined with a pipeline of BFAST (Homer et al., 2009) for alignment and SAMtools (Li et al., 2009) for variant calling. All other SNP data were obtained using BBMap (Bushnell, 2015) for alignment and FreeBayes (Garrison and Marth, 2012) for variant calling. Duplicate reads were removed after alignment and, unless otherwise indicated, at least three independent

1 Mapping Challenging Mutations by Whole-Genome Sequencing Harold

Whole Genome Sequencing Identifies Novel Structural Variant in a Large Indian Family Affected with X - Linked Agammaglobulinemia

Exome Sequencing and Functional Validation in Zebrafish Identify

Genomic Approaches for Understanding the Genetics of Complex Disease

QTL Analysis Reveals Genomic Variants Linked to High-Temperature

A Comprehensive Database of Putative Human Microrna Target Site

Interleukin 10 Mutant Zebrafish Have an Enhanced Interferon

Comparison of Structural and Short Variants Detectedby Linked-Read

Variant Annotation

Phase 3. Functional Variant Discovery

A Pipeline for Identifying Saccharomyces Cerevisiae Mutations

Gain-Of-Function Mutation of Microrna-140 in Human Skeletal Dysplasia

Improving Isobutanol Productivity Through Adaptive Laboratory Evolution in Saccharomyces Cerevisiae