A Field Guide to Wholegenome Sequencing, Assembly and Annotation

Evolutionary Applications Evolutionary Applications ISSN 1752-4571 REVIEWS AND SYNTHESIS A field guide to whole-genome sequencing, assembly and annotation Robert Ekblom and Jochen B. W. Wolf Department of Evolutionary Biology, Uppsala University, Uppsala, Sweden Keywords Abstract bioinformatics, conservation genomics, genome assembly, next generation Genome sequencing projects were long confined to biomedical model organisms sequencing, vertebrates, whole - genome and required the concerted effort of large consortia. Rapid progress in high- sequencing. throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual Correspondence research groups in the eco-evolutionary and conservation community to generate Robert Ekblom, Department of Evolutionary de novo draft genome sequences for any organism of choice. Because of the cost Biology, Uppsala University, Norbyvagen€ 18D, SE- 752 36 Uppsala, Sweden. and considerable effort involved in such an endeavour, the important first step is Tel.: +46 18 471 6468; to thoroughly consider whether a genome sequence is necessary for addressing fax: +46 18 471 6310; the biological question at hand. Once this decision is taken, a genome project e-mail: [email protected] requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this Received: 4 February 2014 field and provide a step-by-step introduction to the workflow involved in gen- Accepted: 20 May 2014 ome sequencing, assembly and annotation with particular reference to large and doi:10.1111/eva.12178 complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects. Rapid advances in sequencing technology and bioin- Introduction formatic tools during the last decade have initiated a The field of conservation genetics is concerned with study- transition from classical conservation genetics to conser- ing genetic and evolutionary processes in the context of vation genomics (Fig. 1; Primmer 2009; Allendorf et al. biodiversity conservation (Frankham et al. 2010). Tradi- 2010; Ouborg et al. 2010; Steiner et al. 2013). This tionally, a small number of neutral genetic markers were development has two major implications. First, by signi- employed to study patterns of genetic variation of individ- ficantly scaling-up the number of genetic markers, ge- uals and populations with the aim to explore underlying nomewide approaches enhance the power and resolution processes and their relevance to conservation. Marker- for the above-mentioned applications and improve the based measures provide insight into effective population reliability of conclusions (Steiner et al. 2013). Second, sizes (Ne), recent demographic events (e.g. bottlenecks and the application of genomic technologies opens novel axes expansions), genetic relatedness and the level of inbreeding. of investigation (Allendorf et al. 2010; Ouborg et al. Genetic markers are also routinely employed in monitoring 2010). Genome-scale data provide information beyond schemes (e.g. identification of individuals and capture– neutral genetic variation or candidate gene approaches recapture modelling) and in breeding programs (Romanov (e.g. major histocompatibility complex genes; Hedrick et al. 2009) and have been extensively used to characterize 1999) and thus enable screening for selectively important population substructuring and genetic connectivity, to variation and assessing the adaptive potential of popula- delineate conservation units and to infer interspecific tions (Primmer 2009). For example, approaches such as admixture events (Hoglund€ 2009; Allendorf et al. 2013). genomewide scans for selection, association mapping or Under the premise of a causal relationship between neutral quantitative trait loci (QTL) mapping can pinpoint loci genetic variation and population viability, these data can of relevance for local adaptation of the target population inform practical conservation decisions (Frankham 1995). (Steiner et al. 2013), with the potential to conserve 1026 © 2014 The Authors. Evolutionary Applications published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Ekblom and Wolf Whole-genome sequencing, assembly and annotation High quality tissue sample (with attached metadata) including taxonomic delineation, characterization of demographic events, estimates of inbreeding or relatedness, DNA extraction DNA fragmentation Fragment size selection High throughput shotgun sequencing Fragment amplification can be successfully conducted in the absence of a genome Wet-lab Wet-lab procedures reference. Instead, large-scale maker data such as genotyp- Single end, paired end and mate pair sequencing data ing-by-sequencing (Elshire et al. 2011), RAD-Seq (Narum Read Contamination Quality K-mer Genome size et al. 2013), reduced representation sequencing (Van Tas- Reference trimming removal filtering counting estimation information sell et al. 2008), amplicon sequencing (Zavodna et al. from related De-novo Read error Error rate Scaffolding species contig correction estimation 2013) or transcriptome sequencing (Ekblom and Galindo assembly Contamination Genome 2011) can be effectively utilized without relying on a gen- assembly Gap closing Validation scanning RNA-Seq ome backbone. A complete and well-annotated genome data Annotation Repeat masking sequence, however, provides the ultimate resource for Draft genome assembly genomic approaches. Whole-genome resequencing data with positional information along a genome sequence con- Read mapping Local realignment Re-sequencing stitute the most complete account of individual genomic data from Targeted sequencing Variant/SNP calling population and gene capture variation [e.g. structural rearrangements, copy number var- sample genomics Population iation, insertion–deletion, single nucleotide polymor- Genotype calling Variant/SNP validation phisms (SNPs), sequence repeats] and will likely soon Genome wide population level data become the standard for genetic studies of natural popula- Identification of adaptive Detailed genetic Genetic changes tions (Ellegren et al. 2012; The Heliconius Genome Con- genetic variation population structure and due to local demographic history adaptation sortium 2012). It also provides the basis for haplotype Connectivity and gene • Evaluate potential to disperse and information and genomewide estimates of linkage disequi- flow between adapt to changing environments subpopulations • Define functionally relevant conservation units librium which have great power to reveal recent population • Facilitate science-based conservation actions Genetics of inbreeding histories (Li and Durbin 2011), timing of admixture events of conservation Some examples to reduce potential effects of inbreeding depression genomic applications (Hellenthal et al. 2014) and to screen for signatures of selection (Hohenlohe et al. 2010). The study of selectively Figure 1 Workflow of a typical de novo whole-genome sequencing important variation strongly relies on annotated genome project. Black boxes with white text indicate genomic resources becom- ing available during the course of the project. From the top: wet-lab data to identify the functional genomic regions of interest. procedures, de novo assembly bioinformatic pipeline, postassembly Reference sequences are further indispensable as a template analyses of additional population-wide sampling (population genomics), for RNA-seq in detailed studies of (isoform-specific, allele- conservation genomic questions to address and analyses to perform specific) gene expression (Vijay et al. 2013), epigenetic (conservation genomic applications). Bullet points within the white star modifications (such as methylation; Herrera and Bazaga in the bottom part of the figures represent ultimate goals in conserva- 2011) and DNA–protein interactions (Auerbach et al. tion biology that can be addressed using genomic information com- 2013). These approaches are only accessible to genome- bined with high-quality ecological data. enabled taxa (Kohn et al. 2006) that enjoy the added bene- fit of using the latest bioinformatics tools developed in the evolutionary processes – a long sought after goal in con- biomedical sciences. servation biology (Crandall et al. 2000; Fraser and Ber- Here, we introduce the workflow of a typical whole- natchez 2001). Genomewide analyses further allow genome sequencing project conducted by an individual addressing the poorly understood mechanistic basis of research group. This field guide aims at introducing inbreeding depression (epistasis, directional dominance principles and concepts to beginners in the field (Box 1) versus overdominance, many versus few loci), or assess- and offers practical guidance for the many steps involved ing the impact of genetic variation on patterns of gene (Fig. 1). It builds largely upon our own experience with expression, and plastic response to environmental vertebrate genome assembly. We limit the scope to geno- change. Genomic approaches can also be applied to mic data, focusing on large and complex genomes, for highly fragmented DNA from ancient material (e.g.

A Field Guide to Wholegenome Sequencing, Assembly and Annotation

A Clone-Array Pooled Shotgun Strategy for Sequencing Large Genomes

Probabilities and Statistics in Shotgun Sequencing Shotgun Sequencing

Metagenomics Analysis of Microbiota by Next Generation Shotgun Sequencing

Randomness Versus Order

Shotgun Metagenomics, from Sampling to Sequencing and Analysis Christopher Quince1,^, Alan W

Random Shotgun Fire

Whole Genome Shotgun Sequencing

Modeling of Shotgun Sequencing of DNA Plasmids Using Experimental

Human Genome Program in 1986.”

PDF File Contains Supplementary Figures, Figures S1-S5

Metagenomics-Based Proficiency Test of Smoked Salmon Spiked with A

Information Theory of DNA Shotgun Sequencing Abolfazl S