<<

Overview of DNA Strategies UNIT 7.1

Jay A. Shendure,1 Gregory J. Porreca,2 and George M. Church2 1University of Washington, Seattle, Washington 2Harvard Medical School, Boston, Massachusetts

ABSTRACT Efficient and cost-effective DNA sequencing technologies have been, and may continue to be, critical to the progress of molecular . This overview of DNA sequencing strategies pro- vides a high-level review of six distinct approaches to DNA sequencing: (a) dideoxy sequencing; (b) cyclic array sequencing; (c) sequencing-by-hybridization; (d) microelectrophoresis; (e) ; and (f) . The primary focus is on dideoxy sequencing, which has been the dominant technology since 1977, and on cyclic array strategies, for which sev- eral competitive implementations have been developed since 2005. Because the field of DNA sequencing is changing rapidly, this unit represents a snapshot of this particular moment. Curr. Protoc. Mol. Biol. 81:7.1.1-7.1.11. C 2008 by John Wiley & Sons, Inc. Keywords: DNA r sequencing r Sanger r dideoxy r polony

INTRODUCTION cleotides (ddNTPs) bearing fluorescent moi- In the mid-1960s, the first attempts at DNA eties (Prober et al., 1987), and thermostable sequencing followed the precedent set for pro- engineered to accept them (Tabor tein (Ryle et al., 1955) and RNA (Holley et al., and Richardson, 1995), as well as the im- 1965): sequencing by detailed analysis of plementation of efficient DNA sequence pro- degradation products. However, the length and duction workflows in core facilities and high- consequent complexity of the DNA throughput sequencing centers. It is notable proved to be significantly problematic (Sanger, that much of this innovation was moti- 1988). A key moment came in February, 1977, vated by the Human Project (HGP), when groups led by Fred Sanger and Walter which achieved completion of a draft of the Gilbert independently published descriptions canonical sequence in 2001 of methodologies for DNA sequencing, both (International Human Genome Sequencing of which relied on gel to Consortium, 2001). Consequent to the techno- separate DNA fragments with single-base- logical innovation that enabled the HGP, the pair resolution (Maxam and Gilbert, 1977; per-base cost of dideoxy sequencing has fol- Sanger et al., 1977). In the years that fol- lowed an exponential decline (Collins et al., lowed, the rapid dissemination of these tech- 2003; Shendure et al., 2004). Importantly, nologies and their progression to robust proto- the read lengths and accuracy of sequenc- cols enabled a wide range of critical advances ing traces have steadily improved as well. As throughout the fields of and molecu- community-wide capacity for high-throughput lar biology. The development of commercially DNA sequence production has been main- available automated sequencing platforms in tained in the wake of the HGP, the number of the mid-1980s represented a second key break- sequenced deposited in GenBank through that secured the dominance of the has continued its exponential rise. As of Sanger protocol (also known as “dideoxy se- October 2007, genome sequences for 997 bac- quencing”) over the Maxam-Gilbert protocol terial species and 164 eukaryotic species are (also known as “chemical sequencing”) as the available in at least draft assembly form. method of choice for the next several decades In recent years, there has been a collective (Hunkapiller et al., 1991). sense in the technology development field In addition to automation, a supporting that optimization of dideoxy sequencing pro- cast of related technologies was developed tocols may be approaching exhaustion, and to further reduce costs and improve sequenc- that the trend of declining sequencing costs ing throughput. These included a broad range is unlikely to continue much further without of methods for efficient library construc- a radical change in the underlying technol- tion and template preparation, dideoxynu- ogy. This has sparked significant academic DNA Sequencing

Current Protocols in 7.1.1-7.1.11, January 2008 7.1.1 Published online January 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471142727.mb0701s81 Supplement 81 Copyright C 2008 John Wiley & Sons, Inc. and commercial investment in alternative tech- complementary to the template whose se- nological paradigms (Shendure et al., 2004). quence is to be determined (Fig. 7.1.1). Several of these alternatives have quickly Numerous identical copies of the sequencing progressed to substantial proof-of-concept, template undergo the primer extension reac- demonstrating costs competitive with conven- tion within a single microliter-scale volume. tional dideoxy sequencing for certain applica- Generating sufficient quantities of template for tions (Margulies et al., 2005; Shendure et al., a sequencing reaction is typically achieved by 2005). Some of these platforms have recently either (1) miniprep of a plasmid vector into become, or are anticipated to become, widely which the fragment of interest has been cloned, available in an “open-source” format or as or (2) chain reaction (PCR) fol- commercial products. Although dideoxy se- lowed by a cleanup step. In the sequenc- quencing still accounts for the vast majority of ing reaction itself, both the natural deoxynu- DNA sequencing production, this is unlikely cleotides (dNTPs) and the chain-terminating to be the case several years from now. (ddNTPs) are present at This unit provides a high-level overview a specific ratio that determines their relative of six distinct approaches to DNA sequenc- probability of incorporation during the primer ing. These are: (1) dideoxy sequencing, extension. Incorporation of a ddNTP instead of (2) cyclic array sequencing, (3) sequencing a dNTP results in termination of a given strand. by hybridization, (4) microelectrophoresis, Therefore, for any given template molecule, (5) mass spectrometry, and (6) nanopore se- strand elongation will begin at the 3 end of quencing. Additionally, this unit presents key the primer and terminate upon incorporation parameters that should be considered when of a ddNTP. In older protocols for dideoxy se- choosing the DNA sequencing strategy most quencing, four separate primer extension reac- appropriate for a given application. It should tions are carried out, each containing only one be emphasized that the DNA sequencing field of the four possible ddNTP species (ddATP, is changing rapidly, so the information in this ddGTP, ddCTP, or ddTTP), along with tem- unit represents a snapshot of this particular plate, polymerase, dNTPs, and a radioactively moment. labeled primer. The result is a collection of It is worthwhile to note that the research many terminated strands of many different goals that motivate DNA sequencing may be lengths within each reaction. As each reaction undergoing a substantial shift as well, concur- contains only one ddNTP species, fragments rent with the introduction of new technologies. with only a subset of possible lengths will Given that sequences for H. be generated, corresponding to the positions sapiens as well as all major model organisms of that in the template sequence. are nearly complete, demand will likely shift The four reactions are then electrophoresed in away from de novo genome sequencing to- four lanes of a denaturing polyacrylamide gel wards other areas of application, such as rese- to yield size separation with single-nucleotide quencing (identifying in the resolution. The pattern of bands (with each genome of an individual for whose species a band consisting of terminated fragments of a reference genome is already available) and tag single length) across the four lanes allows one counting (i.e., serial analysis of expres- to directly interpret the primary sequence of sion or chromatin occupancy by the sequenc- the template under analysis. ing of short but identifying DNA tags). The ini- Current implementations of dideoxy se- tial generation of new technologies will deliver quencing differ in several key ways from the sequence that is substantially shorter and less protocol described above. Only a single primer accurate than state-of-the-art Sanger sequenc- extension reaction is performed that includes ing. However, although the utility of such se- all four ddNTPs. The four species of ddNTP quence may be limited for de novo sequencing, are labeled with fluorescent dyes that have the it will likely be compatible, and often prefer- same excitation wavelength but different emis- able, for other areas of application. sion spectra, allowing for identification by flu- orescent energy resonance transfer (FRET). To minimize the required amount of template DNA SEQUENCING STRATEGIES DNA, a “cycle sequencing” reaction is per- Dideoxy Sequencing formed, in which multiple cycles of denat- Dideoxy sequencing, also known as Sanger uration, primer annealing, and primer exten- Overview of DNA sequencing, proceeds by primer-initiated, sion are performed to linearly increase the Sequencing polymerase-driven synthesis of DNA strands number of terminated strands. This requires Strategies 7.1.2

Supplement 81 Current Protocols in Molecular Biology Figure 7.1.1 Schematic of the basic principle involved in dideoxy sequencing. The sequencing template consists of an unknown region whose sequence is to be determined, flanked by known sequence to which a sequencing primer can be hybridized. Cycle sequencing (multiple cycles of primer annealing, primer extension, and denaturation) are performed with polymerase, dNTPs, and fluorescently labeled ddNTPs (where a different label is present on each species of ddNTP). Products of the cycle sequencing reaction are run into a capillary containing a denaturing polymer. This yields size-based separation with single-base-pair resolution, with the shortest fragments running the fastest. Observation of the emission spectra in four channels (corresponding to the fluorescent labels for the four ddNTP species) over time, as fragments emerge from capillary electrophoresis, can be used to infer the primary sequence of the unknown template. the use of engineered polymerases, such as wavelengths produces a four-color sequencing ThermoSequenase, that are thermostable and trace. Computer algorithms (“base callers”) in- that efficiently incorporate modified ddNTPs terpret the peak heights in these traces to pro- (Tabor and Richardson, 1995). The products of duce a DNA sequence. Importantly, sophisti- the cycle sequencing reaction are analyzed in cated algorithms exist that also define the accu- an automated sequencing instrument via elec- racy with which individual base-calls are made trophoresis in a long capillary filled with a (Ewing and Green, 1998; Ewing et al., 1998). denaturing polymer that yields size separation Although the per-base accuracy can vary sub- with single-base-pair resolution. As fragments stantially within a single sequencing read, the of each discrete length pass through a trans- accuracy of the best base calls can be as high parent component near the end of the capil- as 99.999%. lary, a single wavelength of light excites the Nearly all dideoxy sequencing performed fluorophores linked to the ddNTPs. Labeled today makes use of automated capillary elec- fragments fluoresce at one of four distinct trophoresis, which typically analyzes 96 to wavelengths, revealing the identity of their 384 sequencing reactions simultaneously via terminal base via FRET. Simultaneous mea- an array of capillaries. Major vendors include surement of the emission spectra at these four (e.g., the ABI 3730) and DNA Sequencing 7.1.3

Current Protocols in Molecular Biology Supplement 81 GE Healthcare (e.g., the MegaBACE instru- domly dispersed. Each DNA feature gener- ment series). There is a tradeoff between long ally includes an unknown sequence of interest read lengths and the overall throughput of (distinct from the unknown sequence of other an instrument. Depending on which param- DNA features on the array) flanked by uni- eter is being optimized, conventional instru- versal adaptor sequences. A key point in this ments are capable of reads just over 1000 base approach is that the features are not necessar- pairs in length, or production throughputs of ily separated into individual wells. Rather, be- over 2.5 megabases per day. Because of vari- cause they are immobilized on a single surface, ation in the levels of optimization and instru- a single volume is applied to simulta- ment uptime, the cost of dideoxy sequencing neously access and manipulate all features in varies widely throughout the research commu- parallel. The sequencing process is cyclic be- nity. The in-house costs of high-throughput se- cause in each cycle an enzymatic process is quencing centers may be as low as 50 cents per applied to interrogate the identity of a single kilobase, while core facilities and commercial base position for all features in parallel. The entities may charge anywhere from $1 to $20 enzymatic process is coupled to either the pro- per sequencing read. duction of light or the incorporation of a fluo- rescent group. At the conclusion of each cycle, Cyclic Array Sequencing data are acquired by CCD-based imaging of All of the recently released, or soon-to-be- the array. Subsequent cycles are aimed at in- released, non-Sanger commercial sequencing terrogating different base positions within the platforms, including systems from 454/Roche, template. After multiple cycles of enzymatic Solexa/Illumina, Agencourt/Applied Biosys- manipulation, position-specific interrogation, tems, and Helicos BioSystems, fall under the and array imaging, a contiguous sequence for rubric of a single paradigm, termed cyclic each feature can be derived from analysis of array sequencing (Fig. 7.1.2). Cyclic array the full series of imaging data covering its platforms achieve low costs by simultane- position. ously decoding a two-dimensional array bear- Although this basic paradigm serves to de- ing millions (potentially billions) of distinct scribe several different platforms for cyclic ar- sequencing features. The sequencing features ray sequencing, the platforms differ remark- are “clonal,” in that each resolvable unit con- ably in the specifics of implementation. The tains only one species of DNA (as a single primary areas of difference (summarized for molecule or in multiple copies) physically im- several platforms in Table 7.1.1) are (1) the mobilized on the array. The features may be method used to generate the DNA sequenc- arranged in an ordered fashion or may be ran- ing features, and (2) the biochemistry used

Figure 7.1.2 The concept of cyclic array sequencing platforms involves an array of DNA features to be sequenced, immobilized to constant locations on a solid substrate. At each cycle, the identity of a single base position is interrogated at each feature. Data are collected at each cycle by imaging of the array. At the conclusion of the experiment, imaging data for each feature collected over the full set of cycles can be used to infer contiguous stretches of sequence. The power of cyclic array methods to achieve low costs derives from the possibility of simultaneously sequencing millions to potentially billions of sequencing features in parallel. Also, microliter-scale reagent volumes can Overview of DNA be used to manipulate all features in a single reaction, such that the effective reagent volume per Sequencing sequencing feature is on the order of picoliters or femtoliters. For the color version of this figure Strategies go to http://www.currentprotocols.com. 7.1.4

Supplement 81 Current Protocols in Molecular Biology Table 7.1.1 Cyclic Array Sequencing Platformsa

Platform Polony amplification Cycle sequencing

Harvard/Danaher Emulsion PCR (1-µm beads) Ligase Agencourt/Applied Biosystems Solexa/Illumina Bridge PCR Polymerase (single base extension with reversible terminators) 454 Corp/Roche Emulsion PCR (28-µm beads) Polymerase () Helicos None (single molecule) Polymerase (single base extension) aOpen-source and commercial cyclic array sequencing platforms that have been recently released, or are soon to be released. The primary areas of difference are the methods used to generate DNA sequencing features, and the biochemistry used for cyclic sequencing itself. to perform cyclic sequencing. The follow- from the emulsion PCR reaction may carry ing discussion briefly describes three dis- multiple copies of a single library species, tinct approaches that are currently available while different beads (which were present in as commercial or open-source platforms, with different compartments) carry multiple copies commentary on some of the advantages and of other library species. The Harvard plat- disadvantages of each. form uses emulsion PCR protocols adapted “Polymerase colony” or “polony” is a directly from work done by the group of generic term to describe the polymerase- Bert Vogelstein and Ken Kinzler (Dressman driven amplification of a complex library of et al., 2003; Diehl et al., 2005, 2006; Li et al., sequencing templates, such that 2006), using paramagnetic beads that are 1 originating from any given template within µm in diameter. The 454 system uses a differ- the complex library remain locally clustered, ent oil phase that yields much larger aqueous analogous to a bacterial colony (Mitra and compartments, thus supporting emulsion PCR Church, 1999; Mitra et al., 2003). Both the with capture to much larger beads (28 µmin polony sequencing system recently developed diameter). at Harvard (Shendure et al., 2005; UNIT 7.8) In the 454 system, amplified beads are ar- and the 454 system (Roche; Margulies et al., rayed on a prefabricated fiber optic bundle 2005) generate DNA sequencing features by etched to contain over one million picoliter- performing an emulsion PCR amplification scale wells. Sequencing is performed by the of a complex sequencing library, such that pyrosequencing method. Initially developed PCR products derived from individual tem- by Mostafa Ronaghi and colleagues (Valdar plates are clonally captured onto the surface et al., 2006), pyrosequencing involves poly- of micrometer-scale beads. Emulsion PCR is merase extension of a primed template by similar to standard PCR, but is performed in sequential addition of a single nucleotide the context of a water-in-oil emulsion (Tawfik species at each cycle. Incorporation events at and Griffiths, 1998) such that the aqueous each array feature are detected by real-time, component is separated into millions of stable -based monitoring of pyrophosphate reaction chambers of uniform size. A complex release. Several hundred thousand wells con- library (consisting of unknown fragments to tain template-bearing beads that yield useful be sequenced, flanked by universal adaptors) sequence. A key advantage of the 454 system, can be simultaneously amplified with a single relative to other cyclic array platforms, is the pair of primers, but PCR amplicons originat- clear demonstration of read lengths in excess ing from a single molecule within the library of 100 base pairs. As it was the first cyclic remain compartmentalized to a single aqueous array platform to achieve commercialization, chamber. It has been demonstrated that the in- 454 sequencing has contributed to successful clusion of 1-µm paramagnetic beads, where projects over a range of applications, includ- the beads bear one of the two PCR primers on ing bacterial genome resequencing (Andries their surface, enables the solid-phase capture et al., 2005; Velicer et al., 2006), de novo of PCR amplicons generated within individ- bacterial genome sequencing (best done as a ual emulsion PCR compartments (Dressman combination of dideoxy- and pyrosequencing- et al., 2003). Individual templates are “clonally based reads; Goldberg et al., 2006; Smith et al., DNA Sequencing amplified” in that any single bead recovered 2007), small RNA discovery (Berezikov et al., 7.1.5

Current Protocols in Molecular Biology Supplement 81 2006; Ruby et al., 2006), and metagenomic (Shendure et al., 2005). An open-source sampling (Edwards et al., 2006). There are sev- version of the platform can be imple- eral disadvantages to consider. (1) There is a mented with off-the-shelf instrumentation and high error rate at homopolymeric sequences , with the instrument itself costing (consecutive runs of the same base) because it ∼$150,000. The platform was licensed by is difficult to interpret the precise number of Harvard for commercial development to Agen- consecutive incorporations in a single cycle. court/Applied Biosystems, which notably has As there is a systematic component to these developed more sophisticated sequencing-by- errors, the extent to which they resolve with ligation chemistry that is capable of contigu- multiple reads covering the same positions is ous 25 to 35 base-pair reads per mate-paired unclear. (2) The cost of sequencing with the tag. Applied Biosystems expects to release its 454 system is lower than that of conventional instrument (also known as the SOLiD se- dideoxy sequencing, but not dramatically so. quencing system) in 2008. Advantages of these (3) Pyrosequencing is required for real-time systems over other cyclic array platforms in- observation of sequencing features during the clude: (1) very small feature sizes (1 µm), enzymatic step of each cycle. This may limit with a potential to fit more than one billion the extent to which the system can evolve be- sequencing features on the surface of a stan- yond its current feature density in future mod- dard microscope slide, and (2) high consensus els of the instrument. (4) Performing emulsion accuracies, which are particularly important PCR and bead recovery is cumbersome rela- for resequencing applications. Notable disad- tive to bridge PCR or single-molecule methods vantages include: (1) significantly shorter read (see below). lengths than dideoxy sequencing or the 454 In the most recent implementation of the system, and (2) as with the 454 system, cum- Harvard polony sequencing system (Shendure bersome emulsion PCR and bead recovery rel- et al., 2005; UNIT 7.8), DNA sequencing fea- ative to bridge PCR or single-molecule se- tures (1-µm beads generated by emulsion quencing (see below). PCR) are dispersed to the surface of a glass Illumina’s Solexa sequencing platform coverslip as a disordered array, and immo- (which includes merged technology from bilized either by a thin acrylamide gel or Solexa, Lynx, and Manteia SA) generates by direct covalent attachment to the surface. polony sequencing features by a different A unique aspect of this platform is that the method known as “bridge PCR” (Fedurco enzyme conferring specificity during each se- et al., 2006). In this approach, both forward quencing cycle is a ligase, rather than a poly- and reverse PCR primers are immobilized to merase. At each cycle, a population of fluo- the two-dimensional surface of a glass slide. rescently labeled nonamers is introduced. The The primers are designed to target univer- population is composed such that the identity sal adaptors that flank a complex library of of a specific base position within each nonamer sequencing templates. PCR is performed by correlates with the identity of the fluorophore. standard thermal cycling of the slide, with all When a position near the site of ligation is reagents present in aqueous phase except for interrogated, each bead will predominately in- the primers, which are only present in surface- corporate nonamers bearing a single type of bound form. Because all primers are immo- fluorophore, which reveals the identity of the bilized, copies remain local, and the result of base at that position. The method is success- amplification of each single template molecule ful for sequencing contiguous stretches of 6 to is a tight cluster of ∼1000 copies. One species 7 bases from the site of ligation. By using li- of primer is chemically released from the slide, braries with mate-paired sequencing tags and such that only one orientation of each ampli- sequencing into each tag in both the 5 and fied template remains. Cyclic sequencing is 3 directions, one can obtain at least 26 base also performed by a method distinct from those pairs of sequence information per bead fea- described above. A universal primer is hy- ture. Data are collected at each cycle by four- bridized to a position immediately adjacent to color imaging with a modified epifluoresence unknown sequence. At each cycle, polymerase microscope. The system has demonstrated extension is performed with modified dNTPs success in resequencing a bacterial genome bearing unique fluorescent groups (identifying with raw accuracies of up to 99.9% and with the dNTP species) and a reversibly terminat- very high consensus accuracies (<1 error per ing moiety in place of the 3-hydroxyl position. Overview of DNA million bases), at a cost at least one order Because of this terminating group, only a sin- Sequencing of magnitude below conventional sequencing gle base extension can occur at each cycle. The Strategies 7.1.6

Supplement 81 Current Protocols in Molecular Biology array is imaged in four colors to acquire data base pair to be resequenced, there are four fea- on all features for a single base position. After tures on the chip that differ only at their cen- cleavage of the terminating moiety (leaving tral position (dA, dG, dC, or dT), while the a3-hydroxyl group), the next cycle can be- flanking sequence is constant and is based on gin. Read lengths of 35 to 50 bases for over the reference genome. After hybridization of 40 million features have been demonstrated labeled target DNA to the chip, followed by on the Solexa system. Key advantages of this imaging of the array, the relative intensities at platform are: (1) at least a gigabase of se- each set of four features targeting a given posi- quence can be generated in a single sequencing tion can be used to infer its identity. Perlegen run, (2) the simplicity of bridge PCR relative developed and applied SBH arrays for rese- to emulsion PCR for feature generation, and quencing the nonrepetitive portion of chromo- (3) per-base costs are estimated to be at least some 21 in multiple individuals, yielding ex- two orders of magnitude lower than conven- tensive discovery of novel SNPs (Patil et al., tional dideoxy sequencing. Key disadvantages 2001). A key limitation was a high false pos- may include: (1) significantly shorter read itive rate (3%), a significant problem as there lengths than dideoxy sequencing or the 454 is no possibility of redundant to ob- system, and (2) the system was only recently tain higher consensus accuracies, as is possi- released, so performance for key parameters ble with other sequencing methods. A compo- such as raw and consensus accuracy are still nent of the problem lies in the difficulty that under evaluation. SBH approaches have with heterozygosity in Several groups are working on cyclic ar- diploid . A recent study demonstrated ray platforms for direct sequencing of single that Affymetrix arrays could be used for rese- molecules, i.e., without any amplification step. quencing haploid S. cerevisiae strains with an These include the Helicos system, based on impressively low false-positive rate (detection technology developed in Steven Quake’s lab of 87% of ∼30,000 SNPs with only eight false (Braslavsky et al., 2003), in which a library of positives; Gresham et al., 2006). Nimblegen single DNA molecules is dispersed to an ar- has developed a two-tiered SBH approach to ray and sequenced by cyclic extensions with genomic resequencing of microbial genomes: fluorescently labeled nucleotides. A differ- genome-wide discovery of approximate loca- ent approach is being taken by Nanofluidics, tions of in the first array, followed by Inc., based on technology developed in Watt fine mapping in a second, custom array (Albert Webb’s lab (Levene et al., 2003), which pro- et al., 2005; Herring et al., 2006). In one recent poses to use an array of zero-mode waveg- study using Nimblegen resequencing arrays, uides for real-time, single-molecule observa- 95 SNPs were predicted in the genomes of tion of polymerase-driven incorporation of five evolved E. coli strains, 17 of which were fluorescently labeled nucleotides to a primed true by Sanger-based confirmation, at a cost of template. ∼$7,500 per strain (Herring et al., 2006).

Sequencing by Hybridization Microelectrophoresis The principle of sequencing by hybridiza- As mentioned above, conventional dideoxy tion (SBH) is that the differential hybridization sequencing is performed with microliter-scale of target DNA to an array of reagent volumes, with most instruments run- probes can be used to decode its primary DNA ning 96 or 384 reactions simultaneously in sequence. The most successful implementa- separate reaction vessels. The goal of micro- tions of this approach rely on probe sequences electrophoretic methods is to make use of based on the reference genome sequence of a microfabrication techniques developed in the given species, such that genomic DNA derived semiconductor industry to enable significant from individuals of that species can be hy- miniaturization of conventional dideoxy se- bridized to the array to reveal differences rel- quencing (Paegel et al., 2003), for example, ative to the reference genome (i.e., resequenc- by performing in nanoliter- ing, rather than de novo sequencing). This scale channels cut into the surface of a sil- same concept is used for many icon wafer (Emrich et al., 2002). A more array platforms, except that SBH attempts to ambitious goal, on which much progress has query all bases, rather than only bases at which been made, is the integration of a series of common polymorphisms have been defined. In sequencing-related steps (e.g., PCR amplifi- resequencing arrays developed by Affymetrix cation, product purification, and sequencing) and Perlegen, each feature consists of a 25-bp in a “lab-on-a-chip” format (Blazej et al., oligonucleotide of defined sequence. For each 2006). A key advantage of this approach is the DNA Sequencing 7.1.7

Current Protocols in Molecular Biology Supplement 81 retention of the dideoxy biochemistry, which The nanopore itself is a biological mem- has proven robustness for >1011 bases of se- brane (e.g., α-; Kasianowicz quencing. Until alternative methods achieve et al., 1996) or a synthetic solid-state de- significantly longer read lengths than they can vice (Fologea et al., 2005). As individual nu- today, there will continue to be an impor- cleotides are expected to obstruct the pore tant role for . Microelec- to varying degrees in a base-specific manner trophoretic methods may prove critical to con- (or can potentially be encouraged to do so tinuing the trend of reducing costs for this via base-specific modifications), the resulting well-proven chemistry. There may also be a fluctuations in electrical conductance through key role for “lab-on-a-chip” integrated se- the pore can, in principle, be measured and quencing devices for cost-effective, clinical used to infer the primary DNA sequence. Pub- “point-of-care” . lished examples of nanopore-based character- ization of single molecules in- Mass Spectrometry clude: (1) the measurement of duplex stem Over the past decade, mass spectrometry length, base-pair mismatches, and loop length (MS) has established itself as the key data- within DNA hairpins (Vercoutere et al., 2001), acquisition platform for the emerging field (2) the classification of the terminal base pair of proteomics. There are also applications of a DNA hairpin, with ∼60% to 90% accuracy for MS in , including methods with a single observation, and >99% accu- for genotyping, quantitative DNA analysis, racy with 15 observations of the same species analysis, analysis of in- (Winters-Hilt et al., 2003), and (3) reasonably dels and DNA methylation, and DNA/RNA accurate (93% to 98%) discrimination of de- sequencing (Ragoussis et al., 2006). Se- oxynucleotide monophosphates from one an- quencing with matrix-assisted laser desorp- other with an engineered protein nanopore sen- tion/ionization time-of-flight mass spectrom- sor (Astier et al., 2006). Additionally, a wide etry (MALDI-TOF-MS) relies on the precise range of proposed approaches to nanopore se- measurement of the masses of DNA frag- quencing were recently funded by the NIH and ments present within a mixture of nucleic acids are under development. It is probable that sig- (Edwards et al., 2005). With MS sequencing, nificant pore engineering and further technol- fragmentation can also be achieved by primer ogy development will be necessary to achieve extension with dideoxy termination; the pri- accurate decoding of a complex mixture of mary difference is the use of MALDI-TOF- DNA with single-base-pair resolu- MS rather than capillary electrophoresis to re- tion and useful read lengths. Provided these solve fragment sizes. Alternatively, fragments challenges can be met, nanopore sequencing are transcribed to RNA and subjected to base- has great potential to enable extraordinarily specific cleavage prior to analysis. For de novo rapid and cost-effective sequencing of popu- sequencing with MS, read lengths have gener- lations of DNA molecules with comparatively ally been limited to <100 bp. Applications of simple sample preparation. MS sequencing include deciphering sequences that appear as compression zones by gel elec- trophoresis, direct sequencing of RNA (in- CHOOSING A SEQUENCING cluding for identification of post-translational STRATEGY modifications of ribosomal RNA), the robust Given the extent of flux in the sequenc- discovery of heterozygous frameshift and sub- ing technology field and the uncertainties sur- stitution mutations within PCR products in rounding the precise costs and performance resequencing projects, and DNA methylation parameters for several of the new non-Sanger analysis (Ragoussis et al., 2006). MS sequenc- sequencing platforms, it is difficult to state ing will likely continue to have a role for with any certainty which system will be best specific problems not easily addressed with for any given application. Some of the key other methods, but is unlikely to displace con- parameters that should be considered in com- ventional methods for most DNA sequencing paring technologies to one another include the applications. following. Cost per raw base. What is the all-inclusive Nanopore Sequencing cost (instrument amortization, reagents, labor, A creative approach to single-molecule se- etc.) for producing each base pair of sequence? Overview of DNA quencing, first proposed in the 1980s, in- Raw accuracy. What is the distribution Sequencing volves passing single-stranded DNA through of accuracies with which raw base calls are Strategies a nanopore (Deamer and Akeson, 2000). made? What is the dominant error modality? 7.1.8

Supplement 81 Current Protocols in Molecular Biology New technologies are clearly quite a bit be- Andries, K., Verhasselt, P., Guillemont, J., hind conventional sequencing with respect to Gohlmann, H.W., Neefs, J.M., Winkler, H., this parameter. Van Gestel, J., Timmerman, P., Zhu, M., Lee, E., Williams, P., de Chaffoy, D., Huitric, E., Cost per consensus base. For example, one Hoffner, S., Cambau, E., Truffot-Pernot, C., high-accuracy raw base call at a given posi- Lounis, N., and Jarlier, V. 2005. A diarylquino- tion may be more valuable than several lower- line drug active on the ATP synthase of accuracy raw base calls. Mycobacterium tuberculosis. Science 307:223- Consensus accuracy. If errors are system- 227. atic rather than random, then multiple reads Astier, Y., Braha, O., and Bayley, H. 2006. Toward covering a given position (raw base-calls) may single molecule DNA sequencing: Direct identi- fication of ribonucleoside and deoxyribonucleo- not lead to higher consensus accuracies in a side 5-monophosphates by using an engineered straightforward manner. protein nanopore equipped with a molecular Read lengths. Although less expensive per adapter. J. Am. Chem. Soc. 128:1705-1710. base, new platforms for DNA sequencing are Berezikov, E., Thuemmler, F., van Laake, L.W., currently at a significant disadvantage with re- Kondova, I., Bontrop, R., Cuppen, E., and spect to read length. Certain applications, such Plasterk, R.H. 2006. Diversity of microRNAs as de novo genome sequencing and assembly, in human and chimpanzee brain. Nat. Genet. 38:1375-1377. may prove difficult with technologies limited to read lengths of 25 bp. On the other hand, Blazej, R.G., Kumaresan, P., and Mathies, R.A. 2006. Microfabricated bioprocessor for inte- short read lengths may be sufficient for rese- grated nanoliter-scale Sanger DNA sequencing. quencing (identifying variants in individuals Proc. Natl. Acad. Sci. U.S.A. 103:7240-7245. for whom a canonical genome is already de- Braslavsky, I., Hebert, B., Kartalov, E., and Quake, fined), as well as for tag counting. S.R. 2003. Sequence information can be ob- Cost per read. For tag-counting applica- tained from single DNA molecules. Proc. Natl. tions such as serial analysis of gene expression Acad. Sci. U.S.A. 100:3960-3964. (SAGE; UNIT 25B.6), read lengths add little in- Collins, F.S., Morgan, M., and Patrinos, A. 2003. formation beyond a certain point. The key pa- The Human : Lessons from rameter here is the number of independently large-scale biology. Science 300:286-290. sequenced tags, each with sufficient informa- Deamer, D.W. and Akeson, M. 2000. Nanopores tion to identify the transcript from which it was and nucleic acids: Prospects for ultrarapid se- quencing. Trends Biotechnol. 18:147-151. derived. Mate-paired reads. The capacity to pro- Diehl, F., Li, M., Dressman, D., He, Y., Shen, D., Szabo, S., Diaz, L.A. Jr, Goodman, S.N., David, duce mate-paired reads (pairs of reads that are K.A., Juhl, H., Kinzler, K.W., and Vogelstein,B. known to be separated by a known distance 2005. Detection and quantification of mutations distribution on the genome of origin) can be in the plasma of patients with colorectal tumors. critical to certain applications, such as de novo Proc. Natl. Acad. Sci. U.S.A. 102:16368-16373. genome assembly and the detection of struc- Diehl, F., Li, M., He, Y., Kinzler, K.W., Vogelstein, tural rearrangements. B., and Dressman, D. 2006. BEAMing: Single- Finally, it should be emphasized that molecule PCR on microparticles in water-in-oil emulsions. Nat. Methods 3:551-559. the “pre-processing” protocols (e.g., library construction) and “post-processing” pipelines Dressman, D., Yan, H., Traverso, G., Kinzler, K.W., and Vogelstein, B. 2003. Transforming single (data analysis) for new sequencing technology DNA molecules into fluorescent magnetic par- platforms are not nearly as mature as for con- ticles for detection and enumeration of ge- ventional dideoxy sequencing. The develop- netic variations. Proc. Natl. Acad. Sci. U.S.A. ment of robust, straightforward protocols for 100:8817-8822. in vitro library construction for various appli- Edwards, J.R., Ruparel, H., and Ju, J. 2005. Mass- cations and the creation of tools spectrometry DNA sequencing. Mutat. Res. for interpreting massive amounts of short-read 573:3-12. sequencing data are critical challenges that Edwards, R.A., Rodriguez-Brito, B., Wegley, L., must be addressed if investigators are to make Haynes, M., Breitbart, M., Peterson, D.M., Saar, M.O., Alexander, S., Alexander, E.C. Jr., and the most of these exciting new technologies. Rohwer, F. 2006. Using pyrosequencing to shed light on deep mine microbial . BMC LITERATURE CITED Genomics 7:57. Albert, T.J., Dailidiene, D., Dailide, G., Norton, Emrich, C.A., Tian, H., Medintz, I.L., and J.E., Kalia, A., Richmond, T.A., Molla, Mathies, R.A. 2002. Microfabricated 384-lane M., Singh, J., Green, R.D., and Berg, capillary array electrophoresis bioanalyzer for D.E. 2005. discovery in bacterial ultrahigh-throughput . Anal. genomes: Metronidazole resistance in Heli- Chem. 74:5076-5083. DNA Sequencing cobacter pylori. Nat. Methods 2:951-953. 7.1.9

Current Protocols in Molecular Biology Supplement 81 Ewing, B. and Green, P. 1998. Base-calling of au- Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, tomated sequencer traces using phred. II. Error S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, probabilities. Genome Res. 8:186-194. B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, Ewing, B., Hillier, L., Wendl, M.C., and Green, G.P., Jando, S.C., Alenquer, M.L., Jarvie, T.P., P. 1998. Base-calling of automated sequencer Jirage, K.B., Kim, J.B., Knight, J.R., Lanza, traces using phred. I. Accuracy assessment. J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Genome Res. 8:175-185. Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., McKenna, M.P., Myers, E.W., Fedurco, M., Romieu, A., Williams, S., Lawrence, Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., I., and Turcatti, G. 2006. BTA, a novel reagent Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, for DNA attachment on glass and efficient gen- J.F., Simpson, J.W., Srinivasan, M., Tartaro, eration of solid-phase amplified DNA colonies. K.R., Tomasz, A., Vogt, K.A., Volkmer, G.A., Nucleic Acids Res. 34:e22. Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Fologea, D., Gershow, M., Ledden, B., McNabb, Begley, R.F., and Rothberg, J.M. 2005. Genome D.S., Golovchenko, J.A., and Li, J. 2005. De- sequencing in microfabricated high-density pi- tecting single stranded DNA with a solid state colitre reactors. 437:376-380. nanopore. Nano Lett. 5:1905-1909. Maxam, A.M. and Gilbert, W. 1977. A new method Goldberg, S.M., Johnson, J., Busam, D., for sequencing DNA. Proc. Natl. Acad. Sci. Feldblyum, T., Ferriera, S., Friedman, R., U.S.A. 74:560-564. Halpern, A., Khouri, H., Kravitz, S.A., Lauro, Mitra, R.D. and Church, G.M. 1999. In situ lo- F.M., Li, K., Rogers, Y.H., Strausberg, R., calized amplification and contact replication of Sutton, G., Tallon, L., Thomas, T., Venter, many individual DNA molecules. Nucleic Acids E., Frazier, M., and Venter, J.C. 2006. A Res. 27:e34. Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of Mitra, R.D., Shendure, J., Olejnik, J., Edyta- marine microbial genomes. Proc. Natl. Acad. Krzymanska-Olejnik, and Church, G.M. 2003. Sci. U.S.A. 103:11240-11245. Fluorescent in situ sequencing on polymerase colonies. Anal. Biochem. 320:55-65. Gresham, D., Ruderfer, D.M., Pratt, S.C., Schacherer, J., Dunham, M.J., Botstein, D., and Paegel, B.M., Blazej, R.G., and Mathies, R.A. 2003. Kruglyak, L. 2006. Genome-wide detection of Microfluidic devices for DNA sequencing: Sam- polymorphisms at nucleotide resolution with ple preparation and electrophoretic analysis. a single DNA microarray. Science 311:1932- Curr. Opin. Biotechnol. 14:42-50. 1936. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Herring, C.D., Raghunathan, A., Honisch, C., Patel, Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, T., Applebee, M.K., Joyce, A.R., Albert, T.J., D.H., Marjoribanks, C., McDonough, D.P., Blattner, F.R., van den Boom, D., Cantor, C.R., Nguyen, B.T., Norris, M.C., Sheehan, J.B., and Palsson, B.Ø. 2006. Comparative genome Shen, N., Stern, D., Stokowski, R.P., Thomas, sequencing of allows obser- D.J., Trulson, M.O., Vyas, K.R., Frazer, K.A., vation of bacterial on a laboratory Fodor, S.P., and Cox, D.R. 2001. Blocks of timescale. Nat. Genet. 38:1406-1412. limited diversity revealed by high- resolution scanning of human 21. Holley, R.W., Apgar, J., Everett, G.A., Madison Science 294:1719-1723. J.T., Marquisee, M., Merrill, S.H, Penswick, J.R., and Zamir, A. 1965. Structure of a ribonu- Prober, J.M., Trainor, G.L., Dam, R.J., Hobbs, F.W., cleic acid. Science 147:1462-1465. Robertson, C.W., Zagursky, R.J., Cocuzza, A.J., Jensen, M.A., and Baumeister, K. 1987. A sys- Hunkapiller, T., Kaiser, R.J., Koop, B.F., and Hood, tem for rapid DNA sequencing with fluorescent L. 1991. Large-scale and automated DNA se- chain-terminating dideoxynucleotides. Science quence determination. Science 254:59-67. 238:336-341. International Human Genome Sequencing Consor- Ragoussis, J., Elvidge, G.P., Kaur, K., and tium (2001). Initial sequencing and analysis of Colella, S. 2006. Matrix-assisted laser desorp- the human genome. Nature 409:860-921. tion/ionisation, time-of-flight mass spectrom- Kasianowicz, J.J., Brandin, E., Branton, D., and etry in genomics research. PLoS Genet. 2: Deamer, D.W. 1996. Characterization of indi- e100. vidual polynucleotide molecules using a mem- Ruby, J.G., Jan, C., Player, C., Axtell, M.J., Lee, brane channel.. Proc. Natl. Acad. Sci. U.S.A. W., Nusbaum, C., Ge, H., and Bartel, D.P. 2006. 93:13770-13773. Large-scale sequencing reveals 21U- and Levene, M.J., Korlach, J., Turner S.W., Foquet, M., additional microRNAs and endogenous siRNAs Craighead, H.G., and Webb, W.W. 2003. Zero- in C. elegans. Cell 127:1193-1207. mode waveguides for single-molecule analysis Ryle, A.P., Sanger, F., Smith, L.F. and Kitai, R. at high concentrations. Science 299:682-686. 1955. The disulphide bonds of . Biochem. Li, M., Diehl, F., Dressman, D., Vogelstein, B., and J. 60:541-556. Kinzler, K.W. 2006. BEAMing up for detection Sanger, F. 1988. Sequences, sequences, and se- and quantification of rare sequence variants. Nat. quences. Annu. Rev. Biochem. 57:1-28.. Methods 3:95-97. Overview of DNA Sanger, F., Air, G.M., Barrell, B.G., Brown, Sequencing Margulies, M., Egholm, M., Altman, W.E., Attiya, N.L., Coulson, A.R., Fiddes, C.A., Hutchison, Strategies S., Bader, J.S., Bemben, L.A., Berka, J., C.A., Slocombe, P.M., and Smith, M. 1977. 7.1.10

Supplement 81 Current Protocols in Molecular Biology Nucleotide sequence of phi X174 Valdar, W., Solberg, L.C., Gauguier, D., Burnett, DNA. Nature 265:687-695. S., Klenerman, P., Cookson, W.O., Taylor, M.S., Shendure, J., Mitra, R.D., Varma, C., and Church, Rawlins, J.N., Mott, R., and Flint, J. 2006. G.M. 2004. Advanced sequencing technologies: Genome-wide genetic association of complex Methods and goals. Nat. Rev. Genet. 5:335-344. traits in heterogeneous stock mice. Nat. Genet. 38:879-887. Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., Wang, Velicer, G.J., Raddatz, G., Keller, H., Deiss, S., M.D., Zhang, K., Mitra, R.D., and Church, G.M. Lanz, C., Dinkelacker, I., and Schuster, S.C. 2005. Accurate multiplex polony sequencing of 2006. Comprehensive mutation identification in an evolved bacterial genome. Science 309:1728- an evolved bacterial cooperator and its cheating 1732. ancestor. Proc. Natl. Acad. Sci. U.S.A. 103:8107- 8112. Smith, M.G., Gianoulis, T.A., Pukatzki, S., Mekalanos, J.J., Ornston, L.N., Gerstein, M., Vercoutere, W., Winters-Hilt, S., Olsen, H., Snyder, M. 2007. New insights into Acinetobac- Deamer, D., Haussler, D., and Akeson, M. 2001. ter baumannii pathogenesis revealed by high- Rapid discrimination among individual DNA density pyrosequencing and transposon muta- hairpin molecules at single-nucleotide resolu- genesis. . Dev. 21:601-614. tion using an ion channel. Nat. Biotechnol. 19:248-252. Tabor, S. and Richardson, C.C. 1995. A single residue in DNA polymerases of the Escherichia Winters-Hilt, S., Vercoutere, W., DeGuzman, V.S., coli DNA polymerase I family is critical for dis- Deamer, D., Akeson, M., and Haussler, D. 2003. tinguishing between deoxy- and dideoxyribonu- Highly accurate classification of Watson-Crick cleotides. Proc. Natl. Acad. Sci. U.S.A. 92:6339- basepairs on termini of single DNA molecules. 6343. Biophys. J. 84:967-976. Tawfik, D.S. and Griffiths, A.D. 1998. Man-made cell-like compartments for . Nat. Biotechnol. 16:652-656.

DNA Sequencing 7.1.11

Current Protocols in Molecular Biology Supplement 81