Highthroughput DNA Sequencing Concepts and Limitations
Total Page:16
File Type:pdf, Size:1020Kb
Methods, Models & Techniques High-throughput DNA sequencing – concepts and limitations Martin Kircher and Janet Kelsoà Recent advances in DNA sequencing have revolutionized the field of genomics, modern biology and completely trans- making it possible for even single research groups to generate large amounts of formed the field of genetics. sequence data very rapidly and at a substantially lower cost. These high- At the time of the sequencing of wX174, and for almost another decade, throughput sequencing technologies make deep transcriptome sequencing DNA sequencing was a barely auto- and transcript quantification, whole genome sequencing and resequencing mated and very tedious process which available to many more researchers and projects. However, while the cost involved determining only a few hun- Methods, Models & Techniques and time have been greatly reduced, the error profiles and limitations of the dred nucleotides at a time. In the late new platforms differ significantly from those of previous sequencing technol- 1980s, semi-automated sequencers with higher throughput became avail- ogies. The selection of an appropriate sequencing platform for particular types able [6, 7], still only able to determine of experiments is an important consideration, and requires a detailed under- a few sequences at a time. A break- standing of the technologies available; including sources of error, error rate, as through in the early 1990s was the well as the speed and cost of sequencing. We review the relevant concepts and development of capillary array electro- compare the issues raised by the current high-throughput DNA sequencing phoresis and appropriate detection sys- technologies. We analyze how future developments may overcome these limita- tems [8–12]. As recently as 1996, these developments converged in the pro- tions and what challenges remain. duction of a commercial single capillary sequencer (ABI Prism 310). In 1998, the Keywords: GE Healthcare MegaBACE 1000 and the .ABI/Life Technologies SOLiD; Helicos HeliScope; Illumina Genome Analyzer; ABI Prism 3700 DNA Analyzer became Roche/454 GS FLX Titanium; Sanger capillary sequencing the first commercial 96 capillary sequencers, a development which was Introduction termed high-throughput sequencing. sequenced [1] using a technology Over the last decade, alternative invented just a few years earlier [2–5]. sequencing strategies have become In 1977 the first genome, that of the Since then the sequencing of whole available [13–18] which force us to com- 5,386 nucleotide (nt), single-stranded genomes as well as of individual regions pletely redefine ‘‘high-throughput bacteriophage wX174, was completely and genes has become a major focus of sequencing.’’ These technologies out- perform the older Sanger-sequencing DOI 10.1002/bies.200900181 technologies by a factor of 100–1,000 in daily throughput, and at the same Department of Evolutionary Genetics, Max ChIP-Seq, Chromatin Immuno-Precipitation sequen- time reduce the cost of sequencing Planck Institute for Evolutionary Anthropology, cing; CNV, Copy Number Variation; dNTPs/NTPs, one million nucleotides (1 Mb) to Leipzig, Germany deoxy-nucleotides; ddNTPs, dideoxy-nucleotides (modified nucleotides missing a hydroxyl group at 4–0.1% of that associated with Sanger *Corresponding author: the third carbon atom of the sugar); GA, Short sequencing. To reflect these huge Janet Kelso for Illumina Genome Analyzer; InDel, Insertion/ changes, several companies, research- 3 E-mail: [email protected] Deletion; kb/Mb/Gb, kilo base (l0 nt)/mega ers, and recent reviews [19–24] use the base (106 nt)/giga base (l09 nt); MeDIP-Seq, Abbreviations: Methyl- ation-Dependent Immuno-Precipitation term ‘‘next-generation sequencing’’ A/C/G/T, Deoxyadenosine, Deoxycytosine, Deoxy- sequen- cing; nt nucleotide(s); PCR, Polymerase instead of high-throughput sequencing, guanosine, Deoxythymidine; ATP, Adenosine tri- Chain Reaction; RNA-Seq, Sequencing of yet this term itself may soon be outdated phosphate; dATPaS, Deoxy-adenosine-5’-(alpha- mRNAs/transcripts; SAGE, Serial Analysis of thio)-triphosphate; CCD, Charge-coupled Device, Gene Expression; SNP, Single Nucleotide Poly- considering the speed of ongoing i.e. semi-conductor device used in digital cameras; morphism; mRNA, messenger RNA/transcripts. developments. 524 www.bioessays-journal.com Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. ......Methods, Models & Techniques M. Kircher and J. Kelso Here we review the five sequencing modified nucleotides missing a hydro- Unfortunately, there is still little auto- technologies currently available on the xyl group at the third carbon atom of the mation for creation of the high copy market (capillary sequencing, pyrose- sugar). The dNTP/ddNTP mixture input DNA with known priming sites. Methods, Models & Techniques quencing, reversible terminator chem- causes random, non-reversible termin- Typically this is done by cloning, i.e., istry, sequencing-by-ligation, and virtual ation of the extension reaction, introducing the target sequence into a terminator chemistry), discuss the intrin- creating from the different copies mol- known vector sequence using restriction sic limitations of each, and provide an ecules extended to different lengths. and ligation procedures and using a outlook on new technologies on the Following denaturation and clean up bacterial strain to amplify the target horizon. We explain how the vast of free nucleotides, primers, and the sequence in vivo – thereby exploiting increases in throughput are associated enzyme, the resulting molecules are the low amplification error due to with both new and old types of problems sorted by their molecular weight (corre- inherent proof-reading and repair mech- in the resulting sequence data, and how sponding to the point of termination) anisms. However, this process is very these limit the potential applications and the label attached to the terminat- tedious and is sometimes hampered and pose challenges for data analysis. ing ddNTPs is read out sequentially in by difficulties such as cloning specific the order created by the sorting step. A sequences due to their base compo- Sanger capillary sequencing schematic representation of this process sition, length, and interactions with is available in Fig. 1. the bacterial host system. Although Current Sanger capillary sequencing Sorting by molecular weight was not yet widely used, integrated micro- systems, like the widely used Applied originally performed using gel electro- fluidic devices have been developed Biosystems 3xxx series or the GE phoresis but is nowadays carried out by which aim to automate the DNA extrac- Healthcare MegaBACE instrument, are capillary electrophoresis [7, 25]. tion, in vitro amplification, and sequenc- still based on the same general scheme Originally, radioactive or optical labels ing on the same chip [26–29]. applied in 1977 for the wX174 genome were applied in four different terminator Using current Sanger sequencing [1, 3]. First, millions of copies of the reactions (each sorted and read out technology, it is technically possible sequence to be determined are purified separately), but today four different flu- for up to 384 sequences [29, 30] of or amplified, depending on the source of orophores, one per nucleotide (A, C, G, between 600 and 1,000 nt in length the sequence. Reverse strand synthesis and T) are used in a single reaction [6]. [23, 31] to be sequenced in parallel. is performed on these copies using a Additionally, the advent of more sensi- However, these 384-capillary systems known priming sequence upstream of tive detection systems and several are rare. The more standard 96-capillary the sequence to be determined and a rounds of primer extensions (equivalent instruments yield a maximum of mixture of deoxy-nucleotides (dNTPs, to a linear amplification) permit approximately 6 Mb of DNA sequence the standard building blocks of smaller amounts of starting DNA to be per day, with costs for consumables DNA) and dideoxy-nucleotides (ddNTP, used for modern sequencing reactions. amounting to about $500 per 1 Mb. Figure 1. Schematic representation of the Sanger sequencing proc- to be non-reversibly terminated, creating differently extended mol- ess. Input DNA is fragmented and cloned into bacterial vectors for ecules. Subsequently, after denaturation, clean up of free nucleo- in vivo amplification. Reverse strand synthesis is performed on the tides, primers, and the enzyme, the resulting molecules are sorted obtained copies starting from a known priming sequence and using a using capillary electrophoresis by their molecular weight (corre- mixture of deoxy-nucleotides (dNTPs) and dideoxy-nucleotides sponding to the point of termination) and the fluorescent label (ddNTPs). The dNTP/ddNTP mixture randomly causes the extension attached to the terminating ddNTPs is read out sequentially. Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 525 M. Kircher and J. Kelso Methods, Models & Techniques ..... The sequencing error observed for sequencing platforms on the market synthesized by the polymerase. In the Sanger sequencing is mainly due to (released in October 2005). It is based process of incorporation, one pyrophos- errors in the amplification step (a low on the pyrosequencing approach devel- phate per nucleotide is released and rate when done in vivo), natural var- oped by Pa˚l Nyre´n and Mostafa Ronaghi converted to ATP by an ATP sulfurylase. iance, and contamination in the sample at the Royal Institute of Technology, The ATP drives the light reaction