Methods, Models & Techniques

High-throughput DNA sequencing – concepts and limitations

Martin Kircher and Janet Kelso

Recent advances in DNA sequencing have revolutionized the field of , modern biology and completely trans- making it possible for even single research groups to generate large amounts of formed the field of genetics. sequence data very rapidly and at a substantially lower cost. These high- At the time of the sequencing of wX174, and for almost another decade, throughput sequencing technologies make deep transcriptome sequencing DNA sequencing was a barely auto- and transcript quantification, whole sequencing and resequencing mated and very tedious process which available to many more researchers and projects. However, while the cost involved determining only a few hun- Methods, Models & Techniques and time have been greatly reduced, the error profiles and limitations of the dred at a time. In the late new platforms differ significantly from those of previous sequencing technol- 1980s, semi-automated sequencers with higher throughput became avail- ogies. The selection of an appropriate sequencing platform for particular types able [6, 7], still only able to determine of experiments is an important consideration, and requires a detailed under- a few sequences at a time. A break- standing of the technologies available; including sources of error, error rate, as through in the early 1990s was the well as the speed and cost of sequencing. We review the relevant concepts and development of capillary array electro- compare the issues raised by the current high-throughput DNA sequencing phoresis and appropriate detection sys- technologies. We analyze how future developments may overcome these limita- tems [8–12]. As recently as 1996, these developments converged in the pro- tions and what challenges remain. duction of a commercial single capillary sequencer (ABI Prism 310). In 1998, the Keywords: GE Healthcare MegaBACE 1000 and the .ABI/Life Technologies SOLiD; Helicos HeliScope; Illumina Genome Analyzer; ABI Prism 3700 DNA Analyzer became Roche/454 GS FLX Titanium; Sanger capillary sequencing the first commercial 96 capillary sequencers, a development which was Introduction termed high-throughput sequencing. sequenced [1] using a technology Over the last decade, alternative invented just a few years earlier [2–5]. sequencing strategies have become In 1977 the first genome, that of the Since then the sequencing of whole available [13–18] which force us to com- 5,386 (nt), single-stranded as well as of individual regions pletely redefine ‘‘high-throughput bacteriophage wX174, was completely and genes has become a major focus of sequencing.’’ These technologies out- perform the older Sanger-sequencing DOI 10.1002/bies.200900181 technologies by a factor of 100–1,000 in daily throughput, and at the same Department of Evolutionary Genetics, Max ChIP-Seq, Chromatin Immuno-Precipitation sequen- time reduce the cost of sequencing Planck Institute for Evolutionary Anthropology, cing; CNV, Copy Number Variation; dNTPs/NTPs, one million nucleotides (1 Mb) to Leipzig, Germany deoxy-nucleotides; ddNTPs, dideoxy-nucleotides (modified nucleotides missing a hydroxyl group at 4–0.1% of that associated with Sanger *Corresponding author: the third carbon atom of the sugar); GA, Short sequencing. To reflect these huge Janet Kelso for Illumina Genome Analyzer; InDel, Insertion/ changes, several companies, research- 3 E-mail: [email protected] Deletion; kb/Mb/Gb, kilo base (l0 nt)/mega ers, and recent reviews [19–24] use the base (106 nt)/giga base (l09 nt); MeDIP-Seq, Abbreviations: Methyl- ation-Dependent Immuno-Precipitation term ‘‘next-generation sequencing’’ A/C/G/T, Deoxyadenosine, Deoxycytosine, Deoxy- sequen- cing; nt nucleotide(s); PCR, Polymerase instead of high-throughput sequencing, guanosine, Deoxythymidine; ATP, Adenosine tri- Chain Reaction; RNA-Seq, Sequencing of yet this term itself may soon be outdated phosphate; dATPaS, Deoxy-adenosine-5’-(alpha- mRNAs/transcripts; SAGE, Serial Analysis of thio)-triphosphate; CCD, Charge-coupled Device, Gene Expression; SNP, Single Nucleotide Poly- considering the speed of ongoing i.e. semi-conductor device used in digital cameras; morphism; mRNA, messenger RNA/transcripts. developments.

524 www.bioessays-journal.com Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso

Here we review the five sequencing modified nucleotides missing a hydro- Unfortunately, there is still little auto- technologies currently available on the xyl group at the third carbon atom of the mation for creation of the high copy market (capillary sequencing, pyrose- sugar). The dNTP/ddNTP mixture input DNA with known priming sites. ehd,Mdl Techniques & Models Methods, quencing, reversible terminator chem- causes random, non-reversible termin- Typically this is done by cloning, i.e., istry, sequencing-by-ligation, and virtual ation of the extension reaction, introducing the target sequence into a terminator chemistry), discuss the intrin- creating from the different copies mol- known vector sequence using restriction sic limitations of each, and provide an ecules extended to different lengths. and ligation procedures and using a outlook on new technologies on the Following denaturation and clean up bacterial strain to amplify the target horizon. We explain how the vast of free nucleotides, primers, and the sequence in vivo – thereby exploiting increases in throughput are associated enzyme, the resulting molecules are the low amplification error due to with both new and old types of problems sorted by their molecular weight (corre- inherent proof-reading and repair mech- in the resulting sequence data, and how sponding to the point of termination) anisms. However, this process is very these limit the potential applications and the label attached to the terminat- tedious and is sometimes hampered and pose challenges for data analysis. ing ddNTPs is read out sequentially in by difficulties such as cloning specific the order created by the sorting step. A sequences due to their base compo- Sanger capillary sequencing schematic representation of this process sition, length, and interactions with is available in Fig. 1. the bacterial host system. Although Current Sanger capillary sequencing Sorting by molecular weight was not yet widely used, integrated micro- systems, like the widely used Applied originally performed using gel electro- fluidic devices have been developed Biosystems 3xxx series or the GE phoresis but is nowadays carried out by which aim to automate the DNA extrac- Healthcare MegaBACE instrument, are capillary electrophoresis [7, 25]. tion, in vitro amplification, and sequenc- still based on the same general scheme Originally, radioactive or optical labels ing on the same chip [26–29]. applied in 1977 for the wX174 genome were applied in four different terminator Using current [1, 3]. First, millions of copies of the reactions (each sorted and read out technology, it is technically possible sequence to be determined are purified separately), but today four different flu- for up to 384 sequences [29, 30] of or amplified, depending on the source of orophores, one per nucleotide (A, C, G, between 600 and 1,000 nt in length the sequence. Reverse strand synthesis and T) are used in a single reaction [6]. [23, 31] to be sequenced in parallel. is performed on these copies using a Additionally, the advent of more sensi- However, these 384-capillary systems known priming sequence upstream of tive detection systems and several are rare. The more standard 96-capillary the sequence to be determined and a rounds of primer extensions (equivalent instruments yield a maximum of mixture of deoxy-nucleotides (dNTPs, to a linear amplification) permit approximately 6 Mb of DNA sequence the standard building blocks of smaller amounts of starting DNA to be per day, with costs for consumables DNA) and dideoxy-nucleotides (ddNTP, used for modern sequencing reactions. amounting to about $500 per 1 Mb.

Figure 1. Schematic representation of the Sanger sequencing proc- to be non-reversibly terminated, creating differently extended mol- ess. Input DNA is fragmented and cloned into bacterial vectors for ecules. Subsequently, after denaturation, clean up of free nucleo- in vivo amplification. Reverse strand synthesis is performed on the tides, primers, and the enzyme, the resulting molecules are sorted obtained copies starting from a known priming sequence and using a using capillary electrophoresis by their molecular weight (corre- mixture of deoxy-nucleotides (dNTPs) and dideoxy-nucleotides sponding to the point of termination) and the fluorescent label (ddNTPs). The dNTP/ddNTP mixture randomly causes the extension attached to the terminating ddNTPs is read out sequentially.

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 525 M. Kircher and J. Kelso Methods, Models & Techniques .....

The sequencing error observed for sequencing platforms on the market synthesized by the polymerase. In the Sanger sequencing is mainly due to (released in October 2005). It is based process of incorporation, one pyrophos- errors in the amplification step (a low on the approach devel- phate per nucleotide is released and rate when done in vivo), natural var- oped by Pa˚l Nyre´n and Mostafa Ronaghi converted to ATP by an ATP sulfurylase. iance, and contamination in the sample at the Royal Institute of Technology, The ATP drives the light reaction of luci- used, as well as polymerase slippage at Stockholm in 1996 [34]. In contrast to ferases present and the emitted light low complexity sequences like simple the Sanger technology, pyrosequencing signal is measured. To prevent the repeats (short variable number tandem is based on iteratively complementing dATP provided for sequencing reaction repeats) and homopolymers (stretches single strands and simultaneously from being used directly in the light of the same nucleotide). Further, lower reading out the signal emitted from reaction, deoxy-adenosine-50-(a-thio)- intensities and missing termination the nucleotide being incorporated triphosphate (dATPaS), which is not a variants tend to lead to sequencing (also called sequencing by synthesis, substrate of the luciferase, is used errors accumulating toward the end of sequencing during extension). Electro- for the base incorporation reaction. long sequences. In combination with phoresis is therefore no longer required Standard deoxyribose nucleotides are reduced separation by the electrophore- to generate an ordered read out of the used for all other nucleotides. After cap- sis, base miscalls [32] and deletions nucleotides, as the read out is now turing the light intensity, the remaining increase with read length. However, done simultaneously with the sequence unincorporated nucleotides are washed the average error rate (the average over extension. away and the next nucleotide is all bases of a sequence) after sequence In the pyrosequencing process provided. end trimming is typically very low, with (Fig. 2), one nucleotide at a time is In 2005, pyrosequencing technology an error every 10,000–100,000 nt [33]. washed over several copies of the was parallelized on a picotiter plate by sequence to be determined, causing (later bought by Roche Roche/454 GS FLX Titanium polymerases to incorporate the nucleo- Diagnostics) to allow high-throughput tide if it is complementary to the tem- sequencing [16]. The sequencing plate

Methods, Models & Techniques sequencer plate strand. The incorporation stops if has about two million wells – each of The 454 sequencing platform was the the longest possible stretch of comp- them able to accommodate exactly first of the new high-throughput lementary nucleotides has been one 28-mm diameter bead covered with

Figure 2. The pyrosequencing process. One of four nucleotides is incorporation, one pyrophosphate per nucleotide is released and washed sequentially over copies of the sequence to be determined, converted to ATP by an ATP sulfurylase. The ATP drives the light causing polymerases to incorporate complementary nucleotides. reaction of luciferases present and a light signal proportional The incorporation stops if the longest possible stretch of the avail- (within limits) to the number of nucleotide incorporations can be able nucleotide has been synthesized. In the process of measured.

526 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso

single-stranded copies of the sequence different sequences. These ‘‘mixed are washed over the plate) as well as to be determined. The beads are incubated beads’’ will participate in a high number by the base composition and the order of with a polymerase and single-strand of incorporations per flow cycle, result- the bases in the sequence to be deter- ehd,Mdl Techniques & Models Methods, binding proteins and, together with ing in sequencing reads that do not mined. Currently, 454/Roche limits this smaller beads carrying the ATP sulfur- reflect real molecules. Most of these number to 200 flow cycles, resulting in ylases and luciferases, gravitationally reads are automatically filtered during an expected average read length of deposited in the wells. Free nucleotides the software post-processing of the data. about 400 nt. This is largely due to are then washed over the flow cell and The filtering of mixed beads may, how- limitations imposed by the efficiency the light emitted during the incorpora- ever, cause a depletion of real sequences of polymerases and luciferases, which tion is captured for all wells in parallel with a high fraction of incorporations drops over the sequencing run, resulting using a high-resolution charge-coupled per flow cycle. in decreased base qualities. Currently device (CCD) camera, exploiting the A large fraction of the errors the platforms allows about 750 Mb of light-transporting features of the plate observed for this instrument are small DNA sequence to be created per day used. InDels, mostly arising from inaccurate with costs of about 20$/Mb. One of the main prerequisites for calling of homopolymer length, and applying this array-based pyrosequenc- single base-pair deletions or insertions ing approach is covering individual caused by signal-to-noise thresholding Illumina Genome Analyzer II/IIx beads with multiple copies of the same issues [35]. Most of these problems can molecule. This is done by first creating be resolved by higher coverage. For long The reversible terminator technology sequencing libraries in which every (>10 nt) homopolymers, however, there used by the Illumina Genome Analyzer individual molecule gets two different is often a consistent length miscall that (GA) employs a sequencing-by-syn- adapter sequences, one at the 50 end is not resolvable by coverage [35–37]. thesis concept that is similar to that and one at the 30 end of the molecule. Strong light signals in one well of used in Sanger sequencing, i.e. the In the case of the 454/Roche sequencing the picotiter plate may also result in incorporation reaction is stopped after library preparation [16], this is done by insertions in sequences in neighboring each base, the label of the base incorp- sequential ligation of two pre-synthes- wells. If the neighboring well is empty, orated is read out with fluorescent dyes, ized oligos. One of the adapters added is this can generate so-called ghost and the sequencing reaction is then con- complementary to oligonucleotides on wells, i.e., wells for which a signal is tinued with the incorporation of the the sequencing beads and thus allows recorded even though they contain no next base [13, 39] (Fig. 3). molecules to be bound to the beads by sequence template; hence, the inten- Like 454/Roche, the Illumina hybridization. Low molecule-to-bead sities measured are completely caused sequencing protocol requires that the ratios and amplification from the hybri- by bleed-over signal from the neighbor- sequences to be determined are con- dized double-stranded sequence on the ing wells. Computational post-process- verted into a special sequencing library, beads (kept separate using emulsion ing may correct for these artifacts [38]. which allows them to be amplified and PCR) makes it possible to grow beads As for Sanger sequencing, the error rate immobilized for sequencing [13, 40]. For with thousands of copies of a single increases with the position in the this purpose two different adapters are starting molecule. Using the second sequence. In the case of 454 sequencing, added to the 50 and 30 ends of all mol- adapter, beads covered with molecules this is caused by a reduction in enzyme ecules using ligation of so-called forked can be separated from empty beads efficiency or loss of enzymes (resulting adapters.1 The library is then amplified (using special capture beads with oligo- in a reduction of the signal intensities), using longer primer sequences, which nucleotides complementary to the sec- some molecules no longer being elong- extend and further diversify the ond adapter) and are then used in ated and by an increasing phasing adapters to create the final sequence the sequencing reaction as described effect. Phasing is observed when a needed in subsequent steps. above. population of DNA molecules amplified This double-stranded library is The average substitution (excluding from the same starting molecule melted using sodium hydroxide to insertion/deletion, InDel) error rate is in (ensemble) is sequenced, and describes obtain single-stranded , which the range of 103–104 [16, 35], which is the process whereby not all molecules in are then pumped at a very low concen- higher than the rates observed for the ensemble are extended in every tration through the channels of a flow Sanger sequencing, but is the lowest cycle. This causes the molecules in the cell. This flow cell has on its surface two average substitution error rate of the ensemble to lose synchrony/phase, and populations of immobilized oligonu- new sequencing technologies discussed results in an echo of the preceding cleotides complementary to the two here. As mentioned earlier for Sanger cycles to be added to the signal as noise. different single-stranded adapter ends sequencing, in vitro amplifications per- The current 454/Roche GS FLX of the sequencing library. These oligo- formed for the sequencing preparation Titanium platform makes it possible to nucleotides hybridize to the single- cause a higher background error rate, sequence about 1.5 million such beads i.e., the error introduced into the sample in a single experiment and to determine before it enters the sequencer. In sequences of length between 300 and 1 addition, in bead preparation (i.e., 500 nt. The length of the reads is deter- Hybrids of partially complementary oligonucleo- tides creating one double-stranded end with a T emulsion PCR) a fraction of the beads mined by the number of flow cycles (the overhang, with a single-stranded and a different end up carrying copies of multiple number of times all four nucleotides sequence at the other end.

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 527 M. Kircher and J. Kelso Methods, Models & Techniques .....

Figure 3. Reversible terminator chemistry applied by the Illumina GA. nucleotides are washed away and the label of the bases incorporated Sequencing primers are annealed to the adapters of the sequences for each sequence is read with four images taken through different to be determined. Polymerases are used to extend the sequencing filters (T nucleotide filter is indicated in the figure) and using two primers by incorporation of fluorescently labeled and terminated different lasers (red: A, C and green: G, T) to illuminate fluorophores. nucleotides. The incorporation stops immediately after the first Subsequently, the fluorophores and terminators are removed and Methods, Models & Techniques nucleotide due to the terminators. The polymerases and free the sequencing continued with the incorporation of the next base.

stranded library molecules. By reverse the sequencing primer onto the adapter library and flow cell preparation strand synthesis starting from the hybri- sequences and starting the reversible includes several in vitro amplification dized (double-stranded) part, the new terminator chemistry. steps, which cause a high background strand being created is covalently ‘‘Solexa sequencing’’, as it was error rate and contribute to the average bound to the flow cell. If this new strand introduced in early 2007, initially error rate of about 102–103 [41, 42]. bends over and attaches to another oli- allowed for the simultaneous sequenc- Further, the flow cell preparation gonucleotide complementary to the sec- ing of several million very short sequen- creates a fraction of ordinary-looking ond adapter sequence on the free end of ces (at most 26 nt) in a single clusters that are initiated from more the strand, it can be used to synthesize a experiment. In recent years there have than one individual sequence. These second covalently bound reverse strand. been several technical, chemical, and results in mixed signals and mostly This process of bending and reverse software updates. The product, which low quality sequences for these clusters. strand synthesis, called bridge amplifi- is now called the Illumina Genome Similar to the 454 ghost wells, the cation, is repeated several times and Analyzer, has increased flow cell cluster Illumina image analysis may identify creates clusters of several 1,000 copies densities (more than 200 million clus- chemistry crystals, dust, and lint of the original sequence in very close ters per run), a wider range of the particles as clusters and call sequences proximity to each other on the flow cell flow cell is imaged, and sequence from these. In such cases the resulting [13, 40]. reads of up to 100 nt can be generated. sequences typically appear to be of low These randomly distributed clusters A technical update also enabled the sequence complexity. contain molecules that represent the sequencing of the reverse strand of As is the case for the other platforms, forward as well as reverse strands each molecule. This is achieved by the error rate increases with increasing of the original sequences. Before deter- chemical melting and washing away position in the determined sequence. mining the sequence, one of the strands the synthesized sequence, repeating a This is mainly due to phasing, which has to be removed to prevent it few bridge amplification cycles for increases the background noise as from hindering the extension reaction reverse strand synthesis, and then selec- sequencing progresses. While the sterically or by complementary base tively removing the starting strand ensemble sequencing process for pyro- pairing. Strands are selectively cleaved (again using base modifications of the sequencing creates uni-directional at base modifications of oligonucleoti- flow cell oligonucleotide populations), phasing, reversible terminator sequenc- des on the flow cell. Following strand before annealing another sequencing ing creates bi-directional phasing [41, removal, each cluster on the flow cell primer for the second read. Using this 43] as some incorporated nucleotides consists of single stranded, identically ‘‘paired-end sequencing’’ approach, may also fail to be correctly terminated – oriented copies of the same sequence; approximately twice the amount of allowing the extension of the sequence which can be sequenced by hybridizing data can be generated. The Illumina by another nucleotide in the same cycle.

528 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso

With increasing cycle numbers, the on the Church lab sequencing-by- a similar fashion to that described ear- intensities extracted from the clusters ligation concept, but combines it with lier for the 454/Roche platform. In con- decline [41, 43, 44]. This is due to fewer a new strategy of sequencing library trast to the 454/Roche technology, the ehd,Mdl Techniques & Models Methods, molecules participating in the extension construction and sequence immobiliz- SOLiD system does not use a picotiter reaction as a result of non-reversible ation using rolling circle amplification plate for fixation of the beads in the termination, or due to dimming effects [45]. Here, we focus on the commercial sequencing process; instead the 30 ends of the sequencing fluorophores. In early SOLiD system as this is the most wide- of the sequences on the beads are modi- versions of the chemistry, one of the spread application of this concept. fied in a way that allows them to be fluorophores could become stuck to The principle behind sequencing- covalently bound onto a glass slide. the clusters creating another source of by-ligation is very different from the As for the Illumina GA system, this cre- increased background noise [41]. The approaches discussed thus far. The ates a random dispersion of the beads in simultaneous identification of four sequence extension reaction is not car- the sequencing chamber and allows for different nucleotides is also an issue. ried out by polymerases but rather by higher loading densities. However, ran- The GA uses four fluorescent dyes to ligases [17] (see Fig. 4 for a schematic dom dispersion complicates the identi- distinguish the four nucleotides A, C, representation of the SOLiD 2/3 plat- fication of bead positions from images, G, and T. Of these, two pairs (A/C and form). In the sequencing-by-ligation and results in the possibility that chemi- G/T) excited using the same laser, are process, a sequencing primer is hybri- cal crystals, dust, and lint particles can similar in their emission spectra and dized to single-stranded copies of the be misidentified as clusters. Further, show only limited separation using opti- library molecules to be sequenced. A dispersal of the beads results in a wide cal filters. Therefore, the highest substi- mixture of 8-mer probes carrying four range of inter-bead distances, which tution errors observed are between A/C distinct fluorescent labels compete for then have different susceptibility to be and G/T [41, 42]. ligation to the sequencing primer. The influenced by signals from neighboring Even though the Illumina GA reads fluorophore encoding, which is based beads. show a higher average error rate, on the two 30-most nucleotides of the Types and causes of sequence errors a wider average error range, and are probe, is read. Three bases including are diverse: first, the in vitro amplifica- considerably shorter than 454/Roche the dye are cleaved from the 50 end of tion steps cause a higher background reads, the GA instrument determines the probe, leaving a free 50 phosphate on error rate. Secondly, beads carrying a more than 5,000 Mb/day with a price the extended (by five nucleotides) pri- mixture of sequences and beads in close of about 0.50$/Mb. This is more than mer, which is then available for further proximity to one another create false six times higher daily throughput and ligation. After multiple ligations (typi- reads and low quality bases. Further, for a considerably lower price per cally up to 10 cycles), the synthesized signal decline, a small regular phasing megabase. strands are melted and the ligation effect, and incomplete dye removal product is washed away before a new result in increasing error as the ligation sequencing primer (shifted by one cycles progress [47]. Phasing, as SOLiD nucleotide) is annealed. Starting from described earlier, is a minor issue on the new sequencing primer the ligation this platform as sequences not extended The prototype of what was further reaction is repeated. The same process is in the last cycle are non-reversibly ter- developed and later sold by Life followed for three other primers, facili- minated using phosphatases. Since Technologies/Applied Biosystems (ABI) tating the read out of the dinucleotide hybridization is a stochastic process, as the SOLiD sequencing platform, was encoding for each start position in the this causes a considerable reduction in developed by Harvard Medical School sequence. Using specific fluorescent the number of molecules participating and the Howard Hughes Medical label encoding, the dye read outs (i.e. in subsequent ligation reactions, and Institute and published in 2005 [17]. colors) can be converted to a sequence therefore substantial signal decline. With its commercial release in late [46]. This conversion from color space to On the other hand, given the efficiency 2007, SOLiD was only the third new sequence requires a known first base, of phosphatases the remaining phasing high-throughput system entering a which is the last base of the used library effect can be considered very low. highly competitive market with all three adapter sequence. Given a reference However, incomplete cleavage of the vendors selling their instruments for sequence, this encoding system allows dyes may allow cleavage in the next around half a million dollars. The detection of machine errors and the ligation reaction, which then allows Church lab at Harvard Medical School application of an error correction to for the extension in the next but one continued the development of the reduce the average error rate. In the cycle. This causes a different phasing system and now offers a cheaper absence of a reference sequence, how- effect and additional noise from the (<$200,000) open source version of ever, color conversion fails with an error previous cycle’s dyes in the dye identi- the system (called Polonator) in collab- in the dye read out and causes the fication process. oration with Dover System. In the third sequence downstream of the error to The SOLiD system currently allows quarter of 2008, a biotechnology com- be incorrect. sequencing of more than 300 million pany from Mountain View, California, For parallelization, the sequencing beads in parallel, with a typical read named Complete Genomics started process uses beads covered with length of between 25 and 75 nt. At the offering a sequencing multiple copies of the sequence to be time of writing, the ABI SOLiD system is service. Their technology is also based determined. These beads are created in therefore comparable to the Illumina GA

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 529 M. Kircher and J. Kelso Methods, Models & Techniques ..... Methods, Models & Techniques

Figure 4. Applied Biosystem’s SOLiD sequencing by ligation. A free 50 phosphate for further ligations. After multiple ligations, the sequencing primer is annealed to single-stranded copies of sequen- synthesized strands are melted and the ligation product is washed ces to be determined. Octamer probes are hybridized, ligated to the away before a new, by-one-nucleotide-shifted sequencing primer is sequencing primer, and a fluorescent dye at the 50 end of the ligated annealed. Starting from the new sequencing primer the ligation 8-mer probes, encoding the two 30-most nucleotides of the probe, is reaction is repeated. The same is done for three other primers, read out. Non-extended primers are dephosphorylated. Three allowing the read out of the dinucleotide label for every position in nucleotides of the probe including the dye are cleaved, creating a the sequence.

system in terms of throughput and price Helicos HeliScope The HeliScope, as the Helicos sequencer per million nucleotides (5,000 Mb/ is called, was first sold in March 2008, day, 0.50$/Mb). Average error rates Helicos is the first company to sell a and by the end of the first quarter of are, however, dependent on the avail- sequencer able to sequence individual 2009 only four machines have been ability of a reference genome for error molecules instead of molecule ensem- installed worldwide. This might be sur- correction (103–104 vs.102–103). In bles created by an amplification proc- prising given the advantages of single the absence of a reference genome, ess. Single molecule sequencing has the molecule sequencing, but probably assembly and consensus calling may advantage that it is not affected by reflects both the specific limitations of be performed based on dye read outs biases or errors introduced in a library this platform, the price (about one (so-called color space sequences) to preparation or amplification step, and million dollars), and a relatively small reduce the errors before conversion to may facilitate sequencing of minimal market that has already invested exten- the nucleotide sequence. If no reference amounts of input DNA. Using methods sively in new sequencing technologies. genome is available for error correction, able to detect non-standard nucleotides, The technology applied (Fig. 5) and no assembly and consensus calling it could also allow for the identification could be termed asynchronous virtual is performed, then the average error rate of DNA modifications, commonly lost in terminator chemistry [15]. Input DNA is higher than for the Illumina GA. the in vitro amplification process. is fragmented and melted before a

530 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso ehd,Mdl Techniques & Models Methods,

Figure 5. Asynchronous virtual terminator chemistry performed by type of fluorescently labeled nucleotides (A, C, G, and T) at a time, the HeliScope. Input DNA is fragmented, melted, and polyadeny- and the polymerases extend the reverse strand of the sequences lated. A fluorescently labeled is added in the last step. This starting from the poly-T oligonucleotides. The nucleotide incorpora- single-stranded DNA is washed over a flow cell with poly-T oligo- tion of the polymerases is slowed down by the fluorescent labeling nucleotides allowing hybridization. The bound coordinates on the and allows for at most one incorporation before the polymerase is flow cell are determined using the fluorescently labeled . washed away. The flow cell is then imaged, the fluorescent dyes Having the coordinates identified, the fluorescent label of the 30 removed, and the reaction continued with another nucleotide. adenines is removed. Polymerases are washed through with one

poly-A-tail is synthesized onto each one incorporation before the polymer- molecules may be irreversibly termi- single-stranded molecule using a poly- ase is washed away together with the nated by the incorporation of incorrectly adenylate polymerase. In the last step of non-incorporated nucleotides (termed synthesized nucleotides. Overall, reads polyadenylation, a fluorescently labeled virtual termination [48, 49]). The flow are between 24 and 70 nt long (average adenine is added. The library is washed cell is then imaged again, the fluor- 32 nt) [50] and thus shorter than for the over a flow cell where the poly-A tails escent dyes are removed, and the reac- other platforms. Due to the higher num- bind to poly-T oligonucleotides. The tion continued with another nucleotide. ber of sequences determined in parallel, bound coordinates on the flow cell are By this process not every molecule is the total throughput per day (4150 Mb/day determined using a fluorescence-based extended in every cycle, which is why with a cost of 0.33$/Mb [50]) is in the read out of the flow cell. Having these it is an asynchronous sequencing proc- same range as for the GA and SOLiD coordinates identified, the fluorescent ess resulting in sequences of different systems. The average error rate, which label of the 30 adenine is removed and length (as is the case for the 454/Roche is in the range of a few percent, is the sequencing reaction started. platform). slightly higher than for all other instru- Polymerases are washed through the Since single molecules are ments and biased toward InDels rather flow cell with one type of fluorescently sequenced, the signals being measured than substitutions. labeled nucleotide (A, C, G, or T) at a are weak, and there is no possibility that time and the polymerases extend the misincorporation errors can be cor- reverse strand of the sequences starting rected by an ensemble effect. Due to Applications and general from the poly-T oligonucleotides. The the fact that molecules are attached to considerations nucleotide incorporation of the poly- the flow cell by hybridization only, there merases is slowed down by the fluor- is a chance that template molecules can All current high-throughput technol- escent labeling and allows for at most be lost in the wash steps. In addition, ogies have an average error rate that

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 531 M. Kircher and J. Kelso Methods, Models & Techniques .....

is considerably higher than the typical amounts are available or being devel- molecule is then processed using restric- 1/10,000 to 1/100,000 observed for oped for each of the platforms, and pub- tion enzymes or fragmentation before high-quality Sanger sequences. lications demonstrate that, while vendor outer library adapters are added around Further, the GS FLX Titanium, GA, protocols indicate the need for higher the two combined molecule ends. The SOLiD, and HeliScope platforms each sample quantities (microgram range), internal adapter can then be used as a have very specific biases and limita- many users are proceeding successfully second priming site for an additional tions, making it necessary to choose a with low input DNA amounts (nanogram sequencing reaction on the same platform appropriate for a specific proj- to picogram range), as, for example, from immobilized molecules. Thus, mate-pair ect or application (for a summary see ancient DNA specimens [60–62]. sequencing provides distance infor- Table 1). A combination of technologies Like Sanger sequencing, the GS FLX mation useful for assembly, but does [51–54] and experimental protocols [55– Titanium provides a read length span- not allow the merging of the two over- 57] may also be appropriate, and even ning many of the short repeat sequences lapping end reads, since by design the complementary, for specific projects. – an important feature for accurate molecules will not overlap in sequenc- High-quality Sanger sequencing sequence mapping and assembly of ing. However, merging of two overlap- is now commonly used to generate genomes [63]. Despite the InDel errors, ping forward and reverse paired end low-coverage sequencing of individual this technology has very low rates of reads from short insert libraries allows positions and regions (e.g., diagnostic misidentifying individual bases, making the reconstruction of a complete con- ) or the sequencing of it perfectly suited for the identification secutive molecule sequence, longer virus- and phage-sized whole genomes. of single nucleotide polymorphisms than the individual read length, and As the Sanger sequence length is (SNPs). Also geared to the identification with reduced average error rates in the longer than most abundant short of SNPs, at least for samples with an overlapping sequence part [60, 66]. repeat classes, it allows the unambigu- existing reference genome, is the Due to the large amounts of sequen- ous assembly of most genomic SOLiD instrument with its dinucleotide ces created, there is interest in sequenc- regions–somethingthatisgenerally encoding scheme [46]. Considerably ing targeted regions (e.g. a genomic Methods, Models & Techniques not possible using the shorter read higher coverage is needed to perform locus, from sequence capture exper- platforms. However, the technology is SNP calling with similar accuracy using iments [67–69]) in multiple individuals/ expensive and too slow for sequencing the Illumina GA [64]. Neither the samples instead of sequencing one large samples, extended genomic re- Illumina GA nor the ABI SOLiD sequenc- sample in excessive depth. All tech- gions, or the many molecules required ing systems are prone to generate high nologies therefore provide a separation for quantitative applications [e.g., rates of small InDels, making them well of their sequencing plate into defined gene expression quantification; chro- suited for studying InDel variation. regions or channels. However, at most, matin immunoprecipitation sequenc- As mentioned earlier, the drawback 16 such regions/channels are available ing (ChIP-Seq); and methylation- of short reads (below about 75 nt) (GS FLX Titanium and HeliScope plates), dependent immunoprecipitation sequen- obtained from Helicos, SOLiD, or GA which may not be sufficient for some cing (MeDip-Seq)]. For quantitative instruments is in genome assembly applications. Using different library con- applications the HeliScope provides and mapping applications, where the struction protocols, some platforms the highest throughput in terms of placement of repeated or very similar allow addition of sample specific barcode sequence number and has the sequences cannot be resolved unambig- (sometimes called ‘‘index’’) sequences to advantage of not requiring a multistep uously. The correct placement is further the library molecules. These molecules library preparation protocol. On the ot- complicated by high error rates intro- can then be sequenced in the same her hand, the HeliScope provides the ducing a requirement for a minimum region/channel, and later separated lowest resolution in mapping accuracy sequence distance of an unambiguous (computationally) based on their bar- for complex genomes due its short read placement. Paired-end or mate-pair pro- code sequence [70–73]. This facilitates length and error profile. The GA or tocols help to overcome some of these highly parallel sequencing of a large SOLiD platforms may thus provide limitations of short reads [65] by provid- number of samples beyond that possible equivalent results for quantitative appli- ing information about relative location using the physical lane/channel separ- cations, while providing fewer but lon- and orientation of a pair of reads. ation. Currently such protocols (mostly ger reads and requiring a more Currently a paired-end protocol is only non-vendor protocols) are available for elaborate library preparation. commonly applied on the GA, while the GS FLX Titanium, GA, and SOLiD While it has not yet been fully ana- mate-pair protocols are available for instrument. lyzed, it is possible (and even likely) that SOLiD, GS FLX Titanium, and GA. In Although sequencing prices per giga- library preparation protocols could bias paired-end sequencing the actual ends base have fallen considerably in recent the sequence representation in a sample of rather short DNA molecules (<1 kb) years, making projects like the 1000 [42, 58, 59], making the replacement of are determined, while mate-pair Human Genome Variation Project, 1001 this step an important goal. Further, sequencing requires the preparation of Arabidopsis thaliana Genomes Project, multistep library preparation protocols special libraries. In these protocols, the the Mammalian Genome Project, or require higher amounts of input ends of longer, size-selected molecules the International Cancer Genome material, limiting their general appli- (e.g., 8, 12, or 20 kb) are connected with Consortium possible, high-throughput cation. However, protocols for library an internal adapter sequence in a circu- sequencing still has high acquisition, construction from limited sample larization reaction. The circular running and maintenance costs, which

532 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso

Table 1. Comparison of high-throughput sequencing technologies available

Throughput Length Quality Costs Applications Main sources of errors Techniques & Models Methods,

The table summarizes throughput, length, quality, and costs for the current versions of the mentioned technologies. These approximate numbers are constantly improving and based on figures available in January 2010. Costs do not include instrument acquisition and maintenance; further they may be affected by discounts and scale effects for multiple instruments. Where numbers are very similar, colors ranging from red (low performance) to green (good performance) indicate a general trend. In the last column, example applications fitting the throughput and error profiles of each of the platforms are given. Typically, this does not mean that the technology is limited to these applications, but that it is currently best suited to such applications. þ High sequencing depth/number of runs required.

are not included in Table 1. Further, each increasing and the numbers given here Pacific Biosciences’ SMRT technology of these platforms requires a substantial are rapidly outdated. However, in performs the sequencing reaction on investment in data management and addition to the improvements of current silicon dioxide chips with a 100 nm analysis, time, and personnel [74–77]. technologies, including the January metal film containing thousands of Smaller research groups may still find 2010 announcement of the Illumina tens-of-nanometer diameter holes, so- prohibitive the costs of the infrastructure HiSeq 2000 system, which determines called zero-mode waveguides (ZMWs) needed for storing, handling, and ana- sequences of clusters on bottom and top [79]. Each ZMW is used as a nano-visual- lyzing several tens of gigabytes of pure of the flow cell and processes two flow ization chamber, providing a detection sequence data and terabytes of several cells in parallel, a new generation of volume of 20 zeptoliters (1021 l). At this thousand intermediate files generated by sequencers is already on the horizon. volume, a single molecule can be illu- these instruments each week. Even for What started with the Helicos minated while excluding other labeled larger, experienced genome centers this system – the sequencing of single mol- nucleotides in the background – saving aspect remains an ever-increasing chal- ecules without prior library preparation time and sequencing chemistry by omit- lenge for the ongoing use of these or amplification – will likely become a ting wash steps. A single DNA polymer- platforms. popular paradigm. Specifically, three ot- ase is fixed to the bottom of the surface her systems have captured media and within the detection volume, and scientific attention well in advance of nucleotides, with different dyes Upcoming developments their actual availability: Pacific attached to the phosphate chain, are Bioscience’s Single Molecule Real used in concentrations allowing Motivated by the goal of a $1,000- Time (SMRT) sequencing technology normal enzyme processivity. As the pol- genome set by NIH/NHGRI to enable [18], Oxford ’s BASE technol- ymerase incorporates complementary personalized medicine, the throughput ogy [14] and, recently, IBM’s proposal of nucleotides, the nucleotide is held of all systems described is constantly silicon-based [78]. within the detection volume for tens

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 533 M. Kircher and J. Kelso Methods, Models & Techniques .....

of milliseconds, orders of magnitude the Nanopore technologies developed at demonstrate the major future directions longer than for unspecific diffusion Harvard University [81] or the previously in the field of DNA sequencing: the abil- events. This way the fluorescent dye of described BASE technology where it ity to use individual molecules without the incorporated nucleotide can be may overcome the destructive approach any library preparation or amplification, identified during normal speed reverse followed so far. the identification of specific nucleotide strand synthesis [79]. In pilot exper- modifications, and the ability to gener- iments, Pacific Biosciences has shown ate longer sequence reads. These devel- that its technology allows for direct Conclusion opments will facilitate future research in sequencing of a few thousand bases many fields, make data analysis easier, before the polymerase is denatured Current high-throughput sequencing and further reduce sequencing costs, due to the laser read out of the dyes. technologies provide a huge variety of hopefully achieving the aim of a The SMRT technology is intended for sequencing applications to many $1,000 human genome suggested by release in 2010. Even though further researchers and projects. Given the NIH/NHGRI to be required for personal- development is needed to create a more immense diversity, we have not dis- ized medicine. robust system, the omission of library cussed these applications in depth here; preparation and amplification as well other reviews with a stronger focus on as the long sequences generated will specific applications and data analysis Acknowledgments undoubtedly provide an advantage are available [24, 82–88]. The discussed We thank the members of the Depart- over the current systems for many technologies make it possible for even ment of Evolutionary Genetics, and applications. single research groups to generate large particularly members of the sequencing Oxford Nanopore’s BASE technology amounts of sequence data very rapidly group, for providing sequencing data is unlikely to be released as soon as and at substantially lower costs than from multiple platforms, as well the SMRT technology. BASE offers traditional Sanger sequencing. While as interesting discussions and useful the potential to identify individual costs have been reduced to less than insights. We are also indebted to A. Methods, Models & Techniques nucleotide modifications (e.g. 5-methyl- 4–0.1% and time has been shortened Wilkins and the three anonymous vs. cytosine) during the by a factor of 100–1,000 based on daily reviewers for critical reading of the sequencing process [14]. The idea behind throughput, the error profiles and manuscript and thoughtful comments. this technology is the identification of limitations observed for the new plat- This work was supported by the Max individual nucleotides using a change forms differ significantly from Sanger Planck Society. in the membrane potential as they pass sequencing and between approaches. through a modified a-hemolysin mem- Further, each of these new sequencing brane pore with a cyclodextrin sensor platforms requires substantial additional [14, 80]. However, to apply this technol- investments – factors that have often References ogy for sequencing, the pore has to be not be sufficiently stressed in research 1. Sanger F, Air GM, Barrell BG, et al. 1977. fused to an exonuclease, which degrades publications describing a specific appli- Nucleotide sequence of bacteriophage phi single-stranded DNA sequences and cation. Some vendors have recently X174 DNA. Nature 265: 687–95. releases individual nucleotides into the started to offer budget versions of their 2. Gilbert W, Maxam A. 1973. The nucleotide pore. In addition, the technology needs instruments (e.g. Illumina GA IIe or 454/ sequence of the lac operator. Proc Natl Acad Sci USA 70: 3581–4. to be parallelized in array format, before Roche GS Junior) with lower sequencing 3. Sanger F, Nicklen S, Coulson AR. 1977. its release as a high-throughput sequenc- capacity. However, while the instru- DNA sequencing with chain-terminating ing platform. While the sensitivity for ment price is lower, the financial invest- inhibitors. Proc Natl Acad Sci USA 74: 5463–7. individual nucleotide modifications ment remains high. Costs per base are 4. Sanger F, Coulson AR. 1975. A rapid method seems to be a major advantage, the generally higher than for the standard for determining sequences in DNA by primed destructive fashion of the outlined instrument, and very similar overall synthesis with DNA polymerase. J Mol Biol 94: 441–8. sequencing process might be considered infrastructure is still required. Often 5. Wu R, Kaiser AD. 1968. Structure and base a hindrance for applications with pre- the choice of an appropriate sequencing sequence in the cohesive ends of bacterio- cious samples, and it does not allow a platform is project specific and some- phage lambda DNA. J Mol Biol 35: 523–37. second read cycle for error reduction. times combinations can be advan- 6. Smith LM, Sanders JZ, Kaiser RJ, et al. 1986. Fluorescence detection in automated In early October 2009, IBM issued a tageous. This may open the market DNA sequence analysis. Nature 321: 674–9. press release [78] describing a method to further to companies providing 7. Swerdlow H, Gesteland R. 1990. Capillary slow down the speed of an individual sequencing-on-demand services, but gel electrophoresis for rapid, high resolution DNA sequencing. Nucleic Acids Res 18: DNA strand passing through a nano- will not replace the need for laboratories 1415–9. pore. For this purpose they developed to invest considerable time and exper- 8. Zagursky RJ, McCormick RM. 1990. DNA a multilayer metal/dielectric nanopore tise in both the production of libraries sequencing separations in capillary gels on a device that utilizes the interaction of the and analysis of the vast quantities of modified commercial DNA sequencing instru- ment. Biotechniques 9: 74–9. DNA backbone charges with a modu- data that will be generated. 9. Huang XC, Quesada MA, Mathies RA. 1992. lated electric field to trap and slowly New technologies on the horizon, DNA sequencing using capillary array electro- releases an individual DNA molecule. SMRT by Pacific Biosciences, BASE by phoresis. Anal Chem 64: 2149–54. 10. Kambara H, Takahashi S. 1993. Multiple- The technology described could theor- Oxford Nanopore, and other technol- sheathflow capillary array DNA analyser. etically be combined with, for example, ogies such as that suggested by IBM, Nature 361: 565–6.

534 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc...... Methods, Models & Techniques M. Kircher and J. Kelso

11. Ueno K, Yeung ES. 1994. Simultaneous 384 multicapillary sequencer. Genome Res 49. Bowers J, Mitchell J, Beer E, et al. 2009. monitoring of DNA fragments separated by 10: 1757–71. Virtual terminator nucleotides for next- electrophoresis in a multiplexed array of 100 31. Hert DG, Fredlake CP, Barron AE. 2008. generation DNA sequencing. Nat Methods capillaries. Anal Chem 66: 1424–31. Advantages and limitations of next-gener- 6: 593–5. ehd,Mdl Techniques & Models Methods, 12. Kim S, Yoo HJ, Hahn JH. 1996. ation sequencing technologies: a comparison 50. Pushkarev D, Neff NF, Quake SR. 2009. Postelectrophoresis capillary scanning of electrophoresis and non-electrophoresis Single-molecule sequencing of an individual method for DNA sequencing. Anal Chem methods. Electrophoresis 29: 4618–26. human genome. Nat Biotechnol 27: 847–52. 68: 936–9. 32. Ewing B, Hillier L, Wendl MC, et al. 1998. 51. Reinhardt JA, Baltrus DA, Nishimura MT, 13. Bentley DR, Balasubramanian S, Swerdlow Base-calling of automated sequencer traces et al. 2009. De novo assembly using low-cov- HP, et al. 2008. Accurate whole human using phred. I. Accuracy assessment. erage short read sequence data from the rice genome sequencing using reversible termin- Genome Res 8: 175–85. pathogen Pseudomonas syringae pv. oryzae. ator chemistry. Nature 456: 53–9. 33. Ewing B, Green P. 1998. Base-calling of Genome Res 19: 294–305. 14. Clarke J, Wu HC, Jayasinghe L, et al. 2009. automated sequencer traces using phred. 52. Diguistini S, Liao NY, Platt D, et al. 2009. Continuous base identification for single-mol- II. Error probabilities. Genome Res 8: De novo genome of a ecule nanopore DNA sequencing. Nat 186–94. filamentous fungus using Sanger, 454 and Nanotechnol 4: 265–70. 34. Ronaghi M, Karamohamed S, Pettersson Illumina sequence data. Genome Biol 10: 15. Harris TD, Buzby PR, Babcock H, et al. B, et al. 1996. Real-time DNA sequencing R94. 2008. Single-molecule DNA sequencing of a using detection of pyrophosphate release. 53. Miller JR, Delcher AL, Koren S, et al. 2008. viral genome. Science 320: 106–9. Anal Biochem 242: 84–9. Aggressive assembly of pyrosequencing 16. Margulies M, Egholm M, Altman WE, et al. 35. Quinlan AR, Stewart DA, Stromberg MP, reads with mates. Bioinformatics 24: 2818– 2005. Genome sequencing in microfabricated et al. 2008. Pyrobayes: an improved base 24. high-density picolitre reactors. Nature 437: caller for SNP discovery in pyrosequences. 54. Chen W, Ullmann R, Langnick C, et al. 2009. 376–80. Nat Methods 5: 179–81. Breakpoint analysis of balanced chromosome 17. Shendure J, Porreca GJ, Reppas NB, et al. 36. Wicker T, Schlagenhauf E, Graner A, et al. rearrangements by next-generation paired- 2005. Accurate multiplex polony sequencing 2006. 454 sequencing put to the test using the end sequencing. Eur J Hum Genet DOl: 10. of an evolved bacterial genome. Science 309: complex genome of barley. BMC Genomics 7: 1038/ejhg .2009.21118 [Epub ahead of print]. 1728–32. 275. 55. Zimin AV, Delcher AL, Florea L, et al. 2009. 18. Korlach J, Marks PJ, Cicero RL, et al. 2008. 37. Green RE, Malaspinas AS, Krause J, et al. A whole-genome assembly of the domestic Selective aluminum passivation for targeted 2008. A complete Neandertal mitochondrial cow, Bos taurus. Genome Biol 10: R42. immobilization of single DNA polymerase mol- genome sequence determined by high- 56. Zhou X, Su Z, Sammons RD, et al. 2009. ecules in zero-mode waveguide nanostruc- throughput sequencing. Cell 134: 416– Novel software package for cross-platform tures. Proc Natl Acad Sci USA 105: 1176–81. 26. transcriptome analysis (CPTRA). BMC 19. Ansorge WJ. 2009. Next-generation, DNA 38. Green RE, Krause J, Ptak SE, et al. 2006. Bioinf. 11: S16. sequencing techniques. Nat Biotechnol 25: Analysis of one million base pairs of 57. Kim JI, Ju YS, Park H, et al. 2009. A highly 195–203. Neanderthal DNA. Nature 444: 330–6. annotated whole-genome sequence of a 20. Mardis ER. 2008. Next-generation, DNA 39. Turcatti G, Romieu A, Fedurco M, et al. Korean individual. Nature 460: 1011–5. sequencing methods. Annu Rev Genomics 2008. A new class of cleavable fluorescent 58. Linsen SE, de Wit E, Janssens G, et al. 2009. Hum Genet 9: 387–402. nucleotides: synthesis and optimization as Limitations and possibilities of small RNA 21. Schuster SC. 2008. Next-generation reversible terminators for DNA sequencing digital gene expression profiling. Nat sequencing transforms today’s biology. Nat by synthesis. Nucleic Acids Res 36: e25. Methods 6: 474–6. Methods 5: 16–8. 40. Fedurco M, Romieu A, Williams S, et al. 59. Quail MA, Swerdlow H, Turner DJ. 2009. 22. Shendure J, Ji H. 2008. Next-generation, 2006. BTA, a novel reagent for DNA attach- Improved protocols for the Illumina Genome DNA sequencing. Nat Biotechnol 26: 1135– ment on glass and efficient generation of Analyzer sequencing system. Curr Protoc 45. solid-phase amplified DNA colonies. Nucleic Hum Genet Chapter 18: Unit 18.2. 23. Shendure JA, Porreca GJ, Church GM. Acids Res 34: e22. 60. Briggs AW, Stenzel U, Meyer M, et al. 2009. 2008. Overview of DNA sequencing strat- 41. Kircher M, Stenzel U, Kelso J. 2009. Removal of deaminated and detec- egies. Curr Protoc Mol Biol Chapter 7: Unit Improved base calling for the Illumina tion of in vivo methylation in ancient DNA. 7.1. Genome Analyzer using machine learning Nucleic Acids Res 38(6): e87 [Epub ahead 24. Metzker ML. 2010. Sequencing technologies strategies. Genome Biol 10: R83. of print]. – the next generation. Nat Rev Genet 11: 31– 42. Dohm JC, Lottaz C, Borodina T, et al. 2008. 61. Maricic T, Paabo S. 2009. Optimization of 46. Substantial biases in ultra-short read data 454 sequencing library preparation from small 25. George KS, Zhao X, Gallahan D, et al. 1997. sets from high-throughput DNA sequencing. amounts of DNA permits sequence determi- Capillary electrophoresis methodology for Nucleic Acids Res 36: e105. nation of both DNA strands. Biotechniques identification of cancer related gene expres- 43. Erlich Y, Mitra PP, delaBastide M, et al. 46: 51–2, 54–7. sion patterns of fluorescent differential display 2008. Alta-Cyclic: a self-optimizing base 62. Rohland N, Hofreiter M. 2007. Comparison polymerase chain reaction. J Chromatogr B caller for next-generation sequencing. Nat and optimization of ancient DNA extraction. Biomed Sci Appl 695: 93–102. Methods 5: 679–82. Biotechniques 42: 343–52. 26. Blazej RG, Kumaresan P, Mathies RA. 44. Rougemont J, Amzallag A, Iseli C, et al. 63. Wheeler DA, Srinivasan M, Egholm M, et al. 2006. Microfabricated bioprocessor for inte- 2008. Probabilistic base calling of Solexa 2008. The complete genome of an individual grated nanoliter-scale Sanger DNA sequenc- sequencing data. BMC Bioinf. 9: 431. by massively parallel DNA sequencing. Nature ing. Proc Natl Acad Sci USA 103: 7240–5. 45. Drmanac R, Sparks AB, Callow MJ, et al. 452: 872–6. 27. Mariella R Jr. 2008. Sample preparation: the 2010. Human genome sequencing using 64. Harismendy O, Ng PC, Strausberg RL, et al. weak link in microfluidics-based biodetection. unchained base reads on self-assembling 2009. Evaluation of next generation sequenc- Biomed Microdevices 10: 777–84. DNA nanoarrays. Science 327: 78–81. ing platforms for population targeted 28. Roper MG, Easley CJ, Legendre LA, et al. 46. Applied Biosystems. A Theoretical sequencing studies. Genome Biol 10: R32. 2007. Infrared temperature control system for Understanding of 2 Base Color Codes and 65. Chaisson MJ, Brinza D, Pevzner PA. 2009. a completely noncontact polymerase chain Its Application to Annotation, Error De novo fragment assembly with short mate- reaction in microfluidic chips. Anal Chem Detection, and Error Correction. White paired reads: Does the read length matter? 79: 1294–1300. Paper SOLiDTM System; 2008. Genome Res 19: 336–46. 29. Emrich CA, Tian H, Medintz IL, et al. 2002. 47. Dimalanta ET, Zhang L, Hendrickson CL, 66. Krause J, Briggs AW, Kircher M, et al. 2009. Microfabricated 384-lane capillary array elec- et al. 2009. Increased Read Length on the A complete mtDNA genome of an early mod- trophoresis bioanalyzer for ultrahigh-through- SOLiDTM Sequencing Platform. Poster ern human from Kostenki, Russia. Curr Biol put genetic analysis. Anal Chem 74: 5076– SOLiDTM System. 20: 231–6. 83. 48. Zhu Z, Waggoner AS. 1997. Molecular 67. Gnirke A, Melnikov A, Maguire J, et al. 2009. 30. Shibata K, Itoh M, Aizawa K, et al. 2000. mechanism controlling the incorporation of Solution hybrid selection with ultra-long oli- RIKEN integrated sequence analysis (RISA) fluorescent nucleotides into DNA by PCR. gonucleotides for massively parallel targeted system – 384-format sequencing pipeline with Cytometry 28: 206–11. sequencing. Nat Biotechnol 27: 182–9.

Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc. 535 M. Kircher and J. Kelso Methods, Models & Techniques .....

68. Hodges E, Rooks M, Xuan Z, et al. 2009. 75. Richter BG, Sexton DP. 2009. Managing and nanotube interactions: activation enthalpies Hybrid selection of discrete genomic intervals analyzing next-generation sequence data. and assembly-disassembly control. Nano- on custom-designed microarrays for mas- PLoS Comput Biol 5: e1000369. technology 20: 395101. sively parallel sequencing. Nat Protoc 4: 76. Quail MA, Kozarewa I, Smith F, et al. 2008. 82. Medvedev P, Stanciu M, Brudno M. 2009. 960–74. A large genome center’s improvements to the Computational methods for discovering 69. Briggs AW, Good JM, Green RE, et al. 2009. Illumina sequencing system. Nat Methods 5: structural variation with next-generation Targeted retrieval and analysis of five 1005–10. sequencing. Nat Methods 6: S13–20. Neandertal mtDNA genomes. Science 325: 77. Batley J, Edwards D. 2009. Genome 83. Pepke S, Wold B, Mortazavi A. 2009. 318–21. sequence data: management, storage, and Computation for ChIP-seq and RNA-seq 70. Meyer M, Stenzel U, Hofreiter M. 2008. visualization. Biotechniques 46: 333–4, 336. studies. Nat Methods 6: S22–32. Parallel tagged sequencing on the 454 plat- 78. IBM Research. 2009. IBM research aims to 84. Flicek P, Birney E. 2009. Sense from form. Nat Protoc 3: 267–78. build nanoscale DNA sequencer to help drive sequence reads: methods for alignment and 71. Meyer M, Stenzel U, Myles S, et al. 2007. down cost of personalized genetic analysis. In assembly. Nat Methods 6: S6–12. Targeted high-throughput sequencing of Loughran M, ed.; Press Releases, Vol. 2009. 85. Park PJ. 2009. ChIP-seq: advantages and tagged nucleic acid samples. Nucleic Acids New York: IBM. challenges of a maturing technology. Nat Res 35: e97. 79. Eid J, Fehr A, Gray J, et al. 2009. Real-time Rev Genet 10: 669–80. 72. Erlich Y, Chang K, Gordon A, et al. 2009. DNA sequencing from single polymerase mol- 86. Wall PK, Leebens-Mack J, Chanderbali DNA Sudoku – harnessing high-throughput ecules. Science 323: 133–8. AS, et al. 2009. Comparison of next generation sequencing for multiplexed specimen analy- 80. Astier Y, Braha O, Bayley H. 2006. Toward sequencing technologies for transcriptome sis. Genome Res 19: 1243–53. single molecule DNA sequencing: direct characterization. BMC Genomics 10:347. 73. Meyer M, Kircher M. 2010. Illumina sequenc- identification of ribonucleoside and deoxyri- 87. Holt RA, Jones SJ. 2008. The new paradigm ing library preparation for highly multiplexed bonucleoside 50-monophosphates by using of flow cell sequencing. Genome Res 18: 839– target capture and sequencing. Cold Spring an engineered protein nanopore equipped 46. Harb Protoc DOI: 10.1101/pdb.prot5448. with a molecular adapter. J Am Chem Soc 88. Dalca AV, Brudno M. 2010. Genome 74. Pop M, Salzberg SL. 2008. Bioinformatics 128: 1705–10. variation discovery with high-throughput challenges of new sequencing technology. 81. Albertorio F, Hughes ME, Golovchenko JA, sequencing data. Brief Bioinf. 11: 3–14. Trends Genet 24: 142–9. et al. 2009. Base dependent DNA-carbon Methods, Models & Techniques

536 Bioessays 32: 524–536,ß 2010 WILEY Periodicals, Inc.