This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book.

Next-Generation Sequencing Technologies

W. Richard McCombie,1 John D. McPherson,2 and Elaine R. Mardis3

1Genome Center, Cold Spring Harbor Laboratory, Woodbury, New York 11797 2Department of Biochemistry and Molecular Medicine, University of California Davis Comprehensive Cancer Center, Sacramento, California 95817 3Institute for Genomic Medicine at Nationwide Children’s Hospital, The Ohio State University College of Medicine, The Institute for Genomic Medicine, Columbus, Ohio 43205 Correspondence: [email protected]

Although DNA and RNA sequencing has a history spanning five decades, large-scale mas- sively parallel sequencing, or next-generation sequencing (NGS), has only been commer- cially available for about 10 years. Nonetheless, the meteoric increase in sequencing throughput with NGS has dramatically changed our understanding of our genome and our- selves. Sequencing the first human genome as a haploid reference took nearly 10 years but now a full diploid human genome sequence can be accomplished in just a few days. NGS has also reduced the cost of generating sequence data and a plethora of sequence-based methods for probing a genome have emerged using NGS as the readout and have been applied to many species. NGS methods have also entered the medical realm and will see an increasing use in diagnosis and treatment. NGS has largely been driven by short-read generation (150 bp) but new platforms have emerged and are now capable of generating long multikilobase reads. These latter platforms enable reference-independent genome assemblies and long- range haplotype generation. Rapid DNA and RNA sequencing is now mainstream and will continue to have an increasing impact on biology and medicine.

HISTORY OF DNA SEQUENCING ratory to devise an enzymatic approach to DNA sequencing, and the studies of Escherichia coli Sanger Enzymatic Sequencing DNA polymerase I and its interactions with var- lthough the Sanger dideoxynucleotide se- ious nucleotides and nucleotide analogs as sub- Aquencing method was introduced in 1977 strates or inhibitors in Kornberg’s laboratory. At (Sanger et al. 1977), other enzymatic sequencing its essence, the enzymatic DNA sequencing re- methods were devised and published in the same action mimics many aspects of DNA replication time frame, including partial ribosubstitution in the cell. This is exemplified by the compo- sequencing of Barnes (1978), the plus and minus nents of the Sanger reaction, combining the method of Sanger (Sanger and Coulson 1975) DNA polymerase with its template, primed and the chemical cleavage method (Maxam with a synthetic oligonucleotide primer to pro- and Gilbert 1977). Dideoxynucleotide sequenc- vide a free 30 OH for the polymerase-catalyzed ing emerged from the interest in Sanger’s labo- addition of native nucleotides and dideoxynu-

Copyright © 2019 Cold Spring Harbor Laboratory Press; all rights reserved Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798

1

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. W.R. McCombie et al. cleotide analogs, the latter causing termination templates, and myriad other small changes of the elongating nucleotide chain by preventing that were achieved during the human genome the addition of any further nucleotides once in- project to make the first human reference ge- corporated. By providing a carefully adjusted ra- nome possible. tio of nucleotides (originally, one of which was radiolabeled) and specific dideoxynucleotides in Sanger Sequencing Data Analysis each of four enzymatic primer extension reac- tions (one with each of the four dideoxynucleo- Early Sanger sequencing projects focused on se- tides present), Sanger sequencing produces a quence determination for single genes or very pool of molecules in each reaction mix that in- small genomes (∼5000 bases at most). The cludes some molecules that are terminated at development of a computational package by each residue within the growing chain in which Rodger Staden, developed at the Medical Re- the dideoxynucleotide in that specific reaction is search Council (MRC) in concert with Sanger’s incorporated. When subjected to a denaturing laboratory allowed the size of DNA regions or polyacrylamide gel in a subsequent step, this genomes sequenced to scale significantly (Sta- produces a ladder of fragments across four lanes, den 1979, 1984; Staden et al. 2000). The Staden each differing by one nucleotide in length. This Package allowed randomly sheared, small frag- ladder is detected when exposed to X-ray film to ments of DNA from a larger original DNA reveal the ladder of fragments, and read from source to be sequenced randomly and computa- bottom to top to derive the nucleotide sequence tionally overlapped to reform the whole se- of each primed template (shortest to longest quence of the original, larger input source. The fragment). In summary, Sanger reactions are Staden Package was compiled for use on early characterized by two discrete steps, the first is operating systems and was widely used. It the enzymatic production of template-directed provided a sequence assembly capability for fragment ladders, each terminated by a specific input data and a viewer that permitted visual dideoxynucleotide as specified by the template, evaluation of the overlaps between fragments. and the second is an electrophoretic separation Six-frame translation of assembled sequences and sequence detection step. was also output, after which the potential for Over time, radiolabeled (32Por35S) nucleo- open reading frames could be determined. The tides and film-based detection were supplanted combination of Sanger sequencing and the Sta- by Hood and colleagues with the incorporation den Package for sequence assembly created the of four different fluorescent dyes enabling detec- ability to sequence genomes roughly 10 times as tion during electrophoretic separation by laser- large as originally possible, such as phage lamb- induced fluorescent emission of each fragment da (Sanger et al. 1982) and enabled the era of on a dedicated sequencing instrument (Smith genomics. et al. 1986). Because this allowed each of the As sequencing projects became focused on four reactions to be run in a single lane of a gel longer DNA inserts (e.g., cosmid subclones) and (owing to the four different emission spectra of on larger genomes, the Staden Package was sup- the dyes), it eliminated most gel artifacts and planted by the phred/phrap/consed suite from allowed automated reading of the sequence Phil Green’s laboratory (Ewing and Green 1998; ladders. Similarly, improvements in detection Ewing et al. 1998; Gordon et al. 1998). Here, chemistry were accompanied by advances such phred provided basecalling accuracy statistics as incorporating the fluorescent dye on the di- for Sanger reads, phrap was a read assembly pro- deoxynucleotides rather than primers to in- gram, and consed was an assembly viewer and crease flexibility, the use of polymerase enzymes editing program. It had the ability to work with with favorable characteristics that gave a more fluorescent sequence data as well. This suite pro- even incorporation of dideoxynucleotides and vided needed scalability and was indeed the pri- were stable at higher temperatures to alleviate mary toolkit for sequencing and finishing of all secondary structure effects in high G + content large genomes sequenced by mapped clone-

2 Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. Next-Generation Sequencing Technologies based methods, including the international hu- a variety of enzymes and detection schemes that man genome project (Lander et al. 2001). In a permit the corresponding instrument platform time span of ∼25 years, from the introduction of to collect data in lockstep with enzymatic syn- Sanger sequencing until the completion of the thesis on a template. Other shared concepts for human genome reference sequence (1977– these platforms include (1) fragmentation by 2004), the scalability and widespread use of physical shearing before sequencing of the Sanger sequencing had experienced significant genome or other DNA/RNA sample to be technological advances that permitted large- sequenced, (2) the generation of a sequencing scale projects to be completed. However, the “library” that results from the attachment of uni- effort to sequence, assemble, and annotate ge- versal, platform-specific adaptors (synthesized nomes with Sanger approaches was still a signif- oligonucleotides of known sequence) at each icant and expensive undertaking that required end of the template fragments to be sequenced, specialized equipment, expertise, and infra- and (3) on-surface template amplification by structure. This scenario began to change shortly virtue of library fragment hybridization to cova- thereafter, with the introduction of the first mas- lently attached oligonucleotides with sequence sively parallel DNA sequencing technology in complementarity to the synthetic adaptors. Fol- 2005 (Margulies et al. 2005). lowing fragment amplification, the templates are primed by virtue of unique sequences present in the synthetic adaptor, providing a free 30 OH for NEXT-GENERATION SEQUENCING (NGS) enzymatic extension on the templates coupled TECHNOLOGIES with on-instrument data detection. The on-sur- fi fi – Fundamental Aspects of NGS face ampli cation establishes a xed X Y coor- dinate for each template that, in turn, permits “ ” “ Most so-called massively parallel or next- data from nucleotide incorporation steps to be ” generation sequencing methods and instru- assigned to a specific template. Because SBS ments to date have close intellectual connections methods are detecting nucleotide incorporation to Sanger sequencing from the standpoint of from a population of amplified template mole- their fundamental enzymological underpin- cules at each X–Y coordinate, and because of nings, as will be reviewed here. The primary dif- various types of background noise that contrib- ferentiation is that, unlike Sanger sequencing, ute in a cumulative manner at each step in the massively parallel sequencing approaches do incorporation reactions, SBS is ultimately limit- not decouple enzymatic nucleotide incorpora- ed in its length of sequence read (“read length”) tion from sequence ladder separation and data because of the increasing noise over sequential acquisition. Rather, NGS instruments perform incorporation and imaging cycles. Modifica- both the enzymology and data acquisition in an tions to the enzymology and nucleotide chem- orchestrated and stepwise fashion, enabling se- istry or synthesis, as well as more sensitive de- quence data to be generated from tens of thou- tectors, have yielded improved signal-to-noise sands to billions of templates simultaneously. over time, permitting increased read lengths. “ ” Hence, the term massively parallel refers to Still, SBS read lengths remain shorter than Sang- this enhanced data-generation capacity that has er read lengths. This has impacted how the se- fi resulted in signi cant changes to DNA sequenc- quence data are analyzed, as will be described in ing and its applications since its introduction. the section on analytical aspects of SBS data.

Sequencing by Synthesis SBS Detection Schema Most commercial platforms using massively One key difference in SBS-based massively par- parallel sequencing are based on the concept of allel sequencing platforms is the mechanism sequencing by synthesis (SBS). In essence, these by which nucleotide incorporation is detected. methods permit nucleotide incorporation using There are essentially two themes that have been

Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798 3

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. W.R. McCombie et al. used in detection: direct fluorescence detection nate). These analytical pipelines are provided by or indirect sensing by nucleotide incorporation the instrument manufacturer, and perform reaction products. The first theme copies origi- postanalytical quality evaluation of the data, nal Sanger sequencing most closely, by labeling culling low-quality reads from the final output the nucleotides, which are modified to only al- and generating metrics that permit user evalua- low one base extension with fluorescent moieties tion of the overall read data quality. Finally, the that distinguish each nucleotide base by virtue of read data files are written in a specific format a specific fluorescent emission wavelength (e.g., suitable for their use in downstream analytical Illumina). After each cycle of nucleotide addi- pipelines. tion, the dye and the extension blocking moiety are cleaved off the growing chain and the cycle is Analytical Aspects of SBS Data repeated. The cyclical fluorescent additions at each X–Y coordinate represent the sequence of Following sequence data generation on massive- added bases. ly parallel platforms, a wide variety of analytical The second theme uses a combination of a steps may be pursued to provide information single type of unlabeled nucleotide per incorpo- from the data generated. In Sanger sequencing, ration cycle and a postincorporation detection reads were typically derived from subcloned step to identify which of the fragments incorpo- fragments originating from a large-insert DNA rated one (or more) nucleotides. The postreac- clone (cosmid, fosmid, or BAC), permitting read tion detection is fueled by a chemical byproduct assembly based on sequence similarity to reca- of the nucleotide incorporation reaction. As pitulate the initial cloned fragment in one or such, when multiple nucleotides of one type more “contigs” (subassemblies of the reads). In are adjacent in the template (i.e., sequential NGS approaches, the shorter read lengths often G nucleotides), the amount of chemical byprod- are not suitable for read assembly, especially if uct will scale accordingly, compared with the the NGS library was generated from a whole amount for a single-nucleotide incorporation. genome, not a cloned fragment of that genome. This variability in detected byproduct is, howev- This is particularly true when the genome is er, not infinitely linear because of the limits on larger than 1–2 Mb (million bases) and is com- the dynamic range of the detector. Hence, all plex in nature (i.e., contains repetitive se- instruments using this detection scheme tend quences). In general, short-read data are more to lower accuracy in mononucleotide repeat se- readily interpreted by their alignment to a refer- quences. Two types of postincorporation detec- ence genome than by de novo read assembly. As tion schema include: (1) detection of pyrophos- such, the early use of NGS sparked the develop- phate released during nucleotide incorporation(s) ment of numerous read alignment algorithms, that reacts with firefly luciferase to produce an each using one of several different principles to amount of emitted light corresponding to the match each short read to an existing genome number of incorporated nucleotides (e.g., Roche assembly. Following read alignment to a refer- 454), as detected by a highly sensitive charge- ence genome, there was an ensuing need to iden- coupled device (CCD) camera, or (2) detection tify variants—nucleotide sequence differences of the pH change resulting from released hydro- between the sequenced genome and the refer- gen ions following nucleotide incorporation(s) ence—and to subsequently interpret these as detected by an X–Y-specific miniaturized pH changes in terms of their impact on protein cod- meter (e.g., Ion Torrent). ing genes, regulatory regions, and other se- In general, regardless of the type of detec- quence data features. Here, the most straightfor- tion, each platform will process the X–Y coor- ward variant calling algorithms to develop were dinate–specific data following the sequencing those capable of detecting single-nucleotide var- instrument run completion, essentially turning iants (SNVs), also known as substitutions or the acquired signals into “reads” (i.e., the nucle- point mutations. Less straightforward was the otide sequence of the fragment at each coordi- detection of insertion or deletion events (“in-

4 Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. Next-Generation Sequencing Technologies dels”), in which one or more nucleotides were with data analysis in intimate and inextricable either added or deleted in comparison to the ways that has helped to shape the broad-based reference genome. These variants remain quite use of this fundamental tool. difficult to detect because of difficulties in align- Another type of analysis of NGS data uses ing the sequence read with indels to the reference the sequencer as a counting machine. In RNA- genome, although the accuracyof their detection Seq, sequences from different transcripts can be has been greatly enhanced by the increased read counted and their relative number compared. lengths of NGS platforms over time. Beyond read This allows a relatively accurate description of length, the development of paired-end read ap- the transcript abundance profile in a given tissue proaches also has enhanced our ability to detect or cell source. With the large number of reads structural variants, especially for complex repet- that can be obtained from next-gen sequencers itive genomes. In paired-end sequencing, a sec- the relative frequency of transcripts can be ob- ond set of SBS reads is generated by priming the tained over a very high dynamic range. Any bi- opposite end of each library fragment with an ological assay that can be converted into se- adaptor-specific oligonucleotide, thereby gener- quence fragment counting can be detected in ating concordant X–Y data from the opposite this way. end of each library fragment. These read pairs are then mapped to the genome, using informa- tion about their predicted separation (based on SINGLE-MOLECULE SEQUENCING (SMS) insert size for the library) and read orientation. TECHNOLOGIES Large-scale genomic alterations that represent Basics of SMS sequence duplications or amplifications, large deletions, or structural alterations (transloca- Massively parallel sequencing instruments using tions, inversions) continue to represent compu- SBS approaches use an enzymatic on-surface tational challenges to NGS data interpretation amplification of each library fragment before although, as for indels, their accuracy and false- the initiation of the sequencing reaction, to pro- positive rates have improved somewhat with lon- duce sufficient signal for detection by the instru- ger read lengths and read pairs that provide more ment. This intermediate amplification step, accurate read mapping and read directionality. while necessary, introduces a variety of artifacts. Here, reads that group outside of the predicted One type of artifact is caused by the inherent fragment length separation or with different ori- error rate of polymerases, some of which can entations than predicted, or both, are subject to masquerade as true variants if they occur suffi- secondary analysis that may predict large-scale ciently early in the fragment amplification pro- structural changes in comparison to the refer- cess. The other artifact is the result of nucleotide ence genome. Typical analytical pipelines that composition of the fragment, wherein fragments have the capability to detect these more challeng- with high or low G + C composition are less ef- ing alterations combine multiple algorithms to ficiently amplified compared with more equal examine the aligned read data and produce a ratio composition fragments. This step also consensus report—those alterations that are somewhat limits the length of SBS library inserts identified by multiple different algorithms and because of a desire to make the amplification supported by multiple read pairs—which are rapid, efficient, and consistent across all library then subject to manual scrutiny and often sec- fragments. In addition, as mentioned above, as ondary validation to support or refute them. As the clusters of molecules are sequenced, some described in the final section of this article, read molecules in each cluster misincorporate the alignment is just the starting point for many added base on each cycle of nucleotide addition. types of NGS analyses, dictated entirely by the This has the effect of increasing the noise at each upfront preparatory assays that are used to gen- nucleotide addition, which ultimately results in erate the sequencing library. As such, NGS ex- the noise being too high to allow further se- periments link together experimental design quencing. This too limits read lengths in SBS.

Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798 5

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. W.R. McCombie et al.

As a result, SMS methods and instrumentation measured as a nucleotide incorporation. Sources have been pursued to obviate library fragment of error for this sequencing method arise when a amplification steps and associated artifacts, correct nucleotide is incorporated too quickly which in turn permits increased library frag- for sufficient detection time by the optics, or ment sizes and read lengths. when multiple nucleotides are incorporated Although the advantages of SMS are obvi- in quick succession without resolution by the ous, the difficulty in overcoming the signal-to- optics of each incorporation (e.g., mononucleo- noise aspects of collecting data from a single tide stretches), both of which result in deletions molecule are considerable, regardless of the ap- in the sequence read. A deletion can also result proach. As a result, SMS reads typically have a from incorporation of an unlabeled (“dark”) nu- higher error rate than SBS reads and have re- cleotide, although these are rare. Another source quired both higher sequence data coverage and of error occurs when the incorrect nucleotide novel computational approaches to analyze the dwells too long in the active site and is detected resulting data. Despite these challenges, two by the optics yet is not actually incorporated by SMS devices have achieved commercial status, the polymerase, resulting in an insertion error. each using highly unique approaches to se- The Pacific Biosciences platform has in- quencing detection. creased average read length and accuracy over time by virtue of a variety of enzymatic, nucleic acid chemistry, and optical detector sensitivity Single-Molecule Fluorescent Sequencing improvements. Current per-read error rate is The initial SMS device to achieve commercial ∼10%, although these are random errors (e.g., release was the Pacific Biosciences instrument, not systematic or predictable), which means that which emerged from a proof-of-principle in- the consensus error rate improves as coverage strumentation concept published by Watt depth increases. Improvements to the method Webb’s group at Cornell University (Levene et and instrument have cumulatively allowed for al. 2003). This instrument used patterned nano- increasing library fragment lengths, commensu- fabricated chambers on a silicon surface, called rate with increased optical imaging duration for zero-mode waveguides (ZMWs), to isolate indi- data collection. This, in turn, has placed renewed vidual polymerase enzymes coupled to primed emphasis on techniques to produce fragment DNA template molecules. On supplying these lengths in excess of 10,000 bp for library con- polymerases with physiological concentrations struction. Throughput per instrument run also of fluorescent labeled nucleotides, the copying has been impacted by increasing the number of of each template can progress. Here, the ZMW ZMWs from which data can be collected, in provides a mechanism to focus the detection combination with the aforementioned increases apparatus onto the polymerase, effectively per- in read length. It is now possible to have average mitting the instrument optics to collect data in read lengths in excess of 10,000 bases with this real time as the polymerase copies the template. instrument compared with 150–250 base maxi- The resulting “movies” represent optical data mum read lengths with Illumina sequencers. collected at the active site of each polymerase, in which an incoming nucleotide is detected suc- Single-Molecule Nanopore-Based cessfully by virtue of its increased dwell time in Sequencing the active site during incorporation. On its ad- dition to the growing synthesized strand, each A second approach to SMS, commercialized nucleotide loses its fluorescent label, which is by Oxford Nanopore Technologies, uses trans- covalently attached onto the phosphate portion location of single DNA strands through a sur- of the molecule and diffuses out of focus. Incor- face-positioned nanopore grid to generate rect template-matching nucleotides can enter sequencing data by sensing the changes in elec- the active site buttypically have insufficient dwell trical conductance during nucleotide transloca- time of detection of their attached fluor to be tion of the DNA strand through the pore. Each

6 Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. Next-Generation Sequencing Technologies nanopore has a detector positioned adjacent to it SBS data could be aligned, or the SBS data were that records these conductance changes over aligned to the longer reads to improve overall time of fragment translocation through the accuracy and the resulting highly accurate long pore. The nanopore sequencing library is pro- reads were assembled. Despite these innovative duced much like other NGS libraries, and uses approaches, the largest genomes routinely as- adaptor sequences that have a platform-specific sembled with these approaches were bacterial composition, basecalling model that is built on genomes (Ribeiro et al. 2012). The benefitof known sequence context-conductance correla- these early innovations to long-read SMS data tion data, and the resulting sequence read is in- analysis was that they fueled the development terpreted computationally from the best fit to the of sequence assembly approaches suitable for model. Like the Pacific Biosciences data, the sin- read data with a random error model, and as gle-molecule nanopore data is subject to signal- the sequence data quality and read lengths of to-noise constraints and has a correspondingly SMS reads improved, these algorithms became high error rate when compared with SBS data. suitable for long-read assembly without the need This error rate also has improved over time and for SBS error correction. development of improved basecalling models, improved polymerase properties, and modifica- Analysis of Long-Read Single-Molecule Data tionsto the nanopore that slow the traverse of the molecules through the nanopores, permit more The combination of increased read lengths from data richness that correspondingly provides a SMS platforms, improved error rates, and opti- better fit to the model. Unlike other platforms, mized assembly algorithms has now resulted in many errors are not random, being sequence the singular use of SMS data for genome as- context dependent, resulting in error not as read- sembly. There are distinct advantages to this ily resolved with increasing coverage in these re- approach that may be more or less important, gions. Similarly, read lengths have improved over depending on the specific application. In partic- time as well, shifting focus to the development of ular, long reads offer long-range contiguity and, library construction approaches that produce hence, for diploid organism haplotypes are gen- longer fragments. Unlike the Pacific Biosciences erated that span long distances along a chromo- technology and all of the previously discussed some. Similarly, complex repetitive sequences types of SBS platforms, the Oxford Nanopore may be elucidated from long-read data, enabling sequencing method does not require a polymer- complicated genomic regions to be accurately ase or labeled nucleotides to conduct sequence and completely sequenced even where highly data generation. At its essence, the nanopore is similar tandem duplications exist (also known the only reagent needed for the platform to se- as segmental duplications) (Huddleston et al. quence the library fragments. 2014). Indeed, as the capability for long-read assembly emerged, sequencing and assembly of a human genome using Pacific Biosciences SMS Analysis of Combined Short- (SBS) data was compared with the same genome se- and Long-Read (SMS) Data quenced and aligned using Illumina SBS data Long-read lengths offer significant advantages (Chaisson et al. 2015). This comparison high- in genome assembly, yet, initially, the high error lighted the advantages of long-read SMS assem- rates obtained from SMS platforms made direct bly to provide contiguity, haplotype resolution, assembly of long-read SMS data difficult to and to deconvolute the sequence data in com- achieve. To address these difficulties, several plex areas of the human genome that are simply groups have explored different approaches that not resolvable by short-read data. One might combined SBS and SMS data to achieve high then raise the obvious question of why more quality and contiguity in genome assemblies. Ei- human resequencing efforts are not being pur- ther the long-read datafrom SMS platforms were sued with SMS data generation. The primary assembled as a scaffold against which short-read reason at this writing for continued human

Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798 7

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved. This is a free sample of content from Next-Generation Sequencing in Medicine. Click here for more information on how to buy the book. W.R. McCombie et al.

(and other complex organisms) genome se- et al. 2014. Reconstructing complex regions of genomes 24: quencing using SBS over SMS is simply one of using long-read sequencing technology. Genome Res 688–696. doi:10.1101/gr.168450.113 cost and throughput, wherein SBS offers clear Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, advantages for both. Similarly, the short-read Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, libraries needed for SBS are more straightfor- et al. 2001. Initial sequencing and analysis of the human ward to generate, are more readily automated, genome. Nature 409: 860–921. doi:10.1038/35057062 and require significantly less input DNA, mak- Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. 2003. Zero-mode waveguides for single-mol- ing them suitable for clinical samples that often ecule analysis at high concentrations. Science 299: 682– provide limited amounts of DNA. 686. doi:10.1126/science.1079700 The use of SMS sequencing, either alone or Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, in combination with SBS data, continues to be Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. 2005. Genome sequencing in microfabricated an active research area. Human genomes can re- high-density picolitre reactors. Nature 437: 376–380. liably assemble into considerably more con- doi:10.1038/nature03959 tiguous genomes than is possible with SBS alone. Maxam AM, Gilbert W. 1977. A new method for sequencing Recently, even larger genomes, such asthe 18-Gb DNA. Proc Natl Acad Sci 74: 560–564. doi:10.1073/ wheat genome have been assembled with com- pnas.74.2.560 Ribeiro FJ, Przybylski D, Yin S, Sharpe T, Gnerre S, Abouel- bined SMS and SBS data (Zimin et al. 2017). leil A, Berlin AM, Montmayeur A, Shea TP, Walker BJ, Applications of NGS in biomedical research et al. 2012. Finished bacterial genomes from shotgun se- and clinical diagnostics include (1) DNA-based quence data. Genome Res 22: 2270–2277. doi:10.1101/ applications in research, (2) RNA-based appli- gr.141515.112 cations in research, (3) combinatorial DNA– Sanger F, Coulson AR. 1975. A rapid method for determin- ing sequences in DNA by primed synthesis with DNA RNA interaction applications, and (4) clinical polymerase. J Mol Biol 94: 441–448. doi:10.1016/0022- applications of NGS. 2836(75)90213-2 Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74: REFERENCES 5463–5467. doi:10.1073/pnas.74.12.5463 Sanger F, Coulson AR, Hong GF, Hill DF, Petersen GB. 1982. Barnes WM. 1978. DNA sequencing by partial ribosubstitu- Nucleotide sequence of bacteriophage λDNA. J Mol Biol 119: – tion. J Mol Biol 83 99. doi:10.1016/0022-2836(78) 162: 729–773. doi:10.1016/0022-2836(82)90546-0 90271-1 Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Con- Chaisson MJ, Huddleston J, Dennis MY, Sudmant PH, Malig nell CR, Heiner C, Kent SB, Hood LE. 1986. Fluorescence M, Hormozdiari F, Antonacci F, Surti U, Sandstrom R, detection in automated DNA sequence analysis. Nature Boitano M, et al. 2015. Resolving the complexity of the 321: 674–679. doi:10.1038/321674a0 human genome using single-molecule sequencing. Na- ture 517: 608–611. doi:10.1038/nature13907 Staden R. 1979. A strategy of DNA sequencing employing 6: – Ewing B, Green P. 1998. Base-calling of automated se- computer programs. Nucleic Acids Res 2601 2610. quencer traces using phred. II: Error probabilities. Ge- doi:10.1093/nar/6.7.2601 nome Res 8: 186–194. doi:10.1101/gr.8.3.186 Staden R. 1984. Computer methods to aid the determination 12: Ewing B, Hiller L, Wendl MC, Green P. 1998. Base-calling of and analysis of DNA sequences. Biochem Soc Trans – automated sequencer traces using phred. I: Accuracy 1005 1008. doi:10.1042/bst0121005 assessment. Genome Res 8: 175–185. doi:10.1101/gr.8.3.175 Staden R, Beal KF, Bonfield JK. 2000. The Staden Package, Gordon D, Abajian C, Green P. 1998. Consed: A graphical 1998. Methods Mol Biol 132: 115–130. tool for sequence finishing. Genome Res 8: 195–202. Zimin AV, Pulu D, Hall R, Kingan S, Clavijo BJ, Salzberg SL. doi:10.1101/gr.8.3.195 2017. The first near-complete assembly of the hexaploid Huddleston J, Ranade S, Malig M, Antonacci F, Chaisson M, bread wheat genome, Triticum aestivum. Gigascience 6: Hon L, Sudmant PH, Graves TA, Alkan C, Dennis MY, 1–7.

8 Cite this article as Cold Spring Harb Perspect Med doi: 10.1101/cshperspect.a036798

© 2019 by Cold Spring Harbor Laboratory Press. All rights reserved.