Whole-Genome Dna Sequencing

C OMPUTATIONAL B IOLOGY

Computation is integrally and powerfully involved with the DNA sequencing technology that promises to reveal the complete human DNA sequence in the next several years. After introducing the latest DNA sequencing methods, this article describes three current approaches for completing the sequencing.

he prevailing method of determining per limit. (A cosmid is a type of vector for manip- the sequence of a long DNA segment ulating and replicating inserted pieces of DNA.) is the shotgun sequencing approach, in Now, the shotgun sequencing of 200 kbp bacter- which a random sampling of short ial artificial chromosomes (BACs) is a given. This fragmentT sequences is acquired and then assem- achievement inspired Jim Weber and me to pro- bled by a computer program to infer the sampled pose the use of a shotgun approach to sequence segment’s sequence. In the early 1980s, such seg- the human genome,2 after which we requested ments were typically on the order of 5,000 to funding for a pilot project from the US National 10,000 base pairs (5 to 10 kbp). By 1990, this Institutes of Health. The established community method was sequencing segments on the order rejected our controversial proposal,3 but in May of 40 kbp, and by 1995, the entire 1,800-kbp H. of 1998, Craig Venter and the Perkin-Elmer Influenzae bacterium had been sequenced.1 The Corporation announced a new private venture, source segment is clearly becoming extremely Celera Genomics, aimed at using a whole- large without any concomitant increase in the genome shotgun approach to sequence the fruit length of the sampled fragment sequences. Soon, fly Drosophila (≈120 Mbp) in 1999 and the human shotgun data sets could well consist of millions genome (≈3.5 Gbp) by 2001.4 of sampled fragment sequences and will require After introducing the basic technology of significant computational resources to assemble. shotgun DNA sequencing and briefly summa- The whole-genome shotgun sequencing of H. rizing the computational and algorithmic results Influenzae in 1995 showed that direct shotgun se- to date on the problem of assembling shotgun quencing could handle a much larger source se- data sets, this article characterizes the nature of quence segment than biologists had commonly DNA sequences and improvements in sequenc- thought. Before that, cosmid-sized clones of 30 ing technology that affect the computational to 50 kbps were considered this approach’s up- problem. It then analyzes three current propos- als for sequencing the human genome, includ- ing the one we’re pursuing at Celera Genomics.

MAY–JUNE 1999 33 along a given DNA strand.5,6 In the version used 2. To remove fragments that are too large or most commonly today, a biochemical sequenc- too small, this pool of fragments is size- ing reaction produces a collection of geometri- selected, typically using size separation un- cally distributed copies of every prefix of the der gel electrophoresis and then simply ex- given DNA strand such that the last nucleotide cising a band of the gel containing the or base—A, C, G, or T—of each prefix is known. desired size. With care, this procedure pro- In a process called gel electrophoresis, this material duces a normally distributed collection of passes through a permeable gel under an applied fragment sizes with a 10% variance. voltage, which separates the prefixes in order of 3. The technicians then insert the size-selected length, letting either a technician or a combina- fragments into the DNA of a genetically en- tion of a laser, charge-coupled device detector, gineered bacterial virus (phage), called a vec- and software determine the sequence of nu- tor. Usually, at most one fragment is inserted cleotides along one end of the source strand. at a predetermined point, called the cloning How much of the source we can determine is site, in the vector. Typically, the number of limited by the fact that the size ratio between vectors where more than one fragment gets consecutive prefixes approaches 1 and the num- inserted is less than 1%, but can be as low ber of long prefixes is diminishing geometrically. as 0.01% for some meticulously executed With today’s technology, biologists can resolve protocols. The fragments at this point are on average the source’s first 500 nucleotides and often called inserts and the collection of in- upwards of 800 to 900 bases for a particularly serts is a library. clean reaction. The result of such a sequencing 4. A bacterium is then infected with a single experiment is called a read. Over the last 20 years, vector, which reproduces to produce a bac- special machines and robots have been developed terial colony containing millions of copies that automate much of this process. of the vector and its associated insert. The To determine the sequence of much longer procedure thus has effectively cloned a pure stretches of DNA, Frederick Sanger and his col- sample of the given insert. This procedure leagues devised the shotgun DNA sequencing strat- repeats simultaneously for as many inserts egy.7 This approach entails sampling DNA frag- as desired for sequencing in the final step. ments as randomly as possible from the source 5. By design, the vector then permits a se- sequence and then producing a sequencing read quencing reaction to be performed, starting of the first 300 to 900 bases of one end of each just to the left or right of a source fragment’s fragment. To maximize the sequence produced insertion point. The sequencing reaction from each fragment, such experiments involve produces a read of the first 300 to 900 bases sampling fragments whose length is longer than of one end of the insert. a read’s maximum expected length. If enough fragments are sequenced and their sampling is A key failure in this process occurs if the sam- sufficiently random across the source, the process pled reads are not randomly sampled but biased should let us determine the source by finding se- to come from particular regions of the source. quence overlaps among the reads of fragments This can happen for three reasons: the fractur- that were sampled from overlapping stretches. ing of the fragments might be biased, the inser- This basic shotgun approach is at the heart of all tion of fragments into vectors might be biased, current approaches to genome sequencing. or some insert/vector combinations might not As currently practiced in many DNA se- clone properly because the insert has reacted tox- quencing centers, the basic shotgun protocol ically with the host/vector environment. Anec- starts with a pure sample of a large number of dotal evidence suggests that the first two biases copies of the source DNA whose sequence is to are minimal in well-performed experiments, but be determined, typically a segment of 100 kbp the third bias definitely exists. Picking host/vec- or longer. tor combinations for which the insert DNA will be relatively inert will reduce this toxicity bias. 1. Technicians randomly fracture the sample Sequencing reactions tend to fail for a variety either using sound (sonication) or passing it of reasons. In a production context, investigators through a nozzle under pressure (nebulation), consider a 70 to 80% success rate to be a very which produces a uniformly random parti- good yield. In initially processing the sequenc- tioning of each copy of the source strand ing information, technicians must screen these into a collection of DNA fragments. failed reactions and also screen reads from vec-

34 COMPUTING IN SCIENCE & ENGINEERING The fragment-assembly assembly must account for the follow- ing more than 15% from 650 to problem ing essential characteristics of the data: 900 bases into the read, after Given the reads obtained from a shot- which the resulting sequence is gun protocol, the computational prob- • Incomplete coverage: Not every effectively unusable. (However, I lem, called fragment assembly, is to source base pair is sequenced ex- have seen data sets where an er- infer the source sequence given the actly c– times due to both the sto- ror rate of 5% occurs in the collection of reads. For the purposes of chastic nature of the sampling “sweet” part of the read, consist- illustration, we might parameterize a and cloning bias I’ve mentioned. ing of the ﬁrst 500 bases.) typical problem occurring in practice Some portions of the source • Unknown orientation: DNA is a today as follows (see the “Deﬁnitions” might be covered by more than c– double-stranded helix. Which of box for help with the terminology). For reads, and others might not be the source sequence’s two strands a source strand of length G = 100 Kbp, covered at all. In general, there is actually read depends on the ar- we would then typically sequence R = can be several such gaps or maxi- bitrary way the given insert ori- – 1,500 reads of average length L = 500. mal contiguous regions where the ents itself in the vector. Thus we Thus, we would collect altogether N = source sequence has not been do not know whether to use a read – RL = 750 Kbps of data, so that we have sampled. Gaps necessarily dictate or its Watson-Crick complement in sequenced on average every base pair a fragmented, incomplete solu- the reconstruction. The Watson- – … c in the source c = N/G = 7.5 times. The tion to the problem. Crick complement (a1 a2 an) – • quantity c is the average sequencing Sequencing errors: The gel-electro- of a sequence a1 a2 ... an is

coverage, and practitioners say that the phoretic experiment yielding a wc(an) ... wc(a2) wc(a1) where source has been sequenced to 7.5X read, like most physical experi- wc(A) = T, wc(T) = A, wc(C) = G, coverage. In practice, an investigator ments, is prone to error, especially and wc(G) = C. will decide on a given level of coverage near the end of a read where the and then sequence inserts until a total signal strength and separation of I will now develop a mathematical of N = Gc– base pairs of data have been consecutive prefix fragments be- formulation of the fragment-assembly collected. Software for fragment come small. In a very stringently problem. For input, we have a collec- controlled pro- tion of reads duction R F = {}f = Definitions environment, ii1 G Length of target sequence the error rate is that are sequences over the four-letter – L Average length of sequence read less than 1% in alphabet Σ = {A, C, G, T}. An ε-layout is R Number of sequencing reads in shotgun data set the first 500 or a string S over Σ and a collection of R – NRL, total number of base pairs sequenced so bases. There- pairs of integers, (s ,e ) ∈ , such that – i i i [1,R] I Average length of a clone inset after, the error – • c N/G, average sequence coverage rate increases if si < ei then fi can be aligned to – – m RI /2G, average clone or map coverage rapidly, reach- the substring S[si,ei] with less than

tors where no insert occurred. Moreover, because DNA sequence characteristics the sequencing reaction begins in the vector at As we’ve seen, when reseachers first began em- one end of the insert location or the other, the ploying shotgun sequencing in the early 1980s, a initial part of a read can consist of the vector typical source sequence size was 5 to 10 kbp. By DNA sequence leading up to the beginning of 1990, they were shotgun-sequencing cosmid- the insert. This bit of vector sequence must be sized sources for which G ≈ 40 kbp, and in 1995 carefully identified and removed. Similarly, if an the bacteria H. Influenzae of length 1 Mbp was insert is particularly short, the technicians might successfully shotgun sequenced. In the past three need to trim vector sequence from the end of a years, 20 bacterial genomes in this size range read. After taking these steps, the process will have been shotgun-sequenced. In August 1998, have produced a set of sequence reads randomly Celera Genomics was formed to shotgun-se- sampled from the source sequence. quence the entirety of the fruit fly Drosophila in See the sidebar, “The fragment-assembly 1999 (G ≈ 120 Mbp) and the human genome by problem,” for a discussion of the computational 2001 (G ≈ 3.0 Gbp). problem associated with shotgun sequencing. With the trend toward sequencing higher or-

MAY–JUNE 1999 35 ε ε |fi| differences, and and consensus. The overlap phase speed for a given N and . Conceptu- • if si > ei then fi can be aligned to compares every fragment read against ally, we can think of the result of the c the substring S[ei,si] with less every other read (in both orientations) overlap phase as producing an overlap ε than |fi| differences, then to determine if they overlap. Given the graph in which every vertex models a •∪ ε i [min(si,ei),max(si,ei)] = [1,|S|]. presence of sequencing errors, an over- read and every edge an -overlap be- lap is necessarily approximate in that tween two reads. The string S represents the recon- not all characters in the overlapping The layout phase determines the

struction of the source strand, and the region coincide. This problem is a vari- pairs (si,ei) that position every fragment integer pairs indicate the substrings of S ation on traditional sequence compari- in the assembly. In graph theoretic that gave rise to each read. The order of son where the degree of difference terms, we accomplish this by selecting ε si and ei encode the orientation of the permitted is bounded by . The best a spanning forest of the overlap graph; fragment read in the layout—that is, deterministic designs for ﬁnding all ε- such a subset positions every fragment

whether fi was sampled from S or its overlaps lets us solve problems on the with respect to every other, transitively, complement strand. The parameter ε ∈ order of N = 1 to 5 Mbp in a matter of through the overlaps on the path be- [0,1] models the maximum error rate of minutes on a typical workstation.5 For tween them. Finding a spanning forest the sequencing process. contexts requiring even greater speed, that optimizes a criterion such as short- The set of ε-layouts models the set most investigators resort to heuristics est or most likely is known to be NP- of all possible solutions to the frag- that detect overlapping reads by find- hard.6 Investigators have proposed ment-assembly problem. Of course, ing exact common substrings of some greedy algorithms that come within a there are many such solutions, so the length k using a hashing scheme. Typi- given factor of optimal,7,8 simulated computational problem is to find one cally, they choose k to provide the best annealing9 and genetic algorithms, 10 that is in some sense best. Traditionally, compromise between sensitivity and relaxation methods based on generat- the fragment-assembly problem has been phrased as one of finding a shortest common superstring (SCS) of the fragment reads within error rate ε; that is, find an ε-layout for which S is as ABCX X XlXc Xr Xl Xc Xr short as possible. Unfortunately, as Fig- X=Xl.Xc.Xr ure A illustrates, this appeal to AX BXl Xr C Fragment Xl Xc Xr parsimony often produces over- sampling Shortest reconstruction compressed results when the source sequence contains repeated subseg- ments. This tendency has prompted Correct reconstruction the proposal of maximum-likelihood

criteria based on the distribution of ABCXX fragment start points in the layout.1 Xl Xc Xr Xl Xc Xr While such a criteria provides a better objective function, algorithm designs Figure A. The shortest answer isn’t always the correct one. A DNA source at the upper left for computing it have proven elusive. consists of unique stretches A, B, and C separated by a repeated sequence X. Below it, the A common computational architec- source has been sampled perfectly uniformly across the target, as evidenced by the correct re- ture for fragment assembly, advocated construction of the pieces shown at lower right. But note the result in the upper right of a by several authors,2–4 divides the prob- program that produces the minimum-length reconstruction. The interior portion Xc of the lem into three phases: overlap, layout, repeat sequence, which is covered only by reads completely interior to X, is overcompressed.

ganisms (which have an extensive repeat struc- the human T-cell receptor locus, there is a ﬁve- ture not found in lower-order organisms) and fold repeat of a trypsinogen gene that is 4 kbp toward larger and larger source sizes, investiga- long and that varies 3 to 5% between copies. tors commonly see several repetitive substrings Three of these were close enough together that in a source sequence of even moderate size. Be- they appeared in a single shotgun-sequenced fore 1990, this was rarely considered an imped- cosmid source.8 Such large-scale repeats are iment to sequencing as it was practiced then, but problematic for shotgun approaches because it is now clearly a major computational difﬁculty. reads with unique portions outside the repeat Repeats occur at several scales. For example, in cannot span it. Smaller repeated elements such

36 COMPUTING IN SCIENCE & ENGINEERING ing either spanning forests or weighted sampling were perfectly uniform, we Shortest Common Superstring Problem,” matchings in order of score,4 problem should expect to see16 Information and Computation, Vol. 83, 1989, simplification by chordal graph collaps- pp. 1–20. −c ing,1 and a reduction to greedy Euler- • 1− e of the source strand cov- 7. J. Tarhio and E. Ukkonen, “A Greedy Approx- ian tour.11 Ultimately, the complicating ered by some read, imation Algorithm for Constructing Shortest −c factor is the presence of repeated • Fe gaps in the coverage of the Common Superstrings,” Theoretical Com- strings within the source, which has led source, puter Science, Vol. 57, 1988, pp. 131–145. to the use of quality values assessing • gap-free segments or contigs of 8. A. Blum et al., “Linear Approximation of – – c- the accuracy of each base in a read, in average length (L/c )e , and Shortest Superstrings,” J. ACM, Vol. 41, No. – – an attempt to distinguish ε-overlaps • gaps of average length L/c . 4, 1994, pp. 630–647. that are true from those induced by re- 9. C. Burks et al., “Stochastic Optimization peats. Currently, such an edge discrimi- There are several interesting things Tools for Genomic Sequence Assembly,” Au- nator coupled with the basic greedy al- to note about these results. First, the tomated DNA Sequencing and Analysis, M.D. gorithm is employed in the most percentage of the genome covered Adams, C. Fields, and J.C. Venter, eds., Aca- 12 – widely used phrap program. depends only on c and not on the size demic Press, New York, 1994, pp. 249–259. Finally, the consensus phase forms a of the reads or length of the source. 10. R. Parsons, S. Forrest, and C. Burks, “Genetic consensus-measure multiple alignment Second, the number of gaps rises to a Algorithms for DNA Sequence Assembly,” – of the reads in all regions where the maximum at c = 1 and declines with Proc. First Conf. Intelligent Systems for Molecu- coverage is two or greater. The result- an exponentially vanishing tail there- lar Biology, AAAI Press, Menlo Park, Calif., ing consensus character for each posi- after. Contig lengths rise exponen- 1993, pp. 310–318. – tion of the multiple alignment gives tially in c , and gaps quickly become 11. R. Idury and M.S. Waterman, “A New Algo- the ultimate reconstruction S. Like very small. rithm for Shotgun Sequencing,” J. Computa- pairwise sequence comparison, tional Biology, Vol. 2, No. 2, 1995, pp. sequence multiple alignment has been References 291–306. extensively studied. In most formula- 1. E. Myers, “Toward Simplifying and Accurate- 12. B. Ewing et al., “Base–Calling of Automated tions, investigators start with the initial ly Formulating Fragment Assembly,” J. Com- Sequencer Traces Using phred; Accuracy As- multiple alignment obtained by pair- putational Biology, Vol. 2, No. 2, 1995, p. sessment,” Genome Research, Vol. 8, No. 3, wise merging the alignments between 275–290. 1998, pp. 175–185. reads using the overlaps selected for 2. H. Peltola, H. Soderlund, and E. Ukkonen, “SE- 13. D. Feng and R. Doolittle, “Progressive Se- the spanning forest by the overlap QAID: A DNA Sequence Assembly Program quence Alignment as a Prerequisite to Cor- stage.13 They then refine this initial Based on a Mathematical Model,” Nucleic rect Phylogenetic Trees,” J. Molecular Evolu- multiple alignment using either a win- Acids Research, Vol. 12, No. 1, pp. 307–321. tion, Vol. 25, No. 4, 1987, pp. 351–360. dow-sweep optimization, a Hidden- 3. X. Huang, “A Contig Assembly Program Based 14. A. Krogh et al., “Hidden Markov Models in Markov model gradient-descent algo- on Sensitive Detection of Fragment Overlaps,” Computational Biology,” J. Molecular Biology, rithm,14 or round-robin realignment.15 Genomics, Vol. 14, 1992, pp. 18–25. Vol. 235, No. 5, 1994, pp. 1501–1531. Before we return to the discussion 4. J. Kececioglu and E. Myers, “Exact and Ap- 15. E. Anson and E. Myers, “ReAligner: A Pro- of sequencing, it behooves us to proximate Algorithms for the Sequence Re- gram for Refining DNA Sequence Multialign- appreciate some statistics of shotgun construction Problem,” Algorithmica, Vol. 13, ments,” J. Computational Biology, Vol. 4, No. sampling. In an analysis that is essen- Nos. 1-2, 1995, pp. 7–51. 3, 1997, pp. 369–383. tially the dual of that for packet colli- 5. E. Myers, “A Sublinear Algorithm for Approx- 16. E.S. Lander and M.S. Waterman, “Genomic sion on an Ethernet (as here we want imate Keyword Matching,” Algorithmica, Vol. Mapping by Fingerprinting Random Clones: packets to collide), Michael Waterman 12, Nos. 4–5, 1994, pp. 345–374. A Mathematical Analysis,” Genomics, Vol. 2, and Eric Lander determined that if 6. J. Turner, “Approximation Algorithms for the No. 3, 1988, pp. 231–239. as Alus that are small retrotransposons of length the motif has 1 to 2% variation within it. approximately 300 bp do not share this feature Repeats have three characterizing dimensions: but are still problematic because they cluster and length, copy number, and fidelity between can constitute up to 50 or 60% of the source se- copies. As the examples above demonstrate, re- quence, with copies varying from 5 to 15% be- peats found in DNA cover a wide range along tween each other.9,10 Finally, in telomeric and each of these dimensions. From a computational centromeric regions, microsatellite repeats of the perspective, it is the long, high-fidelity repeats form xn are common.9 The repeated “satellite” of low copy numbers that cause the greatest dif- x is three to six bases long, n is very large, and ficulty. On a whole-genome scale, the problem

MAY–JUNE 1999 37 initially looks quite daunting. For example, con- ments covering a contiguous region of the recon- sider human DNA. It contains a number of structed fragment.) That is, if a read in one contig ubiquitous repeats such as the Alu above and the has a mate in another contig, we know the orien- longer LINE (long interspersed nucleotide ele- tation of the contigs to each other and have an ment) elements that have an average length of idea of the distance between them. At 7.5X cov- 1,000 base pairs. The human genome contains erage, for example, contigs tend to be quite large, an estimated one million Alus and 200,000 line at an average of 66 kbp, and gaps quite small, at elements, making it roughly 10% Alu and 5% an average of 66 bp. Because there are typically LINE in terms of total content. We further esti- many mated pairs between a pair of adjacent con- mate that there are roughly 80,000 distinct genes tigs, we can quite reliably order the contigs. Such in the human genome, and probably 25% of a maximally linked and ordered set of contigs is these have two to five copies within the genome. called a scaffold (see Figure 1). The next step is to There are also large 43-kbp-long RNA pseudo- sequence the small gaps between adjacent contigs gene arrays that occur in tandem clusters and by amplifying a sample of the sequence between that vary by only 2 to 3% between copies. Fi- the contigs with a process called PCR (for poly- nally, there have been large 50- to 150-kbp-long merase chain reaction) that only requires knowing genome duplications where a section of one 18 to 25 unique bases on either side of the gap to chromosome has been duplicated near the cen- be amplified. tromere of another. Any attempt to directly With the one exception of the TIGR (from shotgun a large portion or the entirety of a The Institute of Genetic Research) assembler,12 genome as a single source thus must carefully investigators have used mate information only contemplate the impact of repeats on its under- for confirmations, primarily because it can be lying algorithms. quite unreliable, with on average 10% of re- While practitioners have ambitiously in- ported pairs proving unrelated. There are three creased the size of the source sequences, the sources of such false positives. technology for obtaining a read has not im- – proved the length of a read L at a corresponding • Two small fragments from distant parts of rate, leading to greater and greater ratios of ω = the source might get inserted into the vec- – G/L. Thus, the expected number of gaps grows tor. For such a chimeric clone, the reads at – asωc–e–c, ignoring the exacerbating effect of clone both ends thus come from uncorrelated bias. Fragmentation of the solution into a col- parts of the genome. Appropriate care— lection of gap-separated contigs therefore in- such as size-selecting clones or using asym- creases at least linearly with source size for a metric linkers in the insertion step—can fixed level of sequencing coverage. This, com- keep this source of false pairings to as low as bined with the increasing difficulty of correctly 0.01%. resolving repetitive elements in the source, has • A sample can simply be mistracked as it led investigators to develop enhancements to the flows through the sequencing factory. For shotgun sequencing protocol. example, a technician might place a mi- crotiter plate in the wrong orientation within a stack of plates or transfer materials “Double-barreled” shotgun to an incorrect destination. Simple precau- sequencing tions such as using asymmetric plates and In the predominant variation on shotgun se- dual-bar scanning any transfer can also keep quencing, inserts are size-selected so that their av- this source of false positives under 0.1%. – – erage length I is at least 2L or longer and both • In slab gel-sequencing machines, the mate- ends of the insert are sequenced.11 This proce- rial often does not migrate along a straight dure gives rise to a pair of reads, called mates, that line but gently undulates, causing the opti- are in opposite orientations and at a distance from cal-scanning software to misnumber the 32 each other approximately equal to the insert to 96 lanes of sequencing reactions that run length. While these mate pairings could operate simultaneously on a given slab. This pre- in an integral way within the fragment-assembly dominant source accounts for 10% of the software, this information typically serves instead false-positive rate. to confirm the assembly and most importantly to – order contigs with respect to each other. (A con- How then should we choose the average size I tig is a maximal overlapping arrangement of frag- of the inserts in such a strategy? We can define

38 COMPUTING IN SCIENCE & ENGINEERING Read1 Read2 Figure 1. Mates, contigs, Insert gaps, and scaffolds. The top of the figure shows a Vector blue vector with a green insert for which read reac- Mates tions are primed at both ends. A light green dashed arc depicts the relation- ship between the reads and is used within the as- PCR PCR sembly shown below it. The relative order of the Contig 1 Contig 2 Contig 3 Gap 1 Gap 2 differently colored three contigs is fixed by the Scaffold = {Contig 1, Contig 2, Contig 3} mate pairings. We then prime PCR reactions across the two gaps (primers in the map or clone coverage of such a project as m– = mid 1980s, the US National Institutes red, polymerase chain re- – CI /G, where the number of clones C is R/2 in of Health and Department of Energy actions sequence in gold). the current context. From the definitions it fol- announced the start of the Human The three contigs in ag- – – lows that m– = –c (I / 2L) is larger than –c , so there Genome Program (HGP) in 1990, gregate constitute a single −IL/2 are a factor of e fewer gaps in the coverage with an objective to do so by 2005 in scaffold. of the source by inserts than there are gaps in the concert with the UK’s Sanger Centre coverage of the source by reads. For example, if and other laboratories in Europe and − inserts are 5 kbp long, there are a factor of e 5 or Japan.14 A single approach, described next, was 148 fewer clone gaps than sequence gaps. From adopted and continues to be followed. In the last another viewpoint, scaffolds are on average 148 few years, several interesting alternative strategies times larger than contigs, so that for 7.5X se- have emerged, and I describe two of these as well, quencing project of a 200-kbp source, we would the last of which Celera Genomics is actually pur- expect all the contigs to be ordered by the mate suing. This latter plan has a potential to produce information. the entire sequence in two years time—by 2001— Recent simulation studies have indicated that at one tenth the cost of the HGP. from a purely informatic perspective, there is an advantage in using long inserts and no advantage The clone-by-clone approach in having some percentage of the reads be un- The HGP proposal involves a hierarchical two- paired.13 However, this finding must be tem- tiered approach. This approach first randomly pered against the experimental fact that because fractures the whole human DNA sequence into of the different cloning vehicles required to serve 50- to 300-kbp pieces and inserts them into as the vector as the insert becomes larger, it is BACs, which are a vector mechanism designed to more difficult to sequence the ends of long in- accommodate such large DNA segments. The re- serts, and greater care must be taken to avoid sulting collection of BAC inserts is maintained in chimeric clones. Counterbalancing economic a library from which investigators can select a par- pressure thus encourages the use of single reads ticular BAC insert to amplify for further experi- and shorter inserts. Fortunately, we lose little of mentation. The first step consists of determining the benefits of having long end-sequenced in- an assembly, or physical map, of these large inserts serts in hybrid schemas where a sizable fraction that covers the human genome. Given a physical of a project is single reads and where the paired map, the investigators then pick a minimal tiling reads are from inserts over a distribution of inset of the inserts that covers the genome. At the sert lengths skewed to the shorter lengths. second level, they shotgun-sequence each of the inserts in the tiling set. This has been coined a clone-by-clone approach because once we have the Sequencing the human and other tiling set of BAC clones, we conceptually imag- whole genomes ine sequencing each tiling clone in a march across After the idea that the human genome could be the genome (see Figure 2). sequenced began to be discussed in the early to The term physical map stems from the obser-

MAY–JUNE 1999 39 Figure 2. The Human Genome Project’s two- Human genome tiered approach. After ﬁrst fragmenting the genome into large bacterial-artiﬁcial-chromosome- Physical mapping Minimum tiling BACs sized segments, the investigators build a physi- set cal map of them. They then select a minimum tiling set of the BACs in the map (shown in green) and shotgun sequence each of these. BAC shotgun sequencing (x 25,000) Reads

vation that such an assembly gives the physical information of such moderate reliability leaves us location of each segment in the genome. Unlike with a problem that is computationally very diffi- the fragment-assembly problem, where the com- cult to solve optimally and for which there is con- plete sequence of the inserts is used to determine siderable ambiguity in the answers delivered.19,20 overlaps between them, overlaps between BAC The HGP approach has the advantage that inserts are determined on the basis of fingerprint the outcome is understood and portends to de- data about each insert, which is necessarily less liver most of the genome. Shotgun sequencing informative than knowing the entire sequence of BACs is now fairly routine. Reliable software of the insert. Here are several types of finger- is available, and centers capable of rapidly se- prints that various research groups have used and quencing BACs continue to gear up. Physical the conceptual nature of the information they maps, while hard to build, have been prepared convey. for a number of chromosomes. While not complete, they do cover a significant percentage of • Restriction length digests: The approximate the chromosomes involved. Thus we are certain lengths of the pieces that result when an in- to see a reasonable return on continued invest- sert is split at each occurrence of a particular ment in the HGP. substrings of length 4, 6, or 8. The agents HGP’s shortcomings are in terms of cost, effi- that perform the cutting are called restriction ciency, and, to a lesser extent, the completeness enzymes.15 of what will be determined. Sequencing at this • Restriction maps: The approximate locations scale is basically an issue of designing a medium- along the insert of a selectable set of sub- sized factory. Issues are simplicity, automatabil- strings of length 4, 6, or 8, cut by restriction ity, and cost of each step, and scalability of the enzymes.16 overall process. The HGP design has the draw- • Oligo probe hybridization: The presence or back of involving two separate processes: se- absence of each of a set of 12- to 24-length quencing and physical mapping. While se- substrings.17 quencing is heavily automatable once an insert • STS probes: The presence or absence of a library of fragments has been prepared, investi- pair of 18-length substrings between 200 gators must prepare a minimum of 30,000 clone and 1,000 bases apart in the insert.18 (A re- libraries of BACs by hand and must continue to gion characterized by such a pair is a sequence laboriously build and try to complete physical tagged site, or STS.) maps of each of the chromosomes. Originally, all the physical maps were to be completed, at a The STS probe is currently the most widely modest cost, in the project’s first five years. The used because of the ease, cost, reliability, and au- cost has been much heavier than anticipated, and tomatibility of determining the information. Even eight years into the project, maps are available for these experiments, investigators must deal for only a few chromosomes—and most of these with fairly high error rates—roughly 2% false maps have on the order of hundreds of gaps, positives (a probe is reported for an insert when some of considerable size. Also, it is difficult to it does not contain it) and 10 to 20% false nega- construct BAC clones that are not chimeric. By tives (a probe is not reported when it should be). some estimates, 1 to 5% of the BAC sequences Most false negatives are due to experimental fail- being sequenced are actually two or more unre- ures, while the false positives are expected to be lated segments of the human genome that have induced by repetitions in the genome. Such sparse been inserted together into the BAC.

40 COMPUTING IN SCIENCE & ENGINEERING Seed BAC Figure 3. Ordered shot- Shotgun sequence gun sequencing. Start- ing at the top, we shotgun-sequence a selected Overlapping BACs seed BAC whose sequenced ends are shown Shotgun sequence Shotgun sequence in green. Once the entire sequence of this BAC Overlapping BACs (shown as a solid green Shotgun sequence etc. Shotgun sequence etc. line) is revealed, we ob- serve overlaps with a number of end sequences of other BACs in the library. We then The sequence-tagged connector approach eliminates the physical mapping step. shotgun-sequence the An interesting proposal that circumvents the Moreover, the organization of the se- left- and rightmost of physical-mapping step involves initially sequencing factory is simpliﬁed because these (shown with a pur- quencing both ends of approximately 600,000 sequencing BAC ends and the smaller 21 – ple interior). The process BACs. BAC clones have an average size of I = shotgun inserts are similar sequencing continues iteratively, giv- 150 kbp, implying a clone coverage of the hu- processes. But, the approach still suf- ing a BAC-by-BAC walk man genome of m– = 30X. Sampling theory tells fers from needing to make at least − across the genome. us that there will thus be roughly 600,000 × e 30 25,000 BAC libraries and because the − 10 7 gaps in the genome’s coverage by BAC BAC clones must be maintained for the clones; that is, with good probability there will length of the entire project. Getting a consis- be no gaps. Unfortunately, the BAC inserts are tently successful sequencing reaction for the end produced by partial digestion with restriction of a BAC is also more difﬁcult, so greater effort enzymes, implying that BAC endpoints are not and expense must go to end-sequence BACs. particularly random. Estimating the effect of this The quality of the end reads is poorer as well— is difficult, but the implication is that there on the order of 2 to 5% error. might be a few gaps despite the high clone coverage. On the other hand, without any further The whole-genome shotgun approach effort, few of the BACs can be assembled, be- The plan we have developed at Celera Genomics cause their end sequences constitute a coverage involves collecting 60 to 70 million high-quality in sequence of only c– = 0.1. On average, there is sequencing reads for a 10X coverage of the one BAC-end sequence in every 5 kbp segment genome. We will use only those portions of a of the genome, and few of them overlap. read that have an error rate of 1% or less, in con- The next step involves randomly selecting a trast to current practice with BAC shotgunning few of the BAC clones as seeds of an ordered where as much of a read as possible is used to get clone-by-clone walk. Each of the selected BAC better coverage with only 6 to 7X and thus re- clones is shotgun-sequenced. Most notably, once duce cost. For a problem on the scale of the hu- a BAC is sequenced, on average 30 end- man genome, the ends of such reads, at a 10 to sequences of other BACs will be discovered to 15% error rate, are too noisy to detect overlaps. overlap the BAC’s interior. Half will extend into We must use only the high-quality parts of a the genome in each direction, with one having read and collect 10X to compensate for the an overlap, on average, of only 7.5 kbp with the shorter length. Even so, without any additional sequenced BAC. The next step is to shotgun-se- information, assembling this large set of reads is quence the two minimally overlapping BACs in effectively impossible given the genome’s repet- each direction and then in turn determine min- itive nature. imally overlapping BACs in each direction to se- Recall, however, that we can end-sequence in- quence next. The investigator is therefore effec- serts to produce mate pairs. Typically, the insert tively discovering how to continue a seeded set lengths are on average 2 kbp. With care, we can of bidirectional, clone-by-clone walks across the use inserts as long as 10 kbp, although the suc- genome as each clone in each walk is sequenced. cess rate of reactions on such longer clones is Figure 3 illustrates the process. lower, so they are somewhat more expensive to The sequence-tagged connector approach collect. The plan is for 80% of the reads to be in

MAY–JUNE 1999 41 Figure 4. Whole-genome shotgun assembly. Anchored reads Spanning mates Mated pairs of fragments are black segments with an intervening green segment connecting them. Given two BAC end sequences shown in red, where for the purposes of illustration we assume there is a gold and purple repeat in the BAC end BAC, the problem is to determine the set of Repeats mated reads that cover the BAC. Mate pairs that span repeats have their connecting line colored blue, and the reads completely interior to a repeat are given the repeat’s color. Such reads 2-kbp mate stretch by observing that most of these reads are often anchored in the sense that their mate pairs and the re- have a mate that is anchored in the sense of be- is not in a repeat. maining 20% to ing in a unique part of the genome. From the an- be in 10-kbp chored mates on the flanks of the repeat, you can mate pairs. I generally determine enough of the reads actu- noted earlier that on current slab gel-sequenc- ally sampled from that copy of the repeat to infer ing machines, false pairings of mates occur at its exact sequence. about Of course, there will be many repeats in a a 10% rate because of lane-tracking errors. The genome longer than 10 kbp. To resolve these, we current plan also uses next-generation capillary need mated pairs of reads at longer lengths. For- gel-sequencing machines in which the material tunately, there will be 600,000 BAC end se- of each sequencing reaction migrates down its quences produced in anticipation of the ordered own physically separate microcapillary tube. shotgun approach described above. These BAC Thus for these machines, the lane-tracking end pairs essentially serve as very long-range problem disappears and we can now expect mates, albeit of less reliability. Moreover, in sep- mate-pairing errors to be less than 1% and pos- arate radiation-hybrid mapping efforts,22 STS sibly as good as 0.01%. With information of this marker maps that place and linearly order read- quality, investigators can now use mate-pairing sized sequences roughly every 200 kbp along the information as a key component of the assem- genome have already been constructed. While bly algorithm. these maps are not very accurate, they do give Intuitively, we understand that mate pairs can additional long-range mate pairings of up to any resolve any repeat whose length is shorter than length required to resolve a repeat. Originally, the distance between mates as follows. Imagine we conceived of solving the computational prob- building an assembly by progressively adding lem by solving a series of intermarker assembly fragments at a given end (see Figure 4). As long problems that require assembling the sequence as you are in a unique stretch of sequence, the between a pair of STS markers or BAC-end se- placement of the next maximally overlapping quences given the 60 to 70 million reads in the fragment is obvious and correct. However, when whole-genome shotgun data set. Simulation you enter a repeat of sufficiently high fidelity work has shown that with 99.8% probability, we with other copies, you begin to place fragments can unambiguously assemble 99.7% of the se- from many of the copies together. Notice, how- quence between the markers. ever, that while fragments are being incorrectly I’ve spent a fair bit of time discussing how to incorporated, you are still effectively putting to- assemble a whole-genome data set because this is gether a facsimile of the repeat’s sequence. The the component of the proposal that most critics real problem develops when you exit the repeat think is impossible. In terms of a sequencing fac- at the other end: if there are 100 copies of the tory, this approach provides the greatest sim- repeat that are intertwined at this point, there plicity because we only have to set up a sequenc- are 100 unique flanks into which you could walk ing pipeline. Moreover, we have to build only and you don’t know which to take. However, two sequencing libraries from whole human there is very likely a mate pair that spans the re- DNA. We can therefore expend great effort to peat in that it has a read in the unique flanking insure that these libraries do not contain unde- sequence on each side of the repeat. Such a span- sirable artifacts and can completely automate all ning mate indicates which of the 100 options to the remaining steps, thus making the manpower take. Moreover, you can resolve the tangle of required to run the factory very small. Finally, reads from different copies within the repeat there is no need to store BAC or other clones for

42 COMPUTING IN SCIENCE & ENGINEERING any length of time, because once the BACs have 8. L. Rowen, B.F. Koop, and L. Hood, “The Complete 685-Kilobase been end-sequenced they are no longer needed DNA Sequence of the Human Beta T cell Receptor Locus,” Sci- ence, Vol. 272, No. 5,269, 1996, pp. 1755–1762. except as PCR templates for gap filling. Coupled 9. G.I. Bell, “Roles of Repetitive Sequences,” Computers Chemistry, with the new-generation capillary gel-sequenc- Vol. 16, 1992, pp. 135–143. ing machines that give us greater speed and ca- 10. F.J.M. Iris, “Optimized Methods for Large-Scale Ssequencing in pacity, the plan will be very efficient in terms of Alu-Rich Genomic Regions,” Automated DNA Sequencing and both time and cost. Analysis, M.D. Adams, C. Fields, and J.C. Venter, eds., Academic Press, London, 1994, pp. 199–210. 11. A. Edwards and C.T. Caskey, “Closure Strategies for Random DNA Sequencing,” Methods: A Companion to Methods Enzymol- ogy 3, Academic Press, New York, 1991, pp. 41–47. 12. G.G. Sutton et al., “TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects,” Genome Science & Tech- nology, Vol. 1, No. 1, 1995, pp. 9–19. final issue centers on understanding 13. J.C. Roach et al., “Pairwise End Sequencing: A Unified Approach diversity in the human genome. Each to Genomic Mapping and Sequencing,” Genomics, Vol. 26, No. human cell, with the exception of 26, 1995, p. 345. sperm and egg cells, has two copies of 14. F. Collins and D. Galas, “A New Five-Year Plan for the U.S. Hu- eachA chromosome, one version from each par- man Genome Project,” Science, Vol. 262, No. 5,130, 1993, pp. 43–46. ent. The complement of DNA inherited from 15. M.R. Olson et al., “Random Clone Strategy for Genomic Restric- one parent is called a haplotype. Each human hap- tion Mapping in Yeast,” Proc. Nat’l Academy of Science, No. 83, lotype varies from another by 0.1%, and the to- No. 20, 1986, pp. 7826–7830. tal number of sites of variation over the human 16. Y. Kohara, A. Akiyama, and K. Isono, “The Physical Map of the population takes an estimated 0.3% of the E. Coli Chromosome: Application of a New Strategy for Rapid 23 Analysis and Sorting of a Large Genomic Library,” Cell, Vol. 50, genome. In the clone-by-clone approach, each No. 3, 1987, pp. 495–508. assembly of a BAC clone is of a given haplotype, 17. A. Coulson et al., “Toward a Physical Map of the Genome of the so the HGP effort will produce a series of over- Nematode, C. Elegans,” Proc. Nat’l Academy of Science, Vol. 83, lapping clones, each representing some haplo- 1986, pp. 7821–7825. type. This is to be contrasted to the whole-shot- 18. M.R. Olson et al., “A Common Language for Physical Mapping of the Human Genome,” Science, Vol. 245, No. 4,925, 1989, pp. gun approach, where fragments from different 1434–1435. haplotypes come together to give the overall as- 19. F. Alizadeh et al., “Physical Mapping of Chromosomes: A Com- sembly. In this case, we can detect many of the binatorial Problem in Molecular Biology,” Algorithmica, Vol. 13, sites of genetic variation between haplotypes. Nos. 1 and 2, 1995, pp. 52–76. Even if we sequence only one pair of haplotypes, 20. M. Jain and E. Myers, “Algorithms for Computing and Integrat- ing Physical Maps Using Unique Probes,” J. Computational Biol- we will detect an estimated three million sites of ogy, Vol. 4, No. 4, 1997, pp. 449–466. single nucleotide variation or polymorphism 21. T.J. Hudson et al., “An STS-Based Map of the Human Genome,” during the course of the project. Science, Vol. 270, No. 5,244, 1995, pp. 1945–1954. 22. J.C. Venter, H.O. Smith, and L. Hood, “A New Strategy for Genome Sequencing,” Nature, Vol. 381, No. 6,581, 1996, pp. 364–366. 23. A.G. Clark et al., “Haplotype Structure and Population Genetic Inferences from Nucleotide Sequence Variation in Human References Lipoprotein Lipase, American J. Human Genetics, Vol. 63, No. 2, 1998, pp. 595–612. 1. R.D. Fleischmann et al., “Whole-Genome Random Sequencing and Assembly of H. Influenzae,” Science, Vol. 269, No. 5,223, 1995, pp. 496–512. 2. J. Weber and W. Myers, “Human Whole Genome Shotgun Se- quencing,” Genome Research, Vol. 7, No. 5, 1997, pp. 401–409. Gene Myers is the director of Informatics Research at 3. P. Green, “Against a Whole-Genome Shotgun,” Genome Re- Celera Genomics and a professor currently on leave search, Vol. 7, No. 5, 1977, pp. 410–417. from the Department of Computer Science at the Uni- 4. J.C. Venter et al., “Shotgun Sequencing of the Human Genome,” Science, Vol. 280, No. 5,369, 1998, pp. 1540–1542. versity of Arizona. His research interests include algo- 5. F. Sanger, S. Nicklen, and A.R. Coulson, “DNA Sequencing with rithm design, pattern matching, computer graphics, Chain-Terminating Inhibitors,” Proc. Nat’l Academy of Science, and computational molecular biology. He received his Vol. 74, No. 12, 1977, pp. 5463–5467. PhD in computer science from the University of Col- 6. A.M. Maxam and W. Gilbert, “A New Method for Sequencing orado. He is an associate editor of the Journal of Com- DNA,” Proc. Nat’l Academy of Science, No. 74, No. 2, 1997, pp. 560–564. putational Biology. Contact him at Celera Genomics, 45 7. F. Sanger et al., “Nucleotide Sequence of Bacteriophage λ DNA,” W. Gude Dr., Rockville, MD 20850; myersgw@celera. J. Molecular Biology, Vol. 162, No. 4, 1982, pp. 729–773. com.

MAY–JUNE 1999 43