4/6/2020 - Genomes - NCBI Bookshelf

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Brown TA. Genomes. 2nd edition. Oxford: Wiley-Liss; 2002.

Chapter 6 Sequencing Genomes

Learning outcomes When you have read Chapter 6, you should be able to:

Distinguish between the two methods used to sequence DNA

Give a detailed description of chain termination sequencing and an outline description of the chemical degradation method

Describe the key features of automated DNA sequencing and evaluate the importance of automated sequencing in research

State the strengths and limitations of the shotgun, whole- shotgun and clone contig methods of genome sequencing

Describe how a small bacterial genome can be sequenced by the shotgun method, using the project as an example

Outline the various ways in which a clone contig can be built up

Explain the basis to the whole-genome shotgun approach to genome sequencing, with emphasis on the steps taken to ensure that the resulting sequence is accurate

Give an account of the development of the Project up to the publication of the draft sequence in February 2001

Debate the ethical, legal and social issues raised by the human genome projects

of a is the complete DNA sequence for the organism being studied, ideally integrated with the genetic and/or physical maps of the genome so that genes and other interesting features can be located within the DNA sequence. This chapter describes the techniques and research strategies that are used during the sequencing phase of a genome project, when this ultimate objective is being directly addressed. Techniques for sequencing DNA are clearly of central importance in this context and we will begin the chapter with a detailed examination of sequencing methodology. This methodology is of little value however, unless the short sequences that result from individual sequencing experiments can be linked together in the correct order to give the master sequences of the chromosomes that make up the genome. The middle part of this chapter describes the strategies used to ensure that the master sequences are assembled correctly. Finally, we will review the way in which mapping and sequencing were used to produce the two draft human genome sequences that were published in February 2001.

6.1. The Methodology for DNA Sequencing Rapid and efficient methods for DNA sequencing were first devised in the mid-1970s. Two different procedures were published at almost the same time:

The chain termination method (Sanger et al., 1977), in which the sequence of a single- stranded DNA molecule is determined by enzymatic synthesis of complementary https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 1/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf polynucleotide chains, these chains terminating at specific positions;

The chemical degradation method (Maxam and Gilbert, 1977), in which the sequence of a double-stranded DNA molecule is determined by treatment with chemicals that cut the molecule at specific nucleotide positions.

Both methods were equally popular to begin with but the chain termination procedure has gained ascendancy in recent years, particularly for genome sequencing. This is partly because the chemicals used in the chemical degradation method are toxic and therefore hazardous to the health of the researchers doing the sequencing experiments, but mainly because it has been easier to automate chain termination sequencing. As we will see later in this chapter, a genome project involves a huge number of individual sequencing experiments and it would take many years to perform all these by hand. Automated sequencing techniques are therefore essential if the project is to be completed in a reasonable time-span.

6.1.1. Chain termination DNA sequencing Chain termination DNA sequencing is based on the principle that single-stranded DNA molecules that differ in length by just a single nucleotide can be separated from one another by polyacrylamide gel electrophoresis (Technical Note 6.1). This means that it is possible to resolve a family of molecules, representing all lengths from 10 to 1500 , into a series of bands (Figure 6.1).

Box 6.1

Polyacrylamide gel electrophoresis. Separation of DNA molecules differing in length by just one nucleotide. Polyacrylamide gel electrophoresis is used to examine the families of chain-terminated DNA molecules resulting from a sequencing experiment. Agarose (more...)

Figure 6.1

Polyacrylamide gel electrophoresis can resolve single- stranded DNA molecules that differ in length by just one nucleotide. The banding pattern is produced after separation of single-stranded DNA molecules by denaturing polyacrylamide gel electrophoresis. (more...)

Chain termination sequencing in outline

The starting material for a chain termination sequencing experiment is a preparation of identical single-stranded DNA molecules. The first step is to anneal a short oligonucleotide to the same position on each molecule, this oligonucleotide subsequently acting as the primer for synthesis of a new DNA strand that is complementary to the template (Figure 6.2A). The strand synthesis reaction, which is catalyzed by a DNA polymerase enzyme (Section 4.1.1 and Box 6.1) and requires the four deoxyribonucleotide triphosphates (dNTPs - dATP, dCTP, dGTP and dTTP) as substrates, would normally continue until several thousand nucleotides had been polymerized. This does not occur in a chain termination sequencing experiment because, as well as the four dNTPs, a small amount of a dideoxynucleotide (e.g. ddATP) is added to the reaction. The polymerase enzyme does not discriminate between dNTPs and ddNTPs, so the dideoxynucleotide can be incorporated into the growing chain, but it then blocks further

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 2/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf elongation because it lacks the 3′-hydroxyl group needed to form a connection with the next nucleotide (Figure 6.2B).

Box 6.1

DNA polymerases for chain termination sequencing. Any template-dependent DNA polymerase is capable of extending a primer that has been annealed to a single-stranded DNA molecule, but not all polymerases do this in a way that is useful for DNA sequencing. (more...)

If ddATP is present, chain termination occurs at positions opposite thymidines in the template DNA (Figure 6.2C). Because dATP is also present the strand synthesis does not always terminate at the first T in the template; in fact it may continue until several hundred nucleotides have been polymerized before a ddATP is eventually incorporated. The result is therefore a set of new chains, all of different lengths, but each ending in ddATP. Now the polyacrylamide gel comes into play. The family of molecules generated in the presence of ddATP is loaded into one lane of the gel, and the families generated with ddCTP, ddGTP and ddTTP loaded into the three adjacent lanes. After electrophoresis, the DNA sequence can be read directly from the positions of the bands in the gel (Figure 6.2D). The band that has moved the furthest represents the smallest piece of DNA, this being the strand that terminated by incorporation of a ddNTP at the first position in the template. In the example shown in Figure 6.2 this band lies in the ‘G’ lane (i.e. the lane containing the molecules terminated with ddGTP), so the first nucleotide in the sequence is ‘G’. The next band, corresponding to the molecule that is one nucleotide longer than the first, is in the ‘A’ lane, so the second nucleotide is ‘A’ and the sequence so far is ‘GA’. Continuing up through the gel we see that the next band also lies in the ‘A’ lane (sequence GAA), then we move to the ‘T’ lane (GAAT), and so on. The sequence reading can be continued up to the region of the gel where individual bands are not separated.

Chain termination sequencing requires a single-stranded DNA template

The template for a chain termination experiment is a single-stranded version of the DNA molecule to be sequenced. There are several ways in which this can be obtained:

The DNA can be cloned in a plasmid vector (Section 4.2.1). The resulting DNA will be double stranded so cannot be used directly in sequencing. Instead, it must be converted into single-stranded DNA by denaturation with alkali or by boiling. This is a common method for obtaining template DNA for DNA sequencing, largely because cloning in a plasmid vector is such a routine technique. A shortcoming is that it can be difficult to prepare plasmid DNA that is not contaminated with small quantities of bacterial DNA and RNA, which can act as spurious templates or primers in the DNA sequencing experiment.

The DNA can be cloned in a bacteriophage M13 vector. Vectors based on M13 bacteriophage are designed specifically for the production of single-stranded templates for DNA sequencing. M13 bacteriophage has a single-stranded DNA genome which, after infection of Escherichia coli bacteria, is converted into a double-stranded replicative form. The replicative form is copied until over 100 molecules are present in the cell, and when the cell divides the copy number in the new cells is maintained by further replication. At the same time, the infected cells continually secrete new M13 phage particles, approximately 1000 per generation, these phages containing the single-stranded version of the genome (Figure 6.3). Cloning vectors based on M13 vectors are double-stranded DNA molecules equivalent to the replicative form of the M13 genome. They can be manipulated https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 3/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf in exactly the same way as a plasmid cloning vector. The difference is that cells that have been transfected with a recombinant M13 vector secrete phage particles containing single- stranded DNA, this DNA comprising the vector molecule plus any additional DNA that has been ligated into it. The phages therefore provide the template DNA for chain termination sequencing. The one disadvantage is that DNA fragments longer than about 3 kb suffer deletions and rearrangements when cloned in an M13 vector, so the system can only be used with short pieces of DNA.

The DNA can be cloned in a phagemid. This is a plasmid cloning vector that contains, in addition to its plasmid origin of replication, the origin from M13 or another phage with a single-stranded DNA genome. If an E. coli cell contains both a phagemid and the replicative form of a helper phage, the latter carrying genes for the phage replication enzymes and coat proteins, then the phage origin of the phagemid becomes activated, resulting in synthesis of phage particles containing the single-stranded version of the phagemid. The double-stranded plasmid DNA is therefore converted into single-stranded template DNA for DNA sequencing. This system avoids the instabilities of M13 cloning and can be used with fragments up to 10 kb or more.

PCR can be used to generate single-stranded DNA. There are various ways of generating single-stranded DNA by PCR, the most effective being to modify one of the two primers so that DNA strands synthesized from this primer are easily purified. One possibility is to attach small metallic beads to the primer and then use a magnetic device to purify the resulting strands (Figure 6.4).

Figure 6.3

Obtaining single-stranded DNA by cloning in a bacteriophage M13 vector. M13 vectors can be obtained in two forms: the double-stranded replicative molecule and the single-stranded version found in bacteriophage particles. The replicative form can be manipulated (more...)

Figure 6.4

One way of using PCR to prepare template DNA for chain termination sequencing. The PCR is carried out with one normal primer (shown in red), and one primer that is labeled with a metallic bead (shown in brown). After PCR, the labeled strands are purified (more...)

The primer determines the region of the template DNA that will be sequenced

To begin a chain termination sequencing experiment, an oligonucleotide primer is annealed onto the template DNA. The primer is needed because template-dependent DNA polymerases cannot initiate DNA synthesis on a molecule that is entirely single-stranded: there must be a short double-stranded region to provide a 3′ end onto which the enzyme can add new nucleotides (Section 4.1.1).

The primer also plays the critical role of determining the region of the template molecule that will be sequenced. For most sequencing experiments a ‘universal’ primer is used, this being one that is complementary to the part of the vector DNA immediately adjacent to the point into which new DNA is ligated (Figure 6.5A). The same universal primer can therefore give the sequence of any piece of DNA that has been ligated into the vector. Of course if this inserted https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 4/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf DNA is longer than 750 bp or so then only a part of its sequence will be obtained, but usually this is not a problem because the project as a whole simply requires that a large number of short sequences are generated and subsequently assembled into the contiguous master sequence. It is immaterial whether or not the short sequences are the complete or only partial sequences of the DNA fragments used as templates. If double-stranded plasmid DNA is being used to provide the template then, if desired, more sequence can be obtained from the other end of the insert. Alternatively, it is possible to extend the sequence in one direction by synthesizing a non- universal primer, designed to anneal at a position within the insert DNA (Figure 6.5B). An experiment with this primer will provide a second short sequence that overlaps the previous one.

Figure 6.5

Different types of primer for chain termination sequencing. (A) A universal primer anneals to the vector DNA, adjacent to the position at which new DNA is inserted. A single universal primer can therefore be used to sequence any DNA insert, but only provides (more...)

Thermal cycle sequencing offers an alternative to the traditional methodology

The discovery of thermostable DNA polymerases, which led to the development of PCR (Sections 4.1.1 and 4.3), has also resulted in new methodologies for chain termination sequencing. In particular, the innovation called thermal cycle sequencing (Sears et al., 1992) has two advantages over traditional chain termination sequencing. First, it uses double-stranded rather than single-stranded DNA as the starting material. Second, very little template DNA is needed, so the DNA does not have to be cloned before being sequenced.

Thermal cycle sequencing is carried out in a similar way to PCR but just one primer is used and each reaction mixture includes one of the ddNTPs (Figure 6.6). Because there is only one primer, only one of the strands of the starting molecule is copied, and the product accumulates in a linear fashion, not exponentially as is the case in a real PCR. The presence of the ddNTP in the reaction mixture causes chain termination, as in the standard methodology, and the family of resulting strands can be analyzed and the sequence read in the normal manner by polyacrylamide gel electrophoresis.

Figure 6.6

Thermal cycle sequencing. PCR is carried out with just one primer and with a dideoxynucleotide present in the reaction mixture. The result is a family of chain-terminated strands - the ‘A’ family in the reaction shown. These strands, along (more...)

Box 6.2

The chemical degradation sequencing method. The difference between the two sequencing techniques lies in the way in which the A, C, G and T families of molecules are generated. In the chemical degradation procedure these families are produced by treatment (more...)

Fluorescent primers are the basis of automated sequence reading https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 5/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf The standard chain termination sequencing methodology employs radioactive labels, and the banding pattern in the polyacrylamide gel is visualized by autoradiography. Usually one of the nucleotides in the sequencing reaction is labeled so that the newly synthesized strands contain radiolabels along their lengths, giving high detection sensitivity. To ensure good band resolution, 33P or 35S is generally used, as the emission energies of these isotopes are relatively low, in contrast to 32P, which has a higher emission energy and gives poorer resolution because of signal scattering.

In Section 5.3.2 we saw how the replacement of radioactive labels by fluorescent ones has given a new dimension to in situ hybridization techniques. Fluorolabeling has been equally important in the development of sequencing methodology, in particular because the detection system for fluorolabels has opened the way to automated sequence reading (Prober et al., 1987). The label is attached to the ddNTPs, with a different fluorolabel used for each one (Figure 6.7A). Chains terminated with A are therefore labeled with one fluorophore, chains terminated with C are labeled with a second fluorophore, and so on. Now it is possible to carry out the four sequencing reactions - for A, C, G and T - in a single tube and to load all four families of molecules into just one lane of the polyacrylamide gel, because the fluorescent detector can discriminate between the different labels and hence determine if each band represents an A, C, G or T. The sequence can be read directly as the bands pass in front of the detector and either printed out in a form readable by eye (Figure 6.7B) or sent straight to a computer for storage. When combined with robotic devices that prepare the sequencing reactions and load the gel, the fluorescent detection system provides a major increase in throughput and avoids errors that might arise when a sequence is read by eye and then entered manually into a computer. It is only by use of these automated techniques that we can hope to generate sequence data rapidly enough to complete a genome project in a reasonable length of time.

Figure 6.7

Automated DNA sequencing with fluorescently labeled dideoxynucleotides. (A) The chain termination reactions are carried out in a single tube, with each dideoxynucleotide labeled with a different fluorophore. In the automated sequencer, the bands in the (more...)

6.1.2. Departures from conventional DNA sequencing In spite of the development of automated techniques, conventional DNA sequencing suffers from the limitation that only a few hundred bp of sequence can be determined in a single experiment. In the context of the , this means that each experiment provides only one five-millionth of the total genome sequence. Attempts are continually being made to modify the technology so that sequence acquisition is more rapid, a recent example being the introduction of new automated sequencers that use capillary separation rather than a polyacrylamide gel. These have 96 channels so 96 sequences can be determined in parallel, and each run takes less than 2 hours to complete, enabling up to 1000 sequences to be obtained in a single day (Mullikan and McMurray, 1999). Other systems that are being developed will increase data generation even further by enabling 384 or 1024 sequences to be run at the same time (Rogers, 1999).

There have also been attempts to make sequence acquisition more rapid by devising new sequencing methodologies. One possibility is pyrosequencing, which does not require electrophoresis or any other fragment separation procedure and so is more rapid than chain https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 6/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf termination sequencing (Ronaghi et al., 1998). In pyrosequencing, the template is copied in a straightforward manner without added ddNTPs. As the new strand is being made, the order in which the dNTPs are incorporated is detected, so the sequence can be ‘read’ as the reaction proceeds. The addition of a nucleotide to the end of the growing strand is detectable because it is accompanied by release of a molecule of pyrophosphate, which can be converted by the enzyme sulfurylase into a flash of chemiluminescence. Of course, if all four dNTPs were added at once then flashes of light would be seen all the time and no useful sequence information would be obtained. Each dNTP is therefore added separately, one after the other, with a nucleotidase enzyme also present in the reaction mixture so that if a dNTP is not incorporated into the polynucleotide then it is rapidly degraded before the next dNTP is added (Figure 6.8). This procedure makes it possible to follow the order in which the dNTPs are incorporated into the growing strand. The technique sounds complicated, but it simply requires that a repetitive series of additions be made to the reaction mixture, precisely the type of procedure that is easily automated, with the possibility of many experiments being carried out in parallel.

Figure 6.8

Pyrosequencing. The strand synthesis reaction is carried out in the absence of dideoxynucleotides. Each dNTP is added individually, along with a nucleotidase enzyme that degrades the dNTP if it is not incorporated into the strand being synthesized. Incorporation (more...)

A very different approach to DNA sequencing through the use of DNA chips (see Technical Note 5.1) might one day be possible. A chip carrying an array of different oligonucleotides could be used in DNA sequencing by applying the test molecule - the one whose sequence is to be determined - to the array and detecting the positions at which it hybridizes. Hybridization to an individual oligonucleotide would indicate the presence of that particular oligonucleotide sequence in the test molecule, and comparison of all the oligonucleotides to which hybridization occurs would enable the sequence of the test molecule to be deduced (Figure 6.9). The problem with this approach is that the maximum length of the molecule that can be sequenced is given by the square root of the number of oligonucleotides in the array, so if every possible 8-mer oligonucleotide (ones containing eight nucleotides) were attached to the chip - all 65 536 of them - then the maximum length of readable sequence would be only 256 bp (Southern, 1996). Even if the chip carried all the 1 048 576 different 10-mer sequences, it could still only be used to sequence a 1 kb molecule. To sequence a 1 Mb molecule (this being the sort of advance in sequence capability that is really needed) the chip would have to carry all of the 1 × 1012 possible 20-mers. This may sound an outlandish proposition but advances in miniaturization, together with the possibility of electronic rather than visual detection of hybridization, could bring such an array within reach in the future.

Figure 6.9

A possible way of using chip technology in DNA sequencing. The chip carries an array of every possible 8-mer oligonucleotide. The DNA to be sequenced is labeled with a fluorescent marker and applied to the chip, and the positions of hybridizing oligonucleotides (more...)

6.2. Assembly of a Contiguous DNA Sequence

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 7/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf The next question to address is how the master sequence of a chromosome, possibly several tens of Mb in length, can be assembled from the multitude of short sequences generated by chain termination sequencing. We addressed this issue at the start of Chapter 5 and established that the relatively short genomes of prokaryotes can be assembled by shotgun sequencing, but that this approach might lead to errors if applied to larger eukaryotic genomes. The whole-genome shotgun method, which uses a map to aid assembly of the master sequence, has been used with the fruit-fly and human genomes, but it is generally accepted that a greater degree of accuracy is achieved with the clone contig approach, in which the genome is broken down into segments, each with a known position on the genome map, before sequencing is carried out (see Figure 5.3). We will start by examining how shotgun sequencing has been applied to prokaryotic genomes.

6.2.1. by the shotgun approach The straightforward approach to sequence assembly is to build up the master sequence directly from the short sequences obtained from individual sequencing experiments, simply by examining the sequences for overlaps (see Figure 5.1). This is called the shotgun approach. It does not require any prior knowledge of the genome and so can be carried out in the absence of a genetic or physical map.

The potential of the shotgun approach was proven by the Haemophilus influenzae sequence

During the early 1990s there was extensive debate about whether the shotgun approach would work in practice, many molecular biologists being of the opinion that the amount of data handling needed to compare all the mini-sequences and identify overlaps, even with the smallest genomes, would be beyond the capabilities of existing computer systems. These doubts were laid to rest in 1995 when the sequence of the 1830 kb genome of the bacterium Haemophilus influenzae was published (Fleischmann et al., 1995).

The H. influenzae genome was sequenced entirely by the shotgun approach and without recourse to any genetic or physical map information. The strategy used to obtain the sequence is shown in Figure 6.10. The first step was to break the genomic DNA into fragments by sonication, a technique which uses high-frequency sound waves to make random cuts in DNA molecules. The fragments were then electrophoresed and those in the range 1.6–2.0 kb purified from the agarose gel and ligated into a plasmid vector. From the resulting library, 19 687 clones were taken at random and 28 643 sequencing experiments carried out, the number of sequencing experiments being greater than the number of plasmids because both ends of some inserts were sequenced. Of these sequencing experiments, 16% were considered to be failures because they resulted in less than 400 bp of sequence. The remaining 24 304 sequences gave a total of 11 631 485 bp, corresponding to six times the length of the H. influenzae genome, this amount of redundancy being deemed necessary to ensure complete . Sequence assembly required 30 hours on a computer with 512 Mb of RAM, and resulted in 140 lengthy contiguous sequences, each of these sequence contigs representing a different, non-overlapping portion of the genome.

Figure 6.10

The way in which the shotgun approach was used to obtain the DNA sequence of the Haemophilus influenzae genome. H. influenzae DNA was sonicated and fragments with sizes between 1.6 and 2.0 kb purified from an agarose gel and ligated into a plasmid vector (more...) https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 8/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf The next step was to join up pairs of contigs by obtaining sequences from the gaps between them (Figure 6.11). First, the library was checked to see if there were any clones whose two end sequences were located in different contigs. If such a clone could be identified, then additional sequencing of its insert would close the ‘sequence gap’ between the two contigs (Figure 6.11A). In fact, there were 99 clones in this category, so 99 of the gaps could be closed without too much difficulty.

Figure 6.11

Assembly of the complete Haemophilus influenzae genome sequence by spanning the gaps between individual sequence contigs. (A) ‘Sequence gaps’ are ones which can be closed by further sequencing of clones already present in the library. (more...)

This left 42 gaps, which probably consisted of DNA sequences that were unstable in the cloning vector and therefore not present in the library. To close these ‘physical gaps’ a second clone library was prepared, this one with a different type of vector. Rather than using another plasmid, in which the uncloned sequences would probably still be unstable, the second library was prepared in a bacteriophage λ vector (Section 4.2.1). This new library was probed with 84 oligonucleotides, one at a time, these 84 oligonucleotides having sequences identical to the sequences at the ends of the unlinked contigs (Figure 6.11B). The rationale was that if two oligonucleotides hybridized to the same λ clone then the ends of the contigs from which they were derived must lie within that clone, and sequencing the DNA in the λ clone would therefore close the gap. Twenty-three of the 42 physical gaps were dealt with in this way.

A second strategy for gap closure was to use pairs of oligonucleotides, from the set of 84 described above, as primers for PCRs of H. influenzae genomic DNA. Some oligonucleotide pairs were selected at random and those spanning a gap identified simply from whether or not they gave a PCR product (see Figure 6.11B). Sequencing the resulting PCR products closed the relevant gaps. Other primer pairs were chosen on a more rational basis. For example, oligonucleotides were tested as probes with a Southern blot of H. influenzae DNA cut with a variety of restriction endonucleases, and pairs that hybridized to similar sets of restriction fragments identified. The two members of an oligonucleotide pair identified in this way must be contained within the same restriction fragments and so are likely to lie close together on the genome. This means that the pair of contigs that the oligonucleotides are derived from are adjacent, and the gap between them can be spanned by a PCR of genomic DNA using the two oligonucleotides as primers, which will provide the template DNA for gap closure.

The demonstration that a small genome can be sequenced relatively rapidly by the shotgun approach led to a sudden plethora of completed microbial genomes. These projects demonstrated that shotgun sequencing can be set up on a production-line basis, with each team member having his or her individual task in DNA preparation, carrying out the sequencing reactions, or analyzing the data. This strategy resulted in the 580 kb genome of being sequenced by five people in just eight weeks (Fraser et al., 1995), and it is now accepted that a few months should be ample time to generate the complete sequence of any genome less than about 5 Mb, even if nothing is known about the genome before the project begins. The strengths of the shotgun approach are therefore its speed and its ability to work in the absence of a genetic or physical map.

6.2.2. Sequence assembly by the clone contig approach https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 9/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf The clone contig approach is the conventional method for obtaining the sequence of a eukaryotic genome and has also been used with those microbial genomes that have previously been mapped by genetic and/or physical means. In the clone contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial restriction (Section 5.3.1), and these cloned in a high-capacity vector such as a BAC or a YAC (Section 4.2.1). A clone contig is built up by identifying clones containing overlapping fragments, which are then individually sequenced by the shotgun method. Ideally the cloned fragments are anchored onto a genetic and/or physical map of the genome, so that the sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, genes) known to be present in a particular region.

Clone contigs can be built up by chromosome walking, but the method is laborious

The simplest way to build up an overlapping series of cloned DNA fragments is to begin with one clone from a library, identify a second clone whose insert overlaps with the insert in the first clone, then identify a third clone whose insert overlaps with the second clone, and so on. This is the basis of chromosome walking, which was the first method devised for assembly of clone contigs.

Chromosome walking was originally used to move relatively short distances along DNA molecules, using clone libraries prepared with λ or cosmid vectors. The most straightforward approach is to use the insert DNA from the starting clone as a hybridization probe to screen all the other clones in the library. Clones whose inserts overlap with the probe give positive hybridization signals, and their inserts can be used as new probes to continue the walk (Figure 6.12).

Figure 6.12

Chromosome walking. The library comprises 96 clones, each containing a different insert. To begin the walk, the insert from one of the clones is used as a hybridization probe against all the other clones in the library. In the example shown, clone A1 (more...)

The main problem that arises is that if the probe contains a genome-wide repeat sequence then it will hybridize not only to overlapping clones but also to non-overlapping clones whose inserts also contain copies of the repeat. The extent of this non-specific hybridization can be reduced by blocking the repeat sequences by prehybridization with unlabeled genomic DNA (see Figure 5.30). But this does not completely solve the problem, especially if the walk is being carried out with long inserts from high-capacity vectors such as BACs or YACs. For this reason, intact inserts are rarely used for chromosome walks with human DNA and similar which have a high frequency of genome-wide repeats. Instead, a fragment from the end of an insert is used as the probe, there being less chance of a genome-wide repeat occurring in a short end-fragment compared with the insert as a whole. If complete confidence is required then the end-fragment can be sequenced before use to ensure that no repetitive DNA is present.

If the end-fragment has been sequenced then the walk can be speeded up by using PCR rather than hybridization to identify clones with overlapping inserts. Primers are designed from the sequence of the end-fragment and used in attempted PCRs with all the other clones in the library. A clone that gives a PCR product of the correct size must contain an overlapping insert (Figure 6.13). To speed the process up even more, rather than performing a PCR with each individual clone, groups of clones are mixed together in such a way that unambiguous identification of overlapping ones can still be made. The method is illustrated in Figure 6.14, in which a library of https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 10/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf 960 clones has been prepared in ten microtiter trays, each tray comprising 96 wells in an 8 × 12 array, with one clone per well. PCRs are carried out as follows:

Figure 6.13

Chromosome walking by PCR. The two oligonucleotides anneal within the end region of insert number 1. They are used in PCRs with all the other clones in the library. Only clone 15 gives a PCR product, showing that the inserts in clones 1 and 15 overlap. (more...)

Figure 6.14

Combinatorial screening of clones in microtiter trays. In this example, a library of 960 clones has to be screened by PCR. Rather than carrying out 960 individual PCRs, the clones are grouped as shown and just 296 PCRs are performed. In most cases, the (more...)

1. Samples of each clone in row A of the first microtiter tray are mixed together and a single PCR carried out. This is repeated for every row of every tray - 80 PCRs in all.

2. Samples of each clone in column 1 of the first microtiter tray are mixed together and a single PCR carried out. This is repeated for every column of every tray - 120 PCRs in all.

3. Clones from well A1 of each of the ten microtiter trays are mixed together and a single PCR carried out. This is repeated for every well - 96 PCRs in all.

As explained in the legend to Figure 6.14, these 296 PCRs provide enough information to identify which of the 960 clones give products and which do not. Ambiguities arise only if a substantial number of clones turn out to be positive.

Newer more rapid methods for clone contig assembly

Even when the screening step is carried out by the combinatorial PCR approach shown in Figure 6.14, chromosome walking is a slow process and it is rarely possible to assemble contigs of more than 15–20 clones by this method. The procedure has been extremely valuable in positional cloning, where the objective is to walk from a mapped site to an interesting gene that is known to be no more than a few Mb distant. It has been less valuable for assembling clone contigs across entire genomes, especially with the complex genomes of higher eukaryotes. So what alternative methods are there?

The main alternative is to use a clone fingerprinting technique. Clone fingerprinting provides information on the physical structure of a cloned DNA fragment, this physical information or ‘fingerprint’ being compared with equivalent data from other clones, enabling those with similarities - possibly indicating overlaps - to be identified. One or a combination of the following techniques is used (Figure 6.15):

Figure 6.15

Four clone fingerprinting techniques.

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 11/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf Restriction patterns can be generated by digesting clones with a variety of restriction enzymes and separating the products in an agarose gel. If two clones contain overlapping inserts then their restriction fingerprints will have bands in common, as both will contain fragments derived from the overlap region.

Repetitive DNA fingerprints can be prepared by blotting a set of restriction fragments and carrying out Southern hybridization (Section 4.1.2) with probes specific for one or more types of genome-wide repeat. As for the restriction fingerprints, overlaps are identified by looking for two clones that have some hybridizing bands in common.

Repetitive DNA PCR, or interspersed repeat element PCR ( IRE-PCR ), uses primers that anneal within genome-wide repeats and so amplify the single-copy DNA between two neighboring repeats. Because genome-wide repeat sequences are not evenly spaced in a genome, the sizes of the products obtained after repetitive DNA PCR can be used as a fingerprint in comparisons with other clones, in order to identify potential overlaps. With human DNA, the genome-wide repeats called Alu elements (Section 2.4.2) are often used because these occur on average once every 4 kb. An Alu-PCR of a human BAC insert of 150 kb would therefore be expected to give approximately 38 PCR products of various sizes, resulting in a detailed fingerprint.

STS content mapping is particularly useful because it can result in a clone contig that is anchored onto a physical map of STS locations. PCRs directed at individuals STSs (Section 5.3.3) are carried out with each member of a clone library. Presuming the STS is single copy in the genome, then all clones that give PCR products must contain overlapping inserts.

As with chromosome walking, efficient application of these fingerprinting techniques requires combinatorial screening of gridded clones, ideally with computerized methodology for analyzing the resulting data.

6.2.3. Whole-genome shotgun sequencing The whole-genome shotgun approach was first proposed by and colleagues as a means of speeding up the acquisition of contiguous sequence data for large genomes such as the human genome and those of other eukaryotes (Venter et al., 1998; Marshall, 1999). Experience with conventional shotgun sequencing (Section 6.2.1) had shown that if the total length of sequence that is generated is between 6.5 and 8 times the length of the genome being studied, then the resulting sequence contigs will span over 99.8% of the genome sequence (Fraser, 1997), with a few gaps that can be closed by methods such as those developed during the H. influenzae project (see Figure 6.11). This implies that 70 million individual sequences, each 500 bp or so in length, corresponding to a total of 35 000 Mb, would be sufficient if the random approach were taken with the human genome. Seventy million sequences is not an impossibility: in fact, with 75 automatic sequencers, each performing 1000 sequences per day, the task could be achieved in 3 years.

The big question was whether the 70 million sequences could be assembled correctly. If the conventional shotgun approach is used with such a large number of fragments, and no reference is made to a genome map, then the answer is certainly no. The huge amount of computer time needed to identify overlaps between the sequences, and the errors, or at best uncertainties, caused by the extensive repetitive DNA content of most eukaryotic genomes (see Figure 5.2), would make the task impossible. But with reference to a map, Venter argued, it should be possible to assemble the mini-sequences in the correct way. https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 12/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf Key features of whole-genome shotgun sequencing

The most time-consuming part of a shotgun sequencing project is the ‘finishing’ phase when individual sequence contigs are joined by closure of sequence gaps and physical gaps (see Figure 6.11). To minimize the amount of finishing that is needed, the whole-genome shotgun approach makes use of at least two clone libraries, prepared with different types of vector. Two libraries are used because with any cloning vector it is anticipated that some fragments will not be cloned because of incompatibility problems that prevent vectors containing these fragments from being propagated. Different types of vector suffer from different problems, so fragments that cannot be cloned in one vector can often be cloned if a second vector is used. Generating sequence from fragments cloned in two different vectors should therefore improve the overall coverage of the genome.

What about the problems that repeat elements pose for sequence assembly? We highlighted this issue in Chapter 5 as the main argument against the use of shotgun sequencing with eukaryotic genomes, because of the possibility that jumps between repeat units will lead to parts of a repetitive region being left out, or an incorrect connection being made between two separate pieces of the same or different chromosomes (see Figure 5.2). Several possible solutions to this problem have been proposed (Weber and Myers, 1997), but the most successful strategy is to ensure that one of the clone libraries contains fragments that are longer than the longest repeat sequences in the genome being studied. For example, one of the plasmid libraries used when the shotgun approach was applied to the Drosophila genome contained inserts with an average size of 10 kb, because most Drosophila repeat sequences are 8 kb or fewer. Sequence jumps, from one repeat sequence to another, are avoided by ensuring that the two end-sequences of each 10- kb insert are at their appropriate positions in the master sequence (Figure 6.16).

Figure 6.16

Avoiding errors when the whole-genome shotgun approach is used. In Figure 5.2B, we saw how easy it would be to ‘jump’ between repeat sequences when assembling the master sequence by the standard shotgun approach. The result of such an (more...)

The initial result of sequence assembly is a series of scaffolds, each one comprising a set of sequence contigs separated by sequence gaps - ones which lie between the mini-sequences from the two ends of a single cloned fragment and so can be closed by further sequencing of that fragment (Figure 6.17). The scaffolds themselves are separated by physical gaps, which are more difficult to close because they represent sequences that are not in the clone libraries. The marker content of each scaffold is used to determine its position on the genome map. For example, if the locations of STSs in the genome map are known then a scaffold can be positioned by determining which STSs it contains. If a scaffold contains STSs from two non-contiguous parts of the genome then an error has occurred during sequence assembly. The accuracy of sequence assembly can be further checked by obtaining end-sequences from fragments of 100 kb or more that have been cloned in a high-capacity vector. If a pair of end-sequences do not fall within a single scaffold at their anticipated positions relative to each other, then again an error in assembly has occurred.

Figure 6.17

Scaffolds are intermediates in sequence assembly by the whole-genome shotgun approach. Two scaffolds are shown. https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 13/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf Each comprises a series of sequence contigs separated by sequence gaps, with the scaffolds themselves separated by physical gaps.

The feasibility of the whole-genome shotgun approach has been demonstrated by its application to the fruit-fly and human genomes (Adams et al., 2000; Venter et al., 2001). The question that remains, and which has been hotly debated (Patterson, 1998), is whether the sequences obtained by the whole-genome shotgun approach have the desired degree of accuracy. Part of the problem is that the random nature of sequence generation means that some parts of the genome are covered by several of the mini-sequences that are obtained, whereas other parts are represented just once or twice (Figure 6.18). It is generally accepted that every part of a genome should be sequenced at least four times to ensure an acceptable level of accuracy, and that this coverage should be increased to 8–10 times before the sequence can be looked upon as being complete. A sequence obtained by the whole-genome shotgun approach is likely to exceed this requirement in many regions, but may fall short in other areas. If those areas include genes, then the lack of accuracy could cause major problems when attempts are made to locate the genes and understand their functions (see Chapter 7). These problems have been highlighted by studies of the Drosophila genome sequence, which have suggested that as many as 6500 of the 13 600 genes might contain significant sequence errors (Karlin et al., 2001).

Figure 6.18

The random nature of sequence generation by the whole- genome shotgun approach means that some parts of the genome are covered by more mini-sequences than other parts.

6.3. The Human Genome Projects To conclude our examination of mapping and sequencing we will look at how these techniques were applied to the human genome. Although every genome project is different, with its own challenges and its own solutions to those challenges, the human projects illustrate the general issues that have had to be addressed in order to sequence a large eukaryotic genome, and in many ways illustrate the procedures that are currently regarded as state of the art in this area of molecular biology.

6.3.1. The mapping phase of the Human Genome Project Until the beginning of the 1980s a detailed map of the human genome was considered to be an unattainable objective. Although comprehensive genetic maps had been constructed for fruit flies and a few other organisms, the problems inherent in analysis of human pedigrees (Section 5.2.4) and the relative paucity of polymorphic genetic markers meant that most geneticists doubted that a human genetic map could ever be achieved. The initial impetus for human genetic mapping came from the discovery of RFLPs, which were the first highly polymorphic DNA markers to be recognized in animal genomes. In 1987 the first human RFLP map was published, comprising 393 RFLPs and ten additional polymorphic markers (Donis-Keller et al., 1987). This map, developed from analysis of 21 families, had an average marker density of one per 10 Mb.

In the late 1980s the Human Genome Project became established as a loose but organized collaboration between geneticists in all parts of the world. One of the goals that the Project set itself was a genetic map with a density of one marker per 1 Mb, although it was thought that a density of one per 2–5 Mb might be the realistic limit. In fact by 1994 an international consortium had met and indeed exceeded the objective, thanks to their use of SSLPs and the https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 14/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf large CEPH collection of reference families (Section 5.2.4). The 1994 map contained 5800 markers, of which over 4000 were SSLPs, and had a density of one marker per 0.7 Mb (Murray et al., 1994). A subsequent version of the genetic map (Dib et al., 1996) took the 1994 map slightly further by inclusion of an additional 1250 SSLPs.

Physical mapping did not lag far behind. In the early 1990s considerable effort was put into the generation of clone contig maps, using STS screening (Section 5.3.3) as well as other clone fingerprinting methods (Section 6.2.2). The major achievement of this phase of the physical mapping project was publication of a clone contig map of the entire genome, consisting of 33 000 YACs containing fragments with an average size of 0.9 Mb (Cohen et al., 1993). However, doubts were raised about the value of YAC contig maps when it was realized that YAC clones can contain two or more pieces of non-contiguous DNA (Figure 6.19; Anderson, 1993). The use of these chimeric clones in the construction of contig maps could result in DNA segments that are widely separated in the genome being mistakenly mapped to adjacent positions. These problems led to the adoption of radiation hybrid mapping of STS markers (Section 5.3.3), largely by the Whitehead Institute/MIT Genome Center in Massachusetts, culminating in 1995 with publication of a human STS map containing 15 086 markers, with an average density of one per 199 kb (Hudson et al., 1995). This map was later supplemented with an additional 20 104 STSs, most of these being ESTs and hence positioning protein-coding genes on the physical map (Schuler et al., 1996). The resulting map density approached the target of one marker per 100 kb set as the objective for physical mapping at the outset of the Human Genome Project.

Figure 6.19

Some YAC clones contain segments of DNA from different parts of the human genome.

The combined STS maps included positions for almost 7000 polymorphic SSLPs that had also been mapped onto the genome by genetic means. As a result, the physical and genetic maps could be directly compared, and clone contig maps that included STS data could be anchored onto both maps. The net result was a comprehensive, integrated map (Bentley et al., 1998; Deloukas et al., 1998) that could be used as the framework for the DNA sequencing phase of the Human Genome Project.

6.3.2. Sequencing the human genome The original plan was that the sequencing phase of the Human Genome Project would be based on YAC libraries, because this type of vector can be used with DNA fragments longer than can be handled by any other type of cloning system. This strategy had to be abandoned when it was discovered that some YAC clones contain non-contiguous fragments of DNA. The Project therefore turned its attention to BACs (Section 4.2.1). A library of 300 000 BAC clones was generated and these clones mapped onto the genome, forming a ‘sequence-ready’ map which could be used as the primary foundation for the sequencing phase of the project, during which the insert from each BAC would be completely sequenced by the shotgun method.

At about the time when the Human Genome Project was gearing itself up to move into the sequence-acquisition phase, the whole-genome shotgun approach was first proposed as an alternative to the more laborious clone contig method that had so far been adopted (Venter et al., 1998). The possibility that the Human Genome Project would not in fact provide the first human genome sequence stimulated the organizers of the Project to bring forward their planned dates for completion of a working draft (Collins et al., 1998). The first sequence of an entire human chromosome (number 22) was published in December 1999 (Dunham et al., 1999) and the https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 15/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf sequence of chromosome 21 appeared a few months later (Hattori et al., 2000). Finally, on 26 June 2000, accompanied by the President of the United States, Francis Collins and Craig Venter, the leaders of the two projects, jointly announced completion of their working drafts (Marshall, 2000), which appeared in print eight months later (IHGSC, 2001; Venter et al., 2001).

It is important to understand that the two genome sequences published in 2001 are drafts, not complete final sequences. For example, the version obtained by the clone contig approach covers just 90% of the genome, the missing 320 million bp lying predominantly in constitutive heterochromatin (Section 8.1.2) - regions of chromosomes in which the DNA is very tightly packaged and which are thought to contain few, if any genes. Within the 90% of the genome that is covered, each part has been sequenced at least four times, providing an ‘acceptable’ level of accuracy, but only 25% has been sequenced the 8–10 times that is necessary before the work is considered to be ‘finished’ (Bork and Copley, 2001). This draft sequence comprises approximately 50 000 scaffolds (see Figure 6.17) with an average size of 54.2 kb. Similar statistics apply to the whole-genome shotgun sequence. A substantial number of gaps therefore have to be closed and much additional sequencing must be done in order to bring the two draft sequences to the stage where either is considered to be complete.

6.3.3. The future of the human genome projects Completion of a finished sequence is not the only goal of the consortia working on the human genome. Understanding the genome sequence is a massive task that will engage many groups around the world, making use of various techniques and approaches which will be described in the next chapter. Important among these are the use of comparative genomics, in which two complete genome sequences are compared in order to identify common features that, being conserved, are likely to be important (Section 7.4). With the human genome, comparative genomics has the added value that it may allow the animal versions of human disease genes to be located, paving the way for studies of the genetic basis of these diseases using the animal genes as models for the human condition. Genome projects for the mouse and rat are both underway, with draft sequences expected in 2003 (Pennisi, 1999; Denny and Justice, 2000; Marshall, 2001), and plans are being made for a chimpanzee genome project (Normile, 2001). There will also be additional human genome projects aimed at building up a catalog of sequence variability in different populations, the results possibly enabling the ancient origins of these populations to be inferred (Section 16.3.2).

These human diversity projects lead us to the controversial aspects of genome sequencing. Most scientists anticipate that sequence data from different populations will emphasize the unity of the human race by showing that patterns of genetic variability do not reflect the geographic and political groupings that humans have adopted during the last few centuries. But the outcomes of these projects are still certain to stimulate debate in non-scientific circles. Additional controversies center on the question of who, if anyone, will own human DNA sequences. To many, the idea of ownership of a DNA sequence is a peculiar concept, but large sums of money can be made from the information contained in the human genome, for example by using gene sequences to direct development of new drugs and therapies against cancer and other diseases. Pharmaceutical companies involved in genome sequencing naturally want to protect their investments, as they would for any other research enterprise, and currently the only way of doing this is by patenting the DNA sequences that they discover. Unfortunately, in the past, errors have been made in dealing with the financial issues relating to research with human biological material, the individual from whom the material is obtained not always being a party in the profit sharing. These issues have still to be resolved.

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 16/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf The problems relating to the public usage of human genome sequences are even more contentious. A major concern is the possibility that, once the sequence is understood, individuals whose sequences are considered ‘sub-standard’, for whatever reason, might be discriminated against. The dangers range from increased insurance premiums for individuals whose sequence includes mutations predisposing them to a genetic disease, to the possibility that racists might attempt to define ‘good’ and ‘bad’ sequence features, with depressingly predictable implications for the individuals unlucky enough to fall into the ‘bad’ category.

The two human genome projects, especially in the USA, continue to support research and debate into the ethical, legal and social issues raised by genome sequencing. In particular, great care is being taken to ensure that the genome sequences that result from the projects cannot be identified with any single individual. The DNA that is being cloned and sequenced is taken only from individuals who have given consent for their material to be used in this way and for whom anonymity can be guaranteed. When this policy was first adopted it required a certain amount of realignment of the research effort because older clone libraries had to be destroyed and the existing physical maps checked with the new material. It is accepted, however, that the extra work was necessary and that care must be taken to maintain and enhance public confidence in the projects.

Study Aids For Chapter 6

Key terms Give short definitions of the following terms:

Alu-PCR

Chain termination method

Chemical degradation method

Chromosome walking

Clone contig approach

Clone fingerprinting

Comparative genomics

Dideoxynucleotide

Helper phage

Interspersed repeat element PCR (IRE-PCR)

Phagemid

Polyacrylamide gel electrophoresis

Positional cloning

Processivity

Pyrosequencing

Repetitive DNA fingerprint

Repetitive DNA PCR

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 17/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf Replicative form

Scaffold

Sequence contig

Sonication

STS content mapping

Thermal cycle sequencing

Whole-genome shotgun approach

Self study questions 1. Draw a diagram illustrating the steps involved in a chain termination sequencing experiment. Your diagram should make clear the process by which the sequence is read from the polyacrylamide gel.

2. Outline how DNA is sequenced by the chemical degradation procedure.

3. What are the desirable features of a DNA polymerase that is to be used for chain termination sequencing?

4. Describe the advantages and disadvantages of M13 vectors as a source of DNA for chain termination sequencing.

5. Describe the methods, other than the use of an M13 vectors, by which single-stranded DNA can be obtained for chain termination sequencing.

6. Describe how automated DNA sequencing differs from the standard chain termination method, and evaluate the importance of automated sequencing in genomics research.

7. Outline the non-conventional methods for DNA sequencing that are currently being explored.

8. Using Haemophilus influenzae as an example, describe how the shotgun method is used to sequence a small bacterial genome.

9. Distinguish between the terms ‘sequence gap’ and ‘physical gap’. Describe the various methods that can be used to close physical gaps in a DNA sequence obtained by the shotgun method.

10. Describe the methods used to obtain an overlapping set of clones when the clone contig approach is used to sequence a genome.

11. Explain why the clone contig approach is generally considered to be more accurate than whole-genome shotgun sequencing.

12. Evaluate the strengths and weaknesses of the whole-genome shotgun approach to DNA sequencing. What steps are taken to ensure that a sequence resulting from this approach is accurate?

13. Write a short history of the Human Genome Project.

14. Describe the future directions of research into the human genome sequence.

Problem-based learning https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 18/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf 1. In the late 1970s, the chain termination and chemical degradation methods for DNA sequencing appeared to be equally efficacious. But today virtually all sequencing is done by the chain termination method. Write a report that explains why chain termination sequencing has become predominant.

2. You have isolated a new species of bacterium whose genome is a single DNA molecule of approximately 2.6 Mb. Write a detailed project plan to show how you would obtain the genome sequence for this bacterium.

3. Critically evaluate the clone contig approach as a means of sequencing a large eukaryotic genome.

4. From a comparison of the research papers describing the two draft human genome sequences (IHGSC, 2001; Venter et al., 2001) evaluate the success of the whole-genome shotgun approach as applied to the human genome.

5. With the benefit of hindsight, evaluate the decision making that occurred during the course of the Human Genome Project, and assess if any alternative strategies would have resulted in the draft sequence being obtained more quickly.

6. Human genome sequence: friend or foe?

References

1. Adams MA, Celniker SE, Holt RA. et al. The genome sequence of . Science. (2000);287:2185–2195. [PubMed: 10731132] 2. Anderson C. Genome shortcut leads to problems. Science. (1993);259:1684–1687. [PubMed: 8456291] 3. Bentley DR, Pruitt KD, Deloukas P, Schuler GD, Ostell J. Coordination of human genome sequencing via a consensus framework map. Trends Genet. (1998);14:381–384. [PubMed: 9820023] 4. Bork P, Copley R. Filling in the gaps. Nature. (2001);409:818–820. [PubMed: 11236994] 5. Cohen D, Chumakov I, Weissenbach J. A first-generation map of the human genome. Nature. (1993);366:698–701. [PubMed: 8259213] 6. Collins F S, Patrinos A, Jordan E. et al. New goals for the U.S. Human Genome Project: 1998-2003. Science. (1998);282:682–689. [PubMed: 9784121] 7. Deloukas P, Schuler GD, Gyapay G. et al. A physical map of 30,000 genes. Science. (1998);282:744–746. [PubMed: 9784132] 8. Denny P, Justice MJ. Mouse as the measure of man? Trends Genet. (2000);16:283–287. [PubMed: 10858655] 9. Dib C, Fauré S, Fizames C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. (1996);380:152–154. [PubMed: 8600387] 10. Donis-Keller H, Green P, Helms C. et al. A genetic map of the human genome. Cell. (1987);51:319–337. [PubMed: 3664638] 11. Dunham I, Shimizu N, Roe BA. et al. The DNA sequence of human chromosome 22. Nature. (1999);402:489–495. [PubMed: 10591208] 12. Fleischmann RD, Adams MD, White O. et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. (1995);269:496–512. [PubMed: 7542800] 13. Fraser CM (1997) How to sequence a small genome. Trends Genet., 13, poster insert. 14. Fraser CM, Gocayne JD, White O. et al. The minimal gene complement of Mycoplasma genitalium. Science. (1995);270:397–403. [PubMed: 7569993] https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 19/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf 15. Hattori M, Fujiyama A, Taylor TD. et al. The DNA sequence of human chromosome 21. Nature. (2000);405:311–319. [PubMed: 10830953] 16. Hudson TJ, Stein LD, Gerety SS. et al. An STS-based map of the human genome. Science. (1995);270:1945–1954. [PubMed: 8533086] 17. Initial sequencing and analysis of the human genome. Nature. (2001);409:860–921. [PubMed: 11237011] 18. Karlin S, Bergman A, Gentles AJ. Annotation of the Drosophila genome. Nature. (2001);411:259–260. [PubMed: 11357119] 19. Marshall E. A high-stakes gamble on genome sequencing. Science. (1999);284:1906– 1909. [PubMed: 10400531] 20. Marshall E. Rival genome sequencers celebrate a milestone together. Science. (2000);288:2294–2295. [PubMed: 10917817] 21. Marshall E. Rat genome spurs an unusual partnership. Science. (2001);291:1872. [PubMed: 11245170] 22. Maxam AM, Gilbert W. A new method for sequencing DNA. Proc. Natl Acad. Sci. USA. (1977);74:560–564. [PMC free article: PMC392330] [PubMed: 265521] 23. Mullikan JC, McMurray AA. Sequencing the genome, fast. Science. (1999);283:1867– 1868. [PubMed: 10206892] 24. Murray JC, Buetow KH, Weber JL. et al. A comprehensive human linkage map with centimorgan density. Science. (1994);265:2049–2054. [PubMed: 8091227] 25. Normile D. Chimp sequencing crawls forward. Science. (2001);291:2297. [PubMed: 11269291] 26. Patterson M. Politicogenomics takes centre stage. Trends Genet. (1998);14:259–260. [PubMed: 9676524] 27. Pennisi E. Mouse genome added to sequencing effort. Science. (1999);286:211. [PubMed: 10577185] 28. Prober JM, Trainor GL, Dam RJ. et al. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science. (1987);238:336–341. [PubMed: 2443975] 29. Rogers J. Gels and genomes. Science. (1999);286:429. [PubMed: 10577205] 30. Ronaghi M, Ehleen M, Nyrn P. A sequencing method based on real-time pyrophosphate. Science. (1998);281:363–365. [PubMed: 9705713] 31. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain terminating inhibitors. Proc. Natl Acad. Sci. USA. (1977);74:5463–5467. [PMC free article: PMC431765] [PubMed: 271968] 32. Schuler GD, Boguski MS, Stewart EA. et al. A of the human genome. Science. (1996);274:540–546. [PubMed: 8849440] 33. Sears LE, Moran LS, Kisinger C. et al. CircumVent thermal cycle sequencing and alternative manual and automated DNA sequencing protocols using the highly thermostable Vent (exo-) DNA polymerase. Biotechniques. (1992);13:626–633. [PubMed: 1476733] 34. Southern EM. DNA chips: analysing sequence by hybridization to oligonucleotides on a large scale. Trends Genet. (1996);12:110–115. [PubMed: 8868349] 35. Venter JC, Adams MD, Sutton GG. et al. Shotgun sequencing of the human genome. Science. (1998);280:1540–1542. [PubMed: 9644018] 36. Venter JC, Adams MD, Myers EW. et al. The sequence of the human genome. Science. (2001);291:1304–1351. [PubMed: 11181995] 37. Weber JL, Myers EW. Human whole-genome shotgun sequencing. Genome Res. (1997);7:401–409. [PubMed: 9149936]

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 20/21 4/6/2020 Sequencing Genomes - Genomes - NCBI Bookshelf Further Reading

1. Brown TA (1994) DNA Sequencing: The Basics. Oxford University Press, Oxford. — Details of DNA sequencing methodology. 2. Davies K (2001) Cracking the Genome: Inside the Race to Unlock Human DNA. Free Press, New York. (Published in the UK as The Sequence: Inside the Race for the Human Genome. Weidenfeld and Nicholson, London.) —A history of the human genome projects. 3. Strachan T and Read AP (1999) Human Molecular , 2nd edition. BIOS Scientific Publishers, Oxford. —Chapter 13 describes the Human Genome Project. 4. Wilkie T (1993) Perilous Knowledge: The Human Genome Project and its Implications. Faber and Faber, New York. —A view of the social impact of the Human Genome Project.

Copyright © 2002, Garland Science.

Bookshelf ID: NBK21117

https://www.ncbi.nlm.nih.gov/books/NBK21117/#A6452 21/21