<<

Chapter 4. The Genomic Biologist’s Toolkit

Contents 4.7. Annotation 4. Genomic Biologists tool kit 4.7.1. Using Bioinformatic Tools to Identify Putative 4.1. Restriction Endonucleases – making “sticky ends” Protein Coding Genes 4.7.2. Comparison of predicted sequences with known 4.2. Vectors sequences (at NCBI) 4.2.1. Simple Cloning Vectors 4.7.3. Published 4.2.2. Expression Vectors 4.2.3. Shuttle Vectors 4.2.4. Phage Vectors 4.2.5. Artificial Vectors 4.3. Methods for Sequence Amplification 4.3.1. Polymerase Chain Reaction 4.3.2. Cloning Recombinant DNA 4.3.3. Cloning DNA in Expression Vectors 4.3.4. Making Complementary DNA (cDNA) 4.3.5. Cloning a cDNA 4.4. Genomic Libraries 4.4.1. Cloning in YAC Vectors 4.4.2. Cloning in BAC Vectors 4.5. DNA sequencing 4.5.1. Electrophoresis 4.5.2. Sanger Dideoxy Sequencing 4.5.3. Capillary Sequencers 4.5.4. Next Generation Sequencing 4.5.5. 3rd Generation Sequencing 4.6. DNA Sequencing Strategies 4.6.1. Map-based Strategies 4.6.2. Whole Genome

CONCEPTS OF GENOMIC Page 4-1

4.1. RESTRICTION ENDONUCLEASES (RETURN)

 CHAPTER 4. THE GENOMIC BIOLOGIST’S Restriction endonucleases (restriction enzymes) each TOOLKIT (RETURN) recognize a specific DNA sequence (restriction site), and break a phosphodiester linkage between a 3’ carbon and phosphate within that sequence. Restriction enzymes are used to create DNA fragments for cloning and to Genomic Biology has 3 important branches, i.e. analyze positions of restriction sites in cloned or Structural Genomics, Comparative genomics, and genomic DNA. A specific digests cut Functional genomics. The ultimate goal of these DNA at the same sites in every molecule if allowed to branches is, respectively; the sequencing of genes and cut to completion. Thus, this is a method whereby all genomes; the comparison of these sequenced genes copies of genomes or any other longer sequence can be and genomes, and an understanding of how genes and reproducibly cut into identical fragments. genomes work to produce the complex phenotypes of The first three letters of the name of a restriction all . enzyme are derived from the genus and species of the A set of molecular genetic technologies was/is critical from which it was isolated. Additional letters to our ability to pursue the goals described above. The often denote the bacterial strain from which the Genomic Biologists Tool Kit is provides a brief restriction enzyme was isolated, and if multiple enzymes understanding of these critical tools, and how they are are isolated from the same strain, they are given Roman used in the investigation of genomes. While the numerals. For example, the restriction enzyme EcoRI, is techniques are intrinsically laboratory tools, the nature the first enzyme isolated from the RY13-strain of of what they can do and how they work can be readily . studied using bioinformatic resources. produce restriction endonucleases to defend against (), and each restriction

CONCEPTS OF GENOMIC BIOLOGY Page 4-2

Table 4.1. Characteristics of Some Restriction Enzymes

CONCEPTS OF GENOMIC BIOLOGY Page 4-3 enzyme recognizes a completely unique DNA sequence from degrading host DNA, while invading bacter- where it cuts the DNA strands (see Table 4.1 & Figure iophage DNA is unmethylated and readily degraded. 4.1). The specific restriction enzyme recognition sites in Many restriction sites are sequences of 4, 6, or 8 the bacterial DNA are often limited in the genome of the base pairs in length and have identical sequences from organism from which it comes, but they are abundant in 5’ to 3’ on each strand. These sequences are referred to the genome of the . Also the DNA of the as palindromic DNA sequences. Other restriction sites host cell can be modified by methylation, which are not completely symmetrical and/or differ in length prevents the restriction enzymes of the host cell from 4, 6, or 8 nucleotide pairs (Table 4.1 & Figure 4.1). As shown in the figure on the left, the nature of the

fragment ends produced when a restriction enzyme produces DNA fragments can vary. Some enzymes produce fragments where the two strands are equal in length. This is referred to as blunt ends. Other enzymes produce fragments where the two strands are unequal in length. These are referred to as either 5’ sticky ends, or 3’ sticky ends. Overhanging sticky ends provide a basis for combining DNA fragments produced by the same restriction enzyme from different DNA sources. This process was the original method used to produce recombinant DNA molecules. The application of restriction endonucleases to the cloning of DNA is further discussed in DNA Cloning video that can be viewed by clicking on the link. Note that part of this video will be discussed in detail in the next Figure 4.1. Restriction site sequences and section of the Genomic Biologist’s Toolkit, but the first cut locations of: a) SmaI; b) BamHI, and c) part of the video is a good demonstration of how PstI. CONCEPTS OF GENOMIC BIOLOGY Page 4-4 restriction enzymes work and how they can be used to Note that we have previously discussed SNPs as a create recombinant DNA molecules for cloning DNA. type of Sequence Tagged Site (STS). As single nucleotide changes in the genome sequence, consider the effect of an SNP that happens to occur in a restriction endonuclease recognition site. The result would be the loss of a restriction site at that SNP. This site would no longer be cut by the enzyme, and thus new fragments having different sizes would be produced. This is called Restriction Fragment Length Polymorphism (RFLP). Thus, and RFLP is an SNP that happens to occur in a restriction site in the DNA. A famous RFLP is associated with Sickle Cell Disease, and is further described in the accompanying video.

Figure 4.2. Using restriction enzyme, EcoRI to make recombin-ant 4.2. CLONING VECTORS (RETURN) DNA. The procedure relies on the 3’-overhanging “sticky ends”. The process of “DNA cloning” involves a set of An additional application of restriction enzymes experimental methods in that are involves the production of a res-triction map. A used to assemble recombinant DNA molecules and to restriction map is shows the relative position of direct their replication within host organisms. The use restriction sites for multiple restriction enzymes in a of the word cloning refers to the fact that the method piece of linear or circular DNA. Prior to the availability involves the replication of one molecule to produce a of genomic sequences, restriction mapping was an population of cells with identical DNA molecules. important tool used to characterize cloned DNA generally uses DNA sequences from fragments. The production of a restriction map for a two different organisms: 1) the organism that is the circular DNA is shown in the Restriction Mapping video. source of the DNA to be cloned, and 2) the organism that will serve as the living host for replication of the CONCEPTS OF GENOMIC BIOLOGY Page 4-5 recombinant DNA. Molecular cloning methods are or that permits cells to make an amino acid central to many areas of biology, biotechnology, and required for growth. medicine, including DNA sequencing. These are the basic requirements that all modern The DNA from host organism in a cloning cloning vectors contain, but beyond these basic experiment, often called a vector, typically has 3 things: requirements, there can be a number of additional features that make specific vectors useful for various 1) Sequences necessary to produce recombinant DNA purposes. Thus, several types of cloning vectors have and facilitate entry into the host organism. Typically, been constructed, each with different molecular this can be one or more “unique” restriction sites. properties and cloning capacities. “Unique” in this context means that these are restriction sites will permit cutting the vector at only 4.2.1. Simple Cloning Vectors (RETURN) one location. Most vectors contain unique restriction The most common vectors are used to clone sites for a number of different restriction enzymes. recombinant DNA in bacterial cells, typically E. coli. This is called a polylinker or , and Simple cloning vectors are constructed from can make the use of the vector much easier. common in many bacterial cells. In fact plasmids are 2) An for the host organism to circles of dsDNA (double stranded) much smaller than facilitate replication of the recombinant DNA in the the bacterial chromosome that include replication host cell. Typically this sequence controls the origins (ori sequence) needed for replication in bacterial number of copies of the vector that can be made in cells that naturally carry DNA between different one cell. bacteria. An example of a typical E. coli is 3) In order to facilitate identification of cells that contain pUC19 (2,686bp). The more modern version of pUC19 is the vector containing recombinant DNA, a gene that pBluescript II. The features of this are shown in can be expressed in the host and that provides a Figrue 4.2. “selectable” marker for the presence of recombinant More information about cloning DNA in plasmid DNA is provided. Often the selectable vectors can be found in Molecular Cell Biology, 4th will be a gene that makes cells resistant to a specific edition, Section 7.1. This can be downloaded from NCBI by clicking on the link. The use of simple cloning vectors CONCEPTS OF GENOMIC BIOLOGY Page 4-6

Figure 4.3. The features of pUC19 and pBluescrip II include: 1) High copy number in E. coli, with nearly 100 copies per cell, provides a good yield of cloned DNA. 2) Its selectable marker is ampR. 3) It has a cluster of unique restriction sites, called the polylinker (multiple cloning site). 4) The polylinker is part of the lacZ (b-galacto-sidase) gene. The plasmid will complement a lacZ- , allowing it to become lacZ+. When DNA is cloned into the polylinker, lacZ is disrupted, preventing complementation of the lacZ- from occurring. 5) X-gal, a chromogenic analog of lactose that turns blue when -galactosidase is present, and remains white in the absence of -galactosidase, so blue- white screening can indicate which colonies contain recombinant plasmids.

to clone recombinant DNA made via the use of DNA One of these might be to create a expression library that restriction and overhanging sticky ends can be seen in makes specific proteins from each clone. This requires the attached Steps in DNA Cloning video. The use of an . simple cloning to obtain a collection of clones 4.2.2. Expression Vectors (RETURN) representing all sequences that can be cut from a longer piece of DNA is called creating a clone library (see video) Expression vectors contain all of the same elements of sequences. Libraries can be useful in several ways. that simple cloning vectors contain, i.e. an ori, a selectable marker, and a multiple cloning site; but the CONCEPTS OF GENOMIC BIOLOGY Page 4-7 MCS is flanked by a sequence, and a sequence can insert randomly in two orientations. terminator sequence that works in the host organism. However, only one of the orientations will produce a This permits the cloned sequence to be transcribed, and translatable mRNA. The other orientation will produce if the vector contains a Shine-Delgarno sequence (not an apparent RNA that will be the complementary strand shown in Figure 4.4.), to be translated into a protein if of the mRNA (called an antisense RNA). In section 4.5. there is an start and stop code word in the sequence. dealing with this issue will be considered. t Note that Figure 4.4. illustrates how the cloned 4.2.3. Shuttle Vectors (RETURN) A cloning vector capable of replicating in two or more types of organism (e.g., E. coli and ) is called

Figure 4.5. Shuttle vectors like pRS426 can be used to move cloned DNA into 2 different organisms. In this case, the plasmid moves into E. coli and Yeast. Note that the vector contains an origin of replication for yeast (yeast 2 u ARS) and E. coli (ori), a selectable marker gene for E. coli (ampr) and yeast (Ura3, does not require Uracil for growth as does the Figure 4.4. An example of a simple expression vector. yeast strain used), and a multiple cloning site with a yeast promoter and terminator on either side. Thus, this can work in both E. coli and yeast. CONCEPTS OF GENOMIC BIOLOGY Page 4-8 a shuttle vector. Shuttle vectors may replicate plasmids. These often have specific uses that take autonomously in both hosts, or integrate into the host advantage of their unique properties. Among the types genome. of non-plasmid vectors, bacteriophage λ vectors (shown

4.2.4. Phage Vectors (RETURN) in Figure 4.6) are among the most frequently used. Phage λ vectors can be used to make expression Beside plasmid-based simple cloning vectors, there libraries and to convenient for selection of clones as the are a number of other vectors that are not based on bacteriophage lyses cells releasing the contents to the cell to the medium. Thus and proteins derived from the inserted fragment can be investigated using these vectors.

4.2.5. Artificial Chromosome Vectors (RETURN) The typical simple cloning vector will accommodate DNA fragments up to about 3,000 bp in length. However, there are needs to clone significantly longer fragments of DNA for study. Typically DNA genomic sequencing is easiest with the longest fragments possible. Two vector systems, i.e. BAC vectors (bacterial artificial chromosome) and YAC vectors (yeast artificial chromosome), are useful choices for cloning DNA fragments. In BACs fragments up about 350 kbp (350,000 bp) can be cloned while in YACs fragments up 1,000,000 bp have been reported. Both of these methods have been used in the original human genome sequencing project. However, it was found that YACs are relatively unstable, meaning that they frequently Figure 4.6. Phage λ Vector. self-modified loosing DNA in the process, and thus, they CONCEPTS OF GENOMIC BIOLOGY Page 4-9 did not have the stability shown by BACs. Conse- quently, BACs have emerged as the large cloning vector of choice.

4.3. METHODS OF SEQUENCE AMPLIFICAION (RETURN)

With our discussion of restriction endonucleases and cloning vectors completed. We are now ready to put these concepts together and show how specific DNA sequences can be amplified to provide specific DNA sequences for genetic and genomic studies.

4.3.1. Polymerase Chain Reaction (PCR) (RETURN) Polymerase Chain Reaction or PCR is a method by which DNA polymerase can be used to make many copies of a DNA sequence in a test tube. The technique is a valuable supplement to DNA cloning to generate specific DNA sequences for use as reagents. A description of the PCR process is given in the Polymerase Chain Reaction video. Click the link to view this video. Some additional things to note are that the Figure 4.7. Artificial Chromosome vectors. a) Shows a reaction temperature is changed using a device called a bacterial artificial chromosome (BAC) that has a thermal cycler that can rapidly change temperatures selectable marker (chloramphenicol resistance), and a MCS. However, the ori sequence is replaced by a during each cycle. The reaction mixture must have all single copy F factor origin of replication. b) Shows a necessary components for a PCR reaction including a yeast artificial chromosome, including selectable thermostable DNA polymerase like the TAQ DNA markers (TRP1 and URA3), a yeast origin of replication (ARS), and and chromosome polymerase mentioned in the video. Such DNA parts. This vector will replicate in yeast cells.

CONCEPTS OF GENOMIC BIOLOGY Page 4-10 polymerases are obtained from organisms called 4.3.2. Cloning in a Simple Cloning Vector (RETURN) extremophiles that grow in very hot water like that DNA cloning is the for a number of genomic biology found in geysers (e.g. Old Faithful in Yellowstone experiments. Large amounts of DNA are needed for National park) or thermal vents on the floor of the analysis, sequencing, and numerous experimental ocean. The reaction also contains the deoxyNTP (deoxy approaches. As we saw above multiple copies of a nucleotide triphosphates, e.g. dATP, dGTP, dCTP, & known DNA sequence can be made and cloned using dTTP), and the primers which define each end of the PCR and a PCR vector. However, an alternative is sequence to be amplified. necessary when the sequence to be cloned is unknown DNA sequences amplified via PCR typically contain an (i.e. PCR primers cannot be determined). To introduce extra A on the 3’-end the molecule, i.e. a single this principle we will outline the steps to clone a DNA overhanging 3’-A that makes ligation of the PCR fragment of unknown sequence in a simple cloning amplified fragment into a PCR cloning vector much vector. easier (see Figure 4.6). To get multiple copies of a gene or other piece of DNA you must isolate, or ‘cut’, the DNA from its source using restriction enzymes, and then ‘paste’ it into a simple cloning vector that can be amplified in a host cell, typically E. coli. pGEM-T Easy PCR Vector (3015 bp) The four main steps in PCR DNA cloning are: pGEM-T Easy PCR VectorDNA Ligase + (3015+ bp) Step 1. DNA is purified from the donor cells using a standard DNA purification technique. pGEM-Teasy+ PCR Amplified Step 2. A chosen fragment of DNA is ‘cut’ from the PCR Amplified DNA DNA (4206 bp) (1191 bp) purified genomic DNA of the source organism using Figure 4.8. PCR Cloning vectors. Note that the vector comes linearized a restriction enzyme. with overhanging 3’-T’s. PCR products typically have single over- hanging A’s at their 3’-ends. This provides a convenient way of making a circular plasmid with the inserted PCR product.

Recont CONCEPTS OF GENOMIC BIOLOGY Page 4-11 Step 3. The piece of DNA is ‘pasted’ into a vector and the ends of the DNA are joined with the vector DNA by DNA ligase (joins Okazaki fragments) in the DNA

Figure 4.9. Insertion of restricted DNA into a simple cloning vector. replication section. Step 4. The vector is introduced into a host cell, often a bacterium, by a process called bacterial transformation. The transformed host cells copy the vector DNA + recombinant DNA along with their own DNA, creating multiple copies of the inserted DNA. DNA that has been ‘cut’ and ‘pasted’ from an organism into a Figure 4.10. Using PCR to obtain only the forward orientation of a sequence in an expression vector. Primers are designed with a vector is called recombinant DNA. Because of this, DNA restriction site added such that they anneal at each end of the cloning is also called recombinant DNA technology. fragment of interest. Following PCR an amplified fragment will be produced with a KpnI site at the 5’ end of the intended coding Step 5. The vector DNA is isolated (or separated) sequence and a SalI site at the 3’ end. The expression vector is then opened by cutting with both KpnI and SalI. Since the KpnI from the host cells’ DNA and purified. site is closer to the promoter in the expression vector’s MCS, while the SalI site is closer to the terminator. This construct will 4.3.3. Cloning DNA in Expression Vectors (RETURN) go into the vector in the sense orientation so that a message is In section 4.2., we discussed expression vectors, and produced that makes the protein of interest rather than its showed that when a restricted DNA sequence is cloned antisense equivalent. CONCEPTS OF GENOMIC BIOLOGY Page 4-12 in an expression vector, it can be ligated into the vector (makes a DNA strand from an RNA in both a “forward” or a “reverse” configuration (Figure strand) is used to make a first-strand DNA copy of the 4.4). In the forward configuration the fragment is mRNA strand. positioned so that it makes an mRNA that codes for a protein, while in the reverse configuration, the DNA fragment does not make an mRNA, but makes an RNA from the opposite strand called an antisense RNA. It is possible using a PCR strategy to insert a DNA fragment into an expression vector such that it can only insert in the “forward” orientation. This strategy is shown in Figure 4.10.

4.3.4 Making complementary DNA (cDNA) (RETURN) A double stranded DNA copy of an mRNA is called a cDNA. Making cDNA is a way to convert a relatively labile single-stranded RNA into a relatively stable double-stranded DNA. It is possible to make a DNA copy of an RNA by employing an enzyme involved in replic- ation of certain viruses called reverse transcriptase. The other aspect of Eukaryotic mRNAs that makes producing cDNAs relatively facile is the polyA tail as we will see below. cDNAs can be made in several ways, but the method described here is a traditional method. Step 1. Total RNA is extracted from cells using a standard technique for the organism in question. Step 2. An oligo-dT primer is hybridized with the polyA tail of a Eukaryotic mRNA. Then an enzyme called Figure 4.11. The process for making cDNA in a simple cloning vector. CONCEPTS OF GENOMIC BIOLOGY Page 4-13 Step 3. The RNA is then partially degraded with cDNA library. A similar cDNA library from different cells RNase H, and RNA fragments are randomly annealed to (e.g. different tissues, or cells treated with a drug, or the newly made DNA strand. These RNA fragments act is grown in a different environment, etc.) will show primers for DNA polymerase I. different levels of each cDNA present based on the Step 4. DNA polymerase I is then used to make a mRNAs found in a tissue. The frequency of mRNAs complementary DNA strand, and replace the RNA found in a tissue is considered information about the primers with DNA nucletoides. expression of a gene. information relates directly to the function of transcription Step 5. All pieces are then ligated together using machinery in cells, and is critical functional genomic DNA ligase. Completing the synthesis of a double information, as we will see in a subsequent section of stranded DNA copy of the mRNA. the book. At completion of the procedure above you will have In order to store and subsequently utilize a cDNA prepared a cDNA copy of each mRNA that was present library it is useful to produce a clone of each sequence in in the cells from which you extracted the RNA. If there the library. Typically this involved putting the cDNAs were 10,000,000 polyA tails on 10,000,000 mRNAs you into vectors, and putting the vectors into host cells, should make 10,000,000 cDNAs. In other words if there typically E. coli such that each cell gets a single cDNA were 10,000 mRNAs in the preparation that coded for a which is amplified in that cell and all it’s clones. given protein like myosin, but only 500 mRNAs coding for hexokinase and 10 mRNAs for tyrosyl-tRNA 4.3.5. Cloning a cDNA Library (RETURN) synthetase, you might expect that your cDNA library of A cDNA clone library is a useful tool to identify sequences obtained from the cells you used would have specific mRNAs found in a tissue and to obtain the 10,000, 500, and 10 cDNAs for the 3 proteins sequences of identified genes. To do this a cDNA clone respectively. The frequency of occurrence of each mRNA library (i.e. to clone all cDNAs into a vector, and put one is represented by the frequency of cDNAs in the cDNA vector containing an individual cDNA in each cell) can be library obtained from a given set of cells. Thus, created. These cells can be screened to determine information about the frequency of occurrence of which clones express genes of interest. mRNAs in cells can be obtained from analysis of such a CONCEPTS OF GENOMIC BIOLOGY Page 4-14 Various types of vectors can be used to create a Step 4. Digest the cDNAs with internal sites cDNA clone library. These include phage expression protected and linkers attached with the restriction vectors, plasmid expression vectors, or shuttle vectors enzyme to generate the appropriate overhanging sticky depending on the intended use of the clone library. We ends). will look at a protocol for incorporation of cDNA into a plasmid expression vector, using a simple strategy. Note that kits are now available that provide everything you require and outline specific strategies for most types of vectors should you ever need to accomplish this task. Step 3 Step 1. Prepare a cDNA library as outlined in section 4.3.4. Step 2. Manipulating the cDNAs so that each one has a unique (not contained in any cDNA) restriction site Step 4 at both ends. To do this, the cDNAs are frequently methylated with a specific methyl transferase that incorporates a methyl group into particular restriction sites to protect them from the restriction enzyme that will be used later. Step 5 Step 3. A synthetic double stranded linker is then ligated to the ends of this cDNA. The linker should correspond to a restriction site in the MCS of the vector to be used. Blunt end ligation is generally a low efficiency process; but, by using a high concentration of these synthetic , it is possible to drive the reaction to near completion.

Figure 4.12. Procedure of inserting a cDNA into a cloning vector involving ligation of linkers on the ends of the cDNA. CONCEPTS OF GENOMIC BIOLOGY Page 4-15 Step 5. Mix the digested cDNAs with the predigested vector, and add DNA ligase to ligate to make cDNA recombinant vectors Step 6. Transform the recombinant vectors into host cells, and grow up clones. Once the cDNA clone library has been constructed, a number of strategies can be used to select a specific clone that contains a gene of interest. Figure 4.11 demonstrates how this could be done if antibodies against the protein of interest are available. Figure 4.12. shos a strategy for identifying a specific clone by Figure 4.13. Finding a specific cDNA clone using an expres- complementation of a yeast mutant. Note that for this sion library. Following technique the cDNA library was constructed in a yeast transformation of cells with the shuttle vector. cDNA expression library, transformants with inserts Because cDNAs are the exons of the gene (parts that (white colonies) are selected, code for proteins) a cDNA clone library can be expressed replated, and screened with antibodies against the protein in either Prokaryotic or Eukaryotic cells. However, there of interest. Colonies producing are sometimes (but relatively infrequently) complex antigenic proteins are then issues that keep Eukaryotic cDNAs from expressing tested for the presence of the protein of interest and the functional proteins in Prokaryotic cells. When this cDNA insert in that clone is occurs the shuttle vector approach is necessary to get a characterized. functional protein produced in the library. cDNA libraries have many uses, but comparisons of sequencing clones from a cDNA library, so called cDNA sequences with sequences of corresponding genes expressed sequence tags (ESTs) are determined. The is one way of demonstrating the positions of and sequences of ESTs were critical to understanding exons in the genomic sequence (see Figure 4.15. By CONCEPTS OF GENOMIC BIOLOGY Page 4-16

DNA (Gene)

Primary RNA Transcript Figure 4.14. Strategy for identifying cDNA clones for a gene of interest (ARG1) mRNA (cDNA) using cDNAs (high MW DNA from (ARG1)yeast Figure 4.15. Primary RNA Transcript strain. Note the cDNAs need to be inserted into a yeast shuttle vector such (RETURN) that the ARG1 gene will be 4.4. GENOMIC LIBRARIES propperly expressed and complement the arg1 A genomic clone library or is a set of mutant in the yeast strain used. cloned sequences made by cloning the entire genome of an organism or organelle. One of several ways this can be done by cutting the genomic DNA with one or more restriction enzymes, and ligating the pieces into a simple cloning vector as shown in Figure 4.9. A limitation of simple cloning vectors is the size of DNA that can be introduced into the cell by transformation. This presents problems when you are trying to create a Genomic functional components of genomes as they were being Library of a large genome such as that of most sequenced. . Remember that a genomic library contains all of the DNA found in the cells of the organism. If you digest CONCEPTS OF GENOMIC BIOLOGY Page 4-17 organismal DNA to completion with a restriction previously obtained. If this new clone overlaps a portion enzyme, ligate those fragments into a plasmid vector of the original clone, then the length of the DNA of and transform bacterial cells, only a portion of those interest is extended by the length of DNA in the second fragments will be represented in the final clone that is not found in the original clone. By transformation products. If a gene of interest is larger performing these steps successive times, a long distance that the clonalbe fragment length, then you will not be map can be obtained. To claify this concept, please view able to isolate that gene in tact from a plasmid library. the Chromosome Walking short video. But what can be done to increase the probability of This technique though has difficulties. First, each obtaining a clone that contains the entire gene. First you step is technically slow. Second, if you use phage λ or need to use a vector that can accept large fragments of clones, you might only extend the region of DNA. Examples of these are bacteriophage and cosmid interest by 5-10 kb in each step of the walk. Finally, if vectors, and the relatively popular yeast artificial any of the clones that are obtained contain repeated chromosome (YAC) vectors (see Figure 4.7b) and the sequences, the subclone could lead you to another bacterial artificial chromosome (BAC) vetors (see Figure region of the genome that is not contiguous with the 4.7a). While longer fragments of genomic DNA can be region of interest. This is because Eukaryotic genomes cloned in YAC vectors, these are less stable than the BAC have so called repeated sequence DNA interspersed vectors, making BACs the vectors most frequently used throughout their genomes. for genomic cloning. Yeast artificial can alleviate some of 4.4.1. Cloning in YAC Vectors (RETURN) these problems because of the large (100-1000kb) A goal of genomic sequencing is to obtain physical amount of DNA that can be cloned. Howver, YACs data about the genomic organization of DNA in a cannot speed up each step of the walk because the genome. Traditionally, this data has been obtained by a and screening steps cannot be accellerated. technique called chromosome walking. Walking can But YACs can easily extend the region of interest by 50- performed by subcloning the ends of DNA inserted in a 100 kb and up to as much as 500 kb per walking cycle. phage λ vector or cosmid vector and screening a library Thus a long distance map of the region can be obtained for new clones that contain the end-sequences in several steps. Secondly, although repetitive regions CONCEPTS OF GENOMIC BIOLOGY Page 4-18 may be 10-20 kb in length they are rarely, longer than library. Individual BAC clone colonies can be stored until 50 kb. Thus a YAC with 100kb will contain some region needed. that is single copy which can be used for further steps in Making a BAC library the walk. To make a genomic Bacterial Artificial Chromosome While YACs allow the cloning of the largest fragments (BAC) library: possible, their relative stability has allowed the more stable BACs, which bear shorter recombinant fragments, Step 1. Isolate the cells containing the DNA you want to become the vector of choice for chromosome walking to store. For animals BAC libraries come from white and subsequent sequencing. blood cells. Step 2. These isolated cells are then mixed with 4.4.2. Cloning in BAC Vectors (RETURN) warm agarose, a jelly-like substance. The whole mixture During the Human , researchers had is then poured into a mold and allowed to cool to to find a way to reduce the entire human genome into produce a set of small blocks, each containing thousands chunks, as it was too large to be sequenced in one go. of the isolated cells. To do this they created a store of DNA fragments called a BAC library, specifically a human genome BAC library. Step 3. The cells are then treated with enzymes to dissolve their cell membranes and release the DNA into BAC stands for Bacterial Artificial Chromosome. the agarose gel. A restriction endonuclease is used to These are small pieces of bacterial DNA that can be chop the DNA into pieces around 200,000 base pairs in identified and copied within a bacterial cell and act as a length (partial digestion versus complete digestion vector, to artificially carry recombinant DNA into the cell producing smaller fragments). of a bacterium, such as Escherichia coli. Step 4. These blocks of gel containing chopped up In general BAC clones carry inserts of DNA up to DNA are then inserted into holes in a slab of agarose gel. 300,000 bp in length. The bacteria are then grown to The DNA fragments are then separated according to size produce colonies that contain the same fragment of by electrophoresis. DNA in each cell of the colony. This is a BAC clone CONCEPTS OF GENOMIC BIOLOGY Page 4-19 and inserted into a BAC vector using DNA ligase to join the two bits of DNA together. This produces a set of BAC clones. Step 6. The BAC clones are added to bacterial cells, usually E. coli, and the bacteria are then spread on nutrient rich plates that allow only the bacteria that carry BAC clones to grow. The bacteria grow rapidly, resulting in lots of bacterial cells, each containing a copy of a separate BAC clone. Step 7. After they have grown, the bacteria are then ‘picked’ into plates of 96 or 384 so that each tube contains a single BAC clone. The bacteria can also be copied or frozen and kept until researchers are ready to use the DNA for sequencing. A BAC library has been created.

4.5. DNA SEQUENCING (RETURN)

The original techniques for sequencing DNA molecules were developed by Fred Sanger in the 1970’s. Figure 4.16. BAC Vector. Contains blue/white screening capability. Genomic DNA fragments up to 300,000 bp can be ligated into the MCS Sanger’s method, which we will look at in section 4.5.2, of the vector which also contains a selectable marker and an F’ single relies on determining the last nucleotide added as DNA copy origin of replication. polymerase is copying a DNA molecule, and then separating these nucleotides that are but one nucleotide different in length from each other using a technique Step 5. Fragments of a particular size class (200,000 known as electrophoresis. to 300,000 bp) selected, removed from the agarose gel CONCEPTS OF GENOMIC BIOLOGY Page 4-20 From Sanger’s original work, the process was automated, and such robotic sequencers were used to generate the first human genome sequence obtained by the original . Subsequently, sequencing technology has been dramatically changed to both lower the cost of sequencing and increase the speed of sequencing using so called “Next Generation Sequencing”. We will look at these techniques in today’s lab.

4.5.1. Electrophoresis (RETURN) electrophoresis is an analytical technique used to separate DNA or RNA fragments by size and reactivity. Nucleic acid molecules to be analyzed are separated in a viscous medium, typically a Figure 4.17. Electrophoretogram showing the migration of smaller gel of some type. An electric field is appled across the molecules to the anode (+) at the bottom of the gel,. The molecules to gel causing the nucleic acids to migrate toward the be separated are loaded at the top of the gel near the cathode (-). Larger molecules remain near the cathode. On the right side of the gel, anode due to the net negative charge of the sugar- a set of moleucles of known molecular size (length) are run. By phosphate backbone of the nucleic acid chain. The comparing the mobility of an unknown molecule with the molecules of separation of nucleic acid fragments is accomplished by known length the size of the unknown fragments can be estimated. exploiting the different mobility of different sized For highest reolution of similar sized fragments as molecules as they are passing through the gel. Longer required for DNA sequencing, either the voltage or run molecules migrate more slowly because they experience time can be varried. Extended runs across a low voltage more resistance within the gel. Smaller fragments gel yield the most accurate resolution, and sequencing migrate further in the same time and end up nearer to gels can be 1 m in length. the anode than longer ones (see figure 4.17).

CONCEPTS OF GENOMIC BIOLOGY Page 4-21

4.5.2. Sanger Dideoxy Sequencing (RETURN) strands. These will then act as templates for DNA The method of DNA sequencing invented by Fred synthesis using DNA polymerase and a primer similar to Sanger is a truly revolutionary technique. He was what is used in PCR. rewarded for his ingenuity with the Nobel Prize in 1980. Step 2. To the mixture of template, primer, DNA The specific steps of Sanger’s method are given polymerase, dNTP (nucleotide bases (dA, dC, dG and dT) below. Note that you can also view a video that are added. One or more of these bases is radioactively describes this process: labelled so that any DNA that is synthesised can be detected. Step 1. The DNA double helix is ‘denatured’ (broken down) with heat or chemicals to separate the two Step 3. Once the sequencing reaction has begun versions of the dNTP containing a hydrogen atom on both the 2’ and 3’ carbons of deoxyribose (see Figure 4.18) known as dideoxy-nucletotides (ddNTP) or chain terminators are also added in small amounts. Four identical reactions are run at the same, but ddA is added to one, ddG to the second, ddC to the third, and ddT to the last reaction. Terminators stop DNA synthesis since they lack a 3’-OH group for the next nucleotide to fasten to. So, the 'A' terminator will stop DNA synthesis when an 'A' base is added (the 'C' terminator will stop DNA synthesis when a 'C' base is added and so on…) Step 4. This results in a mixture of pieces of radioactive DNA of various lengths but all ending in the Figure 4.18. a) A regular deoxynucleotide triphosphate (dNTP) with a same base, i.e. the ddBase added to each reaction. 3’-OH Group. B) A dideoxynucleotide triphosphate ddNTP. Since ddNTP Step 5. The four different reactions are then loaded have no 3’-OH group it is not possible for DNA polymerase to add more nucleotides to the growing nucleotide chain and DNA synthesis is on to separate lanes of an acrylamide gel and the DNA terminated at that base. CONCEPTS OF GENOMIC BIOLOGY Page 4-22 pieces separated according to size by a process called electrophoresis (see section 4.5.1). Step 6. Upon completion of the electrophoresis, the radioactively labeled DNA is then visualized by exposing the gel to X-ray film. The radioactively labelled DNA will make the film turn black at a position corresponding to it’s position in the gel. This exposed film is called an autoradiogram. Each band on the film corresponds to where a specific ddBase was added in each of the reactions (ddA, ddC, ddG or ddT). You can therefore read off the sequence of the DNA from the bottom of the film since you know the nucleotide that Figure 4.20. A Sanger Dide- must be at the end of each oxy sequencing gel showing results for 10 sequence (x4 Figure 4.19. Four sequencing reacitons terminated with ddA, ddC, fragment. Note that this reactions). ddG, and ddT are loaded onto a gel, and after fragments are technique was very popular in separated, an autoradiogram demonstrates the positions of the fragments with known end nucleotides. the day, but it has several major drawbacks including: 1) the necessity of using radioactivity; 2) eye strain from CONCEPTS OF GENOMIC BIOLOGY Page 4-23 reading the gel manually leading to frequent errors; 3) chromophore fluoresces at a different color. This means fragments near the top of the gel cannot accurately be that only one reaction is needed instead of four, and as read, and in general discontinuities in the gel can create the differently colored ddNTPs terminate the reactions errors; 4) the method was not easily automated because the molecules will have different fluorescent colors it was tedious and time consuming. In general with depending on the terminating nucleotide (see Figure great effort it was possible to obtain about 500-700 nt 4.21., left pannel). of sequence from most gels, this often took months to The second innovation was the replacement of gel obtain. electrophoresis, with electrophoresis through long thin Imagine that this the “state of the art” at the time acrylic-fiber capillaries (tubes with very narrow pores the Human Genome Sequencing Project began. through which liquids can pass). These capillaries are far Obtaining 3.2 billion bp of human sequence taking 3 more uniform and consistent as an electrophoresis man-months per 700 bp would require about 1 million medium, and because they are less temperature man-years of labor. Thus, improved technology was sensitive higher voltages can be employed making required to make the project successful. Though not separation faster and more reproducible. Additionally a really appreciated by the general population, this laser can be used to generate the fluorescence and this project was the biological equivalent of putting a man can be done while the nucleotides remain in the on the moon. capillary.

4.5.3. Capillary Sequencing (RETURN) In capillary sequencing machines, DNA fragments are Two significant innovations made it possible to separated by size through a long, thin, acrylic-fibre automate DNA sequencing, reduce costs, and increase capillary. A sample containing fragments of DNA labeled efficiency making of virtually with the different chromophores described above is any genome a reality. injected into the capillary. Once the sample has been injected, an electric field can be applied, to drive the The first of these innovations was the addition of DNA fragments through the capillary toward the anode fluorescent chromophores to the dideoxy NTPs. These as in . chromophores are attached such that different chromophores are attached to each base, and each CONCEPTS OF GENOMIC BIOLOGY Page 4-24 A fluorescence-detecting laser, built into the = Blue, G = Yellow and T = Red. The color of the automated sequencing machine, then shoots through fluorescent bases is detected by a camera as they the capillary fiber at the end, causing the colored tags migrate through the capilary, and the bases are on the DNA fragments, to fluoresce. Each fluorescent recorded by the sequencing machine as the terminator base produces a different color: A = Green, C electrophoresis proceeds. The colors of the bases are

Figure 4.21. On the right is a capillary sequencer trace showing the nucleotides seen by the laser scan. On the left is a classical gel made using fluorescent nucleotides rather than radioactivity to demonstrate the principle of the cappliary sequencer. CONCEPTS OF GENOMIC BIOLOGY Page 4-25 then displayed on a computer as a graph of different than being limited to just a few DNA fragments, next- colored peaks (see Figure 4.21., right panel). generation sequencing extends this process so that This technology is readily amenable to millions of samples can be sequenced, all at the same mechanization, and modern capillary sequences can time. For this reason it is sometimes called massively dependably run dozens of samples in parallel through parallel sequencing (MPS). As a result, large amounts of multiple capillaries simultaneously. Also the process is DNA can be sequenced at rapid speed. With some next- much faster, and thus multiple runs can be made daily generation sequencing machines researchers can through each capillary. The robots automating these sequence more than five human genomes per machine sequencers also work 24-7, and data is collected and in just under a week. stored directly with no tedious human gel read involved. Next-generation sequencing gives scientists the The human genome took about 10 years to sequence ability to compare the genomes of many different 3.2 billion bases at a cost of approximately $3 billion. individuals. With the latest technologies, we can study Today we have even faster sequencers that do not the genomes from all sorts of people to provide us with use electrophoresis, and generate sequences even the data needed to compare them and uncover the faster and more inexpensively. This is …. genetic causes of cancer, diabetes, schizophrenia and other diseases. We can also explore the genomes of 4.5.4. Next Generation Sequencing (RETURN) things that cause human disease such as viruses, Next-generation sequencing (NGS) is a fundamen- bacteria and other pathogens. tally different approach to DNA sequencing, cutting the There are at least 4 different NGS sequencing time and cost needed to sequence a genome. Using technologies. Each has it’s advantages and capillary sequencing it costs about $1 million to disadvantages, but 2 technologies have emerged as the sequence 1 million bp, and it took about 10 years to most useful, e.g. Illumina sequencing-by-synthesis, and sequence the first human genome. NGS costs about the Roche 454 sequencing technology. All of the NGS $0.60 per million bp, and can do the job in about 1 day. sequencing technologies share several features as The principles of NGS are in some ways similar to illustrated a video (click link); these are: capillary sequencing where the bases of a small section of DNA are identified and recorded. However, rather CONCEPTS OF GENOMIC BIOLOGY Page 4-26 1. Sample preparation. Fragments of uniform longer available although it is described in the video length are generated and adapter sequences are above). ligated onto the ends of the fragments. Note that each of these sequencing technologies, 2. Attachment of sequences to a matrix using a delivers millions to billions of base paris of reads in a technique called “bridge PCR” that amplifies a relatively short period of time (days), and does so at sequence in a specific region of the support varying, but relatively low costs per base sequenced. matrix in a cluster. This produces millions to Read length varies according to the technology used, billions of sequence locations where specific but is typically 100 to 400 bases are obtained per read. clusters of sequences are attached to a solid The data generated are very large data files that must be support matrix. used to generate the longer genomic or cDNA 3. Raw sequence data collection is accomplished by sequences that are biologically meaningful. various techniques depending on the particular technology that is employed. In general the data NGS technology regardless of type has revolutionized collection process records the sequence being DNA sequencing, but simultaneously places a burden on generated from each cluster at each of the available computational technology in order to assemble millions of locations on the matrix simultaneously, billions of short reads into whole genomic sequences. and saves these sequences for subsequent Nevertheless, the ability to generate such massive analysis. amounts of sequence has made this very successful technology. Each sequencing technology involves different chemistry leading to the generation of sequences. The 4.5.5. Third Generation Sequencing (RETURN) specific chemistries that can be used include: Although this technology is emerging, it could soon pyrosequencing chemistry used by Roche 454 be a reality further advancing the role of DNA Sequencers, sequencing-by-synthesis chemistry used by sequencing in all branches of the life sciences. Illumina sequencers, ion semi-conductor sequencing With third generation sequencing, sequencing a used by Ion Torrent Sequencers, and sequen-cing-by- genome will become a cheaper, faster and more ligation used by ABI SoLID sequencers (this technology is sophisticated process. No sooner had next-generation CONCEPTS OF GENOMIC BIOLOGY Page 4-27 sequencing reached the market than a third generation SMRT, Escherichia coli has now been sequenced to an of sequencing was being developed. accuracy of 99.9999 per cent! One of these new technologies was developed by Sequencing the human genome in this way won’t be Pacific Biosciences and is called Single-Molecule possible for a while, but when it is, scientists predict that Sequencing in Real Time (SMRT). This system involves a it will be possible to sequence an entire human genome single-stranded molecule of DNA that attaches to a DNA in about an hour. Imagine the clinical applications of this polymerase enzyme. The DNA is sequenced as the DNA technology. A doctor or pharmacist may be able to polymerase adds complementary fluorescently-labelled identify a critical gene that leads to an accurate drug bases to the DNA strand. As each labelled base is added, prescription by sequencing your genome in the office the fluorescent color of the base is recorded before the while you wait fluorescent label is cut off. The next base in the DNA chain can then be added and recorded. SMRT is very efficient which means that fewer expensive chemicals have to be used. It is also incredibly sensitive, enabling scientists to effectively ‘eavesdrop’ on DNA polymerase and observe it making a strand of DNA. SMRT can generate very long reads of sequence (10- 15 kilobases) from single molecules of DNA, very quickly. Producing long reads is very important because it is easier to assemble genomes from longer fragments of DNA. With the introduction of such sensitive and cheap sequencing methods scientists can now begin to re- Figure 4.20. A graph showing how the speed of DNA sequencing sequence genomes that have already been sequenced technologies has increased since the early techniques in the 1980s. Image credit: Genome Research Limited. to achieve a higher level of accuracy. For example, using CONCEPTS OF GENOMIC BIOLOGY Page 4-2 8 “best” method for obtaining the sequence of most eukaryotic genomes, and it has also been used with 4.6. DNA SEQUENCING STRATEGIES (RETURN) those microbial genomes that have previously been

mapped by genetic and/or physical means. Though it is Beyond the method for generating DNA sequences, it relatively slow and expensive, this method provides is necessary to have a strategy for how to emply DNA dependable high-quality sequence information with a sequencing technology. Strategies for DNA sequencing high level of confidence. depend on the features and size of the genome that is being sequenced and the available technology for doing In the clone-contig approach, the genome is broken the sequencing. As part of the Human Genome Project into fragments of up to 1.5 Mb, usually by partial two general approaches emerged as most useful and digestion with a restriction endonuclease (section 4.1), valuable. One of these strategies the Map-based and these cloned in a high-capacity vector such as a BAC approach was employed by the publicly funded or a YAC vector (section 4.2.5). A clone contig map is sequencing effort that involved scientists from around made by identifying clones containing overlapping the world. The other strategy that was developed by a fragments bearing mapped sequence markers. These privately funded group at Celera Genomics, called whole markers were originally identified using a combination genome shotgun sequencing was perhaps faster and of conventional genetic mapping, FISH cytogenetic cheaper than the map-based approach, but does not mapping, and radiation hybrid mapping. Subsequently, work efficiently with large genomes though it is very common practice is to use chromosome walking as an useful for smaller genomes. In fact today these approach to making a clone-contig library using this approaches are “hybridized” or combined to obtain the approach sequence markers are generated from BAC- advantages of both strategies. ends, and a map of BAC-end sequences is subsequently made. Ideally the cloned fragments are anchored onto a 4.6.1. Map-based Sequencing (RETURN) genetic and/or physical map of the genome, so that the The map-based or clone-contig mapping sequencing sequence data from the contig can be checked and approach was the method originally developed by the interpreted by looking for features (e.g. STSs, SSLPs, publically funded Human Genome Project sequencing RFLPs, and genes) known to be present in a particular effort. The rationale for this method is that it is the region. CONCEPTS OF GENOMIC BIOLOGY Page 4-29 Once the clone library and contig map have been developed, relevant clones are sequenced, using shotgun method below. These sequenced contigs are then alighned using the markers and overlapping seuqences on the clones to position each clone.

Figure 4.22. Clone contig mapping of a series of YAC clones conaining human DNA.

4.6.2. Whole Genome Shotgun Sequencing (RETURN) In the whole genome shotgun approach, smaller randomly produced fragments (1,500-2,000 bp) were produced, cloned, and sequenced. These sequences Figure 4.21. Schematic diagram of sequencing strategy used by the were then assembled based on random overlap into a publicly funded Human Genome Project. The DNA was cut into 150 genome sequence. Typically, some regions are not well Mb fragments and arranged into overlapping contiguous fragments. sequenced, and specific sequencing is done to fill in the These contigs were cut into smaller pieces and sequenced completely.. gaps that cannot be assembled from the randomly made pieces. CONCEPTS OF GENOMIC BIOLOGY Page 4-30 sequence. This might seem trivial, but duplications seldom retain their original sequences. They tend to develop SNPs over time, and this can generate difficulties in the proper assembly of these duplicated sequences. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group involved believed its approach was superior to the other, but a hybrid approach is now being used routinely. The advent of next generation sequencing allows the use of fragment-end short read sequencing with much more powerful computer-based assemblers generating finished sequences. However, the method still requires at least some second round sequencing to Figure 4.23. Schematic diagram of sequencing strategy used by Celera obtain a completely sequenced genome. Genomics. The DNA was cut into small pieces and sequenced completely. These fragments were organized into contigs based on overlapping sequences. 4.7. GENOME ANNOTATION (RETURN)

The shotgun method is faster and less expensive Once a genome sequence is obtained via sequencing than the map-based approach, but the shotgun method using one or more strategies outlined in the preceding is more prone to errors due to incorrect assembly of the sections. The hard work of deciding what the sequence random fragments, especially in larger genomes. For means begins. Typically to make such tasks easier some example, if a 500 kb portion of a chromosome is type of database is created that ultimately shows the duplicated and each duplication is cut into 2kb entire sequence, the location of specific genes in that fragments, then it would be difficult to determine where sequence, and some functional annotation as to the role a particular 2 kb piece should be located in the finished that each gene has in an organism. The databases at CONCEPTS OF GENOMIC BIOLOGY Page 4-31 NCBI are a critical repository for these types of errors that were made. As the programs are used they information, but there are many other specific and refine and improve their predictive power. perhaps more detailed repositories of this type of 4.7.2. Comparison of predicted sequences with known information. sequences (at NCBI) (RETURN) The process routinely begins with the Once putative coding genes are predicted, the next implementation of what is termed a Gene Finding step is to compare the predicted mRNA (cDNA) bioinformatic pipeline. The separate parts of such a sequences with known coding sequences, in publically pipeline are described below. available libraries. 4.7.1. Using Bioinformatic Tools to Identify Putative This can be done with a number of possible tools, but Protein Coding Genes (RETURN) one of the best for doing this is the Basic Local A first approximation of gene locations in the Alignment Search Tool (BLAST) utility at NCBI. By taking genomic sequence is usually made using a gene your predicted peptide and/or nucleotide sequence and prediction program to predict gene beginning and submitting it to a BLAST search of the nr (proteins) or nt ending points, transcriptional and translational start and (nucleotide) sequence database you can learn what stop sites, and exon locations, and polyA addition sequences available at NCBI are most similar to your sites. Often such programs produce sequences of the sequence. When you do a BLASTP (protein) comparison, putative transcript produced, and/or the mature mRNA you are also shown conserved domains found in your and protein amino acid sequence coded for by the gene protein. as well. Recall that conserved domains are amino acid Many gene prediction programs are so called neural sequences that are conserved in various types of network programs that are capable of “learning” what proteins. Thus, BLAST searches can inform you a algorithms to use to decide the sequence of a gene. number of interesting and useful sequence features that Such programs are trained on known sequences, and are found in your submitted sequence. Also note that if then once trained used to predict gene regions, and a cDNA sequence library or libraries is/are available then after predicting, input is given back concerning from the organism you are working with, and if a related sequence from a previously cloned gene is available at CONCEPTS OF GENOMIC BIOLOGY Page 4-32

NCBI you can also learn about previously known cDNA 4.7.3. Published Genomes (RETURN) or other sequences found in all of the databases at NCBI Once such preliminary analyses have been from this BLAST search. This becomes a critical method performed the data needs to be shared with the for learning what your gene does. applicable communities (scientific, medical, clinical, Also note that if you are working with a rare students, the interested public, etc) to whom the organism where little sequence information is available, information is useful. The Genomes database at NCBI is you can construct and sequence your own cDNA library, a resource where this is done. to provide information about protein coding genes in Note that genomic databases at NCBI and elsewhere your organism. are continually evolving, and new information is added The other things you can learn from inspection of the as it comes available. This can make it difficult to predicted cDNA sequence and the actual sequence understand what you find, but with care you can follow found in databases is how accurate the prediction was the process and wind up with the best information that was made by the prediction program. This can lead available. to editing the predicted gene to show the actual sequence that is found by BLAST searching when this is appropriate based on the available data. As we learn more information about each gene, more literature is published related to your gene, and appears in the PubMed database at NCBI or in other NCBI databases. Since you have an interlocking series of databases at NCBI, the BLAST search itself gives you access to a large body of information about sequences related to your predicted sequence and to the actual gene that you discovered in the genome that was sequenced.