<<

Chapter 5. Structural Contents

5. 5.1. DNA Strategies 5.1.1. Map-based Strategies 5.1.2. Whole 5.2. Genome Annotation 5.2.1. Using Bioinformatic Tools to Identify Putative Coding 5.2.2. Comparison of predicted sequences with known sequences (at NCBI) 5.2.3. Published 5.3. DNA Sequence Polymorphisms 5.3.1. Simple Sequence Repeats (SSRs) 5.3.2. RFLPs are a Special Type of SNP 5.3.3. Detecting SNPs 5.3.4. Uses of DNA Polymorphisms 5.4. 5.4.1. Point Mutations – Base Substitutions 5.4.2. Point Mutations in Coding Sequences 5.4.3. Point Mutations – Base Insertions or Deletions

CONCEPTS OF GENOMIC Page 5- 1

 CHAPTER 5. STRUCTURAL GENOMICS 5.1. DNA SEQUENCING STRATEGIES (RETURN)

(RETURN) Beyond the method for generating DNA sequences, it

is necessary to have a strategy for how to emply DNA sequencing technology. Strategies for DNA sequencing Genomic Biology has 3 important branches, i.e. depend on the features and size of the genome that is Structural Genomics, , and being sequenced and the available technology for doing . The ultimate goal of these the sequencing. As part of the Project branches of genomics is, respectively; the sequencing of two general approaches emerged as most useful and genes and genomes; the comparison of these sequenced valuable. One of these strategies the Map-based genes and genomes across all organisms with the aim of approach was employed by the publicly funded understanding evolutionary relationships and under- sequencing effort that involved scientists from around standing how genes and genomes work to produce the the world. The other strategy that was developed by a complex phenotypes including regulation and privately funded group at Celera Genomics, called whole environmental signaling. genome shotgun sequencing was perhaps faster and A set of molecular genetic technologies was/is critical cheaper than the map-based approach, but does not to our ability to pursue the goals described above. The work efficiently with large genomes though it is very Genomic Tool Kit is provides a brief useful for smaller genomes. In fact, today these understanding of these critical tools, and how they are approaches are “hybridized” or combined to obtain the used in the investigation of genomes. While the advantages of both strategies. techniques are intrinsically laboratory tools, the of 5.1.1. Map-based Sequencing (RETURN) what they can do and how they work can be readily The map-based or clone-contig mapping sequencing studied using bioinformatic resources. approach was the method originally developed by the publically funded Human sequencing effort. The rationale for this method is that it is the “best” method for obtaining the sequence of most eukaryotic CONCEPTS OF GENOMIC BIOLOGY Page 5- 2 genomes, and it has also been used with those microbial Once the clone library and contig map have been genomes that have previously been mapped by genetic and/or physical means. Though it is relatively slow and expensive, this method provides dependable high-quality sequence information with a high level of confidence. In the clone-contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial digestion with a restriction endonuclease (section 4.1), and these cloned in a high-capacity vector such as a BAC or a YAC vector (section 4.2.5). A clone contig map is made by identifying clones containing overlapping fragments bearing mapped sequence markers. These markers were originally identified using a combination of conventional genetic mapping, FISH cytogenetic mapping, and radiation hybrid mapping. Subsequently, common practice is to use walking as an Figure 5.1. Clone contig mapping of a series of YAC clones conaining human DNA. approach to making a clone-contig library using this approach sequence markers are generated from BAC developed, relevant clones are sequenced, using shotgun ends, and a map of BAC-end sequences is subsequently method below (Figure 5.2.). These sequenced contigs are made. Ideally the cloned fragments are anchored onto a then aligned using the markers and overlapping genetic and/or physical map of the genome, so that the sequences on the clones to position each clone. sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, 5.1.2. Whole Genome Shotgun Sequencing (RETURN) RFLPs, and genes) known to be present in a particular In the whole genome shotgun approach, smaller region. randomly produced fragments (1,500-2,000 bp) were produced, cloned, and sequenced. These sequences were then assembled based on random overlap into a CONCEPTS OF GENOMIC BIOLOGY Page 5- 3 genome sequence. Typically, some regions are not well The shotgun method is faster and less expensive than sequenced, and specific sequencing is done to fill in the the map-based approach, but the shotgun method is gaps that cannot be assembled from the randomly made more prone to errors due to incorrect assembly of the pieces. random fragments, especially in larger genomes. For example, if a 500 kb portion of a chromosome is duplicated and each duplication is cut into 2kb fragments, then it would be difficult to determine where a particular 2 kb piece should be located in the finished sequence. This might seem trivial, but duplications seldom retain their original sequences. They tend to develop SNPs over time, and this can generate difficulties in the proper assembly of these duplicated sequences. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group involved believed its approach was superior to the other, but a hybrid approach is now being used routinely. The advent of next generation sequencing allows the use of fragment-end short read sequencing with much more powerful computer-based assemblers generating finished sequences. However, the method still requires at least some second-round sequencing to Figure 5.2. Schematic diagram of sequencing strategy used by the obtain a completely sequenced genome. publicly funded . The DNA was cut into 150 Mb fragments and arranged into overlapping contiguous fragments. These contigs were cut into smaller pieces and sequenced completely.. CONCEPTS OF GENOMIC BIOLOGY Page 5- 4 transcript produced, and/or the mature mRNA and protein sequence coded for by the gene as 5.2. GENOME ANNOTATION (RETURN) well. Once a genome sequence is obtained via sequencing Many programs are so called neural using one or more strategies outlined in the preceding network programs that are capable of “learning” what sections. The hard work of deciding what the sequence algorithms to use to decide the sequence of a gene. Such means begins. Typically to make such tasks easier some programs are trained on known sequences, and then type of database is created that ultimately shows the once trained used to predict gene regions, and then after entire sequence, the location of specific genes in that predicting, input is given back concerning errors that sequence, and some functional annotation as to the role were made. As the programs are used they refine and that each gene has in an organism. The databases at NCBI improve their predictive power. are a critical repository for these types of information, 5.2.2. Comparison of predicted sequences with but there are many other specific and perhaps more known sequences (at NCBI) (RETURN) detailed repositories of this type of information. Once putative coding genes are predicted, the next The process routinely begins with the implementation step is to compare the predicted mRNA (cDNA) of what is termed a Gene Finding bioinformatic pipeline. sequences with known coding sequences, in publically The separate parts of such a pipeline are described available libraries. below. This can be done with a number of possible tools, but 5.2.1. Using Bioinformatic Tools to Identify Putative one of the best for doing this is the Basic Local Alignment Protein Coding Genes (RETURN) Search Tool (BLAST) utility at NCBI. By taking your A first approximation of gene locations in the genomic predicted peptide and/or sequence and sequence is usually made using a gene prediction submitting it to a BLAST search of the nr () or nt program to predict gene beginning and ending points, (nucleotide) sequence database you can learn what transcriptional and translational start and stop sites, sequences available at NCBI are most similar to your intron and exon locations, and polyA addition sites. Often sequence. When you do a BLASTP (protein) comparison, such programs produce sequences of the putative CONCEPTS OF GENOMIC BIOLOGY Page 5- 5 you are also shown conserved domains found in your As we learn more information about each gene, more protein. literature is published related to your gene, and appears Recall that conserved domains are amino acid in the PubMed database at NCBI or in other NCBI sequences that are conserved in various types of databases. Since you have an interlocking series of proteins. Thus, BLAST searches can inform you a number databases at NCBI, the BLAST search itself gives you of interesting and useful sequence features that are access to a large body of information about sequences found in your submitted sequence. Also note that if a related to your predicted sequence and to the actual cDNA sequence library or libraries is/are available from gene that you discovered in the genome that was the organism you are working with, and if a related sequenced. sequence from a previously cloned gene is available at 5.2.3. Published Genomes (RETURN) NCBI you can also learn about previously known cDNA or Once such preliminary analyses have been performed other sequences found in all of the databases at NCBI the data needs to be shared with the applicable from this BLAST search. This becomes a critical method communities (scientific, medical, clinical, students, the for learning what your gene does. interested public, etc) to whom the information is useful. Also note that if you are working with a rare organism The Genomes database at NCBI is a where this where little sequence information is available, you can is done. construct and sequence your own cDNA library, to Note that genomic databases at NCBI and elsewhere provide information about protein coding genes in your are continually evolving, and new information is added as organism. it comes available. This can make it difficult to The other things you can learn from inspection of the understand what you find, but with care you can follow predicted cDNA sequence and the actual sequence found the process and wind up with the best information in databases is how accurate the prediction was that was available. made by the prediction program. This can lead to editing the predicted gene to show the actual sequence that is found by BLAST searching when this is appropriate based on the available data. CONCEPTS OF GENOMIC BIOLOGY Page 5- 6 Simple Sequence Repeats (SSRs), Sequence-Tagged 5.3. DNA SEQUENCE POLYMORPHISMS (RETURN) Sites (STMS or simply ), When the contig-cloning approach to genome Simple Sequence Repeats Polymorphisms (SSRP), sequencing was developed (see Section 5.1.1. Map-based Variable Number Tandem Repeats (VNTRs), or Short Tandem Repeats (SSRs) are all names that have been Strategies) it quick became clear that many type of used to describe polymorphic loci present in nuclear DNA sequenced-tagged-site (STS) markers found in the and some organellar DNA. SSRs consist of repeated genome already existed that were based on some type sequence units of 1-10 base pairs, most often 2-3 bp in DNA sequence polymorphism. That is to say, known length. sequence variations could be found in almost every genome. Often such sequence polymorphisms were SSRs are highly variable in the number of repeated identified prior to obtaining genomic sequences for many units found, and they are typically evenly distributed organisms. throughout genomes of . However, with the sequencing of multiple individual SSR-type polymorphisms are most frequently genomes of a number of organisms, including humans, it revealed using PCR (Figure 5.3.). PCR primers are made became clear that such polymorphic DNA sequences for the DNA sequences flanking the repeated region since scattered throughout the genome were important the repeats themselves may occur at multiple locations in sequence markers that are useful for relating important the genome. The flanking regions tend to be conserved genes to given chromosomal locations, and subsequently within a species, but they may also be conserved across for mapping important phenotypes. Recall that we have higher taxonomic levels as well. By determining the size discussed the use of single nucleotide polymorphisms (SNPs) in quantitative and population previously in sections 2.6 and 2.7. In this chapter we will examine the major types of DNA polymorphisms in greater d etail.

5.3.1. Simple Sequence Repeats (SSRs) (RETURN) CONCEPTS OF GENOMIC BIOLOGY Page 5- 7 of the PCR fragments, the number of repeated units at a given individual can be determined. SSR polymorphism are believed to arise in the genome as a result of a mutational process. Also because of the repeat nature of the SSR, it is possible that a DNA replication error leads to extra duplication of the repeated unit. For more detail of STS and related tool usage, visit the STS Tools page at the Probe Database at NCBI.

5.3.2. RFLPs/AFLPs are special types of SNP (RETURN) Restriction Fragment Length Polymorphisms are a special type of SNP where a single base inside a restriction site is altered (mutated) producing an SNP. This causes the disappearance of a restriction site, and thus, the DNA is not cut as it should be by a particular . This generates a polymorphism that can be detected by of DNA restriction fragments followed by blotting the fragments onto a membrane filter, and probing the filter with a Figure 5.3. Schematic diagram showing the use of PCR to complementary DNA probe. This process is called detect SSR Polymorpisms (STR). Specific PCR primers land outside the repeat region, and copy the entire repeat Southern Blotting (see Figure 5.4.). and can be used in region. The PCR products can then be sepqrqted on the this application to detect the presence of an RFLP. basis of size, and the number of repeats for each homologous gene determined. CONCEPTS OF GENOMIC BIOLOGY Page 5- 8

Figure 5.4. Restriction Fragment Length Polymorphisms Figure 5.5. Amplified Fragment Length Polymorphisms (AFLP) (RFLP) are caused by a single base change in a restriction are caused by a single base change in a restriction site (BamH1 site (BamH1 in the example given). This loss of a restriction in the example given). This loss of a restriction site results in a site results in a single 7 kb restriction fragment compared single 2 kb PCR product compared to a 500 bp and a 1500 bp to a 2 kb and a 5 kb fragment. fragment. CONCEPTS OF GENOMIC BIOLOGY Page 5- 9 In addition to Southern blotting, it is also possible to under conditions where only a complete match PCR amplify a DNA fragment around a restriction site will form a stable hybrid (top panel of Figure 5.6.). If containing the SNP. Once the amplification is complete the amplified fragment is then cut with the restriction endonuclease in question, and the amplified cut fragments are separated by electrophoresis. The fragments are then visualized using a double stranded DNA specific stain (see Figure 5.5.), and the genotype of the organism around the SNP probe can be determined from the fragment pattern.

5.3.3. Detecting SNPs (RETURN) Most SNPs do not alter restriction sites, so other methods are used for analysis. Certainly, there are Figure 5.6. -specific oligonucleotide hybridization. This means for detecting SNPs by direct DNA sequencing, and technique allow the investigation of single nucleotide polymorphisms in the genome. In this case the probles are now that sequencing and analysis techniques have grown bound to a membrane, and hybridization with lableled DNA much more sophisticated these are becoming from the sample subject. If probes for all possible nucleotide increasingly popular these are typically the method of combinations are separately bound to the filter, hybridization occurs only to the probes where there is a complete match. This choice. is called hybridization at high stringency, and can be used to An early method of SNP detection involves the use of determine the SNP present in a genome. Allele-Specific oligonucleotide (ASO) hybridization. An hybridization occurs, target DNA has the allele oligonucleotide complementary to one SNP allele is corresponding to the oligo. If hybridization does not attached to a membrane filter in a specific location on occur a different allele for that SNP is present (see Figure filter. All other SNP alleles at a given locus can also be 5.6.) , but hybridization would occur at the oligo attached to the filter at separate locations. The filter is corresponding to that allele. then allowed to hybridize at high stringency (meaning CONCEPTS OF GENOMIC BIOLOGY Page 5- 10 Such a technique can be extended to make membrane DNA testing is increasingly available for many genetic filters with oligonucleotides for multiple SNP loci all diseases. Genetic tests are now included in the OMIM incorporated into one array placed on the filter. database for a vast number of genes, and Genomic Additionally, the technique can be extended to place the is one of the newest branches of medicine. A oligonucleotide probes on other substrates where few examples of diseases for which there are now reliable thousands of oligos corresponding to thousands of DNA marker-based genetic tests include: Huntington different SNPs can be examined simultaneously. These disease, Hemophilia, Cystic fibrosis, Tay–Sachs disease, are referred to as SNP arrays which are a type of Breast Cancer, and Sickle- anemia. microarray that we will be discussing in Chapter 6. Human serves three main purposes. 1) SNPs are an abundant source of STS markers that have Prenatal diagnosis, often using amniocentesis or become increasingly important genomic tools both chorionic villus sampling to assess risk to the fetus of a experimentally as well as clinically. It is clear that such genetic disorder. 2) Newborn screening, using blood polymorphic markers arise from the mutational process, screen for Phenylketonuria (PKU), Sickle-cell anemia, and thus we turn out attention to a more detailed Tay–Sachs disease, and others. 3) Carrier (heterozygote) investigation of the process of and its effect on detection is now available for many genetic diseases genome structure. listed above.

5.3.4. Uses of DNA Polymorphisms (RETURN) Additional examples of the application of DNA- Genes have historically been used as markers for based markers include: crime scene investigation, genetic mapping experiments. DNA polymorphisms, population studies to determine variability in groups of defined as is two or more alleles at a locus that vary in people, proving horse pedigrees for registration nucleotide sequence or number of repeated nucleotide purposes, to determine genetic units (indels) behave like genes for mapping purposes. variation in endangered species, Forensic analysis in DNA markers are polymorphisms suitable for mapping, wildlife crimes, allowing body parts of poached animals used in association with gene markers for genetic and to be used as evidence, detection of pathogenic E. coli physical mapping of . strains in foods, detection of genetically modified organisms (GMOs) in bulk or processed foods, and many others. CONCEPTS OF GENOMIC BIOLOGY Page 5- 11 While transversion mutaitons involve a C-G to G-C or a A-T to T-A change. 5.4. MUTATIONS (RETURN)

Mutations are low frequency changes in the nucleotide sequence that can occur either naturally (spontaneously), or they can be be induced by a host of chemical and physical agents. 5.4.2. Point Mutations in Protein Coding Sequences (RETURN) Mutations are quantified in two different ways. The mutation rate is the probability of a particular kind of Another way of classifying substitutions is based on mutation as a of time (e.g., number of observed the effect of the mutation on the protein coded for by the mutations per gene per generation). Mutation frequency ORF in which they occur. is the number of times a particular mutation occurs in proportion to the number of cells or individuals in a population (e.g., number of mutations per 100,000 organisms). Mutations occur in a number of ways, ranging from major chromosomal rearrangements to the so-called point mutations (one or a few base changes) or single nucleotide mutations. A 5.4.1. Point Mutations – Base-pair Substitutions missense mutation occurs when the result is that one (RETURN) amino acid is changed to another as a result of the base A base-pair substitution replaces one base pair with substitution. another. There are two general types of substitutions. Transition mutations involve an A-T to G-C change.

CONCEPTS OF GENOMIC BIOLOGY Page 5- 12 When a missense mutation causes a stop codon A Silent Mutation results when a base substitution rather than another amino acid being specified, it is occurs in the third “wobble” position of a codon, and as a referred to as a nonsense mutation. result the amino acid that is coded for is not changed.

When the amino acid change produced by a missense 5.4.2. Point Mutations – base insertions or deletions mutation involves substituting one amino acid for a (RETURN) similar functional amino acid (see the table of amino acid Frameshift mutations result from single base R-groups for examples of similarity), the resulting amino insertions or deletions. acid is a Neutral mutation.

Frame shift mutations cause all amino acids from the point of insertion on to change, and often lead to truncation of the protein as a result of a premature stop

codon now being in frame.