Chapter 5. Structural Genomics Contents

Chapter 5. Structural Genomics Contents 5. Structural Genomics 5.1. DNA Sequencing Strategies 5.1.1. Map-based Strategies 5.1.2. Whole Genome Shotgun Sequencing 5.2. Genome Annotation 5.2.1. Using Bioinformatic Tools to Identify Putative Coding genes 5.2.2. Comparison of predicted sequences with known sequences (at NCBI) 5.2.3. Published Genomes 5.3. DNA Sequence Polymorphisms 5.3.1. Simple Sequence Repeats (SSRs) 5.3.2. RFLPs are a Special Type of SNP 5.3.3. Detecting SNPs 5.3.4. Uses of DNA Polymorphisms 5.4. Mutations 5.4.1. Point Mutations – Base Substitutions 5.4.2. Point Mutations in Protein Coding Sequences 5.4.3. Point Mutations – Base Insertions or Deletions CONCEPTS OF GENOMIC BIOLOGY Page 5- 1 CHAPTER 5. STRUCTURAL GENOMICS 5.1. DNA SEQUENCING STRATEGIES (RETURN) (RETURN) Beyond the method for generating DNA sequences, it is necessary to have a strategy for how to emply DNA sequencing technology. Strategies for DNA sequencing Genomic Biology has 3 important branches, i.e. depend on the features and size of the genome that is Structural Genomics, Comparative genomics, and being sequenced and the available technology for doing Functional genomics. The ultimate goal of these the sequencing. As part of the Human Genome Project branches of genomics is, respectively; the sequencing of two general approaches emerged as most useful and genes and genomes; the comparison of these sequenced valuable. One of these strategies the Map-based genes and genomes across all organisms with the aim of approach was employed by the publicly funded understanding evolutionary relationships and under- sequencing effort that involved scientists from around standing how genes and genomes work to produce the the world. The other strategy that was developed by a complex phenotypes including gene regulation and privately funded group at Celera Genomics, called whole environmental signaling. genome shotgun sequencing was perhaps faster and A set of molecular genetic technologies was/is critical cheaper than the map-based approach, but does not to our ability to pursue the goals described above. The work efficiently with large genomes though it is very Genomic Biologists Tool Kit is provides a brief useful for smaller genomes. In fact, today these understanding of these critical tools, and how they are approaches are “hybridized” or combined to obtain the used in the investigation of genomes. While the advantages of both strategies. techniques are intrinsically laboratory tools, the nature of 5.1.1. Map-based Sequencing (RETURN) what they can do and how they work can be readily The map-based or clone-contig mapping sequencing studied using bioinformatic resources. approach was the method originally developed by the publically funded Human Genome Project sequencing effort. The rationale for this method is that it is the “best” method for obtaining the sequence of most eukaryotic CONCEPTS OF GENOMIC BIOLOGY Page 5- 2 genomes, and it has also been used with those microbial Once the clone library and contig map have been genomes that have previously been mapped by genetic and/or physical means. Though it is relatively slow and expensive, this method provides dependable high-quality sequence information with a high level of confidence. In the clone-contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial digestion with a restriction endonuclease (section 4.1), and these cloned in a high-capacity vector such as a BAC or a YAC vector (section 4.2.5). A clone contig map is made by identifying clones containing overlapping fragments bearing mapped sequence markers. These markers were originally identified using a combination of conventional genetic mapping, FISH cytogenetic mapping, and radiation hybrid mapping. Subsequently, common practice is to use chromosome walking as an Figure 5.1. Clone contig mapping of a series of YAC clones conaining human DNA. approach to making a clone-contig library using this approach sequence markers are generated from BAC developed, relevant clones are sequenced, using shotgun ends, and a map of BAC-end sequences is subsequently method below (Figure 5.2.). These sequenced contigs are made. Ideally the cloned fragments are anchored onto a then aligned using the markers and overlapping genetic and/or physical map of the genome, so that the sequences on the clones to position each clone. sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, 5.1.2. Whole Genome Shotgun Sequencing (RETURN) RFLPs, and genes) known to be present in a particular In the whole genome shotgun approach, smaller region. randomly produced fragments (1,500-2,000 bp) were produced, cloned, and sequenced. These sequences were then assembled based on random overlap into a CONCEPTS OF GENOMIC BIOLOGY Page 5- 3 genome sequence. Typically, some regions are not well The shotgun method is faster and less expensive than sequenced, and specific sequencing is done to fill in the the map-based approach, but the shotgun method is gaps that cannot be assembled from the randomly made more prone to errors due to incorrect assembly of the pieces. random fragments, especially in larger genomes. For example, if a 500 kb portion of a chromosome is duplicated and each duplication is cut into 2kb fragments, then it would be difficult to determine where a particular 2 kb piece should be located in the finished sequence. This might seem trivial, but duplications seldom retain their original sequences. They tend to develop SNPs over time, and this can generate difficulties in the proper assembly of these duplicated sequences. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group involved believed its approach was superior to the other, but a hybrid approach is now being used routinely. The advent of next generation sequencing allows the use of fragment-end short read sequencing with much more powerful computer-based assemblers generating finished sequences. However, the method still requires at least some second-round sequencing to Figure 5.2. Schematic diagram of sequencing strategy used by the obtain a completely sequenced genome. publicly funded Human Genome Project. The DNA was cut into 150 Mb fragments and arranged into overlapping contiguous fragments. These contigs were cut into smaller pieces and sequenced completely.. CONCEPTS OF GENOMIC BIOLOGY Page 5- 4 transcript produced, and/or the mature mRNA and protein amino acid sequence coded for by the gene as 5.2. GENOME ANNOTATION (RETURN) well. Once a genome sequence is obtained via sequencing Many gene prediction programs are so called neural using one or more strategies outlined in the preceding network programs that are capable of “learning” what sections. The hard work of deciding what the sequence algorithms to use to decide the sequence of a gene. Such means begins. Typically to make such tasks easier some programs are trained on known sequences, and then type of database is created that ultimately shows the once trained used to predict gene regions, and then after entire sequence, the location of specific genes in that predicting, input is given back concerning errors that sequence, and some functional annotation as to the role were made. As the programs are used they refine and that each gene has in an organism. The databases at NCBI improve their predictive power. are a critical repository for these types of information, 5.2.2. Comparison of predicted sequences with but there are many other specific and perhaps more known sequences (at NCBI) (RETURN) detailed repositories of this type of information. Once putative coding genes are predicted, the next The process routinely begins with the implementation step is to compare the predicted mRNA (cDNA) of what is termed a Gene Finding bioinformatic pipeline. sequences with known coding sequences, in publically The separate parts of such a pipeline are described available libraries. below. This can be done with a number of possible tools, but 5.2.1. Using Bioinformatic Tools to Identify Putative one of the best for doing this is the Basic Local Alignment Protein Coding Genes (RETURN) Search Tool (BLAST) utility at NCBI. By taking your A first approximation of gene locations in the genomic predicted peptide and/or nucleotide sequence and sequence is usually made using a gene prediction submitting it to a BLAST search of the nr (proteins) or nt program to predict gene beginning and ending points, (nucleotide) sequence database you can learn what transcriptional and translational start and stop sites, sequences available at NCBI are most similar to your intron and exon locations, and polyA addition sites. Often sequence. When you do a BLASTP (protein) comparison, such programs produce sequences of the putative CONCEPTS OF GENOMIC BIOLOGY Page 5- 5 you are also shown conserved domains found in your As we learn more information about each gene, more protein. literature is published related to your gene, and appears Recall that conserved domains are amino acid in the PubMed database at NCBI or in other NCBI sequences that are conserved in various types of databases. Since you have an interlocking series of proteins. Thus, BLAST searches can inform you a number databases at NCBI, the BLAST search itself gives you of interesting and useful sequence features that are access to a large body of information about sequences found in your submitted sequence. Also note that if a related to your predicted sequence and to the actual cDNA sequence library or libraries is/are available from gene that you discovered in the genome that was the organism you are working with, and if a related sequenced. sequence from a previously cloned gene is available at 5.2.3. Published Genomes (RETURN) NCBI you can also learn about previously known cDNA or Once such preliminary analyses have been performed other sequences found in all of the databases at NCBI the data needs to be shared with the applicable from this BLAST search.

Chapter 5. Structural Genomics Contents

Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)

13 Genomics and Bioinformatics

Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond

Genetic Effects on Microsatellite Diversity in Wild Emmer Wheat (Triticum Dicoccoides) at the Yehudiyya Microsite, Israel

Gene Prediction and Genome Annotation

Small Variants Frequently Asked Questions (FAQ) Updated September 2011

Epigenetics Analysis and Integrated Analysis of Multiomics Data, Including Epigenetic Data, Using Artiﬁcial Intelligence in the Era of Precision Medicine

The Economic Impact and Functional Applications of Human Genetics and Genomics

Mathematical Challenges from Genomics and Molecular Biology Richard M

A Roadmap for Metagenomic Enzyme Discovery

Structural Genomics: an Approach to the Protein Folding Problem

Genomics & Comp. Biology (GCB)