Chapter 5. Structural Genomics Contents

Total Page:16

File Type:pdf, Size:1020Kb

Chapter 5. Structural Genomics Contents Chapter 5. Structural Genomics Contents 5. Structural Genomics 5.1. DNA Sequencing Strategies 5.1.1. Map-based Strategies 5.1.2. Whole Genome Shotgun Sequencing 5.2. Genome Annotation 5.2.1. Using Bioinformatic Tools to Identify Putative Coding genes 5.2.2. Comparison of predicted sequences with known sequences (at NCBI) 5.2.3. Published Genomes 5.3. DNA Sequence Polymorphisms 5.3.1. Simple Sequence Repeats (SSRs) 5.3.2. RFLPs are a Special Type of SNP 5.3.3. Detecting SNPs 5.3.4. Uses of DNA Polymorphisms 5.4. Mutations 5.4.1. Point Mutations – Base Substitutions 5.4.2. Point Mutations in Protein Coding Sequences 5.4.3. Point Mutations – Base Insertions or Deletions CONCEPTS OF GENOMIC BIOLOGY Page 5- 1 CHAPTER 5. STRUCTURAL GENOMICS 5.1. DNA SEQUENCING STRATEGIES (RETURN) (RETURN) Beyond the method for generating DNA sequences, it is necessary to have a strategy for how to emply DNA sequencing technology. Strategies for DNA sequencing Genomic Biology has 3 important branches, i.e. depend on the features and size of the genome that is Structural Genomics, Comparative genomics, and being sequenced and the available technology for doing Functional genomics. The ultimate goal of these the sequencing. As part of the Human Genome Project branches of genomics is, respectively; the sequencing of two general approaches emerged as most useful and genes and genomes; the comparison of these sequenced valuable. One of these strategies the Map-based genes and genomes across all organisms with the aim of approach was employed by the publicly funded understanding evolutionary relationships and under- sequencing effort that involved scientists from around standing how genes and genomes work to produce the the world. The other strategy that was developed by a complex phenotypes including gene regulation and privately funded group at Celera Genomics, called whole environmental signaling. genome shotgun sequencing was perhaps faster and A set of molecular genetic technologies was/is critical cheaper than the map-based approach, but does not to our ability to pursue the goals described above. The work efficiently with large genomes though it is very Genomic Biologists Tool Kit is provides a brief useful for smaller genomes. In fact, today these understanding of these critical tools, and how they are approaches are “hybridized” or combined to obtain the used in the investigation of genomes. While the advantages of both strategies. techniques are intrinsically laboratory tools, the nature of 5.1.1. Map-based Sequencing (RETURN) what they can do and how they work can be readily The map-based or clone-contig mapping sequencing studied using bioinformatic resources. approach was the method originally developed by the publically funded Human Genome Project sequencing effort. The rationale for this method is that it is the “best” method for obtaining the sequence of most eukaryotic CONCEPTS OF GENOMIC BIOLOGY Page 5- 2 genomes, and it has also been used with those microbial Once the clone library and contig map have been genomes that have previously been mapped by genetic and/or physical means. Though it is relatively slow and expensive, this method provides dependable high-quality sequence information with a high level of confidence. In the clone-contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial digestion with a restriction endonuclease (section 4.1), and these cloned in a high-capacity vector such as a BAC or a YAC vector (section 4.2.5). A clone contig map is made by identifying clones containing overlapping fragments bearing mapped sequence markers. These markers were originally identified using a combination of conventional genetic mapping, FISH cytogenetic mapping, and radiation hybrid mapping. Subsequently, common practice is to use chromosome walking as an Figure 5.1. Clone contig mapping of a series of YAC clones conaining human DNA. approach to making a clone-contig library using this approach sequence markers are generated from BAC developed, relevant clones are sequenced, using shotgun ends, and a map of BAC-end sequences is subsequently method below (Figure 5.2.). These sequenced contigs are made. Ideally the cloned fragments are anchored onto a then aligned using the markers and overlapping genetic and/or physical map of the genome, so that the sequences on the clones to position each clone. sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, 5.1.2. Whole Genome Shotgun Sequencing (RETURN) RFLPs, and genes) known to be present in a particular In the whole genome shotgun approach, smaller region. randomly produced fragments (1,500-2,000 bp) were produced, cloned, and sequenced. These sequences were then assembled based on random overlap into a CONCEPTS OF GENOMIC BIOLOGY Page 5- 3 genome sequence. Typically, some regions are not well The shotgun method is faster and less expensive than sequenced, and specific sequencing is done to fill in the the map-based approach, but the shotgun method is gaps that cannot be assembled from the randomly made more prone to errors due to incorrect assembly of the pieces. random fragments, especially in larger genomes. For example, if a 500 kb portion of a chromosome is duplicated and each duplication is cut into 2kb fragments, then it would be difficult to determine where a particular 2 kb piece should be located in the finished sequence. This might seem trivial, but duplications seldom retain their original sequences. They tend to develop SNPs over time, and this can generate difficulties in the proper assembly of these duplicated sequences. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group involved believed its approach was superior to the other, but a hybrid approach is now being used routinely. The advent of next generation sequencing allows the use of fragment-end short read sequencing with much more powerful computer-based assemblers generating finished sequences. However, the method still requires at least some second-round sequencing to Figure 5.2. Schematic diagram of sequencing strategy used by the obtain a completely sequenced genome. publicly funded Human Genome Project. The DNA was cut into 150 Mb fragments and arranged into overlapping contiguous fragments. These contigs were cut into smaller pieces and sequenced completely.. CONCEPTS OF GENOMIC BIOLOGY Page 5- 4 transcript produced, and/or the mature mRNA and protein amino acid sequence coded for by the gene as 5.2. GENOME ANNOTATION (RETURN) well. Once a genome sequence is obtained via sequencing Many gene prediction programs are so called neural using one or more strategies outlined in the preceding network programs that are capable of “learning” what sections. The hard work of deciding what the sequence algorithms to use to decide the sequence of a gene. Such means begins. Typically to make such tasks easier some programs are trained on known sequences, and then type of database is created that ultimately shows the once trained used to predict gene regions, and then after entire sequence, the location of specific genes in that predicting, input is given back concerning errors that sequence, and some functional annotation as to the role were made. As the programs are used they refine and that each gene has in an organism. The databases at NCBI improve their predictive power. are a critical repository for these types of information, 5.2.2. Comparison of predicted sequences with but there are many other specific and perhaps more known sequences (at NCBI) (RETURN) detailed repositories of this type of information. Once putative coding genes are predicted, the next The process routinely begins with the implementation step is to compare the predicted mRNA (cDNA) of what is termed a Gene Finding bioinformatic pipeline. sequences with known coding sequences, in publically The separate parts of such a pipeline are described available libraries. below. This can be done with a number of possible tools, but 5.2.1. Using Bioinformatic Tools to Identify Putative one of the best for doing this is the Basic Local Alignment Protein Coding Genes (RETURN) Search Tool (BLAST) utility at NCBI. By taking your A first approximation of gene locations in the genomic predicted peptide and/or nucleotide sequence and sequence is usually made using a gene prediction submitting it to a BLAST search of the nr (proteins) or nt program to predict gene beginning and ending points, (nucleotide) sequence database you can learn what transcriptional and translational start and stop sites, sequences available at NCBI are most similar to your intron and exon locations, and polyA addition sites. Often sequence. When you do a BLASTP (protein) comparison, such programs produce sequences of the putative CONCEPTS OF GENOMIC BIOLOGY Page 5- 5 you are also shown conserved domains found in your As we learn more information about each gene, more protein. literature is published related to your gene, and appears Recall that conserved domains are amino acid in the PubMed database at NCBI or in other NCBI sequences that are conserved in various types of databases. Since you have an interlocking series of proteins. Thus, BLAST searches can inform you a number databases at NCBI, the BLAST search itself gives you of interesting and useful sequence features that are access to a large body of information about sequences found in your submitted sequence. Also note that if a related to your predicted sequence and to the actual cDNA sequence library or libraries is/are available from gene that you discovered in the genome that was the organism you are working with, and if a related sequenced. sequence from a previously cloned gene is available at 5.2.3. Published Genomes (RETURN) NCBI you can also learn about previously known cDNA or Once such preliminary analyses have been performed other sequences found in all of the databases at NCBI the data needs to be shared with the applicable from this BLAST search.
Recommended publications
  • Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)
    International Journal of Molecular Sciences Article A Tale of Two Families: Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus angustifolius L.) Katarzyna B. Czy˙z 1,* , Michał Ksi ˛a˙zkiewicz 2 , Grzegorz Koczyk 1 , Anna Szczepaniak 2, Jan Podkowi ´nski 3 and Barbara Naganowska 2 1 Department of Biometry and Bioinformatics, Institute of Plant Genetics, Polish Academy of Sciences, 60-479 Poznan, Poland; [email protected] 2 Department of Genomics, Institute of Plant Genetics, Polish Academy of Sciences, 60-479 Poznan, Poland; [email protected] (M.K.); [email protected] (B.N.) 3 Department of Genomics, Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704 Poznan, Poland * Correspondence: [email protected] Received: 17 February 2020; Accepted: 6 April 2020; Published: 8 April 2020 Abstract: Narrow-leafed lupin (Lupinus angustifolius L.) has recently been supplied with advanced genomic resources and, as such, has become a well-known model for molecular evolutionary studies within the legume family—a group of plants able to fix nitrogen from the atmosphere. The phylogenetic position of lupins in Papilionoideae and their evolutionary distance to other higher plants facilitates the use of this model species to improve our knowledge on genes involved in nitrogen assimilation and primary metabolism, providing novel contributions to our understanding of the evolutionary history of legumes. In this study, we present a complex characterization of two narrow-leafed lupin gene families—glutamine synthetase (GS) and phosphoenolpyruvate carboxylase (PEPC). We combine a comparative analysis of gene structures and a synteny-based approach with phylogenetic reconstruction and reconciliation of the gene family and species history in order to examine events underlying the extant diversity of both families.
    [Show full text]
  • 13 Genomics and Bioinformatics
    Enderle / Introduction to Biomedical Engineering 2nd ed. Final Proof 5.2.2005 11:58am page 799 13 GENOMICS AND BIOINFORMATICS Spencer Muse, PhD Chapter Contents 13.1 Introduction 13.1.1 The Central Dogma: DNA to RNA to Protein 13.2 Core Laboratory Technologies 13.2.1 Gene Sequencing 13.2.2 Whole Genome Sequencing 13.2.3 Gene Expression 13.2.4 Polymorphisms 13.3 Core Bioinformatics Technologies 13.3.1 Genomics Databases 13.3.2 Sequence Alignment 13.3.3 Database Searching 13.3.4 Hidden Markov Models 13.3.5 Gene Prediction 13.3.6 Functional Annotation 13.3.7 Identifying Differentially Expressed Genes 13.3.8 Clustering Genes with Shared Expression Patterns 13.4 Conclusion Exercises Suggested Reading At the conclusion of this chapter, the reader will be able to: & Discuss the basic principles of molecular biology regarding genome science. & Describe the major types of data involved in genome projects, including technologies for collecting them. 799 Enderle / Introduction to Biomedical Engineering 2nd ed. Final Proof 5.2.2005 11:58am page 800 800 CHAPTER 13 GENOMICS AND BIOINFORMATICS & Describe practical applications and uses of genomic data. & Understand the major topics in the field of bioinformatics and DNA sequence analysis. & Use key bioinformatics databases and web resources. 13.1 INTRODUCTION In April 2003, sequencing of all three billion nucleotides in the human genome was declared complete. This landmark of modern science brought with it high hopes for the understanding and treatment of human genetic disorders. There is plenty of evidence to suggest that the hopes will become reality—1631 human genetic diseases are now associated with known DNA sequences, compared to the less than 100 that were known at the initiation of the Human Genome Project (HGP) in 1990.
    [Show full text]
  • Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond
    DOE/SC-0083 Genomics and Its Impact on Science and Society The Human Genome Project and Beyond U.S. Department of Energy Genome Research Programs: genomics.energy.gov A Primer ells are the fundamental working units of every living system. All the instructions Cneeded to direct their activities are contained within the chemical DNA (deoxyribonucleic acid). DNA from all organisms is made up of the same chemical and physical components. The DNA sequence is the particular side-by-side arrangement of bases along the DNA strand (e.g., ATTCCGGA). This order spells out the exact instruc- tions required to create a particular organism with protein complex its own unique traits. The genome is an organism’s complete set of DNA. Genomes vary widely in size: The smallest known genome for a free-living organism (a bac- terium) contains about 600,000 DNA base pairs, while human and mouse genomes have some From Genes to Proteins 3 billion (see p. 3). Except for mature red blood cells, all human cells contain a complete genome. Although genes get a lot of attention, the proteins DNA in each human cell is packaged into 46 chro- perform most life functions and even comprise the mosomes arranged into 23 pairs. Each chromosome is majority of cellular structures. Proteins are large, complex a physically separate molecule of DNA that ranges in molecules made up of chains of small chemical com- length from about 50 million to 250 million base pairs. pounds called amino acids. Chemical properties that A few types of major chromosomal abnormalities, distinguish the 20 different amino acids cause the including missing or extra copies or gross breaks and protein chains to fold up into specific three-dimensional rejoinings (translocations), can be detected by micro- structures that define their particular functions in the cell.
    [Show full text]
  • Genetic Effects on Microsatellite Diversity in Wild Emmer Wheat (Triticum Dicoccoides) at the Yehudiyya Microsite, Israel
    Heredity (2003) 90, 150–156 & 2003 Nature Publishing Group All rights reserved 0018-067X/03 $25.00 www.nature.com/hdy Genetic effects on microsatellite diversity in wild emmer wheat (Triticum dicoccoides) at the Yehudiyya microsite, Israel Y-C Li1,3, T Fahima1,MSRo¨der2, VM Kirzhner1, A Beiles1, AB Korol1 and E Nevo1 1Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel; 2Institute for Plant Genetics and Crop Plant Research, Corrensstrasse 3, 06466 Gatersleben, Germany This study investigated allele size constraints and clustering, diversity. Genome B appeared to have a larger average and genetic effects on microsatellite (simple sequence repeat number (ARN), but lower variance in repeat number 2 repeat, SSR) diversity at 28 loci comprising seven types of (sARN), and smaller number of alleles per locus than genome tandem repeated dinucleotide motifs in a natural population A. SSRs with compound motifs showed larger ARN than of wild emmer wheat, Triticum dicoccoides, from a shade vs those with perfect motifs. The effects of replication slippage sun microsite in Yehudiyya, northeast of the Sea of Galilee, and recombinational effects (eg, unequal crossing over) on Israel. It was found that allele distribution at SSR loci is SSR diversity varied with SSR motifs. Ecological stresses clustered and constrained with lower or higher boundary. (sun vs shade) may affect mutational mechanisms, influen- This may imply that SSR have functional significance and cing the level of SSR diversity by both processes. natural constraints.
    [Show full text]
  • Gene Prediction and Genome Annotation
    A Crash Course in Gene and Genome Annotation Lieven Sterck, Bioinformatics & Systems Biology VIB-UGent [email protected] ProCoGen Dissemination Workshop, Riga, 5 nov 2013 “Conifer sequencing: basic concepts in conifer genomics” “This Project is financially supported by the European Commission under the 7th Framework Programme” Genome annotation: finding the biological relevant features on a raw genomic sequence (in a high throughput manner) ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Thx to: BSB - annotation team • Lieven Sterck (Ectocarpus, higher plants, conifers, … ) • Yao-cheng Lin (Fungi, conifers, …) • Stephane Rombauts (green alga, mites, …) • Bram Verhelst (green algae) • Pierre Rouzé • Yves Van de Peer ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Annotation experience • Plant genomes : A.thaliana & relatives (e.g. A.lyrata), Poplar, Physcomitrella patens, Medicago, Tomato, Vitis, Apple, Eucalyptus, Zostera, Spruce, Oak, Orchids … • Fungal genomes: Laccaria bicolor, Melampsora laricis- populina, Heterobasidion, other basidiomycetes, Glomus intraradices, Pichia pastoris, Geotrichum Candidum, Candida ... • Algal genomes: Ostreococcus spp, Micromonas, Bathycoccus, Phaeodactylum (and other diatoms), E.hux, Ectocarpus, Amoebophrya … • Animal genomes: Tetranychus urticae, Brevipalpus spp (mites), ... ProCoGen Dissemination Workshop, Riga, 5 nov 2013 Why genome annotation? • Raw sequence data is not useful for most biologists • To be meaningful to them it has to be converted into biological significant knowledge
    [Show full text]
  • Small Variants Frequently Asked Questions (FAQ) Updated September 2011
    Small Variants Frequently Asked Questions (FAQ) Updated September 2011 Summary Information for each Genome .......................................................................................................... 3 How does Complete Genomics map reads and call variations? ........................................................................... 3 How do I assess the quality of a genome produced by Complete Genomics?................................................ 4 What is the difference between “Gross mapping yield” and “Both arms mapped yield” in the summary file? ............................................................................................................................................................................. 5 What are the definitions for Fully Called, Partially Called, Half-Called and No-Called?............................ 5 In the summary-[ASM-ID].tsv file, how is the number of homozygous SNPs calculated? ......................... 5 In the summary-[ASM-ID].tsv file, how is the number of heterozygous SNPs calculated? ....................... 5 In the summary-[ASM-ID].tsv file, how is the total number of SNPs calculated? .......................................... 5 In the summary-[ASM-ID].tsv file, what regions of the genome are included in the “exome”? .............. 6 In the summary-[ASM-ID].tsv file, how is the number of SNPs in the exome calculated? ......................... 6 In the summary-[ASM-ID].tsv file, how are variations in potentially redundant regions of the genome counted? .....................................................................................................................................................................
    [Show full text]
  • Epigenetics Analysis and Integrated Analysis of Multiomics Data, Including Epigenetic Data, Using Artificial Intelligence in the Era of Precision Medicine
    biomolecules Review Epigenetics Analysis and Integrated Analysis of Multiomics Data, Including Epigenetic Data, Using Artificial Intelligence in the Era of Precision Medicine Ryuji Hamamoto 1,2,*, Masaaki Komatsu 1,2, Ken Takasawa 1,2 , Ken Asada 1,2 and Syuzo Kaneko 1 1 Division of Molecular Modification and Cancer Biology, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; [email protected] (M.K.); [email protected] (K.T.); [email protected] (K.A.); [email protected] (S.K.) 2 Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan * Correspondence: [email protected]; Tel.: +81-3-3547-5271 Received: 1 December 2019; Accepted: 27 December 2019; Published: 30 December 2019 Abstract: To clarify the mechanisms of diseases, such as cancer, studies analyzing genetic mutations have been actively conducted for a long time, and a large number of achievements have already been reported. Indeed, genomic medicine is considered the core discipline of precision medicine, and currently, the clinical application of cutting-edge genomic medicine aimed at improving the prevention, diagnosis and treatment of a wide range of diseases is promoted. However, although the Human Genome Project was completed in 2003 and large-scale genetic analyses have since been accomplished worldwide with the development of next-generation sequencing (NGS), explaining the mechanism of disease onset only using genetic variation has been recognized as difficult. Meanwhile, the importance of epigenetics, which describes inheritance by mechanisms other than the genomic DNA sequence, has recently attracted attention, and, in particular, many studies have reported the involvement of epigenetic deregulation in human cancer.
    [Show full text]
  • The Economic Impact and Functional Applications of Human Genetics and Genomics
    The Economic Impact and Functional Applications of Human Genetics and Genomics Commissioned by the American Society of Human Genetics Produced by TEConomy Partners, LLC. Report Authors: Simon Tripp and Martin Grueber May 2021 TEConomy Partners, LLC (TEConomy) endeavors at all times to produce work of the highest quality, consistent with our contract commitments. However, because of the research and/or experimental nature of this work, the client undertakes the sole responsibility for the consequence of any use or misuse of, or inability to use, any information or result obtained from TEConomy, and TEConomy, its partners, or employees have no legal liability for the accuracy, adequacy, or efficacy thereof. Acknowledgements ASHG and the project authors wish to thank the following organizations for their generous support of this study. Invitae Corporation, San Francisco, CA Regeneron Pharmaceuticals, Inc., Tarrytown, NY The project authors express their sincere appreciation to the following indi- viduals who provided their advice and input to this project. ASHG Government and Public Advocacy Committee Lynn B. Jorde, PhD ASHG Government and Public Advocacy Committee (GPAC) Chair, President (2011) Professor and Chair of Human Genetics George and Dolores Eccles Institute of Human Genetics University of Utah School of Medicine Katrina Goddard, PhD ASHG GPAC Incoming Chair, Board of Directors (2018-2020) Distinguished Investigator, Associate Director, Science Programs Kaiser Permanente Northwest Melinda Aldrich, PhD, MPH Associate Professor, Department of Medicine, Division of Genetic Medicine Vanderbilt University Medical Center Wendy Chung, MD, PhD Professor of Pediatrics in Medicine and Director, Clinical Cancer Genetics Columbia University Mira Irons, MD Chief Health and Science Officer American Medical Association Peng Jin, PhD Professor and Chair, Department of Human Genetics Emory University Allison McCague, PhD Science Policy Analyst, Policy and Program Analysis Branch National Human Genome Research Institute Rebecca Meyer-Schuman, MS Human Genetics Ph.D.
    [Show full text]
  • Mathematical Challenges from Genomics and Molecular Biology Richard M
    Mathematical Challenges from Genomics and Molecular Biology Richard M. Karp fundamental goal of biology is to un- algorithms and the role of combinatorics, opti- derstand how living cells function. This mization, probability, statistics, pattern recognition, understanding is the foundation for all and machine learning. higher levels of explanation, including We begin by presenting the minimal information Aphysiology, anatomy, behavior, ecology, about genes, genomes, and proteins required to and the study of populations. The field of molec- understand some of the key problems in genomics. ular biology analyzes the functioning of cells and Next we describe some of the fundamental goals of the processes of inheritance principally in terms of the molecular life sciences and the role of genomics interactions among three crucially important classes in attaining these goals. We then give a series of of macromolecules: DNA, RNA, and proteins. brief vignettes illustrating algorithmic and mathe- Proteins are the molecules that enable and execute matical questions arising in a number of specific most of the processes within a cell. DNA is the car- areas: sequence comparison, sequence assembly, rier of hereditary information in the form of genes gene finding, phylogeny construction, genome re- and directs the production of proteins. RNA is a key arrangement, associations between polymorphisms intermediary between DNA and proteins. and disease, classification and clustering of gene Molecular biology and genetics are undergoing expression data, and the logic of transcriptional revolutionary changes. These changes are guided by control. An annotated bibliography provides point- a view of a cell as a collection of interrelated sub- ers to more detailed information.
    [Show full text]
  • A Roadmap for Metagenomic Enzyme Discovery
    Natural Product Reports View Article Online REVIEW View Journal A roadmap for metagenomic enzyme discovery Cite this: DOI: 10.1039/d1np00006c Serina L. Robinson, * Jorn¨ Piel and Shinichi Sunagawa Covering: up to 2021 Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, Creative Commons Attribution-NonCommercial 3.0 Unported Licence. shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a ‘needle in a haystack’ without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, This article is licensed under a genomic context, 3D structure-based approaches, and machine learning techniques.
    [Show full text]
  • Structural Genomics: an Approach to the Protein Folding Problem
    Commentary Structural genomics: An approach to the protein folding problem Gaetano T. Montelione* Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers University, and Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, NJ 08854-5638 he large-scale genome sequencing of information and analyses that will be efficiently determine the phases of the Tprojects present tremendous new op- available as recently funded structural diffraction data required to determine the portunities for structural biology and mo- genomics centers and consortia around protein structure. In this study, MAD was lecular biophysics. This explosion of bio- the world (12–15) come up to speed. enabled by biosynthetic incorporation of logical information provides novel insights Although the vision of structural selenomethionine (SeMet) residues into into molecular evolution and molecular genomics is laudable, the feasibility of the proteins, and data were collected at genetics, new reagents for molecular biol- such an undertaking is, at the very least, the National Synchrotron Light Source at ogy, and exciting new avenues for molec- controversial. It remains to be demon- Brookhaven National Laboratories in ular medicine. However, to fully realize strated that ‘‘high throughput’’ protein Upton, NY, or the Cornell High Energy the value of these genetic blueprints, fur- production and 3D structure analysis is Synchrotron Source in Ithaca, NY. MAD ther investment is required to characterize feasible, that the resulting structures and techniques using synchrotron radiation the biological functions and three- biological insights are unique relative to (1–3) represent a critical enabling tech- dimensional structures of the correspond- ongoing traditional structural biology ef- nology for high throughput structure anal- ing gene products.
    [Show full text]
  • Genomics & Comp. Biology (GCB)
    Genomics & Comp. Biology (GCB) 1 GCB 535 Introduction to Bioinformatics GENOMICS & COMP. BIOLOGY This course provides overview of bioinformatics and computational biology as applied to biomedical research. A primary objective of the (GCB) course is to enable students to integrate modern bioinformatics tools into their research activities. Course material is aimed to address GCB 493 Epigentics of Human Health and Disease biological questions using computational approaches and the analysis Epigenetic alterations encompass heritable, non-genetic changes to of data. A basic primer in programming and operating in a UNIX chromatin (the polymer of DNA plus histone proteins) that influence enviroment will be presented, and students will also be introduced to cellular and organismal processes. This course will examine epigenetic Python R, and tools for reproducible research. This course emphasizes mechanisms in directing development from the earliest stages of direct, hands-on experience with applications to current biological growth, and in maintaining normal cellular homeostasis during life. We research problems. Areas include DNA sequence alignment, genetic will also explore how diverse epigenetic processes are at the heart of variation and analysis, motif discovery, study design for high-throughput numerous human disease states. We will review topics ranging from a sequencing RNA, and gene expression, single gene and whole-genome historical perspective of the discovery of epigenetic mechanisms to the analysis, machine learning, and topics in systems biology. The relevant use of modern technology and drug development to target epigenetic principles underlying methods used for analysis in these areas will mechanisms toincrease healthy lifespan and combat human disease. be introduced and discussed at a level appropriate for biologists The course will involve a ccombination of didactic lectures, primary without a background in computer science.
    [Show full text]