Integration of Genome Data and Protein Structures: Prediction Of

Total Page:16

File Type:pdf, Size:1020Kb

Load more

125 Integration of genome data and protein structures: prediction of protein folds, protein interactions and ‘molecular phenotypes’ of single nucleotide polymorphisms Shamil Sunyaev*†‡§, Warren Lathe III*†# and Peer Bork*†¶ With the massive amount of sequence and structural data the issue of integrating structural analysis with research on being produced, new avenues emerge for exploiting the human genetic variation. information therein for applications in several fields. Fold distributions can be mapped onto entire genomes to learn Predicting the number of protein folds in about the nature of the protein universe and many of the genomes using homology searches interactions between proteins can now be predicted solely on Soon after the first complete genome sequences became the basis of the genomic context of their genes. Furthermore, publicly available, it was realised that they represent by utilising the new incoming data on single nucleotide datasets for the analysis of protein folds that are statisti- polymorphisms by mapping them onto three-dimensional cally far more natural than compared to the PDB [1]. structures of proteins, problems concerning population, Furthermore, iterative database search methods such as medical and evolutionary genetics can be addressed. PSI-BLAST [2] had been developed and a number of reports revealed that the folds of more than 30% of the Addresses proteins in prokaryotic genomes can be reliably predicted *European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, using homology-based techniques [3–6]. These iterative 69117 Heidelberg, Germany methods are at least twice as sensitive as pairwise protein † Max-Delbrueck Centre for Molecular Medicine (MDC), comparisons [7]. This increased sensitivity can some- Robert-Roessle-Strasse 10, 13122 Berlin, Germany ‡Engelhardt Institute of Molecular Biology (IMB), Vavilova 32, times identify the common evolutionary origin of 117984 Moscow, Russia proteins that were previously thought to belong to differ- §e-mail: [email protected] ent superfamilies and whose structural similarities were # e-mail: [email protected] assumed to be a result of parallel evolution (analogous ¶e-mail: [email protected] rather than homologous structures). An example of a Current Opinion in Structural Biology 2001, 11:125–130 newly identified nontrivial homology relationship 0959-440X/01/$ — see front matter involves enzymes of the TIM-barrel fold from the central Published by Elsevier Science Ltd. metabolism that have been shown to share a common evolutionary origin despite being grouped into 12 distinct Abbreviations SCOP superfamilies [8,9]. cSNP coding SNP PDB Protein Data Bank SNP single nucleotide polymorphism Fold predictions for complete genomes allowed a new esti- mate of the total number of protein folds and of the Introduction number of protein folds in individual genomes [10]. The The past several years have been marked by the success- estimate qualitatively agrees with that of earlier studies ful completion of numerous genome projects, ranging [11,12]. The statistical analysis [13] shows that the nature from the short genomes of prokaryotes to our own. As a of the fold distribution is universal for all genomes and result of this extraordinary growth, different types of agrees with earlier theoretical considerations [14]. genomic data (Figure 1), including sequences of complete Although these estimates give a much clearer picture of genomes, complete sets of proteins (proteomes) and data the nonuniform fold distribution and the low number of on genetic variation, have become available for analysis. folds, it remains to be shown whether the physical con- The simultaneous growth of structural data also provides straints of the polypeptide chain or the specific features of the possibility of performing this analysis from a structural protein evolution led to the current fold repertoire. perspective. In this review, we focus on the impact of genomic data on studies of protein structures and interac- Predicting protein interactions using genomic tions and, perhaps more importantly, on the new context applications of the structural research and the challenges The availability of numerous complete genome raised by these applications. Currently, three trends sequences also enables comparative analysis to predict bridge the structural and genomics fields: analysis of pro- various functional features at the protein level. The tein folds in complete genomes; use of genome genomic context of genes reveals the physical, functional information for predicting protein–protein interactions; and genetic interactions of the respective gene and structural analyses of disease mutations and single products. Several strategies have been used to explore nucleotide polymorphisms (SNPs). We only briefly intro- genomic context. First, gene fusion in one species is duce the first two topics, because they have been indicative of a protein–protein interaction between the extensively covered in many reviews, and mostly focus on gene products of the two fused genes and can be 126 Folding and binding Figure 1 The growth of structural data (counted in PDB Growth of known 3D structures entries) in recent years is compared with the 15,000 growth of genomic data, represented by the number of completely sequenced genomes 10,000 and by the number of known human SNPs. Although all types of presented data 5,000 entries accumulate fast, the growth of genomic data, especially SNP data, outperforms structural Number of PDB 0 data accumulation. Data kindly provided by 1995 1996 1997 1998 1999 2000 A Brookes (Karolinska Institute), S Sherry Year (NCBI) and M Huynen (EMBL). Growth of the number of completely sequenced genomes 40 30 20 10 genomes Number of 0 1995 1996 1997 1998 1999 2000 Year Growth of SNP data 2.50E+06 2.00E+06 1.50E+06 1.00E+06 5.00E+05 submissions 0.00E+00 Number of SNP 1998(3) 1998(4) 1999(1) 1999(2) 1999(3) 1999(4) 2000(1) 2000(2) 2000(3) 2000(4) Quarters of a year Current Opinion in Structural Biology assumed for all the orthologues of the two genes in other Structural analysis of allelic variants species [15,16]. Second, conservation of gene neighbour- A very important part of genomic research, which was hood in some divergent species also strongly indicates underestimated at the beginning of the genomics era, is the occurrence of an interaction between the two encoded the analysis of genetic variation in populations. The stud- gene products, even if they are not neighbours in many ies of genetic variation are now appreciated in the context other genomes [17,18]. In fact, entire subcellular systems of the human genome project and several consortia can be identified if neighbourhood information for dif- around the world have already identified more than two ferent genes is systematically merged [19]. Third, the million DNA variants in the human population (Figure 1). co-occurrence (this has also been coined phylogenetic Therefore, it is time to integrate these data with informa- profile or COG pattern) of two genes (and their ortho- tion from other resources, such as three-dimensional logues) in the same subset of species indicates that the structures of proteins. The application of structural data two genes interact [20–22]. Fourth, the presence of to research on genetic variation can propel studies on the shared regulatory elements hints at the co-regulation of identification of the genetic roots of phenotypic variation the respective downstream genes and, hence, at the and bring new insights to the fields of population and genetic or functional interaction of the respective gene evolutionary genetics. In turn, structural research may products [23]. also benefit from using mutation data. Catalogues of nat- urally occurring mutations with known phenotype All these strategies and the respective methods can association might, in some cases, substitute for experi- be combined to increase their sensitivity [24]. Although ments on site-directed mutagenesis in studies of protein context-based methods are not yet as powerful as classical folding and binding. homology-based function prediction methods [25] and dif- ferences between prokaryotic and eukaryotic evolution In the following, we give a brief introduction to the have to be considered, the power of genomic context nature of mutation and polymorphism data, and then analysis will increase with each new genome published. review the first approaches to combine them with structural Integration of genome data and protein structures Sunyaev, Lathe and Bork 127 information in both specific case studies and general human disease phenotypes are now believed to be of a analyses of amino-acid-replacing SNPs mapped to 3D complex nature, involving common DNA variants structures of proteins. together with environmental factors. The identification of variants that increase susceptibility to human diseases is Mutations one of the key problems in medical genetics. Second, For a long time, data on genetic variation were available analysis of genetic variation in a population can help in the only for particular loci mainly corresponding to genes pursuit of solutions to many problems in evolutionary associated with simple monogenic diseases. Numerous genetics. Third, knowledge of genetic variation in the locus-specific databases contain mutation data, often from modern human population is a clue to our understanding patients with genetic disorders. These databases report not of human origins and features of the human population
Recommended publications
  • Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)

    Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)

    International Journal of Molecular Sciences Article A Tale of Two Families: Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus angustifolius L.) Katarzyna B. Czy˙z 1,* , Michał Ksi ˛a˙zkiewicz 2 , Grzegorz Koczyk 1 , Anna Szczepaniak 2, Jan Podkowi ´nski 3 and Barbara Naganowska 2 1 Department of Biometry and Bioinformatics, Institute of Plant Genetics, Polish Academy of Sciences, 60-479 Poznan, Poland; [email protected] 2 Department of Genomics, Institute of Plant Genetics, Polish Academy of Sciences, 60-479 Poznan, Poland; [email protected] (M.K.); [email protected] (B.N.) 3 Department of Genomics, Institute of Bioorganic Chemistry, Polish Academy of Sciences, 61-704 Poznan, Poland * Correspondence: [email protected] Received: 17 February 2020; Accepted: 6 April 2020; Published: 8 April 2020 Abstract: Narrow-leafed lupin (Lupinus angustifolius L.) has recently been supplied with advanced genomic resources and, as such, has become a well-known model for molecular evolutionary studies within the legume family—a group of plants able to fix nitrogen from the atmosphere. The phylogenetic position of lupins in Papilionoideae and their evolutionary distance to other higher plants facilitates the use of this model species to improve our knowledge on genes involved in nitrogen assimilation and primary metabolism, providing novel contributions to our understanding of the evolutionary history of legumes. In this study, we present a complex characterization of two narrow-leafed lupin gene families—glutamine synthetase (GS) and phosphoenolpyruvate carboxylase (PEPC). We combine a comparative analysis of gene structures and a synteny-based approach with phylogenetic reconstruction and reconciliation of the gene family and species history in order to examine events underlying the extant diversity of both families.
  • Genetic and Environmental Exposures Constrain Epigenetic Drift Over the Human Life Course

    Genetic and Environmental Exposures Constrain Epigenetic Drift Over the Human Life Course

    Downloaded from genome.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press Research Genetic and environmental exposures constrain epigenetic drift over the human life course Sonia Shah,1,9 Allan F. McRae,1,9 Riccardo E. Marioni,1,2,3,9 Sarah E. Harris,2,3 Jude Gibson,4 Anjali K. Henders,5 Paul Redmond,6 Simon R. Cox,3,6 Alison Pattie,6 Janie Corley,6 Lee Murphy,4 Nicholas G. Martin,5 Grant W. Montgomery,5 John M. Starr,3,7 Naomi R. Wray,1,10 Ian J. Deary,3,6,10 and Peter M. Visscher1,3,8,10 1Queensland Brain Institute, The University of Queensland, Brisbane, 4072, Queensland, Australia; 2Medical Genetics Section, Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XU, United Kingdom; 3Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, EH8 9JZ, United Kingdom; 4Wellcome Trust Clinical Research Facility, University of Edinburgh, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, United Kingdom; 5QIMR Berghofer Medical Research Institute, Brisbane, 4029, Queensland, Australia; 6Department of Psychology, University of Edinburgh, Edinburgh, EH8 9JZ, United Kingdom; 7Alzheimer Scotland Dementia Research Centre, University of Edinburgh, Edinburgh, EH8 9JZ, United Kingdom; 8University of Queensland Diamantina Institute, Translational Research Institute, The University of Queensland, Brisbane, 4072, Queensland, Australia Epigenetic mechanisms such as DNA methylation (DNAm) are essential for regulation of gene expression. DNAm is dy- namic, influenced by both environmental and genetic factors. Epigenetic drift is the divergence of the epigenome as a function of age due to stochastic changes in methylation.
  • Small Variants Frequently Asked Questions (FAQ) Updated September 2011

    Small Variants Frequently Asked Questions (FAQ) Updated September 2011

    Small Variants Frequently Asked Questions (FAQ) Updated September 2011 Summary Information for each Genome .......................................................................................................... 3 How does Complete Genomics map reads and call variations? ........................................................................... 3 How do I assess the quality of a genome produced by Complete Genomics?................................................ 4 What is the difference between “Gross mapping yield” and “Both arms mapped yield” in the summary file? ............................................................................................................................................................................. 5 What are the definitions for Fully Called, Partially Called, Half-Called and No-Called?............................ 5 In the summary-[ASM-ID].tsv file, how is the number of homozygous SNPs calculated? ......................... 5 In the summary-[ASM-ID].tsv file, how is the number of heterozygous SNPs calculated? ....................... 5 In the summary-[ASM-ID].tsv file, how is the total number of SNPs calculated? .......................................... 5 In the summary-[ASM-ID].tsv file, what regions of the genome are included in the “exome”? .............. 6 In the summary-[ASM-ID].tsv file, how is the number of SNPs in the exome calculated? ......................... 6 In the summary-[ASM-ID].tsv file, how are variations in potentially redundant regions of the genome counted? .....................................................................................................................................................................
  • A Pilot Analysis of DNA Methylation in Candidate Genes in Brain

    A Pilot Analysis of DNA Methylation in Candidate Genes in Brain

    cells Article Epigenetic Study in Parkinson’s Disease: A Pilot Analysis of DNA Methylation in Candidate Genes in Brain Luis Navarro-Sánchez 1 , Beatriz Águeda-Gómez 1, Silvia Aparicio 1 and Jordi Pérez-Tur 1,2,3,* 1 Unitat de Genètica Molecular, Instituto de Biomedicina de Valencia, CSIC, 46010 València, Spain; [email protected] (L.N.-S.); [email protected] (B.Á.-G.); [email protected] (S.A.) 2 Centro de Investigación Biomédica en Red en Enfermedades Neurodegenerativas (CIBERNED), 46010 València, Spain 3 Unidad Mixta de Genética y Neurología, Instituto de Investigación Sanitaria La Fe, 46026 València, Spain * Correspondence: [email protected]; Tel.: +34-96-339-1755 Received: 25 July 2018; Accepted: 21 September 2018; Published: 26 September 2018 Abstract: Efforts have been made to understand the pathophysiology of Parkinson’s disease (PD). A significant number of studies have focused on genetics, despite the fact that the described pathogenic mutations have been observed only in around 10% of patients; this observation supports the fact that PD is a multifactorial disorder. Lately, differences in miRNA expression, histone modification, and DNA methylation levels have been described, highlighting the importance of epigenetic factors in PD etiology. Taking all this into consideration, we hypothesized that an alteration in the level of methylation in PD-related genes could be related to disease pathogenesis, possibly due to alterations in gene expression. After analysing promoter regions of five PD-related genes in three brain regions by pyrosequencing, we observed some differences in DNA methylation levels (hypo and hypermethylation) in substantia nigra in some CpG dinucleotides that, possibly through an alteration in Sp1 binding, could alter their expression.
  • Dbsnp Resources in the UCSC Database Today We Will Discuss

    Dbsnp Resources in the UCSC Database Today We Will Discuss

    dbSNP resources in the UCSC database Today we will discuss some of the variation data from dbSNP as displayed on the UCSC Genome Browser. We will start at genome.ucsc.edu at the main Browser page and we will ‘reset all user settings’, which brings us to the most recent genome assembly, hg38. [0:28 Set up GB to hg18 defaults] So to begin, let’s go down to hg18, one of the earlier version of the genome, and we will hit the go button. At the Browser graphic page, we will hit the ‘hide all’ button to turn all the data tracks off. [0:45 Turn on SNPs (130) track] Scrolling to the bottom of the screen, to the Variation and Repeats bluebar group, we’ll draw your attention to the dbSNP track, SNPs version 130. In addition, we have earlier versions: 129, 128 and 126. At the time of version 130, before moving to hg19, dbSNP was mapping all new SNPs to the hg18 genome assembly. If we click into the link above the pulldown menu, we can see that this table was last updated in 2009. We’re going to turn the track on to ‘dense’ and then ‘Submit’. This shows us all the SNPs in version 130, and let’s turn on the gene track, UCSC Genes to ‘pack’so we will see the scale. We have 310,000 bases and we have the default gene GABRA3. The region is annotated by many SNPs. Now let’s zoom into one of the exons here in the middle of the screen, and you can that at this resolution, around 5,000 basepairs, there are a dozen or so SNPs in view.
  • A Roadmap for Metagenomic Enzyme Discovery

    A Roadmap for Metagenomic Enzyme Discovery

    Natural Product Reports View Article Online REVIEW View Journal A roadmap for metagenomic enzyme discovery Cite this: DOI: 10.1039/d1np00006c Serina L. Robinson, * Jorn¨ Piel and Shinichi Sunagawa Covering: up to 2021 Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, Creative Commons Attribution-NonCommercial 3.0 Unported Licence. shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a ‘needle in a haystack’ without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, This article is licensed under a genomic context, 3D structure-based approaches, and machine learning techniques.
  • Structural Genomics: an Approach to the Protein Folding Problem

    Structural Genomics: an Approach to the Protein Folding Problem

    Commentary Structural genomics: An approach to the protein folding problem Gaetano T. Montelione* Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers University, and Department of Biochemistry, Robert Wood Johnson Medical School, University of Medicine and Dentistry of New Jersey, Piscataway, NJ 08854-5638 he large-scale genome sequencing of information and analyses that will be efficiently determine the phases of the Tprojects present tremendous new op- available as recently funded structural diffraction data required to determine the portunities for structural biology and mo- genomics centers and consortia around protein structure. In this study, MAD was lecular biophysics. This explosion of bio- the world (12–15) come up to speed. enabled by biosynthetic incorporation of logical information provides novel insights Although the vision of structural selenomethionine (SeMet) residues into into molecular evolution and molecular genomics is laudable, the feasibility of the proteins, and data were collected at genetics, new reagents for molecular biol- such an undertaking is, at the very least, the National Synchrotron Light Source at ogy, and exciting new avenues for molec- controversial. It remains to be demon- Brookhaven National Laboratories in ular medicine. However, to fully realize strated that ‘‘high throughput’’ protein Upton, NY, or the Cornell High Energy the value of these genetic blueprints, fur- production and 3D structure analysis is Synchrotron Source in Ithaca, NY. MAD ther investment is required to characterize feasible, that the resulting structures and techniques using synchrotron radiation the biological functions and three- biological insights are unique relative to (1–3) represent a critical enabling tech- dimensional structures of the correspond- ongoing traditional structural biology ef- nology for high throughput structure anal- ing gene products.
  • 1 a Primer on Molecular Biology

    1 a Primer on Molecular Biology

    1APrimer on Molecular Biology Alexander Zien Modern molecular biology provides a rich source of challenging machine learning problems. This tutorial chapter aims to provide the necessary biological background knowledge required to communicate with biologists and to understand and properly formalize a number of most interesting problems in this application domain. The largest part of the chapter (its first section) is devoted to the cell as the basic unit of life. Four aspects of cells are reviewed in sequence: (1) the molecules that cells make use of (above all, proteins, RNA, and DNA); (2) the spatial organization of cells (“compartmentalization”); (3) the way cells produce proteins (“protein expression”); and (4) cellular communication and evolution (of cells and organisms). In the second section, an overview is provided of the most frequent measurement technologies, data types, and data sources. Finally, important open problems in the analysis of these data (bioinformatics challenges) are briefly outlined. 1.1 The Cell The basic unit of all (biological) life is the cell. A cell is basically a watery solution of certain molecules, surrounded by a lipid (fat) membrane. Typical sizes of cells range from 1 µm (bacteria) to 100 µm (plant cells). The most important properties Life of a living cell (and, in fact, of life itself) are the following: It consists of a set of molecules that is separated from the exterior (as a human being is separated from his or her surroundings). It has a metabolism, that is, it can take up nutrients and convert them into other molecules and usable energy. The cell uses nutrients to renew its constituents, to grow, and to drive its actions (just like a human does).
  • Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants

    Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants

    Genomics 4.0 - Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants - Advanced Pattern Analytics in Duplicate Genomes - Johannes A. Hofberger Thesis committee Promotor Prof. Dr M. Eric Schranz Professor of Experimental Biosystematics Wageningen University Other members Prof. Dr Bart P.H.J. Thomma, Wageningen University Prof. Dr Berend Snel, Utrecht University Dr Klaas Vrieling, Leiden University Dr Gabino F. Sanchez, Wageningen University This research was conducted under the auspices of the Graduate School of Experimental Plant Sciences. Genomics 4.0 - Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants - Advanced Pattern Analytics in Duplicate Genomes - Johannes A. Hofberger Thesis submitted in fulfilment of the requirements for the degree of doctor at Wageningen University by the authority of the Rector Magnificus Prof. Dr M.J. Kropff, in the presence of the Thesis Committee appointed by the Academic Board to be defended in public on Monday 18 May 2015 at 4 p.m. in the Aula. Johannes A. Hofberger Genomics 4.0 - Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants 83 pages. PhD thesis, Wageningen University, Wageningen, NL (2015) With references, with summaries in Dutch and English ISBN: 978-94-6257-314-7 PROPOSITIONS 1. Ohnolog over-retention following ancient polyploidy facilitated diversification of the glucosinolate biosynthetic inventory in the mustard family. (this thesis) 2. Resistance protein conserved in structurally stable parts of plant genomes confer pleiotropic effects and expanded functions in plant innate immunity. (this thesis) 3.
  • Dbsnp Columns

    Dbsnp Columns

    dbSNP Columns RS dbSNP ID (i.e. rs number) RSPOS Chr position reported in dbSNP RV RS orientation is reversed Variation Property. Documentation is at VP ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a GENEINFO colon (:) and each pair is delimited by a vertical bar (|) dbSNPBuildID First dbSNP Build for RS SAO Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - SSR 1kg_failed, 1024 - other WGT Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more VC Variation Class PM Variant is Precious(Clinical,Pubmed Cited) Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will TPA give phenotype data) PMC Links exist to PubMed Central article S3D Has 3D structure - SNP3D table SLO Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out Has non-synonymous frameshift A coding region variation where one allele in NSF the set changes all downstream amino acids. FxnClass = 44 Has non-synonymous missense A coding region variation where one allele in NSM the set changes protein peptide. FxnClass = 42 Has non-synonymous nonsense A coding region variation where one allele in NSN the set changes to STOP codon (TER). FxnClass = 41 Has reference A coding region variation where one allele in the set is identical REF to the reference sequence. FxnCode = 8 Has synonymous A coding region variation where one allele in the set does not SYN change the encoded amino acid.
  • Interactive Tutorial 1: SNP Database Resources

    Interactive Tutorial 1: SNP Database Resources

    NIEHS SNPs Workshop January 30-31, 2005 Interactive Tutorial 1: SNP Database Resources The tutorial is designed to take you through the steps necessary to access SNP data from the primary database resources: 1. Entrez SNP/dbSNP 2. HapMap Genome Browser 3. NIEHS - GeneSNPs and the NIEHS SNPs Websites 4. Other Tools –PolyPhen, ECR, PolyDoms, Transfac Note: Answers to questions from this tutorial are included at the end of this document As a launching point, we will begin our searching at the Entrez cross-database browser. This can be accessed on the NCBI home page (http://www.ncbi.nlm.nih.gov/). For these exercises we will be accessing data for the gene: chemokine-like factor (HUGO name: NOS2A). For a cross-database search: 1. Enter the gene symbol (NOS2A) into the empty box next to the ‘Search All Databases’, type NOS2A into the empty box and click on the GO button, or simply hit the return key on your keyboard. Which NCBI database gives the most number of results What is the database? Hint: Mouse over on the ‘?’ next to the icon and click for a popup explanation of this database. 2. On the left column note the results returned for the ‘SNP’ and ‘Gene’ database. 3. How many results were returned for the ‘SNP’ and ‘Gene’ database? 4. Why did the ‘Gene’ database return more than one result? Entrez Gene 5. From the cross database search, click on the ‘Gene’ database icon. 6. Click on the result that corresponds to the ‘homo sapiens’ NOS2A gene. 7. NOS2A maps to which chromosome? 8.
  • The Role of Protein Structure in Genomics

    The Role of Protein Structure in Genomics

    FEBS Letters 476 (2000) 98^102 FEBS 23802 View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Minireview The role of protein structure in genomics Francisco S. Dominguesa, Walter A. Koppensteinerb, Manfred J. Sippla;b;* aCenter for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob Haringer Strasse 3, A-5020 Salzburg, Austria bProCeryon Biosciences GmbH, Jakob Haringer Strasse 3, A-5020 Salzburg, Austria Received 5 May 2000 Edited by Gunnar von Heijne the sequences genomes is a daunting task. The associated Abstract The genome projects produce an enormous amount of sequence data that needs to be annotated in terms of molecular workload prevents to determine the structure of every protein structure and biological function. These tasks have triggered by experimental techniques. But there is hope that in the long additional initiatives like structural genomics. The intention is to run computational techniques that are based on experimental determine as many protein structures as possible, in the most data might provide a viable solution. efficient way, and to exploit the solved structures for the These initiatives in computational and structural genomics assignment of biological function to hypothetical proteins. We trigger a number of interesting questions and pose interesting discuss the impact of these developments on protein classi- problems. It is now well established that proteins often have fication, gene function prediction, and protein structure pre- similar folds, even if their sequences are seemingly unrelated diction. ß 2000 Federation of European Biochemical Socie- [6]. Hence the number of sequences is much larger than the ties.