What Is Bioinformatics? What Is Bioinformatics? What Is

Total Page:16

File Type:pdf, Size:1020Kb

What Is Bioinformatics? What Is Bioinformatics? What Is What is bioinformatics? What is bioinformatics? •the analysis and interpretation of nucleotide and an interdisciplinary field at the interface of the protein sequences and structures computational and life sciences “The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. “ National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/ What is bioinformatics? What is bioinformatics? •the analysis and interpretation of nucleotide and •the analysis and interpretation of nucleotide and protein sequences and structures protein sequences and structures • development of algorithms and software to support • development of algorithms and software to support the acquisition … of biomolecular data the acquisition … of biomolecular data •the development of software that enables efficient access and management of biomolecular information 1 The Origins of Computational Biology Bioinformatics stems from parallel revolutions in biology and computing Amino acid sequencing Turing designs a stored At the beginning of World War II (1939‐1944): Sanger sequences insulin. program computer •The shared program computer had not yet been 1950 Edsac: 1st stored Stein, Moore, Spackman: invented, and there were no programming program computer automatic amino acid languages, databases, or computer networks. analyzer •The relationship between genes and proteins, the molecular basis of genes, the structure of DNA and the genetic code were all unknown. Trends in Biochem. Sci, 99 Automatic recording apparatus used in the chromatographic analysis of mixtures of amino acids EDSAC: The first stored‐program computer. Stein, Moore, Spackman, 1958 2 The Origins of Computational Biology Grace Murray Hopper Amino acid sequencing Turing designs a stored finds the first bug Sanger sequences insulin. program computer 1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug Rear Admiral Grace Hopper over‐seeing her team of programmers, Philadelphia Inquirer, 1957 The Origins of Computational Biology Amino acid sequencing Turing designs a stored Sanger sequences insulin. program computer 1950 Edsac: 1st stored Stein, Moore, Spackman: program computer automatic amino acid analyzer Grace Murray Hopper co‐invents COBOL and finds the first bug Grace Murray Hopper Discovery of DNA structure, Watson, Crick, Franklin 3 The Origins of Computational Biology The helical structure of DNA Determination of the Fortran, Basic, LISP James Watson and Francis Crick genetic code 1960 Rosalind Franklin 1970 The Origins of Computational Biology The genetic code The genetic code is a degenerate, non‐overlapping, triplet code. Crick, Barnett, Brenner, and Watts‐Tobin, 1961 Determination of the Fortran, Basic, LISP genetic code 1960 Determination of On going protein sequen‐ the genetic code cing, Dayhoff publishes the Protein Atlas 1970 4 Atlas of Protein Sequence & Structure The Origins of Computational Biology 1965 ‐ 1978 Determination of the Fortran, Basic, LISP genetic code 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet Margaret Dayhoff PhD in Chemistry, 47 Watson Computing Lab Fellow 47 ‐ 48 1970 The ARPAnet is established, The Origins of Computational Biology precursor to the Internet. Determination of the Fortran, Basic, LISP genetic code Packet switching is invented, which supports flexible, multi‐node network 1960 On going protein sequen‐ cing, Dayhoff publishes the Protein Atlas Arpanet Interface Message Processor (IMP) network is constructed, linking 4, and The Protein Data Bank is then 15 nodes established. 1970 5 The Origins of Computational Biology Protein Data Bank (PDB) Growing collection of X‐ray diffraction protein 1970 structure data Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Development of molecular graphics display for Internet 3D visualization SEARCH program supports remote access First royal email Gilbert, Sanger win Nobel USENET newsgroups Prize 1980 Beginnings of molecular evolution The Origins of Computational Biology early mammal 1970 …atgccaggtcccagtga… Sanger‐Coulson sequencing TCP/IP Maxam‐Gilbert sequencing Internet …atgcaaggagtccgagc……atgcgaggtccagtagt… …atgcaaggagtccgagc… …atgcgaggtccagtagt… First royal email Gilbert, Sanger win Nobel human rat mouse USENET newsgroups …atgcaaggagtcggagc… …atgcgaggtccaacagt… …atgggaggtccagtagt… Prize 1980 …atgcaaggagtcgcagagc… …atgcgaggtctcacagtgt… …atgggaggtctcccagtgt… 6 Sequence similarity → structural similarity Sequence similarity → structural similarity Structure? Structure? …vkltpegtr_wgghpldekflske… …vkltpegtr_wgghpldekflske… …vhltpettrgwgghmldekeiske… Estimate protein structure from a related protein with known …vhltpettrgwgghmldekeiske… structure and similar sequence. Estimate protein structure from a related protein with known structure and similar sequence. Sequence similarity → functional similarity The Origins of Computational Biology ? Congress establishes Genbank 1990 ...atgcgaggattctcgaagacagcga… Human Genome Project ...atgcaaggagtc_cgttgacagagc… begins (DOE) Basic local alignment search tool (BLAST) 28 7 The Origins of Computational Biology BLAST Congress establishes Genbank 1990 Human Genome Project Information begins (DOE) superhighway Basic local alignment search World Wide Web, tool (BLAST) NCSA Mosaic ome The internet we know and love The Origins of Computational Biology Al Gore Creates Bill to Fund "Information Congress establishes Superhighway" Genbank 1990 Human Genome Project Information begins (DOE) superhighway Tim Berners‐Lee proposes a design for information sharing that becomes the World Wide Web Basic local alignment search World Wide Web, tool (BLAST) GenBank goes online. NCSA Mosaic The first web browser Pizza Hut goes on line 8 Genbank BLAST National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/ BLAST The Origins of Computational Biology Google H. Influenzae: 1st whole 1995 genome sequence Yeast –1st eukaryote 1998 C. elegans ‐ 1st animal IBM’s Blue Gene architecture for Fly genome 1999 protein modeling A.thaliana – 1st plant 2000 9 Sequence Assembly Shotgun sequencing DNA molecule Limits of gel electrophoresis: ~ 500bp in one “read” To sequence more than 500 bp: shotgun reads Sequence 500bp fragments separately Combine computationally using sequence comparison Sequence assembly contigs Sequence Assembly The Origins of Computational Biology 2001 Wikipedia, Captcha Draft human genome 2002 Mozilla Foundation: open source, W3P ABI Capillary sequencer patent policy 2003 Chimp genome, aligned All-vs-allAlignmentDetermination sequence of sequences of consensus similarity with sequence comparisonhigh sequence with human genome identity 10 How to interpret whole genome How to interpret whole genome sequence sequence •Where are the genes? Where are the genes? •When and where are those genes expressed? •When and where are those genes expressed? •What proteins do they encode? •What proteins do they encode? •What do they do? •What do they do? – Molecular function? – Molecular function? – Biological pathway or process? – Biological pathway or process? atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc DNA PATTERNS IN THE E.coli lexA GENE tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac Promotor sequences cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc Repressor binding site cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 1 gaattcgataaatcctggtttattgtgcagtttatggttccaaaatcgccttttgctgt atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg TTCCAA -35 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg ATG…TAA gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg open reading frame cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct
Recommended publications
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • Genbank Is a Reliable Resource for 21St Century Biodiversity Research
    GenBank is a reliable resource for 21st century biodiversity research Matthieu Leraya, Nancy Knowltonb,1, Shian-Lei Hoc, Bryan N. Nguyenb,d,e, and Ryuji J. Machidac,1 aSmithsonian Tropical Research Institute, Smithsonian Institution, Panama City, 0843-03092, Republic of Panama; bNational Museum of Natural History, Smithsonian Institution, Washington, DC 20560; cBiodiversity Research Centre, Academia Sinica, 115-29 Taipei, Taiwan; dDepartment of Biological Sciences, The George Washington University, Washington, DC 20052; and eComputational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052 Contributed by Nancy Knowlton, September 15, 2019 (sent for review July 10, 2019; reviewed by Ann Bucklin and Simon Creer) Traditional methods of characterizing biodiversity are increasingly (13), the largest repository of genetic data for biodiversity (14, 15). being supplemented and replaced by approaches based on DNA In many cases, no vouchers are available to independently con- sequencing alone. These approaches commonly involve extraction firm identification, because the organisms are tiny, very difficult and high-throughput sequencing of bulk samples from biologically or impossible to identify, or lacking entirely (in the case of eDNA). complex communities or samples of environmental DNA (eDNA). In While concerns have been raised about biases and inaccuracies in such cases, vouchers for individual organisms are rarely obtained, often laboratory and analytical methods used in metabarcoding
    [Show full text]
  • Genbank Dennis A
    Published online 28 November 2016 Nucleic Acids Research, 2017, Vol. 45, Database issue D37–D42 doi: 10.1093/nar/gkw1070 GenBank Dennis A. Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell and Eric W. Sayers* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA Received September 15, 2016; Revised October 19, 2016; Editorial Decision October 24, 2016; Accepted November 07, 2016 ABSTRACT data from sequencing centers. The U.S. Patent and Trade- ® mark Office also contributes sequences from issued patents. GenBank (www.ncbi.nlm.nih.gov/genbank/)isa GenBank participates with the EMBL-EBI European Nu- comprehensive database that contains publicly avail- cleotide Archive (ENA) (2) and the DNA Data Bank of able nucleotide sequences for 370 000 formally de- Japan (DDBJ) (3) as a partner in the International Nu- Downloaded from scribed species. These sequences are obtained pri- cleotide Sequence Database Collaboration (INSDC) (4). marily through submissions from individual labora- The INSDC partners exchange data daily to ensure that tories and batch submissions from large-scale se- a uniform and comprehensive collection of sequence infor- quencing projects, including whole genome shotgun mation is available worldwide. NCBI makes GenBank data (WGS) and environmental sampling projects. Most available at no cost over the Internet, through FTP and a submissions are made using the web-based BankIt wide range of Web-based retrieval and analysis services (5). http://nar.oxfordjournals.org/ or the NCBI Submission Portal. GenBank staff assign accession numbers upon data receipt.
    [Show full text]
  • Biogrid Australia Facilitates Collaborative Medical And
    SPECIAL ARTICLE Human Mutation OFFICIAL JOURNAL BioGrid Australia Facilitates Collaborative Medical and Bioinformatics Research Across Hospitals and Medical www.hgvs.org Research Institutes by Linking Data from Diverse Disease and Data Types Robert B. Merriel,1,2Ã Peter Gibbs,1–3 Terence J. O’Brien,1,4 and Marienne Hibbert4,5 1Melbourne Health, Melbourne, Australia; 2BioGrid Australia Ltd, Melbourne, Australia; 3Ludwig Institute for Cancer Research, Melbourne, Australia; 4Department of Medicine, The Royal Melbourne Hospital, The University of Melbourne, Melbourne, Australia; 5Victorian Partnership for Advanced Computing, Melbourne, Australia For the HVP Bioinformatics Special Issue Received 17 September 2010; accepted revised manuscript 14 December 2010. Published online in Wiley Online Library (www.wiley.com/humanmutation). DOI 10.1002/humu.21437 between institutions. Although there was a positive collective ABSTRACT: BioGrid Australia is a federated data linkage will—there were limited resources, a lack of standards and an and integration infrastructure that uses the Internet to inconsistent approach to data collection, storage and utilization enable patient specific information to be utilized for within and across Australian hospitals. research in a privacy protected manner, from multiple Market analysis plus consultations and workshops with research- databases of various data types (e.g. clinical, treatment, ers, clinicians and key government stakeholders defined the genomic, image, histopathology and outcome), from a ‘‘preferred future state’’ and a pilot system, the Molecular Medicine range of diseases (oncological, neurological, endocrine Informatics Model (MMIM) was proposed. The vision was a virtual and respiratory) and across more than 20 health services, platform, where information is accessible to authorized users, yet the universities and medical research institutes.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]
  • Uniprot.Ws: R Interface to Uniprot Web Services
    Package ‘UniProt.ws’ September 26, 2021 Type Package Title R Interface to UniProt Web Services Version 2.33.0 Depends methods, utils, RSQLite, RCurl, BiocGenerics (>= 0.13.8) Imports AnnotationDbi, BiocFileCache, rappdirs Suggests RUnit, BiocStyle, knitr Description A collection of functions for retrieving, processing and repackaging the UniProt web services. Collate AllGenerics.R AllClasses.R getFunctions.R methods-select.R utilities.R License Artistic License 2.0 biocViews Annotation, Infrastructure, GO, KEGG, BioCarta VignetteBuilder knitr LazyLoad yes git_url https://git.bioconductor.org/packages/UniProt.ws git_branch master git_last_commit 5062003 git_last_commit_date 2021-05-19 Date/Publication 2021-09-26 Author Marc Carlson [aut], Csaba Ortutay [ctb], Bioconductor Package Maintainer [aut, cre] Maintainer Bioconductor Package Maintainer <[email protected]> R topics documented: UniProt.ws-objects . .2 UNIPROTKB . .4 utilities . .8 Index 11 1 2 UniProt.ws-objects UniProt.ws-objects UniProt.ws objects and their related methods and functions Description UniProt.ws is the base class for interacting with the Uniprot web services from Bioconductor. In much the same way as an AnnotationDb object allows acces to select for many other annotation packages, UniProt.ws is meant to allow usage of select methods and other supporting methods to enable the easy extraction of data from the Uniprot web services. select, columns and keys are used together to extract data via an UniProt.ws object. columns shows which kinds of data can be returned for the UniProt.ws object. keytypes allows the user to discover which keytypes can be passed in to select or keys via the keytype argument. keys returns keys for the database contained in the UniProt.ws object .
    [Show full text]
  • Unexpected Insertion of Carrier DNA Sequences Into the Fission Yeast Genome During CRISPR–Cas9 Mediated Gene Deletion
    Longmuir et al. BMC Res Notes (2019) 12:191 https://doi.org/10.1186/s13104-019-4228-x BMC Research Notes RESEARCH NOTE Open Access Unexpected insertion of carrier DNA sequences into the fssion yeast genome during CRISPR–Cas9 mediated gene deletion Sophie Longmuir, Nabihah Akhtar and Stuart A. MacNeill* Abstract Objectives: The fssion yeast Schizosaccharomyces pombe is predicted to encode ~ 200 proteins of < 100 amino acids, including a number of previously uncharacterised proteins that are found conserved in related Schizosaccharomyces species only. To begin an investigation of the function of four of these so-called microproteins (designated Smp1– Smp4), CRISPR–Cas9 genome editing technology was used to delete the corresponding genes in haploid fssion yeast cells. Results: None of the four microprotein-encoding genes was essential for viability, meiosis or sporulation, and the deletion cells were no more sensitive to a range of cell stressors than wild-type, leaving the function of the proteins unresolved. During CRISPR–Cas9 editing however, a number of strains were isolated in which additional sequences were inserted into the target loci at the Cas9 cut site. Sequencing of the inserts revealed these to be derived from the chum salmon Oncorhynchus keta, the source of the carrier DNA used in the S. pombe transformation. Keywords: Microprotein, Fission yeast, Schizosaccharomyces pombe, Oncorhynchus keta, CRISPR–Cas9 Introduction Te results presented here arose out of a project to Microproteins (also known as SEPs for smORF-encoded investigate the function of four unstudied S. pombe peptides) are small (generally < 100 amino acid) proteins microproteins, designated Smp1–Smp4 (see Table 1 for that are increasingly being implicated in a wide range of systematic IDs).
    [Show full text]
  • A Computational Biology Database Digest: Data, Data Analysis, and Data Management
    A Computational Biology Database Digest: Data, Data Analysis, and Data Management Franc¸ois Bry and Peer Kroger¨ Institute for Computer Science, University of Munich, Germany http://www.pms.informatik.uni-muenchen.de [email protected] [email protected] Abstract Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [26]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology. 1 Introduction “Computational biology is part of a larger revolution that will affect how all of sci- ence is conducted. This larger revolution is being driven by the generation and use of information in all forms and in enormous quantities and requires the development of intelligent systems for gathering, storing and accessing information.” [26] As this statement suggests, Molecular Biology databases play a central role in Computational Biology [11]. Currently, there are several hundred Molecular Biology databases – their number probably lies between 500 and 1,000. Well-known examples are DDBJ [112], EMBL [9], GenBank [14], PIR [13], and SWISS-PROT [8]. It is so difficult to keep track of Molecular Biology databases that a “meta- 1 database”, DBcat [33], has been developed for this purpose. Nonetheless, DBcat by far does not report on all activities in the rapidly evolving field of Molecular Biology databases.
    [Show full text]
  • Bioinformatics Applications in Proteomics Genbank
    Bioinformatics Applications in Proteomics Chiquito Crasto [email protected] January 29, 2008 GenBank • http://www.ncbi.nlm.nih.gov/ – Gene information – Protein Information – Literature – Links to Microarray Data, if any BLAST 1 SWISSPROT • http://www.expasy.ch/ – Prosite – SWISS-2DPage – SWISS-Model Repository (Homology Modeling) – Other Tools UNIPROT • http://www.pir.uniprot.org/ – Interoperability between • European Bioinformatics Institute (EBI) • EXPASY • Protein Information Resource (PIR) • Enzyme Nomenclature – http://www.chem.qmul.ac.uk/iubmb/enzyme/ 2 KEGG (Kyoto Encyclopedia of Genes and Genomes) • Database •Atlas • Pathway •BRITE • Genes • SSDB (Sequence Similarities DB) • Ligand •DBGet MASCOT—Protein Indentification from Mass Spectroscopy Data • Peptide Mass Fingerprinting • Sequence Query • MS/MS Ion Search • Alternatives • Phenyx--http://phenyx.vital-it.ch/pwi/login/login.jsp • Aldente-- http://www.genebio.com/aldente/index.htm l 3 Protein Data Bank-PDB • http://www.rcsb.org/pdb/home/home.do • Front Page for each entry • File page and formats • Molecular Viewer Bioinformatics as a Drive or Discovery • Recognition of the mechanism of olfaction depends upon understanding the sequence- structure-function relationships of olfactory receptors. We used bioinformatics methods to identify a new structural sub-class of olfactory receptors and GPCRs. We used available statistical methods to predict trans-membrane helical domains in olfactory receptor hOR17-210, a receptor that has been shown to be variably functional and pseudogenic in humans. • TM domain identification was undertaken as a prelude to modeling this olfactory receptor in order to understand its interaction with ligands that have been experimentally shown to bind to this receptor. Our analyses revealed that there are only five typically observed TM regions in this protein with an additional orphan TM.
    [Show full text]
  • Genbank Accession Number Reference Sheet
    GenBank Accession Number Reference Sheet: The International Nucleotide Sequence Database Collaboration (INSDC) consists of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank at NCBI. As part of the Collaboration, all three organizations accept new sequence submissions and share sequence data among the three databases. To facilitate the exchange of data, each member of the collaboration is assigned certain accession prefixes. In addition to the accession number, GenBank records also have a GI number. The GI number is simply a series of digits assigned consecutively to sequences submitted to NCBI. Format of GenBank accession numbers: Type Format Nucleotide 1 letter + 5 numbers or 2 letters + 6 numbers Protein 3 letters + 5 numbers WGS 4 letters + 2 numbers for WGS assembly version + 6-8 numerals Primary GenBank accession number prefixes: Prefixes Data Source AE, CP, CY Genome projects (nucleotide) U, AF, AY, DQ Direct submissions (nucleotide) AAAA-AZZZ Whole genome shotgun sequences (nucleotide) AAA-AZZ Protein ID EAA-EZZ WGS protein ID O, P, Q Swissprot (protein) Version number suffix: GenBank sequence identifiers consist of an accession number of the record followed by a dot and a version number (i.e. accession.version). The version number is incremented whenever the sequence record is updated. Refseq Accession Format: Refseq accession numbers do not follow the standards set by INSDC. It has a distinct format of 2 letters + underbar + 6 digits (i.e. NM_012345). Refseq records can either be curated (manually reviewed by NCBI staff or collaborators) or automated (records not individually reviewed). Prefixes Molecule Method NC, NG Genomic Curated NM MRNA Curated NR RNA Curated NP Protein Curated NT, NW Genomic Automated XM MRNA Automated XR RNA Automated XP Protein Automated The complete list of accession numbers is available at http://www.ncbi.nlm.nih.gov/Sequin/acc.html.
    [Show full text]
  • Introduction to Bioinformatics
    INTRODUCTION TO BIOINFORMATICS > Please take the initial BIOINF525 questionnaire: < http://tinyurl.com/bioinf525-questions Barry Grant University of Michigan www.thegrantlab.org BIOINF 525 http://bioboot.github.io/bioinf525_w16/ 12-Jan-2016 Barry Grant, Ph.D. [email protected] Ryan Mills, Ph.D. [email protected] Hongyang Li (GSI) [email protected] COURSE LOGISTICS Lectures: Tuesdays 2:30-4:00 PM Rm. 2062 Palmer Commons Labs: Session I: Thursdays 2:30 - 4:00 PM Session II: Fridays 10:30 - 12:00 PM Rm. 2036 Palmer Commons Website: http://tinyurl.com/bioinf525-w16 Lecture, lab and background reading material plus homework and course announcements MODULE OVERVIEW Objective: Provide an introduction to the practice of bioinformatics as well as a practical guide to using common bioinformatics databases and algorithms 1.1. ‣ Introduction to Bioinformatics 1.2. ‣ Sequence Alignment and Database Searching 1.3 ‣ Structural Bioinformatics 1.4 ‣ Genome Informatics: High Throughput Sequencing Applications and Analytical Methods TODAYS MENU Overview of bioinformatics • The what, why and how of bioinformatics? • Major bioinformatics research areas. • Skepticism and common problems with bioinformatics. Bioinformatics databases and associated tools • Primary, secondary and composite databases. • Nucleotide sequence databases (GenBank & RefSeq). • Protein sequence database (UniProt). • Composite databases (PFAM & OMIM). Database usage vignette • Searching with ENTREZ and BLAST. • Reference slides and handout on major databases. HOMEWORK Complete the initial
    [Show full text]