Protein Sequence

Total Page:16

File Type:pdf, Size:1020Kb

Protein Sequence Inside protein databases: bridging sequences and knowledge WELCOME ! Geneva, 2017 SIB Swiss Institute of Bioinformatics [email protected] & [email protected] Swiss-Prot, SIB [email protected] CALIPHO, SIB SIB Swiss Institute of Bioinformatics • 70 groups • 800 collaborators • biologists, biochemists, computer scientists, physicists, physicians, chemists, mathematicians, pharmacists, … Common point: bioinformatics Inside protein databases: bridging sequences and knowledge All the material is available here: http://education.expasy.org/cours/InsideProteinDatabases2017/ Content • a description of the major protein sequence databases and their sequence annotation pipeline, focusing on UniProtKB/Swiss-Prot • an introduction to Gene Ontology (GO) • practical sessions allowing to gain knowledge on how to query protein sequence databases, how to perform enrichment analysis on datasets and how to interpret the results of such analyses. Objectives • know the differences between the major protein sequence databases • understand the major sequence annotation pipelines and the GO annotation pipelines • estimate the protein sequence accuracy and the annotation quality 08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END 08h30 Protein sequence databases: theory 10h30 COFFEE BREAK 11h00 Controlled vocabularies and standardization resources: theory 12h15 PAUSE 13h30 Protein sequence databases and Gene Ontology: practicals 15h00 COFFEE BREAK 15h30 Analysis tools using ontologies : theory 16h00 Protein sequence databases and Gene Ontology: practicals 17h00 Evaluation / Exam 18h00 END Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… ID HIDH_SOYBN Reviewed; 319 AA. AC Q5NUF3; ... DE RecName: Full=2-hydroxyisoflavanone dehydratase; Protein databases DE EC=3.1.1.1; DE EC=4.2.1.105; DE AltName: Full=Carboxylesterase HIDH; GN Name=HIDH; OrderedLocusNames=GLYMA01G45020; OS Glycine max (Soybean) (Glycine hispida). OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; Text search OC rosids; fabids; Fabales; Fabaceae; Papilionoideae; Phaseoleae; OC Glycine. OX NCBI_TaxID=3847; RN [1] RP NUCLEOTIDE SEQUENCE [MRNA], FUNCTION, CATALYTIC ACTIVITY, MUTAGENESIS OF GLY-78; Training Dataset RP GLY-79; THR-164; ASP-263 AND HIS-295, AND BIOPHYSICOCHEMICAL PROPERTIES. RC TISSUE=Seedling; RX PubMed=15734910; DOI=10.1104/pp.104.056747; RA Akashi T., Aoki T., Ayabe S.; RT "Molecular and biochemical characterization of 2-hydroxyisoflavanone dehydratase. Statistics RT Involvement of carboxylesterase-like proteins in leguminous isoflavone biosynthesis."; RL Plant Physiol. 137:882-891(2005). ... CC -!- FUNCTION: Dehydratase that mediates the biosynthesis of CC isoflavonoids. Can use both 4'-hydroxylated and 4'-methoxylated 2- Genome annotation (Features) CC hydroxyisoflavanones as substrates. Has also a slight CC carboxylesterase activity toward p-nitrophenyl butyrate. CC -!- CATALYTIC ACTIVITY: 2,7,4'-trihydroxyisoflavanone = daidzein + CC H(2)O. ... System biology CC -!- BIOPHYSICOCHEMICAL PROPERTIES: CC Kinetic parameters: CC KM=29 uM for 2,7-dihydroxyBIOLOGICAL-4'-methoxyisoflavanone (at pH 7.5 and CC 30 degrees Celsius); ... … CC -!- PATHWAY: Secondary metabolite biosynthesis; flavonoid CC biosynthesis. CC -!- SIMILARITY: Belongs toKNOWLEDGE the 'GDXG' lipolytic enzyme family. DR EMBL; AB154415; BAD80840.1; -; mRNA. DR EMBL; BT097440; ACU22699.1; -; mRNA. DR EMBL; CM000834; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR RefSeq; NP_001237228.1; NM_001250299.2. DR UniGene; Gma.19376; -. DR ProteinModelPortal; Q5NUF3; -. ... DR Pfam; PF07859; Abhydrolase_3; 1. DR PROSITE; PS01173; LIPASE_GDXG_HIS; 1. PE 1: Evidence at protein level; KW Complete proteome; Flavonoid biosynthesis; Hydrolase; Lyase; Reference proteome. FT CHAIN 1 319 2-hydroxyisoflavanone dehydratase. FT /FTId=PRO_0000424101. ... FT ACT_SITE 77 77 Potential. FT ACT_SITE 164 164 FT ACT_SITE 263 263 FT ACT_SITE 295 295 … annotation Proteomics BLAST Phylogeny SQ SEQUENCE 319 AA; 35138 MW; E8333CF425FBA4A3 CRC64; MAKEIVKELL PLIRVYKDGS VERLLSSENV AASPEDPQTG VSSKDIVIAD NPYVSARIFL PKSHHTNNKL PIFLYFHGGA FCVESAFSFF VHRYLNILAS EANIIAISVD FRLLPHHPIP Training datasets AAYEDGWTTL KWIASHANNT NTTNPEPWLL NHADFTKVYV GGETSGANIA HNLLLRAGNE SLPGDLKILG GLLCCPFFWGprotein SKPIGSEAVE GHEQSLAMKVsequence WNFACPDAPG GIDNPWINPC VPGAPSLATL ACSKLLVTIT GKDEFRDRDI LYHHTVEQSG WQGELQLFDA GDEEHAFQLF Domains KPETHLAKAM IKRLASFLV // … Annotation: where does it come from ? Annotation is the process of assigning biological information to DNA or protein sequences Information (i.e. protein function and subcellular location) may come from publications (experimental data) sequence similarity (…quest for orthologs) protein domain computational analysis (prediction) Computational analysis can be manually checked (by the ‘biocurators’) or not Examples: UniProtKB and Gene Ontology annotation Protein sequence: where does it come from? Protein sequences origins • > 180 billion ‘different’ proteins on earth (∑ N species x M genes) • ~ 74 million ‘known and public’ protein sequences in 2017 • About 98% of the protein sequences are derived from the translation of nucleotide sequences (mRNA or DNA/genome) • About 1 % come from direct protein sequencing (Edman, MS/MS…) The ideal life of a sequence … http://www.ncbi.nlm.nih.gov/genbank/submit RNA, genes, genomes, … Nucleic acid sequence databases Protein sequence databases EMBL/GenBank/DDBJ http://www.insdc.org/ RefSeq • The Reference Sequence (RefSeq) collection: provides a non- redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeqi NP_000790.2. NM_000799.2. • Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Ensembl • Creates, integrates and distributes reference datasets and analysis tools that enable genomics. • Joint project between EMBL-EBI and the Sanger Centre Ensembl i ENST00000252723; ENSP00000252723; ENSG00000130427. • Contains protein sequences derived from gene prediction, not submitted to EMBL-Bank/GenBank/DDBJ Menu Introduction Nucleic acid sequence databases EMBL, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases (NCBInr, RefSeq) Practicals… EBI (UK) EMBL (ENA) European Nucleotide Archive GenBank NCBI (US) DDBJ Japan Archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing. DNA sequence of the human EPO gene Server EBI; Database EMBL/ENA; text format accession number taxonomy References - the submitters Cross-references CDS CoDing Sequence (proposed by submitters) 5 exons DNA sequence EMBL/GenBank/DDBJ • Archive: nothing goes out -> highly redundant ! • Most annotations are done by the submitters: heterogeneity of the quality and of the completion • Archive: all submitted information remains there; not updated (exception: Third Part Annotation (TPA)) • Many errors: in sequences, in annotations, in CDS attribution, no consistency of annotations EMBL/GenBank/DDBJ and annotation “Beyond limited editorial control and some internal integrity checks (for example, proper use of INSD formats and translation of coding regions specified in CDS entries are verified), the quality and accuracy of the record are the responsibility of the submitting author, not of the database. The databases will work with submitters and users of the database to achieve the best quality resource possible.” http://www.insdc.org/policy EMBL/GenBank/DDBJ and annotation • many scientists assume that GenBank annotation is kept up to date, and they are surprised to hear that it is not • the annotation has remained static: a gene labeled 'hypothetical protein' a few years ago might now have a known function. • erroneous and inconsistent naming of genes. • a name is transferred from one gene to another on the basis of sequence similarity (usually from a BLAST search). As more genomes are annotated, and more BLAST searches are run, the original source of the name quickly becomes lost. • scientists should fix errors that they find. But this would quickly destroy the archival function of GenBank, as original entries would be erased over time. (PMID: 17274839) information provided by the submitter of an nucleotide entry… DR EMBL; DQ339047; ABC68418.1; -; mRNA. FT source 1..1397 FT /organism="Rattus norvegicus" FT /strain="Sprague-Dawley" FT /mol_type="mRNA" FT /sex="female" FT /tissue_type="ovary" FT /db_xref="taxon:10116" FT CDS 70..1329 FT /codon_start=1 FT /product="testis derived transcript" FT /note="TES" FT /db_xref="GOA:Q2LAP6" EMBL/GenBank/DDBJ DNA sequence & CoDing Sequences (CDS) Coding sequence (CDS) annotation CDS CoD ing Sequence (provided by submitters) Slide J. McDowall CDS annotation provided by the submitters The first Met ! CDS translation provided by EMBL/GenBank This
Recommended publications
  • Zebrafish Disease Models to Study the Pathogenesis of Inherited Manganese Transporter Defects and Provide A
    Zebrafish disease models to study the pathogenesis of inherited manganese transporter defects and provide a route for drug discovery Dr Karin Tuschl University College London PhD Supervisors: Dr Philippa Mills & Prof Stephen Wilson A thesis submitted for the degree of Doctor of Philosophy University College London August 2016 Declaration I, Karin Tuschl, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Part of the work of this thesis has been published in the following articles for which copyright clearance has been obtained (see Appendix): - Tuschl K, et al. Manganese and the brain. Int Rev Neurobiol. 2013. 110:277- 312. - Tuschl K, et al. Mutations in SLC39A14 disrupt manganese homeostasis and cause childhood-onset parkinsonism-dystonia. Nat Comms. 2016. 7:11601. I confirm that these publications were written by me and may therefore partly overlap with my thesis. 2 Abstract Although manganese is required as an essential trace element excessive amounts are neurotoxic and lead to manganism, an extrapyramidal movement disorder associated with deposition of manganese in the basal ganglia. Recently, we have identified the first inborn error of manganese metabolism caused by mutations in SLC30A10, encoding a manganese transporter facilitating biliary manganese excretion. Treatment is limited to chelation therapy with intravenous disodium calcium edetate which is burdensome due to its route of administration and associated with high socioeconomic costs. Whole exome sequencing in patients with inherited hypermanganesaemia and early- onset parkinsonism-dystonia but absent SLC30A10 mutations identified SLC39A14 as a novel disease gene associated with manganese dyshomeostasis.
    [Show full text]
  • Uniprot at EMBL-EBI's Role in CTTV
    Barbara P. Palka, Daniel Gonzalez, Edd Turner, Xavier Watkins, Maria J. Martin, Claire O’Donovan European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK UniProt at EMBL-EBI’s role in CTTV: contributing to improved disease knowledge Introduction The mission of UniProt is to provide the scientific community with a The Centre for Therapeutic Target Validation (CTTV) comprehensive, high quality and freely accessible resource of launched in Dec 2015 a new web platform for life- protein sequence and functional information. science researchers that helps them identify The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of therapeutic targets for new and repurposed medicines. functional information on proteins, with accurate, consistent and rich CTTV is a public-private initiative to generate evidence on the annotation. As much annotation information as possible is added to each validity of therapeutic targets based on genome-scale experiments UniProtKB record and this includes widely accepted biological ontologies, and analysis. CTTV is working to create an R&D framework that classifications and cross-references, and clear indications of the quality of applies to a wide range of human diseases, and is committed to annotation in the form of evidence attribution of experimental and sharing its data openly with the scientific community. CTTV brings computational data. together expertise from four complementary institutions: GSK, Biogen, EMBL-EBI and Wellcome Trust Sanger Institute. UniProt’s disease expert curation Q5VWK5 (IL23R_HUMAN) This section provides information on the disease(s) associated with genetic variations in a given protein. The information is extracted from the scientific literature and diseases that are also described in the OMIM database are represented with a controlled vocabulary.
    [Show full text]
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Glycomics Goes Visual and Interactive
    Glycomics & Lipidomics Extended Abstract Glycomics goes visual and interactive Alessandra Gastaldello structures attached to each of these sites. Mass spectrometry Abstract (MS) and microarray are high-throughput technologies that are commonly used in glycomics and glycoproteomics, which often result in the generation of large experimental datasets. Glycomics@ExPASy the glycomics tab of the Swiss Institute of Bioinformatics approaches play an essential role in automated Bioinformatics server (www.expasy.org/glycomics) was created analysis and interpretation of such data. This unit describes in 2016 to centralise web-based glycoinformatics resources and discusses the computational tools currently available for developed within an international network of glycoscientists. these analyses, and their glycomics and glycoproteomics The philosophy of this toolbox is to be {glycoscientist AND applications. protein scientist}???friendly with the aim of popularising (a) the use of bioinformatics in glycobiology and (b) the relation A key point in achieving accurate intact glycopeptide between glycobiology and protein-oriented bioinformatics identification is the definition of the glycan composition file resources. The scarcity of bridging data led us to design tools that is used to match experimental with theoretical masses by a as interactive as possible based on database connectivity in glycoproteomics search engine. At present, these files are order to facilitate data exploration and support hypothesis mainly built from searching the literature and/or querying building. The current set of resources is mostly built on top of data sources focused on posttranslational modifications. Most curated or experimental data relative to glycan structures, glycoproteomics search engines include a default composition glycoproteins, host-pathogen interactions and mass file that is readily used when processing MS data.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Pathogenicity and Selective Constraint on Variation Near Splice Sites
    Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Pathogenicity and selective constraint on variation near 2 splice sites 3 AUTHORS 4 Jenny Lord1, Giuseppe Gallone1, Patrick J. Short1, Jeremy F. McRae1, Holly Ironfield1, Elizabeth H. 5 Wynn1, Sebastian S. Gerety1, Liu He1, Bronwyn Kerr2,3, Diana S. Johnson4, Emma McCann5, Esther 6 Kinning6, Frances Flinter7, I. Karen Temple8,9 , Jill Clayton-Smith2,3, Meriel McEntagart10, Sally Ann 7 Lynch11, Shelagh Joss12, Sofia Douzgou2,3, Tabib Dabir13, Virginia Clowes14, Vivienne P. M. 8 McConnell13, Wayne Lam15, Caroline F. Wright16, David R. FitzPatrick1,15, Helen V. Firth1,17, Jeffrey 9 C. Barrett1, Matthew E. Hurles1, on behalf of the Deciphering Developmental Disorders study 10 AFFILIATIONS 11 1 Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK 12 2Manchester Centre for Genomic Medicine, St Mary’s Hospital, Manchester University Hospitals NHS 13 Foundation Trust Manchester Academic Health Sciences Centre 14 3Division of Evolution and Genomic Sciences School of Biological Sciences University of Manchester 15 4Sheffield Clinical Genetics Service, Sheffield Children's Hospital, OPD2, Northern General Hospital, 16 Herries Road, Sheffield, S5 7AU 17 5Liverpool Women’s Hospital Foundation Trust, Crown Street, Liverpool, L8 7SS 18 6West of Scotland Regional Genetics Service, NHS Greater Glasgow and Clyde, Institute of Medical 19 Genetics, Yorkhill Hospital, Glasgow G3 8SJ, UK 20 7South East Thames Regional Genetics
    [Show full text]
  • Webnetcoffee
    Hu et al. BMC Bioinformatics (2018) 19:422 https://doi.org/10.1186/s12859-018-2443-4 SOFTWARE Open Access WebNetCoffee: a web-based application to identify functionally conserved proteins from Multiple PPI networks Jialu Hu1,2, Yiqun Gao1, Junhao He1, Yan Zheng1 and Xuequn Shang1* Abstract Background: The discovery of functionally conserved proteins is a tough and important task in system biology. Global network alignment provides a systematic framework to search for these proteins from multiple protein-protein interaction (PPI) networks. Although there exist many web servers for network alignment, no one allows to perform global multiple network alignment tasks on users’ test datasets. Results: Here, we developed a web server WebNetcoffee based on the algorithm of NetCoffee to search for a global network alignment from multiple networks. To build a series of online test datasets, we manually collected 218,339 proteins, 4,009,541 interactions and many other associated protein annotations from several public databases. All these datasets and alignment results are available for download, which can support users to perform algorithm comparison and downstream analyses. Conclusion: WebNetCoffee provides a versatile, interactive and user-friendly interface for easily running alignment tasks on both online datasets and users’ test datasets, managing submitted jobs and visualizing the alignment results through a web browser. Additionally, our web server also facilitates graphical visualization of induced subnetworks for a given protein and its neighborhood. To the best of our knowledge, it is the first web server that facilitates the performing of global alignment for multiple PPI networks. Availability: http://www.nwpu-bioinformatics.com/WebNetCoffee Keywords: Multiple network alignment, Webserver, PPI networks, Protein databases, Gene ontology Background tools [7–10] have been developed to understand molec- Proteins are involved in almost all life processes.
    [Show full text]
  • Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp
    Feng, Z, et al. 2020. Impact of the Protein Data Bank Across Scientific Disciplines. Data Science Journal, 19: 25, pp. 1–14. DOI: https://doi.org/10.5334/dsj-2020-025 RESEARCH PAPER Impact of the Protein Data Bank Across Scientific Disciplines Zukang Feng1,2, Natalie Verdiguel3, Luigi Di Costanzo1,4, David S. Goodsell1,5, John D. Westbrook1,2, Stephen K. Burley1,2,6,7,8 and Christine Zardecki1,2 1 Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, US 2 Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, US 3 University of Central Florida, Orlando, Florida, US 4 Department of Agricultural Sciences, University of Naples Federico II, Portici, IT 5 Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, US 6 Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, US 7 Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, US 8 Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, US Corresponding author: Christine Zardecki ([email protected]) The Protein Data Bank archive (PDB) was established in 1971 as the 1st open access digital data resource for biology and medicine. Today, the PDB contains >160,000 atomic-level, experimentally-determined 3D biomolecular structures. PDB data are freely and publicly available for download, without restrictions. Each entry contains summary information about the structure and experiment, atomic coordinates, and in most cases, a citation to a corresponding scien- tific publication.
    [Show full text]
  • Genbank Is a Reliable Resource for 21St Century Biodiversity Research
    GenBank is a reliable resource for 21st century biodiversity research Matthieu Leraya, Nancy Knowltonb,1, Shian-Lei Hoc, Bryan N. Nguyenb,d,e, and Ryuji J. Machidac,1 aSmithsonian Tropical Research Institute, Smithsonian Institution, Panama City, 0843-03092, Republic of Panama; bNational Museum of Natural History, Smithsonian Institution, Washington, DC 20560; cBiodiversity Research Centre, Academia Sinica, 115-29 Taipei, Taiwan; dDepartment of Biological Sciences, The George Washington University, Washington, DC 20052; and eComputational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052 Contributed by Nancy Knowlton, September 15, 2019 (sent for review July 10, 2019; reviewed by Ann Bucklin and Simon Creer) Traditional methods of characterizing biodiversity are increasingly (13), the largest repository of genetic data for biodiversity (14, 15). being supplemented and replaced by approaches based on DNA In many cases, no vouchers are available to independently con- sequencing alone. These approaches commonly involve extraction firm identification, because the organisms are tiny, very difficult and high-throughput sequencing of bulk samples from biologically or impossible to identify, or lacking entirely (in the case of eDNA). complex communities or samples of environmental DNA (eDNA). In While concerns have been raised about biases and inaccuracies in such cases, vouchers for individual organisms are rarely obtained, often laboratory and analytical methods used in metabarcoding
    [Show full text]