Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A

Total Page:16

File Type:pdf, Size:1020Kb

Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A D412–D419 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 30 October 2020 doi: 10.1093/nar/gkaa913 Pfam: The protein families database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A. Salazar1, Erik L.L. Sonnhammer2, Silvio C.E. Tosatto 3, Lisanna Paladin 3, Shriya Raj 1, Lorna J. Richardson 1, Robert D. Finn 1 and Alex Bateman 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK, 2Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden and 3Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received September 11, 2020; Revised October 01, 2020; Editorial Decision October 02, 2020; Accepted October 06, 2020 ABSTRACT per-family gathering thresholds. Pfam entries are manually annotated with functional information from the literature The Pfam database is a widely used resource for clas- where available. sifying protein sequences into families and domains. Since Pfam release 29.0, pfamseq is based on UniPro- Since Pfam was last described in this journal, over tKB reference proteomes, whilst prior to that, it was based 350 new families have been added in Pfam 33.1 and on the whole of UniProtKB (3,4). Although the underly- numerous improvements have been made to existing ing sequence database is based on reference proteomes, all entries. To facilitate research on COVID-19, we have of the profile HMMs are also searched against UniProtKB revised the Pfam entries that cover the SARS-CoV-2 and the resulting matches are made available on the Pfam proteome, and built new entries for regions that were website and in a flatfile format. Similarly, the match data not covered by Pfam. We have reintroduced Pfam-B for representative proteomes (5) are provided in the same which provides an automatically generated supple- way. Pfam can be accessed via the website at https://pfam. xfam.org. The flatfiles and exports of the Pfam MySQL ment to Pfam and contains 136 730 novel clusters of database are provided under a CC0 license for each re- sequences that are not yet matched by a Pfam fam- lease of the database, and can be found on the FTP site ily. The new Pfam-B is based on a clustering by the ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases. MMseqs2 software. We have compared all of the re- When a Pfam entry is built, we typically search it itera- gions in the RepeatsDB to those in Pfam and have tively against pfamseq in order to find more distant homo- started to use the results to build and refine Pfam logues. Pfam entries are built such that there are no over- repeat families. Pfam is freely available for browsing laps between them; this means that the same region of a se- and download at http://pfam.xfam.org/. quence should not match more than one family. This non- overlap rule proves to be an excellent quality control cri- teria which helps to avoid including false positive matches INTRODUCTION into a family. From Pfam 28.0, we relaxed this rule to allow Pfam is a database of protein families and domains that is small overlaps between families as it had become increas- widely used to analyse novel genomes, metagenomes and ingly time-consuming to resolve all such overlaps each time to guide experimental work on particular proteins and sys- we updated pfamseq (see (3) for more details). tems (1,2). Each Pfam family has a seed alignment that Sets of Pfam entries that we believe to be evolutionarily contains a representative set of sequences for the entry. related are grouped together into clans (6). Relationships A profile hidden Markov model (HMM) is automatically between entries are identified through sequence similarity, built from the seed alignment and searched against a se- structural similarity, functional similarity and/or profile- quence database called pfamseq using the HMMER soft- profile comparisons using software such as HHsearch7 ( ). ware (http://hmmer.org/). All sequence regions that satisfy Where possible, we build a single comprehensive profile a family-specific curated threshold, also known as the gath- HMM to detect all members of a family. For some of the ering threshold, are aligned to the profile HMM to create larger superfamilies where this is not possible, we build the full alignment. It is worth noting that a common mis- multiple profile HMMs and put them in the same clan. use of Pfam is to use a single E-value threshold across all As families in a clan are evolutionarily related, we allow Pfam HMMs, which results in lower sensitivity and an in- them to overlap with other members of the same clan. The crease in false positive matches when compared to using the clans are competed such that if there are multiple over- *To whom correspondence should be addressed. Tel: +44 1223 494100; Fax: +44 1223 494468; Email: [email protected] C The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2021, Vol. 49, Database issue D413 lapping profile HMM matches to families from a single FAMILY IMPROVEMENTS clan, only the match with the lowest E-value is displayed Despite almost 25 years of active curation, there remain on the website, or included in the full alignment of the many protein domains and families that are as yet unclassi- entry. fied by Pfam. New families added to Pfam are created from Here we describe Pfam 33.1 and some of the family up- a range of sources, including Pfam-B families and protein dates we have made since the last release. We also detail the structure. Pfam-B alignments in particular have been a very Pfam coverage of the SARS-CoV-2 proteome, and the new fruitful substrate for families, with historically nearly a third method to create Pfam-B which we re-introduced in Pfam of all Pfam entries being built from them. The new Pfam- 33.1. Finally we describe the results of the analysis we car- B (described below) was used to build 18 entries in Pfam ried out on comparing repeats in the Pfam database to Re- 33.1. We expect that Pfam-B will again become a very use- peatsDB. ful source of additional families in the coming years. Protein structures from databases such as the Protein PFAM 33.1 RELEASE Data Bank (PDB) (9,10) are particularly amenable as a source of new Pfam entries because the Pfam domain Pfam 33.0 was due to be released in March 2020, however boundaries can be defined precisely from the structure. due to the global COVID-19 pandemic, we delayed its re- There is often an associated paper to the structure which lease so that we could focus on improving our SARS-CoV- we use to help annotate the Pfam entry. Building families 2 models (see ‘COVID-19 updates’ section for more de- using protein structures is an ongoing activity, and some of tail). The updated SARS-CoV-2 models, and a handful of the new SARS-CoV-2 families were built in this way. Ad- other new Pfam entries were added to Pfam 33.0 to create ditionally, we have made 37 new families based on clus- Pfam 33.1, which was released in May 2020. Release 33.0 ters of sequences from the MGnify metagenomic protein was never officially released on the Pfam website, although database (11). The MGnify clusters may be another large the database files and flat files are made available onthe source from which we could build families from in the fu- ftp site. ture. Using metagenomics clusters can help to cover regions Pfam 33.1 contains 18 259 families and 635 clans. Since of sequence space that are not well covered by existing ge- Pfam 32.0, we have built 355 new families, deleted 25 fami- nomic sequencing projects. lies and built 8 new clans. Just over 39% of all Pfam families Pfam has created numerous entries for domains of un- are placed within a clan. Of the sequences in UniProtKB, known function (DUF) and uncharacterized protein fami- 77.0% have at least one match to a Pfam entry, and 53.2% lies (UPF). Over time the functions of some of these are dis- residues in UniProtKB fall within a Pfam entry. These fig- covered. We continue to scan the literature to identify newly ures, termed the sequence and residue coverage respectively, identified functions as well as receiving updates from the have remained fairly constant over the past five Pfam re- community via our helpdesk. Many functions are also iden- leases despite a 240% increase in UniProtKB size during tified by the InterPro database12 ( ) update process which that time (see Figure 1). Since 2015, highly redundant bac- checks whether the UniProtKB/Swiss-Prot descriptions of terial proteomes have been identified and removed from proteins in each InterPro entry have changed between re- UniProtKB (8). This process ensures a level of diversity in leases. When these changes identify a function, Pfam is no- the sequences added to UniProtKB, and prevents, for ex- tified. To date we have changed the identifiers of 1132 DUF ample, multiple strains of a particular species being added. or UPF families indicating that a function has been iden- Pfam is able to maintain a sequence coverage of ∼77%, and tified for these families.
Recommended publications
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; [email protected] (A.M.); [email protected] (K.B.A.); [email protected] (P.B.); [email protected] (J.B.); [email protected] (E.d.C.); [email protected] (E.C.); [email protected] (E.G.); [email protected] (S.G.); [email protected] (D.L.); [email protected] (T.L.); [email protected] (T.B.N.); [email protected] (I.P.); [email protected] (S.P.); [email protected] (M.P.); [email protected] (N.R.); [email protected] (A.B.); [email protected] (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: [email protected] (M.F.); [email protected] (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice
    (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster.
    [Show full text]
  • Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences
    Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms......................
    [Show full text]
  • Molecular Genetics & Genomics
    page 46 Lab Times 5-2010 Ranking Illustration: Christina Ullman Publication Analysis 1997-2008 Molecular Genetics & Genomics Under the premise of a “narrow” definition of the field, Germany and England co-dominated European molecular genetics/genomics. The most frequently citated sub-fields were bioinformatical genomics, epigenetics, RNA biology and DNA repair. irst of all, a little science history (you’ll soon see why). As and expression. That’s where so-called computational biology is well known, in the 1950s genetics went molecular – and and systems biology enter research into basic genetic problems. Fdid not just become molecular genetics but rather molec- Given that development, it is not easy to answer the question ular bio logy. In 1963, however, Sydney Brenner wrote in his fa- what “molecular genetics & genomics” today actually is – and, mous letter to Max Perutz: “[...] I have long felt that the future of in particular, what is it in the context of our publication analy- molecular biology lies in the extension of research to other fields sis of the field? It is obvious that, as for example science historian of biology, notably development and the nervous system.” He Robert Olby put it, a “wide” definition can be distinguished from appeared not to be alone with this view and, as a consequence, a “narrow” definition of the field. The wide definition includes along with Brenner many of the leading molecular biologists all fields, into which molecular biology has entered as an exper- from the classical period redirected their research agendas, utilis- imental and theoretical paradigm. The “narrow” definition, on ing the newly developed molecular techniques to investigate un- the other hand, still tries to maintain the status as an explicit bio- solved problems in other fields.
    [Show full text]
  • Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]
    F1000Research 2018, 7:1975 Last updated: 26 JUL 2021 METHOD ARTICLE Rapid identification of novel protein families using similarity searches [version 1; peer review: 2 approved] Matt Jeffryes, Alex Bateman European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK v1 First published: 24 Dec 2018, 7:1975 Open Peer Review https://doi.org/10.12688/f1000research.17315.1 Latest published: 24 Dec 2018, 7:1975 https://doi.org/10.12688/f1000research.17315.1 Reviewer Status Invited Reviewers Abstract Protein family databases are an important tool for biologists trying to 1 2 dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating version 1 a protein family database. This comparison helps to understand 24 Dec 2018 report report whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a 1. Daniel J. Rigden , University of Liverpool, method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive Liverpool, UK comparisons. This method is based upon the MinHash algorithm, 2. Desmond G Higgins, Conway Institute, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. University College Dublin, Dublin, Ireland Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing Any reports and responses or comments on the families in Pfam in less than a second, with little loss in accuracy.
    [Show full text]
  • Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments
    Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments. by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of ARCHIVES Masters of Engineering in Computer Science and Engineering MASSACHUE N U TE at the OF TECH IOLOY MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 1 2011 June 2011 LIBRARI ES @ Massachusetts Institute of Technology 2011. All rights reserved. '$7 A uthor ............ .. .. ... ............. Department of Electrical Wgineering and Computer Science May 20, 2011 Certified by..................................... ...... Bonnie Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by.... ....................................... Christopher J. Terman Chairman, Department Committee on Graduate Theses 2 Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2011, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract Whole genome alignments have become a central tool in biological sequence analy- sis. A major application is the de novo prediction of non-coding RNAs (ncRNAs) from structural conservation visible in the alignment. However, current methods for constructing genome alignments do so by explicitly optimizing for sequence simi- larity but not structural similarity. Therefore, de novo prediction of ncRNAs with high structural but low sequence conservation is intrinsically challenging in a genome alignment because the conservation signal is typically hidden. This study addresses this problem with a method for genome-wide realignment of potential ncRNAs ac- cording to structural similarity.
    [Show full text]
  • Generating Functional Protein Variants with Variational Autoencoders
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Generating functional protein variants with variational 2 autoencoders 1 1 1 3 Alex Hawkins-Hooker , Florence Depardieu , Sebastien Baur , Guillaume 1 1 1 4 Couairon , Arthur Chen , and David Bikard 1 5 Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France 6 Abstract 7 The design of novel proteins with specified function and controllable biochemical properties 8 is a longstanding goal in bio-engineering with potential applications across medicine and nan- 9 otechnology. The vast expansion of protein sequence databases over the last decades provides 10 an opportunity for new approaches which seek to learn the sequence-function relationship 11 directly from natural sequence variation. Advances in deep generative models have led to 12 the successful modelling of diverse kinds of high-dimensional data, from images to molecules, 13 allowing the generation of novel, realistic samples. While deep models trained on protein 14 sequence data have been shown to learn biologically meaningful representations helpful for 15 a variety of downstream tasks, their potential for direct use in protein engineering remains 16 largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 17 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the 18 luxA bacterial luciferase.
    [Show full text]
  • Genome Informatics
    Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 Arranged by Inanc Birol, BC Cancer Agency, Canada Michele Clamp, BioTeam, Inc. James Kent, University of California, Santa Cruz, USA SCHEDULE AT A GLANCE Wednesday 15th September 2010 17.00-17.30 Registration – finger buffet dinner served from 17.30-19.30 19.30-20:50 Session 1: Epigenomics and Gene Regulation 20.50-21.10 Break 21.10-22.30 Session 1, continued Thursday 16th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 2: Population and Statistical Genomics 10.20-10:40 Morning Coffee 10:40-12:00 Session 2, continued 12.00-14.00 Lunch 14.00-15.20 Session 3: Environmental and Medical Genomics 15.20-15.40 Break 15.40-17.00 Session 3, continued 17.00-19.00 Poster Session I and Drinks Reception 19.00-21.00 Dinner Friday 17th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 4: Databases, Data Mining, Visualization and Curation 10.20-10.40 Morning Coffee 10.40-12.00 Session 4, continued 12.00-14.00 Lunch 14.00-16.00 Free afternoon 16.00-17.00 Keynote Speaker: Alex Bateman 17.00-19.00 Poster Session II and Drinks Reception 19.00-21.00 Dinner Saturday 18th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 5: Sequencing Pipelines and Assembly 10.20-10.40
    [Show full text]
  • Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition
    Enhanced protein domain discovery by using language modeling techniques from speech recognition Lachlan Coin, Alex Bateman, and Richard Durbin* Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom Edited by Michael Levitt, Stanford University School of Medicine, Stanford, CA, and approved February 12, 2003 (received for review December 10, 2002) Most modern speech recognition uses probabilistic models to equation and introduce the terms P(Di) to represent the prior interpret a sequence of sounds. Hidden Markov models, in partic- probability of the ith domain: ular, are used to recognize words. The same techniques have been P͑A ͉D ͒ adapted to find domains in protein sequences of amino acids. To ͑ ͉ ͒ϰ ͹ i i ͑ ͒ P D A, M ͩ ͑ ͉ ͒ P Di ͪ increase word accuracy in speech recognition, language models are P Ai R used to capture the information that certain word combinations i are more likely than others, thus improving detection based on P͑D ͉D Ϫ ...D , M͒ ϫ ͹ i i 1 1 context. However, to date, these context techniques have not been ͩ ͑ ͒ ͪ. [2] P Di applied to protein domain discovery. Here we show that the i application of statistical language modeling methods can signifi- cantly enhance domain recognition in protein sequences. As an Note that we have replaced P(AԽM), which is a constant given the example, we discover an unannotated Tf࿝Otx Pfam domain on the signal, independent of our interpretation of the sequence, by cone rod homeobox protein, which suggests a possible mechanism another constant, P(AԽR): the probability of the sequence being for how the V242M mutation on this protein causes cone-rod generated independently residue by residue according to a dystrophy.
    [Show full text]
  • Introduction
    ent INTRODUCTION When we consider a protein (or gene), one of the most fundamental questions is what other proteins are related. Biological sequences often occur in families. These fam ilies may consist of related genes within an organism (paralogs), sequences within a population (e.g., polymorphic variants), or genes in other species (ortho­ logs). Sequences diverge from each other for reasons such as duplication within a genome or speciation leading to the existence of orthologs. We have studied pairwise comparisons of two protein (or DNA) sequences (Chapter 3), and we have also seen multiple related sequences in the form of profiles or as the output of a BLAST or other database search (Chapters 4 and 5). We will also explore multiple sequence alignments in the context of molecular phylogeny (Chapter 7), protein domains (Chap ter 10), and protein structure (Chapter 11). In this chapter, we consider the general problem of multiple sequence alignment from three perspectives. First, we describe five approaches to making multiple sequence alignments from a group of homologous sequences of interest. Second, we discuss multiple alignment of genomic DNA. This is typically a comparative genomics problem of aligning large chromosomal regions from different species. Third, we explore databases ofmultiply aligned sequences, such as Pfam, the protein family database. While multiple sequence alignment is commonly performed for bo th protein and D NA sequences, most databases consist of protein families only. Nucleotides corresponding to coding regions are typically less well conserved than Bioinformatics and Functional Genomics, Second Edition. By [onathan Pevsner Copyright © 2009 John Wiley & Sons, Inc. 179 180 MULTIPLE SEQUEN CE ALIGNMENT proteins because of the degeneracy of the genetic code.
    [Show full text]
  • CV Burkhard Rost
    Burkhard Rost CV BURKHARD ROST TUM Informatics/Bioinformatics i12 Boltzmannstrasse 3 (Rm 01.09.052) 85748 Garching/München, Germany & Dept. Biochemistry & Molecular Biophysics Columbia University New York, USA Email [email protected] Tel +49-89-289-17-811 Photo: © Eckert & Heddergott, TUM Web www.rostlab.org Fax +49-89-289-19-414 Document: CV Burkhard Rost TU München Affiliation: Columbia University TOC: • Tabulated curriculum vitae • Grants • List of publications Highlights by numbers: • 186 invited talks in 29 countries (incl. TedX) • 250 publications (187 peer-review, 168 first/last author) • Google Scholar 2016/01: 30,502 citations, h-index=80, i10=179 • PredictProtein 1st Internet server in mol. biol. (since 1992) • 8 years ISCB President (International Society for Computational Biology) • 143 trained (29% female, 50% foreigners from 32 nations on 6 continents) Brief narrative: Burkhard Rost obtained his doctoral degree (Dr. rer. nat.) from the Univ. of Heidelberg (Germany) in the field of theoretical physics. He began his research working on the thermo-dynamical properties of spin glasses and brain-like artificial neural networks. A short project on peace/arms control research sketched a simple, non-intrusive sensor networks to monitor aircraft (1988-1990). He entered the field of molecular biology at the European Molecular Biology Laboratory (EMBL, Heidelberg, Germany, 1990-1995), spent a year at the European Bioinformatics Institute (EBI, Hinxton, Cambridgshire, England, 1995), returned to the EMBL (1996-1998), joined the company LION Biosciences for a brief interim (1998), became faculty in the Medical School of Columbia University in 1998, and joined the TUM Munich to become an Alexander von Humboldt professor in 2009.
    [Show full text]