Fibronectin Type III Domains in Yeast Detected by a Hidden Markov Model

Total Page:16

File Type:pdf, Size:1020Kb

Fibronectin Type III Domains in Yeast Detected by a Hidden Markov Model 1544 Current Biology 1996, Vol 6 No 12 sequence tags as described here Fibronectin type III 35–125 of YEF3_YEAST, matched relies on all proteins from an the HMM with scores of 39.5 and organism being in sequence domains in yeast 21.5 bits, respectively. We found a databases. In this manner, if only detected by a hidden homologue of L8543.18 in cosmid one protein within a given pI and c6G9 of the genomic data for S. mass range is found with a certain Markov model pombe, using the program tblastn [8] amino- or carboxy-terminal Alex Bateman and (see Fig. 1a). Residues 77–167 of sequence tag, one can be confident this sequence match the FnIII that there is no other, as yet Cyrus Chothia HMM with a score of 32.4 bits. The undescribed, protein that could HMM score is the logarithm to base otherwise match the tag. In fully Proteins containing fibronectin type 2 of the probability of the sequence sequenced organisms, the procedure III (FnIII) domains play a central matching the HMM, divided by the is thus self-checking. The role in many intercellular processes: probability of a randomly generated specificity of sequence tags may be they are part of many cell-surface sequence matching the HMM. The an issue in larger organisms: whereas receptors, adhesive matrix proteins next highest match, SVS1, scored there are (for example) 3 200 000 and cell adhesion molecules. FnIII 9.4 bits. We would expect a score of combinations of five amino-acid domains are also found in the giant 12 bits to be significant against a tags, protein amino termini have muscle proteins, titin and twitchin. database of this size. HMM scores biased sequences and many amino The occurrence of FnIII domains in of 39 and 21 bits are highly reliable termini are shared. However, prokaryotes led to speculation that indicators of sequence homology, in protein carboxyl termini have almost these domains existed in the last our experience. We therefore expect random sequences (data not shown) common ancestor of prokaryotes and that the yeast domains have an so their sequence tags should be animals [1]. However, it has been FnIII-like fold. However, we did try more specific. Other factors to argued that the currently known other methods to verify our results: consider will be the accuracy of prokaryotic examples were obtained database searches with BLASTP sequence data that can be obtained from a horizontal transfer of a single [8], key residues analysis [9], and from proteins purified from two- domain from animals — that this PHD [10] secondary structure dimensional gels, and the accuracy protein arose late in evolution, and prediction. of prediction of protein open is unlikely to occur in plants or BLASTP [8] found matches reading frames in genome/proteome fungi [2,3]. between the ‘FnIII’ sections of the databases. Large-scale protein Here, we report evidence that yeast proteins and known animal characterization projects will define three fungal proteins, L8543.18 and FnIII domains in the SWISS-PROT the effect of these factors and thus YEF3_YEAST of Saccharomyces database [11]. The top match against the utility of sequence tags for cerevisiae and the L8543.18 the L8543.18 protein was the FnIII- protein identification. homologue from Schizosaccharomyces containing receptor tyrosine kinase pombe, contain FnIII domains. The KEK4_CHICK, with a p value of evidence for this comes from a 7.5 3 10–4. The best match against References 1. Wilkins MR, Ou K, Appel RD, Sanchez J-C, hidden Markov model (HMM) [4,5] YEF3_YEAST was FINC_CHICK, Yan JX, Golaz O, et al.: Rapid protein of the amino-acid residues that the fibronectin protein in chicken, identification using N-terminal “sequence determine the FnIII protein fold, with a p value of 2.0 3 10–3. Note that tag” and amino acid analysis. Biochem Biophys Res Commun 1996, and is supported by other these matches were found using only 221:609–613. calculations. From alignments of the the ‘FnIII’ portion of the two S. sequences of a protein family, an cerevisiae proteins. If the whole Address: Central Clinical Chemistry HMM can be built to encode the sequence of L8543.18 is used, the Laboratory, Geneva University Hospital, 24 probabilities of different residues first protein with an FnIII domain to Rue Micheli-du-Crest, 1211-Geneve 14. occurring at particular sites. The be matched is NCA1_MOUSE at E-mail: [email protected] model can then be used to detect rank position 284, with a p value of other sequences that are likely to be 0.56; for the whole sequence of The editors of Current Biology welcome very distant members of the protein- YEF3_YEAST, the first such match correspondence in response to any article fold family [6]. We built an HMM is NCA2_XENLA at rank position in the journal, but reserve the right to reduce the length of any letter to be from a multiple alignment of 434 380 and with a p value of 0.96. published. Items for publication should FnIII domains, and used it to search Routine BLASTP analysis would either be submitted typed, double-spaced, for FnIII domains in the yeast not, therefore, find the yeast FnIII- or sent by electronic mail. They should protein database release 4.1 [7]. like sequences. include a full contact address, with phone and fax numbers. Residues 76–166 of the Key residues are those that, sequence L8543.18, and residues through their packing, hydrogen Magazine 1545 Figure 1 (a) The sequence of the putative S. pombe homologue of S. cerevisiae L8543.18, (a) 1 MDDTNQFMVS VAKIDAGMAI LLTPSFHIIE FPSVLLPNDA TAGSIIDISV HHNKEEEIAR 60 translated from genomic cosmid c6G9. (b) 61 ETAFDDVQKE IFETYGQKLP SPPVLKLKNA TQTSIVLEWD PLQLSTARLK SLCLYRNNVR 120 121 VLNISNPMTT HNAKLSGLSL DTEYDFSLVL DTTAGTFPSK HITIKTLRMI DLTGIQVCVG 180 Alignment of yeast FnIII domains to the FnIII 181 NMVPNEMEAL QKCIERIHAR PIQTSVRIDT THFICSSTGG PEYEKAKAAN IPILGLDYLL 240 domains of neuroglian (NGd1 and NGd2) 241 KCESEGRLVN VSGFYIENRA SYNANASINS VEAAQNAAPN LNATTEQPKN TAEVAQGAAS 300 and tenascin (Ten) [12,13]. The S. pombe 301 AKAPQQTTQQ GTQNSANAEP SSSASVPAEA PETEAEQSID VSSDIGLRSD SSKPNEAPTS 360 homologue of L8543.18 is denoted Pombe. 361 SENIKADQPE NSTKQENPEE DMQIKDAEEH SNLESTPAAQ QTSEVEANNH QEKPSSLPAV 420 421 EQINVNEENN TPETEGLEDE KEENNTAAES LINQEETTSG EAVTKSTVES SANEEEAEPN 480 Regions in the yeast sequences that are 481 EIIEENAVKS LLNQEGPATN EEVEKNNANS ENANGLTDEK IIEAPLDTKE NSDDDKPSPA 540 expected to have the same structure as one 541 AAEDIGTNGA IEEIPQVSEV LEPEKAHTTN LQLNALDKEE DLNITTVKQS SEPTADDNLI 600 of the known structures are shown in upper 601 PNKEAEIIQS SDEFESVNID case; regions expected to differ in conformation are shown in lower case. The secondary structure (marked ‘Strand’) of the (b) Strand A-------------A B-----B C--------C known FnIII domain structures is shown, as is NGd1 DVP.NAP..KLTGITC.QA.DKAEIHWE...QQGDNRSPI.....LHYTIQFNTS. the PHD secondary structure prediction NGd2 .PDVPFKNP.DNVVG.QGTQP.NNLVISWT..PMPEIEHNAPN....FHYYVSWKRD. Ten .RL.DAP.SQIEV.KDVTD.TTALITWF.....KPLAEI......DGIELTYGIK. (marked ‘PHD’) [10] of the single yeast L8543.18 THKP.ESP..VLKI.VNVTQ.TSCVLAWD..plkl...gsak....LKSLILYRKG. domains: residues predicted to be in an PHD ---- ---- ------ - -------- extended conformation are marked by — and Pombe QKLP.SPP..VLKL.KNATQ.TSIVLEWD..plql...star....LKSLCLYRNN. those predicted to be in a helical PHD -- ------- ----- YEF3 KIKTP.PAT..KVSI.DKIAT.DSVTIHWEnepvkaedngsadrnfiSHYLLYLNNT. conformation by +. Key residues important PHD - -- ------ -- ++++++++------- for the structure of the FnIII fold are shown in red [9]. The chemical nature of these ----C' E-----E F---------F G-------------G Strand sidechains is largely conserved, implying FTPASWDAAYEK..V..PNTDSSFVVQ..MSPW.ANYTFRVIAFNK..IGASPPSASSD.SCTTQ NGd1 structural similarity of the yeast domains to .IPAAAWENNN...I.FDWRQNNIVIA.DQPTF.VKYLIKVVAIND..RGESN.VAAEEVVGYSG NGd2 .KVPGDRTTID...L..TEDENQYSIG.NLKPD.TEYEVSLISR....RGDMS.SNPAKETFTT. Ten the known structures. (c) The location of FnIII ....irsmvipnp.F....KVTTTKIS.GLSVD.TPYEFQLKLITT..SGTLW.SEKV..ILRTH L8543.18 domains within the S. cerevisiae proteins. ----- -------- ---- --------- -- ---- --- PHD ....vrvlnisnp.M....TTHNAKLS.GLSLD.TEYDFSLVLDTT..AGTFP.SKHI..TIKTL Pombe ------- --- --------- - ---- PHD ..qlaifpnnpns.L.....YTCCSIT.GLEAE.TQYQLDFITINN..KGFIN.KPSI..YCPTK YEF3 -- - ------ ++ +++ ----- - --- PHD (c) L8543.18 FnIII 76 166 277 671 YEF3_YEAST FnIII 35136 625 694 956 Low complexity region High complexity region FnIII domain 100 amino acids bonds or unusual torsion angles, play domains of known structures homologue of L8543.18 are the the major role in determining the [12,13]. The yeast sequences same as, or very similar to, those three-dimensional structure of a aligned to the known structures found in the S. cerevisiae form. All protein. These residues tend to be share identical residues at only 8–16 the residues found at core sites of strongly conserved, in type if not in sites. Inspection of the alignment, the yeast domains can be found in identity, over long evolutionary however, shows that, to a very large the sequences of animal FnIII periods, and they can be used to extent, the key residues in the core domains. detect distant evolutionary of the known structures are the The known FnIII structures relationships. We have defined the same or conservatively substituted contain two b sheets, one with three key residues of several FnIII in the two new sequences. The key strands (ABE) and the other with domains of known structure [9]. residues in the core of each of the four (GFCC′). The sequences of the Figure 1b shows a comparison of the predicted S. cerevisiae FnIII domains yeast domains were submitted to residues at key sites in the yeast are shown schematically in Figure 2. the PHD secondary structure FnIII domains with those in three The key residues of the S. pombe prediction server [10]. For each of 1546 Current Biology 1996, Vol 6 No 12 Figure 2 The conservation of core residues between neuroglian domain 2 and YEF3_YEAST and Neuroglian domain 2 L8543.18 YEF3-YEAST S. cerevisiae L8543.18. Strands A,B and E are shown in pink.
Recommended publications
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; [email protected] (A.M.); [email protected] (K.B.A.); [email protected] (P.B.); [email protected] (J.B.); [email protected] (E.d.C.); [email protected] (E.C.); [email protected] (E.G.); [email protected] (S.G.); [email protected] (D.L.); [email protected] (T.L.); [email protected] (T.B.N.); [email protected] (I.P.); [email protected] (S.P.); [email protected] (M.P.); [email protected] (N.R.); [email protected] (A.B.); [email protected] (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: [email protected] (M.F.); [email protected] (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice
    (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster.
    [Show full text]
  • Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences
    Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms......................
    [Show full text]
  • Molecular Genetics & Genomics
    page 46 Lab Times 5-2010 Ranking Illustration: Christina Ullman Publication Analysis 1997-2008 Molecular Genetics & Genomics Under the premise of a “narrow” definition of the field, Germany and England co-dominated European molecular genetics/genomics. The most frequently citated sub-fields were bioinformatical genomics, epigenetics, RNA biology and DNA repair. irst of all, a little science history (you’ll soon see why). As and expression. That’s where so-called computational biology is well known, in the 1950s genetics went molecular – and and systems biology enter research into basic genetic problems. Fdid not just become molecular genetics but rather molec- Given that development, it is not easy to answer the question ular bio logy. In 1963, however, Sydney Brenner wrote in his fa- what “molecular genetics & genomics” today actually is – and, mous letter to Max Perutz: “[...] I have long felt that the future of in particular, what is it in the context of our publication analy- molecular biology lies in the extension of research to other fields sis of the field? It is obvious that, as for example science historian of biology, notably development and the nervous system.” He Robert Olby put it, a “wide” definition can be distinguished from appeared not to be alone with this view and, as a consequence, a “narrow” definition of the field. The wide definition includes along with Brenner many of the leading molecular biologists all fields, into which molecular biology has entered as an exper- from the classical period redirected their research agendas, utilis- imental and theoretical paradigm. The “narrow” definition, on ing the newly developed molecular techniques to investigate un- the other hand, still tries to maintain the status as an explicit bio- solved problems in other fields.
    [Show full text]
  • Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A
    D412–D419 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 30 October 2020 doi: 10.1093/nar/gkaa913 Pfam: The protein families database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A. Salazar1, Erik L.L. Sonnhammer2, Silvio C.E. Tosatto 3, Lisanna Paladin 3, Shriya Raj 1, Lorna J. Richardson 1, Robert D. Finn 1 and Alex Bateman 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK, 2Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden and 3Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received September 11, 2020; Revised October 01, 2020; Editorial Decision October 02, 2020; Accepted October 06, 2020 ABSTRACT per-family gathering thresholds. Pfam entries are manually annotated with functional information from the literature The Pfam database is a widely used resource for clas- where available. sifying protein sequences into families and domains. Since Pfam release 29.0, pfamseq is based on UniPro- Since Pfam was last described in this journal, over tKB reference proteomes, whilst prior to that, it was based 350 new families have been added in Pfam 33.1 and on the whole of UniProtKB (3,4). Although the underly- numerous improvements have been made to existing ing sequence database is based on reference proteomes, all entries. To facilitate research on COVID-19, we have of the profile HMMs are also searched against UniProtKB revised the Pfam entries that cover the SARS-CoV-2 and the resulting matches are made available on the Pfam proteome, and built new entries for regions that were website and in a flatfile format.
    [Show full text]
  • Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]
    F1000Research 2018, 7:1975 Last updated: 26 JUL 2021 METHOD ARTICLE Rapid identification of novel protein families using similarity searches [version 1; peer review: 2 approved] Matt Jeffryes, Alex Bateman European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK v1 First published: 24 Dec 2018, 7:1975 Open Peer Review https://doi.org/10.12688/f1000research.17315.1 Latest published: 24 Dec 2018, 7:1975 https://doi.org/10.12688/f1000research.17315.1 Reviewer Status Invited Reviewers Abstract Protein family databases are an important tool for biologists trying to 1 2 dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating version 1 a protein family database. This comparison helps to understand 24 Dec 2018 report report whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a 1. Daniel J. Rigden , University of Liverpool, method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive Liverpool, UK comparisons. This method is based upon the MinHash algorithm, 2. Desmond G Higgins, Conway Institute, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. University College Dublin, Dublin, Ireland Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing Any reports and responses or comments on the families in Pfam in less than a second, with little loss in accuracy.
    [Show full text]
  • Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments
    Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments. by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of ARCHIVES Masters of Engineering in Computer Science and Engineering MASSACHUE N U TE at the OF TECH IOLOY MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 1 2011 June 2011 LIBRARI ES @ Massachusetts Institute of Technology 2011. All rights reserved. '$7 A uthor ............ .. .. ... ............. Department of Electrical Wgineering and Computer Science May 20, 2011 Certified by..................................... ...... Bonnie Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by.... ....................................... Christopher J. Terman Chairman, Department Committee on Graduate Theses 2 Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2011, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract Whole genome alignments have become a central tool in biological sequence analy- sis. A major application is the de novo prediction of non-coding RNAs (ncRNAs) from structural conservation visible in the alignment. However, current methods for constructing genome alignments do so by explicitly optimizing for sequence simi- larity but not structural similarity. Therefore, de novo prediction of ncRNAs with high structural but low sequence conservation is intrinsically challenging in a genome alignment because the conservation signal is typically hidden. This study addresses this problem with a method for genome-wide realignment of potential ncRNAs ac- cording to structural similarity.
    [Show full text]
  • Generating Functional Protein Variants with Variational Autoencoders
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Generating functional protein variants with variational 2 autoencoders 1 1 1 3 Alex Hawkins-Hooker , Florence Depardieu , Sebastien Baur , Guillaume 1 1 1 4 Couairon , Arthur Chen , and David Bikard 1 5 Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France 6 Abstract 7 The design of novel proteins with specified function and controllable biochemical properties 8 is a longstanding goal in bio-engineering with potential applications across medicine and nan- 9 otechnology. The vast expansion of protein sequence databases over the last decades provides 10 an opportunity for new approaches which seek to learn the sequence-function relationship 11 directly from natural sequence variation. Advances in deep generative models have led to 12 the successful modelling of diverse kinds of high-dimensional data, from images to molecules, 13 allowing the generation of novel, realistic samples. While deep models trained on protein 14 sequence data have been shown to learn biologically meaningful representations helpful for 15 a variety of downstream tasks, their potential for direct use in protein engineering remains 16 largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 17 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the 18 luxA bacterial luciferase.
    [Show full text]
  • Genome Informatics
    Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository Joint Cold Spring Harbor Laboratory/Wellcome Trust Conference GENOME INFORMATICS September 15–September 19, 2010 Arranged by Inanc Birol, BC Cancer Agency, Canada Michele Clamp, BioTeam, Inc. James Kent, University of California, Santa Cruz, USA SCHEDULE AT A GLANCE Wednesday 15th September 2010 17.00-17.30 Registration – finger buffet dinner served from 17.30-19.30 19.30-20:50 Session 1: Epigenomics and Gene Regulation 20.50-21.10 Break 21.10-22.30 Session 1, continued Thursday 16th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 2: Population and Statistical Genomics 10.20-10:40 Morning Coffee 10:40-12:00 Session 2, continued 12.00-14.00 Lunch 14.00-15.20 Session 3: Environmental and Medical Genomics 15.20-15.40 Break 15.40-17.00 Session 3, continued 17.00-19.00 Poster Session I and Drinks Reception 19.00-21.00 Dinner Friday 17th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 4: Databases, Data Mining, Visualization and Curation 10.20-10.40 Morning Coffee 10.40-12.00 Session 4, continued 12.00-14.00 Lunch 14.00-16.00 Free afternoon 16.00-17.00 Keynote Speaker: Alex Bateman 17.00-19.00 Poster Session II and Drinks Reception 19.00-21.00 Dinner Saturday 18th September 2010 07.30-09.00 Breakfast 09.00-10.20 Session 5: Sequencing Pipelines and Assembly 10.20-10.40
    [Show full text]
  • Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition
    Enhanced protein domain discovery by using language modeling techniques from speech recognition Lachlan Coin, Alex Bateman, and Richard Durbin* Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom Edited by Michael Levitt, Stanford University School of Medicine, Stanford, CA, and approved February 12, 2003 (received for review December 10, 2002) Most modern speech recognition uses probabilistic models to equation and introduce the terms P(Di) to represent the prior interpret a sequence of sounds. Hidden Markov models, in partic- probability of the ith domain: ular, are used to recognize words. The same techniques have been P͑A ͉D ͒ adapted to find domains in protein sequences of amino acids. To ͑ ͉ ͒ϰ ͹ i i ͑ ͒ P D A, M ͩ ͑ ͉ ͒ P Di ͪ increase word accuracy in speech recognition, language models are P Ai R used to capture the information that certain word combinations i are more likely than others, thus improving detection based on P͑D ͉D Ϫ ...D , M͒ ϫ ͹ i i 1 1 context. However, to date, these context techniques have not been ͩ ͑ ͒ ͪ. [2] P Di applied to protein domain discovery. Here we show that the i application of statistical language modeling methods can signifi- cantly enhance domain recognition in protein sequences. As an Note that we have replaced P(AԽM), which is a constant given the example, we discover an unannotated Tf࿝Otx Pfam domain on the signal, independent of our interpretation of the sequence, by cone rod homeobox protein, which suggests a possible mechanism another constant, P(AԽR): the probability of the sequence being for how the V242M mutation on this protein causes cone-rod generated independently residue by residue according to a dystrophy.
    [Show full text]
  • Introduction
    ent INTRODUCTION When we consider a protein (or gene), one of the most fundamental questions is what other proteins are related. Biological sequences often occur in families. These fam ilies may consist of related genes within an organism (paralogs), sequences within a population (e.g., polymorphic variants), or genes in other species (ortho­ logs). Sequences diverge from each other for reasons such as duplication within a genome or speciation leading to the existence of orthologs. We have studied pairwise comparisons of two protein (or DNA) sequences (Chapter 3), and we have also seen multiple related sequences in the form of profiles or as the output of a BLAST or other database search (Chapters 4 and 5). We will also explore multiple sequence alignments in the context of molecular phylogeny (Chapter 7), protein domains (Chap ter 10), and protein structure (Chapter 11). In this chapter, we consider the general problem of multiple sequence alignment from three perspectives. First, we describe five approaches to making multiple sequence alignments from a group of homologous sequences of interest. Second, we discuss multiple alignment of genomic DNA. This is typically a comparative genomics problem of aligning large chromosomal regions from different species. Third, we explore databases ofmultiply aligned sequences, such as Pfam, the protein family database. While multiple sequence alignment is commonly performed for bo th protein and D NA sequences, most databases consist of protein families only. Nucleotides corresponding to coding regions are typically less well conserved than Bioinformatics and Functional Genomics, Second Edition. By [onathan Pevsner Copyright © 2009 John Wiley & Sons, Inc. 179 180 MULTIPLE SEQUEN CE ALIGNMENT proteins because of the degeneracy of the genetic code.
    [Show full text]