Proteome-Scale Prediction of Structure and Function

Total Page:16

File Type:pdf, Size:1020Kb

Proteome-Scale Prediction of Structure and Function Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Proteome Folding Project: Proteome-scale prediction of structure and function Kevin Drew,1 Patrick Winters,1 Glenn L. Butterfoss,1 Viktors Berstis,2 Keith Uplinger,2 Jonathan Armstrong,2 Michael Riffle,3 Erik Schweighofer,4 Bill Bovermann,2 David R. Goodlett,5 Trisha N. Davis,3 Dennis Shasha,6 Lars Malmstro¨m,7 and Richard Bonneau1,4,6,8 1Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York 10003, USA; 2IBM, Austin, Texas 78758, USA; 3Department of Biochemistry, Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 4Institute for Systems Biology, Seattle, Washington 98103, USA; 5Medicinal Chemistry Department, University of Washington, Seattle, Washington 98195, USA; 6Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, New York 10003, USA; 7Institute of Molecular Systems Biology, ETH Zurich, Zurich CH 8093, Switzerland The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome an- notation pipeline based on structure prediction, where function and structure annotations are generated using an in- tegration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (in- cluding human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were dis- tributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions. [Supplemental material is available for this article.] Annotation of protein structure and function is a fundamental Current computational protein annotation methods can be challenge in biology. Accurate structure annotations give re- broadly grouped into four categories: (1) Primary sequence-feature searchers a three-dimensional (3D) view of protein function at the annotationmethodspredictgeneralfeaturesofaproteinsuchas molecular level that enables specific point-mutation analysis or disorder content (DISOPRED) ( Jones and Ward 2003), secondary the design of custom inhibitors to disrupt function. Accurate func- structure (PSIPRED) ( Jones 1999), transmembrane helices (TMHMM) tion annotations give researchers specific testable hypotheses about (Krogh et al. 2001), coiled-coils (COILS) (Lupas et al. 1991), or signal the role a protein plays in the cell and also allow biologists to better peptides (SignalP) (Bendtsen et al. 2004) and can be efficiently interpret the role of uncharacterized genes from high-throughput applied to full proteomes but have limited ability to describe the experiments (e.g., mass spectrometry co-IP, yeast two-hybrid specific function of individual proteins in a cellular context. (2) screens, microarray, RNAi screens, or forward-genetic screens). Sequence comparison methods (e.g., BLAST) (Altschul et al. 1997) Unfortunately, experimental annotation efforts fail to cover large including methods and databases that organize proteins into portions of all proteomes and are often focused on model organ- families (CATH [Orengo et al. 1997], SCOP [Murzin et al. 1995], isms. Current computational function prediction methods can ex- Pfam [Finn et al. 2008]) and fold recognition methods (FFAS) tend coverage to any sequenced genome (i.e., non-model organ- ( Jaroszewski et al. 2005), can generate putative function annota- isms) but can only annotate proteins that have high sequence tions (for a complete review, see Lee et al. 2007) but are limited in similarity to well-characterized proteins. Motivated by the obser- their application to the set of proteins with sequence detectable vation that structure is more conserved than sequence (Chothia and homologs or significant sequence matches. (3) Several groups have Lesk 1986), our method extends function annotation coverage to used machine learning techniques to integrate high-throughput many unannotated protein domains by comparing computation- experimental data, such as gene expression and protein–protein ally predicted structures to well-characterized proteins with known interactions, to predict protein function (Marcotte et al. 1999; structures. Bader and Hogue 2002; Hazbun et al. 2003; Troyanskaya et al. 2003; Lee et al. 2004; Pena-Castillo et al. 2008), but these data sets are not generally available for all organisms and therefore limit these methods’ coverage. (4) Several studies have shown that 8Corresponding author. protein structures derived from either large-scale protein experi- E-mail [email protected]. mental determination (Matthews 2007; Dessailly et al. 2009) or Article published online before print. Article, supplemental material, and pub- lication date are at http://www.genome.org/cgi/doi/10.1101/gr.121475.111. large-scale protein prediction (homology modeling, fold recogni- Freely available online through the Genome Research Open Access option. tion, and de novo) significantly increased the annotation coverage 21:1981–1994 Ó 2011 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/11; www.genome.org Genome Research 1981 www.genome.org Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press Drew et al. of proteomes and often provided site-specific information such as Next, we use the fold recognition algorithm FFAS03 probable functional sites and surfaces (Bonneau et al. 2004; Ginalski ( Jaroszewski et al. 2005) to match more evolutionarily distant se- et al. 2004; Malmstro¨m et al. 2007; Pieper et al. 2009; Zhang et al. quences in the PDB, producing an additional 58,000 domains with 2009). Homology modeling is increasingly productive as more significant matches to domains in the PDB. We then use additional structures are solved, however, many proteins and protein domains methods, including identifying Pfam domains (Finn et al. 2008), an lack detectable homologous proteins in the Protein Data Bank (PDB) algorithm for predicting domains from multiple sequence align- (Marsden et al. 2007). De novo structure prediction methods do not ments (MSA), and a heuristic-based algorithm for delineating do- require sequence homology with known structures and can, in main boundaries to identify additional putative domains (Chivian principle, provide annotation coverage to proteins unreachable by et al. 2003, 2005). These methods combined to identify an additional homology modeling. Unfortunately, de novo prediction methods 325,000 domains. In total, our domain prediction produced nearly require vast computational resources, and because of this, published 700,000 domains for the 389,000 query proteins, which serve as the pipelines that include de novo structure prediction are incapable of basis for our domain centric annotation of these proteins. We then keeping pace with incoming genomic data. used the Rosetta de novo protocol (Rohl et al. 2004) to predict the 3D This study focuses on this fourth type of annotation method structure of those domains lacking structure annotation (i.e., Pfam, via structure prediction. Here we describe the proteome folding MSA, and heuristic domains) and that are less than 150 residues in pipeline (PFP), a protein-domain-level fold and function annota- length. We predicted de novo folds for 57,000 domains on IBM’s tion method that combines de novo (Rosetta) (Rohl et al. 2004), fold World Community Grid (requiring more than 100,000 yr of CPU recognition, and sequence-based structure assignment methods time, resulting in one of the largest repository of protein structure into a single pipeline that significantly extends functional and predictions publicly available). The final step in the pipeline classifies structural proteome annotation coverage for 94 complete genomes predicted protein domains into structural superfamilies (SCOP). (as well as several new protein families from recent metagenomics PDB-BLAST and FFAS03 domains are assigned the superfamily of the studies). The computational cost of performing de novo structure PDB structure to which they matched, while a logistic regression prediction was distributed on a grid of more than 1.5 million CPUs model is used to classify de novo structures into SCOP superfamilies worldwide (World Community Grid; http://www.wcgrid.org). De based on the Rosetta predictions and estimated error associated with novo structures and sequence-based comparisons were used to these structure predictions (Malmstro¨m et al. 2007). In all, our assign predicted protein domains into the Structural Classification pipeline classified more than 250,000 domains into SCOP super- of Proteins (SCOP), a hierarchical classification of protein 3D families of which nearly 43,000 are considered both confident (based structure (Murzin et al. 1995). We demonstrate our ability to in- on our benchmarks) and novel (FFAS03 and de novo). tegrate these predicted SCOP classifications with information from the Gene Ontology (GO) database (Ashburner
Recommended publications
  • Bioinformatic Analysis of Structure and Function of LIM Domains of Human Zyxin Family Proteins
    International Journal of Molecular Sciences Article Bioinformatic Analysis of Structure and Function of LIM Domains of Human Zyxin Family Proteins M. Quadir Siddiqui 1,† , Maulik D. Badmalia 1,† and Trushar R. Patel 1,2,3,* 1 Alberta RNA Research and Training Institute, Department of Chemistry and Biochemistry, University of Lethbridge, 4401 University Drive, Lethbridge, AB T1K 3M4, Canada; [email protected] (M.Q.S.); [email protected] (M.D.B.) 2 Department of Microbiology, Immunology and Infectious Disease, Cumming School of Medicine, University of Calgary, 3330 Hospital Drive, Calgary, AB T2N 4N1, Canada 3 Li Ka Shing Institute of Virology, University of Alberta, Edmonton, AB T6G 2E1, Canada * Correspondence: [email protected] † These authors contributed equally to the work. Abstract: Members of the human Zyxin family are LIM domain-containing proteins that perform critical cellular functions and are indispensable for cellular integrity. Despite their importance, not much is known about their structure, functions, interactions and dynamics. To provide insights into these, we used a set of in-silico tools and databases and analyzed their amino acid sequence, phylogeny, post-translational modifications, structure-dynamics, molecular interactions, and func- tions. Our analysis revealed that zyxin members are ohnologs. Presence of a conserved nuclear export signal composed of LxxLxL/LxxxLxL consensus sequence, as well as a possible nuclear localization signal, suggesting that Zyxin family members may have nuclear and cytoplasmic roles. The molecular modeling and structural analysis indicated that Zyxin family LIM domains share Citation: Siddiqui, M.Q.; Badmalia, similarities with transcriptional regulators and have positively charged electrostatic patches, which M.D.; Patel, T.R.
    [Show full text]
  • Phylogenomic Analysis of the Chlamydomonas Genome Unmasks Proteins Potentially Involved in Photosynthetic Function and Regulation
    Photosynth Res DOI 10.1007/s11120-010-9555-7 REVIEW Phylogenomic analysis of the Chlamydomonas genome unmasks proteins potentially involved in photosynthetic function and regulation Arthur R. Grossman • Steven J. Karpowicz • Mark Heinnickel • David Dewez • Blaise Hamel • Rachel Dent • Krishna K. Niyogi • Xenie Johnson • Jean Alric • Francis-Andre´ Wollman • Huiying Li • Sabeeha S. Merchant Received: 11 February 2010 / Accepted: 16 April 2010 Ó The Author(s) 2010. This article is published with open access at Springerlink.com Abstract Chlamydomonas reinhardtii, a unicellular green performed to identify proteins encoded on the Chlamydo- alga, has been exploited as a reference organism for iden- monas genome which were likely involved in chloroplast tifying proteins and activities associated with the photo- functions (or specifically associated with the green algal synthetic apparatus and the functioning of chloroplasts. lineage); this set of proteins has been designated the Recently, the full genome sequence of Chlamydomonas GreenCut. Further analyses of those GreenCut proteins with was generated and a set of gene models, representing all uncharacterized functions and the generation of mutant genes on the genome, was developed. Using these gene strains aberrant for these proteins are beginning to unmask models, and gene models developed for the genomes of new layers of functionality/regulation that are integrated other organisms, a phylogenomic, comparative analysis was into the workings of the photosynthetic apparatus. Keywords Chlamydomonas Á GreenCut Á Chloroplast Á Phylogenomics Á Regulation A. R. Grossman (&) Á M. Heinnickel Á D. Dewez Á B. Hamel Department of Plant Biology, Carnegie Institution for Science, 260 Panama Street, Stanford, CA 94305, USA Introduction e-mail: [email protected] Chlamydomonas reinhardtii as a reference organism S.
    [Show full text]
  • An Emerging Field for the Structural Analysis of Proteins on the Proteomic Scale † ‡ ‡ ‡ § ‡ ∥ Upneet Kaur, He Meng, Fang Lui, Renze Ma, Ryenne N
    Perspective Cite This: J. Proteome Res. 2018, 17, 3614−3627 pubs.acs.org/jpr Proteome-Wide Structural Biology: An Emerging Field for the Structural Analysis of Proteins on the Proteomic Scale † ‡ ‡ ‡ § ‡ ∥ Upneet Kaur, He Meng, Fang Lui, Renze Ma, Ryenne N. Ogburn, , Julia H. R. Johnson, , ‡ † Michael C. Fitzgerald,*, and Lisa M. Jones*, ‡ Department of Chemistry, Duke University, Durham, North Carolina 27708-0346, United States † Department of Pharmaceutical Sciences, University of Maryland, Baltimore, Maryland 21201, United States ABSTRACT: Over the past decade, a suite of new mass- spectrometry-based proteomics methods has been developed that now enables the conformational properties of proteins and protein− ligand complexes to be studied in complex biological mixtures, from cell lysates to intact cells. Highlighted here are seven of the techniques in this new toolbox. These techniques include chemical cross-linking (XL−MS), hydroxyl radical footprinting (HRF), Drug Affinity Responsive Target Stability (DARTS), Limited Proteolysis (LiP), Pulse Proteolysis (PP), Stability of Proteins from Rates of Oxidation (SPROX), and Thermal Proteome Profiling (TPP). The above techniques all rely on conventional bottom-up proteomics strategies for peptide sequencing and protein identification. However, they have required the development of unconventional proteomic data analysis strategies. Discussed here are the current technical challenges associated with these different data analysis strategies as well as the relative analytical capabilities of the different techniques. The new biophysical capabilities that the above techniques bring to bear on proteomic research are also highlighted in the context of several different application areas in which these techniques have been used, including the study of protein ligand binding interactions (e.g., protein target discovery studies and protein interaction network analyses) and the characterization of biological states.
    [Show full text]
  • Gene Ontology Analysis with Cytoscape
    Introduction to Systems Biology Exercise 6: Gene Ontology Analysis Overview: Gene Ontology (GO) is a useful resource in bioinformatics and systems biology. GO defines a controlled vocabulary of terms in biological process, molecular function, and cellular location, and relates the terms in a somewhat-organized fashion. This exercise outlines the resources available under Cytoscape to perform analyses for the enrichment of gene annotation (GO terms) in networks or sub-networks. In this exercise you will: • Learn how to navigate the Cytoscape Gene Ontology wizard to apply GO annotations to Cytoscape nodes • Learn how to look for enriched GO categories using the BiNGO plugin. What you will need: • The BiNGO plugin, developed by the Computational Biology Division, Dept. of Plant Systems Biology, Flanders Interuniversitary Institute for Biotechnology (VIB), described in Maere S, et al. Bioinformatics. 2005 Aug 15;21(16):3448-9. Epub 2005 Jun 21. • galFiltered.sif used in the earlier exercises and found under sampleData. Please download any missing files from http://www.cbs.dtu.dk/courses/27041/exercises/Ex3/ Download and install the BiNGO plugin, as follows: 1. Get BiNGO from the course page (see above) or go to the BiNGO page at http://www.psb.ugent.be/cbd/papers/BiNGO/. (This site also provides documentation on BiNGO). 2. Unzip the contents of BiNGO.zip to your Cytoscape plugins directory (make sure the BiNGO.jar file is in the plugins directory and not in a subdirectory). 3. If you are currently running Cytoscape, exit and restart. First, we will load the Gene Ontology data into Cytoscape. 1. Start Cytoscape, under Edit, Preferences, make sure your default species, “defaultSpeciesName”, is set to “Saccharomyces cerevisiae”.
    [Show full text]
  • Mmnet: Metagenomic Analysis of Microbiome Metabolic Network
    mmnet: Metagenomic analysis of microbiome metabolic network Yang Cao, Fei Li, Xiaochen Bo May 15, 2016 Contents 1 Introduction 1 1.1 Installation . .2 2 Analysis Pipeline: from raw Metagenomic Sequence Data to Metabolic Network Analysis 2 2.1 Prepare metagenomic sequence data . .3 2.2 Annotation of Metagenomic Sequence Reads . .3 2.3 Estimating the abundance of enzymatic genes . .5 2.4 Building reference metabolic dataset . .7 2.5 Constructing State Specific Network . .8 2.6 Topological network analysis . .9 2.7 Differential network analysis . 10 2.8 Network Visualization . 10 3 Analysis in Cytoscape 11 1 Introduction This manual is a brief introduction to structure, functions and usage of mmnet pack- age. The mmnet package provides a set of functions to support systems analysis of metagenomic data in R, including annotation of raw metagenomic sequence data, con- struction of metabolic network, and differential network analysis based on state specific metabolic network and enzymatic gene abundance. Meanwhile, the package supports an analysis pipeline for metagenomic systems bi- ology. We can simply start from raw metagenomic sequence data, optionally use MG- RAST to rapidly retrieve metagenomic annotation, fetch pathway data from KEGG using its API to construct metabolic network, and then utilize a metagenomic systems biology computational framework mentioned in [Greenblum et al., 2012] to establish further differential network analysis. The main features of mmnet: • Annotation of metagenomic sequence reads with MG-RAST • Estimating abundances of enzymatic gene based on functional annotation 1 • Constructing State Specific metabolic Network • Topological network analysis • Differential network analysis 1.1 Installation mmnet requires these packages: KEGGREST, igraph, Biobase, XML, RCurl, RJSO- NIO, stringr, ggplot2 and biom.
    [Show full text]
  • Static Retention of the Lumenal Monotopic Membrane Protein Torsina in the Endoplasmic Reticulum
    The EMBO Journal (2011) 30, 3217–3231 | & 2011 European Molecular Biology Organization | All Rights Reserved 0261-4189/11 www.embojournal.org TTHEH E EEMBOMBO JJOURNALOURN AL Static retention of the lumenal monotopic membrane protein torsinA in the endoplasmic reticulum Abigail B Vander Heyden1, despite the fact that it has been a decade since the protein was Teresa V Naismith1, Erik L Snapp2 and first described and linked to dystonia (Breakefield et al, Phyllis I Hanson1,* 2008). Based on its membership in the AAA þ family of ATPases (Ozelius et al, 1997; Hanson and Whiteheart, 2005), 1Department of Cell Biology and Physiology, Washington University School of Medicine, St Louis, MO, USA and 2Department of Anatomy it is likely that torsinA disassembles or changes the confor- and Structural Biology, Albert Einstein College of Medicine, Bronx, mation of a protein or protein complex in the ER or NE. The NY, USA DE mutation is thought to compromise this function (Dang et al, 2005; Goodchild et al, 2005). TorsinA is a membrane-associated enzyme in the endo- TorsinA is targeted to the ER lumen by an N-terminal plasmic reticulum (ER) lumen that is mutated in DYT1 signal peptide. Analyses of torsinA’s subcellular localization, dystonia. How it remains in the ER has been unclear. We processing, and glycosylation show that the signal peptide is report that a hydrophobic N-terminal domain (NTD) di- cleaved and the mature protein resides in the lumen of the ER rects static retention of torsinA within the ER by excluding (Kustedjo et al, 2000; Hewett et al, 2003; Liu et al, 2003), it from ER exit sites, as has been previously reported for where it is a stable protein (Gordon and Gonzalez-Alegre, short transmembrane domains (TMDs).
    [Show full text]
  • PROTEOMICS the Human Proteome Takes the Spotlight
    RESEARCH HIGHLIGHTS PROTEOMICS The human proteome takes the spotlight Two papers report mass spectrometry– big data. “We then thought, include some surpris- based draft maps of the human proteome ‘What is a potentially good ing findings. For example, and provide broadly accessible resources. illustration for the utility Kuster’s team found protein For years, members of the proteomics of such a database?’” says evidence for 430 long inter- community have been trying to garner sup- Kuster. “We very quickly genic noncoding RNAs, port for a large-scale project to exhaustively got to the idea, ‘Why don’t which have been thought map the normal human proteome, including we try to put together the not to be translated into pro- identifying all post-translational modifica- human proteome?’” tein. Pandey’s team refined tions and protein-protein interactions and The two groups took the annotations of 808 genes providing targeted mass spectrometry assays slightly different strategies and also found evidence and antibodies for all human proteins. But a towards this common goal. for the translation of many Nik Spencer/Nature Publishing Group Publishing Nik Spencer/Nature lack of consensus on how to exactly define Pandey’s lab examined 30 noncoding RNAs and pseu- Two groups provide mass the proteome, how to carry out such a mis- normal tissues, including spectrometry evidence for dogenes. sion and whether the technology is ready has adult and fetal tissues, as ~90% of the human proteome. Obtaining evidence for not so far convinced any funding agencies to well as primary hematopoi- the last roughly 10% of pro- fund on such an ambitious project.
    [Show full text]
  • PMAP: Databases for Analyzing Proteolytic Events and Pathways Yoshinobu Igarashi1, Emily Heureux1, Kutbuddin S
    Published online 8 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D611–D618 doi:10.1093/nar/gkn683 PMAP: databases for analyzing proteolytic events and pathways Yoshinobu Igarashi1, Emily Heureux1, Kutbuddin S. Doctor1, Priti Talwar1, Svetlana Gramatikova1, Kosi Gramatikoff1, Ying Zhang1, Michael Blinov3, Salmaz S. Ibragimova2, Sarah Boyd4, Boris Ratnikov1, Piotr Cieplak1, Adam Godzik1, Jeffrey W. Smith1, Andrei L. Osterman1 and Alexey M. Eroshkin1,* 1The Center on Proteolytic Pathways, The Cancer Research Center and The Inflammatory and Infectious Disease Center at The Burnham Institute for Medical Research, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, 2Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Lavrentieva 10, Novosibirsk 630090, Russia, 3Center of Cell Analysis and Modeling, University of Connecticut Health Center, Farmington, CT 06030, USA and 4Faculty of Information Technology, Monash University, Clayton, Victoria 3800, Australia Received August 15, 2008; Revised September 19, 2008; Accepted September 23, 2008 ABSTRACT Proteolysis is essential to almost all fundamental cellular processes including proliferation, death and migration The Proteolysis MAP (PMAP, http://www. (1–4). Equally as important, mis-regulated proteolysis proteolysis.org) is a user-friendly website intended can cause diseases ranging from emphysema (5) and to aid the scientific community in reasoning about thrombosis (6), to arthritis (7) and Alzheimer’s (8). There proteolytic networks and pathways. PMAP is com- are a number of online resources containing information prised of five databases, linked together in one on proteases including SwissProt (the oldest), HPRD environment. The foundation databases, Protease- (human protein reference database) (9) and UniProt (10).
    [Show full text]
  • How to Use and Integrate.Pdf
    Journal of Proteomics 171 (2018) 37–52 Contents lists available at ScienceDirect Journal of Proteomics journal homepage: www.elsevier.com/locate/jprot How to use and integrate bioinformatics tools to compare proteomic data from distinct conditions? A tutorial using the pathological similarities between Aortic Valve Stenosis and Coronary Artery Disease as a case-study Fábio Trindade a,b,⁎, Rita Ferreira c, Beatriz Magalhães a, Adelino Leite-Moreira b, Inês Falcão-Pires b,RuiVitorinoa,b a Institute of Biomedicine, Department of Medical Sciences, University of Aveiro, Aveiro, Portugal b Unidade de Investigação Cardiovascular, Departamento de Cirurgia e Fisiologia, Faculdade de Medicina, Universidade do Porto, Porto, Portugal c QOPNA, Mass Spectrometry Center, Department of Chemistry, University of Aveiro, Aveiro, Portugal article info abstract Article history: Nowadays we are surrounded by a plethora of bioinformatics tools, powerful enough to deal with the large Received 22 December 2016 amounts of data arising from proteomic studies, but whose application is sometimes hard to find. Therefore, Received in revised form 28 February 2017 we used a specific clinical problem – to discriminate pathophysiology and potential biomarkers between two Accepted 19 March 2017 similar cardiovascular diseases, aortic valve stenosis (AVS) and coronary artery disease (CAD) – to make a Available online 21 March 2017 step-by-step guide through four bioinformatics tools: STRING, DisGeNET, Cytoscape and ClueGO. Proteome data was collected from articles available on PubMed centered on proteomic studies enrolling subjects with Keywords: fi fi Proteomics AVS or CAD. Through the analysis of gene ontology provided by STRING and ClueGO we could nd speci cbio- Bioinformatics logical phenomena associated with AVS, such as down-regulation of elastic fiber assembly, and with CAD, such STRING as up-regulation of plasminogen activation.
    [Show full text]
  • A New Census of Protein Tandem Repeats and Their Relationship with Intrinsic Disorder
    G C A T T A C G G C A T genes Article A New Census of Protein Tandem Repeats and Their Relationship with Intrinsic Disorder Matteo Delucchi 1,2 , Elke Schaper 1,2,† , Oxana Sachenkova 3,‡, Arne Elofsson 3 and Maria Anisimova 1,2,* 1 ZHAW Life Sciences und Facility Management, Applied Computational Genomics, 8820 Wädenswil, Switzerland; [email protected] 2 Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland 3 Science of Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, 106 91 Stockholm, Sweden * Correspondence: [email protected]; Tel.: +41-(0)58-934-5882 † Present address: Carbon Delta AG, 8002 Zürich, Switzerland. ‡ Present address: Vildly AB, 385 31 Kalmar, Sweden. Received: 9 March 2020; Accepted: 1 April 2020; Published: 9 April 2020 Abstract: Protein tandem repeats (TRs) are often associated with immunity-related functions and diseases. Since that last census of protein TRs in 1999, the number of curated proteins increased more than seven-fold and new TR prediction methods were published. TRs appear to be enriched with intrinsic disorder and vice versa. The significance and the biological reasons for this association are unknown. Here, we characterize protein TRs across all kingdoms of life and their overlap with intrinsic disorder in unprecedented detail. Using state-of-the-art prediction methods, we estimate that 50.9% of proteins contain at least one TR, often located at the sequence flanks. Positive linear correlation between the proportion of TRs and the protein length was observed universally, with Eukaryotes in general having more TRs, but when the difference in length is taken into account the difference is quite small.
    [Show full text]
  • A Roadmap for Metagenomic Enzyme Discovery
    Natural Product Reports View Article Online REVIEW View Journal A roadmap for metagenomic enzyme discovery Cite this: DOI: 10.1039/d1np00006c Serina L. Robinson, * Jorn¨ Piel and Shinichi Sunagawa Covering: up to 2021 Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, Creative Commons Attribution-NonCommercial 3.0 Unported Licence. shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a ‘needle in a haystack’ without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, This article is licensed under a genomic context, 3D structure-based approaches, and machine learning techniques.
    [Show full text]
  • Some Considerations for Analyzing Biodiversity Using Integrative
    Some considerations for analyzing biodiversity using integrative metagenomics and gene networks Lucie Bittner, Sébastien Halary, Claude Payri, Corinne Cruaud, Bruno de Reviers, Philippe Lopez, Eric Bapteste To cite this version: Lucie Bittner, Sébastien Halary, Claude Payri, Corinne Cruaud, Bruno de Reviers, et al.. Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biology Direct, BioMed Central, 2010, 5 (5), pp.47. 10.1186/1745-6150-5-47. hal-02922363 HAL Id: hal-02922363 https://hal.archives-ouvertes.fr/hal-02922363 Submitted on 26 Aug 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Bittner et al. Biology Direct 2010, 5:47 http://www.biology-direct.com/content/5/1/47 HYPOTHESIS Open Access Some considerations for analyzing biodiversity using integrative metagenomics and gene networks Lucie Bittner1†, Sébastien Halary2†, Claude Payri3, Corinne Cruaud4, Bruno de Reviers1, Philippe Lopez2, Eric Bapteste2* Abstract Background: Improving knowledge of biodiversity will benefit conservation biology, enhance bioremediation studies, and could lead to new medical treatments. However there is no standard approach to estimate and to compare the diversity of different environments, or to study its past, and possibly, future evolution.
    [Show full text]