An Improved Database of Intrinsically Disordered and Mobile Proteins Emilio Potenza, Tomas´ Di Domenico, Ian Walsh and Silvio C.E
Total Page:16
File Type:pdf, Size:1020Kb
Published online 31 October 2014 Nucleic Acids Research, 2015, Vol. 43, Database issue D315–D320 doi: 10.1093/nar/gku982 MobiDB 2.0: an improved database of intrinsically disordered and mobile proteins Emilio Potenza, Tomas´ Di Domenico, Ian Walsh and Silvio C.E. Tosatto* Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received August 15, 2014; Revised October 2, 2014; Accepted October 03, 2014 Downloaded from https://academic.oup.com/nar/article-abstract/43/D1/D315/2439671 by guest on 03 May 2019 ABSTRACT protein achieve its function. This paradigm has been chal- lenged by the collection of hundreds of proteins where func- MobiDB (http://mobidb.bio.unipd.it/) is a database of tion is determined by non-folding regions which play vital intrinsically disordered and mobile proteins. Intrin- biological roles (1,2). Flexible segments lacking a unique sically disordered regions are key for the function native structure, known as intrinsic disordered regions, are of numerous proteins. Here we provide a new ver- widespread in nature, especially in eukaryotic organisms sion of MobiDB, a centralized source aimed at pro- (3,4). The size of disordered regions can be short, long or viding the most complete picture on different fla- even encompass entire proteins and their non-enzymatic vors of disorder in protein structures covering all functions include regulation, protein–DNA/RNA interac- UniProt sequences (currently over 80 million). The tions and molecular recognition to name a few, for a recent database features three levels of annotation: man- review see e.g. (5). ually curated, indirect and predicted. Manually cu- One of the first repositories for experimentally deter- mined disorder was the DisProt database (6), containing rated data is extracted from the DisProt database. manually curated information on currently 694 proteins. Indirect data is inferred from PDB structures that are More recently, the IDEAL database (7) was developed, considered an indication of intrinsic disorder. The 10 which annotates 446 proteins with disorder and other inter- predictors currently included (three ESpritz flavors, esting properties by scanning the literature. Although Dis- two IUPred flavors, two DisEMBL flavors, GlobPlot, Prot and IDEAL are invaluable as an experimental gold VSL2b and JRONN) enable MobiDB to provide dis- standard, they both represent only a fraction of the se- order annotations for every protein in absence of quences in nature, posing a bottleneck for large-scale under- more reliable data. The new version also features standing of the disorder phenomenon. Experimental in vitro a consensus annotation and classification for long techniques such as nuclear magnetic resonance (NMR) and disordered regions. In order to complement the dis- x-ray crystallography detect disorder with difficulty in par- order annotations, MobiDB features additional anno- ticular for long regions and entire proteins. With currently around 100 000 NMR and x-ray structures, the Protein tations from external sources. Annotations from the Data Bank (PDB) (8) nevertheless provides a rich source UniProt database include post-translational modifi- of indirect experimental disorder. Missing residues in x-ray cations and linear motifs. Pfam annotations are dis- crystallographic structures in particular have become the de played in graphical form and are link-enabled, allow- facto standard proxy to infer disorder (1,6,9–10). Only more ing the user to visit the corresponding Pfam page recently have mobile regions in NMR structures started to for further information. Experimental protein–protein be used to infer disorder (11), although it is not entirely interactions from STRING are also classified for dis- clear how this relates to either missing x-ray regions or flex- order content. ible loops. Due to the difficulty in determining disorder ex- perimentally, a plethora of predictors were created over the INTRODUCTION last 15 years. Many are quite accurate, as shown at the re- Proteins have been known to exist in an equilibrium be- cent Critical Assessment of techniques for protein Structure tween an unfolded and folded state at least since Anfinsen’s Prediction (CASP-10) (12) and a large-scale assessment of experiments on denaturation. The existence of an unfolded, disorder predictors (10). Biophysical methods (13,14)de- or disordered, state has long been considered temporary, rive pseudo-energy functions from residue pairings in rigid due to the protein still having to adopt its final conforma- structures (i.e. non-disorder). Machine learning, especially tion. In this view, mobility of the protein structure was seen neural networks, has been widely used to predict protein as a localized phenomenon, where protein structure deter- disorder (3,15–18). Many predictors try to capture quite di- mines function and local flexibility is limited to helping the verse disorder flavors, e.g. ESpritz (15) can predict mobile *To whom correspondence should be addressed. Tel: +39 049 8276269; Fax: +39 049 8276260; Email: [email protected] C The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] D316 Nucleic Acids Research, 2015, Vol. 43, Database issue Table 1. Disorder sources consensus definition matrix DisProt PDB Predictors Consensus Disorder Disorder Any Disorder Disorder Structure Any Ambiguous Disorder Ambiguous Any Ambiguous Structure Disorder Any Ambiguous Structure Structure Any Structure Structure Ambiguous Any Ambiguous Ambiguous Any Any Ambiguous None Disorder Any Disorder None Structure Any Structure None Ambiguous Any Ambiguous None None Disorder Disorder (LC) Downloaded from https://academic.oup.com/nar/article-abstract/43/D1/D315/2439671 by guest on 03 May 2019 None None Structure Structure (LC) Each possible annotation scenario is listed for for the three data sources (DisProt, PDB, predictors) together with its consensus annotation. Ambiguous is used for residues with conflicting annotations warranting further investigation, which may be due to folding upon binding events. LC means lowconfi- dence. Combinations yielding structure as consensus are underlined and those for disorder are shown in bold. Sources which are not contributing to the consensus are shown in italics. NMR regions and DisEMBL (17) loop regions with high to UniProt (22), PDB (8) and Pfam (23) through SIFTS B-factor (high flexibility). Predictions can increase the num- (24). MobiDB also includes secondary structure derived ber of annotated sequences to millions but they must be from PDB files using DSSP25 ( ). Pfam annotations are fast to process many gigabytes of data and keep pace with displayed in graphical form and are link-enabled, allow- data expansion. Despite earlier interest in proteome-scale ing the user to visit the corresponding Pfam page for fur- disorder predictions (3), DICHOT (19) is probably the first ther information. Low-complexity regions predicted with public database to provide predictions for the human pro- SEG (26) and Pfilt (27) are included, as it is thought that teome (ca. 20 000 proteins). MobiDB (20), initially limited low sequence complexity correlates with intrinsic disorder to ca. 450 000 SwissProt sequences, was the first published (28,29). Protein–protein interactions are incorporated from database to contain a mixture of experimental data and STRING (30) by considering only interactions of high ac- a consensus prediction approach to annotate as many se- curacy with database or experimental evidence. Functional quences as possible with intrinsic disorder. A similar large- information from UniProt, e.g. post-translational modifica- scale database, D2P2 (21), was published somewhat later to tions and binding sites (among others), are also assigned to provide consensus predictions for ca. 10 million sequences residues. from fully-sequenced genomes. The new version of Mo- biDB 2.0 improves over its predecessor in terms of coverage Disorder predictors and molecular annotations. It is cross-linked from UniProt, covering all of its protein sequences, presently annotating MobiDB uses three biophysical predictors (IUPred-short over 80 million sequences from thousands of organisms. (14), IUPred-long (14), Globplot (13)) and seven ma- chine learning predictors (DisEMBL-465 , DisEMBL-HL , Espritz-DisProt (15), Espritz-NMR (15), Espritz-xray (15), DATABASE DESCRIPTION JRONN (16)andVSL2b(18)). All predictors are chosen < Data sources for their speed ( 10 s per protein). A consensus prediction is formed by applying a majority vote on the 10 predictors MobiDB is designed in three layers (in order of qual- when there is no high quality information from NMR, x-ray ity): manual curation, indirect experimental PDB informa- or DisProt. tion and predictions. Its data sources are essentially four: DisProt, PDB-NMR, PDB-xray and predictors. The high- Combining experimental data est quality data is currently extracted from the DisProt database (6), a central repository manually curated for The core of MobiDB is shown in the section ‘Sequence an- structure-function annotations associated with protein in- notations’ where all the data are collected to form a global trinsic disorder.