The Complexity, Challenges and Benefits of Comparing Two Transporter Classification Systems in TCDB and Pfam Zachary Chiang*, Ake Vastermark*, Marco Punta, Penelope C
Total Page:16
File Type:pdf, Size:1020Kb
Briefings in Bioinformatics, 16(5), 2015, 865–872 doi: 10.1093/bib/bbu053 Advance Access Publication Date: 21 January 2015 Paper The complexity, challenges and benefits of comparing two transporter classification systems in TCDB and Pfam Zachary Chiang*, Ake Vastermark*, Marco Punta, Penelope C. Coggill, Jaina Mistry, Robert D. Finn and Milton H. Saier, Jr Corresponding author. Milton H. Saier, Jr., Department of Molecular Biology, UC San Diego, La Jolla, CA 92093-0116. Tel.: þ1-858-534-4084; Fax: þ1-858-534- 7108. E-mail: [email protected] *These authors contributed equally to this work. Abstract Transport systems comprise roughly 10% of all proteins in a cell, playing critical roles in many processes. Improving and expanding their classification is an important goal that can affect studies ranging from comparative genomics to potential drug target searches. It is not surprising that different classification systems for transport proteins have arisen, be it within a specialized database, focused on this functional class of proteins, or as part of a broader classification system for all proteins. Two such databases are the Transporter Classification Database (TCDB) and the Protein family (Pfam) database. As part of a long-term endeavor to improve consistency between the two classification systems, we have compared transporter annota- tions in the two databases to understand the rationale for differences and to improve both systems. Differences sometimes reflect the fact that one database has a particular transporter family while the other does not. Differing family definitions and hierarchical organizations were reconciled, resulting in recognition of 69 Pfam ‘Domains of Unknown Function’, which proved to be transport protein families to be renamed using TCDB annotations. Of over 400 potential new Pfam families identified from TCDB, 10% have already been added to Pfam, and TCDB has created 60 new entries based on Pfam data. This work, for the first time, reveals the benefits of comprehensive database comparisons and explains the differences between Pfam and TCDB. Key words: TCDB; Pfam; transport protein classification; data integration Introduction relationships between different protein entries. The informa- Protein classification databases organize entries into coherent tion stored in these databases is widely used for predicting schemas that typically highlight evolutionary or functional function of experimentally uncharacterized proteins and for Zachary Chiang is a programming analyst in the Saier Bioinformatics Lab, University of California at San Diego. Ake Vastermark is a postdoctoral employee in the Saier Bioinformatics Lab and a database administrator of the Transporter Classification Database. Marco Punta was Project Leader at the Pfam database at the Wellcome Trust Sanger Institute and European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) until September 2014 and starting November 2014 is a visiting scientist at UPMC Paris, France. Penelope C. Coggill is a curator at the Pfam database at the Wellcome Trust Sanger Institute and EMBL-EBI involved in integrating suggestions for adding families to clans. Jaina Mistry is a Pfam curator who worked on the analysis of TCDB and Pfam families. Robert D. Finn is the Protein Families Team Leader at the EMBL-EBI. Milton Saier is Professor of Molecular Biology and founder of the TCDB at UCSD. Submitted: 17 November 2014; Received (in revised form): 9 December 2014 VC The Author 2015. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 865 866 | Chiang et al. automatic annotation of whole proteomes by transferring func- library, the pfam_scan searches can be repeated to identify any tional annotations as a result of homology [1]. Numerous TCDB proteins still not found in Pfam. In other words, if a databases exist that focus on different but overlapping themes protein with TC numbers 1.A.2.1.X (where X stands for any TC (e.g. sequence similarity, structural similarity, molecular func- transport system) has been ignored in the first iteration because tion, evolutionary pathway) or have different resolution (e.g. of the presence of a nonhomologous subunit with the same first protein domains and functional subfamilies versus large multi- four TC numbers, it will be picked up by the second iteration. component families) [2–10]. Thus, pfam_scan will find any nonmatching protein with TC In this contribution, we describe some examples of differ- number 1.A.2.1.X, and this can be repeated until no additional ences in how proteins and protein families have been classified proteins can be found. within the Protein family (Pfam) database (pfam.xfam.org) [11] This procedure resulted in a final set of 407 protein se- and the Transporter Classification Database (TCDB) (www.tcdb. quences, after a single round of the analysis method described org) [12–14]. We present examples of how Pfam and TCDB differ above, which could be used to initiate a new protein family in and how some of the differences can be explained. The process Pfam. To check the potential of these protein sequences being of resolving differences is time-consuming, and some are im- built into Pfam families, we focused on the proteins that possible to resolve owing to philosophical differences between produced at least 100 hits in our reference database after two the two databases. Both databases rely on manual curation for iterations of jackhmmer (see Methods), that is, the ones most functional annotation, but while Pfam aims to comprehensively likely to produce the largest families. These amounted to 196 classify all protein or protein domain homologous relationships, proteins overall; see Supplementary Materials File TCDB-for- TCDB focuses on the functional/phylogenetic classification of Pfam.xlsx). To date, Pfam curators have processed 53 entries whole transport systems. Pfam contains 14 488 literature refer- from this list. Of these, 40 new families have been built. Of the ences over the whole protein family spectrum, while TCDB in- remaining possibilities, 13 sequences did not lead to the cludes 9814 literature references for transport systems alone. construction of new Pfam families for a variety of reasons The fundamental difference between the two resources is including (1) they matched Pfam families that were independ- that Pfam groups protein regions (or domains) into homologous ently built after release 27.0 but before this analysis, (2) they families and homologous families into clans, while TCDB classi- matched one of the families just built from the TCDB list (i.e. fies transport systems initially based on function and mechan- they were homologous to some other protein in the list even ism with homology playing a secondary role. The basic units of though they had a different level 4 classification in the TC sys- Transporter Classification (TC) are complete transport systems tem) (3) they corresponded to obsolete UniProtKB sequences or that often include nonhomologous domains and/or proteins. (4) they turned out to be outliers of existing families. Examples These units are then organized into a five-level hierarchy (class, of new Pfam families created from TCDB include, among others, subclass, family, subfamily and transport system, with a super- the Fungal Potassium Channel (F-Kch) Family (TC# 1.A.88; pre- family hyperlink that is independent of the hierarchical sys- liminary Pfam accession ahead of Pfam release 28.0: PF16944) tem). These conceptual differences make mapping between and the Chloroplast Envelope Anion Channel-forming (Tic110) Pfam and TCDB complex with many-to-many relationships be- Family (TC# 1.A.18; 3.A.9; preliminary Pfam accession ahead of tween entries. Pfam release 28.0: PF16940). Results Using TCDB annotations to improve Pfam Our first goal towards understanding the differences between Approximately one-third of the protein families in Pfam lack Pfam and TCDB was to identify entries in one classification any associated functional or experimental characterization. system that were missing in the other. Thus, we started by Many of these families fall into two groups that are referred to asking the simple question: how many of the transport proteins as Domains of Unknown Function (DUFs) and Uncharacterized classified by TCDB were not found in a Pfam family? Protein Families (UPFs, derived from Swiss-Prot). Often, experi- mental data are published subsequent to the generation of a Pfam entry. Unfortunately, Pfam lacks a sophisticated mechan- Transporter families absent in Pfam ism necessary for routinely matching published experimental TCDB proteins lacking a match to Pfam (version 27.0) would data to database entries and relies on biocurators to annotate constitute prime candidates for generation of new Pfam trans- the entries; this requires significant effort and resources. port-related families. Of 11 382 protein sequences in TCDB, 1122 Consequently, some DUFs and UPFs may remain functionally proteins were not annotated by Pfam (see Methods). To minim- unannotated after the publication of relevant experimental ize the complexity of the many-to-one relationships posed by data. An adjunct to identifying TCDB proteins that lack a Pfam this difference between TCDB and Pfam, we considered only match is the identification