Psortdb—An Expanded, Auto-Updated, User-Friendly Protein Subcellular Localization Database for Bacteria and Archaea Nancy Y

Published online 10 November 2010 Nucleic Acids Research, 2011, Vol. 39, Database issue D241–D244 doi:10.1093/nar/gkq1093 PSORTdb—an expanded, auto-updated, user-friendly protein subcellular localization database for Bacteria and Archaea Nancy Y. Yu, Matthew R. Laird, Cory Spencer and Fiona S.L. Brinkman* Department of Molecular Biology & Biochemistry, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada Received September 15, 2010; Revised October 15, 2010; Accepted October 17, 2010 ABSTRACT or vaccine targets, reveals suitable targets for microbial diagnostics and provides directions for experimental The subcellular localization (SCL) of a microbial design. For biomedical applications, identification of cell protein provides clues about its function, its suit- surface and secreted proteins from pathogenic bacteria ability as a drug, vaccine or diagnostic target and may lead to the discovery of novel therapeutic targets. aids experimental design. The first version of Characterizing cell surface and extracellular proteins PSORTdb provided a valuable resource comprising associated with non-pathogenic Bacteria and Archaea a data set of proteins of known SCL (ePSORTdb) as can have industrial uses, or play a role in environmental well as pre-computed SCL predictions for prote- detection. omes derived from complete bacterial genomes The first SCL prediction software, called PSORT, was (cPSORTdb). PSORTdb 2.0 (http://db.psort.org) developed in 1991 by Kenta Nakai for bacteria, animals extends user-friendly functionalities, significantly and plants (1, 2). PSORT II, iPSORT and WoLF PSORT were subsequently developed for eukaryotic species (3–5). expands ePSORTdb and now contains pre- PSORTb and PSORTb 2.0 were later developed in 2003 computed SCL predictions for all prokaryotes— and 2005 specifically for Gram-negative and -positive including Archaea and Bacteria with atypical cell bacterial protein SCL prediction, with a focus on wall/membrane structures. cPSORTdb uses the high-precision/specificity predictions (6,7). They have latest version of the SCL predictor PSORTb been the most precise SCL prediction methods developed (version 3.0), with higher genome prediction (8). However, recently PSORTb version 3.0 was de- coverage and functional improvements over veloped, with 98% precision for Gram-positive bacteria PSORTb 2.0, which has been the most precise bac- and 97% precision for Gram-negative bacteria, surpassing terial SCL predictor available. PSORTdb 2.0 is the PSORTb 2.0 and other available prokaryotic SCL predict- first microbial protein SCL database reported to ors (9). PSORTb 3.0 also provides improved genome pre- have an automatic updating mechanism to regularly diction coverage (higher recall at high precision), as well as the ability to predict a broader range of prokaryotes generate SCL predictions for deduced proteomes of including Archaea as well as bacteria with atypical newly sequenced prokaryotic organisms. This membrane/cell wall structures. In addition, PSORTb 3.0 updating approach uses a novel sequence analysis now identifies subcategory localizations for proteins we developed that detects whether the microbe destined to specialized bacterial organelles (such as the being analyzed has an outer membrane. This iden- flagellum and pilus) as well as host cell destinations. tification of membrane structure permits appropri- The speed in which prokaryotic genomes are sequenced ate SCL prediction in an auto-updated fashion and has been increasing at a dramatic rate thanks to the avail- allows PSORTdb to serve as a practical resource for ability of sequencing technologies which can decode DNA genome annotation and prokaryotic research. sequences at a dramatically increased throughput with lower cost. This creates a challenge for maintaining up-to-date functional annotation of these newly sequenced genomes (10). Given the high accuracy of computational INTRODUCTION SCL prediction for prokaryotes (8), some genome anno- Protein subcellular localization (SCL) prediction aids in- tation groups have incorporated SCL prediction into their ference of protein function, identifies candidates for drug bioinformatics annotation pipeline (11). Instead of having *To whom correspondence should be addressed. Tel: 778 782 5646; Fax: 778 782 5583; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ß The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D242 Nucleic Acids Research, 2011, Vol. 39, Database issue many different researchers compute the same prokaryotic and the results are available for access on the web server as protein SCL prediction repeatedly when needed, it would well as for download. At the time of writing, cPSORTdb be more efficient to create a centralized database of contains 1286 Gram-negative bacterial replicons, 508 pre-computed SCL prediction results that is continually Gram-positive bacterial replicons, 126 archaeal replicons, updated to incorporate SCL predictions for newly 30 replicons belonging to Gram-negative bacteria without sequenced organisms. outer membrane and 11 replicons that belong to Several databases containing prokaryotic SCL informa- Gram-positive bacteria with an outer membrane. All tion have been developed over the years (see http://www together, SCL data for more than 3 700 000 proteins are .psort.org for a list), such as DBSubLoc, PA-GOSUB and currently stored in cPSORTdb. This database is now set to UniProt (12–14). Some are developed specifically for be continually updated, with whole-proteome SCL predic- certain types of bacteria; for example, LocateP Database tions added as newly available sequenced prokaryotic and Augur contain localization predictions specific to genomes become available through NCBI’s microbial Gram-positive bacteria (15,16); others like DBMLoc are genome database each month. specific for multiple SCLs (17). Some incorporate predictions from multiple SCL-prediction tools like CoBaltDB Automatic database update using a computational ‘outer (18). However, none of them are reported, or observed, to membrane detection’ procedure be continually updated in a frequent, regular fashion to accommodate newly sequenced genomes, nor do they A major new feature of PSORTdb version 2 is the ability contain high-precision predictions suitable for handling to automatically determine what ‘Gram-stain’ or cellular diverse prokaryotic cellular structures. structure a given bacterial proteome should be analyzed PSORTdb (19) is a database initially developed in 2005 under, by identifying through a sequence analysis whether to contain experimentally determined (ePSORTdb) and the proteome is consistent with an outer membrane- computationally predicted (cPSORTdb) protein SCLs containing bacterium or not. Previously, when microbial for Bacteria. The computational predictions in genomes have been released by NCBI, we manually cPSORTdb were originally generated by PSORTb 2.0, determined the Gram-stain and cell structure based on the most precise bacterial SCL predictor of its time (7). bacterial phylum recorded and literature review. We It is widely used by researchers wishing to identify the noticed that neither the Gram-stain in the NCBI microbial SCL of specific proteins, verify high-throughput experi- database nor the bacterial phylum were 100% accurate in mental results, as well as those who need a training data indicating the cell structure, due in part to the increasing set to develop novel SCL prediction software. diversity of bacterial genomes being sequenced. Gram- To keep up with the increasing rate of prokaryotic positive organisms traditionally have a cytoplasmic genomes sequenced, and a new version of PSORTb membrane surrounding the cell, and a thick cell wall (version 3.0), we have now developed a new version of composed of peptidoglycan that encircles the cytoplasmic PSORTdb (version 2.0; http://db.psort.org) that automat- membrane. Gram-negative organisms typically have a ically computes PSORTb 3.0 SCL prediction results as much thinner cell wall within the periplasm and an asym- new prokaryotic genomes become available through metrical outer membrane surrounding the entire cell that NCBI each month. A largely expanded training data set is in addition to the cytoplasmic membrane. However, the of proteins with known SCLs have also been added to traditional Gram-staining procedure does not always ac- ePSORTdb. We have also improved the user interface to curately denote the structure of all bacteria. For example, facilitate easier browsing and searching of the database, as Deinococcus spp. stain Gram-positive because they have a well as convenient download options for filtered search thick cell wall, yet they also have an outer membrane (20). results. Old PSORTb 2.0 prediction results are still main- Mycoplasma spp., on the other hand, stain Gram-negative tained in this database for archival purposes. The entire because they have no cell wall, but they also only have database is freely available through the web or for one cell membrane (i.e. no classical outer membrane) (21). download, and highlighted features are briefly described The former should really be analyzed like a Gram- below. negative,

Psortdb—An Expanded, Auto-Updated, User-Friendly Protein Subcellular Localization Database for Bacteria and Archaea Nancy Y

The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D

Biological Capacities Clearly Define a Major Subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J

Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial

Psortm: a Bacterial and Archaeal Protein Subcellular Localization Prediction Tool for Metagenomics Data Michael A

Pathway Tools Version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology

A Sample Article Title

Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria

Ontoloki: an Automatic, Instance-Based Method for the Evaluation of Biological Ontologies on the Semantic Web1

View Course Material

Dbmloc: a Database of Proteins with Multiple Subcellular Localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun*

Ortholugedb: a Bacterial and Archaeal Orthology Resource for Improved Comparative Genomic Analysis Matthew D

Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance