BIOINFORMATICS Pages 1–6

Vol. 00 no. 00 2005 BIOINFORMATICS Pages 1–6 Mapping PDB Chains to UniProtKB Entries Andrew C.R. Martin Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT ABSTRACT have added PDB chain information and the range of residues in Motivation: UniProtKB/SwissProt is the main resource for detailed the UniProtKB/SwissProt entry which corresponds to a given PDB annotations of protein sequences. This database provides a jumping- chain. off point to many other resources through the links it provides. Among Previously, we developed a mapping between PDB chains others, these include other primary databases, secondary databases, and UniProtKB/SwissProt codes as part of an analysis of pro- the Gene Ontology and OMIM. While a large number of links are protein fold distributions for different enzyme classes (Martin et al., vided to PDB files, obtaining a regularly updated mapping between 1998). In order to account for supplied links from PDB chain UniProtKB entries and PDB entries at the chain or residue level is not to UniProtKB/SwissProt, UniProtKB/SwissProt entry to PDB file straightforward. In particular, there is no regularly updated resource and unlinked files, the process was surprisingly complex. More which allows a UniProtKB/SwissProt entry to be identified for a given recently, we developed a system for mapping between PDB residue of a PDB file. protein chains and enzyme classification (EC) numbers, availa- Results: We have created a completely automatically maintai- ble online at http://www.bioinf.org.uk/pdbsprotec/ ned database which maps PDB residues to residues in Uni- (Martin, 2004) and a system for mapping SNP data onto protein ProtKB/SwissProt and UniProtKB/trEMBL entries. The protocol uses structures (Cavallo and Martin, 2005). In these cases we made use links from PDB to UniProtKB, from UniProtKB to PDB and a brute- of a mapping between PDB chains and UniProtKB/SwissProt codes force sequence scan to resolve PDB chains for which no annotated made available to us by the EBI. However, the downloadable ver- link is available. Finally the sequences from PDB and UniProtKB are sion of this resource is not regularly updated leading to a need for aligned to obtain a residue-level mapping. us to develop an automatically updated version. Availability: The resource may be queried interactively or downloa- Links in the PDB to a UniProtKB/SwissProt entry may contain ded from http://www.bioinf.org.uk/pdbsws/ either the UniProtKB/SwissProt identifier (ID) or the accession code Contact: [email protected] –or– [email protected] (AC), and are presented at the chain level. Links in the other direc- tion (from UniProtKB/SwissProt to the PDB) are updated more 1 INTRODUCTION frequently, but until Release 45 of UniProtKB/SwissProt (October, The Protein Databank (Berman et al., 2000, PDB) is the primary 2004) were at the whole PDB file level. In our previous implemen- resource in which protein structure data are deposited while Swiss- tation, fasta33 (Pearson and Lipman, 1988) was used to resolve Prot and trEMBL (Boeckmann et al., 2003) are the major resources which chain was involved and a brute-force FASTA search against containing protein sequence data. trEMBL is an automatic trans- UniProtKB/SwissProt was used for any unassigned chains. In 1998, lation from the EMBL DNA databank while SwissProt contains a the whole process took some 3–4 days to run and unfortunately was huge number of carefully maintained manual annotations. A recent not designed in a manner that could allow fast easy updates as new effort has seen the integration of the SwissProt and trEMBL data PDB or UniProtKB/SwissProt entries become available. with the Protein Information Resource (PIR) to create the UniProt With the exponential growth in the size of the PDB and the Knowledgebase (UniProtKB) (Bairoch et al., 2005), designed as sequence databanks, a complete run now takes of the order of the central access point for extensive curated protein information, 10 days on a single Athlon XP 2800+ processor. Since the including function, classification, and cross-references. data resources (UniProtKB/SwissProt, UniProtKB/trEMBL and the One might expect that mapping between the PDB and UniProtKB PDB) are typically updated within this period, it becomes essential would be a trivial exercise. However, this is not the case. In some either to parallelize the procedure or, more efficiently, to move to a cases, PDB entries provide cross-links to UniProtKB/SwissProt; system which is automatically updateable and does not need to be in other cases, UniProtKB/SwissProt provides cross-links to PDB; re-generated from scratch. in other cases, no link between the resources is supplied. In the The problem of PDB chain to UniProtKB/SwissProt map- case of cross-links provided from the PDB, this may be to the ping has been partially addressed by the Research Collabora- UniProtKB/SwissProt accession code or the identifier. These may tory for Structural Bioinformatics (RCSB). Their beta site provi- become out-dated and are rarely corrected in PDB entries. Clearly, des a mapping of PDB files to associated UniProtKB/SwissProt a mapping between the PDB and UniProtKB/SwissProt is extremely entries (ftp://beta.rcsb.org/pub/pdb/uniformity/ valuable: the detailed annotations and cross-links to other resources derived data/pdb2sp.txt). At the time of writing, this file available in UniProtKB can then be applied to PDB chains. Recent had been updated approximately a month earlier providing Uni- changes to UniProtKB/SwissProt have improved the mapping and ProtKB/SwissProt accession codes for most PDB chains (some entries, for example 1a7m, are missing). Chimeric chains such as ¡ To whom correspondence should be addressed ¢ c Oxford University Press 2005. 1 Martin idac chain A of PDB entry 1gk5 (Chamberlin et al., 2001) are also cor- ac pdbsws alignment *id *pdb *pdb rectly handled, though there is no indication of which residues map length pdbaa date *chain to which UniProtKB/SwissProt entry. However, a number of the source swscount *chain *pdbcount start resnam cross-links appear to be to other databases even when entries appear pdbac sprot aligned ac *pdb *ac *ac swsaa in UniProtKB/SwissProt. For example, chain N of PDB entry 1a02 *ac date identity resid done sequence stop is given a cross-link to an entry simply termed ‘1353774’ (presu- fracoverlap overlap mably an EMBL or Genbank identifier), but this entry should map valid to UniProtKB/SwissProt entry Q13469. Prior to recent updates, the acac file had not been updated for almost two years when it had appeared ac in a different format listing the entries without chain information *altac and providing both UniProtKB/SwissProt identifier and accession for most entries (although the accession was missing from approxi- Fig. 1. Entity-relationship diagram for the database used for processing mately 20% of entries). Clearly the file format is in a state of flux the mapping data. A simple line represents a one-to-one relationship while and contains a number of inconsistencies and missing entries where a ‘crow’s foot’ represents a one-to-many relationship. For example, one mappings should be possible. UniProtKB accession in the ‘sprot’ table can link to several secondary acces- The new XML format files containing PDB data (available at sions in the ‘acac’ table. Primary keys are indicated with asterisks. The ftp://beta.rcsb.org/pub/pdb/uniformity/data/ ‘pdbsws’ table is the main table linking PDB chains to UniProtKB entries XML/) now provide a mapping between ‘entities’ and while links at the residue level are in the ‘alignment’ table. UniProtKB/SwissProt entries where an entity corresponds to a unique gene segment. However, this is a beta release; over the course of this work, the format of these files has been in a state of The derivation of the mapping occurs in several stages each of flux and there have been relevant changes. Like the original PDB which is described below. The mapping is created within a Post- files, the XML files still suffer from inconsistencies in the use of greSQL relational database which may be queried via our web site UniProtKB/SwissProt identifiers and accession codes. and is dumped to a flat file which may be downloaded. Each stage In addition to including cross-links provided in the standard of the mapping procedure is designed to be ‘updateable’. In other PDB files, the RCSB mapping (in both the flat file and the XML words, re-running the procedure will only perform the necessary files) includes per-entity mappings derived from cross-links in Uni- updates. All processing is performed using a set of Perl scripts ProtKB/SwissProt entries. Thus, while a large amount of infor- which interface to PostgreSQL via Perl/DBI. mation is available, extracting the UniProtKB/SwissProt accession code together with the associated PDB chain and residue range 2.1 Mirroring the data on a regular automated basis remains a complex task. One must The UniProtKB data (‘full’ data files incorporating updates on either parse all the XML files, or one must use the flat file (assu- top of official releases) are mirrored from the Expasy FTP site ming it is regularly updated) and resolve cases which are not correct (ftp://ftp.expasy.org/databases/uniprot/ UniProtKB/SwissProt cross-links. knowledgebase/) while the PDB The problem has also been addressed by the Macromo- files are mirrored from the EBI lecular Structure Database (MSD) group at the European (ftp://ftp.ebi.ac.uk/pub/databases/rcsb/pdb/ Bioinformatics Institute (EBI). They have created a map- data/structures/all/pdb). A modified version of the ping between PDB chain and UniProtKB/SwissProt acces- well-known Perl mirror script is used which is capable of storing a sion code at the individual residue level which is used in local de-compressed version of a remote compressed archive the MSDlite search system (http://www.ebi.ac.uk/msd- (http://www.bioinf.org.uk/software/mirror/).

BIOINFORMATICS Pages 1–6

To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari

Orthofiller: Utilising Data from Multiple Species to Improve the Completeness of Genome Annotations Michael P

Infravec2 Open Research Data Management Plan

Lab 5: Bioinformatics I Sanger Sequence Analysis

Introduction to Bioinformatics (Elective) – SBB1609

A Tool to Sanity Check and If Needed Reformat FASTA Files

Genbank Dennis A

The FASTA Program Package Introduction

The Candida Genome Database: the New Homology Information Page Highlights Protein Similarity and Phylogeny Jonathan Binkley, Martha B

The Dynamic Landscape of Fission Yeast Meiosis Alternative-Splice Isoforms

BLAST Practice

FASTA Format 1.0.1 EXAMPLE FASTA FILE 2 Graphical Sequence Manipulation