Iditis: Protein Structure Database

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Crossref 1071 Acta Cryst. (1998). D54, 1071±1077 Iditis: Protein Structure Database Stephen Gardnera* and Janet Thorntonb aSynomics Ltd, 1 Cambridge Business Park, Cambridge CB4 0WZ, England, and bDepartment of Biomolecular Structure, University College London, Gower Street, London WC1E 7HX, England. E-mail: [email protected] (Received 27 February 1998; accepted 18 May 1998) Abstract and to describe the functional anatomy of new proteins and their mutations in a semi-automated fashion, for The validation, enrichment and organization of the data which thorough knowledge of all existing data is critical. stored in PDB ®les is essential for those data to be used The number of known protein structures deposited at accurately and ef®ciently for modelling, experimental the Protein Data Bank (PDB) (Bernstein et al., 1977; design and the determination of molecular interactions. Abola et al., 1987) has increased rapidly over the last few The Iditis protein structure database has been designed years to its current level of well over 6500 structures. to allow the widest possible range of queries to be These structures are currently distributed as formatted performed across all available protein structures. The data ®les, one for each protein structure. The protein Iditis database is the most comprehensive protein structure data deposited in these ®les are reduced to a structure resource currently available, and contains over very simple representation (PDB, 1992), which can 500 ®elds of information describing all publicly depos- obscure the biologically signi®cant information ited protein structures. A custom-written database contained in the data set. Whilst these structure ®les can engine and graphical user interface provide a natural be used to examine individual structures in detail, for and simple environment for the construction of searches example using molecular graphics, this organization of for complex sequence- and structure-based motifs. data does not allow the whole body of known structures Extensions and specialized interfaces allow the data to be searched in detail to identify speci®c features of generated by the database to used in conjunction with a interest from all available proteins. More recently the wide range of applications. PDB has provided a limited relational catalogue of the protein structures available. The information available for searching is, however, limited to protein name, 1. Introduction function, source and simple experimental information, and does not include detailed derived structural infor- Protein structure data plays a pivotal role in the mation. understanding of the mechanisms of molecular interac- In order to be able to identify or characterize struc- tion. When available, it often becomes the key to tures, folds, or motifs from all proteins at once, it is unlocking the mechanisms of a disease or other mole- necessary to extend the range of the structure data cular process and provides a springboard for the inves- substantially, and to organize the fullest possible range tigation of candidate compounds that may bind to and of structural information into a fully functional data- modify the action of target receptors or enzymes. Many base. This has been achieved in the Iditis protein new technologies that aim to accelerate our under- structure database, where 500 ®elds of information are standing of novel molecular systems are extending the stored for all deposited PDB structures. utility of, and our dependence on, accurate protein structure data. Large-scale sequencing of microbial and other pathogenic genomes to identify sequences related 2. Background to known disease-causing agents is now routinely performed. High-throughput screening relies largely on The ®rst generation of databases of protein structure the identi®cation of enzymes, receptors or larger cellular was based around commercially available relational complexes which can be used to model a speci®c disease database management systems (RDBMSs) (Isogai et al., process and identify potentially active compounds. 1987; Akrigg et al., 1988; Islam & Sternberg, 1989) Rapid genotyping of individuals for speci®c poly- although some researchers attempted to use logic morphic alleles will become an increasingly common programming methods (Clark et al., 1990; Paton & Gray, method of screening patients for therapies. All these 1988; Gray et al., 1988). These predominantly relational techniques are required to assign putative homologies systems attempted to organize the protein structure data # 1998 International Union of Crystallography Acta Crystallographica Section D Printed in Great Britain ± all rights reserved ISSN 0907-4449 # 1998 1072 Iditis: PROTEIN STRUCTURE DATABASE Table 1. Major data-derivation programs Program Authors Purpose References & methods ACCESS S. J. Hubbard Calculate the solvent-accessible Chothia, 1976; surface area of proteins Lee & Richards, 1971; Satav et al., 1980 ALTERNATES E. G. Hutchinson Builds ALTERNATE table. ATMNAMES D. K. Smith Determines atom- and residue-based properties BRKALN D. K. Smith; Checks ATOM records against SEQRES E. G. Hutchinson records BRKCLN D. K. Smith Cleans raw PDB ®les IUPAC-IUB, 1970 BVCALC D. Naylor Calculates average B or U2 values per residue. CALPHA D. K. Smith Builds the CALPHA table Nishikawa & Ooi, 1980, 1986 DISULF D. K. Smith Builds the DISULPH table HBOND D. T. Jones Identi®es all potential hydrogen bonds Baker & Hubbard, 1984 in a protein HLXTBL T. P. Flores Builds the HELIX and HELIXINT tables Barlow & Thornton, 1988; Chothia et al., 1981 Richmond & Richards, 1978; Kabsch & Sander, 1983 LIGAND E. G. Hutchinson Generates the LIGAND and WATER tables NEIGHBOUR D. T. Jones Identi®es all non-bonded interactions Narayana & Argos, 1984 in a protein NEWAMINOA D. K. Smith, Correlates records to build up the Kabsch & Sander, 1983; E. G. Hutchinson AMINO table Nishikawa & Ooi, 1980, 1986; E®mov, 1986 NEWATOM E. G. Hutchinson Builds the ATOM table NEWCHAIN D. K. Smith Builds the CHAIN table NEWDBSHEET E. G. Hutchinson Builds the SHEET and STRAND tables Kabsch & Sander, 1983; Richardson, 1976, 1977 NEWTAB2DS D. K. Smith Builds the GAMMATURN and Lewis et al., 1973; BETATURN tables E®mov, 1986; Milner-White et al., 1988 NMRCLUST L. A. Kelley Clusters models within NMR ensembles Kelley et al., 1996 PROCHECK R. A. Laskowski Analyses stereochemical quality of Laskowski et al., 1993 protein structures PROTIN D. K. Smith Builds the PROTEIN table SALTBR D. K. Smith Builds the SALT table Barlow & Thornton, 1983 SITE Builds the SITE table SSTRUC D. K. Smith Determines information pertaining to Kabsch & Sander, 1983; the secondary structure of the protein Nishikawa & Ooi, 1980, 1986 SUMMARY A. L. Morris Builds the SSSUM table in order to allow them to be used by researchers in a RDBMS. At roughly the same time as these relational rational manner. Most groups made attempts to provide systems were being developed, computer scientists at a richer set of data than was available in the PDB ®le by Aberdeen University were creating an object-oriented calculating derived data ®elds, such as secondary-struc- version of the database (P/FDM) using the BIPED raw ture assignments, C distance matrices and torsion data ®les (Gray et al., 1990). angles. The RDBMSs provided search tools to identify At the time of development of Iditis (Thornton & chosen subsets of the data bank corresponding to Gardner, 1993; Oxford Molecular, 1997), using rela- particular search criteria. tional systems to store protein structure data was The most successful of the ®rst-generation relational attractive, although they had some major limitations, systems were undoubtedly BIPED (Islam & Sternberg, notably lack of record order, inef®cient storage and non- 1989) and SESAM (Huysmans et al., 1991), although extensible query languages. At that time also, object- both were limited by the underlying relational tech- oriented systems were in their infancy, and were nology. BIPED was designed around the ORACLE complex, inef®cient and slow. Much progress with RDBMS, and stored structural (and later homologous object-oriented systems has since been made (Gray et sequence) data for all available proteins, whilst SESAM al., 1996), although fully operational systems are still to contained a set of protein structures and their homo- come to fruition. Given the state of currently available logous sequences, and was based around the SYBASE technology, Iditis was designed as an extended pseudo- STEPHEN GARDNER AND JANET THORNTON 1073 relational database management system with a structure, have been run on a ®le, it is ready to be passed comprehensive data schema. through the data-derivation programs themselves. The programs that are used are mainly implementations of standard, published techniques, and generate information in ASCII output ®les capable of being read into 3. The Iditis family Iditis. As the ®nal step, the intermediate ®les are refor- The Iditis protein structure database has a number of matted into data `stripes' that can then be loaded into components. Iditis using the data-de®nition language tools. (i) PROCHECK, a protein-structure data-validation In total, there are over 45 programs which interact to toolkit. produce the ®nal set of data stripes for the Iditis data- (ii) NMRCLUST and NMRCORE, automatic clus- base. The important derivation programs that are used tering tools for atoms and models in an NMR ensemble. are shown in Table 1. (iii) Iditis Architect, a structural data derivation The use of NMR methods for solving small to medium toolkit. size polypeptides has become much more feasible in (iv) Iditis Data, the comprehensive database of vali- recent years. One of the advantages of NMR methods is dated, derived protein structure data. the presentation of multiple solution structures. (v) Iditis RDBMS, a database management and Frequently, ensembles of over 30 model solutions of the search engine. structure are presented in a single PDB ®le, and occa- sionally over 50. This allows more detailed analysis of the movements of a structure in solution, and may avoid some of the time averaging and crystal contact problems 4.

Iditis: Protein Structure Database

ANNUAL REVIEW 1 October 2005–30 September

SD Gross BFI0403

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Functional Effects Detailed Research Plan

Features China Introduction

Meeting Review: Bioinformatics and Medicine – from Molecules To

Cambridge University Reporter Special Number 3

The London Diplomatic List

Contents and Society

Crystal Structures of Native and Inhibitedforms of Human Cathepsin

Part I Officers in Institutions Placed Under the Supervision of the General Board

Iisc-Alumni Association- M.Vijayan Lecture. Cross Fertilisation: How