Research Papers the Cambridge Structural Database
Total Page:16
File Type:pdf, Size:1020Kb
research papers Acta Crystallographica Section B Structural The Cambridge Structural Database: a quarter of a Science million crystal structures and rising ISSN 0108-7681 Frank H. Allen The Cambridge Structural Database (CSD) now contains data Received 11 February 2002 for more than a quarter of a million small-molecule crystal Accepted 28 February 2002 structures. The information content of the CSD, together with Cambridge Crystallographic Data Centre (CCDC), 12 Union Road, Cambridge CB2 1EZ, methods for data acquisition, processing and validation, are England summarized, with particular emphasis on the chemical information added by CSD editors. Nearly 80% of new structural data arrives electronically, mostly in CIF format, Correspondence e-mail: [email protected] and the CCDC acts as the of®cial crystal structure data depository for 51 major journals. The CCDC now maintains both a CIF archive (more than 73000 CIFs dating from 1996), as well as the distributed binary CSD archive; the availability of data in both archives is discussed. A statistical survey of the CSD is also presented and projections concerning future accession rates indicate that the CSD will contain at least 500000 crystal structures by the year 2010. 1. Introduction In October 2001, the CCDC passed a major milestone by archiving the 250000th small-molecule crystal structure to the Cambridge Structural Database (CSD; Allen et al., 1979, 1991; Allen & Kennard, 1993). The ongoing creation and main- tenance of the CSD has been the core activity of the CCDC since its inception in 1965, and the CSD system ± the database and its associated access software (Bruno et al., 2002) ± is now used in 109 industrial companies and by 826 academic insti- tutions in 58 countries worldwide. The CCDC was established at the Department of Chem- istry, Cambridge University, to compile a database containing comprehensive information on small-molecule crystal struc- tures, i.e. organics and metallo-organic compounds containing up to 500 non-H atoms, the structures of which had been determined by X-ray or neutron diffraction. A speci®c aim was to store the key numerical results of each analysis, namely the cell parameters, space group and atomic coordinates, making the CSD one of the ®rst numerical scienti®c databases in the world, and the ®rst to store experimental three- dimensional chemical structure information. The CCDC was established by Dr Olga Kennard as part of the organic crys- tallography group, and with just two group members and some part-time scienti®c and clerical assistance assigned to the project, under a grant from the (then) UK Of®ce for Scienti®c and Technical Information. In the late 1960s just a few hundred structures were published each year, and about 2000 # 2002 International Union of Crystallography structures published before 1965 were gradually incorporated Printed in Great Britain ± all rights reserved into the developing database, often using printed compendia, 380 Frank H. Allen Cambridge Structural Database Acta Cryst. (2002). B58, 380±388 research papers Table 1 the 1980s. In 1989, the CCDC, then with about 20 staff, became Summary of information content of the Cambridge Structural Database. an independent self-®nancing non-pro®t institution and was Bibliographic and chemical text granted UK charitable status. Compound name(s), systematic and trivial The modern CCDC now has 45 full-time staff. In addition to Amino acid sequence for peptides the Executive Director, Dr David Hartley, and the Scienti®c, Chemical formula Development and Business Directors, a total of 15 editorial Authors' names Journal name and literature citation staff are responsible for the CSD itself, nine work on the Text indicating special experimental conditions or results (e.g. development of new software products, ®ve are responsible neutron study, powder study, non-ambient temperature or for the computing infrastructure, release preparation and pressure, absolute con®guration determined etc.) Chemical class (e.g. alkaloid, steroid etc.) software for database creation, four work on research projects, Text comment concerning disorder, errors located during four are responsible for customer support and marketing validation and special structural features operations, and there are four business, administrative and secretarial staff. The Executive and Scienti®c Directors are Chemical connection table (see text and Fig. 2) responsible to an International Board of Governors Formal two-dimensional chemical structure diagram in terms comprising seven distinguished scientists and a ®nancial of atom and bond properties expert. The CCDC retains close links with Cambridge Bond types used in CSD connection tables are: single, double, triple, quadruple (metal±metal), aromatic, delocalized University, and is recognized by the University as an institu- double and -bonds tion quali®ed to train postgraduate students. The Centre hosts visiting scientists and also collaborates widely with universities Crystal structure data and industrial organizations, both within the UK and inter- nationally. Cell dimensions and s.u.'s Space group and symmetry operators Structural crystallography has, of course, changed out of all Atomic coordinates and s.u.'s for the crystal chemical unit recognition since the mid-1960s. Improvements in data (see text) collection, structure solution and re®nement techniques have gone hand in hand with dramatic increases in computing Derived information power. As a result, more than 24500 structures were archived Bit-encoded screen records (see text) to the CSD in 2001, representing a near 40-fold increase in Matching of two-dimensional and three-dimensional connec- worldwide crystallographic productivity compared with 1965. tivity representations (see text) Reduced cell parameters This paper summarizes the current status of the CSD and uses Z0, the number of chemical entities per asymmetric unit statistics of database growth, together with an analysis of Calculated density current trends in the subject, to make some observations about future trends. Other papers in this special issue of Acta Crystallographica review the more recent scienti®c applica- tions of the CSD in organic chemistry and crystal chemistry (Allen & Motherwell, 2002), molecular inorganic chemistry such as the IUCr's Structure Reports (1939±1985) volumes, to (Orpen, 2002), and the life sciences (Taylor, 2002). locate original literature references. Early software development centred on systems for vali- dating and storing the accumulated information (see e.g. Allen 2. Information content of the CSD et al., 1979). However, systems for search, retrieval, analysis Each individual crystal structure determination forms an entry and visualization of CSD information began to be developed in the CSD, which is identi®ed by a reference code: six letters in the late 1970s, and were considerably enhanced during the identify the chemical compound and two supplementary digits 1980s (Allen et al., 1991) to include full two-dimensional and identify additional determinations of the same structure, e.g. three-dimensional substructure search capability and the an improved re®nement, studies by different scientists, studies ability to locate intermolecular nonbonded contacts (Allen & under different experimental conditions etc. The information Kennard, 1993). The CSD system continues to be enhanced; content of each entry is illustrated in Fig. 1 and is summarized the latest software developments are described by Bruno et al. in Table 1. The most important information item added by (2002). CCDC staff is the two-dimensional chemical structure repre- During this period also, the CSD began to be used exten- sentation (Fig. 2). The atom and bond properties are sively as a basis for fundamental research (BuÈrgi & Dunitz, converted into a compact connectivity table for CSD storage, 1983; Allen et al., 1983; BuÈrgi & Dunitz, 1994), variously and form the basis for substructure searching (Bruno et al., denoted as `structure correlation' or `knowledge acquisition', 2002) at the molecular and supramolecular levels. forerunners of the modern, and semantically questionable, Each connection table is analysed to assign cyclic/acyclic term `data mining'. The uptake of the CSD as a research tool ¯ags to chemical bonds and to generate a bitmap or `screen' in academia, and the advent of computational chemistry record. This contains codi®ed `yes/no' information concerning methods in many major pharmaceutical and agrochemicals the presence/absence of speci®c substructural features in each companies, led to a rapid increase in CSD subscriptions during chemical diagram, e.g. atoms with speci®c connectivity Acta Cryst. (2002). B58, 380±388 Frank H. Allen Cambridge Structural Database 381 research papers patterns, common functional groups, rings of speci®c sizes etc. 2002) can also display chemical bond types in their three- The screens are used as heuristics to speed substructure dimensional structure representations (Fig. 4), to aid the searching: screens generated from a query substructure must chemical interpretation of CSD structures. This is possible all be matched in a candidate CSD entry before that entry is because the two-dimensional and three-dimensional structure further analysed using CPU-intensive atom-by-atom, bond- representations are linked by a graph-theoretical atom-by- by-bond matching. Bit-screens are also employed to encode atom and bond-by-bond matching, which maps the chemical information about (a) elemental constitution,