Databases & Structural Classes of

Marjolein Thunnissen Lund September 2009

Classification of structures: Why needed?

! Plays a central role in understanding the principles of , function and evolution. ! Class assignment can provide unknown functional details. This is of special importance in structural genomics projects.

. Databases : PDB

The (PDB) was established at Brookhaven National Laboratory in 1971 as an archive for biological macromolecular crystal structures. In the beginning it held seven structures. (http:// www.rcsb.org/) Currently it holds around 60046 structures.

PDB Content Growth PDB Growth in New Folds: Proportion of "new folds" (light blue) and "old folds" (orange) for a given year as a number of protein

PDB search results: The PDBsum database (at www.ebi.ac.uk) contains summary information and derived data on entries in the Protein Data Bank.

PDBsum Additional features: secondary structure Includes analysis of ligand binding to proteins.

Databases : MMDB

Uses a file format diferent from that used by the PDB. Integrated into the Entrez system - very useful when you need to find a 3D structure of a homologous protein. Provides an alignment to related structures. Contains about 18000 domains against which a search with a structure can be performed using the VAST algorithm. MMDB:

A result of a search for similar structures

Databases : NDB

The Nucleic Acid Database Project (NDB) assembles and distributes structural information about nucleic acids. http:// ndbserver.rutgers.edu/ The logic of classification:

Assignment of independent folding units --> domains. Assignment. of secondary structure. Assignment of structural class (all !, all ", !/" and !+").

Assignment of domains

Spatially separated units of protein structure. May. have sequence and/or structural resemblance to another protein structure/domain Example:

The domain. One protein - one domain

Fold: Globin-like, Superfamily: globin-like

Example:

The structure of one domain of a bacterial muramidase. 450 amino acid residues build up 27 helices arranged in a two- layered ring. The ring has a large central hole with a diameter of about 30 Å.

Scop fold: alpha-alpha superhelix, superfamily: bacterial muramidases Example

Pyruvate kinase folds into several domains, one of which is an a/b barrel (red). One of the loop regions in this barrel domain is extended and comprises about 100 amino acid residues that fold into a separate domain (blue) built up from antiparallel b strands. The C-terminal region of about 140 residues forms a third domain (green), which is an open twisted a/b structure.

Assignment of fold

The number, type connectivity and arrangement of secondary structures define the fold of the protein. Databases may disagree in the definition of a certain fold, therefore diferent databases should be checked. The name is given to a fold often on the bases of the first protein/protein function, where the fold was observed.

oligonucleotide/ oligosaccharide binding folds The coenzyme-binding domain of some dehydrogenases: the Rossmann fold

Examples of folds: Class: alpha & beta, fold:TIM barrel Subform of Rossmann fold: flavodoxin fold

Superfolds:

Folds found within many diferent superfamilies: Rossmann fold, TIM-barrels, ferredoxin-like, four-helical bundles, Ig-like b-sandwich, OB-binding fold, etc. These folds are adopted by groups of seemingly non- homologous proteins with diferent functions. They tend to bind ligand at similar structural sites-supersites. Assignment of superfamily

For many proteins with similar folds, sequence, structure or function suggest divergence from a common ancestor. Superfamily refers to group of proteins that appear to be homologous, even in absence of significant sequence similarity. Unrelated protein with the same fold are called analogues.

Some members of the IGSF (Immunoglobulin superfamily

How can we distinguish analogues from homologues?

A high level of structure similarity most probably means homology. Conservation of unusual structural features outside the core structure (turn conformations, left-handed b-a-b units, p-helices, etc.). Low, but convincing sequence identity (>12%) calculated after structural superposition. The presence of key active site residues even in the absence of global sequence similarity. Sequence similarity bridges-if A is similar to B and B is similar to C, than A is similar to C. Clusters proteins at four major levels, Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). http://cathwww.biochem.ucl.ac.uk/latest/index.html

Assignment of structural class

After assignment of secondary structure and domains, structural class can be assigned to each domain. Structural classes divide proteins according to secondary structure content and organisation. Databases: SCOP

Structural Classification of Proteins. Created by manual inspection and automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. http://scop.mrc- lmb.cam.ac.

Structure classification at SCOP:

Proteins are divided into functional domains, which are grouped into a hierarchy: Class - Fold - Superfamily - Family -Protein. Proteins in the same superfamily have probable common evolutionary origin. Proteins in the same family have clear evolutionary relationship. Proteins in the same fold but diferent superfamily do not have a clear common ancestor. Scop classification statistics

Class Number of folds Number of superfamilies Number of families All alpha proteins 284 507 871 All beta proteins 174 354 742 Alpha and beta proteins (a/b) 147 244 803 Alpha and beta proteins (a+b) 376 552 1055 Multi-domain proteins 66 66 89 Membrane /cell surface proteins 58 110 123 Small proteins 90 129 219 Total 1195 1962 3902

Classification at CATH:

Class - Architecture - Topology - Homology (SCOP: class-fold-superfamily-family-protein) ! " barrel

!" " ribbon Tiamin- like T cell endotoxin " antigen " sandwich Ig-like transthyretin " etc.

etc plastocyanin . etc . Class assignment may be funny!

Which class this protein belongs to ? Staphylococcal nuclease Search results from CATH

Search results from SCOP

Class assignment in this case follows the fact that the core of the protein comprises an oligonucleotide/olygosac- charide (OB) binding fold b barrel.Thus, structural similarity/function may afect fold assignment. Databases: FSSP

Based on exhaustive all- against-all comparison of protein 3D structures currently in the PDB. Multiple alignment views: structure neighbours sequence neighbours structures superimposed in 3D http:// ekhidna.biocenter.helsinki.fi/ dali/start

FSSP search result: