Progress in Biophysics & Molecular Biology 73 (2000) 321–337

Review Towards a covering set of family profiles Andreas Heger, Liisa Holm* Structural Group, EMBL-EBI, Cambridge CB10 1SD, UK

Abstract

Evolutionary classification leads to an economical description of the protein sequence universe because attributes of function and structure are inherited in protein families. Efficient strategies of functional and therefore target one representative from each family. Enumerating all families and establishing family membership consistently based on sequence similarities are nontrivial computational problems. Emerging concepts and caveats of global sequence clustering are reviewed. Explicit multiple alignments coupled with neighbourhood analysis lead to domain segmentation, and hierarchical unification helps to resolve conflicts and validate clusters. Eventually, every part of every sequence will be assigned to a domain family which is uniquely associated with a fold and a molecular function. # 2000 Elsevier Science Ltd. All rights reserved.

Keywords: Clustering; Domains; Homology; ; Structural genomics

Contents

1. Evolutionary classification ...... 322

2. Discovering families ...... 323

3. Topographical maps of sequence space ...... 326

4. Domain decomposition ...... 328

5. Quality control ...... 331

6. Coverage in structural genomics ...... 334

7. Conclusions ...... 335

References...... 335

*Corresponding author. Fax: +44-1223-494470.

0079-6107/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S 0 0 7 9 - 6 107(00)00013-4 322 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337

1. Evolutionary classification

The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of . Fortunately, the theory of evolution provides a simple solution to the computational assignment of protein structure and function to uncharacterized sequences: functional and structural information can be transferred between homologous proteins. Homologues carry the memory of common ancestry in their amino acid sequences as a result of functional constraints that have persisted through successive generations. Sequence similarity searching is today the most powerful tool to predict the function or structure of anonymous products that come out of genome sequencing projects. Annotation by similarity is typically based on a nearest neighbour approach. Neighbours of a query sequence are detected using fast sequence database search programs such as Blast (Altschul et al., 1997) or Fasta (Pearson, 2000). Many implementations worldwide follow the model of GeneQuiz (Scharf et al., 1994) and pick the highest-scoring hit with informative annotation attached to it to generate a plausible function of the query protein. There are numerous possibilities of error in this approach (Andrade et al., 1999; Kyrpides and Ouzounis, 1998; Bork and Koonin, 1998). For example, sequence databases generally contain poor positional pointers of functional domains, and an erroneous inference will result if the annotation of the database hit refers to the presence of a particular functional domain but the query protein matches in a different region. More difficult to check and correct are the quality and integrity of second-hand, similarity-derived functional annotation in the search databases. In particular, it is not possible to trace back a chain of annotations by similarity, nor to propagate changes in annotation to dependent sequence entries in current sequence databases. classification opens a way out of this muddle. Grouping proteins into families is useful in two ways. First, it leads to more sensitive detection of new members and improved discrimination against spurious hits based on the essential conserved features in a family as expressed by profiles (position-specific scoring matrices or hidden Markov models) or patterns (regular expressions). Second, having established family membership, the query sequence can be placed in the context of the evolutionary tree of the family for accurate functional inference. It is also easier to spot inconsistent second-hand annotations in the tree context. The physiological role, cellular function and substrate specificity of homologous proteins can diverge remarkably during evolution (Holm and Sander, 1997). A query sequence rooted inside a homogeneous branch (in terms of direct experimental annotation) is likely to have the same function, while one between different functional subclasses may represent a novel subfamily. It may be convenient to collapse the tree to hierarchic discrete subclasses that reflect major evolutionary changes in the functional specialization of subfamilies. Many large families have dedicated databases, where new sequences are inserted into existing or newly created subfamilies manually (e.g. AAA family, http://yeamob.pci.chemie.uni-tuebingen.de/AAA; Patel and Latter- ich, 1998) or automatically (GPCRDBsup, http://jura.ebi.ac.uk:8901/html/; Heger, unpublished). Libraries of profile models have been generated around sets of particular interest, such as all known structures (Dodge et al., 1998; Teichman et al., 2000; Schaeffer et al., 1999) and large families. For example, Pfam 5.2 (Bateman et al., 2000) contains 2128 HMMs which cover about A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 323

64% of all sequences in SwissProt release 38. The coverage of complete genomes is lower. A complete covering (a profile library representing every family) would reveal the evolutionary origins of genomes but also reduces the risk of wrongly assigning marginal hits because the full tree context is available. In recent years, interest in protein sequence classification has been fuelled by the challenge of interpreting genome sequences and an increasing number of groups are active in the field (Table 1). Here, we review emerging concepts (Krause and Vingron, 1998; Yona et al., 1998; Holm, 2000) and caveats of computational approaches to globally organize protein sequence databases into protein families.

2. Discovering families

Historically, families have been identified one by one and based on similarities to individual proteins under study by individual scientists. The process starts from the compilation of a multiple alignment of similar sequences. Methods for finding similar sequences and the thresholds deemed safe to infer homology from similarity differ between different sequence classifications. The usage of the terms family and superfamily (unified family) also are not uniform and can represent different levels of the functional hierarchy. The PIR definition of superfamilies (Dayhoff et al., 1983) is conservative in terms of sequence identity, while structure-based classifications unify remote homologues whose structural and functional features suggests a common evolutionary origin despite very low sequence identities (Holm and Sander, 1999; Orengo et al., 1999; Lo Conte et al., 2000). From a multiple alignment, one can identify conserved patterns which are diagnostic of family membership or a functional site (Attwood et al., 2000; Hofmann et al., 1999). For example, the histidine triad family was named after a highly conserved H.H.H motif (Seraphin, 1992). Later, this family was unified with UDP-glucosyl transferases and Ap4A-phosphorylases, shown to be a hydrolase, and the invariant part of the pattern reduced to a single histidine required for catalysis (Lima et al., 1997). In current sequence databases, the extended family can be detected by iterative profile searching (Altschul et al., 1997). Profiles (Eddy, 1998) and patterns have different advantages and limitations. In principle, a profile yields a probability of family membership while a pattern is either present or absent. Profiles are coupled to explicit multiple alignments, and there are unsolved problems related to the statistics of gapped alignment scores and algorithmic convergence of the search for an optimal profile. For example, thresholding the neighbourhood radius in terms of e-values in PSI-Blast leads to nonidentical profiles at the end of iteration depending on the seed sequence and the distribution of neighbours in sequence space (Fig. 1). Furthermore, many biologically genuine homologues get statistically insignificant scores, as has been shown in benchmarks against structurally defined families (Brenner et al., 1998; Park et al., 1998, Karplus et al., 1998). Dynamic thresholding is used in a special definition of families, which addresses the question of orthologues versus paralogues (Tatusov et al., 2000). This analysis involves only complete genomes to construct the directed graph of nearest-neighbour relationships. Cliques of at least three bidirectional nearest-neighbour sequences form a COG (cluster of orthologous groups). About 324

Table 1 Comparison of several sequence clusterings and family classifications Database Reference Basis set Method/ Number of Global Explicit Explicit Result version description clusters coverage alignments domains validation .Hgr .Hl rgesi ipyis&MlclrBooy7 20)321–337 (2000) 73 Biology Molecular & Biophysics in Progress / Holm L. Heger, A. ADOPS Hanke et al. Pattern Associative Not applicable Yes Yes No Visual (1999) memory Blocks+23 Jan Henikoff et al. Profile HMMs based on 2334 No Yes No 2000 (2000) Prosite, Prints, Pfam, ProDom, Domo COGs Tatusov et al. Pair Cliques of 2111 Complete ClustalW Manual Manual (2000) nearest genomes neighbours Domo Gracy and Pair Empirical rules 8877 Yes ClustalW Yes Argos (1998) HSSP PDB- Dodge et al. Profile Iterative profile 10979 (HSSP) 3D Yes No Thresholds ISL (1998); alignment 1187 structures structures calibrated IMPALA Teichmann + 105 domains against et al. (IMPALA) structural (2000); benchmark Schaeffer et al. (1999) Interpro http:// Manual Consensus 2990 No No Manual Manual www.ebi.ac.uk/ clusters compiled interpro/ from Prosite, Prints, Pfam, Blocks?? Pfam-A 5.2 Bateman et al. Profile HMMs from 2128 No Yes Yes Manual, (2000) manually exclusion of curated seed domain alignments assignment overlaps Picasso Holm (2000) Pair/ Profile–profile 10k+20k Yes Yes Yes profile alignment singletons Neighbourhood analysis; profile- profile score to merge clusters Prints-S 4.0 Attwood et al. Pattern Multiple 1310 No No Manual (2000) fingerprints ProClass (PIR) Wu et al. 2000 No Yes Yes Manual (1999) superfamilies, 10,000 families, 300 homology .Hgr .Hl rgesi ipyis&MlclrBooy7 20)321–337 (2000) 73 Biology Molecular & Biophysics in Progress / Holm L. Heger, A. domains ProDom Corpet et al. Profile Iterative profile 51,303+12,3649 Yes Yes Yes Radius of 2000.1 (2000) alignment and singletons gyration and (Pfam-B) greedy diameter elimination reported for each cluster Prosite 16 Hofmann et al. Pattern, 1040 No No No Manual 8 Apr 2000 (2000) profile Protomap 2.0 Yona et al. Pair Hierarchical Yes No No Prosite (2000) clustering of pattern match strongly frequencies connected per cluster components Teiresias Rigoutsos et al. Pattern Exhaustive Yes No No (1999) enumeration of recurrent patterns SMART Schultz et al. Profile HMMs for over 400 No No Yes Manual (2000) signalling, extracellular, and chromatin- associated domains Systers 2 Krause et al. Pair Single linkage 9734 perfect, Yes ClustalW No Clusters (2000) clustering 1783 nested, classified as 1142 perfect, overlapping, nested, or 43,582 singletons overlapping 325 326 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337

Fig. 1. To test the stability of a popular iterative profile search strategy (PSI-Blast), neighbour list overlaps are plotted as a function of sequence identity between two queries. 100% overlap means that the iteration started from different seeds converges on an identical set of sequences and produces identical profiles. Usually, a core set of common members is reliably identified when sequence identity is above 30% but there may be some differences on the fringes. During iteration decisions of inclusion or exclusion of sequences used to update the profile (position-specific scoring matrix) are based on fixed numerical thresholds (e.g., statistical significance of alignment score). About half of the remote homologue pairs (‘‘20%’’ bin) have zero overlap, i.e., PSI-Blast failed to find a connection between two homologous families. Intermediate values of overlap usually correspond to huge families and program limitations (maximally 2000 neighbours kept). The test set was constructed from 35 known structures, which represent different FSSP fold types and which have several homologues in the range from 100 to 25% identity and remote homologues with less than 25% sequence identity (Holm and Sander, 1999). The test set yielded 145 pairwise comparisons between a seed sequence and a remote homologue, with at most one homologue per 10% wide bin in sequence identity (horizontal axis). PSI-Blast was run with default parameters for 20 iteration cycles. Neighbour lists were truncated at an e-value of 1.0. Changing cutoff values did not change the qualitative appearance of the graph (data not shown).

70% of proteins from completely sequenced genomes map to COGs. This coverage obviously depends on how diverse a set of genomes is used to define a COG. Pattern discovery works even in unaligned datasets (Brazma et al., 1998). The task can be formulated as that of finding the most concise pattern which is present in all or most sequences of an input data set. For example, given a large set of unaligned zinc finger proteins, Pratt automatically extracted the C2H2 consensus (Jonassen et al., 1995). Patterns of limited complexity can be enumerated in a large database. Teiresias (Rigoutsos et al., 1999) systematically enumerates all maximal patterns in an unclassified sequence database that have at least a minimum number of occurrences. However, the connection of pattern instances to family membership still needs to be established.

3. Topographical maps of sequence space

All-against-all comparison of protein sequences, using traditional database search tool like Blastp or Fasta, yields a view of the geometry of protein space. Neighbour lists of each sequence induce a representation of protein space as a (weighted directed) graph whose vertices (nodes) are A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 327

Fig. 2. A graph representation of neighbour relationships reveals connected components alias single linkage clusters. Neighbours or links are determined by pairwise comparison and a suitable threshold of similarity. Sequences which have nothing in common can end up in the same cluster if there are multidomain proteins in the set. Here, the top cluster contains three types of sequences (square, blob, circle). The blob sequences are actually composed of two domains (square and circle). Explicit domain cutting of the sequences (middle) results in two perfect, i.e., fully connected, clusters at the bottom. the sequences. The weight of an edge connecting two sequences represents their degree of similarity (the weights are the expectation values of the similarities between the sequences). Clusters of related proteins correspond to strongly connected components in this digraph (Grundy, 1998; Yona et al., 1998). Due to spurious similarities and domain chaining, the majority of sequences belong to one huge connected component at biologically interesting levels of similarity. The problem of domain chaining can be largely avoided by using a very high threshold as in Systers (Krause and Vingron, 1998). Systers finds connected components by single linkage clustering. The following types of clusters (sets of neighbours of a seed sequence) can be obtained. In a perfect cluster, every member is a neighbour of every other member (the cluster is fully connected). A nested cluster is a proper subset of another set. Maximal clusters are not contained in any other set. A pair of overlapping maximal clusters have common members and unique members. In Systers, casualties of domain chaining are found in overlapping clusters. Protomap (Yona et al., 1998) performs a hierarchical clustering, varying the threshold of statistical significance stepwise from very high (1e-100) to quite permissive (1). At each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. Connections between clusters which are not strongly connected are rejected while clusters that are strongly connected get merged. The criteria for merging were optimized empirically. Rejected connections may reflect genuine though distant homologies. ADOPS is based on an associative memory network (Hanke et al., 1999). Families are memo- rized as prototype vectors (17mer profiles) of conserved regions. Similarity and kinship, as well as degree of distance between the conserved protein segments, are visualized as neighbourhood relationship on a two-dimensional topographical map. In each of these clusterings, the connectivity graph can be browsed visually (Fig. 2). Judgment is required to assess when a neighbourhood changes from one biological class to another as a result 328 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 of spurious sequence similarities or domain chaining. In the next section, we develop a connection between the analysis of neighbour sets or graphs and explicit domain decomposition as a result of taking into account explicit multiple alignments.

4. Domain decomposition

The neighbourhood graphs discussed above correspond to a large set of pairwise alignments. These pairwise alignments can be combined (resolving inconsistencies on the way) into a much smaller number of multiple alignments with many members. Each multiple alignment can be transformed into a profile model of the family, which recalls the original members (and possibly more). The most economical description of the protein universe employs a minimal set of profiles that identify all sequences in the database; these notions are related to minimal encoding costs in information theory. Fig. 3 illustrates three ways of generating covering sets of multiple alignments: (1) reducing redundancy using only complete sequences, (2) using a greedy fragment elimination algorithm, and (3) using an algorithm that produces ‘bifurcating’ multiple alignments. The first method is a simple heuristic (Hobohm et al., 1992). A representative set is a subset of the sequence database that contains no neighbours (properly defined), and every sequence in the complement of this subset (redundant sequences) is a neighbour of at least one representative. For example, the threshold for neighbours can be expressed in terms of sequence identity and series of (nested) representative sequence subsets derived at 90, 80, ...,40% identity thresholds (Holm and Sander, 1998; Park et al., 1999). The generating process is as follows. Each query sequence from the database is compared to all previous representatives. If a similarity exceeding the threshold is found, then the query is redundant with respect to the representative. If the query is unique, it is added to the representative set. Due to length sorting, query sequences are only compared to longer representatives. Consequently, each redundant sequence is fully contained (aligned over the whole length) in the representative. Many representatives may share regions of high similarity, but embedded in different sequences so that overall sequence identity drops below the threshold. This problem of domains is accentuated at larger sequence distances. For an economic yet complete description of the protein universe, it is necessary to break up the sequences into domains. The second method, Prodom (Corpet et al., 1999), defines domain borders based on the local alignment obtained by PSI-Blast recursive homology searches. First, profiles are constructed for domains with known boundaries and the matches are pulled out from the sequence database by profile searching. If the extracted domain is not terminal in a protein sequence, the remaining sequence is cut into two parts that become two independent entries in the search database. Other families are built iteratively, also based on PSI-Blast. Each time a domain family is found, the corresponding fragments are extracted, as in the first step, from the search database, the size of which is thus decreasing. The process stops when PSI-Blast does not find any similarity between the remaining sequences. The greedy elimination strategy means that the order of processing the database determines which cluster a sequence (segment) ends up in. At larger evolutionary distances, local sequence alignment methods may only detect a shrinking region of similarity, so that alignment borders no longer correspond to structural domains but represent conserved motifs, e.g., an active site, thereby inducing a problem of excessive fragmentation. A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 329

Fig. 3. Schematic stacked sequences (multiple alignments) illustrating sequence coverings defined in terms of (A) mapping of complete sequences, (B) mapping of sequence fragments, and (C) hierarchical mapping. The dark patterns are aligned to the cluster seed sequence (representative) at the top. The greedy elimination strategy (B) builds on two assumptions, namely, that an iterative profile search finds the complete homologous family and that local sequence alignment stops at physical domain borders. Strategy (C) yields a set of representative sequences which fully cover the entire sequence database. One sequence can be part of more than one cluster (multiple alignment). Family unification is based on motifs (which can represent a partial alignment to a structural domain) while full alignment coverage is provided by the bifurcating alignments.

The third method, Picasso (Holm, 2000), starts from highly overlapping sequence neighbour- hoods revealed by all-on-all pairwise Blast alignment. Overlaps are reduced by merging sequences or parts of sequences (domains, motifs) into multiple alignments. At the end of the process, each part of a sequence is covered by at least one multiple alignment. Merging proceeds hierarchically, starting from many small clusters (multiple alignments) defined at high confidence. The decision of merging is based on the score of profile-profile comparison, which is sensitive and selective enough to reach down to about 15% pairwise sequence identity (Fig. 4). Families unified through a short conserved sequence motif are associated with multiple full-length alignments describing 330 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337

Fig. 4. Progress of hierarchic clustering by Picasso with explicit multiple alignments and profile–profile comparison. Multiple alignments are built by gathering more and more distant relatives around a seed sequence. Random pairs would peak around 5% identity. In contrast, hierarchic unification yields a radial distribution with a steep edge around 10% identity and maxima at 15–35% identity. The range below 25–30% identity is traditionally called the twilight zone, where sequence identity alone is insufficient to discriminate between biologically related and unrelated sequences. To overcome this, Picasso uses more sensitive profile–profile comparison. Visual inspection confirms that most pairs in the twilight zone are biologically reasonable as they share sequence patterns that are highly conserved throughout the entire cluster. Fragments that are more than 90% identical to a representative sequence were removed at the outset, leading to the trough at 90–100%. The first unification step used a cutoff of 40% identity over the full sequence length. Subsequent unification steps used Blast e-value cutoffs accepting partial length matches. Sequence identity in the histograms is computed as the number of identical amino acids divided by the number of amino acids aligned to the master sequence. Comparisons of the seed sequence to itself are excluded, so the total area under the curves increases as clusters are merged. different subfamilies. Domains that are mobile modules are identified based on their association with different sets of neighbours. Many common methods of domain cutting look for sequences in a multiple alignment that are second but not first neighbours of each other (such as sequences A and B compared to a sequence AB). However, nested domain composition similarities (such as A contained in AB contained in ABC), spurious alignments and spurious sequences (fragments) have made it difficult to formulate clear and universal domain cutting criteria (Guan and Du, 1998; Park and Teichmann, 1998; Gracy and Argos, 1998; Sonnhammer and Kahn, 1994). For example, Domo (Gracy and Argos, 1998) estimates the locations of domain borders in long sequences by transitively mapping the positions of known N- or C-termini in aligned sequences. This means that the dataset has to be cleaned of fragments, which can be a difficult task given the uncertainties of exon prediction. These problems are alleviated by the analysis of neighbour graphs, extended to second and third sequence neighbours for the identification of evolutionarily mobile modules. Two proteins share an evolutionary mobile module if they also have positionally disjoint sets of unique neighbours. A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 331

This definition of a mobile module coincides with the definition of overlapping maximal clusters which have common and unique neighbours (Fig. 5). In Picasso v.0, we tested each pair (A, B) of multiple alignments (PSI-Blast generated clusters) which shared members, and noted the borders between the common and unique blocks. Because a common sequence may align over the whole length to the seed sequence, the common blocks were mapped transitively from B to A and from A to B. A consensus cutting for each seed sequence was then generated by sorting suggested domains by length and only accepting a new domain cut if it was nested within a previous domain cut.

5. Quality control

There is a need for a systematic evolutionary classification of all protein sequences, and several systematic, global clusterings have been proposed in recent years. Quality control is a key issue. Can carefully designed automatic, algorithmic approaches match the quality or improve the consistency of manually curated collections? Comparison between family classifications is not straightforward because of their different definitions, scopes and purposes (Table 1). In sequence analysis, there is the fundamental problem that statistical significance does not guarantee a biologically significant relationship. If the problem is too complex to formalize, manual curation by experts is the only solution. However, given a model (and objective function), automatic methods can yield an optimal, internally consistent solution. Currently, hybrid solutions are popular, with databases using each other as sources. In general, we identify the following problem areas. Accuracy of domain borders: Sequence-based domain definition, focussing on conserved blocks in a multiple alignment, does not necessarily reproduce structural domains (Elofsson and Sonnhammer, 1999). For example, adenosine deaminases have long inserts between the first and second beta strands of a (ba)8 barrel fold, which had led to a truncated profile in Pfam 4 that missed out part of the active site (since corrected). For experimental structure determination, it is important to delineate targets that assume a stable, native fold. Error vs. coverage: Thresholds are calibrated on known structures, and usually a threshold is chosen which yields the highest coverage at an acceptable (low) error rate. Depending on the method of clustering, a wrong member can pull in many other sequences unrelated to the seed. Hierarchic unification is ‘‘safer’’ than closure starting from an arbitrary seed sequence (e.g. PSI- Blast). The mean radius and diameter are indicators of the diversity of the members of a cluster. Direct proof of the validity of clusters with large radius in sequence space has to await the availability of 3D structures. Optimal profile centres: The search for protein families can be formulated elegantly in terms of identifying strongly connected components in graphs of pairwise similarity relationships. When profiles are used to model families, one gets a nasty combinatorial problem because the composition of the seed alignment influences the profile and this can lead to instability of iterative profile searching. Between different classifications, the number of clusters varies depending on the radius of unification but also depending on the underlying sequence database. The minimal set of clusters, which corresponds to homologous unified families, is likely ultimately to be based on structural information (Fig. 6). 332 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 333

Fig. 6. The distribution of sequences (dots) in ‘‘sequence space’’ is uneven. Efficient algorithms exist for finding the neighbours of a query sequence. The sets of neighbours of similar sequences are highly overlapping. Therefore, all sequences can be covered by a subset of neighbourhoods which are centred on a representative subset of sequences (black dots in second panel). A partitioning of sequence space into disjoint sets results, for example, from mapping each redundant sequence to its closest representative (third panel). Each sequence now belongs to one cluster. Ideally, clusters should represent groups of homologues, i.e., families, at maximal unification range for maximal information gain in function and structure assignment. Profiles (position-specific scoring matrices or hidden Markov models) can detect a statistically significant relationship to more diverse sequences than pairwise comparison (profile ‘‘centres’’ marked by * in fourth panel).

3 Fig. 5. (A) Domain chaining illustrated by eight proteins containing five different domain types (1ayx, glucoamylase; 1bec, T-cell antigen receptor; 1clc, endoglucanase CelD; 1ctn, chitinase A; 1exg, cellulose-binding domain of exo-1,4- beta-D-glycanase; 1nar, narbonin; 1nkr, killer cell inhibitory receptor fragment; 1tf4A, endo-1,4-beta-D-glucanase; see Holm and Sander, 1999). Domain types are drawn in a uniform pattern and parts of multidomain proteins are joined by thick lines. The vertical arrangement of domains inside each box represents a multiple alignment. For example, there are four immunoglobulin-like domains at the far left. (B) Graph representation of the eight proteins. The numbers are the cardinality of each node. Alternative covering sets are indicated by the shaded vertices (nodes). All other nodes are first neighbours of the shaded ones. The slanted lines going through nodes indicate multidomain proteins. (C) Neighbour list representation of the eight proteins. On each row, the grey cells represent a cluster, i.e., the set of neighbours of the seed sequence on the left. Clusters (rows) have been sorted by cardinality. Dark rows are maximal clusters and light gray rows are nested clusters which are contained in a maximal cluster. The overlapping maximal clusters form the leftmost covering set in (B), which also identifies those proteins which contain mobile modules. (D) Schematic view of clusters expanded to the level of explicit multiple alignments (profiles), and the corresponding graph view. Maximal clusters are shaded. Vertically aligned bars represent common alignment blocks, white segments are unique. A closed neighbourhood (left): This set of profiles has a single maximal cluster represented by B. C and D share a block which is absent from B or not detected by the alignment method. This block is uniquely associated with the common block that unifies C and D with B, and there is insufficient information to determine whether it is a mere subfamily-specific insertion or an independent structural domain. Mobile modules (right): Here, overlapping maximal clusters are formed due to domain chaining. Mobile mobiles are identified due to overlapping maximal clusters, and confirmed due to the fact that D and A map to disjoint regions compared to the common block between B and C. Separating the domains results in closed neighbourhoods (maximal cluster representatives shaded). 334 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337

Fig. 7. Sequence–structure alignment of cytidylyltransferase on template structure of tyrosyl-tRNA synthetase. The alignment was anchored on a conserved HiGH sequence motif (arrows in top row) and optimized with respect to an empirical solvation potential (Bork et al., 1995). The overall sequence identity is very low and profile searching methods (which ignore long-sequence-range residue interactions) are unable to detect the similarity. Notation: 2ts1, the 3D template sequence; 1cozA, cytidylyltransferase structure aligned to 2ts1; pred, alignment predicted before the cytidylyltransferase structure was solved.

6. Coverage in structural genomics

Structural genomics is the idea to solve enough structures to bring any protein sequence within modelling distance of a known 3D structure } an instance of the covering problem in global sequence clustering. The strong conservation of 3D structure, or fold, between homologues means that if the 3D structure of only one family member is known, then by implication one can derive a 3D model of all family members using model building by homology. At present, the radius of family unification based on sequence–sequence comparison is actually larger than the radius of convergence of automatic model building software (Sanchez et al., 2000; Vriend, 1990; Bates and Sternberg, 1999). Until these methods fully exploit empirical sequence-structure potentials for screening and alignment optimization (Fig. 7), good alignments are rarely achieved below 40% sequence identity between modelled protein and 3D template. It is also noteworthy that even in the case of perfect alignment, increasing modelling distance also increases the structural differences (insertions, backbone shifts) that structure refinement programs would need to overcome to generate accurate models. Based on a family classification, prioritized lists of targets for structure determination can be generated based on family attributes, such as species distribution, functional importance or unknown function, and predicted (in)solubility. In the seminal February-1999 Structural Genomics Initiative meeting organized by Moult and Sander under the auspices of NIH A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 335

(http://www.nigms.nih.gov/news/meetings/structural_genomics_targets.html), classification groups were asked to present top-30 target lists of structurally unknown protein families. In this context, global classification directs attention to families which might have been overlooked in manually compiled profile libraries. For example, Picasso pulled out a number of large universally conserved families not at that time described in either Prosite or Pfam.

7. Conclusions

Global organization of protein sequences into families is needed to direct functional and structural genomics and to reap the harvest of these initiatives. The benefits from a description of all protein families are more sensitive detection by profile searches, faster search times against a smaller database (profile library), and improved consistency in function and structure assignment. The field offers a number of challenging computational problems. Sequence search methods fail to detect remote homologues consistently. In practice, however, a covering set of profiles is sufficient for nearest neighbour assignment and does not require solving the homology detection problem. Hierarchical unification ensures that each sequence is consistently mapped with its nearest relatives. Important aspects of the domain decomposition problem can be solved using graphs/set theoretic analysis.

References

Altschul, S.F., Madden, T.L., Scha¨ ffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402. Andrade, M.A., Brown, N.P., Leroy, C., Hoersch, S., de Daruvar, A., Reich, C., Franchini, A., Tamames, J., Valencia, A., Ouzounis, C., Sander, C., 1999. Automated genome sequence analysis and annotation. 15, 391–412. Attwood, T.K., Croning, M.D.R., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., Wright, W., 2000. PRINTS-S: the database formerly known as PRINTS. Nucl. Acids Res. 28, 225–227. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., Sonnhammer, E.L.L., 2000. The Pfam protein families database. Nucl. Acids Res. 28, 263–266. Bates, P.A., Sternberg, M.J., 1999. Model building by comparison at CASP3: using expert knowledge and computer automation Proteins 37 (S3), 47–54. Bork, P., Holm, L., Koonin, E., Sander, C., 1995. The cytidylyltransferase superfamily: identification of nucleotide- binding site and fold prediction. Proteins 22, 259–266. Bork, P., Koonin, E.V., 1998. Predicting functions from protein sequences } where are the bottlenecks? Nat. Genet. 18, 313–318. Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D., 1998. Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. 5, 279–305. Brenner, S.E., Chothia, C., Hubbard, T.J., 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA 95, 6073–6078. Corpet, F., Gouzy, J., Kahn, D., 1999. Recent improvements of the ProDom database of protein domain families. Nucl. Acids Res. 27, 263–267. Corpet, F., Servant, F., Gouzy, J., Kahn, D., 2000. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucl. Acids Res. 28, 267–269. Dayhoff, M.O., Barker, W.C., Hunt, L.T., 1983. Establishing homologies in protein sequences. Methods Enzymol. 91, 524–545. 336 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337

Dodge, C., Schneider, R., Sander, C., 1998. The HSSP database of protein structure-sequence alignments and family profiles. Nucl. Acids Res. 26, 313–315. Eddy, S.R., 1998. Profile hidden Markov models. Bioinformatics 14, 755–763. Elofsson, E., Sonnhammer, E.L.L., 1999. A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 15, 480–500. Gracy, J., Argos, P., 1998. Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment. Bioinformatics 14, 164–173; II. Delineation of domain boundaries from sequence similarities. Bioinformatics 14, 174–187. Grundy, W.N., 1998. Homology detection via family pairwise search. J. Comput. Biol. 5, 479–491. Guan, X., Du, L., 1998. Domain identification by clustering sequence alignments. Bioinformatics 14, 783–788. Hanke, J., Lehrmann, G., Bork, P., Reich, J.G., 1999. Associative database of protein sequences. Bioinformatics 15, 741–748. Henikoff, J.G., Greene, E.A., Pietrokovski, S., Henikoff, S., 2000. Increased coverage of protein families with the Blocks Database servers. Nucl. Acids Res. 28, 228–230. Hobohm, U., Scharf, M., Schneider, R., Sander, C., 1992. Selection of representative protein data sets. Protein Sci. 1, 409–417. Hofmann, K., Bucher, P., Falquet, L., Bairoch, A., 1999. The PROSITE database, its status in 1999. Nucl. Acids Res. 27, 215–219. Holm, L., 2000. Picasso: generating a covering set of protein family profiles. Bioinformatics, in press. Holm, L., Sander, C., 1997. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins 28, 72–82. Holm, L., Sander, C., 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429. Holm, L., Sander, C., 1999. Protein folds and families: sequence and structure alignments. Nucl. Acids Res. 27, 244–247. Jonassen, I., Collins, J.F., Higgins, D.G., 1995. Finding flexible patterns in unaligned protein sequences. Protein Sci. 4, 1587–1595. Karplus, K., Barrett, C., Hughey, R., 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856. Krause, A., Stoye, .J., Vingron, M., 2000. The SYSTERS protein sequence cluster set. Nucl. Acids Res. 28, 270–272. Krause, A., Vingron, M., 1998. A set-theoretic approach to database searching and clustering. Bioinformatics 14, 430–438. Kyrpides, N.C., Ouzounis, C.A., 1998. Errors in genome reviews. Science 281, 1457. Lima, C.D., Klein, M.G., Hendrickson, W.A., 1997. Structure-based analysis of catalysis and substrate definition in the HIT protein family. Science 278, 286–290. Lo Conte, L., Ailey, B., Hubbard, T.J., Brenner, S.E., Murzin, A.G., Chothia, C., 2000. SCOP: a structural classification of proteins database. Nucl. Acids Res. 28, 257–259. Orengo, C.A., Pearl, F.M., Bray, J.E., Todd, A.E., Martin, A.C., Lo Conte, L., Thornton, J.M., 1999. The CATH Database provides insights into protein structure/function relationships. Nucl. Acids Res. 27, 275–279. Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., Chothia, C., 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284, 1201–1210. Park, J., Holm, L., Heger, A., Chothia, C., 1999. RSDB: representative sequence databases with high information content. Bioinformatics, 16, 458–464. Park, J., Teichmann, S.A., 1998. DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 14, 144–150. Patel, S., Latterich, M., 1998. The AAA team: related ATPases with diverse functions. Trends Cell Biol. 8, 65–71. Pearson, W.R., 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185–219. A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 337

Rigoutsos, I., Floratos, A., Ouzounis, C., Gao, Y., Parida, L., 1999. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins 37, 264–277. Sanchez, R., Pieper, U., Mirkovic, N., de Bakker, P.I., Wittenstein, E., Sali, A., 2000. MODBASE, a database of annotated comparative protein structure models. Nucl. Acids Res. 29, 250–253. Schaeffer, A.A., Wolf, Y.I., Ponting, C.P., Koonin, E.V., Aravind, L., Altschul, S.F., 1999. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific matrices. Bioinformatics 15, 1000–1011. Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A., Ouzounis, C., Sander, C., 1994. GeneQuiz: a workbench for sequence analysis. Ismb 2, 348–353. Schultz, J., Copley, R.R., Doerks, T., Ponting, S.P., Bork, P., 2000. SMART: a web-based tool for the study of genetically mobile domains. Nucl. Acids Res. 28, 231–234. Seraphin, B., 1992. The HIT protein family: a new family of proteins present in prokaryotes, yeast and mammals. DNA Seq. 3, 177–179. Sonnhammer, E.L., Kahn, D., 1994. Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3, 482–492. Tatusov, R.L., Galperin, M.Y., Natale, D.A., Koonin, E.V., 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucl. Acids Res. 28, 33–36. Teichmann, S.A., Chothia, C., Church, G.M., Park, J., 2000. Fast and reliable assignment of protein structures to sequences. Bioinformatics 22, 117–124. Vriend, G., 1990. WHAT IF: a molecular modelling and drug design program. J. Mol. Graph. 8, 52–56. Wu, C.H., Shivakumar, S., Huang, H., 1999. ProClass protein family database. Nucl. Acids Res. 27, 272–274. Yona, G., Linial, N., Tishby, N., Linial, M., 1998. A map of the protein space – an automatic hierarchical classification of all protein sequences. ISMB 1998, 212–221. Yona, G., Linial, N., Linial, M., 2000. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucl. Acids Res. 28, 49–55.