Towards a Covering Set of Protein Family Profiles

Progress in Biophysics & Molecular Biology 73 (2000) 321–337 Review Towards a covering set of protein family profiles Andreas Heger, Liisa Holm* Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK Abstract Evolutionary classification leads to an economical description of the protein sequence universe because attributes of function and structure are inherited in protein families. Efficient strategies of functional and structural genomics therefore target one representative from each family. Enumerating all families and establishing family membership consistently based on sequence similarities are nontrivial computational problems. Emerging concepts and caveats of global sequence clustering are reviewed. Explicit multiple alignments coupled with neighbourhood analysis lead to domain segmentation, and hierarchical unification helps to resolve conflicts and validate clusters. Eventually, every part of every sequence will be assigned to a domain family which is uniquely associated with a fold and a molecular function. # 2000 Elsevier Science Ltd. All rights reserved. Keywords: Clustering; Domains; Homology; Sequence alignment; Structural genomics Contents 1. Evolutionary classification . .................................. 322 2. Discovering families . .................................. 323 3. Topographical maps of sequence space . ......... 326 4. Domain decomposition . .................................. 328 5. Quality control . .................................. 331 6. Coverage in structural genomics . ......... 334 7. Conclusions . .................................. 335 References. .................................. 335 *Corresponding author. Fax: +44-1223-494470. 0079-6107/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved. PII: S 0 0 7 9 - 6 107(00)00013-4 322 A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 1. Evolutionary classification The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution to the computational assignment of protein structure and function to uncharacterized sequences: functional and structural information can be transferred between homologous proteins. Homologues carry the memory of common ancestry in their amino acid sequences as a result of functional constraints that have persisted through successive generations. Sequence similarity searching is today the most powerful tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects. Annotation by similarity is typically based on a nearest neighbour approach. Neighbours of a query sequence are detected using fast sequence database search programs such as Blast (Altschul et al., 1997) or Fasta (Pearson, 2000). Many implementations worldwide follow the model of GeneQuiz (Scharf et al., 1994) and pick the highest-scoring hit with informative annotation attached to it to generate a plausible function of the query protein. There are numerous possibilities of error in this approach (Andrade et al., 1999; Kyrpides and Ouzounis, 1998; Bork and Koonin, 1998). For example, sequence databases generally contain poor positional pointers of functional domains, and an erroneous inference will result if the annotation of the database hit refers to the presence of a particular functional domain but the query protein matches in a different region. More difficult to check and correct are the quality and integrity of second-hand, similarity-derived functional annotation in the search databases. In particular, it is not possible to trace back a chain of annotations by similarity, nor to propagate changes in annotation to dependent sequence entries in current sequence databases. Protein family classification opens a way out of this muddle. Grouping proteins into families is useful in two ways. First, it leads to more sensitive detection of new members and improved discrimination against spurious hits based on the essential conserved features in a family as expressed by profiles (position-specific scoring matrices or hidden Markov models) or patterns (regular expressions). Second, having established family membership, the query sequence can be placed in the context of the evolutionary tree of the family for accurate functional inference. It is also easier to spot inconsistent second-hand annotations in the tree context. The physiological role, cellular function and substrate specificity of homologous proteins can diverge remarkably during evolution (Holm and Sander, 1997). A query sequence rooted inside a homogeneous branch (in terms of direct experimental annotation) is likely to have the same function, while one between different functional subclasses may represent a novel subfamily. It may be convenient to collapse the tree to hierarchic discrete subclasses that reflect major evolutionary changes in the functional specialization of subfamilies. Many large families have dedicated databases, where new sequences are inserted into existing or newly created subfamilies manually (e.g. AAA family, http://yeamob.pci.chemie.uni-tuebingen.de/AAA; Patel and Latter- ich, 1998) or automatically (GPCRDBsup, http://jura.ebi.ac.uk:8901/html/; Heger, unpublished). Libraries of profile models have been generated around sets of particular interest, such as all known structures (Dodge et al., 1998; Teichman et al., 2000; Schaeffer et al., 1999) and large families. For example, Pfam 5.2 (Bateman et al., 2000) contains 2128 HMMs which cover about A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 323 64% of all sequences in SwissProt release 38. The coverage of complete genomes is lower. A complete covering (a profile library representing every family) would reveal the evolutionary origins of genomes but also reduces the risk of wrongly assigning marginal hits because the full tree context is available. In recent years, interest in protein sequence classification has been fuelled by the challenge of interpreting genome sequences and an increasing number of groups are active in the field (Table 1). Here, we review emerging concepts (Krause and Vingron, 1998; Yona et al., 1998; Holm, 2000) and caveats of computational approaches to globally organize protein sequence databases into protein families. 2. Discovering families Historically, families have been identified one by one and based on similarities to individual proteins under study by individual scientists. The process starts from the compilation of a multiple alignment of similar sequences. Methods for finding similar sequences and the thresholds deemed safe to infer homology from similarity differ between different sequence classifications. The usage of the terms family and superfamily (unified family) also are not uniform and can represent different levels of the functional hierarchy. The PIR definition of superfamilies (Dayhoff et al., 1983) is conservative in terms of sequence identity, while structure-based classifications unify remote homologues whose structural and functional features suggests a common evolutionary origin despite very low sequence identities (Holm and Sander, 1999; Orengo et al., 1999; Lo Conte et al., 2000). From a multiple alignment, one can identify conserved patterns which are diagnostic of family membership or a functional site (Attwood et al., 2000; Hofmann et al., 1999). For example, the histidine triad family was named after a highly conserved H.H.H motif (Seraphin, 1992). Later, this family was unified with UDP-glucosyl transferases and Ap4A-phosphorylases, shown to be a hydrolase, and the invariant part of the pattern reduced to a single histidine required for catalysis (Lima et al., 1997). In current sequence databases, the extended family can be detected by iterative profile searching (Altschul et al., 1997). Profiles (Eddy, 1998) and patterns have different advantages and limitations. In principle, a profile yields a probability of family membership while a pattern is either present or absent. Profiles are coupled to explicit multiple alignments, and there are unsolved problems related to the statistics of gapped alignment scores and algorithmic convergence of the search for an optimal profile. For example, thresholding the neighbourhood radius in terms of e-values in PSI-Blast leads to nonidentical profiles at the end of iteration depending on the seed sequence and the distribution of neighbours in sequence space (Fig. 1). Furthermore, many biologically genuine homologues get statistically insignificant scores, as has been shown in benchmarks against structurally defined families (Brenner et al., 1998; Park et al., 1998, Karplus et al., 1998). Dynamic thresholding is used in a special definition of families, which addresses the question of orthologues versus paralogues (Tatusov et al., 2000). This analysis involves only complete genomes to construct the directed graph of nearest-neighbour relationships. Cliques of at least three bidirectional nearest-neighbour sequences form a COG (cluster of orthologous groups). About 324 Table 1 Comparison of several sequence clusterings and family classifications Database Reference Basis set Method/ Number of Global Explicit Explicit Result version description clusters coverage alignments domains validation A. Heger, L. Holm / Progress in Biophysics & Molecular Biology 73 (2000) 321–337 ADOPS Hanke et al. Pattern Associative Not applicable Yes Yes No Visual (1999) memory Blocks+23 Jan Henikoff

Towards a Covering Set of Protein Family Profiles

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

Sequence-Based Microrna Clustering

Representative Based Protein Sequence Clustering

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Ultrafast and Sensitive Sequence Search and Clustering Methods in the Era of Next Generation Sequencing

VIRMOTIF: a User-Friendly Tool for Viral Sequence Analysis

Spclust: Towards a Fast and Reliable Clustering for Potentially Divergent

De Novo Clustering of Long Reads by Gene from Transcriptomics Data

Application of Subspace Clustering in DNA Sequence Analysis

Downloaded Without User Conclusions Registration At: and Additional Informations in Supplementary Material

Scalable Clustering for Immune Repertoire Sequence Analysis

The Genexpress IMAGE Knowledge Base of the Human Brain Transcriptome: a Prototype Integrated Resource for Functional and Computational Genomics