Vol. 18 no. 7 2002 Pages 922–933

Target space for structural revisited Jinfeng Liu 1, 2 and Burkhard Rost 2, 3,∗

1Department of Pharmacology, Columbia University, 630 West 168th Street, New York, USA, 2CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, NY 10032, New York, USA and 3Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St. Nicholas Avenue, New York, NY 10032, USA

Received on June 5, 2001; revised on January 4, 2002; accepted on February 7, 2002

ABSTRACT each natural (NIGMS, 2001). Second, determine Motivation: eventually aims at one structure for all missing links in pathways and determining structures for all . However, in the biological mechanisms (Blundell and Mizuguchi, 2000; beginning experimentalists are likely to focus on globular Burley et al., 1999; Christendat et al., 2000; Gaasterland, proteins to achieve a rapid basic coverage of protein 1998a,b; Lima et al., 1997; Moult and Melamud, 2000; sequence space. How many proteins will structural Rost, 1998; Shapiro and Harris, 2000; Teichmann et al., genomics have to target? How many proteins will be 1999; Thornton, 2001). These two objectives correspond excluded since we already have structural information for to the two aspects of genome sequencing: (i) the mass these or since they are not globular? We have to answer of data, and (ii) the completeness of entirely sequenced these questions in the context of our target selection for organisms. One expected technical benefit from structural the North-East Structural Genomics Consortium (NESG). genomics is the development of techniques and protocols Results: We estimated that structural information is avail- for large-scale expression, purification, crystallization able for about 6–38% of all proteins; 6% if we require high and structure-determination. An important benefit for accuracy in comparative modelling, 38% if we are satis- molecular biology may be the determination of the fied with having a rough idea about the fold. Excluding all structural scaffolds for most basic functional elements. A regions that are not globular, we found that structural ge- considerable increase in the fraction of proteins for which nomics may have to target about 48% of all proteins. This we have some structural information may also advance corresponded to a similar percentage of residues of the the determination of function for single proteins or entire entire proteomes (52%). We explored a number of differ- proteomes. It is commonly assumed that the scaffolds ent strategies to cluster protein space in order to find the of protein folds constitute one of the ‘basic units’ for number of families representing these 48% of structurally evolution. If so, structural genomics will also help to better unknown proteins. For the subset of all entirely sequenced understand evolution. Structural genomics focuses on eukaryotes, we found over 18 000 fragment clusters each structural modules or domains. However, isolated domains of which may be a suitable target for structural genomics. do not always suffice to understand function. Instead, Availability: All data are available from the authors, most understanding function often requires studying complexes results are summarized at: http://cubic.bioc.columbia.edu/ composed of many proteins. The difficulty of determining genomes/RES/2002 bioinformatics/ structures for large complexes will be prohibitive for the Contact: E-mail: [email protected] first round of structural genomics. Determine one structure for each family of closely INTRODUCTION related proteins Structural genomics to determine all native protein The safest strategy to go about determining structures for structures all native proteins is to simply express, purify, crystallize In 2000, the National Institutes of Health (NIH) in the and x-ray all protein sequences one by one, just in the USA began to finance pilot projects for large-scale protein way large-scale genome sequencing operates. However, structure determination (structural genomics). Two major sequencing is technically much simpler than structure objectives of structural genomics have often been given. determination. None of the necessary steps—express, First, experimentally determine one protein structure for purify, crystallise, x-ray—has ever been accomplished on the scale of ‘all proteins in a proteome’. Consequently, ∗ To whomcorrespondence should be addressed. we have to find a way of focusing on some representative

922 c Oxford University Press 2002 Target space for structural genomics revisited fraction of all proteins. Resources such as CATH (Orengo the 3000 structures added to PDB in 2001, only about et al., 1997), FSSP (Holmand Sander, 1999), HSSP 500 belonged to families of unknown structure (Berman (Schneider et al., 1997), or SCOP (Lo Conte et al., 2000) et al., 2000; Eyrich et al., 2001a,b). Since the structural illustrate that fewer than 1000 folds and about 2500 fami- genomics initiatives set out to determine structures ex- lies are representative for over 20 000 structures deposited clusively for such families of unknown structures their in PDB (Berman et al., 2000). Hence, the conceptually yield would double the number of families for which simple refinement of the selection strategy is to determine structures will be added until 2005. Another implicit one structure for each unknown fold. Unfortunately, goal of all structural genomics initiatives is to reduce the this straightforward concept hides a number of severe costs of determining a protein structure from its current problems. The first is that the absolute majority of similar value of about $100K/protein. Interestingly, the US-based folds have less than 12% pairwise sequence identity pilot groups receive about this amount from the NIH (Rost, 1997, 1999; Yang and Honig, 2000b), i.e. populate to determine their projected 2000 structures. However, the midnight zone of sequence comparisons in which we the goal of the first round of structural genomics is not cannot detect the fold similarity from sequence alone. primarily to determine as many structures as possible, Hence, we would have to determine the fold to find the rather it is to pioneer the development of techniques that set of representative folds. One way around this vicious will be required for a cost-efficient large-scale structure circle is to reformulate the goal: determine one structure determination. for each family of proteins that are related by sequence. The levels of pairwise sequence similarity that imply Existing methods that cluster sequence-space similarity in structure are well established (Abagyan and Over the last years, a number of groups have presented Batalov, 1997; Brenner et al., 1998; Muller et al., 1999; different approaches to cluster sequence space. CATH Park et al., 1997, 1998; Rost, 1999; Sander and Schneider, (Orengo et al., 1997), FSSP (Holmand Sander, 1999), and 1991; Yang and Honig, 2000b). Thus, it may seem that SCOP (Lo Conte et al., 2000) group proteins of known all bioinformatics has to do is to cluster all proteins into structure according to their fold. These classifications can families of proteins with similar structures, exclude all then be extended to homologous proteins for which we do clusters with known structures and define the remaining not experimentally know structure—a concept pioneered list as the target list for structural genomics (Vitkup et in the HSSP database (Sander and Schneider, 1991). al., 2001). In fact, this procedure describes the current When we want to group proteins into families without modus operandi of structural genomics initiatives fairly knowing the structure of any protein in that family, the well. Additionally, most groups exclude clusters that are problembecomeshow to define the boundaries of which particularly problematic due to the presence of membrane proteins to include. For example, should proteins A regions, and/or long regions of low-complexity. and F in Figure 1 become part of the same family, or should we try to chop both A and F into domains and The age of structural genomics has begun build a family labelled ‘Domain 1’ in Figure 1? PFAM The currently active structural genomics groups differ in (Bateman et al., 2000; Sonnhammer et al., 1997) is an their focus. Most groups focus on particular organisms: expert annotated database of protein families that tries Mycoplasma genitalium and Mycoplasma pneumoniae to build multiple alignments of regions in proteins that by BSGC (Kim, 2001), Caenorhabditis elegans by are believed to constitute domains. One limitation of JCSG (Wilson, 2001), Mycobacterium tuberculosis by PFAM is that not all known proteins are included yet. TBSGC (Terwilliger, 2001), Caenorhabditis elegans and Even more limited in that respect is the similar approach Pyrococcus furiosus by SECSG (Wang, 2001), Saccha- toward listing all domains of secreted proteins in SMART romyces cerevisiae by the YSG (YSG, 2001), Thermus (Ponting et al., 1999). COG (Tatusov et al., 2000, 1997) thermophilus by SRG (Yokoyama and Kuramitsu, 2001), builds clusters of orthologous groups (COGs: proteins in Homo sapiens by PSF (Umbach, 2001). Two groups different species that evolved from a common ancestral focus on particular protein types (short proteins from protein) or orthologous sets of paralogues (proteins from eukaryotes by NESG; Montelione, 2001, disease related the same organism which are believed to be related by and ‘easy’ proteins by MCSG; Joachimiak, 2001), and duplication) fromat least three species. The authors try one on particular functional types (enzymes by NYSGRC; to split multi-domain clusters through pair relations. Burley, 2001). The nine initiatives currently financed by ProtoMap (Linial and Yona, 2000; Yona et al., 1998, the National Institute of Health (NIH) in the USA together 1999, 2000) is an automatic, hierarchical classification of intend to add about 2000 structures over the next four the entire SWISS-PROT database (Bairoch and Apweiler, years. Given that almost 3000 new protein chains have 2000a) that is based on pairwise relations. The particular been added to PDB (Berman et al., 2000) over the last algorithmintroduced by ProtoMap for merging and split- 12 months, this number may appear small. However, of ting groups of pairwise related proteins, yields an implicit

923 J.Liu and B.Rost separation into clusters with single and multiple domains. An attempt at combining sequence-based and structure- based classifications is implemented in BioSpace that first clusters all proteins of known structures and then pulls in proteins of unknown structures in a way similar to the ProtoMap algorithm. Finding consensus motifs in alignments and then cutting according to some statistical criteria is the concept that leads to the automatic classifi- cation of all proteins in ProDom(Corpet et al., 2000). The particular problemof ProDomis that the domainsfound tend to be shorter than those assigned fromknown protein structures. The basic idea of using boundaries in align- Fig. 1. Concepts of clustering and domain splitting. Regions in ments to identify domains has also been implemented which the six proteins A–F have significant pairwise sequence by other groups (Enright and Ouzounis, 2000; Marcotte similarity are marked as black lines (A). The particular pairs of et al., 1999). In particular, the GeneRAGE (Enright and ‘significant similarity’ are given in the matrix (B: grey boxes Ouzounis, 2000) algorithmappears to yield domains mark similar pairs). The six proteins group into two ‘minimal- size’ clusters, with protein F belonging to both. The first of the that resemble structural domains. ProClass classifies two clusters constitutes one HSSP file (Holmand Sander, 1999; proteins into families based on PROSITE sequence motifs Schneider et al., 1997). The ‘maximal-size’ clustering assumes that (Bairoch et al., 1997; Hofmann et al., 1999) and PIR we fail to dissect proteins into domains and want to ascertain that super-families (Barker et al., 2000). Domains are not no two clusters have residual similarity. One way of dissecting explicitly detected by ProClass, rather they are taken from proteins into domains is the simple triangular inequality: F = E previous annotations (fromexperts, PFAM, or ProDom). (read ‘similar to’), F = A, but A = E (read ‘not similar to’) that Picasso (Heger and Holm, 2000) is another approach yields a split of F into two domains. Note that C is not split into clustering protein space based on pairwise relations. It two domains because its similarity to D is assumed to be on the seems that Picasso splits domains in a way similar to the borderline, i.e. below some given threshold (indicated by light grey GeneRAGE algorithm. The idea of mapping the space of in B). all proteins implies that we have some sort of metric that do not dissect the sequences into structural domains be- defines a distance between two groups. The problemwith fore we start clustering. Our preliminary implementation this concept is that we can only measure the similarity not of a domain-dissection approach suggested that structural the distance between two proteins. For example, assume genomics initiatives might have to target over 18 000 frag- proteins A and B are both 100 residues long. If they have ment clusters in eukaryotes alone. This estimate resulted 33 pairwise identical residues, we can infer that they fromthe proteins that we selected in our second round for have similar structures (Rost, 1999). If they only have 25 the target selection of the North-East Structural genomics pairwise identical residues we know that the odds are one Consortium(NESG; http://www.nesg.org). in ten that A and B have similar structure, however, these odds reflect our lack of knowledge of the relation between METHODS A and B rather than their actual structural similarity. In fact, A and B may structurally be more similar than a pair Source of sequences A-B with 33 identical residues. Furthermore, assume We obtained the sequences for the entire proteomes we have a globin, an immunoglobulin and a TIM-barrel. of the 30 organisms we analysed from the pub- We know that the three are not similar, however, we lic domain. All ORFs were downloaded from ftp: cannot unambiguously define a distance relationship that //ncbi.nlm.nih.gov/genbank/genomes/, except for Homo concludes something such as the globin is more similar to sapiens (fromSWISS-PROT release 39 and TrEMBL a TIM-barrel than it is to the immunoglobulin. Amongst database release 15), Drosophila melanogaster (from all the clustering attempts, ProtoMap appears to be the http://www.fruitfly.org/, release 2), and Caenorhab- one that most successfully introduces a kind of distance ditis elegans (fromhttp://www.sanger.ac.uk/Projects/ metric (Linial and Yona, 2000). C elegans/wormpep/, wormpep 65). Here, we re-evaluated earlier estimates (Liu and Rost, 2001; Teichmann et al., 1999; Vitkup et al., 2001) for the Prediction methods number of structural families to target by structural ge- Search for similar proteins. We detected similar se- nomics efforts. We also presented two clustering strategies quences in two ways. (1) Run PSI-BLAST (Altschul et that illustrated problems with the simple concept of ‘one al., 1997) searches against all known sequences con- structure per family’. In particular, our maximal-size clus- tained in SWISS-PROT (Bairoch and Apweiler, 2000b), ters illustrated that we fail to cluster sequence-space if we TrEMBL (Bairoch and Apweiler, 2000b), and PDB

924 Target space for structural genomics revisited

(Berman et al., 2000). For simplicity, we refer to the Identifying regions of low-complexity (SEG). We la- combination of these three databases as the set BIG. We belled regions of low-complexity using the program SEG first searched against a filtered version of BIG and then (Wootton and Federhen, 1993, 1996) using the default used the final profile to search against the unfiltered BIG parameters. (Jones, 1999; Przybylski and Rost, 2002). We included all hits below a PSI-BLAST E-value of 10−3. We tested Identifying regions with no regular secondary structure various thresholds for ‘significant sequence similarity (NORS). Using the filtered MaxHomalignments,we to protein of known structure’. Firstly, we included all used PHDsec (Rost, 1996; Rost and Sander, 1993, 1994) protein pairs with more than 50% pairwise identical to predict secondary structures. We considered stretches residues (corresponding to ‘high accuracy in comparative of more than 70 consecutive residues with less than 12% modelling’). Secondly, we included all pairs above the predicted helix or strand as ‘NORS’ (Liu et al., 2002). refined HSSP-curve (medium accuracy in comparative Operational definition for removing fragments from the modelling) relating the length of the alignment to the ‘to-do’ list. Many proteins of known structure contain respective pairwise sequence identity/similarity (Rost, regions of low-complexity (Romero et al., 1998; Saqi, 1999; Sander and Schneider, 1991). Thirdly, we included −3 1995). However, proteins that contain almost no high- all pairs with PSI-BLAST E-values below 10 (for most complexity regions constitute—at best—low-priority of these, comparative modelling supposedly identifies the targets for structural genomics. We removed all pro- basic fold schematically). teins that had fewer than 50 residues in non-membrane, Predict membrane proteins. We used only the filtered non-coiled, non-signal peptide, non-SEG, or non-NORS MaxHom alignments (Rost, 1999) for predicting mem- regions. brane regions by the programPHDhtm(Rost, 1996; Rost Clustering sequence space. In order to cluster sequence et al., 1995, 1996) using the default threshold of 0.8. We space for eukaryotes, we tested the following three adjusted the total number of membrane proteins accord- approximations (Figure 1). (1) Maximal cluster size: ing to the false positive rate (1.6%) and false negative rate merge all proteins that have some local similarity (BLAST (3%) published in the original paper (Rost et al., 1996): score <10−3) to one another into one cluster; merge − clusters as long as they have common members. (2) = 1 FP · − FP · n − − npred − − ntotal (1) Minimal cluster size: given any two proteins A and B, 1 FN FP 1 FN FP group these into one cluster if the sequence similarity where n was the final number of membrane proteins between the pair is above a threshold (BLAST score < −3) we reported, FP and FN were the false positive and 10 . While the maximal clustering is independent of the starting point, the final clusters resulting fromthe false negative rates respectively, npred was the number of predicted membrane proteins in the genome, and minimal clustering do differ. We followed the algorithm encoded in GeneRAGE (Enright and Ouzounis, 2000) by ntotal was the total number of proteins in the genome. Note: our notion of ‘membrane proteins’ is restricted starting fromsingle-domainproteins (Figure 1). Once we to integral helical membrane proteins. In particular, we compiled the minimal-size clusters, we took the domains ignored proteins anchoring helices in the membrane or implied by the clustering and split those further. those inserting beta-strands (porins) since these classes of proteins cannot be identified fromsequence information RESULTS alone. We have some idea about structure for 6–38% of all proteins Predicting signal peptides. We predicted signal peptides using the programSignalP (Nielsen et al., 1996, 1997). We have explicit experimental information about structure We considered a protein to contain a signal peptide if the for less than 0.3% of all entirely sequenced proteomes. ‘mean S’ value in the prediction was above the default The answer to the question for which fraction of entire threshold. The accuracy of SignalP was estimated to be proteomes we can predict structure by comparative around 90% (Emanuelsson et al., 2000; Nielsen et al., modelling depends on the accuracy we require for the 1997). We excluded archaebacteria fromthe analysis since model. One extreme point is to model only proteins for SignalP was developed for prokaryotes and eukaryotes. which the respective experimental structure has more than 50% pairwise identical residues. At that level, models Predicting coiled-coil helices. We used the program are typically very accurate (<3 AC˚ α-rmsd) (Eyrich et COILS (Lupas, 1996; Lupas et al., 1991) to predict coiled- al., 2001a,b; Marti-Renom et al., 2000, 2001). For all the coil regions, with the window-size set to 28 residues and 30 proteomes that we analysed (Appendix, Table W, we the threshold for probability set to 0.9. found that about 6% of the proteins can be modelled at

925 J.Liu and B.Rost this level of accuracy (Figure 2, left panel, black bars). of all human proteins we used (31K) was only 35%. Next, we tested a level of average accuracy at which the models provide a good idea of the basic fold (around The immediate to-do list corresponds to about 54% 5–6 AC˚ α-rmsd) (Eyrich et al., 2001a,b; Marti-Renom of all residues et al., 2000, 2001). At that ‘cartoon-level’ of model When estimating the percentage of proteins that structural accuracy, we found similar structural regions for about genomics targets, we need to define arbitrary thresholds 20% of all proteins (Figure 2, left panel, grey bars). for when we consider the unwanted or structurally known Finally, we dropped the requirement for model accuracy regions to span enough of a protein to discard this fromthe entirely, and tested a threshold at which the model most to-do list. When estimating the percentage of residues in often captures basic features of the respective structure. the entire proteomes that may become targets for structural At that level, we found structurally known regions in genomics, we needed no assumptions about thresholds 38% of all proteins (Figure 2, central panel, grey bars). for ‘minimal globular regions’. Rather, we could simply Note in particular the extreme increase in coverage when count all residues in transmembrane helices, coiled-coil using PSI-BLAST searches against the BIG database. helices, low-complexity stretches, signal peptides, NORS The reason for this non-linear behaviour was that pairs regions, and regions for which comparative modelling of fairly diverged sequences dominated most structural could provide an idea about structure. We found that families (Appendix, Figure W). on average structural genomics will have to contribute to adding in structural information for about 54% of all 30–40% of all proteins contain non-globular regions residues (Figure 3). On the per-residue level, the subset We found at least one membrane helix for about 22% of human (47%) did not differ as significantly from the of all proteins (Figure 2, right panel, black bars). About average as for the protein level. This might suggest that half of all predicted membrane proteins had more than the difference between human and others on the protein five helices (Liu and Rost, 2001). While the percentage level has some reason other than that our subset was overly of helical membrane proteins was similar between all biased. three kingdoms (archae, eukaryotes, and prokaryotes), we found significantly more proteins with coiled-coil Eukaryotes cluster into over 170 000 fragments regions in eukaryotes (eukaryotes > 10%; prokaryotes We did not have the CPU resources to cluster all pro- + archae < 5%, total about 8%; Figure 2, right panel, teomes. Instead, we only had results for Methanococcus stripped bars). Most coiled-coil proteins consisted of jannaschii, Saccharomyces cerevisiae, and the results for a single 28 residue coil (Liu and Rost, 2001). We all known eukaryotic proteomes (Arabidopsis thaliana, also found that the percentage of long NORS regions Caenorhabditis elegans, Drosophila melanogaster, sub- (Methods) differed significantly between eukaryotes and set of Homo sapiens,andSaccharomyces cerevisiae). the other two kingdoms: eukaryotes had about 25% The 6357 Saccharomyces cerevisiae proteins fall into NORS proteins, prokaryotes and archae only about 3%, 3796 maximal-size and into 5448 minimal-size clusters bringing the total percentage to 16% (Liu et al., 2002). (Table 1). The largest single maximal-size cluster Initially, structural genomics initiatives will discard all contained 1351 proteins. The simple domain-splitting those proteins. The total percentage of proteins with algorithmsimilarto GeneRAGE (Enright and Ouzounis, membrane helices, coiled-coils, or NORS regions totalled 2000) first separated and then grouped the minimal-size to about 30–40% (Figure 2, central panel). clusters into 3638–6867 clusters (Table 1). The data were similar for Methanococcus jannaschii (Table 1). About 48% of all proteins constitute targets for When splitting ALL eukaryotes, the situation changed structural genomics dramatically (Table 1): The 97K eukaryotic proteins fall Even when avoiding membrane regions, experimentalists into 22K maximal-size clusters with the largest single may still want to determine the structure for the globular cluster containing almost half of all the proteins (46K). region of a membrane protein. We assumed rather daringly This result demonstrated that the maximal-size clustering that any region of more than 50 consecutive residues was not reasonable. We were surprised by the separation without: (i) membrane helices, (ii) coiled-coil helices, of the 97K eukaryotic proteins into more than 170K frag- (iii) low-complexity stretches, (iv) similarity to a known ments, i.e. by finding almost twice as many minimal-size structure, and (v) for which we predicted some regular fragment-clusters as proteins for the eukaryotes. The secondary structure could be of interest to structural majority of these 170K fragments spanned over 80–150 biology. After this reduction, we found about 48% of all residues (Figure 4). Overall, the length of the consensus proteins (slightly less for eukaryotes) to contain regions region in each cluster corresponded to the length distri- that could be of interest for structural genomics (Figure 2, bution of structural domains. The particular algorithm centre). Remarkably, the respective number for the subset implemented in ProDom (Corpet et al., 2000) that uses

926 Target space for structural genomics revisited

Fig. 2. Estimate for the percentage of protein targets. Left panel: Percentages of proteins in respective proteome for which we found similarities to proteins of known structure above (1) pairwise sequence identities of 50% (PIDE), and (2) above the refined HSSP-threshold, e.g. given by ‘more than 33% pairwise identity over 100 residues aligned’ (Rost, 1999). Right panel: Percentages of proteins predicted with membrane helices (HTM), coiled-coil regions (COILS), and signal peptides (SignalP) in all proteomes. Centre: The lowest threshold for − which we can somehow reliably predict aspects of structure through comparative modelling is an E-value in PSI-BLAST of 10 3.Atthis level, we found about 38% of all proteins to have similarity to known structures. To exclude all these proteins for target selection might be deemed highest priority. Next, we identified all the proteins without any globular region longer than 50 residues (UNWANTED). The sum over PSI-BLAST + UNWANTED marks the percentage of proteins that are certainly not interesting for target selection in the first round of structural genomics. For all proteomes this number added to about 52% leaving about 48% of all proteins as putative targets.

Table 1. Clustering and domain splitting of selected proteomes

Set a Nprotb NminPc NmergePd NmaxPe Largest f NminDg NmaxDh

Methanococcus jannaschii 1 735 1 432 1 070 1 211 72 1 459 1 229 Saccharomyces cerevisiae 6 357 5 448 3 337 3 796 1 351 6 867 3 638 Eukaryotes 97 421 170 186 22 112 46 318 Eukaryotic targets 18 127 15 003 aSet: ‘Eukaryotes’: arabidopsis, worm, fly, yeast, and human, ‘Eukaryotic targets’ is the subset of clusters that may be targeted by structural genomics (at least one stretch of 50 residues without homologue of known structure, membrane regions, low-complexity residues, or NORS regions); bNprot: the number of predicted proteins fromthe respective original publication; cNminP: the number of ‘minimal-size’ clusters; d NmaxP: the number of ‘maximal-size’ − clusters; eNmergeP: the number of ‘minimal-size’ clusters after merging the clusters again with pairwise BLAST E-value of 10 3; f Largest: number of proteins in largest single cluster; gNminD: the number of ’minimal-size’ domain clusters; hNmaxD: the number of ’maximal-size’ domain clusters. evolutionary relations to split proteins into domains yields More than 16 000 targets for structural genomics fragments that are too short. The differences between the were found in eukaryotes alone fragments generated by the GeneRAGE-type algorithm The five eukaryotic proteomes corresponded to over 170K that we implemented and structural domains from PrISM minimal-size clusters. Next, we extracted the consensus (Yang and Honig, 1999, 2000a,b,c) indicated that the regions for all these 170K clusters, and removed all clus- fragments we found were—on average—too long rather ters that did not have at least one fragment of more than than too short. 50 consecutive residues without a homologue of known

927 J.Liu and B.Rost

Fig. 3. Estimate for the percentage of residues in putative targets. Right panel: Percentages of residues in transmembrane helices (HTM), coiled-coil helices (COILS), signal peptides (SignalP), low-complexity regions (SEG) and regions without regular secondary structure (NORS). Note: these numbers do not necessarily add up, since coiled-coil regions are occasionally detected by SEG. Left panel: Percentages − of residues for which PSI-BLAST found similarities to known structures below an E-value of 10 3, and percentage of UNWANTED residues, i.e. those that have any of the regions listed on the right panel. These are unwanted in that they may seriously hamper a high- throughput structural genomics effort. Interestingly, the percentage of residues for putative targets was rather similar to the percentage of proteins (Figure 2). structure (according to PSI-BLAST), transmembrane- to more than one family. This number may provide an or coiled-coil helix, low-complexity or NORS region. upper bound estimate for the error of our clustering if This reduction yielded 107 410 eukaryotic fragments we assume that all Pfam families constitute structural of potential interest to structural genomics (Table 1). domains. Thus, about 7% of our 18 127 clusters may be An all-against-all for these 107 410 fragments resulted problematic. Consequently, we expect that we have about in 18 127 minimal-size clusters (Table 1), the largest 17 000 targets for structural genomics in eukaryotes. of which contained 81 eukaryotic proteins. Finally, we mapped the 18 127 consensus regions to the Pfam-A DISCUSSION AND CONCLUSIONS database (Bateman et al., 2000; Sonnhammer et al., About 48% of all proteins in the 30 proteomes 1997). Most of our clusters did not correspond to any constitute possible targets of the known 2267 Pfam-A families (82–85%, Table 2). Let us assume that structural genomics will have to The 3213 clusters for which HMMer found similarities experimentally determine structures for all proteins of any protein in that cluster to Pfam, matched in 1208 for which we do not have information about structure distinct Pfamfamilies.Most of these Pfamfamilies(57%) through experiments or through comparative modelling matched exclusively in one of our target clusters, 77% based on experimentally known homologues. We have (935) matched in at most two clusters (Table 2). At most explicit experimental information about structure for 210 of the 3213 clusters that matched in Pfam matched only a marginal fraction of all the proteins in currently

928 Target space for structural genomics revisited

Table 2. Eukaryotic target clusters and PFAM

# Number of clusters Percentage of Percentage of Pfam clusters families

BLAST HMMer BLAST HMMer BLAST HMMer

0 15 443 14 914 85.2 82.3 1 2 565 3 003 14.2 16.6 56.6 57.0 2 107 191 0.6 1.1 23.0 20.4 3 11 19 0.1 0.1 8.7 9.9 4 1 0.0 3.7 3.6 5 0 0 4.7 4.8

#: Number of Pfam families that were matched by the same cluster (columns 2–5) or number of clusters matched by one Pfam family (columns 6–7). For each of the 107 410 potential eukaryotic target fragments that we grouped into 18 127 clusters (Table 1) we searched Pfam-A with two Fig. 4. Distribution of fragment lengths for eukaryotes. We found methods. (1) Align target fragments by pairwise BLAST (BLAST E-val of − 170K clusters for the 97K eukaryotic proteins (Table 1). We 10 3, columns labelled BLAST), and (2) align each Pfamfamilyby −2 suspected that this number was inappropriately high due to an over- HMMer to all target clusters (Pfam E-value of 10 , columns labelled > splitting of the clustering algorithmapplied (Enright and Ouzounis, HMMer). Most target clusters ( 82%) had no corresponding Pfamentry. Using BLAST, 2683 of our target clusters matched to one protein in 1141 2000). However, we could not verify this suspicion when comparing distinct Pfam families; 56.6% of these Pfam families matched exclusively the lengths of the 170 186 fragment clusters to that of structural in one of our clusters. Using HMMer to find similarities, 3213 of our target domains from PrISM. clusters matched to one protein in 1208 distinct Pfam families; 57% of these Pfam families matched exclusively in one of our clusters. sequenced proteomes (<0.3%). Hence, the number of targets for structural genomics is not given by ‘all- (35%) for the subset of the 23 K human sequences that we structurally known’, rather it is given by ‘all-models’, i.e. analysed. Comparative modelling predicts structure only by the number of proteins for which we can obtain struc- for the fragments that correspond to known structures. tural information through comparative modelling. The The average protein length in PDB is clearly lower size of structural families increases exponentially when than the average length of the proteins found in entirely lowering the threshold for detecting structural similarities sequenced proteomes. Thus, we might expect that the (Appendix, Figure W). Lower thresholds imply lower percentage of all residues to target by structural genomics accuracy in comparative modelling. Thus, the estimate for is significantly higher than the percentage of proteins. In the number of targets for structural genomics is extremely fact, this expectation has recently been verified (Vitkup sensitive to the accuracy we require in comparative et al., 2001). Surprisingly, we found that the 48% of all modelling to remove a protein from the potential target putative protein targets corresponded to about 52% of list. While we have highly accurate information for only the entire residue mass of all proteomes (Figure 3). This 6% of all proteins, we have low-accuracy information about structure for about 38%. In the first round of significant difference between our results (Figure 2 and structural genomics, we may want to optimise the yield of Figure 3) and the results published previously (Vitkup ‘new structures’. Hence, the low-accuracy number (38%) et al., 2001) might have two reasons. Firstly, we used appears to be a reasonable choice. PSI-BLAST searches against the BIG database rather than pairwise BLAST searches against PDB (note that About half of all proteins constitute targets for the due to the small size of PDB, PSI-BLAST and BLAST first round searches against PDB basically yield the same results). Initially structural genomics may want to try avoiding Secondly, we marked all residues for which we predicted experimental problems by targeting proteins that are as membrane or coiled-coil helices, and low-complexity globular as possible. We found that about 48% of all the or NORS regions. For all eukaryotic proteomes that proteins contained fragments of over 50 residues that were we analysed, these regions added to almost half of the not similar to known structures and did not contain prob- ‘residue mass’ excluded from the target list of structural lematic regions (membrane, coiled-coil, low-complexity, genomics (Figure 3). Thus, our results suggested that no regular secondary structure, or signal peptides, Fig- about half of all the proteins in entire proteomes constitute ure 2). Interestingly, this fraction was significantly lower potential targets for structural genomics.

929 J.Liu and B.Rost

Clustering raised more questions than it answered the number appears rather high, suggesting that the How to best cluster all known sequences depends on algorithmmightsplit proteins into regions that are too the reason for the clustering. In the context of structural short. However, we found that the length distribution genomics, the reason appears clear: find a representative of the respective fragments was surprisingly similar to set of targets. However, this seemingly straightforward typical structural domains (Figure 4). Thus, the 170K concept hides a can of worms. The first problem is that fragments may indeed constitute the base for clustering of a hierarchy: The HSSP database that relates all known eukaryotic sequences. We continued by excluding all the structures to known sequences (Sander and Schneider, clusters that appeared of no interest to initial structural 1991) implicitly treats the protein of experimentally genomics approaches. Thus, we obtained 45051 fragment known structure as the ‘master-representative’ of the clusters containing 170K eukaryotic fragments. Next, we structural family for that structure. If we use this concept, re-applied our minimal-size clustering by comparing all we find 4600 families in yeast, 1431 in Methanococcus 170K against each other. This yielded 18 127 fragment jannaschii and about 30 000 families in all eukaryotes clusters, the largest of which contained 81 proteins. Most (data not shown). However, different structural genomics of these clusters (82%) did not match to any Pfam family initiatives favour different organisms. Hence, we want to (Bateman et al., 2000) (Table 2); 99% of all the clusters generate clusters without ‘master-representatives’. The that matched in Pfam matched one or two Pfam families. obvious problemwith this task is to find the basic unit Matches to more than two Pfam families might constitute for the clustering. If we assume that the ‘building blocks’ errors in defining our clusters; the problem, in particular are structural domains, the problem becomes to dissect was that our domain-splitting approach missed many proteins of unknown structure into structural domains. domains. Further splitting clusters is likely to increase Arguing that we cannot accomplish this, we identified the number of putative eukaryotic targets. A step missing the ‘maximal-size’ protein clusters; by construction fromour analysis that works in the opposite direction is there is no sequence similarity between any two of the the attempted merging of some of the clusters through single-linkage clusters. We found 1211 such clusters PSI-BLAST rather than pairwise relations. in Methanococcus jannaschii with the largest cluster containing 72 proteins (Table 1); for yeast the largest of Structural genomics for eukaryotes may have to the 3796 clusters contained over 1300 proteins. For all target 3000–17 000 protein fragments eukaryotes, the number of clusters appeared reasonable We could not put up a firmconclusion as to the number (22 112) but the largest cluster contained more than 46K of putative targets for structural genomics. One extreme proteins. These results suggested two conclusions. Firstly, answer was: less than 3000! This number based on sequence space appeared to be more continuous than we the observation that the current PDB consists of about might have anticipated because almost half of all proteins 2600 sequence-unique families which allow inferring are connected to one another by some local structural low-resolution information about structure for about similarity. This may imply that domains were shuffled half of the proteins in all the proteomes we analysed. considerably during evolution (Apic et al., 2001a,b) Assuming that a similar number of structures would fill and/or that structural domains are not the appropriate in all unknowns, we need 2600 new structures to fill the ‘building blocks’. Secondly, the ‘maximal-size’ clustering white spaces. Another possible answer was: about 17 000 obviously failed entirely to generate a reasonable map for eukaryotes alone! This number resulted when group- of sequence space when we did not split proteins into ing the fragment clusters for eukaryotes that had more domains. Thus, we have to find some way to dissect than 50 residues without known structure, membrane- or proteins into domains. A particular way applicable to coiled-coil helices, and NORS- or low-complexity regions all protein sequences was suggested by Enright and (Table 2). How many proteins will have to be added for Ouzounis (2000). For Methanococcus jannaschii this prokaryotes and archae bacteria? To approach the answer clustering/domain-splitting algorithm yielded about 1400 to this question, we will first have to complete our clus- clusters (Table 1). The authors of GeneRAGE (Enright tering of all known proteomes. Clearly, our estimate puts and Ouzounis, 2000) published a similar number, suggest- the ball-park figure substantially higher than what was ing that their implementation of the major concept did not previously suggested (Vitkup et al., 2001). While Vitkup differ substantially from ours. The number of minimal- and colleagues proposed a similar number (17 600 for all size clusters for yeast also appeared reasonable (Table 1). species), their estimate was valid for a level of modelling Interestingly, the numbers we obtained with and without accuracy that covers less than 10% of all residues in explicitly starting fromthe already split domainsdid not current proteomes. In contrast, our estimate of 17 000 for differ very much (for yeast 3337 vs. 3638, Table 1). When eukaryotes was valid for an accuracy level at which over we applied the algorithmto the 97K eukaryotic proteins, 45% of all residues were already covered. Furthermore, we obtained over 170K fragment clusters. Obviously, we excluded many fragments that were not excluded

930 Target space for structural genomics revisited by Vitkup et al. (NORS, coiled-coil, transmembrane Blundell,T.L. and Mizuguchi,K. (2000) Structural genomics: an helices, and signal peptides). Nevertheless, our results overview. Prog. Biophys. Mol. Biol., 73, 289–295. confirmed the work of Vitkup and colleagues in that Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Assessing se- structural genomics has a long way to go. If our estimates quence comparison methods with reliable structurally identified are correct, the first pilot phase of structural genomics distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, will—at best—pave one fifth of the way by 2005. 6073–6078. Burley,S.K. (2001) New York Structural Genomics Research Con- sortium, http://www.nysgrc.org/, New York Structural Genomics ACKNOWLEDGEMENTS Research Consortium(NYSGRC). Thanks to Dariusz Przybylski (Columbia) and to Volker Burley,S.K., Almo,S.C., Bonanno,J.B., Capel,M., Chance,M.R., Eyrich (Columbia) for providing programs. We are Gaasterland,T., Lin,D., Sali,A., Studier,F.W. and Swami- grateful to our hard-working wet-lab colleagues from nathan,S. (1999) Structural genomics: beyond the human the North-East Structural Genomics Initiative (NESG), genome project. Nature Genet., 23, 151–157. in particular to Guy Montelione (Rutgers) for his con- Christendat,D. et al. (2000) Structural proteomics of an archaeon. Nat. Struct. Biol., 7, 903–909. tinued support and optimism. The work of JL and BR Corpet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) ProDomand were supported by the grants 1-P50-GM62413-01 and ProDom-CG: tools for protein domain analysis and whole RO1-GM63029-01 fromthe National Institute of Health. genome comparisons. Nucleic Acids Res., 28, 267–269. Last but not least, thanks to all those who deposit their Emanuelsson,O., Nielsen,H., Brunak,S. and von Heijne,G. (2000) experimental data in public databases, to those who main- Predicting subcellular localization of proteins based on their N- tain these databases, and to those heroes who will make terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. structural genomics come true through their dedication Enright,A.J. and Ouzounis,C.A. (2000) GeneRAGE: a robust algo- and experiments. rithmfor sequence clustering and domaindetection. Bioinfor- matics, 16, 451–457. SUPPLEMENTARY MATERIAL Eyrich,V., Mart´ı-Renom,M.A., Przybylski,D., Fiser,A., Pazos,F., Valencia,A., Sali,A. and Rost,B. (2001a) EVA: continuous au- For Supplementary Material, please refer to Bioinformatics tomatic evaluation of protein structure prediction servers. WWW Online. document (http://cubic.bioc.columbia.edu/eva) http://cubic.bioc. columbia.edu/eva, Columbia University. REFERENCES Eyrich,V., Mart´ı-Renom,M.A., Przybylski,D., Fiser,A., Pazos,F., Abagyan,R.A. and Batalov,S. (1997) Do aligned sequences share Valencia,A., Sali,A. and Rost,B. (2001b) EVA: continuous the same fold? J. Mol. Biol., 273, 355–368. automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 1242–1243. Altschul,S., Madden,T., Shaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D. (1997) Gapped Blast and PSI-Blast: a new Gaasterland,T. (1998a) Structural genomics: bioinformatics in the generation of protein database search programs. Nucleic Acids driver’s seat. Nat. Biotechnol., 16, 625–627. Res., 25, 3389–3402. Gaasterland,T. (1998b) Structural genomics taking shape. TIGS, 14, Apic,G., Gough,J. and Teichmann,S.A. (2001a) Domain combina- 135. tions in archaeal, eubacterial and eukaryotic proteomes. J, Mol. Heger,A. and Holm,L. (2000) Towards a covering set of protein Biol., 310, 311–325. family profiles. Prog. Biophys. Mol. Biol., 73, 321–337. Apic,G., Gough,J. and Teichmann,S.A. (2001b) An insight into Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) The domain combinations. Bioinformatics, 17, S83–89. PROSITE database, its status in 1999. Nucleic Acids Res., 27, Bairoch,A. and Apweiler,R. (2000a) The SWISS-PROT protein 215–219. sequence database and its supplement TrEMBL in 2000. Nucleic Holm,L. and Sander,C. (1999) Protein folds and families: sequence Acids Res., 28, 45–48. and structure alignments. Nucleic Acids Res., 27, 244–247. Bairoch,A. and Apweiler,R. (2000b) The SWISS-PROT protein Joachimiak,A. (2001) Midwest Center for Structural Genomics. sequence database and its supplement TrEMBL in 2000. Nucleic http://www.mcsg.anl.gov/, Midwest Center for Structural Ge- Acids Res., 28, 45–48. nomics (MCSG). Bairoch,A., Bucher,P. and Hofmann,K. (1997) The PROSITE Jones,D.T. (1999) Protein secondary structure prediction based database, its status in 1997. Nucleic Acids Res., 25, 217–221. on position-specific scoring matrices. J. Mol. Biol., 292, 195– Barker,W.C. et al. (2000) The protein information resource (PIR). 202. Nucleic Acids Res., 28, 41–44. Kim,S.-H. (2001) Berkeley Structural Genomics Center. http:// Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and www.strgen.org/, Berkeley Structural Genomics Center. Sonnhammer,E.L. (2000) The Pfam protein families database. Lima,C.D., Klein,M.G. and Hendrickson,W.A. (1997) Structure- Nucleic Acids Res., 28, 263–266. based analysis of catalysis and substrate definition in the HIT Berman,H.M., Westbrook,J., Feng,Z., Gillliland,G., Bhat,T.N., . Science, 278, 286–290. Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Linial,M. and Yona,G. (2000) Methodologies for target selection in Data Bank. Nucleic Acids Res., 28, 235–242. structural genomics. Prog. Biophys. Mol. Biol., 73, 297–320.

931 J.Liu and B.Rost

Liu,J. and Rost,B. (2001) Comparing function and structure be- Rost,B. (1996) PHD: predicting one-dimensional protein struc- tween entire proteomes. Protein Sci., 10, 1970–1979. ture by profile based neural networks. Meth. Enzymol., 266, Liu,J., Tan,H. and Rost,B. (2002) Eukaryotes full of loopy proteins? 525–539. J. Mol. Biol., submitted Rost,B. (1997) Protein structures sustain evolutionary drift. Fold. & Lo Conte,L., Ailey,B., Hubbard,T.J., Brenner,S.E., Murzin,A.G. Des., 2, S19–S24. and Chothia,C. (2000) SCOP: a structural classification of Rost,B. (1998) Marrying structure and genomics. Structure, 6, 259– proteins database. Nucleic Acids Res., 28, 257–259. 263. Lupas,A. (1996) Prediction and analyis of coiled-coil structures. Rost,B. (1999) Twilight zone of protein sequence alignments. Meth. Enzymol., 266, 513–525. Protein Eng., 12, 85–94. Lupas,A., Van Dyke,M. and Stock,J. (1991) Predicting coiled coils Rost,B. and Sander,C. (1993) Prediction of protein secondary fromprotein sequences. Science, 252, 1162–1164. structure at better than 70% accuracy. J. Mol. Biol., 232, 584– Marcotte,E.M., Pellegrini,M., Thompson,M.J., Yeates,T.O. and 599. Eisenberg,D. (1999) A combined algorithm for genome-wide Rost,B. and Sander,C. (1994) Combining evolutionary information prediction of protein function. Nature, 402, 83–86. and neural networks to predict protein secondary structure. Marti-Renom,M.A., Stuart,A., Fiser,A., Sanchez,R., Melo,F. and Proteins, 19, 55–72. Sali,A. (2000) Comparative protein structure modeling of Rost,B., Casadio,R., Fariselli,P. and Sander,C. (1995) Prediction of and genomes. Annual Review of Biophysics and Biomolecular helical transmembrane segments at 95% accuracy. Protein Sci., Structure, 29, 291–325. 4, 521–533. Marti-Renom,M.A., Madhusudhan,M.S., Fiser,A. and Sali,A. (2001) Accuracy of comparative modelling. http://pipe.rockefeller.edu/ Rost,B., Casadio,R. and Fariselli,P. (1996) Topology prediction for ∼eva//cm/res/accuracy.html, Rockefeller University. helical transmembrane proteins at 86% accuracy. Protein Sci., 5, 1704–1718. Montelione,G.T. (2001) Northeast Structural Genomics Consor- tium. http://www.nesg.org/, Northeast Structural Genomics Con- Sander,C. and Schneider,R. (1991) Database of homology-derived sortium(NESG). structures and the structural meaning of . Moult,J. and Melamud,E. (2000) From fold to function. Curr. Opin. Proteins: Struct. Funct. Genet., 9, 56–68. Struct. Biol., 10, 384–389. Saqi,M. (1995) An analysis of structural instances of low complex- Muller,A., MacCallum,R.M. and Sternberg,M.J. (1999) Bench- ity sequence segments. Prot. Eng., 8, 1069–1073. marking PSI-BLAST in genome annotation. J. Mol. Biol., 293, Schneider,R., de Daruvar,A. and Sander,C. (1997) The HSSP 1257–1271. database of protein structure-sequence alignments. Nucleic Acids Nielsen,H., Engelbrecht,J., von Heijne,G. and Brunak,S. (1996) Res., 25, 226–230. Defining a similarity threshold for a functional protein sequence Shapiro,L. and Harris,T. (2000) Finding function through structural pattern: the signal peptide cleavage site. Proteins, 24, 165–177. genomics. Curr. Opin. Biotech., 11, 31–35. Nielsen,H., Engelbrecht,J., Brunak,S. and von Heijne,G. (1997) Sonnhammer,E.L., Eddy,S.R. and Durbin,R. (1997) Pfam: a com- Identification of prokaryotic and eukaryotic signal peptides and prehensive database of protein domain families based on seed prediction of their cleavage sites. Protein Eng., 10, 1–6. alignments. Proteins: Struct. Funct. Genet., 28, 405–420. NIGMS (2001) Structural genomics initiatives. http://www.nigms. Tatusov,R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic nih.gov/funding/psi/psi research centers.html, National Institute perspective on protein families. Science, 278, 631–637. of General Medical Sciences (NIGMS). Tatusov,R.L., Galperin,M.Y., Natale,D.A. and Koonin,E.V. (2000) Orengo,C.A., Michie,A.D., Jones,D.T., Swindells,M.B. and Thorn- The COG database: a tool for genome-scale analysis of protein ton,J.M. (1997) CATH—A hierarchic classification of protein functions and evolution. Nucleic Acids Res., 28, 33–36. domain structures. Structures, 5, 1093–1108. Teichmann,S.A., Chothia,C. and Gerstein,M. (1999) Advances in Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Inter- structural genomics. Curr. Opin. Struct. Biol., 9, 390–399. mediate sequences increase the detection of distant sequence ho- Terwilliger,T. (2001) Mycobacteriumm tuberculosis (TB) Structural mologies. J. Mol. Biol., 273, 349–354. Genomics Consortium. http://www.doe-mbi.ucla.edu/TB/, My- Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. cobacteriumm tuberculosis (TB) Structural Genomics Consor- and Chothia,C. (1998) Sequence comparisons using multiple tium. sequences detect three times as many remote homologues as Thornton,J. (2001) Structural genomics takes off. Trends Biochem. pairwise methods. J. Mol. Biol., 284, 1201–1210. Sci., 26, 88–89. Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) SMART: identification and annotation of domains from signalling and Umbach,P. (2001) Protein Structure Factory. http://www.rzpd.de/ extracellular protein sequences. Nucleic Acids Res., 27, 229–232. psf/, Protein Structure Factory. Przybylski,D. and Rost,B. (2002) Alignments grow, secondary Vitkup,D., Melamud,E., Moult,J. and Sander,C. (2001) Complete- structure prediction improves. Proteins: Struct. Funct. Genet., ness in structural genomics. Nat. Struct. Biol., 8, 559–566. 46, 195–205. Wang,B.-C. (2001) Southeast Collaboratory for Structural Ge- Romero,P., Obradovic,Z., Kissinger,C., Villafranca,J.E., Garner,E., nomics. http://secsg.org/secsg/default.html, Southeast Collabo- Guilliot,S. and Dunker,A.K. (1998) Thousands of proteins likely ratory for Structural Genomics (SECSG). to have long disordered regions. Pac. Symp. Biocomput., 3, 437– Wilson,I.A. (2001) Joint Center for Structural Genomics. http:// 448. www.jcsg.org/, Joint Center for Structural Genomics.

932 Target space for structural genomics revisited

Wootton,J.C. and Federhen,S. (1993) Statistics of local complexity analysis and modeling of protein sequences and structures. III. A in amino acid sequences and sequence databases. Comput. comparative study of sequence conservation in protein structural Chem., 17, 149–163. families using multiple structural alignments. J. Mol. Biol., 301, Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally 691–711. biased regions in sequence databases. Meth. Enzymol., 266, 554– Yokoyama,S. and Kuramitsu,S. (2001) Structurome Research 571. group, RIKEN. http://www.riken.go.jp/engn/r-world/research/ Yang,A.S. and Honig,B. (1999) Sequence to structure alignment lab/harima/group-s/index.html, Structurome Research group. in comparative modeling using PrISM. Proteins: Struct. Funct. Yona,G., Linial,N. and Linial,M. (1999) ProtoMap: automatic clas- Genet., Suppl, 66–72. sification of protein sequences, a hierarchy of protein families, Yang,A.S. and Honig,B. (2000a) An integrated approach to the and local maps of the protein space. Proteins: Struct. Funct. analysis and modeling of protein sequences and structures. Genet., 37, 360–378. I. Protein structural alignment and a quantitative measure for Yona,G., Linial,N. and Linial,M. (2000) ProtoMap: automatic protein structural distance. J. Mol. Biol., 301, 665–678. classification of protein sequences and hierarchy of protein Yang,A.S. and Honig,B. (2000b) An integrated approach to the families. Nucleic Acids Res., 28, 49–55. analysis and modeling of protein sequences and structures. II. Yona,G., Linial,N., Tishby,N. and Linial,M. (1998) A map of On the relationship between sequence and structural similarity the protein space–an automatic hierarchical classification of all for proteins that are not obviously related in sequence. J. Mol. protein sequences. Ismb, 6, 212–221. Biol., 301, 679–689. YSG (2001) Yeast Structural genomics. http://genomics.eu.org/, Yang,A.S. and Honig,B. (2000c) An integrated approach to the Genoscope Evry Orsay-Gif-Saclay.

933