Did Evolution Leap to Create the Protein Universe? Burkhard Rost
Total Page:16
File Type:pdf, Size:1020Kb
409 Did evolution leap to create the protein universe? Burkhard Rost The genomes of over 60 organisms from all three kingdoms of Although most of these analyses utilize a wealth of biological life are now entirely sequenced. In many respects, the inventory data, they are not explicitly based on the fact that we have of proteins used in different kingdoms appears surprisingly entire genome sequences from representatives of all three similar. However, eukaryotes differ from other kingdoms in that kingdoms of life: eukarya, bacteria (prokaryotes) and they use many long proteins, and have more proteins with archaea. What do we learn from generating lists of parts of coiled-coil helices and with regions abundant in regular the whole [10,11•]? secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only Here, I focus on findings from methods that endeavor to once in one particular pathway. Many proteins do not have capture overall features for entire organisms. I challenge close homologues in different species (orphans) and there the assumption that bioinformatics is slowly but steadily could even be folds that are specific to one species. This view approaching the point at which we can smoothly move implies that protein fold space is discrete. An alternative model through neighborhoods of protein relationships in order to suggests that structure space is continuous and that modern generate an atlas of the fates and functions of proteins in proteins evolved by aggregating fragments of ancient proteins. the context of the cell. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to be to develop Overview of proteomes: catalogs of structure better methods for comparative proteomics. and function Eukaryotes have many very long proteins Addresses Genomes differ significantly in their nucleotide composition CUBIC, Department of Biochemistry and Molecular Biophysics, [12•]. In contrast, the amino acid compositions of the Columbia University, 650 West 168th Street, BB217, New York, entire proteomes of 28 organisms from all three kingdoms NY 10032, USA and Columbia University Center for Computational are similar ([13•]; Figure 1a). The length of proteins differs Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St Nicholas Avenue, New York, NY 10032, USA; significantly between the three kingdoms (Figure 1b); in e-mail: [email protected] particular, about 7% of eukaryotic proteins are longer than 1000 residues, whereas less than 2% of all proteins in Current Opinion in Structural Biology 2002, 12:409–416 archaea and prokaryotes are that long. 0959-440X/02/$ — see front matter © 2002 Elsevier Science Ltd. All rights reserved. We now realize that many proteins were overlooked in the Abbreviations initial annotation of genome projects, as demonstrated by ADS antecedent domain segment the annual growth of the estimated number of proteins in NORS no regular secondary structure worm [14], yeast [15] and human. Some groups specialize ORF open reading frame in hunting for these overlooked proteins [16•]. In microbial rmsd root mean square deviation genomes, however, the number of proteins appears signif- Introduction icantly over-estimated [12•]. If so, the differences in the protein length distributions for short proteins might not Natura non facit saltus (Nature does not make leaps) hold up (Figure 1b). Nevertheless, the corrections in the number of proteins, as suggested by Skovgaard, Krogh and Attributed to Tito Lucrezio Caro (Titus Lucretius colleagues [12•], alter the distributions for long proteins Carus), 34 AD ?; only marginally (data not shown). Some of the incorrectly Gottfried Wilhelm von Leibniz, 1698; annotated short proteins might actually be short RNAs [17] Charles Darwin, 1859. and others might be pseudo-genes [18]. Some perceive New York City as a place jammed with Despite the complete set of sequences, comparisons humans. However, for those who live there, the metropolis remain guesses feels like a village because life evolves around neighbor- Comparing proteomes on the basis of residue composition hoods. Scientists explaining our behavior from a ‘systems’ or protein length constitutes an extremely dumb realization perspective can argue that we belong to various groups of comparative genomics. Unfortunately, more meaningful defined by our address, our work or even the shape of our comparisons across kingdoms typically require focusing on noses. We may possibly feel that such classifications do not subsets of the complete proteomes for which we have explain who we really are. In analogy, we cannot expect annotations or involve predictions of limited accuracy, or catalogues of entirely sequenced genomes to convey both. For instance, statements about protein families, folds understanding of protein structure, function or evolution. and functions either are restricted to some arbitrary subset The literature bursts with examples of detailed studies of of proteins for which we can infer features by homology or how protein structure and function co-evolve [1–7,8•,9]. are limited by the accuracy of prediction methods. 410 Sequences and topology Figure 1 Comparing proteomes. (a) Amino acid usage (a) 15% in each of the three kingdoms. The height of 10% the letter is proportional to the frequency of Archaea the respective amino acid (overall 5% percentages given on the right). The lower rows show the number of codons for each of the amino acids and the age rank (1 is oldest, 10% 20 newest), as estimated by Trifonov et al. Prokaryotes 5% [88]. The variation of the frequency within the kingdoms is as insignificant as that between the kingdoms [13•]. (b) Distribution of protein 10% length. The data are binned in intervals of ten Eukaryotes residues. The inset gives the cumulative 5% length; it illustrates the significantly higher Number of codons 4 2 2 2 2 4 2 3 2 6 1 2 4 2 6 6 4 4 1 2 proportion of long proteins in eukaryotes. Hypothetical age 1 15 4 6 16 1 17 12 13 8 18 10 7 14 11 5 9 3 20 19 Both (a) and (b) are based on an analysis of 238 326 proteins in 63 complete proteomes (b) (12 archaea, 46 prokaryotes and 5 eukaryotes 100 [89]). (c) The left panel shows the percentage 4 95 of proteins for which we can predict structure through comparative modeling. Three types of 90 models are distinguished by applying different cut-offs for the required level of sequence 3 85 similarity [90]. At levels above 70% pairwise Archaea 80 sequence identity, models reach an average Prokaryotes accuracy around 2 Å rmsd (‘good models’, 2 75 Eukaryotes dark bars). At a level of sequence similarity Cumulative percentage Cumulative that corresponds to 32% pairwise identical 70 500 1000 1500 2000 residues over 100 residues aligned (‘HSSP’ distance of 0 [13•]), models differ about Percentage of proteins Percentage 1 3–5 Å (‘OK models’, gray bars). At a PSI-BLAST threshold of 10–3, most models identify the fold correctly (‘model gives fold’, white bars). The right panel shows the 0 percentage of proteins predicted with NORS 0 200 400 600 800 1000 regions [51] (white bars), signal peptides Number of residues in protein (gray bars, note: all these proteins are extracellular), transmembrane helices (striped bars) and coiled-coil helices (dark bars). The (c) NORS region Good model of 3D graphs in (c) are based on 30 proteomes, as Signal peptide given in [13•]. Note that the lines give the OK model of 3D Membrane helix spread between the minima and maxima Model gives fold Coiled coil found in the respective kingdom. Archaea Prokaryotes Eukaryotes 0102030400 5 1015202530 Percentage of proteins Current Opinion in Structural Biology Consider the idea to investigate whether or not particular proteomes [4]. It is much easier to infer aspects of structure folds are used more often in some organisms than in from sequence similarity than to infer aspects of function. others. Firstly, we know neither which fraction of all exist- Thus, the classification of proteomes by function relies ing folds we know already (estimates range from 10 to 50% even more on guesses than classification by structure. [3,13•,19•,20•,21]) nor whether the types of folds we know are representative. Secondly, we can, at best, infer the folds Popular folds also most populated in proteomes for half of all proteins in entirely sequenced genomes Gerstein pioneered the structural census of proteomes (Figure 1c) [9,10,19•,22–24,25•,26]. Thirdly, we have no [10,27] by analyzing which folds are most often used in way to predict novel folds in the context of entire proteins of different organisms. When analyzing the Did evolution leap to create the protein universe? Rost 411 universe of protein sequences, we observe that some types Proteins with coiled-coil regions are often insoluble of proteins belong to larger families than others [13•,28–33]. because such regions are frequently responsible for aggre- As expected, folds used in these large protein families also gation; they often indicate structural proteins. However, dominate proteomes [2,3,10,11•,34–37]. The major difference the high fraction of coiled-coil proteins in eukaryotes between the types of proteins that populate the largest might originate from the role of coiled-coil regions in sequence-based family and those that populate the largest protein–DNA interactions and in regulation, transcription fold-based family was the under-representation of membrane and translation. The problem with this explanation is that, proteins with