Structural Characterization of the Human Proteome
Total Page:16
File Type:pdf, Size:1020Kb
Article Structural Characterization of the Human Proteome Arne Mu¨ller,1 Robert M. MacCallum,1,4 and Michael J.E. Sternberg1,2,3 1Biomolecular Modelling Laboratory, Cancer Research UK, London, United Kingdom; 2Department of Biological Sciences, Structural Bioinformatics Group, Imperial College of Science, Technology and Medicine, South Kensington, London, United Kingdom This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at http://www.sbg. bio.ic.ac.uk. [Supplemental material is available online at http://www.genome.org.] The interpretation and exploitation of the wealth of biologi- maintaining a high coverage. In the absence of a match to cal knowledge that can be derived from the human genome these characterized motifs/domains, suggestion for a func- (Lander et al. 2001; Venter et al. 2001)requires an analysis of tional annotation comes from a homology to a previously the three-dimensional structures and the functions of the en- functionally annotated sequence. However, transfer of func- coded proteins (the proteome). Comparison of this analysis tion via an identified homology is problematic and the extent with those of other eukaryotic and prokaryotic proteomes will of the difficulty has been recently quantified (e.g., Devos and identify which structural and functional features are common Valencia 2000; Wilson et al. 2000; Todd et al. 2001). Below and which confer species specificity. In this paper, we present 30% pair-wise sequence identity, two proteins often may have an integrated analysis of the proteomes of human and 13 quite different functions even if their structures are similar. other species considering the folds of globular domains, the Because of this problem, global bioinformatics analyses of ge- presence of transmembrane proteins, and the extent to which nomes generally do not use functional transfer from distant the proteomes can be functionally annotated. This integrated homologies for annotation. However, specific analyses by hu- approach enables us to consider the relationship between man experts still extensively employ this strategy, particularly these different aspects of annotation and thereby enhance as any suggestion of function can be refined from additional previous analyses of the human and other proteomes (e.g., information or from further experiments. Koonin et al. 2000; Frishman et al. 2001; Iliopoulos et al. A powerful source of additional information is available 2001), including the seminal papers reporting the human ge- when the three-dimensional coordinates of the protein are nome sequence (Lander et al. 2001; Venter et al. 2001). known. The structure often provides information about the A widely used first step in a bioinformatics-based func- residues forming ligand-binding regions that can assist in tional annotation is to identify known sequence motifs and evaluating the function and specificity of a protein. For ex- domains from manually curated databases such as PFAM/ ample, recently we have shown that spatial clustering of in- INTERPRO (Bateman et al. 2000)and PANTHER (Venter et al. variant residues can assist in assessing the validity of function 2001). This strategy was used in the original analyses of the transfer in this twilight zone (Aloy et al. 2001). At higher human proteome (Lander et al. 2001; Venter et al. 2001). levels of identity, knowledge of structure can assist in analyz- These annotations tend to be reliable, as these libraries have ing ligand specificity and the effect of point mutations. been carefully constructed to avoid false positives whilst A valuable tool in exploiting three-dimensional informa- tion is the databases of protein structure in which domains with similar three-dimensional architecture are grouped to- 3Present address: Stockholm Bioinformatics Center, Department gether. Here, we use the structural classification of proteins of Biochemistry and Biophysics, Stockholm University, S-106 91 (SCOP)(Conte et al. 2000).In SCOP, protein domains of Stockholm, Sweden. known structure that are likely to be homologs are grouped by 4 Corresponding author. an expert into a common superfamily based on their struc- E MAIL [email protected]; FAX +44-(0)20-7594-5264. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ tural similarity together with functional and evolutionary gr.221202. considerations. SCOP is widely regarded as an accurate assess- 12:1625–1641 ©2002 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/02 $5.00; www.genome.org Genome Research 1625 www.genome.org Müller et al. ment of which domains are homologs. However, SCOP re- mains subjective and one cannot exclude the possibility that two domains placed within the same superfamily only share a common fold as a result of convergent evolution and there- fore are not homologous. The above considerations have led us to focus our analy- sis on the following three objectives: (i)to estimate the extent to which the known proteomes can be annotated in terms of structure and function and how reliable we consider these annotations to be; (ii)to place the occurrence of particular SCOP structural superfamilies in terms of their biological and species-specific contexts; and (iii)to derive evolutionary in- sights from frequency-based analysis of homologous SCOP domains. Strategy for Structural and Functional Annotations Protein sequences from the human genome and from 13 other species were analyzed (for details, see Methods). The main strategy was to use the sensitive protein sequence simi- larity search program PSI-BLAST (Altschul et al. 1997)to scan each protein sequence against a database composed of a nonredundant set of sequences augmented by the sequences of each SCOP domain and, to ensure up-to-date coverage, each protein entry of the PDB (Berman et al. 2000). Functional annotation was considered as sequence matches to the PFAM domain library (excluding families of unknown function), which is now part of the more general INTERPRO database. In the absence of a match to these char- acterized motifs/domains, we need to evaluate functional an- notation via transfer from homology. To represent this ap- proach computationally, we simply considered a functional annotation if a homolog contains some textual description of function (see legend to Fig. 1A). Thus, the total of the pro- teome that can be functionally annotated is the sections that are assigned to a PFAM domain or, if no assignment to PFAM, the sections that are homologous to a protein with a text functional description. Status of Structural and Functional Annotations Figure 1A shows the structural annotation status of the pro- teomes expressed as the fraction of the total residues in each proteome. We use the residue fraction to include situations when only part of a protein sequence is annotated, as one cannot quantify this as a fraction of domains because one does not know the number of domains in unannotated re- gions. Thirty-nine percent of the human proteome can be structurally annotated from either having a known protein Figure 1 Structural and functional annotation of the proteomes. (A) structure or via a PSI-BLAST detectable homology to a known Coverage for each species is reported as the fraction of the residues in the proteome that are annotated. This allows for partial coverage of structure. This percentage is higher than that for yeast, fly, any sequence. Structural annotation is a homology to a known struc- and worm and is comparable to the coverage of many bacteria ture. Functional annotation is when there is no structural annotation and archaea. A further 38% of the human genome falls into but there is a homology to an entry from SwissProt or PIR that has a the category of functional annotation without structure. Be- description other than those that contain any of the following words: cause nearly every protein structure has some functional an- “hypothetical”, “probable”, “putative”, “predicted”. Any homology notation, the total functional annotation of the human pro- denotes a sequence similarity to a structurally or functionally unan- notated protein, such as one described as hypothetical. Nonglobular teome is 77%. The remainder are (i)either homologous to denotes remaining sequence regions that were predicted as trans- another protein of unknown function, or (ii)orphan regions membrane, signal peptide, coiled-coils, or low-complexity. Remain- without any detectable homology, or (iii)an unannotated, ing residues are classified as orphans. (B) Structural and functional nonglobular region (a region of low complexity, coiled-coil, annotations that cover the entire protein sequence. For structural or a transmembrane segment). annotation, we required that >95% of the sequence was structurally We also consider how many protein sequences can be annotated and there was no unannotated segment of >30 residues.