Integration of Genome Data and Protein Structures: Prediction Of

125 Integration of genome data and protein structures: prediction of protein folds, protein interactions and ‘molecular phenotypes’ of single nucleotide polymorphisms Shamil Sunyaev*†‡§, Warren Lathe III*†# and Peer Bork*†¶ With the massive amount of sequence and structural data the issue of integrating structural analysis with research on being produced, new avenues emerge for exploiting the human genetic variation. information therein for applications in several fields. Fold distributions can be mapped onto entire genomes to learn Predicting the number of protein folds in about the nature of the protein universe and many of the genomes using homology searches interactions between proteins can now be predicted solely on Soon after the first complete genome sequences became the basis of the genomic context of their genes. Furthermore, publicly available, it was realised that they represent by utilising the new incoming data on single nucleotide datasets for the analysis of protein folds that are statisti- polymorphisms by mapping them onto three-dimensional cally far more natural than compared to the PDB [1]. structures of proteins, problems concerning population, Furthermore, iterative database search methods such as medical and evolutionary genetics can be addressed. PSI-BLAST [2] had been developed and a number of reports revealed that the folds of more than 30% of the Addresses proteins in prokaryotic genomes can be reliably predicted *European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1, using homology-based techniques [3–6]. These iterative 69117 Heidelberg, Germany methods are at least twice as sensitive as pairwise protein † Max-Delbrueck Centre for Molecular Medicine (MDC), comparisons [7]. This increased sensitivity can some- Robert-Roessle-Strasse 10, 13122 Berlin, Germany ‡Engelhardt Institute of Molecular Biology (IMB), Vavilova 32, times identify the common evolutionary origin of 117984 Moscow, Russia proteins that were previously thought to belong to differ- §e-mail: [email protected] ent superfamilies and whose structural similarities were # e-mail: [email protected] assumed to be a result of parallel evolution (analogous ¶e-mail: [email protected] rather than homologous structures). An example of a Current Opinion in Structural Biology 2001, 11:125–130 newly identified nontrivial homology relationship 0959-440X/01/$ — see front matter involves enzymes of the TIM-barrel fold from the central Published by Elsevier Science Ltd. metabolism that have been shown to share a common evolutionary origin despite being grouped into 12 distinct Abbreviations SCOP superfamilies [8,9]. cSNP coding SNP PDB Protein Data Bank SNP single nucleotide polymorphism Fold predictions for complete genomes allowed a new estimate of the total number of protein folds and of the Introduction number of protein folds in individual genomes [10]. The The past several years have been marked by the success- estimate qualitatively agrees with that of earlier studies ful completion of numerous genome projects, ranging [11,12]. The statistical analysis [13] shows that the nature from the short genomes of prokaryotes to our own. As a of the fold distribution is universal for all genomes and result of this extraordinary growth, different types of agrees with earlier theoretical considerations [14]. genomic data (Figure 1), including sequences of complete Although these estimates give a much clearer picture of genomes, complete sets of proteins (proteomes) and data the nonuniform fold distribution and the low number of on genetic variation, have become available for analysis. folds, it remains to be shown whether the physical con- The simultaneous growth of structural data also provides straints of the polypeptide chain or the specific features of the possibility of performing this analysis from a structural protein evolution led to the current fold repertoire. perspective. In this review, we focus on the impact of genomic data on studies of protein structures and interac- Predicting protein interactions using genomic tions and, perhaps more importantly, on the new context applications of the structural research and the challenges The availability of numerous complete genome raised by these applications. Currently, three trends sequences also enables comparative analysis to predict bridge the structural and genomics fields: analysis of pro- various functional features at the protein level. The tein folds in complete genomes; use of genome genomic context of genes reveals the physical, functional information for predicting protein–protein interactions; and genetic interactions of the respective gene and structural analyses of disease mutations and single products. Several strategies have been used to explore nucleotide polymorphisms (SNPs). We only briefly intro- genomic context. First, gene fusion in one species is duce the first two topics, because they have been indicative of a protein–protein interaction between the extensively covered in many reviews, and mostly focus on gene products of the two fused genes and can be 126 Folding and binding Figure 1 The growth of structural data (counted in PDB Growth of known 3D structures entries) in recent years is compared with the 15,000 growth of genomic data, represented by the number of completely sequenced genomes 10,000 and by the number of known human SNPs. Although all types of presented data 5,000 entries accumulate fast, the growth of genomic data, especially SNP data, outperforms structural Number of PDB 0 data accumulation. Data kindly provided by 1995 1996 1997 1998 1999 2000 A Brookes (Karolinska Institute), S Sherry Year (NCBI) and M Huynen (EMBL). Growth of the number of completely sequenced genomes 40 30 20 10 genomes Number of 0 1995 1996 1997 1998 1999 2000 Year Growth of SNP data 2.50E+06 2.00E+06 1.50E+06 1.00E+06 5.00E+05 submissions 0.00E+00 Number of SNP 1998(3) 1998(4) 1999(1) 1999(2) 1999(3) 1999(4) 2000(1) 2000(2) 2000(3) 2000(4) Quarters of a year Current Opinion in Structural Biology assumed for all the orthologues of the two genes in other Structural analysis of allelic variants species [15,16]. Second, conservation of gene neighbour- A very important part of genomic research, which was hood in some divergent species also strongly indicates underestimated at the beginning of the genomics era, is the occurrence of an interaction between the two encoded the analysis of genetic variation in populations. The stud- gene products, even if they are not neighbours in many ies of genetic variation are now appreciated in the context other genomes [17,18]. In fact, entire subcellular systems of the human genome project and several consortia can be identified if neighbourhood information for dif- around the world have already identified more than two ferent genes is systematically merged [19]. Third, the million DNA variants in the human population (Figure 1). co-occurrence (this has also been coined phylogenetic Therefore, it is time to integrate these data with informa- profile or COG pattern) of two genes (and their ortho- tion from other resources, such as three-dimensional logues) in the same subset of species indicates that the structures of proteins. The application of structural data two genes interact [20–22]. Fourth, the presence of to research on genetic variation can propel studies on the shared regulatory elements hints at the co-regulation of identification of the genetic roots of phenotypic variation the respective downstream genes and, hence, at the and bring new insights to the fields of population and genetic or functional interaction of the respective gene evolutionary genetics. In turn, structural research may products [23]. also benefit from using mutation data. Catalogues of nat- urally occurring mutations with known phenotype All these strategies and the respective methods can association might, in some cases, substitute for experi- be combined to increase their sensitivity [24]. Although ments on site-directed mutagenesis in studies of protein context-based methods are not yet as powerful as classical folding and binding. homology-based function prediction methods [25] and dif- ferences between prokaryotic and eukaryotic evolution In the following, we give a brief introduction to the have to be considered, the power of genomic context nature of mutation and polymorphism data, and then analysis will increase with each new genome published. review the first approaches to combine them with structural Integration of genome data and protein structures Sunyaev, Lathe and Bork 127 information in both specific case studies and general human disease phenotypes are now believed to be of a analyses of amino-acid-replacing SNPs mapped to 3D complex nature, involving common DNA variants structures of proteins. together with environmental factors. The identification of variants that increase susceptibility to human diseases is Mutations one of the key problems in medical genetics. Second, For a long time, data on genetic variation were available analysis of genetic variation in a population can help in the only for particular loci mainly corresponding to genes pursuit of solutions to many problems in evolutionary associated with simple monogenic diseases. Numerous genetics. Third, knowledge of genetic variation in the locus-specific databases contain mutation data, often from modern human population is a clue to our understanding patients with genetic disorders. These databases report not of human origins and features of the human population

Integration of Genome Data and Protein Structures: Prediction Of

Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)

Genetic and Environmental Exposures Constrain Epigenetic Drift Over the Human Life Course

Small Variants Frequently Asked Questions (FAQ) Updated September 2011

A Pilot Analysis of DNA Methylation in Candidate Genes in Brain

Dbsnp Resources in the UCSC Database Today We Will Discuss

A Roadmap for Metagenomic Enzyme Discovery

Structural Genomics: an Approach to the Protein Folding Problem

1 a Primer on Molecular Biology

Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants

Dbsnp Columns

Interactive Tutorial 1: SNP Database Resources

The Role of Protein Structure in Genomics