Proteome-Scale Prediction of Structure and Function

Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Proteome Folding Project: Proteome-scale prediction of structure and function Kevin Drew,1 Patrick Winters,1 Glenn L. Butterfoss,1 Viktors Berstis,2 Keith Uplinger,2 Jonathan Armstrong,2 Michael Riffle,3 Erik Schweighofer,4 Bill Bovermann,2 David R. Goodlett,5 Trisha N. Davis,3 Dennis Shasha,6 Lars Malmstro¨m,7 and Richard Bonneau1,4,6,8 1Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York 10003, USA; 2IBM, Austin, Texas 78758, USA; 3Department of Biochemistry, Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA; 4Institute for Systems Biology, Seattle, Washington 98103, USA; 5Medicinal Chemistry Department, University of Washington, Seattle, Washington 98195, USA; 6Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, New York 10003, USA; 7Institute of Molecular Systems Biology, ETH Zurich, Zurich CH 8093, Switzerland The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an in- tegration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (including human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were distributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions. [Supplemental material is available for this article.] Annotation of protein structure and function is a fundamental Current computational protein annotation methods can be challenge in biology. Accurate structure annotations give re- broadly grouped into four categories: (1) Primary sequence-feature searchers a three-dimensional (3D) view of protein function at the annotationmethodspredictgeneralfeaturesofaproteinsuchas molecular level that enables specific point-mutation analysis or disorder content (DISOPRED) ( Jones and Ward 2003), secondary the design of custom inhibitors to disrupt function. Accurate func- structure (PSIPRED) ( Jones 1999), transmembrane helices (TMHMM) tion annotations give researchers specific testable hypotheses about (Krogh et al. 2001), coiled-coils (COILS) (Lupas et al. 1991), or signal the role a protein plays in the cell and also allow biologists to better peptides (SignalP) (Bendtsen et al. 2004) and can be efficiently interpret the role of uncharacterized genes from high-throughput applied to full proteomes but have limited ability to describe the experiments (e.g., mass spectrometry co-IP, yeast two-hybrid specific function of individual proteins in a cellular context. (2) screens, microarray, RNAi screens, or forward-genetic screens). Sequence comparison methods (e.g., BLAST) (Altschul et al. 1997) Unfortunately, experimental annotation efforts fail to cover large including methods and databases that organize proteins into portions of all proteomes and are often focused on model organ- families (CATH [Orengo et al. 1997], SCOP [Murzin et al. 1995], isms. Current computational function prediction methods can ex- Pfam [Finn et al. 2008]) and fold recognition methods (FFAS) tend coverage to any sequenced genome (i.e., non-model organ- ( Jaroszewski et al. 2005), can generate putative function annota- isms) but can only annotate proteins that have high sequence tions (for a complete review, see Lee et al. 2007) but are limited in similarity to well-characterized proteins. Motivated by the obser- their application to the set of proteins with sequence detectable vation that structure is more conserved than sequence (Chothia and homologs or significant sequence matches. (3) Several groups have Lesk 1986), our method extends function annotation coverage to used machine learning techniques to integrate high-throughput many unannotated protein domains by comparing computation- experimental data, such as gene expression and protein–protein ally predicted structures to well-characterized proteins with known interactions, to predict protein function (Marcotte et al. 1999; structures. Bader and Hogue 2002; Hazbun et al. 2003; Troyanskaya et al. 2003; Lee et al. 2004; Pena-Castillo et al. 2008), but these data sets are not generally available for all organisms and therefore limit these methods’ coverage. (4) Several studies have shown that 8Corresponding author. protein structures derived from either large-scale protein experi- E-mail [email protected]. mental determination (Matthews 2007; Dessailly et al. 2009) or Article published online before print. Article, supplemental material, and pub- lication date are at http://www.genome.org/cgi/doi/10.1101/gr.121475.111. large-scale protein prediction (homology modeling, fold recogni- Freely available online through the Genome Research Open Access option. tion, and de novo) significantly increased the annotation coverage 21:1981–1994 Ó 2011 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/11; www.genome.org Genome Research 1981 www.genome.org Downloaded from genome.cshlp.org on September 24, 2021 - Published by Cold Spring Harbor Laboratory Press Drew et al. of proteomes and often provided site-specific information such as Next, we use the fold recognition algorithm FFAS03 probable functional sites and surfaces (Bonneau et al. 2004; Ginalski ( Jaroszewski et al. 2005) to match more evolutionarily distant se- et al. 2004; Malmstro¨m et al. 2007; Pieper et al. 2009; Zhang et al. quences in the PDB, producing an additional 58,000 domains with 2009). Homology modeling is increasingly productive as more significant matches to domains in the PDB. We then use additional structures are solved, however, many proteins and protein domains methods, including identifying Pfam domains (Finn et al. 2008), an lack detectable homologous proteins in the Protein Data Bank (PDB) algorithm for predicting domains from multiple sequence align- (Marsden et al. 2007). De novo structure prediction methods do not ments (MSA), and a heuristic-based algorithm for delineating do- require sequence homology with known structures and can, in main boundaries to identify additional putative domains (Chivian principle, provide annotation coverage to proteins unreachable by et al. 2003, 2005). These methods combined to identify an additional homology modeling. Unfortunately, de novo prediction methods 325,000 domains. In total, our domain prediction produced nearly require vast computational resources, and because of this, published 700,000 domains for the 389,000 query proteins, which serve as the pipelines that include de novo structure prediction are incapable of basis for our domain centric annotation of these proteins. We then keeping pace with incoming genomic data. used the Rosetta de novo protocol (Rohl et al. 2004) to predict the 3D This study focuses on this fourth type of annotation method structure of those domains lacking structure annotation (i.e., Pfam, via structure prediction. Here we describe the proteome folding MSA, and heuristic domains) and that are less than 150 residues in pipeline (PFP), a protein-domain-level fold and function annota- length. We predicted de novo folds for 57,000 domains on IBM’s tion method that combines de novo (Rosetta) (Rohl et al. 2004), fold World Community Grid (requiring more than 100,000 yr of CPU recognition, and sequence-based structure assignment methods time, resulting in one of the largest repository of protein structure into a single pipeline that significantly extends functional and predictions publicly available). The final step in the pipeline classifies structural proteome annotation coverage for 94 complete genomes predicted protein domains into structural superfamilies (SCOP). (as well as several new protein families from recent metagenomics PDB-BLAST and FFAS03 domains are assigned the superfamily of the studies). The computational cost of performing de novo structure PDB structure to which they matched, while a logistic regression prediction was distributed on a grid of more than 1.5 million CPUs model is used to classify de novo structures into SCOP superfamilies worldwide (World Community Grid; http://www.wcgrid.org). De based on the Rosetta predictions and estimated error associated with novo structures and sequence-based comparisons were used to these structure predictions (Malmstro¨m et al. 2007). In all, our assign predicted protein domains into the Structural Classification pipeline classified more than 250,000 domains into SCOP super- of Proteins (SCOP), a hierarchical classification of protein 3D families of which nearly 43,000 are considered both confident (based structure (Murzin et al. 1995). We demonstrate our ability to in- on our benchmarks) and novel (FFAS03 and de novo). tegrate these predicted SCOP classifications with information from the Gene Ontology (GO) database (Ashburner

Load more