Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures
Total Page:16
File Type:pdf, Size:1020Kb
Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures M.P. Liang, D.L. Brutlag, R.B. Altman Pacific Symposium on Biocomputing 8:204-215(2003) AUTOMATED CONSTRUCTION OF STRUCTURAL MOTIFS FOR PREDICTING FUNCTIONAL SITES ON PROTEIN STRUCTURES M.P. LIANG, D.L. BRUTLAG, R.B. ALTMAN Stanford University, 251 Campus Drive X-215, Stanford, CA 94305-5479, USA E-mail: {mliang, brutlag, altman}@smi.stanford.edu Structural genomics initiatives are beginning to rapidly generate vast numbers of protein structures. For many of the structures, functions are not yet determined and high-throughput methods for determining function are necessary. Although there has been extensive work in function prediction at the sequence level, predicting function at the structure level may provide better sensitivity and predictive value. We describe a method to predict functional sites by automatically creating three dimensional structural motifs from amino acid sequence motifs. These structural motifs perform comparably well with manually generated structural motifs and perform better than sequence motifs. Automatically generated structural motifs can be used for structural-genomic scale function prediction on protein structures. 1 Introduction 1.1 Structural Genomics With the sequencing of the human genome and many other organisms, rapid determination of gene and protein function is becoming increasingly important. Structural genomics initiatives are helping elucidate the function of these gene products by developing high-throughput methods for determining structures for all unique protein folds. These new structure targets are specifically selected not to have sequence similarity to existing proteins1-4. With the anticipated explosion of available structures, it is imperative to develop computational methods for high- throughput function prediction on protein structures. 1.2 Sequence Motifs There has been extensive work in identifying conserved residues in protein sequences with similar function. Amino acid sequence patterns that represent these conserved residue positions can be created from multiple alignments of sequences with similar function. These patterns, or sequence motifs, can be used to assign function to sequences that contain the pattern. Numerous sequence motif databases have been established with different methods for creating sequence motifs. Some of the databases are manually curated by experts, while others are automatically derived. The BLOCKS+ database provides an integration of many sequence motif databases by generating blocks of ungapped alignment of sequences from families in the databases5, 6. The eMotif database uses sequence alignments from the BLOCKS+ database to create sequence motifs (eMotifs) at various specificities for a family of protein sequences. By providing motifs at different specificities for a block of sequences, eMotifs can represent families with high specificity and yet provide sensitivity by combining multiple motifs7, 8. Although sequence motifs can provide insight into protein function, when new proteins do not have significant sequence similarity with known proteins, using only sequence information fails. Proteins that do not have high sequence similarity may still have similar function because of conservation of physicochemical properties at the structural level9, 10. 1.3 Structural Motifs Analogous to sequence motifs, structural motifs provide a description of conserved properties in the three dimensional structure of proteins sharing molecular function. Investigators have devised different techniques to construct and define structural motifs; each technique emphasizes different conserved properties. Wallace et al have developed a system, PROCAT, for identifying catalytic sites by geometric orientation of residues with known functional importance. By using previous knowledge of the critical residues involved in the catalytic activity, a structural motif representing the conserved relative positions of those residues is constructed. This motif can be used to scan a new protein structure for occurrence of the catalytic site using a geometric hashing algorithm11, 12. Fetrow and Skolnick have developed Fuzzy Functional Forms (FFF) for representing distances between interesting residues. The critical residues involved in a functional site are identified by careful examination of the literature. Examples of known structures containing these residues are used to find mean distance and variance between the residues. The structural motif representing the conserved distance and variance of the residues is used to identify functional sites on protein structures13, 14. Wei and Altman have developed the FEATURE system that describes the physicochemical environment around functional sites. The environment, characterized by observing the frequency of physicochemical properties in radial shells around the site of interest, represents a structural motif used for predicting the functional sites15, 16. Unlike PROCAT and FFF, FEATURE does not require the conserved properties to be known in advance, but can discover them automatically given a training set. Once a set of examples of the functional site and a background control set is provided, FEATURE can automatically build the structural motif without having to manually identify which residues are considered important. We introduce a new method, SeqFEATURE, that automatically creates structural motifs around functional sites by characterizing the structural environment around sequence motifs using FEATURE. Since the three dimensional structures of protein are more conserved than the sequence of amino acids10, focusing on the protein structure should provide better sensitivity in identifying protein function. 2 Methods 2.1 SeqFEATURE overview SeqFEATURE is a method for automatically building a structural motif from sequence motifs (Figure 1). The sequence motifs can be obtained from any of the sequence motif databases; here we use the eMotif database. Protein structures that contain the sequence motif are identified and a training set for the FEATURE system is automatically generated. FEATURE then uses the training set to construct a structural motif describing the physicochemical environment around the sequence motif. Site description Motif Databases FEATURE training FEATURE scan Training Set SeqFEATURE Protein Sites Gridding pdbid (x y z) Feature Extraction Structure Site Selection pdbid (x y z) Sequence … Motifs Non-Sites Feature Extraction Non-site Selection pdbid (x y z) Model pdbid (x y z) prop-vol: value Functional … prop-vol: value volume values Sites prop-vol: value prop-vol: value Hits (x y z) score (x y z) score properties Score Calculation … Score Calculation Figure 1: Data flow diagram for SeqFEATURE. Sequence motifs are selected by querying a sequence motif database. These motifs are fed into SeqFEATURE for automatic selection of sites and non-sites to create a training set. The training set is used by FEATURE for creating a model of the microenvironment around the functional site. The model is used to predict functional sites on new protein structures. 2.2 FEATURE system The FEATURE system is detailed and evaluated elsewhere16; here we describe it briefly. The FEATURE system is used to build structural motifs and predict functional sites. FEATURE builds a physicochemical model of the environment around functional sites. The model represents the statistical distribution of physicochemical properties at radial distances from the site of interest. The physicochemical properties range from atom names and atom properties to residue names and secondary structure (Table 1). By using properties at the atomic level and not just at the residue level, FEATURE provides a closer view of the chemistry necessary for function. Table 1: List of FEATURE’s physicochemical properties AtomName C N O S ANY OTHER ChemicalGroup Hydroxyl Amide Amine Carbonyl RingSystem Peptide AtomProperties VDWVolume Charge NegCharge PosCharge ChargeWithHis Hydrophobicity Mobility SolventAccessibility ResidueName ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL HOH OTHER ResidueProperties Hydrophobic Charged Polar NonPolar Basic Acidic SecondaryStructure 3Helix 4Helix 5Helix Bridge Strand Turn Bend Coil Het Unknown FEATURE requires a training set composed of positive examples of the functional site and negative background control examples. Each example is a specific 3D coordinate in a protein structure locating the site of interest. The environment around each site of interest is divided into radial shell volumes. FEATURE then creates a feature vector v for each site by counting the occurrence of physicochemical properties within each volume around the site. The structural motif of the functional site is constructed by comparing the distribution of feature vectors from the positive examples (sites) with those from the negative examples (non-sites). Properties at each volume are evaluated as statistically more present or absent in the functional site by using the Wilcoxon rank sum test. Finally, using a Bayesian inference method, FEATURE predicts functional sites at specific locations on protein structures. It places a 3D grid over the new protein structure and calculates the feature vector described above from the environment around each grid point. The grid point is given a score proportional to the likelihood that it is a functional site given its feature vector v. p(Site | v) Score = log (1) p(Site) By assuming that the