New Kernels for Protein Structural Motif Discovery and Function Classiﬁcation

New Kernels for Protein Structural Motif Discovery and Function Classification Chang Wang [email protected] Dept. of Computer Science, University of Massachusetts, Amherst, MA 01003, USA Stephen D. Scott [email protected] Dept. of Computer Science, University of Nebraska, Lincoln, NE 68588-0115, USA Abstract structural comparisons. We present new, general-purpose kernels for There are also methods for predicting function from protein structure analysis, and describe how structure. Many of them compare the structure to apply them to structural motif discov- of a protein with unknown function to the struc- ery and function classification. Experiments ture of proteins with known function in structural show that our new methods are faster than databases, such as CATH (Orengo et al., 1997) and conventional techniques, are capable of find- SCOP (Murzin et al., 1995). Other methods, such as ing structural motifs, and are very effective in SITE (Zhang et al., 1999), FFFs (Fetrow & Skolnick, function classification. In addition to strong 1998) and superfamily active site templates (Meng cross-validation results, we found possible et al., 2004) use structural motif-related information new oxidoreductases and cytochrome P450 to search for function in an unknown structure. reductases and a possible new structural mo- A structural motif is a conserved sub-structural pat- tif in cytochrome P450 reductases. tern that is common to a set of proteins sharing similar structures or functions. Most biological actions of proteins depend on structural motifs. Discovery of 1. Introduction motifs is a complex process including feature extrac- A goal of structural genomics is to determine pro- tion, structure comparison, discovery and evaluation. teins’ three-dimensional structures from their gene se- The feature selection step extracts features to be used quences. The challenge, once the structure is deter- for pattern discovery from proteins. Structure com- mined, is to extract useful biological information about parison is the most difficult step. Many methods have the biochemical and biological role of the protein in been devised, including pairwise structure alignment the organism. With the rapid expansion in the num- using dynamic programming or superposition to mini- ber of known protein structures, prediction of func- mize RMSD. Other methods, such as geometric hash- tion based on structure has become one of the major ing (Holm & Sander, 1995) and 3D coordinate tem- aims of bioinformatics. It provides useful information plates (Wallace et al., 1996) have also been applied. to biochemical experiments and further improves the After structural comparison, patterns matching the in- performance of genome analysis. put structures are found and evaluated to see whether they are possible structural motifs. Lately, many new Primary sequence can often be used to infer function. methods have been proposed for this problem. For ex- However, some protein functions cannot be identified ample, SPratt2 (Jonassen et al., 2002) discovers motifs solely by primary sequence-based methods. In such in an unsupervised fashion. Trilogy (Bradley et al., cases, functional similarities are found from structure 2002) handles sequence and structure simultaneously comparisons. Many methods, including SSAP (Taylor and symmetrically in the search process. & Orengo, 1989), DALI (Holm & Sander, 1993), and CE (Shindyalov & Bourne, 1998), have been used for We introduce new kernels for three-dimensional structural analysis. Our results have applications in motif Appearing in Proceedings of the 22 nd International Confer- discovery and in function classification. As with some ence on Machine Learning, Bonn, Germany, 2005. Copy- other structural methods, we represent a 3D structure right 2005 by the author(s)/owner(s). as a set of its components in 3D space. We show that New Kernels for Protein Structural Motif Discovery and Function Classification these new methods are sensitive enough to identify acids, the distances from the outer amino acids to the some remote structural similarities that are missed by central amino acids, and distances between the two Cs regularly-used approaches. in the motif’s center. We compute similarity between two motif structures using these features. Our first result is a new method for structural motif discovery. In some cases of motif discovery, the func- Our final result is a kernel (K3Dball) designed specifi- tional motif of a protein can be described by defining cally for tertiary structure comparison. We define the the structure’s size, shape, etc. But more often, the similarity between two protein structures S and T as motif itself is also not completely known, and the re- the sum of structural similarities between any two 3D searcher has only a more or less rough idea of what to balls of S and T that have similar constituents. It is look for (Schmollinger et al., 2004). Thus it is difficult similar to DALI, CE, etc., in that we make compar- to specify what to look for in advance. Further, often isons between entire three-dimensional structures (i.e. the results of motif discovery are sensitive to the size of it is an entire structure-based method as opposed to the structure (in terms of number of residues) that is an active site-based method). specified. If the sought structure size is too small, then In our experiments, we test our methods on struc- one risks missing some of the regulatory patterns in a tural superfamilies from CATH and two function su- motif. Conversely, if the structure size is set too large, perfamilies: thiol/disulfide oxidoreductases and cy- the motif will likely include some irrelevant parts. tochrome P450 reductases. For the two function Our approach is different from other methods, in that families, many thiol/disulfide oxidoreductases have a we do not seek conserved fragments or commonly used thioredoxin (Trx) fold (Martin, 1995). If a 3D struc- geometrically-defined cells. We assume that a sim- ture is known, one can easily determine whether a ple function is mediated primarily by one amino acid. given protein possesses a fold. However, some pro- Thus we focus on identifying small conserved substruc- teins without the fold also have redox function, such tures, each centered on a single amino acid. We define as PDB-1d4u. Cytochrome P450 reductase is found in the size of the substructure as a fixed-radius ball in 3D the endoplasmic reticulum of most eukaryotic cells and space rather than as a fixed number of residues. We is an integral component of the monooxygenase system 1 use our new kernel KP attern Sim to measure similarity transferring electrons from NADPH to cytochrome between pairs of substructures. To avoid missing can- P450 via FMN and FAD co-factors. Cytochrome P450 didate motifs, we examine the substructure centered reductase may also donate electrons to heme oxyge- at each residue. The highly conserved substructures nase, cytochrome b5, and the fatty acid elongation are candidate motifs. system, and can reduce cytochrome c. For this family, no conserved motif is known. In our second result, we tune KP attern Sim for ap- plication to redox function prediction. Here we We show that our kernels are sensitive to the fold leverage known information about the superfamily of in tertiary structure, although they are not designed thiol/disulfide oxidoreductases. Most oxidoreductases for fold identification. They also capture similarities have a CxxC primary sequence motif2 at their active in thiol/disulfide oxidoreductases beyond the Trx-fold site. We use this to tune KP attern Sim to oxidoreduc- that are missed by DALI and CE. As a result, they tases, resulting in a new kernel KRedox F unc. Each can be used to find new thiol/disulfide oxidoreduc- substructure we consider consists of all residues that tases, since some such proteins that do not possess lie in a fixed-radius ball in 3D space. The residue at Trx-fold might be missed by traditional methods. We the center of the ball is called the central amino acid also successfully apply our kernels to P450 reductases, and the other residues in the ball are called the outer identifying several possible candidates in PDB. Since amino acids. For thiol/disulfide oxidoreductases, both K3Dball and KP attern Sim do not require any orienta- the Cs in each CxxC motif are seen as central residues. tion of the 3D structures or any other prior informa- The outer residues include the residues between two Cs tion about the protein families, our methods should be and other amino acids in a fixed-radius ball centered on applicable to many protein families. each C. K measures similarity between sub- Redox F unc Our motif discovery method offers two advantages. structures by comparing the types of the outer amino First, it doesn’t require any prior knowledge. Sec- 1 While a version of KP attern Sim is positive semidefi- ond, it is very sensitive to small motifs and can also nite, what we use may not be (Section 2). But for clarity, find large motifs by combining small motifs that are we use “kernel” to refer to all our similarity measures. 2 close to each other in 3D space. Our kernel-based Sometimes a serine replaces one cysteine, but for clar- protein function classification methods also have ad- ity we will refer to it always as the CxxC motif. vantages. First, they are simple and very fast: using New Kernels for Protein Structural Motif Discovery and Function Classification KP attern Sim, KRedox F unc and K3Dball are each about of the same type, then the similarity is zero, else the 100 times faster than DALI and CE, and can quickly similarity is computed using the following procedure: search PDB. Second, they are very sensitive while still first we compute the Gaussian RBF value of the dis- maintaining low false positive rates. tance from S[i] to S[1] and the distance from T [j] to T [1], then we divide the value by the product of dis- The rest of this paper is as follows.

New Kernels for Protein Structural Motif Discovery and Function Classiﬁcation

RNA Structural Motif Recognition Based on Least-Squares Distance

Cross-Talk Between Overlap Interactions in Biomolecules: a Case Study of the Β-Turn Motif

Comparison of GHT-Based Approaches to Structural Motif Retrieval

For Use of the Conserved Helix-Turn-Helix Motif in DNA Binding (Escherichia Coli/Operator Recognition/Hydroxylamine Mutagenesis/Tetracycline Resistance) PAUL J

Computational Design of Asymmetric Three-Dimensional RNA Structures and 2 Machines

Computational Prediction of RNA Structural Motifs Involved in Posttranscriptional Regulatory Processes

An Approach Used in DNA-Binding Helix-Turn-Helix Motif Prediction with Sequence Information Wenwei Xiong, Tonghua Li*, Kai Chen and Kailin Tang

Disclosing the Impact of Carcinogenic Sf3b Mutations on Pre-Mrna Recognition Via All-Atom Simulations

Automated Construction of Structural Motifs for Predicting Functional Sites on Protein Structures

The Structure and Function of DNA G-Quadruplexes

Structure of a Left-Handed DNA G-Quadruplex

Structural Motifs in Ribosomal Rnas: Implications for RNA Design And