With the Advent of Large-Scale Genome Sequencing Programmes and the Completion of the Human

Total Page:16

File Type:pdf, Size:1020Kb

With the Advent of Large-Scale Genome Sequencing Programmes and the Completion of the Human

A Networked Multi-Class SVM for Protein Secondary Structure Prediction YUTAO QI, FENG LIN & KIN KEONG WONG Bioinformatics Research Center, School of Computer Engineering Nanyang Technological University SINGAPORE

Abstract: A new classification method, networked multi-class Support Vector Machines (SVM), is proposed for prediction of protein secondary structure. Classification is based on the analysis of physicochemical properties of a protein generated from its sequence. Multi-class SVM is more powerful than the standard bi-class one since the empirical recognition rate of multi-class SVMs is higher and multi-class SVMs use far fewer support vectors. This is the reason why we investigate the implementation of Multi-class Support Vector Machines to perform the task. A prediction tool based multi-class SVM network is developed. With a trained multi-class SVM network and this tool can perform secondary structure prediction for protein sequences, and eventually protein sequence databases.

Key-Words: Computational Biology, Protein Secondary Structure Prediction, Support Vector Machines, Multi-class SVM

1 Introduction have a similar structure [1]. Thus, the With the advent of large-scale genome performance of these servers sequencing programs and the completion automatically increases due to the of the human genome project the continuous growth of the number of importance of protein structure resolved structures. On the other hand, determination has become a main there are algorithmic developments that challenge of structural molecular biology. lead to more sensitive methods. Building Theoretical methods for protein structure new prediction methods by fusion of prediction are important since they offer existing ones appears particularly relevant the possibility of supporting experimental for protein secondary structure prediction. methods as well as of validating the Two main reasons can be put forward to scientific theory behind the principles of support this assertion. First, the numerous protein structure. Fully automated protein methods already available to predict the structure prediction is a significant secondary structure are based on different challenge and has been a final goal of principles. Second, they use, in addition to research in the field since it allows the the amino acid sequences or profiles of possibility of decoupling the quality of the multiple alignments, data from different prediction from the expertise of the person knowledge sources. Consequently, using the method. whenever protein secondary structure is to Currently, most prediction servers are be predicted, several sets of based on a homology principle that states conformational scores are available, that similar sequences will most likely which can be expected not to be utterly correlated. Indeed, most of the current best prediction systems implement conformational score combinations, which can take many forms. Recently a relatively new classification method, support vector machines (SVM), has been used for the prediction of protein–protein interaction, protein fold recognition, and protein structure prediction. Instead of directly analyzing sequences, SVM classification method Figure 1. Bi-class vs. Multi-class SVMs used in these studies is based on the analysis of physicochemical properties of In this paper, we establish an automatic a protein generated from its sequence [2]. protein secondary structure prediction tool Such an approach may be extended to the from combining protein secondary classification of proteins into functional structure models with Multi-class SVM. classes, and thus facilitating the prediction In the above Figure 1, the basic of function of these proteins. Proteins of architectures for the bi-class and multi- specific functional class share common class SVM predictions are shown. In the structural features essential for performing implementation section, we briefly similar functions such as transport, explain how it can be of practical use to metabolism, and binding induced signal study the generalization capabilities of transduction. The same structural features multi-class networking models. The responsible for protein folding are thus corresponding theorems and formulae are expected to be part of the elements for then applied to the multivariate affine determination of protein functions. regression model, which leads to the Therefore, as a method successfully used specification of the new Multi-class SVM. in protein fold recognition, SVM may be Initial experimental results are given too. potentially used for protein function classification. The study of the standard bi-class SVM is usually performed in two steps: first, the linear case, corresponding 2 Support Vector Machines: Bi- to the specification of the maximal margin Class & Multi-Class hyper plane, then the non-linear one, by The support vector machine (SVM) is a introduction of kernels satisfying Mercer’s training algorithm for learning conditions. In the same way as a linear classification and regression rules from SVM shares the architecture of the data, for example the SVM can be used to perceptron, a linear Multi-class SVM is a learn polynomial, radial basis function multivariate linear regression model, i.e. a (RBF) and multi-layer perceptron (MLP) set of hyper planes of cardinality equal to classifiers. SVMs arose from statistical the number of classes. learning theory; the aim being to solve only the problem of interest without solving a more difficult problem as an intermediate step. SVMs are based on the structural risk minimization principle, closely related to regularization theory. This principle incorporates capacity control to prevent over-fitting and thus is explicitly the nature of the kernel, a partial solution to the bias-variance although bounds on the generalization trade-off dilemma [3]. error of kernel machines have been Two key elements in the implementation derived. In the same way as a linear SVM of SVM are the techniques of shares the architecture of the perceptron, a mathematical programming and kernel multi-class linear SVM is a multivariate functions. The parameters are found by linear regression model--a set of solving a quadratic programming problem hyperplanes of cardinality equal to the with linear equality and inequality number of classes [4]. We thus have to constraints; rather than by solving a non- apply the inductive principle to Multi- convex, unconstrained optimization class SVMs, and consequently to problem. The flexibility of kernel determine the objective function of the functions allows the SVM to search a training procedure, we must thus set the wide variety of hypothesis spaces. Here covering numbers of the multivariate we focus on SVMs for two-class linear or affine model. classification, the classes being P, N for As for the choice between an architecture based on binary SVMs and a yi  1,1 respectively. This can easily multi-class SVM, two strong arguments be extended to k  class classification by speak in favor of the latter. First, the constructing k two-class classifiers. The empirical recognition rate of multi-class geometrical interpretation of support SVMs is higher. Second, multi-class vector classification (SVC) is that the SVMs use far fewer support vectors [4]. algorithm searches for the optimal The lower the number of support vectors, separating surface, i.e. the hyperplane that the lower the number of terms in the sum, is, in a sense, equidistant from the two and, by way of consequence, the lower the classes. This optimal separating time required to compute the outputs. As hyperplane has many nice statistical we defined the protein secondary structure properties. SVC is outlined first for the prediction problem to be the three- linearly separable case. Kernel functions category classification, the output size of are then introduced in order to construct the networked multi-class SVMs is set to non-linear decision surfaces. Finally, for 3 for the three types of secondary noisy data, when complete separation of structures – Alpha Helices (H), Beta the two classes may not be desirable, slack Strands (E) and Coils (C) . As we must fix variables are introduced to allow for the input size too, we try to supply fix- training errors. sized window of protein sequences, The above descriptions also apply to any normally 13. multi-class discriminant system obtained by combining a multivariate model with Bayes’ estimated decision rule. Here we turn to the specific case of Multi-class 3 Implementation of the Multi- SVMs. The study of the standard bi-class class SVM based Network SVMs is usually done in two steps: first, A multi-class linear SVM is a multivariate the linear case by optimal hyperplane, linear regression model or a set of then the non-linear one by introduction of hyperplanes of cardinality equal to the kernels satisfying Mercer’s conditions. number of classes Q. We thus have H = Indeed, the specification of the training {h}, with procedure does not take into account T coils, we choose the number of categories 轾w1 轾 b 1 犏 犏 to be 3. The size of sliding window is also 犏wT 犏 b "x� X, h ( x ) + Wx = b + 2 x 2 very important and normally we number it 犏... 犏 ... as 13 as usual. With other specified multi- 犏 犏 犏wT 犏 b class SVM’s parameters set, the network 臌Q 臌 Q is ready to be trained. The training data need to be prepared The choice of an optimization method is well before the training in a certain an issue in its own right, since dealing desired format. Since the data is prepared with Q classes multiplies the number of for training purpose, we need the exact dual variables by (Q−1). The algorithm conformational detail of the training data includes a decomposition method. The set. We prepared the training data set by prediction is local and based on the sliding calculating the secondary structure using window of the primary structure or the DSSP for those proteins with known amino acid sequence. Precisely, the goal structure in PDB; the raw data file of training consists in associating each contains data with the following format: window’s content with the conformational 354 20 3 state of the central residue. This approach [email protected]:323 is standard in statistical approaches. AACHVACNCPCANDEEYHCPVCHHHCCLNLNFDCHCADLWVFCCE….. For a multi-class SVM, the objective CECEEEEEEECHHHCCEEEEEEECCEEEEEEEECCCCCEEECECC…… function of which takes the confidence In the training data fie, the first line interval into account. The training contains three integers, which are the procedure associated with this model number of sliding windows, the number of consists in solving the following quadratic inputs (no. of types of Amino Acid) and programming problem: the number of output states accordingly. N Q 禳1 2 Then for each protein sequence, there are min睚 邋 ||wk- w l || +C xik h H three lines of information: the first line is 铪2 1# kgi|129369|sp|P04637|P53_HUMAN Cellular databases will be extremely high, so that tumor antigen p53 (Tumor suppressor p53) the prediction timing will be tedious in MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPS QAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA this case. This will lead to the needs for PPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSY high performance computing. In the GFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKT CPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRR authors’ previous paper in high CPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTF performance computing in protein RHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMN RRPILTIITLEDSSGNLLGRNSFEVRVCACPGR secondary structure prediction, they DRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQ proposed a parallelized version of a PKKKPLDGEYFTLQIRGRERFEMFRELNEALEL famous DSC method [6], so that the next KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMF KTEGPDSD development in this project is to develop its high performance version. References: [1] Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg, M. J. E. Prediction of protein secondary structure and active sites using alignment of homologous sequences. J. Mol. Biol., 195, 1987, 957- 961.

[2] Scholk opf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., and Vapnik, V., Comparing support vector machines with gaussian kernels to radial basis function classi_ers, IEEE Trans Sign. Processing, 1997, 45:2758-2765.

[3] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

[4] E.J. Bredensteiner, K.P. Bennett, Multi-category classification by support vector machines, Comput. Optim. Appl. 12 (1/3), 1999, p53–79.

[5] Hall PA. Tumor suppressors: a developing role for p53, Curr Biol 7(3), 1997, R144-R147

[6] Y. Qi, F. Lin & K. K. Wong, High Performance Computing in PSSP, WSEAS Transactions on Circuits and Systems, Issue 3, Volume 2, July 2003, ISSN 1109- 2734.

Recommended publications