Research Article Prediction of Four Kinds of Simple Supersecondary Structures in Protein by Using Chemical Shifts
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 978503, 5 pages http://dx.doi.org/10.1155/2014/978503 Research Article Prediction of Four Kinds of Simple Supersecondary Structures in Protein by Using Chemical Shifts Feng Yonge College of Science, Inner Mongolia Agriculture University, Hohhot 010018, China Correspondence should be addressed to Feng Yonge; [email protected] Received 7 May 2014; Revised 3 June 2014; Accepted 4 June 2014; Published 18 June 2014 Academic Editor: Hao Lin Copyright © 2014 Feng Yonge. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Knowledge of supersecondary structures can provide important information about its spatial structure of protein. Some approaches have been developed for the prediction of protein supersecondary structure. However, the feature used by these approaches is primarily based on amino acid sequences. In this study, a novel model is presented to predict protein supersecondary structure by use of chemical shifts (CSs) information derived from nuclear magnetic resonance (NMR) spectroscopy. Using these CSs as inputs of the method of quadratic discriminant analysis (QD), we achieve the overall prediction accuracy of 77.3%, which is competitive with the same method for predicting supersecondary structures from amino acid compositions in threefold cross-validation. Moreover, our finding suggests that the combined use of different chemical shifts will influence the accuracy of prediction. 1. Introduction support vector machine [10]. And the overall accuracy of 78% was achieved. The features of these studies were mainly The prediction of protein structure is always one of the derived from the amino acid compositions or dipeptide most important research topics in the field of bioinformatics. compositions. However, it is very difficult to predict the spatial structure Nuclear magnetic resonance (NMR) technique plays an directly from the protein sequence. Therefore, the prediction important role in the determination of three-dimensional of supersecondary structure is an important step in the biological macromolecule structures. NMR chemical shifts prediction of protein spatial structure. The supersecondary encode subtle information about the local chemical environ- structural motifs are composed of a few secondary structural mentofnuclearspins.Formanyyears,therehasbeengrowing elements (namely, or ) connected by loops. At present, interest to access this information and utilize it for biomolec- therearefourkindsofsimplesupersecondarystructures, ular structure determination [11, 12]. Recent progress was namely, -loop-, -loop-, -loop-,and-loop-.These made by combining chemical shifts with protein structure motifs play an important role in protein folding and stability prediction programs [13–20], showing that chemical shifts because a large number of motifs exist in protein spatial information is a power parameter for the determination of structure. Many researches have focused on exploring meth- protein structure. In this paper, we utilized chemical shifts ods for protein supersecondary structure prediction [1, 2]. In as parameters to predict four kinds of simple supersecondary 1995, Sun et al. predicted protein supersecondary structure structures in protein by the method of quadratic discriminant and achieved an accuracy of between 70 and 80% by using analysis. Using the benchmark dataset, we achieved the neural networks [3]. Chou and Blinn presented a method for average of sensitivity of 76.3% and specificity of 74.3% and predicting beta turns [4–6], alpha turns [7],andallthetight the overall prediction accuracy of 77.3% in threefold cross- turns [6]. Cruz et al. identified -hairpin and non--hairpin validation by using six CSs (, ,,,,)asfeatures. [8].HuandLiidentifiedfourkindsofsimplesupersecondary Moreover, we have performed the prediction by combining structures in 2088 proteins and achieved an accuracy of the different chemical shifts as features. Results showed 78∼83 % [9]. Zou et al. also predicted four kinds of sim- that the redundant information has great influence on the ple supersecondary structures from 3088 proteins by using accuracy. 2 The Scientific World Journal 1 1 2. Materials and Methods =( − − Σ ) ln 2 2 ln 2.1. Database. The chemical shifts of all nuclei (, , 1 1 ,,,) in proteins were extracted from re-referenced −( − − Σ ). ln 2 2 ln protein chemical shift database (namely, RefDB [21]). The following steps were performed to construct the dataset. (4) Firstly, only proteins with six nuclei assigned CSs were considered. Secondly, only proteins with the supersecondary The result can be generalized to four groups directly and structures information in ArchDB40 [22]wereavailable.We described as follows. finally utilized the PISCES program [23]toremovethehighly Set similar sequences. After strictly following the aforemen- 1 = − V − Σ tioned procedures, 114 proteins were obtained which have V ln V 2 2 ln V both CSs and supersecondary structures. Among 114 proteins, (5) 92% (105 sequences) proteins have less than 25% sequence (V = , , ,) , identity, and the sequence identity of the remains ranges from 25 to 30%. The appendix lists 114 proteins used in this study. where Finally, we obtained 90 -loop- (HH), 89 -loop- (HE), 97 = (−)Σ−1 (−) , -loop- (EH), and 122 -loop- (EE) motifs, including the V V V V (6) - link and - hairpin. where V denotes the number of samples in group V, V is the square mahalanobis distance between and V with respect {, , 2.2. Feature Parameter. In the four data subsets to ΣV (note: V and |ΣV| are calculated in training set), and V , } ,wecalculatedtheaveragedCSsofsixnucleifora denotes chemical shift values of six nuclei :{} averaged sequence of length using the following formula: over group V; |ΣV| is the determinant of matrix ΣV. The six-dimensional vector V can be written as 1 = ∑ CS, (1) V () 1 =1 V = ∑ , (7) V =1 where =,,,,,. Therefore, a sequence can be V = , , , =,,,, ΣV converted into a six-dimensional vector :{}. where ; , ; isthecovariancematrixof6×6dimension, quantifying correlations between the chemical shifts of six nuclei: 2.3. Prediction Algorithm. Todesignanefficientandaccurate V V V predicted algorithm the key step is in protein supersecondary 1,1 1,2 ⋅⋅⋅ 1,6 [ ] structure prediction. The quadratic discriminant analysis [V V ⋅⋅⋅ V ] [ 2,1 2,2 2,6] [24] is a power algorithm that has been widely applied in [ ] ΣV = [ . ] , (8) genomic and proteomic bioinformatics. Thus, we used it here [ . ] to perform prediction. [ V V V ] 6,1 6,2 ⋅⋅⋅ 6,6 [ ] 2.4. Quadratic Discriminant Analysis (QD). For a sequence to be classified, we calculated the averaged CSs of six nuclei where the element using (1). So, the sequence is converted into a six-dimensional 1 :{} V () () vector : , = ∑ ( −V )( −V ). (9) V ={}(=,, ,, ,). (2) Here V = , , ,; , = , ,,,,. From (4)and(5), we have concluded Here we integrated six-dimensional vector by using quadratic = −. discriminant analysis function. Consider a sequence is (10) classified into four groups (, , , ). The discrimi- ( |) nant analysis function between group and group is defined Itcanbeeasilyprovedthat is the maximum ( |) V = by of V ,if is the maximal one in V ( , , , ). Then, we predict that belongs to group . = ln ( |)−ln ( |). (3) According to Bayes’ Theorem, we deduce 2.5. Correction in the Error Allowed Scope. Asequence is predicted for four kinds of supersecondary struc- ∼ − Σ tures by using (1) (10). If is the maximal one in 1 = , , , = ln − − ln ( ), then we predict that belongs 2 2 Σ to group . However, there are slight differences among The Scientific World Journal 3 ( = , ,). , To correct predicted results, we Table 1: The predicted accuracies by using six CSs as features (3-fold definethecoefficientoftheerrorallowedscopeas cross-validation). − Class SN (%) SP (%) Average Average = corr wro , (%) (11) structure <0.4 SN (%) SP (%) total corr HH 73.0 71.0 where denotes belonging to itself class , denotes corr wro 75.8 78.1 being predicted another class . For example, if is the EH 76.3 74.3 77.3 HE 69.0 66.7 super-secondary structure of ,then corr is and wro is the maximum among , , . EE 87.5 81.4 2.6. Performance Evaluation. In statistical prediction, inde- pendent dataset test, cross-validation test, and jackknife test can be used to examine a predictor for its effectiveness in indicating that CSs are highly informative with regard to practical application. Among the three test methods, the jack- supersecondary structures. knife test is deemed to be the least arbitrary that can always Generally speaking, chemical shift measurements can yield a unique result for a given benchmark dataset [25]and be incomplete for a multitude of reasons. Often, chemical has been widely used to examine the performance of various shifts can only be assigned partially or are missing. To assess predictors [26–37].However,inthisstudywehaveusedthe theimpactofincompletechemicalshiftassignments,we threefold cross-validation to examine the performance of our performed the prediction by using the combination of the method; in order to reduce the computational time, we ran- different chemical shifts as features. The results are shown in domly divided the training set into three parts, two of which Table 2. are for training and the rest