Detecting Succinylation Sites from Protein Sequences Using Ensemble Support Vector Machine Qiao Ning1, Xiaosa Zhao1, Lingling Bao1, Zhiqiang Ma1* and Xiaowei Zhao2*

Ning et al. BMC Bioinformatics (2018) 19:237 https://doi.org/10.1186/s12859-018-2249-4 RESEARCH ARTICLE Open Access Detecting Succinylation sites from protein sequences using ensemble support vector machine Qiao Ning1, Xiaosa Zhao1, Lingling Bao1, Zhiqiang Ma1* and Xiaowei Zhao2* Abstract Background: Lysine succinylation is a new kind of post-translational modification which plays a key role in protein conformation regulation and cellular function control. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. However, traditional methods, experimental approaches, are labor-intensive and time-consuming. Computational prediction methods have been proposed recent years, and they are popular because of their convenience and high speed. In this study, we developed a new method to predict succinylation sites in protein combining multiple features, including amino acid composition, binary encoding, physicochemical property and grey pseudo amino acid composition, with a feature selection scheme (information gain). And then, it was trained using SVM (Support Vector Machine) and an ensemble learning algorithm. Results: The performance of this method was measured with an accuracy of 89.14% and a MCC (Matthew Correlation Coefficient) of 0.79 using 10-fold cross validation on training dataset and an accuracy of 84.5% and a MCC of 0.2 on independent dataset. Conclusions: The conclusions made from this study can help to understand more of the succinylation mechanism. These results suggest that our method was very promising for predicting succinylation sites. The source code and data of this paper are freely available athttps://github.com/ningq669/PSuccE. Keywords: Predict succinylation sites, Multiple features, Grey pseudo amino acid composition, Information gain, SVM, Ensemble learning algorithm Background succinylated substrate proteins along with succinylation As a type of widespread reversible post-translational modi- sites, so more and more focus is put on this field [6–23]. fication, lysine succinylation plays a significant role in both Many biological experimental methods have been de- eukaryotic and prokaryotic cells [1–3]. In succinylation pro- veloped to identify succinylated protein or succinylation cedure, the succinyl group (-CO-CH2-CH2-CO-) is cova- sites, such as high performance liquid chromatography lent bonding to specific lysine residues in proteins which assays, spectrophotometric assays and liquid might lead to substantial chemistry changes to proteins [4]. chromatography-mass spectrometry [24, 25]. However, Besides, lysine succinylation can induce mutations of these experimental approaches are inconvenient, charge in the environment with PH value (hydrogen ion time-consuming and costly, especially for large-scale concentration) range from − 1to+1andpromotestruc- data sets. Therefore, efficient computational prediction tural and functional adjustment to substrate proteins [5]. It methods for the succinylated sites are urgently needed. is extremely important to understand the molecular mech- Currently, numerous computational classifiers have been anism of succinylation in biological systems by identifying developed to identify PTM (Post Translation Modifica- tion) sites using various types of two-class machine * Correspondence: [email protected]; [email protected] learning algorithms [26–29]. We proposed a computa- 1School of Information Science and Technology, Northeast Normal tional predictor, SucPred (2015), based on the combin- University, Changchun 130117, China 2Key Laboratory of Intelligent Information Processing of Jilin Universities, ation of a kind of semi-supervised learning algorithm Northeast Normal University, Changchun 130117, China (Psol) and SVM classifier. This predictor took advantage © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ning et al. BMC Bioinformatics (2018) 19:237 Page 2 of 9 of four types of sequence features, including autocorrel- enough succinylation data, the performances of Succin- ation function, encoding based on grouped weight, nor- Site still have room for improvement. malized van der Waals volume and position weight To solve problems mentioned above, we developed a amino acids composition. Xu et al. (2015) built a pre- new predictor, which was proposed to predict succinyla- dictor called iSuc-PseAAC based on SVM using Pseudo tion sites in protein using the same data set with Succin- amino acid composition. And then, Xu et al. (2015) de- Site. We used multiple efficient feature descriptors to veloped another predictor named SucFind. It was con- derive informative features, including amino acid com- structed based on SVM with k-spaced amino acid pairs position (AAC), binary encoding (BE), physicochemical and AAindex features. More recently, Hasan et al. property (PCP) and grey pseudo amino acid composition (2016) proposed an approach SuccinSite based on Ran- (GPAAC) and we showed the flow chart in Fig. 1. Fi- dom Forest classifier. SucStruct predictor was built by nally, we obtained promising results with an accuracy of Lopez et al. (2017) using structural properties of amino 89.14% and a MCC of 0.79 using 10-fold cross validation acids [30]. Thereafter, using profile bigram [31], on training data set and an accuracy of 84.5%, a MCC of PSSM-Suc [32] was introduced for identifying succinyla- 0.2 on independent test set. These results demonstrated tion lysine sites by Lopez et al. (2017). Besides, they pro- that this predictor is promising to predict lysine succiny- posed Success predictor (2018) using evolutionary lation sites and could serve as a helpful tool to the information of amino acids [33]. Thereafter, they (2018) community. used secondary structure information to further enhance the succinylation prediction [34]. Although these Methods methods have already been developed to predict succi- As demonstrated in compliance with Chou’s 5-step rule nylation sites, there are some problems existing. First of [38] in a series of recent publications [6–12], we should all, the data set used in SucPred and iSuc-PseAAC was follow the following five guidelines to establish a useful obtained from CPLM database [35] and the data set of sequence-based predictor for a biological system: (a) se- SucFind was derived from several lysine modification da- lect or construct a valid benchmark data set to train and tabases and some relevant articles [36, 37], which are test the predictor; (b) formulate these protein sequence small and they didn’t cover novel succinylation data re- samples with an effective mathematical expression that cently found. Besides, though the SuccinSite contains can truly reflect their intrinsic correlation with the target Fig. 1 The flow chart of PSuccE Ning et al. BMC Bioinformatics (2018) 19:237 Page 3 of 9 to be predicted; (c) introduce or develop a powerful al- amino acid. pc(xi) means the probability that amino acid gorithm to operate the prediction; (d) properly perform xi appears at position c. cross-validation tests to objectively evaluate the antici- pated accuracy of the predictor; (e) establish a General Pseudo amino acid composition user-friendly web-server for the predictor that is access- With the rapid growth of the amount of biological se- ible to the public. Below, we are going to describe how quences in the post-genome era, one of the most signifi- to deal with these steps one-by- one. cant but also most difficult problems in computational biology is how to convert a biological sequence into a nu- Datasets merical vector, yet still retain significant sequence-order In this study, succinylation data was derived from Uni- information or key pattern characteristic, which is because ProtKB/Swiss-Prot database and NCBI protein sequence almost all the existing machine-learning algorithms can database as Hasan et al. [29] did. After removing pro- only handle vector instead of sequence samples [22]. teins that have more than 30% sequence identity to any However, a vector that is defined in a discrete model may other proteins in this dataset using CH-HIT, 2322 succi- completely lose all the sequence-order information. To nylation proteins including 5009 experimentally verified avoid this, the pseudo amino acid composition or PseAAC lysine succinylation sites were obtained. Then, 124 pro- [42] was proposed. Ever since the concept of Chou’s teins were randomly separated from the 2322 proteins as PseAAC [43, 44] was put forward, it has penetrated into an independent test set for testing, and the remaining nearly all the areas of computational proteomics [45–50], proteins were training data set. We referred the experi- many biomedicine and drug development areas [51]. Be- mentally verified lysine succinylation sites as positive cause of its widely and increasingly usage, two powerful sites, and all the lysine sites that lie on the same proteins open access soft-wares, named ‘propy’ [43]and‘PseAAC-- as succinylation sites but don’t

Load more