Investigation and Identification of Protein G-Glutamyl Carboxylation Sites
Total Page:16
File Type:pdf, Size:1020Kb
Lee et al. BMC Bioinformatics 2011, 12(Suppl 13):S10 http://www.biomedcentral.com/1471-2105/12/S13/S10 PROCEEDINGS Open Access Investigation and identification of protein g-glutamyl carboxylation sites Tzong-Yi Lee*, Cheng-Tsung Lu†, Shu-An Chen†, Neil Arvin Bretaña†, Tzu-Hsiu Cheng, Min-Gang Su, Kai-Yao Huang From Asia Pacific Bioinformatics Network (APBioNet) Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB2011/ISCB-Asia 2011) Kuala Lumpur, Malaysia. 30 November - 2 December 2011 Abstract Background: Carboxylation is a modification of glutamate (Glu) residues which occurs post-translation that is catalyzed by g-glutamyl carboxylase in the lumen of the endoplasmic reticulum. Vitamin K is a critical co-factor in the post-translational conversion of Glu residues to g-carboxyglutamate (Gla) residues. It has been shown that the process of carboxylation is involved in the blood clotting cascade, bone growth, and extraosseous calcification. However, studies in this field have been limited by the difficulty of experimentally studying substrate site specificity in g-glutamyl carboxylation. In silico investigations have the potential for characterizing carboxylated sites before experiments are carried out. Results: Because of the importance of g-glutamyl carboxylation in biological mechanisms, this study investigates the substrate site specificity in carboxylation sites. It considers not only the composition of amino acids that surround carboxylation sites, but also the structural characteristics of these sites, including secondary structure and solvent-accessible surface area (ASA). The explored features are used to establish a predictive model for differentiating between carboxylation sites and non-carboxylation sites. A support vector machine (SVM) is employed to establish a predictive model with various features. A five-fold cross-validation evaluation reveals that the SVM model, trained with the combined features of positional weighted matrix (PWM), amino acid composition (AAC), and ASA, yields the highest accuracy (0.892). Furthermore, an independent testing set is constructed to evaluate whether the predictive model is over-fitted to the training set. Conclusions: Independent testing data that did not undergo the cross-validation process shows that the proposed model can differentiate between carboxylation sites and non-carboxylation sites. This investigation is the first to study carboxylation sites and to develop a system for identifying them. The proposed method is a practical means of preliminary analysis and greatly diminishes the total number of potential carboxylation sites requiring further experimental confirmation. Introduction cofactor in the post-translational conversion of Glu resi- Carboxylation is a post-translational modification (PTM) dues to g-carboxyglutamate (Gla) residues [3]. Carboxy- of glutamate (Glu) residues in proteins that is primarily lation is catalyzed by g-glutamyl carboxylase [4] and involved in the blood clotting cascade specifically occur- proceeds in the lumen of the endoplasmic reticulum [5]. ring in factors II, VII, IX, and X, protein C, protein S, as The vitamin K-dependent carboxylase transforms Glu to well in some bone proteins [1,2]. Vitamin K is a critical Gla when carbon dioxide (CO2) is added at the g-posi- tion in the presence of oxygen (O2) and reduced vitamin K [2]. The carboxylated proteins can be activated when * Correspondence: [email protected] Gla domain binds Ca2+[6]. Since Glu is a weak Ca2+ † Contributed equally Department of Computer Science and Engineering, Yuan Ze University, chelator and Gla is a much stronger one, the transfor- Chung-Li 320, Taiwan mation by the vitamin K-dependent carboxylase greatly © 2011 Lee et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lee et al. BMC Bioinformatics 2011, 12(Suppl 13):S10 Page 2 of 11 http://www.biomedcentral.com/1471-2105/12/S13/S10 increases the Ca2+-binding capacity of a protein [7]. Stu- tertiary structures. Accordingly, two tools, RVP-Net dies conducted over the last few years have revealed [21,22] and PSIPRED [23], are utilized to calculate the that the g-glutamyl carboxylated proteins in vertebrates ASA values and the secondary structures of amino acids can be categorized into three main groups [8]. The first in protein sequences, respectively. group comprises the carboxylated proteins with an Each feature is examined to evaluate its capacity to amino terminal Gla domain, and includes vitamin K- differentiate carboxylation sites from non-carboxylation dependent blood coagulation factors and co-regulators sites. A support vector machine (SVM) is utilized to of blood coagulation [9]. The second group is composed establish a predictive model with various features, of osteocalcin and matrix Gla protein (MGP) [10,11], including the positional weighted matrix of amino acids, and includes three and five Gla residues, respectively, amino acid composition, ASA, and secondary structure. which are critical to the regulation of bone growth and Five-fold cross-validation is utilized to test the predictive extraosseous calcification [12,13]. The third group is the performance of the generated SVM model. To prevent g-glutamyl carboxylase, itself, which includes Gla resi- the generated model from an overestimation of predic- dues [14]. tive power, homologous sequences are eliminated by Morris et al. used mass spectrometry to reveal the applying a window length of 2n+1 to carboxylation sites. processive behavior of vitamin K-dependent carboxyla- After building and evaluating the model, the selected tion wherein multiple carboxylations occur during a sin- models providing the best accuracy are further tested gle substrate binding event [6]. The efficient using a data set of independent testing. The indepen- carboxylation of native substrates depends on the bind- dent test set, which is not included in the training set, is ing of a conserved region to the carboxylase with a sub- then adopted to evaluate whether the selected model is micromolar affinity constant [15,16]. Owing to the over-fitting to the training data. Finally, the training fea- biological importance of g-glutamyl carboxylation, more tures and window length providing the highest predic- attention has been paid to mass spectrometric analyses tive accuracy for the model are used to construct a web- in order to identify experimentally confirmed carboxyla- based system for identifying g-glutamyl carboxylation. tion sites. Nevertheless, in vitro identification of protein carboxylation sites requires a very large amount of time Material and methods and effort. Meanwhile, in silico methods have the poten- Data collection and preprocessing tial to characterize carboxylated sites before experiments A comprehensive PTM resource dbPTM [24], which are conducted. Additionally, in silico identification pre- collects PTM data from release 15.0 of UniProtKB [17] sents a more feasible means of preliminary analysis with and release 8.0 of HPRD [18], comprises of 1123 Gla the potential to significantly diminish the number of residues in 182 protein entries from multiple organisms. potential carboxylation sites requiring further experi- After the non-experimental sites, annotated as “by simi- mental verification. larity”, “potential” and “probable” in the “MOD_RES” Given the importance of g-glutamyl carboxylation in fields of UniProtKB, have been removed, 463 experi- biological mechanisms, this work focuses on investigat- mental carboxylation sites from 134 carboxylated pro- ing the substrate site specificity of carboxylation sites. teins are obtained. It is observed that the carboxylation The experimentally validated g-glutamyl carboxylations site occurs on Glu residues. In this investigation, all car- were mainly gathered from UniProtKB, a universal pro- boxylated Glu residues are regarded as the positive set. tein resource [17]. Numerous experimental carboxyla- On the other hand, Glu residues, found in the experi- tion sites in humans have been taken from HPRD [18], mental carboxylated proteins, which are not annotated which contains curated data that the authors manually as carboxylation sites are regarded as the negative set. extracted from the PubMed literature database. Amino Consequently, a total of 854 non-carboxylated Glu sites acid side chains that undergo post-translational modifi- are defined as negative set. This work aims to investi- cation tend to be accessible on protein surfaces [19]. gate the structural characteristics of protein carboxyla- Therefore, this investigation not only examines the tion sites; with reference to a previous work [25], amino acid composition around carboxylation sites, but features such as amino acid composition, accessible sur- also considers structural characteristics, including sol- face area (ASA), and secondary structure are explored. vent-accessible surface area (ASA) and secondary struc- The extracted features are used to construct a predictive ture. In order to characterize the structural properties of model and are evaluated in terms of its ability to differ- the tertiary structures of carboxylated proteins, experi- entiate carboxylation sites from non-carboxylation sites. mentally identified carboxylation sites are mapped to