Harnessing the Evolutionary Information on Oxygen Binding Proteins Through Support Vector Machines Based Modules
Total Page:16
File Type:pdf, Size:1020Kb
Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules Citation: Muthukrishnan, Selvaraj and Puri, Munish 2018, Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules, BMC research notes, vol. 11: 290, pp. 1-8. DOI: http://www.dx.doi.org/10.1186/s13104-018-3383-9 © 2018, The Authors Reproduced by Deakin University under the terms of the Creative Commons Attribution Licence Downloaded from DRO: http://hdl.handle.net/10536/DRO/DU:30113373 DRO Deakin Research Online, Deakin University’s Research Repository Deakin University CRICOS Provider Code: 00113B Muthukrishnan and Puri BMC Res Notes (2018) 11:290 https://doi.org/10.1186/s13104-018-3383-9 BMC Research Notes RESEARCH NOTE Open Access Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules Selvaraj Muthukrishnan1,2* and Munish Puri2,3* Abstract Objectives: The arrival of free oxygen on the globe, aerobic life is becoming possible. However, it has become very clear that the oxygen binding proteins are widespread in the biosphere and are found in all groups of organisms, including prokaryotes, eukaryotes as well as in fungi, plants, and animals. The exponential growth and availability of fresh annotated protein sequences in the databases motivated us to develop an improved version of “Oxypred” for identifying oxygen-binding proteins. Results: In this study, we have proposed a method for identifying oxy-proteins with two diferent sequence similar- ity cutofs 50 and 90%. A diferent amino acid composition based Support Vector Machines models was developed, including the evolutionary profles in the form position-specifc scoring matrix (PSSM). The fvefold cross-validation techniques were applied to evaluate the prediction performance. Also, we compared with existing methods, which shows nearly 97% recognition, but, our newly developed models were able to recognize almost 99.99 and 100% in both oxy-50 and 90% similarity models respectively. Our result shows that our approaches are faster and achieve a better prediction performance over the existing methods. The web-server Oxypred2 was developed for an alternative method for identifying oxy-proteins with more additional modules including PSSM, available at http://bioinfo.imtec h.res.in/servers/muthu/oxypred2/home.html. Keywords: Oxygen binding proteins, Hemoglobin, Myoglobin, Leghemoglobin, Erythrocruorin, Hemerythrin, Hemocyanin, Support Vector Machines, Confusion matrix, ROC Analysis Introduction characteristics and structure with unique oxygen-binding Oxygen is an essential part of the atmosphere and is nec- capacity [1–11]. essary to sustain the most terrestrial life of living organ- A number of computational methods have been pro- isms as it used in respiration and regulation of a variety posed for identifying functional proteins on their pri- of cellular functions. Te oxygen binding proteins (oxy- mary sequences using machine learning approaches proteins) of various organisms considerably difer from [12–14]. Tese methods are always needful to improve one another and classifed mainly on their structure and or to fnd new features for identifying protein family and physiochemical properties as hemoglobin, hemocyanin, their classes, sub-classes to avoid negative prediction or hemerythrin, myoglobin, leghemoglobin, and eryth- to reduce false positive rates. rocruorin. Each oxy-proteins have its own functional In 2007, Muthukrishnan et al. developed Oxypred method for predicting oxygen-binding proteins using the *Correspondence: [email protected]; [email protected]; munish. simple amino (AC) and dipeptide composition (DC). Te [email protected] growing of protein sequence databases and availability of 1 Institute of Microbial Technology (CSIR-IMTECH), Sector‑39A, newly annotated sequences of oxy-proteins in the post Chandigarh 160036, India 2 Protein Biotechnology Laboratory, Department of Biotechnology, genomic era, retrospectively encouraged us to introduce Punjabi University, Patiala, India a new improved version of forged oxypred method. An Full list of author information is available at the end of the article attempt was made to include a recently generated highly © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Muthukrishnan and Puri BMC Res Notes (2018) 11:290 Page 2 of 8 non-redundant dataset in the development of Oxypred2 was generated using GPSR package available at http:// with a diferent protein features [15]. Recently, it has www.imtec h.res.in/ragha va/gpsr/. We applied GPSR observed that the use of evolutionary profle in the form programs for PSI-BLAST searches against the non- of a position-specifc scoring matrix (PSSM) predicted redundant (nr) database using diferent iterations with various functional proteins with a higher accuracy [16, a cutof E value 0.001 [34, 35]. Further, each value has 17]. Hence, we applied many approaches, including the been normalized the range between 0 and 1 by the fol- PSSM based evolutionary profle to improve prediction lowing equation, quality of oxy-proteins. (Value Minimum) In this study, recently generated two diferent cut- Normalized value − = (Maximum Minimum) (1) of non-redundant datasets 50 and 90% were applied to − develop Oxypred2. Te diference between current and previous study refected that PSSM and Hybrid approach, In 0–1 value, minimum scores consider as “0,” and the confusion matrix analysis, prediction score graphs, and maximum scores become “1”. ROC analysis has been added as extra features. Te many diferent prediction features are always important to understand their functional behavior Evaluation models aspects [18–21]. Here, we compared prediction perfor- We applied fvefold cross-validation techniques, as it was mance of 50 and 90% similarity datasets in all modules done by many investigators with SVM as the prediction to fnd the best identifcation of oxy-proteins. Te pre- engine. In this technique, the dataset was divided into diction results and their complete analysis show that the fve sets consisting of nearly equal number of sequences, developed method Oxypred2 is an improved version and where four sets used for training and remaining set for alternative method for identifying oxy-proteins. testing. Te training and testing set was carried out fve times in such a way that each part was used once for test- Main text ing, and the whole process was repeated 20 times. Methods Te objectives of our classifeds are to discriminate the Datasets oxy-protein from those of negative discipline, and the fol- Te two diferent datasets sequences (90 and 50%) lowing terminology used to evaluate of our classifer as, were extracted from UniProt databases by searching the individual keyword of oxy-proteins [22]. Te fnal • True positive (TP)—a protein is identifed as an oxy- dataset contains 2498 and 5474 sequences as in 50 and protein by both classifer and oxy-proteins model. 90% respectively. In sub-class, 47–114 erythrocruorin, • True negative (TN)—a protein is not identifed as a 42–154 hemocyanin, 1378–2585 hemerythrin, 957–2462 oxy-protein by either the classifer or oxy-protein hemoglobin, 34–34 leghemoglobin and 40–125 myo- model. globin as in both 50 and 90% datasets respectively. Due • False positive (FP)—a protein is identifed as posi- to less availability, 90% leghemoglobin dataset used for tively as oxy-protein by the classifer, but not by the 50% dataset. Te independent non-oxy protein datasets oxy-protein model. were constructed according to the size of oxy-proteins by • False negative (FN)—a protein is identifed as oxy- selecting randomly as 2565 and 5499 on 50 and 90% cut- protein by the oxy-protein model but not by the clas- of datasets respectively. sifer. Support Vector Machines In order to assess the prediction performances, accu- In this study, free downloadable package of SVM-light racy (ACC), Mathew’s correlation coefcient (MCC), was used to generate modules [23, 24]. It has been suc- sensitivity (Sen) and specifcity (Sep) were calculated cessfully applied to numerous classifcation and pattern using standard Eqs. (2–5) [36–38], recognition problems such as classifcation of protein TP TN Accuracy (ACC) + secondary structure, subcellular localization, DNA-bind- = TP TN FP FN (2) ing, ATP-binding and transporter family protein predic- + + + tions [25–33]. TP PSSM‑profle Sensitivity (SN) = TP FN (3) Te PSSM profle provides the evolutionary informa- + tion about residues conservation at a given position in a protein sequence. Te construction of PSSM profle Muthukrishnan and Puri BMC Res Notes (2018) 11:290 Page 3 of 8 TN Ala and Arg residues are very less (− 3%) in Hcy-50 and Specificity (SP) Leg-50 sequences respectively. In 90% oxy-datasets, Ala = TN FP (4) + residue is 2% more in Ery-90 and leg-90, Glu, Lys, and Val are present 3% more in heme, myo, and leg proteins TP TN FP FN respectively. Ala, Glu, and Arg are less 2% in hcy, ery and MCC × − × = √(TP FP)(TP FN)(TN FP)(TN FN) leg proteins, results shown in Fig. 1, Additional fle 1: + + + + Figure S1 and Additional fle 2: Figure S2. In sub-classes, (5) sequence length profle of oxy-50 and 90 were compared, Results found most of the sequences of heme and hemo proteins Determining the relative amino acid composition will belong to the range between 101 and 200. Te other pro- give a characteristic profle for protein [39]. Here, we cal- teins are distributed in diferent length ranges (Addi- culated average AC composition of oxy-proteins accord- tional fle 3: Figure S3).