Prediction of RNA-Binding Proteins from Primary Sequence by a Support Vector Machine Approach

Downloaded from rnajournal.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press BIOINFORMATICS Prediction of RNA-binding proteins from primary sequence by a support vector machine approach LIAN YI HAN,1 CONG ZHONG CAI,1,2 SIEW LIN LO,3 MAXEY C.M. CHUNG,3 and YU ZONG CHEN1 1Department of Computational Science, National University of Singapore, Singapore 117543 2Department of Applied Physics, Chongqing University, Chongqing 400044, People’s Republic of China 3Department of Biochemistry, National University of Singapore, Singapore, 117597 ABSTRACT Elucidation of the interaction of proteins with different molecules is of significance in the understanding of cellular processes. Computational methods have been developed for the prediction of protein–protein interactions. But insufficient attention has been paid to the prediction of protein–RNA interactions, which play central roles in regulating gene expression and certain RNA-mediated enzymatic processes. This work explored the use of a machine learning method, support vector machines (SVM), for the prediction of RNA-binding proteins directly from their primary sequence. Based on the knowledge of known RNA- binding and non-RNA-binding proteins, an SVM system was trained to recognize RNA-binding proteins. A total of 4011 RNA-binding and 9781 non-RNA-binding proteins was used to train and test the SVM classification system, and an independent set of 447 RNA-binding and 4881 non-RNA-binding proteins was used to evaluate the classification accuracy. Testing results using this independent evaluation set show a prediction accuracy of 94.1%, 79.3%, and 94.1% for rRNA-, mRNA-, and tRNA-binding proteins, and 98.7%, 96.5%, and 99.9% for non-rRNA-, non-mRNA-, and non-tRNA-binding proteins, respectively. The SVM classification system was further tested on a small class of snRNA-binding proteins with only 60 available sequences. The prediction accuracy is 40.0% and 99.9% for snRNA-binding and non-snRNA-binding proteins, indicating a need for a sufficient number of proteins to train SVM. The SVM classification systems trained in this work were added to our Web-based protein functional classification software SVMProt, at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. Our study suggests the potential of SVM as a useful tool for facilitating the prediction of protein–RNA interactions. Keywords: RNA-binding proteins; RNA–protein interactions; rRNA; mRNA; tRNA; snRNA; support vector machine INTRODUCTION al.1999; Marcotte et al.1999), phylogenetic profile (Pelle- grini et al.1999), gene neighbor (Dandekar et al.1998; Knowledge regarding how proteins interact with each other Overbeek et al.1999), and interacting domain profile pair and with other molecules is essential in the understanding (Eisen et al.1998) methods. of cellular processes (Siomi and Dreyfuss 1997; Draper Although progress has been made in the development of 1999; Lengeler 2000; Downward 2001).With the accumu- predictive methods for protein–protein interactions, insuf- lation of sequence information, attention has been paid to ficient attention has been paid to the development of pre- the development of methods for the prediction of protein dictive methods for protein–RNA interactions.Most cellu- function (Fetrow and Skolnick 1998) and interactions lar RNAs work in concert with protein partners, and pro- (Dandekar et al.1998; Overbeek et al.1999; Bock and tein–RNA interactions are critically important in regulation Gough 2001) from sequence.Several computational meth- of different steps of gene expression (Siomi and Dreyfuss ods have been developed for the prediction of protein– 1997).Moreover, binding of proteins to some catalytic RNA protein interactions using support vector machines (SVM; molecules is known to activate or enhance the activity of Bock and Gough 2001) and for the prediction of protein– these molecules (Frank and Pace 1998).Therefore, predic- protein interaction maps by Rosetta/gene fusion (Enright et tion of protein–RNA interactions is of significance in a more comprehensive understanding of how cellular pro- Reprint requests to: Yu Zong Chen, Department of Computational cesses and networks work. Science, National University of Singapore, Blk SOC1, Level 7, 3 Science RNA recognition by proteins is primarily mediated by Drive 2, Singapore 117543; e-mail: [email protected]; fax: 65-6774-6756. Article and publication are at http://www.rnajournal.org/cgi/doi/ certain classes of RNA binding domains and motifs (Draper 10.1261/rna.5890304. 1999; Fierro-Monti and Mathews 2000; Peculis 2000; Perez- RNA (2004), 10:355–368.Published by Cold Spring Harbor Laboratory Press.Copyright © 2004 RNA Society. 355 Downloaded from rnajournal.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press Hanetal. Canadillas and Varani 2001).Hence, as in the case of pro- casting (Liong and Sivapragasam 2002), cancer diagnosis tein–protein interactions (Casari et al.1995; Pawson 1995; (Furey et al.2000; Ramaswamy et al.2001; Fritsche 2002), Elcock and McCammon 2001), correlated patterns of se- microarray gene expression data analysis (Brown et al. quence and substructure in RNA-binding proteins can be 2000), inhibitor classification (Burbidge et al.2001), pre- recognized to bind to specific RNA sequences and folds. diction of protein solvent accessibility (Yuan et al.2002), The SVM approach, successfully used for the prediction of protein fold recognition (Ding and Dubchak 2001), protein protein–protein interactions from primary sequences (Bock secondary structure prediction (Hua and Sun 2001), pre- and Gough 2001), is therefore expected to be applicable for diction of protein–protein interaction (Bock and Gough recognizing this pattern and thus predicting RNA-binding 2001) and protein functional class classification (Karchin et proteins from protein primary sequence. al.2002; Cai et al.2003a).These studies have demonstrated In the present study, we explored the use of SVM for the that SVM is consistently superior to other supervised learn- prediction of RNA-binding proteins from protein primary ing methods including classification methods (Brown et al. sequence.The SVM method was used for the prediction of 2000; Burbidge et al.2001; Cai et al.2002b).In the present individual classes of rRNA-, mRNA-, and tRNA-binding study, SVM was further tested regarding its capability to proteins, as well as all RNA-binding proteins.There are predict protein–RNA interactions. other groups of RNA-binding proteins, such as snRNA- binding and snoRNA-binding proteins, with small numbers RESULTS AND DISCUSSION of proteins and fewer available sequences (Tomasevic and Peculis 1999; Singh 2002).A search of protein family and Overall prediction accuracy sequence databases revealed a total of 60 sequences of The numbers and prediction results of specific classes of snRNA-binding proteins and 21 sequences of snoRNA- RNA-binding proteins and non-class members are given in binding proteins, which is fewer than the 80–100 sequences Table 1.In the able, TP stands for true positive (correctly typically needed to properly train an SVM protein classifi- predicted RNA-binding proteins of a specific class), FN for cation system (Cai et al.2003a).Nevertheless, to evaluate its false negative (specific class of RNA-binding proteins incor- performance on classification of a small protein class, SVM rectly predicted as non-class members), TN for true nega- was used for the prediction of snRNA-binding proteins. tive (correctly predicted non-class members), and FP for Proteins of small RNA-binding classes as well as other false positive (non-class members incorrectly predicted as a RNA-binding proteins were included in training and testing specific class of RNA-binding proteins).The predicted sen- the SVM classification of all RNA-binding proteins. sitivity (SE) for rRNA-, mRNA-, tRNA-, and snRNA-bind- SVM is a relatively new and promising algorithm for ing proteins and all RNA-binding proteins, which measures binary classification by means of supervised learning which the overall prediction accuracy for each class of RNA-bind- was originally developed by Vapnik and his coworkers (Vap- ing proteins, is 94.1%, 79.3%, 94.1%, 41.0%, and 97.8%, nik 1995; Burges 1998) and applied to a wide range of respectively.The predicted specificity ( SP) for non-rRNA-, problems including text categorization (Drucker et al.1999; non-mRNA-, non-tRNA-, and non-snRNA-binding proteins Kim et al.2001; de Vel et al.2001), hand-written digit and all non-RNA-binding proteins, which measures predic- recognition (Vapnik 1995), tone recognition (Thubthong tion accuracy for each group of non-RNA-binding proteins, is and Kijsirikul 2001), image classification and object detec- 98.7%, 96.5%, 99.9%, 99.7%, and 96.0%, respectively. tion (Ben-Yacoub et al.1999; Karlsen et al.2000; Papageor- A direct comparison with results from previous protein giou and Poggio 2000; Huang et al.2002), flood stage fore- studies is inappropriate, because of the differences in the TABLE 1. Prediction accuracies and number of positive and negative samples in the training, testing, and independent evaluation set of rRNA-, mRNA-, tRNA-, and snRNA-binding proteins and of all RNA-binding proteins Training set Testing set Independent evaluation set positive negative positive negative Protein family positive negative TP FN TN FP TP FN SE (%) TN FP SP (%) Q (%) RNA-binding 2161 2965 1844 6 6802 14 437 10 97.8 4685 196 96.0 96.1 rRNA-binding 708 972 1243 2 9031 13 95 6 94.1 4931 66 98.7 98.6 mRNA-binding 277 2106 129 0 10164 0 130 34 79.3 5833 213 96.5 96.0 tRNA-binding 94 792 114 0 9295 2 48 3 94.1 5028 5 99.9 99.8 snRNA-binding 33 1988 7 0 10373 1 9 11 41.0 6133 18 99.7 99.5 Predicted results are given in TP (true positive), FN (false negative), TN (true negative), FP (false positive), sensitivity SE = TP/(TP + FN), specificity SP = TN/(TN+FP), and Q (overall accuracy, Q=(TN+TP)/(TP+FN+TN+FP)).

Load more