Sequence-Based Prediction of Protein Protein Interaction Using a Deep-Learning Algorithm Tanlin Sun1, Bo Zhou1, Luhua Lai1,2,3 and Jianfeng Pei1*
Total Page:16
File Type:pdf, Size:1020Kb
Sun et al. BMC Bioinformatics (2017) 18:277 DOI 10.1186/s12859-017-1700-2 RESEARCHARTICLE Open Access Sequence-based prediction of protein protein interaction using a deep-learning algorithm Tanlin Sun1, Bo Zhou1, Luhua Lai1,2,3 and Jianfeng Pei1* Abstract Background: Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. Results: We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. Conclusions: To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field. Keywords: Deep learning, Protein-protein interaction Background high-throughput computational methods are needed to Protein-protein interactions (PPI) play critical roles in identify PPIs with high quality and accuracy. many cellular biological processes, such as signal trans- Recently, many computational methods have been gen- duction, immune response, and cellular organization. erated to solve this problem. Of these, some have Analysis of PPI is therefore of great importance and may attempted to mine new protein information, whereas shed light on drug target detection and aid in therapy others involved the development of new machine-learning design [1]. Biochemical assays, chromatography, and algorithms. For protein information mining, Shen et.al similar small-scale experimental methods have long been regarded any three continuous amino acids as a unit and used to identify novel PPIs, but these only contribute to calculated the frequencies of those conjoint triads in the a low coverage of the whole PPI database due to their protein sequences. They demonstrated that PPIs could be poor efficacies [2]. High-throughput technologies, such predicted by sequences alone [6]. Several other methods, as yeast two-hybrid screens (Y2H) [3] and mass spec- such as autocovariance (AC) [7] and amino acid index dis- trometric protein complex identification (MS-PCI) [4], tribution [8] were developed to extract features such as have generated copious data, however, they are expen- physical chemical properties, frequencies, and locations of sive and time consuming. In addition, these methods amino acids to represent a protein sequence. Considering may not be applicable to proteins from all organisms the high dimensions of the features, dimension reduction and often produce false-positive results [5]. Therefore, techniques have been used. For machine-learning algo- rithms, support vector machine (SVM) and its derivatives [9, 10], random forest [11] and neural networks [12], have * Correspondence: [email protected] been applied. However, most studies provided only the 1Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China results of cross-validation, and did not test prediction Full list of author information is available at the end of the article results using external datasets [6, 10, 13, 14]. © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Sun et al. BMC Bioinformatics (2017) 18:277 Page 2 of 8 Deep-learning algorithms, which mimic the deep Datasets neural connections and learning processes of the human Benchmark dataset brain, have received considerable attention due to their We obtained the Pan’s PPI dataset from http://www.csbio. successful applications in speech and image recognition sjtu.edu.cn/bioinf/LR_PPI/Data.htm [14]. In this dataset, [15, 16], natural language understanding [17] and deci- the positive samples (PPIs) are from the human protein sion making [18]. Compared to traditional machine references database (HPRD, 2007 version), with removal learning methods, deep-learning algorithms can handle of duplicated interactions (36,630 pairs remained). large-scale raw and complex data and automatically Negative samples (non-interaction pairs) were generated learn useful and more abstract features [19]. In recent by pairing proteins found in different subcellular locations. years, these algorithms have been applied to bioinfor- The protein subcellular location information was from the matics to manage increasing amounts and dimensions of Swiss-Prot database, version 57.3, according to the fol- data generated by high throughput technique [20–24]. lowing criteria. (1) Only human proteins were collected. For genome regulation function prediction, for example, (2) Sequences annotated with ambiguous or uncertain Xiong et al. applied a deep neural network model to subcellular location terms, such as “potential”, “probable”, predict DNA variants causing aberrant splicing. Their “probably”, “maybe”,or“by similarity”, were excluded. (3) method was more accurate than traditional models [25]. Sequences annotated with two or more locations were The DeepBind model constructed by Alipanahi and col- excluded for lack of uniqueness. (4) Sequences annotated leagues using convolutional networks could predict with “fragment” were excluded, and sequences with fewer sequence specificities of DNA- and RNA-binding proteins, than 50 amino acid residues were removed due to the pos- and identify binding motifs [26]. Identifying functional sibility that they may represent fragments. effects of noncoding variants is a major challenge in hu- In total, 2,184 unique proteins from six subcellular man genetics. DeepSEA developed by Zhou et al. could locations (cytoplasm, nucleus, endoplasmic reticulum, directly learn a regulatory sequence code from large-scale Golgi apparatus, lysosome, and mitochondrion) were chromatin-profiling data, enabling prediction of chromatin obtained. By randomly pairing those proteins with others effects of sequence alterations with single-nucleotide sensi- found in different subcellular locations, along with the tivity [27]. After that, the DnaQ model constructed by addition of negative pairs from [33], a total of 36,480 Quang and coworkers achieved more than a 50% relative negative pairs were generated. We removed protein pairs improvement compared to other models for predicting the with unusual amino acids, such as U and X to yield function of non-coding DNA [28]. For protein function 36,545 positive samples and 36,323 negative samples to prediction, Spencer et al. used a deep belief network form the benchmark dataset. The interaction networks (DBN) to predict protein secondary structures and they and the degree distributions of the positive and nega- achieved an accuracy of 80.7% [29]. Sheng and colleagues tive sample sets of the benchmark dataset are shown in improved the prediction accuracy to 84% using deep con- Additionalfile1FigureS1andS2. volutional neural fields [30]. Heffernan et al.’salgorithm We mixed the positive and negative samples in the can not only predict secondary structures, but also can benchmark dataset and randomly selected 7,000 pairs predict backbone angles and solvent accessible surface (3,493 positive samples and 3,507 negative samples) as a areas [31]. A more detailed summary of the application of hold-out test set for model validation, the remainder of the deep learning algorithm in computational biology can which formed the pre-training set (33,052 positive sam- be found in a recent review [32]. ples and 32,816 negative samples). The pre-training set In this study, we applied Stacked autoencoder (SAE) was trained and tested using 10-CV, and the best models to study sequence-based human PPI predictions. Models were selected to predict the hold-out test set. To test the based on protein sequence autocovariance coding robustness of the model, a non-redundant test set (‘NR- achieved the best results on 10-fold cross-validation (10- test set’) was formed by removing pairs in the hold-out CV) and on predicting hold-out test sets. The best test set with a pairwise identity ≥25% to those in the model had an average accuracy of 97.19% for the whole pre-training set. After the network architecture and training benchmark dataset. Various external test sets parameters were selected, we trained with the whole were constructed and predicted using our model and benchmark dataset to construct our final PPI prediction the prediction accuracies for these ranged from 87.99 model and used it to predict the external test sets. to 99.21%. In addition, we trained and tested PPI models on other species,