Mining and Ensembling Heterogeneous Information for Protein Interaction Predictions
Total Page:16
File Type:pdf, Size:1020Kb
University of Wollongong Research Online Faculty of Engineering and Information Faculty of Engineering and Information Sciences - Papers: Part B Sciences 2020 HIME: Mining and Ensembling Heterogeneous Information for Protein Interaction Predictions Huaming Chen [email protected] Yaochu Jin [email protected] Lei Wang University of Wollongong, [email protected] Chi-Hung Chi [email protected] Jun Shen University of Wollongong, [email protected] Follow this and additional works at: https://ro.uow.edu.au/eispapers1 Part of the Engineering Commons, and the Science and Technology Studies Commons Recommended Citation Chen, Huaming; Jin, Yaochu; Wang, Lei; Chi, Chi-Hung; and Shen, Jun, "HIME: Mining and Ensembling Heterogeneous Information for Protein Interaction Predictions" (2020). Faculty of Engineering and Information Sciences - Papers: Part B. 4201. https://ro.uow.edu.au/eispapers1/4201 Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: [email protected] HIME: Mining and Ensembling Heterogeneous Information for Protein Interaction Predictions Abstract esearch on protein-protein interactions (PPIs) data paves the way towards understanding the mechanisms of infectious diseases, however improving the prediction performance of PPIs of inter- species remains a challenge. Since one single type of sequence data such as amino acid composition may be deficient for high-quality prediction of protein interactions, we have investigated a broader range of heterogeneous information of sequences data. This paper proposes a novel framework for PPIs prediction based on Heterogeneous Information Mining and Ensembling (HIME) process to effectively learn from the interaction data. In particular, the proposed approach introduces an ensemble process together with substantial features that generate better performance of PPIs prediction task. The performance of the proposed framework is validated on real protein interaction datasets. The extensive experiments show that HIME achieves higher performance over all existing methods reported in literature so far. Keywords heterogeneous, hime:, information, interaction, protein, predictions, mining, ensembling Disciplines Engineering | Science and Technology Studies Publication Details Chen, H., Jin, Y., Wang, L., Chi, C. & Shen, J. (2020). HIME: Mining and Ensembling Heterogeneous Information for Protein Interaction Predictions. IEEE International Joint Conference on Neural Networks (pp. 1-8). United States: IEEE. This conference paper is available at Research Online: https://ro.uow.edu.au/eispapers1/4201 HIME: Mining and Ensembling Heterogeneous Information for Protein-Protein Interactions Prediction Huaming Chen∗, Yaochu Jiny, Lei Wang∗, Chi-Hung Chiz, Jun Shen∗ ∗School of Computing and Information Technology, University of Wollongong, Wollongong, NSW, Australia yDepartment of Computer Science, University of Surrey, United Kingdom zData61, CSIRO, Australia Email: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract—Research on protein-protein interactions (PPIs) data incomplete picture of the protein-protein interactions relation- paves the way towards understanding the mechanisms of infec- ships, where the identifications of the interactions demand a tious diseases, however improving the prediction performance huge amount of time and resources for wet-lab experiments. of PPIs of inter-species remains a challenge. Since one single type of sequence data such as amino acid composition may be Meanwhile, the nature of interaction data between different deficient for high-quality prediction of protein interactions, we species results in a huge amount of latent interactions to have investigated a broader range of heterogeneous information be further examined and verified as positive or negative of sequences data. This paper proposes a novel framework for interactions by biologists. The identification of protein-protein PPIs prediction based on Heterogeneous Information Mining interactions is traditionally conducted by in vitro and in and Ensembling (HIME) process to effectively learn from the interaction data. In particular, the proposed approach intro- vivo methods, which includes affinity purification, yeast two- duces an ensemble process together with substantial features hybrid assay, affinity purification-mass spectrometry (AP-MS), that generate better performance of PPIs prediction task. The nu-clear magnetic resonance (NMR) and mass spectrometry performance of the proposed framework is validated on real methods. The processes of these methods are deemed cost- protein interaction datasets. The extensive experiments show that sensitive task for both time and resources. HIME achieves higher performance over all existing methods reported in literature so far. To effectively generate high-fidelity PPIs prior to biology Index Terms—biological data, heterogeneous information, pro- experiments, there has been numerous studies introducing tein interaction, neural networks computational methods to facilitate the process. The identified interactions data is playing an important role in the studies I. INTRODUCTION providing the interactions relationship. One major category is Analyzing and understanding protein-protein interactions to build machine learning-based model with different protein (PPIs) is of great importance and value to the study of data, such as protein sequence data [3], gene ontology data infectious diseases, especially for inter-species interactions, [4], and protein structure data [5], for the prediction of protein such as the interactions between human and pathogens [1], interactions. Among these, sequence information is considered [2], which is also termed as human-pathogen protein-protein as the main protein information because of its substantial interactions. It has been one of the hot topics towards the accumulation in a large scale. Specifically, the proteins have mechanism study of diseases. Since infectious diseases are been determined uniquely by the sequence information as for still the dominant causes of death, the research of infec- their physical and biochemical characteristics. By analyzing tious diseases has solicited data from different perspectives the protein sequence information hosted by the Universal to examine the biomedical hypothesis and propose potential Protein Resource (UniProt), the past studies had indicated therapeutics. Vast research has been conducted with a long that combining machine learning-based models with protein period of biological development and examination. sequence data mining would benefit the prediction and analysis As a result of decades of efforts of wet lab-based ex- of protein interactions task [1], [6], [7]. periments in biology, the production of biological data, e.g. More recently, Soyemi et al. have reviewed the relevant data protein interactions, has exploded. Although there is still of inter-species/host-parasite protein interaction in a compre- substantial need for further experiments, the accumulated data hensive manner [8], though the quantitative evaluation is still has benefited the research on disease mechanisms to a limited void. Inspired from the idea in [1], [8], in this paper, we focus extent. One of the earliest studies was on the symptom of on constructing a computational framework towards the evalu- anthrax, which was identified as being primarily caused by the ation of machine learning-based models for prediction task of interactions between human and Bacillus anthracis. Bacillus human-pathogen protein-protein interactions (HP-PPIs). One anthracis is a type of bacterium pathogen, where people want major reason is that, to the best of our knowledge, no studies to fully understand mechanisms with the protein interactions conduct a comprehensive quantitative exploration from both map between Bacillus anthracis and Homo sapiens (the host). databases perspective and computational models comparison However, the experiment results to investigate protein- in terms of human-pathogen protein-protein interactions [1], protein interactions are still very limited. There has been an [6]–[8]. In this work, we have further proposed an ensemble machine learning-based model through mining the heteroge- deal with the avalanche of newly sequenced protein data, the neous information of protein data. The proposed framework feature representation methods of protein sequence data were achieves a high performance of prediction. well designed as one of the important components for machine In summary, the novel contributions made in this work are: learning-based PPIs prediction models [3], [9], [14], [15]. • We conduct an extensive review of the existing databases Because sequence data was the most abundant data benefiting for human-pathogen interactions since 2000s. By doing from high-throughput technology development, it would be so, several human-pathogen protein-protein interactions beneficial to understand the performance in computational datasets are carefully curated from the selected databases. models and develop a more efficient model for HP-PPIs • We perform a comprehensive experiment to collect a prediction. wide scale of the prediction performances for different Although there have been other literature reviews on the machine learning-based models, which also include the topic of inter-species PPIs prediction, such as host-pathogen methods from literature focusing on the prediction