A Deep Learning-Based Predictor for Surface Accessibility of Transmembrane Protein Residues

crystals Article TMP-SSurface: A Deep Learning-Based Predictor for Surface Accessibility of Transmembrane Protein Residues Chang Lu 1,2 , Zhe Liu 1,2, Bowen Kan 1,2, Yingli Gong 1,2, Zhiqiang Ma 1,2,3,* and Han Wang 1,2,3,* 1 School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; [email protected] (C.L.); [email protected] (Z.L.); [email protected] (B.K.); [email protected] (Y.G.) 2 Institute of Computational Biology, Northeast Normal University, Changchun 130117, China 3 Department of Computer Science, College of Humanities & Sciences of Northeast Normal University, Changchun 130117, China * Correspondence: [email protected] (Z.M.); [email protected] (H.W.) Received: 4 November 2019; Accepted: 29 November 2019; Published: 1 December 2019 Abstract: Transmembrane proteins (TMPs) play vital and diverse roles in many biological processes, such as molecular transportation and immune response. Like other proteins, many major interactions with other molecules happen in TMPs’ surface area, which is important for function annotation and drug discovery. Under the condition that the structure of TMP is hard to derive from experiment and prediction, it is a practical way to predict the TMP residues’ surface area, measured by the relative accessible surface area (rASA), based on computational methods. In this study, we presented a novel deep learning-based predictor TMP-SSurface for both alpha-helical and beta-barrel transmembrane proteins (α-TMP and β-TMP), where convolutional neural network (CNN), inception blocks, and CapsuleNet were combined to construct a network framework, simply accepting one-hot code and position-specific score matrix (PSSM) of protein fragment as inputs. TMP-SSurface was tested against an independent dataset achieving appreciable performance with 0.584 Pearson correlation coefficients (CC) value. As the first TMP’s rASA predictor utilizing the deep neural network, our method provided a referenceable sample for the community, as well as a practical step to discover the interaction sites of TMPs based on their sequence. Keywords: transmembrane protein; surface accessibility; deep learning 1. Introduction Transmembrane protein (TMP) is one of the most important types of membrane proteins (MPs) that span the entire biological membranes in the whole molecular life cycle as a gateway or receptor. They involve in diverse biological processes, such as cell mechanics regulation [1], signal transduction [2], molecule transport [3], etc. Special interest in TMPs also arises from the fact that they associate with many types of diseases, such as autism [4], dyslipidemia [5], epilepsy [6], and various types of cancers [7–9]. Since TMPs play numerous roles in basic physiology and pathophysiology, TMPs are major targets for more than one-third of known drugs on the current therapeutics market [10]. On one side of the membrane, TMPs INTERACT with ligands, including protons, metal ions, enzyme, drug-like compound, etc. On the other side, they interact with proteins, RNAs, or other molecules to trigger a series of molecular reactions and eventually control the cell functions. The interaction interface is always located on the surface areas of TMPs, according to the statistics [11]. The surface accessibility of the residues in the protein can be measured by the relative accessible surface area (rASA), Crystals 2019, 9, 640; doi:10.3390/cryst9120640 www.mdpi.com/journal/crystals Crystals 2019, 9, 640 2 of 13 which refers to the relative surface area of the residues exposed to the environment surrounding the protein [12]. As a valuable structural property, predicting rASA based on primary sequence is a rewarding task for TMPs in structure prediction, function annotation, and drug discovery [13]. In recent years, several sequence-based methods have been developed to predict the surface accessibility of residues for TMPs. Thijs B. et al. firstly published a knowledge-based method (ProperTM) to predict and analyzed the burial state (burial or exposure) of transmembrane residues within TMPs [14]. After that, several machine learning-based methods have been developed. Depending on their functionality, these methods can be roughly grouped into two categories: burial state identifier and surface area predictor. Identifying the burial state of residues is a binary classification problem; it predicts whether residues are exposed to the surface or buried inside of the TMP, such as TMX [15], Yao et al. (2011) [16], TMexpoSVC [17]. Predicting the real value of surface area is a regression problem; these tools predict the accessible surface area (ASA) value or the relative accessible surface area (rASA) value of residues, such as ASAP [18], MPRAP [19], Yao et al. (2012) [20], TMexpoSVR [17], and MemBrain-Rasa [21,22]. A summary of these methods is listed in Table1 in chronological order. Table 1. The summaries of existing methods for predicting surface accessibility of transmembrane protein (TMP) residues. Method Year Samples Algorithm TMP Type Seq Region Measure ProperTM [14] 2004 59 knowledge α-TMP TM region Burial state ASAP [18] 2006 73 SVR all TMP TM region ASA TMX [15] 2007 43 SVC α-TMP TM region Burial state MPRAP [19] 2010 80 SVR α-TMP full sequence rASA Yao et al. (2011) [16] 2011 53 SVM α-TMP TM region Burial state Yao et al. (2012) [20] 2012 122 RF all TMP TM region ASA TMexpoSVR [17] 2013 110 SVR α-TMP TM region rASA TMexpoSVC [17] 2013 110 SVC α-TMP TM region Burial state MenBrain-Rasa [21,22] 2015 80 SVR α-TMP full sequence rASA Although considerable achievements have been made in the field of TMP surface accessibility prediction, there are still several issues that deserved to be further improved. First of all, none of the mentioned methods could predict the rASA of the whole sequence of all kinds of TMPs. On the one hand, except for MPRAP and MenBrain-Rasa, most predictors can only be applied within transmembrane regions of TMPs, which focus only on the lipid-accessible surface while ignoring the water-accessible surface. It is worth pointing out that the prediction of rASA on the full sequence is more challenging than those that only apply to transmembrane residues. On the other hand, most methods only focus on α-helical TMPs while ignoring β-barrel TMPs—including the only two full sequence predictors. Although β-barrel TMPs just account for a small proportion of TMPs, it is also essential to be studied and should not be ignored. Up to now, ASAP and Yao et al. (2012) are the only two predictors that can be applied to both α-helical and β-barrel TMPs, but it is a pity that they can only be used to predict transmembrane regions of TMPs. Thus, it is meaningful to design a more powerful full sequence predictor to predict rASA for all kinds of TMPs. Besides, previous predictors relied heavily on the features derived from third-party tools, such as position-specific score matrix (PSSM) [23], Z-coordinate, secondary structure [24], and so on. Although these features contribute to the improvement of the predictor performance [25–27], their weakness cannot be ignored. On the one hand, using these third-party tool-derived features will make the predictor slow and may lead to uncontrollable failure. For example, MemBrain-Rasa uses six types of features, four of whom relied on the third-part tools, and seven out of 50 proteins cannot get a reliable prediction result from MemBrain-Rasa on the independent test. On the other hand, expertise in TMPs is always required to successfully use the previous methods, which may confuse the non-professional users and hinder the exploration of the biological significance of the prediction process. Since most Crystals 2019, 9, 640 3 of 13 previous predictors can only be applied in the transmembrane regions of TMPs, the topology structure of TMPs must be known before using them. However, it is difficult for non-professional researchers to determine the topology of TMPs. Based on this consideration, we tried to describe the protein fragment with features as concise as possible to make the predictor simpler and more efficient. After a series of experiments, we selected two types of encoding schemes to represent the protein fragment: one-hot code [28–30] and a position-specific scoring matrix (PSSM), where the former one encodes the residues arrangement and the latter one reflect the evolutionary profile. However, reducing the number of features will inevitably result in less information that may get by the predictor and cause performance deterioration. As a promising solution, decreasing the dependency on sophisticated features, a deep learning-based method was introduced in this study for its ability to discover the structural features from the sequence. The proposed method was a deep learning network that combines a convolutional neural network (CNN), inception network, and CapsuleNet. In this study, we proposed a sequence-based rASA predictor (TMP-SSurface) for the full sequence of all types of TMPs, that achieved considerable performance while simplifying the input features as much as possible. Only one-hot code and PSSM were used as the input features of a new proposed deep learning-based regression method, which combined the inception network with CapsuleNet. The experimental result showed that the performance of TMP-SSurface achieved a Pearson correlation coefficient (CC) of 0.581 on the independent validation, which was slightly better than the results of today’s best predictor, but much more simple than it. TMP-SSurface is accessible freely in http://icdtools.nenu.edu.cn/tmp_ssurface. The datasets used in the experiment and project of the predictor could be downloaded from the web-server. 2. Results and Discussion 2.1. Feature Analysis We tried several features, such as topology structure, physicochemical properties, and Z-coordinate. Although these features contributed to the predictor more or less, they were not as significant as one-hot code and PSSM.

Load more