A Brief Review of Machine Learning Techniques for Sites Prediction

Farzaneh Esmaili, Mahdi Pourmirzaei, Shahin Ramazi, Elham Yavari

{f.esmaili, m.poormirzaie, s.ramazi, e.yavari}@modares.ac.ir

Tarbiat Modares University

Abstract

Reversible Post-Translational Modifications (PTMs) have vital roles in extending the functional diversity of and effect meaningfully the regulation of protein functions in prokaryotic and eukaryotic organisms. PTMs have happened as crucial molecular regulatory mechanisms that are utilized to regulate diverse cellular processes. Nevertheless, among the most well-studied PTMs can say mainly types of proteins are containing phosphorylation and significant roles in many biological processes. Disorder in this modification can be caused by multiple diseases including neurological disorders and cancers. Therefore, it is necessary to predict the phosphorylation of target residues in an uncharacterized sequence. Most experimental techniques for predicting phosphorylation are time-consuming, costly, and error-prone. By the way, computational methods have replaced these techniques. These days, a vast amount of phosphorylation data is publicly accessible through many online databases.

In this study, at first, all datasets of PTMs that include phosphorylation sites (p-sites) were comprehensively reviewed. Furthermore, we showed that there are basically two main approaches for phosphorylation prediction by machine learning: End-to-End and conventional. We gave an overview for both of them. Also, we introduced 15 important feature extraction techniques which mostly have been used for conventional machine learning methods.

1. Introduction 1.1 Post-translational modifications Post-translational modifications (PTMs) are biochemical reactions occurring on a protein after its translation [1], [2] which change regulate physicochemical properties, maturity, and activity of most proteins [3], [4]. PTMs have key roles in different biological regulatory mechanisms like metabolic pathways, DNA damage response, transcriptional regulation, signaling pathways, protein– protein interactions, apoptosis, cell death, insulin signaling, immune response, and aging [5], [6]. Often proteins in advanced organisms contain the multiple PTMs that occur in the amino acids of proteins as enzymatically or non-enzymatically [7], [8]. Most amino acids are included in these modifications and play significant roles in the process of protein synthesis. PTMs include cutting, folding, and ligand-binding and adding a modifying group to one or more amino acids, changing the chemical nature of an amino acid [9], [10]. In recent years, an increasing volume of PTMs data is available because of the improvements in mass spectrometry (MS) based high-throughput proteomics. Such advancements have revolutionized the study of PTMs[11]. There are more than 600 types of PTMs [12] that affect many aspects of cellular functionalities, such as metabolism, signal transduction, activity, stability, and localization of various proteins [13], [14]. Recent studies have shown that each modification leads to a multitude of effects on the structural and therefore, the function of the proteins [15]. Moreover, PTMs are affected by

1 activity state, localization, turnover, and interactions of target proteins with other proteins [13], [16]. PTMs have key roles in many cellular processes ranges from gene expression, signal transduction, DNA repair, cell cycle control, apoptosis to metabolism [17], [18]. PTMs include Phosphorylation, , ubiquitination, nitrosylation, SUMOylation, , , and S- nitrosylation as well as numerous others in most cellular activities [8], [13], [18]–[21]. Dysregulation in PTMs is contributed to cancer, diabetes, cardiovascular disease, and neurological disorder [22]–[27]. Protein folding and aggregation can be regulated via various cellular processes like different types of stress, molecular crowding, or the local micro-environment, although it was reported that three types of PTMs (phosphorylation, ubiquitination, or sumoylation) have vital roles in protein folding and aggregation, and disorders in these modifications cause different neurodegenerative diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), and frontotemporal dementia (FTD)[28]. In recent years, computational methods for distinguishing the PTMs have been extremely appealing to scientists. Researchers use experimental data analysis to train the machine learning methods to predict. Currently, most of the existing computational methods are used to predict the phosphorylation and glycosylation sites [29] . Phosphorylation is highly conserved and is central to the regulation of various cellular processes and has significant effects on the stability of the modified proteins.

1.2 Phosphorylation There are hundreds of classes of PTMs, but phosphorylation is one of the most studied which was first discovered in 1906 by Phoebus Levene in the protein vitellin (phosvitin)[30]. Phosphorylation occurs covalently by adding a phosphoryl group with a −2 charge at physiological PH in , , , and residues which can greatly impact folding, function, stability, and subcellular localization of the protein [28], [30], [31]. More than one-third of human proteins contain phosphorylation modification and it is regulated by approximately 568 human protein and 156 protein , as a result, it has a significant role in the control of biological processes like proliferation, differentiation, and apoptosis [32], [33]. This modification plays key roles in eukaryotic signaling and biological processes including protein synthesis, cell division, signal transduction, DNA repair, environmental stress response, regulation of transcription, apoptosis, cellular motility, immune response, metabolism, cell growth, development, cellular differentiation, and aging [33], [34]. Site mutations or dysregulation of activity, their hyperactivity, malfunction or overexpression and also, hyperphosphorylation of proteins is associated with certain disease states such as cancers, Alzheimer’s disease (AD), Parkinson's disease (PD), frontotemporal dementia (FTD), and various pathways involving the immune system [28], [31], [33], [35]. Thus, using kinase inhibitors can be valuable to the treatment of cancer. Therefore, it could be assumed that various enzymes and receptors are activated and deactivated by the phosphorylation/ events due to specific kinases and phosphatases. Fig 1 shows p-sites schema.

2

Fig 1. Schema of protein phosphorylation [36].

The number of known phosphorylated sites has grown since 2003, it rose from 2,000 to more than 500,000 known sites in valid databases [37]. These methods are generally difficult, slow, costly, and have a high error rate. With the advancement of technology and the emergence of new computational methods in recent years, computational methods and machine learning have replaced the previous methods. In this paper, at the beginning, valid databases for predicting phosphorylation sites were introduced and then, preprocessing techniques with evaluation metrics reviewed. Furthermore, 15 feature extraction methods in three types of the structural level, sequential based and Physicochemical property-based were described. Finally, we described two generally machine learning based approaches for p-sites prediction, and also available tools of p-site prediction briefly mentioned

2. Databases Post-translational modification (PTM) databases are constantly evolving due to the advent of technology. Sources for predicting PTM are expanded by providing accurate information in databases. These databases contain different organisms such as viruses, animals, sapiens, etc. that have been collected manually and experimentally. For instance, the information of HPRD is collected manually and it contains more than 93700 input data. [38].

Considering the area and variety of different PTMs, databases are arranged into specific and general databases. General PTM databases investigate wide domain of data for different type of PTM modifications. On the other hand, specific databases are constructed based on special types of PTMs like phosphorylation.

Databases such as EPSD [39], sysPTM [37], swiss-prot [40], HPRD [38], dbPTM [41], RESID [42], Lymphos2 [43], phoshpho3d [44], phosphoELM [45], Regphos [46] contain a wide range of data for different types of PTMs. In the following, two important databases for p-sites are introduced. Furthermore, Table 1 investigates both general and specific databases according to their general statistics.

3

2.1 EPSD Eukaryotic Phosphorylation Site Database (EPSD) is one of the most specific, large and comprehensive databases for p-sites which was updated in 2020. EPSD updated from two databases of dbPPT [47] and dbPAF [48], which include roughly 82,000 p-sites for 20 plants and more than 483,000 p-sites from seven types of animals and fungi, respectively. Moreover, EPSD collected p-sites in 13 additional databases including PhosphoSitePlus [49], Phospho.ELM [50], UniProt [51], PhosphoPep [52], BioGRID [53], dbPTM, FPD [54], HPRD, MPPD [55], P3DB [56], PHOSIDA [57], PhosPhAt [58] and SysPTM [37]. At all, this database contains 1,616,804 experimentally known p-sites in approximately 209,300 phosphoproteins of 68 eukaryotes (18 animal, 24 plants, 19 fungi and 7 protists) and more than 1,6 million p-sites. Fig 2 the left side shows distribution of animal’s dataset and the right side shows p-sites on three different plants, fungi and animal category. Furthermore, fig 3 represents number of S, T and Y sites in different categories and fig 4 demonstrate distribution of p-sites in plants and fungi organisms.

2.2 dbPTM Database post-translation modification (dbPTM) is a general database that integrates PTM’s data from 30 databases and 92 648 research articles. dbPTM covers 130 types of PTMs in more than 1000 organisms [36]. The dbPTM is consists of three types of PTM sites including phosphorylation, glycosylation and sulfation which are gathered from Swiss-prot [40], PhosphoELM [50] and O-GLYCBASE [59].

Table 1 Contains two general and specific databases which cover p-sites. Also, it provides useful information about each one. This table is inspired from [36].

* Type of database can be secondary or primary; secondary databases are the integration of other databases.

** Primary databases are independent.

General statistics

Type of data Type Acronym Number of URL Number of and database phosphorylation covered organisms sites

Experimental and S: ~ 908 800 More than 1,000 Predicted dbPTM http://dbptm.mbc.nctu.edu.tw organisms P: ~ 557 700 Secondary*

S: ~700 000 Experimental Generaldatabase BioGRID 71 organisms https://orcs.thebiogrid.org ** P: ~ 419 400 Primary

Phosphosite S: ~ 483 700 Experimental

26 organisms https://www.phosphosite.org Plus S: ~ 20 200 Primary

PTMCode Experimental 19 organisms S: ~ 316 500 http://ptmcode.embl.de v2 Secondary

4

P: ~ 45 300

S: ~296 900 Experimental qPTM Human http://qptm.omicsbio.info/ P: ~19 600 Secondary

S: ~285 700 Experimental PLMD 176 organisms http://plmd.biocuckoo.org P: ~53 500 Secondary

S: ~189 900 Experimental CPLM 122 organisms http://cplm.biocuckoo.org P: ~ 45 700 Secondary

Saccharomyces S: ~121 900 Experimental YAAM http://yaam.ifc.unam.mx cerevisiae P: ~680 Secondary

S: ~93 700 Experimental HPRD Human http://www.hprd.org P: ~30 000 Primary

9 organisms S: ~80 000 Experimental PHOSIDA http://www.phosida.com P: ~28 700 Secondary

S: ~10 600 Experimental http://www.dsimb.inserm.fr/dsi PTM-SD 7 model organisms P: ~842 Secondary mb_tools/PTM-SD

S: ~ 478 100 Experimental http://lifecenter.sgst.cn/SysPT sysPTM 6 organisms P: ~ 53 200 Secondary M/

S: ~ 58 500 Experimental WERAM 8 organisms http://weram.biocuckoo.org P: ~20 000 Secondary

S:~1,616,800 Experimental EPSD 68 organisms http://epsd.biocuckoo.cn P: ~209,300 Secondary

Experimental Phosphorylationdatabases PhosphoNE S: ~950,000 Human and Predicted http://www.phosphonet.ca T P: ~20,000 Secondary

Experimental Human, mouse and S: ~111,700 http://140.138.144.141/~RegPh RegPhos and Predicted rat P: ~ 18,700 os

Secondary

Phospho.EL Mainly model S: ~42,900 Experimental http://phospho.elm.eu.org M organims P: ~8,600 Secondary

5

Mainly model S: ~42,500 Experimental Phospho3D http://www.phospho3d.org organisms P: ~8,700 Secondary

200 prokaryotic S: ~19,300 Experimental http://dbpsp.biocuckoo.cn/indE dbPSP organisms P: ~8,600 Secondary xp.php

Experimental S: ~17,800 pTestis Mouse and Predicted http://ptestis.biocuckoo.org P: ~3,900 Secondary

Experimental S: ~15,500 LymPHOS Human and Predicted https://www.lymphos.org P: ~4,900 Primary

S: ~14,600 Experimental P3DB 9 plant organisms http://www.p3db.org P: ~6,400 and Predicted

Fig 2. Figure (left) shows number of phosphorylation animal’s sites distributed by different types of animals and figure (right) shows p-sites on three species animal, fungi and plants; All figures are based on EPSD dataset.

6

Fig 3. Number of S, T and Y in animal, plants and fungi organisms.

Fig 4. Distribution of p-sites in EPSD dataset for (left) fungi organism (right) plants organism.

3. Preprocessing

To predict p-sites, there are two main steps to prepare datasets.

1. Data collection 2. Data preprocessing 3.1. Data collection

Positive data collection: The S, T, and Y amino acids as p-sites and the positive samples are usually compiled from the aforementioned databases (e.g., EPSD and dbPTM).

Negative data collection: S, T, and Y amino acids existing in experimental peptides without any phospho-groups are considered as negative samples.

7

Data gathering is most challenging when selecting the negative dataset and thus, there are two major strategies accessible to choose the negative dataset.

• From phosphoproteins, the negative samples at random of the target residue that did not undergo the phosphorylation are selected.

• From non- phosphoproteins with none of their target residues have undergone specific phosphorylation, (based on experimental evidence) are selected as the negative set.

3.2 Data preprocessing

After constructing the primary positive and negative datasets, one important task is removing inconsistent/redundant samples to gain a more reliable dataset. This step varies from study to study. One can distinguish three main policies in the literature for removing inconsistent/redundant proteins:

1- Removing redundant phosphoproteins.

2- Removing identical subsequences within the positive and negative sets.

3- Removing identical subsequences between the positive and negative datasets.

The Cluster Database at high identity with tolerance (CD-HIT) program is designed to reduce homology and filtering out similar sequences. According to different phosphorylation prediction studies [60]–[63], threshold of identity to consider a pair of sequences to be similar/redundant differs, and this threshold is considered to range from 30% to 100% [64]. In the following fig 5 shows preprocessing process.

3.3 Class imbalanced problem There is a common problem in some machine learning datasets that happens when there are different ratios of distribution in each class. Thus, the dataset is imbalanced and we encounter class imbalance problem. In other words, a dataset that has unequal samples in classes are imbalanced. This is not a problem when the difference is not big. However, when one or more classes are infrequent, many models do not work too well at identifying the minority classes. For example, in p-site prediction, mostly, preprocessed phosphorylation datasets are imbalanced and the number of the negative samples is much greater than positive samples.

In the following, three most used approaches to deal with class Imbalanced problem are introduced: Up sampling: It generates additional data for minority class either by making copies of the minimum class or by creating synthetic data which can represent samples of minimum class.

Down sampling: It removes data from the majority class either by picking them up randomly or using approaches to select appropriately to handle the issue. Usually, this has been a basic method in p-site prediction, which balanced negative samples by randomly selecting them to become equal to positive samples in numbers.

Customize loss function: This is a group/series of methods to deal with imbalance problem in machine learning. These sorts of methods tried to customize the loss function by assigning larger weights to minority classes in order to overcome the issue. However, recently, by emerging Deep Neural Networks (DNN), big training set has become crucial and important to them. Furthermore, customized losses demonstrated better performance and have attracted more attention than up sampling and down sampling approaches.

8

Fig 5. Data preprocessing flow containing balancing step.

4. Evaluation

The evaluation metrics of protein p-sites are classified into five methods using different attributes: Accuracy (ACC), Sensitivity (SN), Specificity (SP) the Matthews Coefficients of Correlation (MCC), and the area under the ROC curve (AUC). These metrics are evaluated with a confusion matrix that compares the actual target values with those predicted by a model. The number of rows and columns in this matrix depends on the number of classes. From the confusion matrix we end up with four values:

True positive (TP): Indicates the number of positive samples that the model classified them correctly. False Positive (FP): Indicates the number of negative samples that the model classified them incorrectly. True Negative (TN): Indicates the number of negative samples that the model classified them correctly. False Negative (FN): Indicates the number of positive samples that the algorithm classified them incorrectly.

The ACC metric defined in (1) as the ratio of both true positive (TP) and true negative (TN) to the total number of cases examined which is False negatives (FN) and false positive (FP).

푇푃 + 푇푁 퐴푐푐푢푟푎푐푦 = 푇푃 + 푇푁 + 퐹푃 + 퐹푁 (1)

The SN or Recall is the proportion of true positive prediction to all positive cases (2). 푇푃 푅푒푐푎푙푙 = 푇푃 + 퐹푁 (2)

The SP defined in (3). It calculates the proportion of samples which got predicted true to all negative samples.

푇푁 푆푝푒푐푖푓푖푐푖푡푦 = 푇푁 + 퐹푃 (3)

The Precision metric defined in (4). It calculates the proportion of true positive samples to all cases were predicted as the positive.

9

푇푃 푃푟푒푐푖푠푖표푛 = 푇푃 + 퐹푃 (4)

The F1-score is combination of Precision and Recall which is defined as in (5). This metric facilitates the process. It can be used to compare the performance of methods with a single number.

2 × 푃푟푒푐푖푠푖표푛 × 푅푒푐푎푙푙 퐹1 = 푃푟푒푐푖푠푖표푛 + 푅푒푐푎푙푙 (5)

Two SN and SP measures are used to plot the ROC curve. And AUC is used to determine the model performance. Furthermore, for binary classification, there is a more elegant solution: Treat the true class and the predicted class as two (binary) variables, and compute their Correlation Coefficient (6). The higher the correlation between true and predicted values, the better the prediction. This is the rechristened MCC when applied to classifiers.

푇푃 × 푇푁 − 퐹푃 × 퐹푁 푀퐶퐶 = √(푇푃 + 퐹푁)(푇푃 + 퐹푃)(푇푁 + 퐹푁)(푇푁 + 퐹푃)

(6)

Fig 6. Flowchart evaluation protocol and calculating evaluation metrics.

4.1 Model evaluation Basically, dataset splits into two sets: a Train-Valid and a test set. Then, the Train-Valid set splits into two subsets: a train set and a valid set. This part can be done by k-fold Cross-Validation or a simple Train-Valid split.

The basic procedure is that the train set is used to train models and the valid set is used for the evaluating of the trained models. After selecting and picking up the best model with respect to the valid set result, we need to evaluate it on the test set. If the valid

10 set evaluation results are different from the test set, it means that the model is overfitted on the valid set. By the way, at the end, we should report the test set and there shouldn't have been much difference between the valid set and the test set results.

Cross-validation is a resampling method used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. Specific values for k can be chosen.

Considering the scenario of 5-Fold cross validation (k=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to evaluate the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the valid set while the rest serve as the training set. This process is repeated until each fold of the 5 folds has been used as the valid set. Each sample is given the opportunity to be used in the hold-out set one time and used to train the model k-1 times.

The k-fold Cross-Validation is usually used when the amount of Train-Valid data is limited. On the contrary, when dealing with huge amounts of data, we do not need to have a big valid set. In other words, the proportion of Train-Valid split sometimes can go below 1% for the valid set. This approach is mostly used when massive amounts of data are accessible. But for low data regimes, they usually split with proportions of 70%-30%. In this way, it’s better to use k-fold validation method.

In summary, we should use k-fold for low data regimes and use Train-Valid split with small percentages for valid sets when we have access to lots of data. Most of the methods for classified p-sites also use k-fold. Fig 6 demonstrate evaluation metrics and k- fold cross validation Train-Valid split process.

5. Methods for predicting phosphorylation sites In following sections, we reviewed the presented methods for classification of p-sites by dividing them into two main categories: Computational and machine learning. Likewise, machine learning methods are also divided into two approaches: Conventional and End-to-End.

5.1 Computational methods Innovative methods based on statistical approaches have been used in many studies. In [65], a statistically repetitive method, a set of phosphorylated peptide sequences to extract the patterns and a set of peptide sequences to evaluate the predictions were used. They mapped two sets of sequences to the position-weight matrix so that in the matrices, the number of repetitions of each residue was determined from 6 positions higher to 6 positions lower than each phosphorylated site (it means their window size for each peptide is 13 amino acids length). Then they formed a binary matrix based on these two matrices. This final matrix indicates the probability of observing a specific residue around a phosphorylated site by examining this matrix and comparing it with other phosphorylated sites. They extracted some of the residues found around most phosphorylated sites and used them to predict new p-sites.

In a work [66], presented a new method for predicting phosphorylated sites by collecting four background data sets including phosphorylated and non-phosphorylated sequences. They chose a given length of 13 for windows around phosphorylated sites. Initially, they formed weight-position matrices; then, they extracted patterns. By scoring those patterns and deleting some of them, they finally reported a series of patterns as the output during an iterative cycle.

In [67], with citing that the number of patterns to be examined around each position are growing exponentially based on the length of the window, they refer to two algorithms which developed to find phosphorylation patterns, namely the Motif-X algorithm and The MoDL algorithm. They supposed that these algorithms do not detect all patterns and some patterns remain hidden from biologists. Thus, they introduced a new algorithm called Motif-ALL to discover and report all possible patterns.

5.2 Machine learning methods Most of algorithms used for phosphorylation prediction are based on machine learning. Moreover, by explosions of the Deep Learning (DL) in the early of 2010s, machine learning gets popular even more than before. Machine learning is generally the ability of machine to do actions based on prior knowledge and experience [68]. There are more than 40 different computational methods for predicting phosphorylated sites many of which use various machine learning methods including: linear regression and logistic regression, support machine vector, random forests and k-nearest Neighbor [61].

In general, there are two main directions to predict Phosphorylation in machine learning: Conventional and End-to-End.

11

5.2.1 Conventional approach For protein phosphorylation prediction, various types of conventional approaches have been studied. The most common of these approaches uses different and various methods as feature extraction [69]. In this paper, we reviewed almost 16 feature extraction approaches which extracted according physicochemical, sequences, and structural features and convert each sequence into numerical vectors for feeding them as an input to classified by a model. We have tried to introduce more important and practical methods of feature extraction in this paper, but it is clear that there may be various other methods for information extraction. In the following, the most important ones are explained. 5.2.1.1 Physicochemical property-based features Average Accumulated Hydrophobicity (ACH): Average cumulative hydrophobicity (ACH) quantifies the tendency of amino acid surrounding S,T or Y to be exposed to solvent [79]. ACH is computed by averaging the cumulative hydrophobicity indices around the p-site for different sliding windows. It should be mentioned that every site is located in the center of the sliding windows [75], [80].

Encoding scheme Based on Attribute Grouping (EBAG): Encoding scheme of protein sequences was considered hydrophobicity attribute and divided amino acid residue into 4 classes based on physicochemical property. The hydrophobic class c1 = {A, F, G, I, L, M, P, V, W} polar class c2 = {C, N, Q, S, T, Y} acidic class c3 = {D, E} and basic class c4 = {H, K, R} were discussed [70], [71]. [62], [72] used this approach and transform each sequence as binary sequences.

Position Weight Amino Acid composition (PWAA): Position information of each amino acid is another key point that shall be considered in feature extraction. PWAA approach can reveal sequence order information around P, S and Y sites. PWAA can be declare from this formula [72].

퐿 1 |푗| 퐶 = ∑ 푥 (푗 + ) , 푗 = −퐿, … , 퐿 푖 퐿(퐿 + 1) 푖,푗 퐿 푗= −퐿 (7)

5.2.1.2 Sequence-based features Composition of K-Spaced Amino Acid Pairs (CKAAP): The encoding of CKAAP is pretty easy, that can directly be calculated from the sequence pieces of phosphorylated and non-phosphorylated sites. CKSAAP is a critical encoding scheme feature selection in lots of prediction tasks, especially in representing short sequence residue in protein sequence or pieces. A sequence piece may contain 400 types (AxA, AxC, AxD, …, OxO) of K-spaced amino acid pairs (i.e. the pairs separated by K other amino acids) [73] . [74] used 6-spaced AAP for selecting important features and in a research [62] CKAAP equation was proposed as follows in formula 8.

푁푢푚 (퐴푖퐴푗) 푓 = 푖, 푗 = 1,2, … 21 푖,푗 퐿 − 퐾 − 1 (8)

Amino acid occurrence Frequency (AF): This method is the most common one in feature selections methods, which calculates each amino acid frequency in protein sequences [74].

Furthermore [74] proposed AF formula 9 as follows: 푐 푣 = 푖 푖 = 1, … . , 20 푖 푙푒푛(푠푒푞) (9)

K-Nearest Neighbor (KNN): The most popular feature selection method which is used in many different machine learning problems even in PTM and phosphorylation classification is KNN. It is classified sequences based on their distance. The algorithm classifies sequences by looking at k of nearest neighbor sequences by finding out majority votes from nearest neighbors that have similar attributes and the shortest distance as those used to map the items. In [74], they did 7-nearest neighbors of both positive and negative datasets and the average distance between both datasets were considered as features. In [75], they did 6-nearest neighbors for feature extracting. Furthermore, in order to calculate distances between sequences they used the similarities between each of

12 the sequences in BLOSUM64 matrix. Respectively, the average distance between positive and negative datasets were calculated as KNN score.

Overlapping Properties (OP): OP clustered each protein based on their chemical attributes [80].

Sequence Features (SF): SF is a simple way to extract information from a sequence amino acid according to encoding them into 20-bits. The number of feature vectors for each sequence is based on the window size where have chosen [80], [86].

Quasi Sequence Order (QSO): It reflects the sequence order using four physicochemical properties: hydrophobicity, hydrophilicity, polarity, and side-chain volume. Features are derived from a distance matrix created by computing the distance between each pair of the 20 amino acids using Schneider-Wrede physicochemical distance matrix [80], [87], [88].

Amino acid index property (AIP): AIP [89] is a database which represents protein numerical indices of physicochemical and biochemical properties of amino acids. A work [90] chose 15 and another work [62] chose 31 important and effective amino acid indices and producing feature vectors.

Position Specific Propensity Matrix (PSPM): Position specific scores are a measurement to quantify the resemblance between sequences and display representation between protein sequences which were first proposed by [9]. PSSM specifies the scores to observe the particular p-site at specific positions and a pair order. A study [62] utilized PSPM feature as input of their model for S, T and Y sites. 5.2.1.3 Structural level features Protein Disorder Features (DF): All PTM modifications include p-sites located within disorder positions [76]. Protein disorders were used as features in many works such as: [77], which created phosphorylation predictor-DISPHO and used protein disorders as features.

In work [75], disorder information was extracted using VSL2B [78] and the disorder scores for both positive and negative dataset were calculated and the average scores were used for different window sizes as final features.

Shannon Entropy: Entropy in information theory is a numerical measure of the amount of information or randomness of a random variable. To be more precise, the entropy of a random variable is the average value (Expected value) of the amount of information obtained from observing it. In other word, when the entropy of a random variable is high, we have more ambiguity about that random variable. Therefore, by observing the definite result of that random variable, more information is obtained, so when the entropy of a random variable is high, more information will be obtained from its definite observation [81]. In science and engineering in general, entropy is a measure of the degree of ambiguity or disorders [82]. Claude Shannon, in his revolutionary paper A Mathematical Theory of Communication in 1948, introduced Shannon entropy and became the founder of information theory. [83] used sequence-wise entropy as features. Furthermore, in a research by [84], they used Shannon entropy as evolutionary feature based vectors.

Relative Entropy (RE): It is known for Kullback Leibler is aggregating entropies for more than 20 sites in proteins. In [83], they used RE as feature which represented information from a protein sequence.

Accessible Surface Area (ASA): Amino acids can be exposed based on 3-dimensional protein structure. Therefore, p-sites may be exposed amino acids that can interact with enzymes [80]. On the other hand, in [85], they proposed an Rvp-net algorithm which is used to extract ASA features from protein sequences before converting them into sliding windows.

5.2.2 Conventional methods After feature extraction, models which need to predict p-sites should be implemented. We reviewed ML based methods in this area, as we mentioned, the most popular model which is currently used for predicting sites is Support Vector Machines (SVM). Support Vectors are a set of points in the n-dimensional space of data that define the boundaries of categories. The data is bounded and categorized based on them, and by moving one unit of the data, category output may change. The SVM is a kind of maximum margin classifier. The maximum margin classifier has a simple function, in which data is separated by a hyperplane; provided they have the highest margin over the data. But the maximum margin classifier cannot be used on all datasets, because the data must be linearly separable, which is not the case. If the data cannot be separated linearly, it can be transferred to a higher dimension. By converting and mapping data to a higher dimension, they are transformed from nonlinear separators to linear separators. This is because as the dimension increases, the data becomes more fragmented and open, and this dimension increase can be continued until they become linearly separated [91], [92]. The SVM is widely used in bioinformatics, especially in PTM issues. By this method, the protein sequences are filtered and converted to feature vectors of constant length and subjected to classified into correct classes which one is considered as phosphorylated and zero is non-phosphorylated [72], [75], [86].

13

Random forests (RF) are one of the well-known and important machine learning algorithms that are used in many issues in the field of bioinformatics and life sciences. Random forest is a supervised learning algorithm. As the name implies, this algorithm builds a forest randomly. Forests are actually a group of decision trees. Random forest (RF) makes several decision trees and merges them together to make more accurate and stable predictions [93].

As we mentioned, in recent years, kinases specific methods were considered because in general, some protein prediction sites remain unexplored and kinases give us a lightening way to find them. [94] Netphos and NetphosK [95] both used DNN based on consensus sequences and combined it with mass spectrometry experimental methods. In Quokka framework [34] they used Logistic Regression to classified 43 S/T and 22 Y kinases family sites. [96] proposed the classification method based on SVM and used consensus sequence structure as features for four kinases groups and families. The best accuracies which the model could predict were from 83 to 95% at the kinase family level, and 76–91% at the kinase groups. [97] proposed a method for four kinases family based on RF which extract features with Auto Covariance (AC) and performed over 90% accuracy.

To recognize protein p-sites in non-kinases proteins, a research by [72] proposed method based on SVM in viruses. They used EBAG and PWAA for extracting physicochemical and sequence information of viral proteins over p-sites. They used 10-fold- cross-validation for different window sizes from 15 – 27 lengths. They got the best results for window size 23 with accuracy of 88.8%, 95.2% and 97.1% for Phosphoserine, Phosphothreonine and Phosphotyrosine respectively.

Furthermore, work [74] combined AF and CKAAP together and provided AF-CKAAP as features. They believed SVM can classify rice protein p-sites. Their work was named Rice_Phospho 1.0 which achieved 82% accuracy.

Paper [75] proposed Granularity Support Vector Machine (GSVM) for predicting p-sites. They used KNN, AF, DF and ACH of every phosphorylation position for building training samples. For partitioning data in high-dimensional feature spaces they used kernel fuzzy c-means clustering and by applying this method, they tried to extract features for dataset representation in a sample space. The method was applied on plants and animal dataset types and could achieve 80% and 85% accuracy, respectively.

In [98] by their method PhospredRF, they used information extracted from PSSW and trained individuals RF with odd window sizes from 9 – 25 amino acids, they got approximately 70% accuracy for 26 protein sequences. RF-phos-1.0 transformed each amino acid to vectors by using eight kinds of algorithms (H, RE, ASA, OP, ACH, SF, QSO and the sequence order coupling number of each sequence) based on 9 amino acid windows size. It was mentioned that the best feature for S and T sites is SF. Then these features are used as RF input with 10-fold-cross validation. The accuracy for the model is approximately 80% for S, T and Y sites [83]. Moreover, in the RF-phos-2.0, their RF model has improved by using window sizes of 7 to 21 amino acids [80].

Microbial Phosphorylation Site predictor (MPsites) was proposed to recognize microbial p-sites with different kind of sequence features. In order to convert each sequence to numerical vectors, they used various sequences encoding strategies including AF, BE, AIP, PWAA. They used naïve bayes, SVM, neural networks, decision tree and RF algorithms to recognize S and T p-sites. Results showed that RF has better performance than the other algorithms. It got 68% accuracy for S sites and 75% accuracy for T sites [99]. Fig 7 shows feature extraction process and in the following fig 8 demonstrate conventional approach for flow for p-sites predictions.

2 3 4 i k

Phos Phos Phos

( 2) ( 2) ( 2) ( 2) ( 2) ( 2)

Fig 7. A common procedure for feature extracting.

14

uilding the Predictor Selecting Appropriate Classifier s

andom orest ( ) Support ector machine (S M) Artificial neural networks (ANN)

Parameter Tuning Optimizing the Classifier (s) Parameters indow Size Optimization

eature Selection Selecting a Subset of Most Informative Descriptions for Predicting the PTMs

edesigning the Classifier

Fig 8. The procedure of p-site prediction by conventional machine learning methods.

5.2.3 End-to-End approach End-to-End learning becomes a hot topic in machine learning field by taking the advantage of DL. DNN for short, is almost the same as the previous Artificial Neural Networks (ANNs) with minor modifications to be more effective and practical in representation learning. Similar to the human brain, each DNN’s layer (or group of layers) could be used to learn the hierarchical abstraction for downstream task. In other words, we just feed the input sequences to a DNN and it does the feature selections inside of the layers by itself. That is why it is called End-to-End learning.

DL has made great success in solving problems that have resisted the best efforts of the AI community for years especially in different biological problems [100]–[106] . In recent years, there has been breakthroughs in DL which is the field applied to PTM classification especially protein phosphorylation prediction. Generally, these architectures are used for feature extraction and classification tasks at the same time. DL methods are multi-level representation learning methods that are achieved by stacking simple but non-linear layers. As mentioned earlier, the main aspect of these approaches is that the layer of features is not designed by human engineers or manually. These layers are acquired from input data to extract best patterns accurately and quickly.

Among all DL architectures, Convolutional Neural Networks (CNN), recurrent neural networks (RNN) and long short memory (LSTM) are the famous ones. However, CNN has attracted more attention in PTM field [60] [61] [84].

A research by [60] provided a DL architecture called MusiteDeep to predict general and kinases-specific families’ places. They used the window size of 33 amino acids for input sequences and presented multi-layer CNN and attention layers architecture. Sequences of proteins containing phosphorylated or non-phosphorylated sites are considered as binary classification problems. In paper [61], in contrast to multi-layer models of MusiteDeep, they used dense CNN blocks that can show different representations of sequences. They called it DeepPhos which could improve the performance of MusiteDeep using different window sizes of 15, 33 and 51. Moreover, the phosTransfer [84] is a DL based framework which constructs pre-train architecture with CNNs based on kinases hierarchy and transfer learning. They believed that protein kinases within the same subfamily, family and group probably share similar local sequential and structural patterns and with this presumption they used pre-train feature extractions for fine tuning for lower levels of kinases. The phosTransfer can achieve 0.89 AUC scores on average.

The DeepPPSite is a DL model based for predicting p-sites by considering the sequence information [62]. They used stacked LSTM architecture with one hot encoding, PSPM, EBGW, CKSAAP and AIP input features and could achieve 0.358 MCC value for S, 0.356 for T and 0.350 for Y p-sites.

Furthermore, there has been some works such as [107] which used hybrid architectures. they presented a specific End-to-End CNN- LSTM architecture and called it DeepIPs, to accurately predict p-sites in host cells infected with SARS-COVID19 [108], [109].

15

They utilize two approaches in Natural Language Processing as word embedding layers. First is Supervised Embedding Layers and the second one is Unsupervised Embedding Layers based on the Glove [110], the fastText [111], [112] and the Word2vec [113] pre-train word embedding methods. 80.45 for S/T and 75.22 for Y accuracy was achieved.

[114] compared and analyzed human engineered representations and deep representations for the reorganization of Phosphoserine sites (PhosS). They used DNN architecture like CNN, LSTM and RNN models for feature representation from sequences and compare their performance with human-engineered representations.

Even though, most DL approaches have built their architecture with large volumes of data, a work [115] with little amount of data from only two kinases family and with considering windows with length of 9, proposed a simple DNN architecture. Their model has been able to achieve nearly 80% accuracy. It means that DL can also perform well on low data regions. Fig 9 shows End-to- End method flow for p-sites prediction.

nd to nd

MP D K ATAMAA N SPDN K A T NIP K MP T KD A DAP KKI NS CN KC NK I N TD SD MATA NI KK KSK K DSP K KD IMA NAA KN Preprocessed dataset

yper Designing deep Parameters Training models learning arcitecture Tuning

Selecting predictive model valuating models

Fig 9. The procedure of p-site prediction by End-to-End methods

6. Tools for protein phosphorylation prediction Due to the high cost and low speed of using experimental methods to recognize p-sites, in recent years many computational online tools have been developed to help and increase the quality. Table 2 introduced conventional and End-to-End models and publicly accessible online tools for protein p-sites prediction.

16

Table 2 Introduce models and tool for p-site prediction.

* Refers to define general type or kinases type of proteins, G for general sets and K for kinases family.

Type / Acronym Method URL G/K* Description

Netphos Conventional ANN http://www.cbs.dtu.dk/services/NetPhos/ K

NetphosK Conventional ANN http://www.cbs.dtu.dk/services/NetPhosK/ K

[96] Conventional SVM http://www.ngri.re.kr/proteo/PredPhospho.html K

[97] Conventional RF ------K

[72] Conventional SVM ------G

Rice-phospho 1.0 Conventional RF http://bioinformatics.fafu.edu.cn/rice_phospho1.0 G

GSVM Conventional SVM ------G

Phosphorylation onlinetools and models

RF-phos-1.0 Conventional RF ------G

RF-phos-2.0 Conventional RF http://bcb.ncat.edu/RF Phos/ G

Deepphos End to End CNN https://github.com/USTCHIlab/DeepPhos G/K

https://www.musite.net/ CNN + MusitDeep End to End G/K attention https://github.com/duolinwang/MusiteDeep_web

[115] End to End DNN ------G

PhosTransfer [84] Conventional CNN https://github.com/yxu132/PhosTransfer K

deepPsites [62] Conventional LSTM https://github.com/saeed344/DeepPPSite G

https://github.com/linDinggroup/DeepIPs. deepIPs CNN- End to End G [107] LSTM http://lin-group.cn/server/DeepIPs/

GPS 5.0 [116] Conventional LR http://gps.biocuckoo.cn K

MPSite [99] Conventional RF http://kurata14.bio.kyutech.ac.jp/MPSite/ G

17

https://ssc.umt.edu.pk/LifeSciences/Our- iPhosT-PseAAC Conventional ANN G Research-Project.aspx

Quokka [34] Conventional LR http://quokka.erc.monash.edu/#webserver K

PhosContext2vec Conventional SVM http://phoscontext2vec.erc.monash.edu/ K/G [117]

PhosphoSVM [86] Phosphorylation SVM http://sysbio.unl.edu/PhosphoSVM/ G

phos_pred [118] Phosphorylation RF http://bioinformatics.ustc.edu.cn/phos_pred/ G

7. Conclusion Almost all proteins contain phosphorylation, which is responsible for critical functions in the cell. Various diseases can be caused by disruptions of this modification. The discovery of phosphorylation by high-throughput experimental methods is labor-intensive and time-consuming. Therefore, it's important to have powerful methods and tools to predict phosphorylation. This paper briefly introduced some popular databases consisting of general PTM and specific type phosphorylation. Moreover, we introduced two important databases, EPSD and dbPTM and compared them in order to analyze their distribution of p-sites.

Furthermore, we have given a brief overview of protein p-sites prediction by machine learning. In fact, machine learning approaches are mainly divided into classical methods which mostly use feature extraction, and End-to-End methods. In addition to machine learning, we slightly discussed computational methods as well. Computational methods have statically basis which is slow and have a high time complexity. On the other hand, machine learning algorithm which are quite popular these days have attracted a lot of attention for p-sites prediction including like SVM, Logistic Regression, and Random Forest. In conventional methods, SVM has shown better performance, although, it is clear that feature extraction step would have significant impact on the final result. Therefore, this study introduced 15 important and mostly used feature extraction methods based on the structural level, sequential based and Physicochemical property-based categories. Contrastingly, CNN and RNN based models which have been famous in the End-to-End learning, can predict p-sites directly form the raw input sequences without any feature extraction. Therefore, researchers who turned to D ’s End-to-End approaches for research direction, have reason to believe that feature extraction methods are time-consuming and need expert knowledge but the End-to-End ones do not require specialized knowledge.

Next, evaluating the methods by different metrics for predicting p-sites approaches were reported to give standard metrics for comparison.

In contrast to many machine learning domains, considering methods on different datasets with different preprocessing, as well as different splitting in The Train set and Test set for p-sites prediction, it is not easy to accurately compare them and choose the best method. Thus, for the fair and precise competition, it is suggested in this article that preparing a uniform, unique and well-defined dataset for p-site prediction by machine learning can be considered as a major step for the future research of this field.

8. References

[1] P. Craveur, J. ebehmed, and A. . de revern, “PTM-SD: a database of structurally resolved and annotated posttranslational modifications in proteins,” Database, vol. 2014, 2014.

[2] A. Sreedhar, . K. iese, and T. itosugi, “ nzymatic and metabolic regulation of succinylation,” Genes Dis.,

18

vol. 7, no. 2, pp. 166–171, 2020.

[3] J. Li et al., “SysPTM 2.0: an updated systematic resource for post-translational modification,” Database, vol. 2014, 2014.

[4] M. Audagnotto and M. Dal Peraro, “Protein post-translational modifications: In silico prediction tools and molecular modeling,” Computational and Structural Biotechnology Journal. 2017, doi: 10.1016/j.csbj.2017.03.004.

[5] . Zhou, . Du, . Xue, . Miao, T. ei, and P. Zhang, “Identification of malonylation, succinylation, and glutarylation in serum proteins of acute myocardial infarction patients,” PROTEOMICS–Clinical Appl., vol. 14, no. 1, p. 1900103, 2020.

[6] H. Huang et al., “iPTMnet: an integrated resource for protein post-translational modification network discovery,” Nucleic Acids Res., vol. 46, no. D1, pp. D542–D550, 2018.

[7] S. Johnson, Algorithms to detect motifs and predict post-translational modification sites. Wake Forest University, 2015.

[8] . A. Khoury, . C. aliban, and C. A. loudas, “Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database,” Sci. Rep., vol. 1, no. 1, pp. 1–5, 2011.

[9] Y. Xu et al., “Prediction of posttranslational modification sites from amino acid sequences with kernel methods,” J. Theor. Biol., vol. 344, pp. 78–87, 2014.

[10] N. arriol‐Mathis et al., “Annotation of post‐translational modifications in the Swiss‐Prot knowledge base,” Proteomics, vol. 4, no. 6, pp. 1537–1550, 2004.

[11] K.-Y. Huang et al., “dbPTM in 20 9: exploring disease association and cross-talk of post-translational modifications,” Nucleic Acids Res., vol. 47, no. D1, pp. D298–D308, 2019.

[12] H. Xu et al., “PTMD: a database of human disease-associated post-translational modifications,” Genomics. Proteomics Bioinformatics, vol. 16, no. 4, pp. 244–251, 2018.

[13] . Duan and D. alther, “The roles of post-translational modifications in the context of protein interaction networks,” PLoS Comput. Biol., vol. 11, no. 2, p. e1004049, 2015.

[14] M. Alleyn, M. reitzig, . ockey, and N. Kolliputi, “The dawn of succinylation: a posttranslational modification,” Am. J. Physiol. Physiol., vol. 314, no. 2, pp. C228–C232, 2018.

[15] V. Pejaver, W. Hsu, F. Xin, A. K. Dunker, . N. Uversky, and P. adivo ac, “The structural and functional signatures of proteins that undergo multiple events of post‐translational modification,” Protein Sci., vol. 23, no. 8, pp. 1077–1093, 2014.

[16] P. Minguez et al., “PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins,” Nucleic Acids Res., vol. 43, no. D1, pp. D494–D502, 2015.

[17] M. ang, . Jiang, and X. Xu, “A novel method for predicting post-translational modifications on serine and threonine sites by using site-modification network profiles,” Mol. Biosyst., vol. 11, no. 11, pp. 3092–3100, 2015.

[18] M. Strumillo and P. eltrao, “Towards the computational design of protein post-translational regulation,” Bioorg. Med. Chem., vol. 23, no. 12, pp. 2877–2882, 2015.

[19] . S. Johnson, “Protein modification by SUMO,” Annu. Rev. Biochem., vol. 73, no. 1, pp. 355–382, 2004.

[20] I. Ahmad et al., “MAP es: An efficient method to analyze protein sequence around post‐translational modification sites,” J. Cell. Biochem., vol. 104, no. 4, pp. 1220–1231, 2008.

[21] P. Nickchi, M. Jafari, and S. Kalantari, “P IMAN .0: Post-translational modification Enrichment, Integration and Matching ANalysis,” Database, vol. 2015, 2015.

[22] K. S. Kamath, M. S. asavada, and S. Srivastava, “Proteomic databases and tools to decipher post-translational modifications,” J. Proteomics, vol. 75, no. 1, pp. 127–144, 2011.

[23] T. M. Karve and A. K. Cheema, “Small changes huge impact: the role of protein posttranslational modifications in cellular homeostasis and disease,” J. Amino Acids, vol. 2011, 2011.

[24] S. Schedin‐ eiss, . inblad, and . O. T ernberg, “The role of protein glycosylation in Alzheimer disease,” FEBS J., vol. 281, no. 1, pp. 46–62, 2014.

19

[25] K. J. alkenberg and . . Johnstone, “ istone deacetylases and their inhibitors in cancer, neurological diseases and immune disorders,” Nat. Rev. Drug Discov., vol. 13, no. 9, pp. 673–691, 2014.

[26] . Park, J. Tan, . arcia, . Kang, . Salvesen, and Z. Zhang, “ egulation of histone acetylation by autophagy in Parkinson disease,” J. Biol. Chem., vol. 291, no. 7, pp. 3531–3540, 2016.

[27] D. Popovic, D. ucic, and I. Dikic, “Ubiquitination in disease pathogenesis and treatment,” Nat. Med., vol. 20, no. 11, pp. 1242–1253, 2014.

[28] S. Tenreiro, K. ckermann, and T. . Outeiro, “Protein phosphorylation in neurodegeneration: friend or foe?,” Front. Mol. Neurosci., vol. 7, p. 42, 2014.

[29] N. lom, T. Sicheritz‐Pontén, . upta, S. ammeltoft, and S. runak, “Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence,” Proteomics, vol. 4, no. 6, pp. 1633–1649, 2004.

[30] P. A. evene and C. . Alsberg, “The cleavage products of vitellin,” J. Biol. Chem., vol. 2, no. 1, pp. 127–133, 1906.

[31] K. . arber and J. inehart, “The abcs of ptms,” Nat. Chem. Biol., vol. 14, no. 3, pp. 188–192, 2018.

[32] C. Peng et al., “The first identification of lysine malonylation substrates and its regulatory enzyme,” Mol. Cell. proteomics, vol. 10, no. 12, 2011.

[33] . Ardito, M. iuliani, D. Perrone, . Troiano, and . o Muzio, “The crucial role of protein phosphorylation in cell signalingand its use as targeted therapy ( eview),” International Journal of Molecular Medicine. 2017, doi: 10.3892/ijmm.2017.3036.

[34] F. Li et al., “ uokka: A comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome,” Bioinformatics, 2018, doi: 10.1093/bioinformatics/bty522.

[35] . M. erguson and N. S. ray, “Kinase inhibitors: the road ahead,” Nat. Rev. Drug Discov., vol. 17, no. 5, pp. 353–377, 2018.

[36] S. amazi and J. Zahiri, “Posttranslational modifications in proteins: resources, tools and prediction methods,” Database, vol. 2021, 2021.

[37] H. Li et al., “SysPTM: a systematic resource for proteomic research on post-translational modifications,” Mol. Cell. Proteomics, vol. 8, no. 8, pp. 1839–1849, 2009.

[38] T. tempspacetempspaceS Keshava Prasad et al., “ uman protein reference database—2009 update,” Nucleic Acids Res., vol. 37, no. suppl_1, pp. D767–D772, 2009.

[39] S. Lin et al., “ PSD: A well-annotated data resource of protein phosphorylation sites in eukaryotes,” Briefings in Bioinformatics. 2021, doi: 10.1093/bib/bbz169.

[40] B. Boeckmann et al., “The SWISS-P OT protein knowledgebase and its supplement Tr M in 2003,” Nucleic Acids Res., vol. 31, no. 1, pp. 365–370, 2003.

[41] K.-Y. Huang et al., “dbPTM 20 6: 0-year anniversary of a resource for post-translational modification of proteins,” Nucleic Acids Res., vol. 44, no. D1, pp. D435–D446, 2016.

[42] J. S. aravelli, “The SID Database of Protein Modifications as a resource and annotation tool,” Proteomics, vol. 4, no. 6, pp. 1527–1533, 2004.

[43] T. D. Nguyen, O. Vidal-Cortes, O. allardo, J. Abian, and M. Carrascal, “ ymP OS 2.0: An update of a phosphosite database of primary human T cells,” Database, 2015, doi: 10.1093/database/bav115.

[44] A. Zanzoni, G. Ausiello, A. Via, P. F. Gherardini, and M. Helmer-Citterich, “Phospho3D: A database of three- dimensional structures of protein phosphorylation sites,” Nucleic Acids Res., 2007, doi: 10.1093/nar/gkl922.

[45] H. Dinkel et al., “Phospho. M: A database of phosphorylation sites-update 20 ,” Nucleic Acids Res., 2011, doi: 10.1093/nar/gkq1104.

[46] K. Y. Huang et al., “ egPhos 2.0: An updated resource to explore protein kinase-substrate phosphorylation networks in mammals,” Database, 2014, doi: 10.1093/database/bau034.

[47] H. Cheng, W. Deng, Y. Wang, J. en, Z. iu, and . Xue, “dbPPT: a comprehensive database of protein phosphorylation in plants,” Database, vol. 2014, 2014.

20

[48] S. Ullah et al., “dbPA : an integrative database of protein phosphorylation in animals and fungi,” Sci. Rep., vol. 6, no. 1, pp. 1–9, 2016.

[49] P. ornbeck, . Zhang, . Murray, J. M. Kornhauser, . atham, and . Skrzypek, “PhosphoSitePlus, 20 4: mutations, PTMs and recalibrations,” Nucleic Acids Res., vol. 43, no. D1, pp. D512–D520, 2015.

[50] H. Dinkel et al., “Phospho. LM: a database of phosphorylation sites—update 20 ,” Nucleic Acids Res., vol. 39, no. suppl_1, pp. D261–D267, 2010.

[51] U. Consortium, “UniProt: a worldwide hub of protein knowledge,” Nucleic Acids Res., vol. 47, no. D1, pp. D506–D515, 2019.

[52] B. Bodenmiller et al., “PhosphoPep—a phosphoproteome resource for systems biology research in Drosophila Kc167 cells,” Mol. Syst. Biol., vol. 3, no. 1, p. 139, 2007.

[53] R. Oughtred et al., “The io ID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions,” Protein Sci., vol. 30, no. 1, pp. 187–200, 2021.

[54] Y. Bai et al., “ PD: A comprehensive phosphorylation database in fungi,” Fungal Biol., vol. 121, no. 10, pp. 869–875, 2017.

[55] . J. de rui n, “Medicago truncatula proteomics: introduction,” Model Legum. Medicago truncatula, p. 1069, 2020.

[56] Q. Yao et al., “P3D 3.0: rom plant phosphorylation sites to protein networks,” Nucleic Acids Res., vol. 42, no. D1, pp. D1206–D1213, Jan. 2014, doi: 10.1093/nar/gkt1135.

[57] . nad, J. unawardena, and M. Mann, “P OSIDA 20 : the posttranslational modification database,” Nucleic Acids Res., vol. 39, no. suppl_1, pp. D253–D260, 2010.

[58] P. Durek et al., “PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update,” Nucleic Acids Res., vol. 38, no. suppl_1, pp. D828–D834, 2010.

[59] . upta, . irch, K. apacki, S. runak, and J. . ansen, “O-GLYCBASE version 4.0: a revised database of O- glycosylated proteins,” Nucleic Acids Res., vol. 27, no. 1, pp. 370–372, 1999.

[60] D. Wang et al., “MusiteDeep: A deep-learning framework for general and kinase-specific phosphorylation site prediction,” Bioinformatics, vol. 33, no. 24, pp. 3909–3916, 2017, doi: 10.1093/bioinformatics/btx496.

[61] F. Luo, M. Wang, Y. Liu, X.-M. Zhao, and A. i, “DeepPhos: prediction of protein phosphorylation sites with deep learning,” Bioinformatics, vol. 35, no. 16, pp. 2766–2773, 2019, doi: 10.1093/bioinformatics/bty1051.

[62] S. Ahmed, M. Kabir, M. Arif, Z. U. Khan, and D.-J. u, “DeepPPSite: a deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information,” Anal. Biochem., vol. 612, p. 113955, 2021.

[63] H. D. Ismail, A. Jones, J. . Kim, . . Newman, and . K. C. Dukka, “Phosphorylation sites prediction using andom orest,” 20 5, doi: 0. 09 ICCA S.20 5.7344726.

[64] . i and A. odzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006.

[65] D. Schwartz and S. P. ygi, “An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets,” Nat. Biotechnol., vol. 23, no. 11, pp. 1391–1398, 2005.

[66] . C. Chen, K. Aguan, C. . ang, . T. ang, N. . Pal, and I. . Chung, “Discovery of protein Phosphorylation motifs through exploratory data analysis,” PLoS One, 2011, doi: 10.1371/journal.pone.0020025.

[67] Z. He, C. ang, . uo, N. i, and . u, “Motif-All: Discovering all phosphorylation motifs,” BMC Bioinformatics, 2011, doi: 10.1186/1471-2105-12-S1-S22.

[68] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science (80-. )., vol. 349, no. 6245, pp. 255–260, 2015.

[69] S. Jamal, . Ali, P. Nagpal, A. rover, and S. rover, “Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins,” J. Transl. Med., vol. 19, no. 1, pp. 1–11, 2021.

[70] S. C. an and X. . Zhang, “Characterizing the microenvironment surrounding phosphorylated protein sites,” Genomics,

21

Proteomics Bioinforma., 2005, doi: 10.1016/S1672-0229(05)03029-9.

[71] Z. H. Zhang, Z. H. Wang, Z. . Zhang, and . X. ang, “A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine,” FEBS Lett., 2006, doi: 10.1016/j.febslet.2006.10.017.

[72] S.-Y. Huang, S.-P. Shi, J.-D. Qiu, and M.-C. iu, “Using support vector machines to identify protein phosphorylation sites in viruses,” J. Mol. Graph. Model., vol. 56, pp. 84–90, 2015.

[73] Z. Chen, Y.-Z. Chen, X.-F. Wang, C. Wang, R.-X. an, and Z. Zhang, “Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs,” PLoS One, vol. 6, no. 7, p. e22930, 2011.

[74] S. Lin et al., “ ice-Phospho 1.0: A new rice-specific S M predictor for protein phosphorylation sites,” Sci. Rep., 2015, doi: 10.1038/srep11940.

[75] . Cheng, . Chen, and . Zhang, “Prediction of phosphorylation sites based on granular support vector machine,” Granul. Comput., 2021, doi: 10.1007/s41066-019-00202-5.

[76] A. K. Dunker et al., “The unfoldomics decade: An update on intrinsically disordered proteins,” 2008, doi: 10.1186/1471-2164-9-S2-S1.

[77] L. M. Iakoucheva et al., “The importance of intrinsic disorder for protein phosphorylation,” Nucleic Acids Res., 2004, doi: 10.1093/nar/gkh253.

[78] Z. Obradovic, K. Peng, S. ucetic, P. adivo ac, and A. K. Dunker, “ xploiting heterogeneous sequence properties improves prediction of protein disorder,” 2005, doi: 0. 002 prot.20735.

[79] T. Zhang, . Zhang, K. Chen, S. Shen, J. uan, and . Kurgan, “Accurate sequence-based prediction of catalytic residues,” Bioinformatics, vol. 24, no. 20, pp. 2329–2338, 2008.

[80] . D. Ismail, A. Jones, J. . Kim, . . Newman, and D. . Kc, “ -Phos: A Novel General Phosphorylation Site Prediction Tool ased on andom orest,” Biomed Res. Int., 2016, doi: 10.1155/2016/3281590.

[81] C. . Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mob. Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55, 2001.

[82] J. A. Capra and M. Singh, “Predicting functionally important residues from sequence conservation,” Bioinformatics, vol. 23, no. 15, pp. 1875–1882, 2007.

[83] . D. Ismail, A. Jones, J. . Kim, . . Newman, and . K. C. Dukka, “Phosphorylation sites prediction using Random orest,” in 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2015, pp. 1–6.

[84] Y. Xu, C. Wilson, A. Leier, T. T. Marquez- ago, J. hisstock, and J. Song, “PhosTransfer: A Deep Transfer Learning Framework for Kinase-Specific Phosphorylation Site Prediction in ierarchy,” Adv. Knowl. Discov. Data Min., vol. 12085, p. 384, 2020.

[85] S. Ahmad, M. M. romiha, and A. Sarai, “ P-net: online prediction of real valued accessible surface area of proteins from single sequences,” Bioinformatics, vol. 19, no. 14, pp. 1849–1851, 2003.

[86] . Dou, . ao, and C. Zhang, “PhosphoS M: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine,” Amino Acids, vol. 46, no. 6, pp. 1459–1469, 2014.

[87] K.-C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” J. Theor. Biol., vol. 273, no. 1, pp. 236–247, 2011.

[88] D.-S. Cao, Q.-S. Xu, and Y.-Z. iang, “propy: a tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, no. 7, pp. 960–962, 2013.

[89] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama, and M. Kanehisa, “AAindex: Amino acid index database, progress report 2008,” Nucleic Acids Res., 2008, doi: 10.1093/nar/gkm998.

[90] M. M. asan, M. M. ashid, M. S. Khatun, and . Kurata, “Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information,” Sci. Rep., 2019, doi: 10.1038/s41598-019-44548-x.

[91] . Drucker, C. J. C. urges, . Kaufman, A. Smola, and . apnik, “Support vector regression machines,” Adv. Neural Inf. Process. Syst., vol. 9, pp. 155–161, 1997.

22

[92] . S. Noble, “ hat is a support vector machine?,” Nat. Biotechnol., vol. 24, no. 12, pp. 1565–1567, 2006.

[93] M. Pal, “ andom forest classifier for remote sensing classification,” Int. J. Remote Sens., vol. 26, no. 1, pp. 217–222, 2005.

[94] N. lom, S. ammeltoft, and S. runak, “Sequence and structure-based prediction of eukaryotic protein phosphorylation sites,” J. Mol. Biol., vol. 294, no. 5, pp. 1351–1362, 1999.

[95] M. Hjerrild et al., “Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry,” J. Proteome Res., 2004, doi: 10.1021/pr0341033.

[96] J. . Kim, J. ee, . Oh, K. Kimm, and I. S. Koh, “Prediction of phosphorylation sites using S Ms,” Bioinformatics, 2004, doi: 10.1093/bioinformatics/bth382.

[97] W. Liu et al., “Prediction of kinase-specific phosphorylational interactions using random forest,” Chemom. Intell. Lab. Syst., 2013, doi: 10.1016/j.chemolab.2013.05.005.

[98] S. aner ee, S. asu, D. hosh, and M. Nasipuri, “Phospred : Prediction of protein phosphorylation sites using a consensus of random forest classifiers,” 20 5, doi: 0. 09 I MCON.20 5.73445 4.

[99] M. M. asan, M. M. ashid, M. S. Khatun, and . Kurata, “Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information,” Sci. Rep., vol. 9, no. 1, pp. 1–9, 2019.

[100] A. Elnaggar et al., “ProtTrans: towards cracking the language of ife’s code through self-supervised deep learning and high performance computing,” arXiv Prepr. arXiv2007.06225, 2020.

[101] A. Nambiar, M. eflin, S. iu, S. Maslov, M. opkins, and A. itz, “Transforming the language of life: transformer neural networks for protein prediction tasks,” in Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 2020, pp. 1–8.

[102] S. ebb, “Deep learning for biology.,” Nature, vol. 554, no. 7690, pp. 555–558, 2018.

[103] M. ainberg, D. Merico, A. Delong, and . J. rey, “Deep learning in biomedicine,” Nat. Biotechnol., vol. 36, no. 9, pp. 829–838, 2018.

[104] C. Angermueller, T. Pärnamaa, . Parts, and O. Stegle, “Deep learning for computational biology,” Mol. Syst. Biol., vol. 12, no. 7, p. 878, 2016.

[105] G.- . ei, “Protein structure prediction beyond Alpha old,” Nat. Mach. Intell., vol. 1, no. 8, pp. 336–337, 2019.

[106] J. Jumper et al., “ ighly accurate protein structure prediction with Alpha old,” Nature, pp. 1–11, 2021.

[107] H. Lv, F.- . Dao, . Zulfiqar, and . in, “DeepIPs: comprehensive assessment and computational identification of phosphorylation sites of SARS-CoV-2 infection using a deep learning-based approach,” Brief. Bioinform., 2021.

[108] C. O. Barnes et al., “SA S-CoV-2 neutralizing antibody structures inform therapeutic strategies,” Nature, vol. 588, no. 7839, pp. 682–687, 2020.

[109] B. Hu, H. Guo, P. Zhou, and Z.- . Shi, “Characteristics of SA S-CoV-2 and COVID- 9,” Nat. Rev. Microbiol., pp. 1– 14, 2020.

[110] J. Pennington, . Socher, and C. D. Manning, “ love: lobal vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

[111] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “ nriching word vectors with subword information,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, 2017.

[112] A. Joulin, . rave, P. o anowski, M. Douze, . Jégou, and T. Mikolov, “ asttext. zip: Compressing text classification models,” arXiv Prepr. arXiv1612.03651, 2016.

[113] T. Mikolov, K. Chen, . Corrado, and J. Dean, “ fficient estimation of word representations in vector space,” arXiv Prepr. arXiv1301.3781, 2013.

[114] S. Naseer, . ussain, . D. Khan, and N. asool, “Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations,” Anal. Biochem., vol. 615, p. 114069, 2021.

[115] . . umbanra a, . Mahesworo, T. . Cenggoro, A. udiarto, and . Pardamean, “An valuation of Deep Neural

23

Network Performance on Limited Protein Phosphorylation Site Prediction Data ScienceDirect An Evaluation of Deep Neural Network Performance on Limited An Evaluation of Deep Neural Network Performance on Limited Protein Phosphor,” Procedia Comput. Sci., vol. 157, no. September, pp. 25–30, 2019, doi: 10.1016/j.procs.2019.08.137.

[116] C. Wang et al., “ PS 5.0: an update on the prediction of kinase-specific phosphorylation sites in proteins,” Genomics. Proteomics Bioinformatics, vol. 18, no. 1, pp. 72–80, 2020.

[117] . Xu, J. Song, C. ilson, and J. C. hisstock, “PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction,” Sci. Rep., vol. 8, no. 1, pp. 1–14, 2018.

[118] . ei, P. Xing, J. Tang, and . Zou, “PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only,” IEEE Trans. Nanobioscience, vol. 16, no. 4, pp. 240–247, 2017.

24