Determinants and prediction of protein degradation

Miguel Correa Marrero

1 Introduction

Biologists used to consider proteins as rather static elements that would only be replaced if they were damaged. When stable isotopes started be- ing used to trace metabolic processes [62], this view changed to a more dynamic one where biomolecules are continously being synthesized and de- graded. Although our understanding of protein degradation has improved much from those early days, less attention has been paid to this form of post-transcriptional regulation than others, to the point it has been called a “‘missing dimension in ” [54]. As a result, the physicochemi- cal, structural and sequence characteristics that underlie the broad range of protein half-lives [80] are not well understood. Even though protein degradation has been found to contribute less to the control of protein concentrations in the cell than other forms of regulation [38], this does not mean it is less of a key mechanism in homeostasis. Among the roles of protein degradation are the irreversible removal of proteins to adapt to new physiological conditions, removal of aberrant and otherwhise damaged proteins, or the maintenance of an adequate amino acid pool [36], thus impacting many different cellular processes [48],[60],[79]. A clear, dra- matic example that highlights the importance of protein turnover is the found in the work of Hirata et al.. [26], in which artificially prolonging the half-life of transcription factor Hes7 by 8 minutes during development results in disruption of somite segmentation. Other examples are the implication of mutations that decrease protein stability in the pathogenesis of certain neu- roendocrine tumours [83] or loss of body mass in AIDS and cancer patients resulting from small increases in half-lives [11]. A complete understanding of regulation of biological networks and the dynamics of proteomes requires information of how regulation of gene ex- pression works at all levels, including protein degradation, the end point of gene expression. However, proteome-wide determinations of half-lives is

1 expensive and time-consuming, and this situation will probably not change in the near future. Therefore, there is interest in the construction of a com- putational tool that allows prediction of half-lives. This will also give us information about what characteristics of a protein determine its half-life. This information could prove useful in several ways. For example, it could be used to obtain parameters of mathematical models that attempt to pre- dict the behaviour of biological systems. In a more applied setting, it could be employed to stabilize proteins for enzyme replacement therapy and make the treatment more efficient by reducing the amount of dosage necessary. It could also be exploited in metabolic engineering and synthetic biology to manipulate pathways. There have been several lines of research on the determinants of protein degradation. Some early studies focused on protein thermodynamic stabil- ity [46],[45]. However, these studies seem to have gone largely unnoticed, and from then attention has been focused on sequence signals. One of the first proposed sequence signals for protein degradation was the N-end rule [2], which states that a protein’s half-life is a function of its N-terminal residue. However, the N-end rule alone does not fully explain half-lives [43] and has even been shown to not apply in Mycoplasma pneumoniae [44]. It has also been proposed that PEST regions (roughly defined as protein segments enriched in proline, glutamic acid, serine and threonine) lead to rapid degradation [59], although many of them are known to be conditional signals [56]. Other motifs, such as the destruction box [19] or the KEN box [53], have also been proposed to regulate half-life. It would seem, however, given the broad variety of substrates that the proteolytic machinery needs to degrade, that half-lives should be also influenced by a range of generic characteristics and their interplay, and not merely by a small range of se- quence motifs. Furthermore, many of these studies have been performed on a rather small set of proteins, limiting their explanatory power. These arguments point to the need for a global analysis of protein degradation. Such an approach has already been attempted by several authors by ex- ploiting large-scale datasets. There have been both statistical, univariate approaches, and machine learning approaches. Amongst the former we can count the study by Tompa et al. [74], which uses a yeast dataset [3] to find relationships between half-lives and a number of properties by using simple linear regression. Another study, by van der Lee et al. [78] used several datasets to focus on the influence of structural disorder by binning the data into several categories. In contrast to these, machine learning approaches allow us to perform a multivariate analysis. Unfortunately, previous ma- chine learning studies of the problem [29],[68],[49] have all used a dataset

2 [84] obtained by a flawed methodology [1],[85], raising doubts about the significance of their results. Furthermore, they have used classification al- gorithms (with varying interpretability), whereas regression would be more useful, given that half-life is not a discrete category, but a continuous quan- tity. Perhaps surprisingly, there is only one computational tool, ProtParam [17], that attempts to give a estimation of half-life. However, it is based solely on the traditional N-end rule. Overall, these studies have reached little consensus. In this work, we attempt to address the problem of predicting protein half-life and uncover- ing its determinants. We do this by using machine learning, which allows us to combine a large number of protein characteristics to create a pre- dictive model. We use support vector regression to try to learn a simple, interpretable model of protein degradation rates from datasets obtained by a reliable methodology, together with careful inspection of the data to prevent pitfalls previous attempts have fallen into. We integrate a large number of possibly relevant features, focusing in those that can be derived from the amino acid sequence, in order to try to answer some open questions in pro- teolysis, such as the relevance of sequence signals, structural disorder or post-translational damage.

2 Materials and methods

2.1 Datasets Human leukemia dataset , collected by Kristensen et al [33]. The data was obtained from the myelomonocytic leukemia THP-1 cell line from both proliferating and differentiating cells. We have used exclusively the measurements taken under proliferation conditions, as there is more available data and the authors do not find a significant difference in half-lives between the two conditions. They use the pulsed SILAC (stable isotope labeling with amino acids in cell culture) technique to collect the data. Two different cell populations are used in this tech- nique, one grown on light and another on medium amino acids. The growth medium of the former is replaced by another containing heavy amino acids. Protein degradation is then measured as a decrease in medium amino acids, with the population grown on light amino acids acting as a control.

Yeast dataset , obtained from the work of Helbig et al. [24]. In this study, Saccharomyces cerevisiae was grown in chemostat cultures un-

3 der nitrogen-limited conditions. Once steady state was achieved, 15N was supplied to the cells instead of 14N, leading to gradual incorpo- ration of the 15N isotope into newly synthetized proteins. Then, by following the evolution of 14N signal intensity, protein turnover rates can be calculated. It should be mentioned, though, that nitrogen lim- itation might have triggered faster degradation of these preexisting proteins in order to maintain an adequate amino acid pool [77].

Protein sequences, together with manual annotation for subcellular lo- cation, were retrieved from UniProtKB/Swiss-Prot [75] for the leukemia dataset, and from the Saccharomyces Genome Database [6] for the yeast dataset,

2.2 Data cleaning and inspection A number of proteins (particularly substantial in the leukemia dataset) could not be identified unambiguosly during the collection of the data and were assigned a group of possible identities. These proteins were removed, as we cannot be certain about their amino acid sequence. Sequences containing ambiguous amino acids were also removed for this reason. Proteins that had been assigned different measurements of half-life in the same dataset were also filtered out. 90% confidence intervals had been calculated in the yeast dataset. For about half of the proteins, the lower bound of the confidence interval was negative. These proteins were discarded, as the measurements seemed unreliable. Finally, after removing outliers, 296 proteins were left in the yeast dataset and 464 in the leukemia dataset. BLASTClust [9] was used to avoid biasing the training process with re- dundant sequences. BLASTClust performs clustering by running BLAST to perform all possible pairwise alignments. Sequences were clustered together and considered to be potentially redundant if they were 95% identical over 90% of the length of each sequence. Sequences in a cluster would only be considered redundant if they showed very similar half-lives, since similar sequences with rather different half-lives could contribute important infor- mation. We inspected the data for possible experimental biases relating to subcel- lular location, since this already had an impact on previous machine learning approaches. In order to do so, we performed a series of GO enrichment anal- yses for the cellular component ontology using the BiNGO plugin [42] for Cytoscape [63] using default parameters. We searched for overrepresented terms in the whole dataset, both using the whole genome as background and

4 using the dataset itself as background. Furthermore, the dataset was split into four quartiles according to half-lives, and we also searched for overrep- resented terms in each of these using the whole genome as background and using the whole dataset as background. Using the dataset as background allows us to control for biases introduced during the collection of the data.

2.3 GO enrichment analyses We intended to explore whether a protein’s half-life has any relationship to the function it performs in the cell. We used the DAVID Functional Anno- tation Clustering tool [27] to search for overrepresented molecular function and biological process clusters. Each dataset was split into three tertiles ac- cording to half-lives; we searched for overrepresented terms in each of them using the corresponding dataset as background.

2.4 Sequence preprocessing Many proteins undergo post-translational modifications that affect their se- quence. Some of them affect the identity of the N-terminal amino acid, which according to the N-end rule has an effect on protein half-life. In order to take this into account, sequences are preprocessed before features are calculated for them. SignalP 4.1 [52] is used to predict N-terminal signal peptides and their cleavage site in each sequence. If the default D-value cutoff of 0.45 is reached, the protein is split at the predicted cleavage site. Afterwards, to mimic the effect of methionine aminopeptidases [28], [4], the initial me- thionine is removed if valine, glycine, proline, alanine, serine, threonine or cysteine are adjacent to it.

2.5 Feature extraction 2.5.1 Features related to sequence preprocessing The presence or absence of a signal peptide is used as a feature. The identity of the N-terminal amino acid is used as well.

2.5.2 Subcellular location features It has been observed that nuclear proteins show, on average, a lower half- life when compared to cytoplasmic proteins [78](supplementary informa- tion). In order to investigate the possible influence of subcellular location more in depth, we defined a small set of possible subcellular locations based

5 on upper-level GO cellular component terms. This set included the terms plasma membrane, extracellular region, extracellular space, cytoplasm, nu- cleus, endoplasmic reticulum, mitochondrion, Golgi apparatus, peroxisome and vacuole. For each of these terms there is a Boolean variable that speci- fies whether the protein has been manually annotated to exist in said com- partment.

2.5.3 Physicochemical features This set of features encompasses several gross descriptors of the protein: length of the polypeptide chain, net charge (and its absolute value), isoelec- tric point and grand average of hydropathicity (GRAVY index) [35].

2.5.4 Composition features This set contains features that describe the amino acid composition of the protein in terms of counts and fractions. Not only we have used the stan- dard 20 letter amino acid alphabet, we have also used pseudo-amino acid alphabets. The rationale to this is that there is, to a certain extent, some redundancy between different amino acids. It might well be, then, that dif- ferent amino acids contribute in common to a property that is relevant to protein half-life, and using an encoding that can aggregate different amino acids could boost the signal and improve the performance of the predic- tor. The evolutionary information contained in similarity matrices can be exploited for this end. In order to make these alphabets, we have have clus- tered Euclidean vectors derived by singular value decomposition [?] from the BLOSUM62 matrix [25] using the k-means algorithm, running it with k (desired number of clusters) ranging from 2 to 20. Each of the runs creates an alphabet of a different size. Two different alphabets are selected from the results, by trying to find a good trade-off between size (smaller alphabets are preferred since they condense more information) and biological meaning (it would be desirable that certain amino acids with special properties do not overlap with others) rather than by clustering performance metrics. The chosen alphabets are those with 8 letters, since it is the smallest alphabet in which proline has its own letter, and with 10 letters, because it is the small- est alphabet with tryptophan, the most conserved amino acid, represented by a single letter. Details about the composition of each alphabet can be found in table 1 and table 2.

6 Table 1: Composition of the clusters conforming the 8 letter alphabet Aminoacid clusters in the 8 letter alphabet Ile, Met, Leu, Val Asp, Asn Phe, Trp, Tyr Ala, Gly, Ser, Thr Glu, Lys, Arg, Gln Pro Cys His

Table 2: Composition of the clusters conforming the 10 letter alphabet Aminoacid clusters in the 10 letter alphabet Ile, Met, Leu, Val Asp, Asn Phe, Tyr Ala, Gly, Ser, Thr Lys, Arg Gln, Glu Pro Cys His Trp

2.5.5 Structural features PSIPRED 3.5 [31] was used to predict secondary structure using a filtered version of the UniRef90 database with no low-complexity regions, transmem- brane regions and coiled-coil segments in order to prevent the incorporation of repetitive sequences into the PSI-BLAST sequence profiles that it uses to make predictions, which would cause a failure. NetSurfP 1.0 [51] was used to predict whether each amino acid in the sequence was exposed to the solvent or not. This is used not only to calculate e.g. the amount of exposed or buried amino acids, but we also take into account the composition of these two fractions with both the full and the reduced amino acid alphabets. In addition, we use this information to search for stretches of exposed hydrophobicity. We have used the method devised by Fredrickson et al to do this. A sliding window of width 5 is moved over

7 the sequence. If all the residues in the window are predicted to be exposed and the average Kyle-Doolitle hydrophobicity score is greater than zero, the segment is considered as hydrophobic. Also, TMHMM 2.0c [70] was used to predict transmembrane regions. Previous machine learning studies of the topic had concluded that trans- membrane regions play an important role in protein half-life [29],[68],[49].

2.5.6 Features related to structural disorder Structural disorder and its implication in protein degradation rates has been studied by several authors, with little consensus on its importance [74],[84],[22]. IUPred 1.0 [10] was used to predict structurally disordered regions with the long mode. The long mode adjusts the parameters to pre- dict regions of at least 30 consecutive residues of predicted disorder. The default cut-off value of 0.5 was used to determine whether a residue is intrin- sically disordered or not. Features derived from the prediction include the number of disordered residues, the average disorder score or the presence of terminal and internal disordered segments that can serve as initiation regions for the proteasome. Van der Lee et al [78] defined the former as disordered segments starting in the first or last residue that are longer than 30 residues and contain no more than three contiguous structured residues. The latter are disordered segments that start anywhere but in the first or last residues, are longer than 40 residues and contain no more than four contiguous structured residues. We also take into account low complexity regions (LCRs), which are regions with a biased amino acid composition. LCRs are frequently lumped together with other forms of structural disorder, but they are rather poorly understood and seem to have different properties than regular disordered regions [34], thus they have been considered separately. LCRs were predicted using the SEG algorithm [82] with default settings.

2.5.7 Sequence motifs PEST motifs are thought to lower the half-life of proteins containing them. We used the EMBOSS program epestfind [58] to predict PEST regions. This program considers PEST motifs to be protein segments of at least 12 residues that contain at least one proline, one glutamate or aspartate and one serine or threonine. These segments are flanked by arginine, histidine or lysine residues. The quality of these regions is assesed based on its hydrophobicity and enrichment of the relevant aminoacids. Only regions that obtain a score

8 greater than 5.0 (“‘potential” motifs) are considered for our analysis. N-glycosylation motifs were also taken into account by counting the num- ber of appearances of the pattern N[P][ST],ˆ as obtained from PROSITE [65].

2.5.8 Pattern-based features We used spectrum kernels [37], as implemented in Shogun [69]. The spec- trum kernel measures the similarity of two strings (in this case, protein sequences) based on k-mer counts. As explained later, k is initially set to 3. Tripeptide composition is thought to influence folding kinetics [30]. This setting will be varied later to observe its influence on the model.

2.6 Regression methods The machine learning algorithm used to learn a model of protein degradation rates was -support vector regression, as implemented in scikit-learn [50]. Generalization performance of the -SVR depends on two parameters,  and C.  determines the width of the so-called -insensitive zone, within which errors are not penalized, while in turn C determines the trade-off between the flatness of the model and the amount to which deviations larger than  are tolerated. The -SVR tries to learn a function that describes the data by minimising a cost function. This cost function is only positive when a training example deviates from the target value it would be assigned by the function by more than . The data is randomly split into training (90%) and test (10%) sets. Fea- tures are linearly rescaled to the range (0,1) to avoid features with greater numeric intervals dominating features with smaller intervals. The scale is determined on the training set. As a starting point, we create models calcu- lating the unweighted sum of the kernel matrices of a linear, polynomial or RBF kernel and a spectrum kernel (k=3). The resulting kernel matrices are tested for positive semidefinitiveness by checking that all of the eigenvalues are above 10-8 in order to avoid the problems that could arise from using ill- conditioned matrices. The parameters , C and, when using polynomial or RBF kernels, the pertinent kernel parameter, were optimized by minimizing the training mean squared error by performing a grid search inside a 10- fold cross validation loop. The regressor is then trained with the estimated optimal parameters using the precomputed kernel matrices. We also tried to find subsets of proteins that might follow different modes of degradation, using a procedure we will call hierarchical regression. We clustered together proteins using complete linkage agglomerative clustering

9 with Euclidean distances. One support vector regression machine is trained for each cluster. The test examples are then assigned to their cluster and shown to the corresponding predictor to test generalization performance.

2.7 Feature selection The goals of adding a feature selection step are twofold. One is to obtain a biological interpretation of the model; the other is to improve the perfor- mance of the model by removing features that harm it. In the procedure we have implemented, the first step is removing features that remain con- stant on all examples. Afterwards, we calculate the Spearman correlation between the remaining features and half-lives. The corresponding p-values are corrected for multiple testing using the Benjamini-Hochberg procedure. Features that have a p-value above 0.05 are discarded. The feature with the lowest p-value is then included in the model. Then, features that pass through these initial steps are added to the model according to a scoring measure. This measure contains a term that defines its relevance to the prediction and another to take into account its redundancy. The score is calculated according to the following formula:

1 X S(f ) = ρ(f ,Y ) − p ρ(f , f ) (1) i i N i j fj ∈F

where fi is the feature being scored, ρis Spearman correlation, Y the response variable, p a redundancy multiplier, N the number of features in the model, F the set of features in the model, and fj any feature in the model. In this way, we attempt to maximize the relevance and minimize the redundancy of the features that we put into the model. Features are ranked according to this score and the first one is selected while there are still features to add. In each iteration of the procedure, a support vector regressor is trained and tested with all the available features. The parameters C and  are left fixed according to previous well-performing observations to reduce calculation time. Once the procedure is finished, we select the model that has shown the lowest test mean squared error.

10 Table 3: Results obtained on the leukemia dataset using different kernels and all the features. Linear kernel Polynomial kernel RBF kernel C 0.217 0.652 0.435 Epsilon 0.440 0.237 0.220 Degree - 5 - Sigma - - 0.315 Training MSE 0.392 0.383 0.383 Test MSE 0.298 0.406 0.367

Table 4: Results obtained on the yeast dataset using different kernels and all the features Linear kernel Polynomial kernel RBF kernel C 0.217 9.565 12.173 Epsilon 0.254 0.576 0.339 Degree - 5 - Sigma - - 0.316 Training MSE 14.667 13.820 14.401 Test MSE 19.042 33.457 25.219

3 Results and discussion

3.1 Predictions correlate moderately with the measured val- ues Using all of the features, we obtain the most predictive models for both datasets using linear kernels, as shown in tables 3 and 4. Consequently, the models with the optimal number of features were trained using linear kernels. As it can be seen in figures 1 and 2, the models are quite robust to the choice of parameters; if the values of C and  are changed to others that are close, the predictions do not change much. We trained models with a linear kernel alone or a combination of a linear kernel and a spectrum kernel, with different settings for k. The best result was obtained in both cases for the model with a linear kernel alone, as can be seen in tables 5 and 6. It seems, thus, that the information provided by the spectrum kernel is irrelevant to the problem. These models achieve a slightly higher performance than those that use all features. The predictions that the models yield are moderately correlated to the measured values (figure3. This shows that protein half-lives depend, at least partially, on information

11 Figure 1: Contour plots of the mean squared error over different combina- tions of parameters for a) the training set and b) the test set, for the leukemia dataset with a linear kernel, a spectrum kernel and all the features.

Figure 2: Contour plots of the mean squared error over different combina- tions of parameters for a) the training set and b) the test set, for the yeast dataset with a linear kernel.

12 Table 5: Results obtained on the leukemia dataset with the selected features. The sum of a linear kernel and a spectrum kernel with increasing k’s is used. No spectrum Spectrum Spectrum Spectrum kernel kernel (k=1) kernel (k=2) kernel (k=3) C 0.217 0.217 0.217 0.217 Epsilon 0.390 0.390 0.508 0.356 Training MSE 0.384 0.384 0.380 0.378

Table 6: Results obtained on the yeast dataset with the selected features. The sum of a linear kernel and a spectrum kernel with increasing k’s is used. No spectrum Spectrum Spectrum Spectrum kernel kernel (k=1) kernel (k=2) kernel (k=3) C 10.652 2.826 2.391 15.0 Epsilon 0.712 0.610 0.712 0.780 Training MSE 14.901 14.183 13.837 13.947 Test MSE 15.752 38.174 37.841 34.308 that can be derived from the primary sequence. Once again, the models are robust to the choice of parameters (see figures 4 and 5). We observe a high correlation between the measured values and the resid- uals of the predictions, as seen in figure 6. Predictions for short-lived and long-lived proteins are worse than for medium-lived. This is most probably caused simply because most proteins are medium-lived; thus, the regressor learns how to predict these better. We divided the predicted examples into three tertiles according to the residual of the prediction and searched for clusters of overrepresented GO terms in each of them. This was done in order to find out if the predictor is biased towards a particular kind of protein (for example, if prediction performance is worse for transmembrane proteins). We found no enrichment in any, indicating that there is no such bias.

3.2 Hierarchical regression does not improve performance As previously explained, we broke down the data into clusters and trained a regressor for each, in order to test if there are subgroups of proteins that are degraded differently. As shown in tables 7, 8 and 9, attempting to create models for different clusters of proteins did not show any advantage over not doing so. Furthermore, many of the resulting models were flat lines, always yielding predictions close to the mean of the corresponding dataset.

13 Figure 3: Scatterplots of measured values versus predicted values for the optimized predictors. To the left, results in the leukemia dataset; to the right, those for the yeast dataset.

Figure 4: Contour plots of the mean squared error over different combi- nations of parameters for a) the training set and b) the test set, for the leukemia dataset with a linear kernel and optimized number of features.

14 Figure 5: Contour plots of the mean squared error over different combina- tions of parameters for a) the training set and b) the test set, for the yeast dataset with a linear kernel and optimized number of features.

Figure 6: Scatterplots of measured values versus residuals of the predictions. To the left, results in the leukemia dataset; to the right, those for the yeast dataset.

15 Table 7: Results of hierarchical regression on the leukemia dataset, using 2 clusters Training set size Test set size Test MSE Cluster 1 6 3 0.154 Cluster 2 384 41 0.440

These results suggest that there are no such subgroups of proteins that follow different ways of degradation, given that there is no advantage to this procedure.

Table 8: Results of hierarchical regression on the yeast dataset, using 2 clusters Training set size Test set size Test MSE Cluster 1 47 23 25.0816 Cluster 2 219 7 14.540

3.3 GO enrichment analyses provide information on cell type and experimental conditions 3.3.1 Leukemia dataset As previously explained, we carried a series of GO enrichment analyses to explore whether a protein’s half-life is related to its function in the cell. When looking for clusters of molecular function terms using the dataset us- ing the genome as background (table 10), we observe a prominent cluster related to detoxification of reactive oxygen species (ROS). This is not sur- prising, as it is known that cancer cells usually display high levels of ROS [39]. Interestingly, there is also a cluster of terms related to protein degra- dation. Searching for clusters of biological process terms reveals enrichment related to mRNA and rRNA processing, protein targeting and, again, pro-

Table 9: Results of hierarchical regression on the yeast dataset, using 3 clusters Training set size Test set size Test MSE Cluster 1 219 22 16.6279 Cluster 2 39 7 14.799 Cluster 3 8 1 0.003

16 tein degradation and ROS metabolism. However, searching for enriched clusters in the different tertiles does not provide much information. Cellular component enrichment shows that the protein of this dataset are overwhelmingly cytoplasmatic. The different quartiles seem to point at a tendency for mitochondrial proteins to be short-lived, and for nuclear proteins to be long-lived.

17 Table 10: Enriched clusters of molecular function terms on the whole leukemia dataset, using the whole genome as background Annotation Cluster 1 Enrichment Score: 3.951 Term Count Fold Enrichment Corrected p-value (BH) oxidoreductase activity, acting on sulfur group of donors 9 9.134 8.388E-4 disulfide oxidoreductase activity 6 13.194 0.006 protein disulfide oxidoreductase activity 4 11.309 0.226 Annotation Cluster 2 Enrichment Score: 2.789 Term Count Fold Enrichment Corrected p-value (BH) threonine-type peptidase 18 activity 6 11.874 0.010 threonine-type endopeptidase activity 6 11.874 0.010 endopeptidase activity 12 1.266 0.9743 Annotation Cluster 3 Enrichment Score: 2.198 Term Count Fold Enrichment Corrected p-value (BH) aminoacyl-tRNA ligase activity 6 5.053 0.250 ligase activity, forming aminoacyl-tRNA and related compounds 6 5.053 0.250 ligase activity, forming carbon-oxygen bonds 6 5.053 0.250 3.4 Yeast dataset Searching for clusters on the whole dataset against the genome shows terms related to regulation of translation and high biosynthetic activity of amino acids. Molecular function clusters reflect this biosynthetic activity, showing terms related to pyridoxal phosphate or electron carriers. We also find terms related to fermentation, even though the yeast is growing under aerobic conditions and in a glucose-rich medium. This is the so-called Crabtree effect [7]. Once again, inspection of the different tertiles does not provide much information. The only enrichment found is in the third tertile, which contains clusters related to the aforementioned biosynthetic activity.

3.5 Controlling aggregation is key in the leukemia dataset A full relation of the features selected can be seen in table 11. The single most predictive feature for half-life is the fraction of disordered residues, which correlates positively. Indeed, other features that are highly correlated to the amount of disorder in the protein, such as the GRAVY index, receive a high score in the feature selection procedure. Disorder promoting amino acids tend to be hydrophilic [5]. This is re- flected in the fact that a greater ratio of hydrophilic amino acids contributes positively to half-life, as disorder does. Likewise, hydrophobic stretches ex- posed to the solvent correlates negatively. Disordered proteins offer a larger surface that can interact favourably with water, and are more soluble. This relationship between disorder and increased solubility had already been ob- served by Klus et al. [32] in silico. Previously, Santner et al. had already fused highly disordered proteins to recombinant proteins to improve their solubility [61]. Furthermore, Mann et al. [45] observed a relationship be- tween protein hydrophobicity and half-life. Also notable is the possible role that ROS seem to play. We observe that the exposed amount of certain amino acids, such as leucine or aromatic amino acids, correlates negatively with half-life. The side chains of these amino acids are a target of ROS [72] [71] [18]. Meanwhile, the exposed amount of amino acids that are not a target for ROS, as serine, has a positive contribution to half-life. This effect is probably exacerbated by the increased production of ROS. Both structural disorder (and concomitant greater solubility) and vul- nerability to ROS are connected to the aggregation propensity of the protein, as hydrophobic surfaces allow the formation of protein aggregates. Proteins that are oxidized by ROS suffer structural changes and exhibit greater hy-

19 drophobicity than their non-damaged counterparts, leading them to form aggregates. This supports the observations by de Baets et al. [8] that, on average, proteins with a short half-life have a higher aggregation propensity than long-lived ones. De Baets et al. argue that there would be no evo- lutionary pressure to reduce the aggregation propensity of these proteins, given that they are quickly degraded. We consider that the argument should be turned around: this would indicate an evolutionary pressure to recognize and quickly degrade proteins likely to form aggregates. This is supported by the recognition of exposed hydrophobicity by proteolytic machinery [16] [64] or the rapid degradation of subunits with hydrophobic interfaces when they dissociate from the complex [20]. This view has also been put forward by Gsponer et al. [21], who frame it as part of a global strategy in the cell to minimize aggregation. Furthermore, it is well established that proline, polar and charged residues flanking aggregation nucleating stretches can lower their aggregation propen- sity. These so-called gatekeeper residues can prevent the formation of aggre- gates through charge repulsion and steric hindrance [57]. Removing gate- keeper residues results in higher aggregation propensities. We observe that lower amounts of buried potential gatekeeper residues, such as proline, serine or glutamic acid, are correlated with a lower half-life. Although it cannot be ascertained from this analysis alone, this suggests that gatekeeper residues also play a role in protein turnover, as would be expected from the findings we have described. Interestingly, we do not find the negative effect on half-life that van der Lee et al. [78] find disordered segments to have. Much to the contrary, we find the reverse effect, which can be attributed to their contribution to the solubility of the protein. This is despite their study using this same dataset. This discrepancy is probably caused by their lax filtering of the data; whereas they use 3971 proteins, we use 464. As previously explained, many of the proteins in these dataset could not be identified unambiguously. It would seem then that their results might be a product of this. Another reason could be the purported influence of the composition complexity of these regions on how effectively they serve as initiation sites for degradation [12]. This influence is attributed to binding preferences of the proteosome. We could not take this account due to the input format the support vector regressor needs. Not every protein has these disordered regions; it would not make sense to simply assign a composition complexity of these regions to such proteins. Furthermore, we do not observe that proteins with PEST regions have a shorter half-life, but a longer one. It has been shown that PEST regions

20 are frequently disordered [66]. It has also been proposed that PEST regions only under certain conditions (e.g. after phosphorylation) can shorten a protein’s half-life [56]. The conclusion that could be taken from this is that PEST regions, disordered as they are, add to the solubility of the protein (and therefore half-life) under normal conditions. However, a fraction of PEST regions might induce degradation under certain conditions, perhaps also if they are in the right context in the protein. The feature selection procedure did not find features related to the N-end rule to be relevant. This matter will be discussed in further detail. Finally, it is worth to mention our findings with regards to subcellular location. Proteins in the plasma membrane are found to last longer; this could be explained by the fact that membrane proteins need to be internal- ized by endocytosis in order to be degraded [41]. Thus, there is less chance for them to come in contact with proteolytic machinery, prolonging their lifespan. In contrast, proteins in the mitochondria have a shorter half-life. This is probably explained by the mitochondria being a major source of ROS [47].

Feature name Spearman correlation Corrected p-value (BH) MSE on test set Fraction of disordered residues 0.404 2.38-14 0.307 Located in plasma membrane 0.164 5.78-3 0.311 Length of -sheets -0.158 7.78-3 0.311 GRAVY index -0.348 1.02-10 0.309 Average disorder score 0.392 1.27-13 0.306 Fraction of Glu+Gln 0.24 1.99-5 0.314 (10 letter alphabet) Fraction of -sheets -0.215 1.70-4 0.307 Fraction of Ile+Met+Leu+Val -0.327 1.73-9 0.304 (8 letter alphabet) Fraction of Ile+Met+Leu+Val -0.327 1.73-9 0.303 (10 letter alphabet) Amount of disordered residues 0.374 1.80-12 0.302 Number of C-terminal disordered 0.284 3.55-7 0.302 residues Fraction of buried Ile+Met+Leu+Val -0.291 1.65-7 0.303 (10 letter alphabet) Fraction of structured residues -0.271 1.12-6 0.297 (-helix +-sheets)

21 Fraction of residues in 0.278 6.26-7 0.298 low-complexity regions Fraction of exposed Ser 0.264 2.46-6 0.298 Fraction of residues in coils 0.271 1.12-6 0.299 Fraction of buriedIle+Met+Leu+Val -0.291 1.65-7 0.299 (8 letter alphabet) Amount of exposed Leu -0.252 7.42-6 0.309 Presence of disordered 0.251 7.52-6 0.304 N-terminal segment Fraction of exposed leucine -0.242 1.83-5 0.305 Fraction of buried Gln+Glu 0.242 1.83-5 0.306 (8 letter alphabet) Fraction of Leu -0.254 6.57-6 0.306 Presence of internal disordered 0.215 1.70-4 0.306 segment Number of N-terminal disordered 0.209 2.78-4 0.306 residues Fraction of residues in exposed -0.2 5.65-4 0.306 hydrophobic stretches Located in the mitochondrion -0.19 1.29-3 0.313 Length of low complexity regions 0.276 7.85-7 0.313 Fraction of buried Gln 0.202 4.96-4 0.307 Fraction of residues in 0.229 5.76-5 0.305 PEST regions Fraction of buried Leu -0.24 1.99-5 0.306 Fraction of buried Val -0.176 2.95-3 0.302 Amount of buried Asn 0.163 6.08-3 0.302 Fraction of Glu+Gln+Lys+Arg 0.163 6.08-3 0.301 (8 letter alphabet) Fraction of Pro 0.175 2.95-3 0.3 (10 letter alphabet) Fraction of buried Val -0.181 2.21-3 0.301 Number of residues in internal 0.254 6.47-6 0.301 disordered segments Fraction of buried Ile -0.176 2.95-3 0.3 Fraction of Ile -0.161 6.34-3 0.299 Fraction of exposed Ile+Met+Leu+Val -0.182 2.21-3 0.299 (8 letter alphabet)

22 Fraction of exposed Ile+Met+Leu+Val -0.182 2.21-3 0.298 (10 letter alphabet) Length of PEST regions 0.233 3.79-5 0.299 Fraction of buried Phe+Tyr -0.218 1.45-4 0.299 (10 letter alphabet) Fraction of Phe+Tyr -0.221 1.12-4 0.297 (10 letter alphabet) Fraction of Phe -0.188 1.46-3 0.296 Fraction of Phe+Tyr+Trp -0.209 2.78-4 0.296 (8 letter alphabet) Fraction of buried Phe+Tyr+Trp -0.202 5.00-4 0.296 (8 letter alphabet) Number of PEST regions 0.228 5.85-5 0.296 Fraction of buried Glu+Gln+Lys+Arg 0.166 5.08-3 0.296 (8 letter alphabet) Fraction of Pro 0.175 2.95-3 0.296 (8 letter alphabet) Fraction of buried Asn 0.162 6.08-3 0.296 Presence of disordered C-terminal 0.162 6.08-3 0.298 segment Fraction of exposed Ile+Met+Leu+Val 0.178 2.64-3 0.297 (8 letter alphabet) Fraction of exposed Met -0.155 9.24-3 0.296 Fraction of Tyr -0.152 1.08-2 0.293 Fraction of Pro 0.175 2.95-3 0.293 Fraction of buried Phe -0.177 2.87-3 0.293 Fraction of exposed Ile+Met+Leu+Val 0.178 2.64-3 0.292 (10 letter alphabet) Fraction of Gln 0.169 4.35-3 0.296 Fraction of buried Tyr -0.149 1.23-2 0.296 Fraction of buried Pro 0.172 3.48-3 0.296 Fraction of Glu 0.157 7.93-3 0.296 Fraction of buried Trp -0.129 3.49-2 0.295 (10 letter alphabet) Fraction of Ser 0.173 3.35-3 0.289 Amount of exposed Trp -0.129 3.49-2 0.289 Fraction of buried Pro 0.172 3.48-3 0.289 (8 letter alphabet) Fraction of buried Glu 0.155 9.24-3 0.289

23 Fraction of buried Pro 0.172 3.48-3 0.289 (10 letter alphabet) Fraction of exposed Trp -0.126 3.92-2 0.287 Presence of exposed hydrophobic -0.128 3.76-2 0.288 stretches Fraction of exposed Trp -0.126 3.92-2 0.287 (10 letter alphabet) Fraction of buried Ser 0.124 4.25-2 0.284 Fraction of exposed Ala -0.127 3.92-2 0.283 Amount of buried Ser 0.139 2.08-2 0.284 Amount of buried Gln 0.162 6.08-3 0.284 Amount of buried Met 0.126 3.92-2 0.282 Amount of exposed Ile+Met+Leu+Val -0.13 3.42-2 0.282 (10 letter alphabet) Amount of exposed Ile+Met+Leu+Val -0.13 3.42-2 0.283 (8 letter alphabet) Amount of buried Gln+Glu 0.159 7.29-3 0.283 (10 letter alphabet) Amount of Asn 0.15 1.21-2 0.283 Amount of Gln+Glu 0.153 1.02-2 0.283 (10 letter alphabet) Amount of Pro 0.147 1.34-2 0.283 Amount of Pro 0.147 1.34-2 0.283 (8 letter alphabet) Amount of buried Glu 0.138 2.30-2 0.282 Amount of Pro 0.147 1.34-2 0.283 (10 letter alphabet) Amount of Glu 0.135 2.56-2 0.282 Amount of buried Pro 0.146 1.38-2 0.283 Amount of buried Pro 0.146 1.38-2 0.283 (10 letter alphabet) Amount of buried Pro 0.146 1.38-2 0.284 (8 letter alphabet) Length of coils 0.135 2.56-2 0.284 Fraction of Glu+Lys+Gln+Arg 0.122 4.72-2 0.283 (8 letter alphabet) Fraction of buried Glu+Lys+Gln+Arg 0.124 4.25-2 0.283 (8 letter alphabet)

24 Table 11: Features that entered the feature selection proce- dure in the leukemia dataset. The order corresponds to the one in which they were are added to the model

3.6 Nitrogen limitation shifts degradation preferences in the yeast dataset The feature selection procedure yields a very different answer to the question of which features are important for prediction when applied to the yeast dataset (table 12). The reason behind this probably lies in the experimental conditions under which the data was collected. As previously explained, the data was obtained under nitrogen-limited conditions. Nitrogen limitation is enough to cause nutritional stress and in- duce autophagy [40] [67] in yeast. The cell starts to degrade its own proteins in the vacuole, which contains an array of proteases, in order to maintain adequate amino acid levels and synthesize proteins necessary for survival. Certain vesicles, called autophagosomes, are responsible for collecting bulk cytosol material and deliver it to the vacuole for degradation. Under these conditions, the cell upregulates amino acid [81] biosynthetic pathways, which agrees with the results from the GO enrichment analysis. One of the disagreements with the leukemia dataset is that proteins with transmembrane helices are degraded more quickly. The reason behind this is probably that the plasma membrane contributes at least partially to the formation of autophagosomes [55]. Thus, membrane proteins come in contact with proteolytic machinery much more often. Since many of the vacuolar proteases seem to recognize mostly hydrophobic residues in their substrate [23], transmembrane proteins would be particularly easy to degrade. Also, since autophagosomes capture cytosolic material, this would make proteins in the mitochondria longer-lived by comparison. Another disagreement is that disorder seems to contributes to a shorter half-life in yeast. There is a large body of evidence that points at that disor- dered proteins are more vulnerable to proteolysis. For example, it is known that protease cleavage sites tend to overlap with regions with no electron density in crystal structures [14], which are predicted to be intrinsically dis- ordered. In addition, proteolysis rates increase upon substrate unfolding [13]. This indicates that, under these particular conditions, disordered pro- teins are degraded more quickly due to them coming in contact with vacuolar proteases.

25 Features related to amino acid composition might give clues of overall substrate preferences of vacuolar proteases. Although as a whole they are thought to have very broad, indiscriminant, substrate specificity (since it would seem that the vacuole evolved to efficiently degrade all sorts of pro- teins), it might be that particular proteases do have more substrate speci- ficity [23]. However, they have been studied mostly with synthetic substrates instead of actual substrates that they might find in the vacuole, and this question cannot be fully answered, only speculated on.

Feature name Spearman correlation Corrected p-value (BH) MSE on test set Amount of Ala 0,335 6,00-06 18,488 Number of transmembrane helices -0,226 3,28-03 17,842 Fraction of buried Glu 0,195 1,19-02 17,773 Fraction of Ala 0,303 4,50-05 17,756 Fraction of Val 0,208 7,63-03 18,17 Located in mitochondria 0,187 1,61-02 18,483 Number of residues in -sheets 0,255 1,01-03 18,583 Fraction of residues in transmembrane helices -0,232 3,23-03 18,29 Fraction of buried Ala 0,293 8,00-05 18,151 Fraction of disordered residues -0,194 1,24-02 18,496 Presence of N-terminal disordered segment -0,167 2,93-02 18,906 Fraction of residues in -sheets 0,157 4,04-02 19,033 Number of residues in transmembrane helices -0,227 3,28-03 18,916 Fraction of buried Ala 0,321 1,20-05 18,406 Fraction of buried Asn -0,181 1,94-02 18,404 Fraction of exposed Thr 0,18 1,94-02 18,651 Fraction of Glu 0,187 1,61-02 18,602 Number of residues in N-terminal disordered segment -0,158 3,95-02 18,534 Amount of exposed Ala 0,209 7,53-03 18,659 Located in cytoplasm 0,179 1,94-02 19,131 Fraction of buried Val 0,191 1,43-02 19,196 Fraction of buried Ala+Gly+Ser+Thr (8 letter alphabet) 0,21 7,53-03 19,351 Amount of buried Lys 0,206 7,92-03 19,101 Amount of exposed Thr 0,239 2,35-03 19,046

26 Fraction of buried Ala+Gly+Ser+Thr (10 letter alphabet) 0,21 7,53-03 19,088 Charge -0,165 3,22-02 19,132 Amount of buried Glu 0,227 3,28-03 19,112 Amount of Glu 0,224 3,54-03 19,119 Fraction of Ala+Gly+Ser+Thr (10 letter alphabet) 0,175 2,29-02 19,075 Fraction of structured residues (-helix +-sheets) 0,235 2,75-03 19,099 Fraction of exposed Ile+Met+Leu+Val (8 letter alphabet) -0,152 4,81-02 19,128 Amount of Val 0,259 9,80-04 19,133 Fraction of Ala+Gly+Ser+Thr (8 letter alphabet) 0,175 2,29-02 19,152 Amount of Lys 0,203 8,33-03 19,088 Amount of buried Val 0,258 9,80-04 19,104 Number of residues in -helix 0,186 1,61-02 19,121 Amount of exposed His 0,159 3,92-02 19,122 Amount of Asp 0,173 2,38-02 19,117 Fraction of buried Ala+Gly+Ser+Thr (10 letter alphabet) 0,246 1,55-03 19,112 Amount of exposed His (10 letter alphabet) 0,159 3,92-02 19,111 Amount of Ala+Gly+Ser+Thr (8 letter alphabet) 0,246 1,55-03 19,109 Amount of Thr 0,223 3,54-03 19,107 Number of exposed hydrophobic segments 0,195 1,19-02 19,12 Amount of buried Asp 0,172 2,53-02 19,126 Amount of exposed His (8 letter alphabet) 0,159 3,92-02 19,125 Amount of buried Gln+Glu (10 letter alphabet) 0,186 1,61-02 19,121 Amount of Gly 0,203 8,33-03 19,116 Amount of Gln+Glu (10 letter alphabet) 0,182 1,89-02 19,117

27 Amount of buried Gly 0,201 8,88-03 19,114 Amount of buried Thr 0,171 2,53-02 19,116 Amount of Leu 0,179 1,94-02 19,11 Amount of buried Leu 0,18 1,94-02 19,109 Amount of buried Glu+Arg+Gln+Arg (8 letter alphabet) 0,183 1,88-02 19,115 Amount of Ala+Gly+Ser+Thr (10 letter alphabet) 0,227 3,28-03 19,106 Amount of buried Lys+Arg (10 letter alphabet) 0,171 2,53-02 19,111 Amount of Ala+Gly+Ser+Thr (8 letter alphabet) 0,227 3,28-03 19,102 Amount of Pro 0,16 3,92-02 19,122 Amount of Lys+Arg (10 letter alphabet) 0,165 3,22-02 19,129 Amount of exposed Ala+Gly+Ser+Thr (10 letter alphabet) 0,171 2,53-02 19,114 Amount of Pro (10 letter alphabet) 0,16 3,92-02 19,134 Amount of exposed Ala+Gly+Ser+Thr (8 letter alphabet) 0,171 2,53-02 19,12 Amount of Pro (8 letter alphabet) 0,16 3,92-02 19,138 Amount of Glu+Lys+Gln+Arg (8 letter alphabet) 0,181 1,94-02 19,145 Amount of Ile+Met+Leu+Val (10 letter alphabet) 0,205 7,92-03 19,146 Amount of Ile+Met+Leu+Val (8 letter alphabet) 0,205 7,92-03 19,146 Amount of buried Pro 0,157 3,95-02 19,17 Amount of buried Ile+Met+Leu+Val (8 letter alphabet) 0,205 7,92-03 19,17 Amount of buried Ile+Met+Leu+Val (10 letter alphabet) 0,205 7,92-03 19,168 Amount of Ile 0,179 1,94-02 19,171

28 Amount of buried Pro (8 letter alphabet) 0,157 3,95-02 19,194 Amount of buried Pro (10 letter alphabet) 0,157 3,95-02 19,216 Amount of buried Ile 0,178 1,94-02 19,22 Total amount of exposed residues 0,155 4,23-02 19,206 Total amount of buried residues 0,189 1,54-02 19,202 Length 0,186 1,61-02 19,201 Table 12: Features that entered the feature selection proce- dure in the yeast dataset. The order corresponds to the one in which they were are added to the model

4 Conclusions

We have provided an exhaustive analysis of protein degradation rates in two different cell types, under different experimental conditions. We have shown that it is possible to predict, at least partially, protein half-lives based on their physicochemical and sequence characteristics (with a Pearson corre- lation coefficient of 0.538 on the leukemia dataset and 0.422 on the yeast dataset). Our results on the leukemia dataset provide evidence for a link between protein degradation and aggregation propensity. Based on current evidence, we subscribe to the view that quicker degradation of aggregation- prone proteins serves the purpose of shifting the chemical equilibrium away from oligomeric and aggregate forms, in order to prevent the potentially toxic effects of these forms. It should be kept in mind, however, that several physiological processes are known to depend on the formation of aggregates [15]. Our study does not provide information on how proteins implicated in these processes are regulated. Interestingly, we have also found no evidence that PEST regions, long associated with rapid degradation of proteins con- taining them, have this effect on a global scale. Rather, we have found the opposite effect. We have also observed how nutritional conditions can alter which char- acteristics are important for prediction and the influence they have on a protein’s half-life. Our results on the yeast dataset indicate that nitrogen limitation is not a good experimental condition to study protein degrada- tion under physiological conditions, given the nutritional stress it causes. This also highlights the difficulty of creating a tool for prediction of protein half-life.

29 One obvious question to ask is why the power of the predictions is limited. One source of these limitations comes from the half-life data itself. Large- scale datasets that have been collected using a trustworthy methodology are scarce. Even when the methodology is considered trustworthy, the data can be rather noisy. Another one is the fact that the predictors and annotation that the study depends on can introduce some noise themselves. Both of these reasons limit the extent to which machine learning algorithms can learn from the training examples. In addition, we were unable to incorporate one feature (proposed in [12]) that might have proved useful because of the input format required by the machine learning algorithms used. Thus, one way to improve the performance of the predictors would be simply to gather more data using a method known to be reliable, such as pulsed SILAC. It is especially important to gather more data of short- lived and long-lived proteins; as shown before, our predictions for such pro- teins are less trustworthy. Gathering more data would also allow to obtain good estimates of mutual information between half-life and the different features. This would help uncover potential non-monotonic relationships between them, whereas our feature selection procedure (based on Spearman correlation) is better suited for monotonic relationships. However, there are more aspects that might be explored. For example, given the links between half-lives and solubility and aggregation propensity we might incorporate information from existing predictors for these characteristics. This would also allow to confirm whether gatekeeper residues do play a role in protein degradation. We have not found features related to the N-end rule to be relevant. This is probably due to the sequence preprocessing step, which is an over- simplification of how the N-terminal end of a protein is processed in the cell. Degradation signals in the N-terminal end are known to depend on nu- merous proteases [80] and post-translational modifications [73]. Portraying these accurately could prove to be a challenge. This study has focused on characteristics of the protein that can be easily derived from the primary structure. However, a more sophisticated study should take into account more sources of information. For example, it is not difficult to see that the protein-protein interaction network could have a role in determining half-lives; as we explained before, certain proteins are rapidly degraded once they dissociate from a complex. Therefore, the role of protein-protein interaction on the matter needs to be examined more closely. Post-translational modifications, such as sumoylation, should also be accounted for [76]. It could also be interesting to search for possible sequence signals in short or long-lived proteins using motif discovery tools.

30 Finally, our study has been based solely on two eukaryotic cell types under one condition each, far from universal. How the determinants of protein degradation vary accross different branches of life and conditions is an open question that could be studied using this same approach.

31 References

[1] Beatriz Alvarez-Castelao, Carmen Ruiz-Rivas, and Jos´eG Casta˜no.A critical appraisal of quantitative studies of protein degradation in the framework of cellular proteostasis. Biochemistry research international, 2012, 2012.

[2] Andreas Bachmair, Daniel Finley, and Alexander Varshavsky. In vivo half-life of a protein is a function of its amino-terminal residue. Science, 234(4773):179–186, 1986.

[3] Archana Belle, Amos Tanay, Ledion Bitincka, , and Erin K OShea. Quantification of protein half-lives in the budding yeast proteome. Proceedings of the National Academy of Sciences, 103(35):13004–13009, 2006.

[4] Jean-Paul Boissel, Thomas J Kasper, and H Franklin Bunn. Cotrans- lational amino-terminal processing of cytosolic proteins. cell-free ex- pression of site-directed mutants of human hemoglobin. Journal of Biological Chemistry, 263(17):8443–8449, 1988.

[5] Andrew Campen, Ryan M Williams, Celeste J Brown, Jingwei Meng, Vladimir N Uversky, and A Keith Dunker. Top-idp-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein and pep- tide letters, 15(9):956, 2008.

[6] J Michael Cherry, Eurie L Hong, Craig Amundsen, Rama Balakr- ishnan, Gail Binkley, Esther T Chan, Karen R Christie, Maria C Costanzo, Selina S Dwight, Stacia R Engel, et al. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic acids research, page gkr1029, 2011.

[7] Herbert Grace Crabtree. The carbohydrate metabolism of certain pathological overgrowths. Biochemical Journal, 22(5):1289, 1928.

[8] Greet De Baets, Joke Reumers, Javier Delgado Blanco, Joaquin Dopazo, Joost Schymkowitz, and Frederic Rousseau. An evolutionary trade-off between protein turnover rate and protein aggregation favors a higher aggregation propensity in fast degrading proteins. PLoS com- putational biology, 7(6):e1002090, 2011.

[9] I Dondoshansky and Y Wolf. Blastclust (ncbi software development toolkit). NCBI, Bethesda, Md, 2002.

32 [10] Zsuzsanna Doszt´anyi, Veronika Csizmok, Peter Tompa, and Istv´anSi- mon. Iupred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. , 21(16):3433–3434, 2005. [11] Franklin H Epstein, William E Mitch, and Alfred L Goldberg. Mecha- nisms of muscle wastingthe role of the ubiquitin–proteasome pathway. New England Journal of Medicine, 335(25):1897–1905, 1996. [12] Susan Fishbain, Tomonao Inobe, Eitan Israeli, Sreenivas Chavali, Houqing Yu, Grace Kago, M Madan Babu, and Andreas Matouschek. Sequence composition of disordered regions fine-tunes protein half-life. structural & molecular biology, 22(3):214–221, 2015. [13] Angelo Fontana, P Polverino de Laureto, Barbara Spolaore, Erica Frare, Paola Picotti, and Marcello Zambonin. Probing protein structure by limited proteolysis. ACTA BIOCHIMICA POLONICA-ENGLISH EDITION-, 51:299–322, 2004. [14] Angelo Fontana, Patrizia Polverino de Laureto, Barbara Spolaore, Erica Frare, and Marcello Zambonin. Detecting disordered regions in proteins by limited proteolysis. Instrumental Analysis of Intrinsically Disordered Proteins: Assessing Structure And Conformation, pages 569–626, 2010. [15] Douglas M Fowler, Atanas V Koulov, William E Balch, and Jeffery W Kelly. Functional amyloid–from bacteria to humans. Trends in bio- chemical sciences, 32(5):217–224, 2007. [16] Eric K Fredrickson, Joel C Rosenbaum, Melissa N Locke, Thomas I Milac, and Richard G Gardner. Exposed hydrophobicity is a key de- terminant of nuclear quality control degradation. Molecular biology of the cell, 22(13):2384–2395, 2011. [17] Elisabeth Gasteiger, Christine Hoogland, Alexandre Gattiker, Marc R Wilkins, Ron D Appel, , et al. Protein identification and analysis tools on the server. In The proteomics protocols handbook, pages 571–607. Springer, 2005. [18] Silvia Gebicki and Janusz M Gebicki. Formation of peroxides in amino acids and proteins exposed to oxygen free radicals. Biochem. J, 289:743– 749, 1993. [19] Michael Glotzer, Andrew W Murray, and Marc W Kirschner. Cyclin is degraded by the ubiquitin pathway. Nature, 349(6305):132–138, 1991.

33 [20] AL Goldberg and JF Dice. Intracellular protein degradation in mam- malian and bacterial cells. Annual review of biochemistry, 43(1):835– 869, 1974. [21] J¨orgGsponer and M Madan Babu. Cellular strategies for regulat- ing functional and nonfunctional protein aggregation. Cell reports, 2(5):1425–1437, 2012. [22] J¨orgGsponer, Matthias E Futschik, Sarah A Teichmann, and M Madan Babu. Tight regulation of unstructured proteins: from transcript syn- thesis to protein degradation. Science, 322(5906):1365–1368, 2008. [23] Karen A Hecht, Allyson F ODonnell, and Jeffrey L Brodsky. The pro- teolytic landscape of the yeast vacuole. Cellular logistics, 4(1), 2014. [24] Andreas O Helbig, Pascale Daran-Lapujade, Antonius JA van Maris, Erik AF de Hulster, Dick de Ridder, Jack T Pronk, Albert JR Heck, and Monique Slijper. The diversity of protein turnover and abundance under nitrogen-limited steady-state conditions in saccharomyces cere- visiae. Molecular BioSystems, 7(12):3316–3326, 2011. [25] Steven Henikoff and Jorja G Henikoff. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89(22):10915–10919, 1992. [26] Hiromi Hirata, Yasumasa Bessho, Hiroshi Kokubu, Yoshito Masamizu, Shuichi Yamada, Julian Lewis, and Ryoichiro Kageyama. Instability of hes7 protein is crucial for the somite segmentation clock. Nature genetics, 36(7):750–754, 2004. [27] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Bioinfor- matics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1):1–13, 2009. [28] S Huang, RC Elliott, PS Liu, RK Koduri, LC Blair, KM Bryan, P Ghosh-Dastidar, B Einarson, and RL Kendall. Specificity of cotrans- lational amino-terminal processing of proteins in yeast. Biochemistry, 26(25):8242–8246, 1987. [29] Tao Huang, Xiao-He Shi, Ping Wang, Zhisong He, Kai-Yan Feng, LeLe Hu, Xiangyin Kong, Yi-Xue Li, Yu-Dong Cai, and Kuo-Chen Chou. Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PloS one, 5(6):e10972, 2010.

34 [30] Susan Idicula-Thomas and Petety V Balaji. Understanding the rela- tionship between the primary structure of proteins and its propensity to be soluble on overexpression in escherichia coli. Protein Science, 14(3):582–592, 2005.

[31] David T Jones. Protein secondary structure prediction based on position-specific scoring matrices. Journal of molecular biology, 292(2):195–202, 1999.

[32] Petr Klus, Benedetta Bolognesi, Federico Agostini, Domenica March- ese, Andreas Zanzoni, and Gian Gaetano Tartaglia. The cleversuite ap- proach for protein characterization: predictions of structural properties, solubility, chaperone requirements and rna-binding abilities. Bioinfor- matics, 30(11):1601–1608, 2014.

[33] Anders R Kristensen, Joerg Gsponer, and Leonard J Foster. Protein synthesis rate is the predominant regulator of protein expression during differentiation. Molecular systems biology, 9(1), 2013.

[34] Bandana Kumari, Ravindra Kumar, and Manish Kumar. Low com- plexity and disordered regions of proteins have different structural and amino acid preferences. Molecular BioSystems, 11(2):585–594, 2015.

[35] Jack Kyte and Russell F Doolittle. A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1):105–132, 1982.

[36] Stewart H Lecker, Alfred L Goldberg, and William E Mitch. Protein degradation by the ubiquitin–proteasome pathway in normal and dis- ease states. Journal of the American Society of Nephrology, 17(7):1807– 1819, 2006.

[37] Christina S Leslie, , and William Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Pa- cific symposium on biocomputing, volume 7, pages 566–575, 2002.

[38] Jingyi Jessica Li and Mark D Biggin. Statistics requantitates the central dogma. Science, 347(6226):1066–1067, 2015.

[39] Geou-Yarh Liou and Peter Storz. Reactive oxygen species in cancer. Free radical research, 44(5):479–496, 2010.

35 [40] Jun Ma, Rui Jin, Xiaoyu Jia, Craig J Dobry, Li Wang, Fulvio Reggiori, Ji Zhu, and Anuj Kumar. An interrelationship between autophagy and filamentous growth in budding yeast. Genetics, 177(1):205–214, 2007.

[41] Jason A MacGurn, Pi-Chiang Hsu, and Scott D Emr. Ubiquitin and membrane protein turnover: from cradle to grave. Annual review of biochemistry, 81:231–259, 2012.

[42] Steven Maere, Karel Heymans, and Martin Kuiper. Bingo: a cytoscape plugin to assess overrepresentation of gene ontology categories in bio- logical networks. Bioinformatics, 21(16):3448–3449, 2005.

[43] Tobias Maier, Marc G¨uell,and Luis Serrano. Correlation of mrna and protein in complex biological samples. FEBS letters, 583(24):3966– 3973, 2009.

[44] Tobias Maier, Alexander Schmidt, Marc G¨uell, Sebastian K¨uhner, Anne-Claude Gavin, Ruedi Aebersold, and Luis Serrano. Quantifica- tion of mrna and protein and integration with protein turnover in a bacterium. Molecular systems biology, 7(1), 2011.

[45] David F Mann, Karen Shah, David Stein, and Gary A Snead. Pro- tein hydrophobicity and stability support the thermodynamic theory of protein degradation. Biochimica et Biophysica Acta (BBA)-Protein Structure and Molecular Enzymology, 788(1):17–22, 1984.

[46] G McLendon and Eric Radany. Is protein turnover thermodynamically controlled? Journal of Biological Chemistry, 253(18):6335–6337, 1978.

[47] M Murphy. How mitochondria produce reactive oxygen species. Biochem. J, 417:1–13, 2009.

[48] Michele Pagano, Sun W Tam, Anne M Theodoras, Peggy Beer-Romero, Giannino Del Sal, Vincent Chau, P Ren´eeYew, Giulio F Draetta, and Mark Rolfe. Role of the ubiquitin-proteasome pathway in regulat- ing abundance of the cyclin-dependent kinase inhibitor p27. Science, 269(5224):682–685, 1995.

[49] Ralph Patrick, Kim-Anh L Cao, Melissa Davis, Bostjan Kobe, and Mikael Bod´en.Mapping the stabilome: a novel computational method for classifying metabolic protein stability. BMC systems biology, 6(1):60, 2012.

36 [50] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret- tenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Ma- chine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011.

[51] Bent Petersen, Thomas N Petersen, Pernille Andersen, Morten Nielsen, and Claus Lundegaard. A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Structural Bi- ology, 9(1):51, 2009.

[52] Thomas Nordahl Petersen, Søren Brunak, , and Hen- rik Nielsen. Signalp 4.0: discriminating signal peptides from transmem- brane regions. Nature methods, 8(10):785–786, 2011.

[53] Cathie M Pfleger and Marc W Kirschner. The ken box: an apc recog- nition signal distinct from the d box targeted by cdh1. Genes & devel- opment, 14(6):655–665, 2000.

[54] Julie M Pratt, June Petty, Isabel Riba-Garcia, Duncan HL Robertson, Simon J Gaskell, Stephen G Oliver, and Robert J Beynon. Dynamics of protein turnover, a missing dimension in proteomics. Molecular & Cellular Proteomics, 1(8):579–591, 2002.

[55] Brinda Ravikumar, Kevin Moreau, Luca Jahreiss, Claudia Puri, and David C Rubinsztein. Plasma membrane contributes to the formation of pre-autophagosomal structures. Nature cell biology, 12(8):747–757, 2010.

[56] Martin Rechsteiner and Scott W Rogers. Pest sequences and regulation by proteolysis. Trends in biochemical sciences, 21(7):267–271, 1996.

[57] Joke Reumers, Frederic Rousseau, and Joost Schymkowitz. Multiple evolutionary mechanisms reduce protein aggregation. Open Biol, 2:176– 184, 2009.

[58] Peter Rice, Ian Longden, Alan Bleasby, et al. Emboss: the european molecular biology open software suite. Trends in genetics, 16(6):276– 277, 2000.

[59] Scott Rogers, Rodeny Wells, and Martin Rechsteiner. Amino acid se- quences common to rapidly degraded proteins: the pest hypothesis. Science, 234(4774):364–368, 1986.

37 [60] D Thomas Rutkowski, Stacey M Arnold, Corey N Miller, Jun Wu, Jack Li, Kathryn M Gunnison, Kazutoshi Mori, Amir A Sadighi Akha, David Raden, and Randal J Kaufman. Adaptation to er stress is mediated by differential stabilities of pro-survival and pro-apoptotic mrnas and proteins. PLoS biology, 4(11):e374, 2006.

[61] Aaron A Santner, Carrie H Croy, Farha H Vasanwala, Vladimir N Uver- sky, Ya-Yue J Van, and A Keith Dunker. Sweeping away protein ag- gregation with entropic bristles: intrinsically disordered protein fusions enhance soluble expression. Biochemistry, 51(37):7250–7262, 2012.

[62] Rudolf Schoenheimer et al. The dynamic state of body constituents. The dynamic state of body constituents., 1946.

[63] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. Cytoscape: a software environment for inte- grated models of biomolecular interaction networks. Genome research, 13(11):2498–2504, 2003.

[64] Reshma Shringarpure and Kelvin JA Davies. Protein turnover by the proteasome in aging and disease 1, 2. Free Radical Biology and Medicine, 32(11):1084–1089, 2002.

[65] Christian JA Sigrist, Lorenzo Cerutti, Nicolas Hulo, Alexandre Gat- tiker, Laurent Falquet, Marco Pagni, Amos Bairoch, and Philipp Bucher. Prosite: a documented database using patterns and profiles as motif descriptors. Briefings in bioinformatics, 3(3):265–274, 2002.

[66] Gajinder Pal Singh, Mythily Ganapathi, Kuljeet Singh Sandhu, and Debasis Dash. Intrinsic unstructuredness and abundance of pest mo- tifs in eukaryotic proteomes. PROTEINS: Structure, Function, and Bioinformatics, 62(2):309–315, 2006.

[67] Qingxuan Song and Anuj Kumar. An overview of autophagy and yeast pseudohyphal growth: integration of signaling pathways during nitro- gen stress. Cells, 1(3):263–283, 2012.

[68] Xiaofeng Song, Tao Zhou, Hao Jia, Xuejiang Guo, Xiaobai Zhang, Ping Han, and Jiahao Sha. Sprotp: a web server to recognize those short- lived proteins based on sequence-derived features in human cells. PloS one, 6(11):e27836, 2011.

38 [69] S¨orenSonnenburg, Gunnar R¨atsch, Sebastian Henschel, Christian Wid- mer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtˇech Franc. The shogun machine learning tool- box. The Journal of Machine Learning Research, 11:1799–1802, 2010.

[70] Erik LL Sonnhammer, Gunnar Von Heijne, , et al. A hidden markov model for predicting transmembrane helices in protein sequences. In Ismb, volume 6, pages 175–182, 1998.

[71] Earl R Stadtman and Barbara S Berlett. Reactive oxygen-mediated protein oxidation in aging and disease. Chemical research in toxicology, 10(5):485–494, 1997.

[72] ER Stadtman and RL Levine. Free radical-mediated oxidation of free amino acids and amino acid residues in proteins. Amino acids, 25(3- 4):207–218, 2003.

[73] Takafumi Tasaki, Shashikanth M Sriram, Kyong Soo Park, and Yong Tae Kwon. The n-end rule pathway. Annual review of biochem- istry, 81:261, 2012.

[74] Prilusky Tompa, J Prilusky, I Silman, and JL Sussman. Structural dis- order serves as a weak signal for intracellular protein degradation. Pro- teins: Structure, Function, and Bioinformatics, 71(2):903–909, 2008.

[75] UniProt et al. Uniprot: a hub for protein information. Nucleic Acids Research, page gku989, 2014.

[76] Kristina Uzunova, Kerstin G¨ottsche, Maria Miteva, Stefan R Weis- shaar, Christoph Glanemann, Marion Schnellhardt, Michaela Niessen, Hartmut Scheel, Kay Hofmann, Erica S Johnson, et al. Ubiquitin- dependent proteolytic control of sumo conjugates. Journal of Biological Chemistry, 282(47):34167–34175, 2007.

[77] Ramunas M Vabulas and F Ulrich Hartl. Protein synthesis upon acute nutrient restriction relies on proteasome function. Science, 310(5756):1960–1963, 2005.

[78] Robin van der Lee, Benjamin Lang, Kai Kruse, J¨orgGsponer, Na- talia S´anchez de Groot, Martijn A Huynen, Andreas Matouschek, Monika Fuxreiter, and M Madan Babu. Intrinsically disordered seg- ments affect protein half-life in the cell and during evolution. Cell reports, 8(6):1832–1844, 2014.

39 [79] Gerben van Ooijen, Laura E Dixon, Carl Troein, and Andrew J Millar. Proteasome function is required for biological timing throughout the twenty-four hour cycle. Current Biology, 21(10):869–875, 2011.

[80] Alexander Varshavsky. The n-end rule pathway and regulation by pro- teolysis. Protein science, 20(8):1298–1345, 2011.

[81] Sheree A Wek, Shuhao Zhu, and Ronald C Wek. The histidyl-trna synthetase-related sequence in the eif-2 alpha protein kinase gcn2 in- teracts with trna and is required for activation in response to starvation for different amino acids. Molecular and cellular biology, 15(8):4497– 4506, 1995.

[82] John C Wootton and Scott Federhen. Statistics of local complexity in amino acid sequences and sequence databases. Computers & chemistry, 17(2):149–163, 1993.

[83] Chunzhang Yang, Joey C Matro, Kristin M Huntoon, Y Ye Don- ald, Thanh T Huynh, Stephanie MJ Fliedner, Jan Breza, Zhengping Zhuang, and Karel Pacak. Missense mutations in the human sdhb gene increase protein degradation without altering intrinsic enzymatic func- tion. The FASEB Journal, 26(11):4506–4516, 2012.

[84] Hsueh-Chi Sherry Yen, Qikai Xu, Danny M Chou, Zhenming Zhao, and Stephen J Elledge. Global protein stability profiling in mammalian cells. Science, 322(5903):918–923, 2008.

[85] Jonathan W Yewdell, Joshua R Lacsina, Martin C Rechsteiner, and Christopher V Nicchitta. Out with the old, in with the new? comparing methods for measuring protein degradation. Cell biology international, 35(5):457–462, 2011.

40