Determinants and Prediction of Protein Degradation

Determinants and prediction of protein degradation Miguel Correa Marrero 1 Introduction Biologists used to consider proteins as rather static elements that would only be replaced if they were damaged. When stable isotopes started being used to trace metabolic processes [62], this view changed to a more dynamic one where biomolecules are continously being synthesized and de- graded. Although our understanding of protein degradation has improved much from those early days, less attention has been paid to this form of post-transcriptional regulation than others, to the point it has been called a \`missing dimension in proteomics" [54]. As a result, the physicochemi- cal, structural and sequence characteristics that underlie the broad range of protein half-lives [80] are not well understood. Even though protein degradation has been found to contribute less to the control of protein concentrations in the cell than other forms of regulation [38], this does not mean it is less of a key mechanism in homeostasis. Among the roles of protein degradation are the irreversible removal of proteins to adapt to new physiological conditions, removal of aberrant and otherwhise damaged proteins, or the maintenance of an adequate amino acid pool [36], thus impacting many different cellular processes [48],[60],[79]. A clear, dra- matic example that highlights the importance of protein turnover is the found in the work of Hirata et al.. [26], in which artificially prolonging the half-life of transcription factor Hes7 by 8 minutes during development results in disruption of somite segmentation. Other examples are the implication of mutations that decrease protein stability in the pathogenesis of certain neu- roendocrine tumours [83] or loss of body mass in AIDS and cancer patients resulting from small increases in half-lives [11]. A complete understanding of regulation of biological networks and the dynamics of proteomes requires information of how regulation of gene expression works at all levels, including protein degradation, the end point of gene expression. However, proteome-wide determinations of half-lives is 1 expensive and time-consuming, and this situation will probably not change in the near future. Therefore, there is interest in the construction of a computational tool that allows prediction of half-lives. This will also give us information about what characteristics of a protein determine its half-life. This information could prove useful in several ways. For example, it could be used to obtain parameters of mathematical models that attempt to pre- dict the behaviour of biological systems. In a more applied setting, it could be employed to stabilize proteins for enzyme replacement therapy and make the treatment more efficient by reducing the amount of dosage necessary. It could also be exploited in metabolic engineering and synthetic biology to manipulate pathways. There have been several lines of research on the determinants of protein degradation. Some early studies focused on protein thermodynamic stability [46],[45]. However, these studies seem to have gone largely unnoticed, and from then attention has been focused on sequence signals. One of the first proposed sequence signals for protein degradation was the N-end rule [2], which states that a protein's half-life is a function of its N-terminal residue. However, the N-end rule alone does not fully explain half-lives [43] and has even been shown to not apply in Mycoplasma pneumoniae [44]. It has also been proposed that PEST regions (roughly defined as protein segments enriched in proline, glutamic acid, serine and threonine) lead to rapid degradation [59], although many of them are known to be conditional signals [56]. Other motifs, such as the destruction box [19] or the KEN box [53], have also been proposed to regulate half-life. It would seem, however, given the broad variety of substrates that the proteolytic machinery needs to degrade, that half-lives should be also influenced by a range of generic characteristics and their interplay, and not merely by a small range of sequence motifs. Furthermore, many of these studies have been performed on a rather small set of proteins, limiting their explanatory power. These arguments point to the need for a global analysis of protein degradation. Such an approach has already been attempted by several authors by ex- ploiting large-scale datasets. There have been both statistical, univariate approaches, and machine learning approaches. Amongst the former we can count the study by Tompa et al. [74], which uses a yeast dataset [3] to find relationships between half-lives and a number of properties by using simple linear regression. Another study, by van der Lee et al. [78] used several datasets to focus on the influence of structural disorder by binning the data into several categories. In contrast to these, machine learning approaches allow us to perform a multivariate analysis. Unfortunately, previous machine learning studies of the problem [29],[68],[49] have all used a dataset 2 [84] obtained by a flawed methodology [1],[85], raising doubts about the significance of their results. Furthermore, they have used classification al- gorithms (with varying interpretability), whereas regression would be more useful, given that half-life is not a discrete category, but a continuous quan- tity. Perhaps surprisingly, there is only one computational tool, ProtParam [17], that attempts to give a estimation of half-life. However, it is based solely on the traditional N-end rule. Overall, these studies have reached little consensus. In this work, we attempt to address the problem of predicting protein half-life and uncover- ing its determinants. We do this by using machine learning, which allows us to combine a large number of protein characteristics to create a pre- dictive model. We use support vector regression to try to learn a simple, interpretable model of protein degradation rates from datasets obtained by a reliable methodology, together with careful inspection of the data to prevent pitfalls previous attempts have fallen into. We integrate a large number of possibly relevant features, focusing in those that can be derived from the amino acid sequence, in order to try to answer some open questions in pro- teolysis, such as the relevance of sequence signals, structural disorder or post-translational damage. 2 Materials and methods 2.1 Datasets Human leukemia dataset , collected by Kristensen et al [33]. The data was obtained from the myelomonocytic leukemia THP-1 cell line from both proliferating and differentiating cells. We have used exclusively the measurements taken under proliferation conditions, as there is more available data and the authors do not find a significant difference in half-lives between the two conditions. They use the pulsed SILAC (stable isotope labeling with amino acids in cell culture) technique to collect the data. Two different cell populations are used in this technique, one grown on light and another on medium amino acids. The growth medium of the former is replaced by another containing heavy amino acids. Protein degradation is then measured as a decrease in medium amino acids, with the population grown on light amino acids acting as a control. Yeast dataset , obtained from the work of Helbig et al. [24]. In this study, Saccharomyces cerevisiae was grown in chemostat cultures un- 3 der nitrogen-limited conditions. Once steady state was achieved, 15N was supplied to the cells instead of 14N, leading to gradual incorpo- ration of the 15N isotope into newly synthetized proteins. Then, by following the evolution of 14N signal intensity, protein turnover rates can be calculated. It should be mentioned, though, that nitrogen lim- itation might have triggered faster degradation of these preexisting proteins in order to maintain an adequate amino acid pool [77]. Protein sequences, together with manual annotation for subcellular location, were retrieved from UniProtKB/Swiss-Prot [75] for the leukemia dataset, and from the Saccharomyces Genome Database [6] for the yeast dataset, 2.2 Data cleaning and inspection A number of proteins (particularly substantial in the leukemia dataset) could not be identified unambiguosly during the collection of the data and were assigned a group of possible identities. These proteins were removed, as we cannot be certain about their amino acid sequence. Sequences containing ambiguous amino acids were also removed for this reason. Proteins that had been assigned different measurements of half-life in the same dataset were also filtered out. 90% confidence intervals had been calculated in the yeast dataset. For about half of the proteins, the lower bound of the confidence interval was negative. These proteins were discarded, as the measurements seemed unreliable. Finally, after removing outliers, 296 proteins were left in the yeast dataset and 464 in the leukemia dataset. BLASTClust [9] was used to avoid biasing the training process with redundant sequences. BLASTClust performs clustering by running BLAST to perform all possible pairwise alignments. Sequences were clustered together and considered to be potentially redundant if they were 95% identical over 90% of the length of each sequence. Sequences in a cluster would only be considered redundant if they showed very similar half-lives, since similar sequences with rather different half-lives could contribute important information. We inspected the data for possible experimental biases relating to subcellular location, since this already had an impact on previous machine learning approaches. In order to do so, we performed a series of GO enrichment anal- yses for the cellular component ontology using the BiNGO plugin [42] for Cytoscape [63] using default parameters. We searched for overrepresented terms in the whole dataset, both using the whole genome as background and 4 using the dataset itself as background. Furthermore, the dataset was split into four quartiles according to half-lives, and we also searched for overrepresented terms in each of these using the whole genome as background and using the whole dataset as background.

Determinants and Prediction of Protein Degradation

Article Reference

Bonnie Berger Named ISCB 2019 ISCB Accomplishments by a Senior

DREAM: a Dialogue on Reverse Engineering Assessment And

Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006

General Assembly and Consortium Meeting 2020

2003 Mulder Nucl Acids Res {22

MPGM: Scalable and Accurate Multiple Network Alignment

Machine Learning and Statistical Methods for Clustering Single-Cell RNA-Sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗

Presentation Sams Patrick Ruch

Lecture 10: Phylogeny 25,27/12/12 Phylogeny

Biological Databases an Introduction

Glycoprotein Hormone Receptors in Sea Lamprey