Constraints and Consequences of the Emergence of Amino Acid Repeats in Eukaryotic Proteins
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLES Constraints and consequences of the emergence of amino acid repeats in eukaryotic proteins Sreenivas Chavali1 , Pavithra L Chavali1,2,5, Guilhem Chalancon1,5, Natalia Sanchez de Groot1, Rita Gemayel3,4, Natasha S Latysheva1, Elizabeth Ing-Simmons1, Kevin J Verstrepen3,4, Santhanam Balaji1 & M Madan Babu1 Proteins with amino acid homorepeats have the potential to be detrimental to cells and are often associated with human diseases. Why, then, are homorepeats prevalent in eukaryotic proteomes? In yeast, homorepeats are enriched in proteins that are essential and pleiotropic and that buffer environmental insults. The presence of homorepeats increases the functional versatility of proteins by mediating protein interactions and facilitating spatial organization in a repeat-dependent manner. During evolution, homorepeats are preferentially retained in proteins with stringent proteostasis, which might minimize repeat-associated detrimental effects such as unregulated phase separation and protein aggregation. Their presence facilitates rapid protein divergence through accumulation of amino acid substitutions, which often affect linear motifs and post-translational-modification sites. These substitutions may result in rewiring protein interaction and signaling networks. Thus, homorepeats are distinct modules that are often retained in stringently regulated proteins. Their presence facilitates rapid exploration of the genotype– phenotype landscape of a population, thereby contributing to adaptation and fitness. Repetitive sequences are an important source of genetic variation and the transcription factor Runx2 is linked to variations in canine skull are prevalent across all eukaryotic genomes. Amino acid homore- morphology16. Thus, amino acid HRs can drive varied molecular peats (HRs) of various lengths have been implicated in diverse human effects, depending on their length and amino acid type, and the bio- diseases (for example, abnormal polyglutamine (polyQ) expansion chemical or molecular functions of the HR-containing proteins, thus in huntingtin leads to Huntington’s disease1). HR-mediated patho- ultimately leading to either detrimental or beneficial outcomes for genicity can arise because of diverse molecular reasons. For instance, organismal phenotypes17. anomalous polyalanine (polyA) length in the transcription factor Amino acid HRs are prevalent in eukaryotic proteins and comprise Foxl2 alters the protein’s subcellular localization2, whereas abnormal ~15% of any eukaryotic proteome18–20 (Supplementary Note 1). polyQ length in the androgen receptor affects the protein’s abun- Such HRPs are involved in a wide variety of functions including dance and stability and its interactions in the cell3–5. Importantly, DNA and RNA binding, signaling and adhesion19–21. In this study, overexpression of wild type (WT) homorepeat-containing proteins we investigated the following questions: Why are HRPs prevalent (HRPs) such as ataxin-1 and androgen receptor with nonpathogenic in eukaryotic proteomes, although altered expression or abnormal HR lengths tends to phenotypically mimic the detrimental effects of HR length is often detrimental? Why are HRs found in only certain HRPs with abnormally long HRs4,6. This effect is primarily mediated proteins? If HRs are important, what benefits do they provide to by the potential of repeats to form aggregates and sequester soluble an organism? © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 proteins into deposits, thus resulting in a loss of protein activity or gain of toxicity due to the aggregates1,7. RESULTS Although homorepeats can have negative consequences, the pres- Analyses of budding yeast proteome ence of HRs can also be beneficial8. PolyQ in the fungal RNA-binding We carried out a comparative genomics study investigating multiple protein Whi3 is required for cell-cycle control, whereas polyA in the large-scale data sets, using budding yeast as a model organism. We fly transcription factor Exd, polyserine in the human lysyl hydroxy- first identified 805 yeast proteins with amino acid HRs encompass- lase Jmdj6 and polyhistidine in diverse human transcription factors ing at least one continuous stretch of five or more identical amino influence subcellular localization9–12. Whereas polyQ and polypro- acids (Fig. 1a,b). This cutoff was based on the observations that (i) line (polyP) have been linked to transcriptional activation, polyA the occurrence of stretches with five or more identical amino acids in insect Hox-encoding genes aids in transcriptional repression13,14. is statistically significant compared to randomized sequences with PolyQ length variation in White Collar-1 tunes the circadian rhythm similar amino acid compositions (Fig. 1c) and (ii) the smallest HR in Neurospora15, and variation in the polyQ/polyA length ratio in length whose length variation is implicated in human diseases is five 1MRC Laboratory of Molecular Biology, Cambridge, UK. 2Li Ka Shing Centre, Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. 3VIB Laboratory for Systems Biology, Leuven, Belgium. 4CMPG Laboratory for Genetics and Genomics, KU Leuven, Leuven, Belgium. 5These authors contributed equally to this work. Correspondence should be addressed to S.C. ([email protected]) or M.M.B. ([email protected]). Received 14 February; accepted 23 June; published online 14 August 2017; doi:10.1038/nsmb.3441 NATURE STRUCTURAL & MOLECULAR BIOLOGY VOLUME 24 NUMBER 9 SEPTEMBER 2017 765 ARTICLES a b Proteome of S. cerevisiae c 0.04 Z score = 64.8 (6,393 proteins) –3 Homorepeat (HR) P < 1.0 × 10 AAAAA Repeat unit: A 0.03 Repeat length: 5 Proteins without Proteins with homorepeats (nonHRPs) homorepeats (HRPs) N C 0.02 4,938 proteins 805 proteins Density 0.01 250 e Nonpolar Negative E.g., arsenate E.g., chromatin-remodeling- 200 0 Polar uncharged reductase Arr2 complex subunit Snf5 90 100 120 140 160 810 Positive 150 No. of HRPs d GO-term description No. of genes FDR 100 Regulation of transcription 167 1.0 × 10–16 No. of proteins Regulation of RNA 130 2.7 × 10–15 50 metabolic process Negative regulation of cellular 66 –7 0 biosynthetic process 9.0 × 10 Ile –6 Ser Gln Glu Asp Asn Lys Pro Leu Thr Ala His Phe Arg Gly Cys Val Tyr Cell-wall organization 69 2.1 × 10 Type of amino acid HRs f g h 0.04 –4 Z score = 2.3 Phenotypes Proteins 50 –5 45 –2 Growth P = 1.0 × 10 NonHRP 40 Normal conditions (6) HRP 30 = 1.1 × 10 = 4.0 × 10 30 P 0.02 With chemical insult (136) P Density Morphological 20 15 Organizational (6) 10 resistance is conferred 0 Cellular (247) No. of phenotypes per gene 0 No. of small molecules to which 0 100 120 140 160 180 n = 3,620 n = 620 n = 3,455 n = 600 CLES: 63.6% Essential genes NonHRP (3,620) HRP (620) CLES: 63.8% Figure 1 Proteins with homorepeats in yeast and their physiological importance. (a) Schematic of a protein bearing a homorepeat. (b) Yeast proteins with HRs (HRPs) and without HRs (nonHRPs). (c) Enrichment of homorepeats of five or more amino acids in length in yeast was tested by generating 1,000 shuffled proteomes, by using permutation tests. Each protein in the yeast proteome was shuffled for the sequence order by keeping the length and amino acid composition constant. The null expectation in the shuffled proteomes (gray histogram) and the actual observation (red arrow) of HRPs are shown. The Z score indicates the distance of the actual observation to the mean of random expectations in terms of the number of s.d. The P value was estimated as the ratio of random observations greater than or equal to the number of actually observed HRPs to the total number of random samples. (d) GO biological processes enriched in HRPs. (e) Distribution of proteins with different amino acid–repeat types. (f) Enrichment of HRPs among essential genes, tested with permutation tests by performing 10,000 iterations. The random expectation (gray histogram) and the actual observation (red arrow) of essential HRPs are shown. (g,h) Box plots of distributions of pleiotropic effects (g) and the number of small molecules that a gene confers resistance to (h), among HRPs and nonHRPs. The black line within the box represents the median, and the box limits represent the first and third quartiles. The notches correspond to ~95% confidence intervals for the median. Whiskers (dashed lines) show the data points up to 1.5× the interquartile range from the box. Values beyond the whiskers were considered outliers and are not shown to improve visualization. The effect size is indicated by the common language effect size (CLES), which describes the probability that a randomly sampled data point from a distribution A will be greater than a data point sampled from distribution B. n represents number of proteins in each class. © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 residues (polyaspartate (polyD), in the cartilage oligomeric complex HRs are prevalent in essential and highly pleiotropic genes that protein COMP22). HRPs in yeast are involved in vital regulatory proc- buffer environmental perturbations esses such as transcriptional regulation and nucleic acid metabolism, Genes that are essential for growth in rich medium show an enrich- similarly to HRPs in other organisms19 (Fig. 1d). Repeats of polar ment to contain HRs (Fig. 1f), thus suggesting that HRPs are important amino acids are prevalent in yeast HRPs (Fig. 1e). Using multiple for cell fitness. We constructed a yeast gene-to-phenotype (G2P) net- genome-scale data sets, we compared yeast HRPs with a set of 4,938 work consisting of 4,220 genes and 395 phenotypes describing growth proteins lacking HRs (nonHRPs) for their phenotypes after perturba- (in normal conditions and after chemical insult) and morphological tion, their adaptability to different environmental insults, and their (organizational and cellular) features after genetic alteration (deletion regulatory and evolutionary properties (Supplementary Fig. 1). To and overexpression).