a r t i c l e s

Constraints and consequences of the emergence of amino acid repeats in eukaryotic

Sreenivas Chavali1 , Pavithra L Chavali1,2,5, Guilhem Chalancon1,5, Natalia Sanchez de Groot1, Rita Gemayel3,4, Natasha S Latysheva1, Elizabeth Ing-Simmons1, Kevin J Verstrepen3,4, Santhanam Balaji1 & M Madan Babu1

Proteins with amino acid homorepeats have the potential to be detrimental to cells and are often associated with human diseases. Why, then, are homorepeats prevalent in eukaryotic proteomes? In yeast, homorepeats are enriched in proteins that are essential and pleiotropic and that buffer environmental insults. The presence of homorepeats increases the functional versatility of proteins by mediating interactions and facilitating spatial organization in a repeat-dependent manner. During evolution, homorepeats are preferentially retained in proteins with stringent proteostasis, which might minimize repeat-associated detrimental effects such as unregulated phase separation and protein aggregation. Their presence facilitates rapid protein divergence through accumulation of amino acid substitutions, which often affect linear motifs and post-translational-modification sites. These substitutions may result in rewiring protein interaction and signaling networks. Thus, homorepeats are distinct modules that are often retained in stringently regulated proteins. Their presence facilitates rapid exploration of the genotype– phenotype landscape of a population, thereby contributing to adaptation and fitness.

Repetitive sequences are an important source of genetic variation and the transcription factor Runx2 is linked to variations in canine skull are prevalent across all eukaryotic genomes. Amino acid homore- morphology16. Thus, amino acid HRs can drive varied molecular peats (HRs) of various lengths have been implicated in diverse human effects, depending on their length and amino acid type, and the bio- diseases (for example, abnormal polyglutamine (polyQ) expansion chemical or molecular functions of the HR-containing proteins, thus in huntingtin leads to Huntington’s disease1). HR-mediated patho- ultimately leading to either detrimental or beneficial outcomes for genicity can arise because of diverse molecular reasons. For instance, organismal phenotypes17. anomalous polyalanine (polyA) length in the transcription factor Amino acid HRs are prevalent in eukaryotic proteins and comprise Foxl2 alters the protein’s subcellular localization2, whereas abnormal ~15% of any eukaryotic proteome18–20 (Supplementary Note 1). polyQ length in the androgen receptor affects the protein’s abun- Such HRPs are involved in a wide variety of functions including dance and stability and its interactions in the cell3–5. Importantly, DNA and RNA binding, signaling and adhesion19–21. In this study, overexpression of wild type (WT) homorepeat-containing proteins we investigated the following questions: Why are HRPs prevalent (HRPs) such as ataxin-1 and androgen receptor with nonpathogenic in eukaryotic proteomes, although altered expression or abnormal HR lengths tends to phenotypically mimic the detrimental effects of HR length is often detrimental? Why are HRs found in only certain HRPs with abnormally long HRs4,6. This effect is primarily mediated proteins? If HRs are important, what benefits do they provide to by the potential of repeats to form aggregates and sequester soluble an organism? © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 proteins into deposits, thus resulting in a loss of protein activity or gain of toxicity due to the aggregates1,7. RESULTS Although homorepeats can have negative consequences, the pres- Analyses of budding yeast proteome ence of HRs can also be beneficial8. PolyQ in the fungal RNA-binding We carried out a comparative genomics study investigating multiple protein Whi3 is required for cell-cycle control, whereas polyA in the large-scale data sets, using budding yeast as a model organism. We fly transcription factor Exd, polyserine in the human lysyl hydroxy- first identified 805 yeast proteins with amino acid HRs encompass- lase Jmdj6 and polyhistidine in diverse human transcription factors ing at least one continuous stretch of five or more identical amino influence subcellular localization9–12. Whereas polyQ and polypro- acids (Fig. 1a,b). This cutoff was based on the observations that (i) line (polyP) have been linked to transcriptional activation, polyA the occurrence of stretches with five or more identical amino acids in insect Hox-encoding aids in transcriptional repression13,14. is statistically significant compared to randomized sequences with PolyQ length variation in White Collar-1 tunes the circadian rhythm similar amino acid compositions (Fig. 1c) and (ii) the smallest HR in Neurospora15, and variation in the polyQ/polyA length ratio in length whose length variation is implicated in human diseases is five

1MRC Laboratory of Molecular Biology, Cambridge, UK. 2Li Ka Shing Centre, Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK. 3VIB Laboratory for Systems Biology, Leuven, Belgium. 4CMPG Laboratory for Genetics and Genomics, KU Leuven, Leuven, Belgium. 5These authors contributed equally to this work. Correspondence should be addressed to S.C. ([email protected]) or M.M.B. ([email protected]). Received 14 February; accepted 23 June; published online 14 August 2017; doi:10.1038/nsmb.3441

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 765 a r t i c l e s

a b Proteome of S. cerevisiae c 0.04 Z score = 64.8 (6,393 proteins) –3 Homorepeat (HR) P < 1.0 × 10 AAAAA Repeat unit: A 0.03 Repeat length: 5 Proteins without Proteins with homorepeats (nonHRPs) homorepeats (HRPs) N C 0.02 4,938 proteins 805 proteins Density

0.01 250 e Nonpolar Negative E.g., arsenate E.g., chromatin-remodeling- 200 0 Polar uncharged reductase Arr2 complex subunit Snf5 90 100 120 140 160 810 Positive 150 No. of HRPs d GO-term description No. of genes FDR 100 Regulation of transcription 167 1.0 × 10–16

No. of proteins Regulation of RNA 130 2.7 × 10–15 50 metabolic process Negative regulation of cellular 66 –7 0 biosynthetic process 9.0 × 10 Ile –6 Ser Gln Glu Asp Asn Lys Pro Leu Thr Ala His Phe Arg Gly Cys Val Tyr Cell-wall organization 69 2.1 × 10 Type of amino acid HRs f g h

0.04 –4

Z score = 2.3 Phenotypes Proteins 50 –5 45 –2 Growth P = 1.0 × 10 NonHRP 40 Normal conditions (6) HRP 30

30 P = 1.1 × 10 0.02 With chemical insult (136) P = 4.0 × 10

Density Morphological 20 15 Organizational (6) 10 resistance is conferred

0 Cellular (247) No. of phenotypes per

0 No. of small molecules to which 0 100 120 140 160 180 n = 3,620 n = 620 n = 3,455 n = 600 CLES: 63.6% Essential genes NonHRP (3,620) HRP (620) CLES: 63.8%

Figure 1 Proteins with homorepeats in yeast and their physiological importance. (a) Schematic of a protein bearing a homorepeat. (b) Yeast proteins with HRs (HRPs) and without HRs (nonHRPs). (c) Enrichment of homorepeats of five or more amino acids in length in yeast was tested by generating 1,000 shuffled proteomes, by using permutation tests. Each protein in the yeast proteome was shuffled for the sequence order by keeping the length and amino acid composition constant. The null expectation in the shuffled proteomes (gray histogram) and the actual observation (red arrow) of HRPs are shown. The Z score indicates the distance of the actual observation to the mean of random expectations in terms of the number of s.d. The P value was estimated as the ratio of random observations greater than or equal to the number of actually observed HRPs to the total number of random samples. (d) GO biological processes enriched in HRPs. (e) Distribution of proteins with different amino acid–repeat types. (f) Enrichment of HRPs among essential genes, tested with permutation tests by performing 10,000 iterations. The random expectation (gray histogram) and the actual observation (red arrow) of essential HRPs are shown. (g,h) Box plots of distributions of pleiotropic effects (g) and the number of small molecules that a gene confers resistance to (h), among HRPs and nonHRPs. The black line within the box represents the median, and the box limits represent the first and third quartiles. The notches correspond to ~95% confidence intervals for the median. Whiskers (dashed lines) show the data points up to 1.5× the interquartile range from the box. Values beyond the whiskers were considered outliers and are not shown to improve visualization. The effect size is indicated by the common language effect size (CLES), which describes the probability that a randomly sampled data point from a distribution A will be greater than a data point sampled from distribution B. n represents number of proteins in each class.

© 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 residues (polyaspartate (polyD), in the cartilage oligomeric complex HRs are prevalent in essential and highly pleiotropic genes that protein COMP22). HRPs in yeast are involved in vital regulatory proc- buffer environmental perturbations esses such as transcriptional regulation and nucleic acid metabolism, Genes that are essential for growth in rich medium show an enrich- similarly to HRPs in other organisms19 (Fig. 1d). Repeats of polar ment to contain HRs (Fig. 1f), thus suggesting that HRPs are important amino acids are prevalent in yeast HRPs (Fig. 1e). Using multiple for cell fitness. We constructed a yeast gene-to-phenotype (G2P) net- genome-scale data sets, we compared yeast HRPs with a set of 4,938 work consisting of 4,220 genes and 395 phenotypes describing growth proteins lacking HRs (nonHRPs) for their phenotypes after perturba- (in normal conditions and after chemical insult) and morphological tion, their adaptability to different environmental insults, and their (organizational and cellular) features after genetic alteration (deletion regulatory and evolutionary properties (Supplementary Fig. 1). To and overexpression). HRPs were associated with a higher number of ensure that the nonHRPs did not contain any repetitive sequences, phenotypes than nonHRPs (Fig. 1g), even after minimization of the we did not consider proteins with imperfect repeats (n = 650). To redundancy in the G2P network by merging phenotypes sharing ≥50% control for enrichments in the trends that were associated with HRPs, of genes (Supplementary Fig. 2a). Investigation of the fitness data we also compared them to different groups of relevant nonHRPs, showed that deletion of nonessential HRPs, compared with nonHRPs, such as highly disordered proteins, proteins with similar functions, made the cells sensitive to a higher number of chemical insults and essential and highly pleiotropic proteins, length-matched proteins stresses (Fig. 1h). Therefore, HRs are prevalent in proteins whose and proteins with nonrepetitive amino acid bias (discussed below presence tends to buffer environmental perturbations and chemical and in Online Methods). insults. Thus, proteins with HRs tend to be essential, to influence a

766 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

a b e NonHRP HRP Proteins with low intrinsic disorder (≤30%) NonHRP HRP

120 60 100 80 100 60 –18 –6 –25 –6

80 –7 –8 60 80 80 40 60 40 40 60 P = 1.0 × 10 P = 2.0 × 10 P = 1.0 × 10 P = 4.0 × 10

40 P = 1.3 × 10 P = 1.5 × 10 40 20 20 in PPI network 40 interactions (PPIs) 20 in genetic network No. of protein-protein No. of link communities 20 in PPI network No. of link communities No. of genetic interactions interactions (PPIs)

0 0 0 0 No. of protein-protein 20 No. of link communities

= 577 = 3,146 n = 3,944 n = 719 = 3,146 n = 577 n = 719 0 n n n n = 3,944 0 n = 2,979 n = 236 n = 2,979 n = 236 CLES: 65.8% CLES: 69.5% CLES: 65.8% CLES: 71.5% CLES: 64.3% CLES: 65.3%

c Ribosome/preribosome particle d f Proteins with high intrinsic disorder (>30%) NonHRP

Transcription Nop14 HRP Other Nop2 Nop1 1 Signaling mRNA Nop7 Nog1 2 processing Pob3 Nsa1 Mrt4 FACT complex Translation Puf6 Loc1 Spt16 Urn1 Nhp6a Aim4–Urn1 complex Rix1 Kre33 Rrp1 MutS–alpha complex Aim4 Gis2 100 Rrp5 Nhp6b 20 60 Taf12 Dbp9

Cka1 –5 Taf5 Ubp3 Dbp10 Ckb2 Bre5 –5 –9 Taf6 SAGA complex Protein kinase 80 Ckb1 CK2 complex 15 Gcn5 Ref2 Cka2 Spt7 40 Rpa43 Spt5 Rna14 Rpa135 60 RNA polymerase I Rpb9 Rna15 mRNA cleavage- P = 4.0 × 10 Rpo21 Rpb8 10 P = 7.0 × 10 P = 4.2 × 10 Rpa49 factor complex Rpo26 Rpb7 Cft2 Pap1 Rpa190 40 Ctr9 Leo1

Pph3 Psy2 Rtr1 Rpb5 in PPI network 20 Rtf1 5 Protein phosphatase Spt4 interactions (PPIs) Rpb4 No. of protein-protein

Paf1 complex 20 No. of link communities PP4 complex Psy4 Tfg1 Paf1

Rpb3 No. of different GO functions Tfg2 Rpb2 0 YHR143W-A Asr1 Rpb11 0 n = 4,354 n = 748 0 n = 965 n = 483 n = 965 n = 483 RNA polymerase II and associated proteins CLES: 64.4% CLES: 56.4% CLES: 59.6%

Figure 2 HRPs are functionally versatile and mediate interactions. (a,b) Box plots of distributions of genetic interactions and protein-protein interactions (a) and link communities in genetic and protein-protein interaction networks (b). (c) Network illustrating link communities in the protein interaction network of the polyD-containing transcriptional regulator Spt5, which affects diverse biological processes. (d) Box plots of different GO terms associated with genes (functional versatility). In a,b and d, box plots are as in Figure 1. Statistical significance was assessed with Wilcoxon rank-sum tests with false discovery rate (FDR) correction for multiple testing. The proportion of intrinsic disorder in a protein can influence the propensity of proteins to interact with multiple partners and hence protein functionality60. To investigate the effect of protein disorder on the functional versatility of HRPs, we classified nonHRPs and HRPs on the basis of intrinsic disorder content, as those with (i) low disorder (≤30%) and (ii) high disorder (>30%) fractions, and drew comparisons. (e,f) Box plots of the distribution of the numbers of protein-protein interactions and link communities in the protein- protein interaction network among HRPs and nonHRPs with low (e) and high (f) disorder content. Box plots are as in Figure 1. Statistical significance was assessed with Wilcoxon rank-sum tests with FDR correction for multiple testing, and effect sizes are displayed as CLES. Irrespective of the extent of protein disorder, HRPs tend to have more interactions and participate in diverse processes. n represents number of proteins in each class.

range of growth and morphological phenotypes (i.e., to be pleiotropic) terms associated with each gene24 (Fig. 2d). These trends were not © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 and to confer adaptability to diverse environmental insults. confounded by the extent of intrinsic protein disorder (Fig. 2e,f), protein length (Supplementary Fig. 2b) or differences in biological HRPs affect multiple biological processes through molecular functions between HRPs and nonHRPs (Supplementary Fig. 2c). interactions Importantly, these trends were also stronger than those for proteins To understand the molecular basis of the observed pleiotropy in with nonrepetitive amino acid bias (Supplementary Fig. 2d). HRPs proteins with HRs, we examined both genetic and biochemical were overrepresented among proteins with low solubility and those interaction networks. HRPs tended to have a higher number of that formed intracellular phase-separated structures such as stress genetic and protein interactions (Fig. 2a) than nonHRPs, thus granules, P bodies and heritable proteins (Fig. 3a–d). These obser- potentially explaining their influence on multiple phenotypes. vations collectively suggested that HRPs tend to spatially organize A network analysis of link communities, which represent intercon- protein-protein interactions within cells. nected sets of genes that participate in similar biological processes23, Whereas HRPs had a comparable number of target genes in revealed that proteins with HRs participated in more communities terms of the protein-DNA and protein-RNA interaction networks than nonHRPs (Fig. 2b). Thus HRPs often connect diverse biologi- (Supplementary Fig. 2e,f), an analysis of the Gene Perturbation cal processes by acting as intercommunity hubs, as exemplified by Network25 revealed that deletion of HRP regulators, as compared with the polyD-containing transcriptional regulator Spt5 (Fig. 2c). These their nonHRP counterparts, affected expression of a larger number observations were consistent with the higher multifunctionality of targets (Fig. 3e). Thus, proteins with HRs regulate a larger part of index of HRPs, as defined by the number of (GO) the genome and/or transcriptome than nonHRPs.

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 767 a r t i c l e s

a NonHRP HRP b c d mRNP

Proteins (%) 0.2 0 20 40 60 80 0.12 Z score = 3.0 0.15 Z score = 3.1 Z score = 3.6 P = 4.1 × 10–3 P = 3.4 × 10–3 P = 1.6 × 10–3 High 0.08 0.1 P = 7.6 × 10–7 0.1 Low OR: 3.6 Density Density Density 0.04 0.05 Solubility Normal n = 1,018 n = 178 0 0 0 0 5 10 15 20 25 0 5 10 15 0 5 10 15 Stress-granule proteins P-body proteins No. of heritable proteins

e f 1 2 –3 –2 300 60 –5 100 30 –3 Ancestral 80 –10 protein 80 60 200 40 20 60 P = 5.5 × 10 P = 2.2 × 10 P = 1.2 × 10 P = 1.5 × 10 Versus 40 P = 1.6 × 10 Paralogs 40 100 20 10 20 interactions NonHRP HRP 20 No. of protein-protein with altered expression No. of downstream genes No. of genetic interactions

0 No. of phenotypes per gene 0 0 0 0 No. of different GO functions

= 456 = 152 = 350 = 596 = 596 = 655 n n n = 408 n = 408 n n = 350 n n n n = 655 CLES: 75.3% Z score: 4.5 Z score: 2.3 Z score: 2.9 Z score: 6.7

Figure 3 HRPs form higher-order assemblies, and homorepeats might contribute to the functional versatility of HRPs. (a) Percentage of proteins with high, low and normal solubility among HRPs and nonHRPs. Statistical significance was assessed with chi-squared test, and effect sizes are displayed as odds ratios (OR). (b–d) Enrichment of HRPs among stress-granule proteins (b), P-body proteins (c) and heritable proteins (d) was tested with permutation tests, by performing 10,000 randomizations. The Z score indicates the distance of the actual observation to the mean of random expectations in terms of the number of standard deviations. P values were estimated as the ratios of randomly observed proteins greater than or equal to the number of observed HRPs to the total number of random samples (10,000). (e) Box plot showing the distribution of genes whose expression was altered after deletion of HRP and nonHRP regulators (from the gene-perturbation network). Statistical significance was estimated with Wilcoxon rank-sum tests, and effect sizes are displayed as CLES. (f) Distribution of pleiotropic effects, number of genetic interactions, protein-protein interactions and different GO terms among divergent paralog pairs in which one is an HRP and the other a nonHRP. Box plots are as in Figure 1. P values were estimated with Wilcoxon matched-pairs tests and FDR-corrected for multiple testing. Effect sizes are indicated by Z scores. n represents number of proteins in each class.

HRs might contribute to the trends observed in HRPs ated a haploid mutant strain by deleting the longest stretch of HR in To assess the roles and contributions of HRs to the observed trends, the protein (polyQ; amino acids 217–268; ∆HR) and compared it with we computed likelihood ratios on the basis of conditional probabili- the WT strain (Fig. 4a) to ensure that SNF5 ∆HR was expressed from ties. We found that HRPs had higher likelihood ratios than nonHRPs its endogenous promoter and was regulated in its natural context. (Supplementary Fig. 3). This result suggested that the presence of We examined the effect of the HR on growth under two different HRs might contribute to these features in HRPs and that the emer- carbon sources: glucose (yeast extract peptone dextrose (YPD) gence of HRs within proteins might facilitate the acquisition of these medium) and glycerol (yeast extract peptone glycerol (YPG) medium). features. To investigate this possibility, we examined paralog pairs that The ∆HR mutant displayed a minor delay in growth kinetics in YPD, arose by gene duplication and studied divergent pairs, wherein one whereas the delay was more pronounced in YPG (Fig. 4b). Moreover, © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 paralog contained the HR and the other did not (HRP–nonHRP pairs; we observed an increase in the budding index (Supplementary Fig. 5a), Fig. 3f). Among such paralog pairs, the paralog with an HR tended a result indicating an increase in the doubling time. Thus, polyQ to have a higher number of (i) phenotypes, (ii) genetic and protein in Snf5 might influence growth rates by either directly or indirectly interactions and (iii) GO functions than did the paralog that did not affecting the cell cycle. Deletion of polyQ did not affect the protein contain an HR (Fig. 3f). Thus, our observations collectively suggested abundance (Fig. 4c) or subcellular localization of Snf5. However, the that the presence of HRs is associated with, and may influence, a large WT protein formed punctate structures, whereas the ∆HR protein number of phenotypes and biological processes. Compared with the tended to be diffuse (Fig. 4d). This result was consistent with our length of the repeat, the type of amino acid repeat might have a greater genome-scale analysis (Fig. 3a–d) and suggested that the polyQ in influence on the molecular effects of HRs and on fitness (as suggested Snf5 is involved in forming protein assemblies in the cell that may by likelihood-ratio estimates; Supplementary Figs. 3 and 4). facilitate protein-protein interactions. To test this possibility, we experimentally determined the interac- PolyQ repeat in Snf5 affects fitness and confers adaptability to tome of WT and ∆HR Snf5, in YPD and YPG by using affinity cap- diverse environmental insults ture followed by mass spectrometry (Fig. 4e). We identified a total of To establish the endogenous roles of an amino acid homorepeat 146 interactions for WT Snf5, of which 130 had not been previously in a protein, we investigated a pleiotropic hub in the protein inter- reported (Supplementary Data Set 1 and Supplementary Fig. 5b,c). action network, Snf5, which is a core component of the Swi/Snf Because we detected all members of the Swi/Snf members in both chromatin-remodeling complex and contains a polyQ HR. We gener- strains and both growth conditions, polyQ in Snf5 does not appear

768 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

a Snf5 e YPD YPG HA/IgG Dynabeads g TF Pol 217 268 450 538 PolyQ+ PolyQ+ WT Lysis N C Trypsin digestion Targets YPD YPG PolyQ Biological ATF2 530 672 – – SNF5 domain HR PolyQ PolyQ process ∆ X X LC–MS analysis BAR1 Cell cycle SCW11 PolyQ+ PolyQ− YPD YPG WT ∆HR SER3 X 77 74 MET28 Cell-wall organization MET14 35 12 PolyQ-mediated MNN1 interactions CYS4 DNA binding ARG4 YPD YPD YPG YPG SIT1 Metabolism (13) Transport (9) Metabolism PHO5 b Response to 1.0 YPD WT stimulus (4) Protein HO PST1 ∆HR ∆Td = 8.8 min Cellular-component synthesis (11) Response to biogenesis and Biological stimulus CIT2 processes Unknown (1) 0.5 organization (11) CWP1 YPG DNA repair (6) Transport ENB1 Transcription (12) FRE1 Cell cycle (10)

Absorbance (600 nm) T = 218.9 min 0.1 ∆ d TAT1 Chromatin 0 24 48 72 remodeling (12) 2–∆∆Ct Time (h) 0.5 1.0 1.5 2.52.0 c vs f WT ∆HR YPD YPG YPD YPG MW MW h (kDa) WT ∆HR WT ∆HR WT WT (kDa) ∆ HR ∆ HR 110 2 Anti-HA Control UV (40 J/m ) HU (100 mM) H3K4me3 90 (Snf5) 17 WT 17 H3K27ac 45 Anti-tubulin Total H3 17 ∆HR d Anti-HA Anti-tubulin Merge WT i HA HRP: Snf5 Fitness 1 m 4 µm µ Merge

YPD ∆HR HA HR: PolyQ

Protein assemblage Gene expression Merge

WT TF HA Pol

Cell cycle 4 µm Merge

YPG ∆HR G1

HA Protein-protein interactions M S Merge

1.5 G2 P = 2.9 × 10–3 P = 5.3 × 10–10 PolyQ-mediated effects DNA-damage response 1.0 Processes affected directly or indirectly by Snf5 via polyQ-dependent protein-protein interactions 0.5 WT Complexes: INO80, RSC, MCM HA area/nuclear area HR YPD YPG ∆ 0 © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017

Figure 4 Homorepeat (polyQ) in Snf5 affects fitness and adaptability by mediating protein-protein interactions and influencing diverse biological processes. (a) WT or polyQ HR-deleted (∆HR) strains expressing Snf5 tagged with HA were grown in yeast peptone with 2% glucose (YPD) and yeast peptone with 4% glycerol (YPG). (b) Mean growth curves of WT or ∆HR (n = 4 independent cultures per strain) . Doubling time (Td) was derived from the slope of the best fit obtained by linear regression. (c) Immunoblot analysis of HA-tagged Snf5 with loading control (tubulin). MW, molecular weight. (d) Immunofluorescence analysis of Snf5 with anti-HA (red) and anti-tubulin (green). Magnified images of HA staining (right) and the ratio of HA area/nuclear area for ~40 cells in each condition (bottom) are shown. Statistical significance was assessed with Wilcoxon rank-sum tests. (e) The pie chart depicts polyQ-dependent interactions of Snf5 (observed in WT but absent in ∆HR) in YPD and YPG. The bottom chart shows the diverse biological processes affected by polyQ-dependent interactors (numbers in parentheses). Snf5 interactors were identified with a minimum of three unique peptides in at least two of the three experiments. (f) Immunoblot analysis of histone modifications from Snf5 WT or ∆HR strains, with total histone H3 as a loading control. H3K3me3, K3-trimethylated H3; H3K27ac, K27-acetylated H3. (g) Heat map showing differences in the transcript levels of Snf5 targets (in WT and ∆HR strains), quantified by qPCR. Black borders represent statistically significant alterations in gene expression relative to WT (two-tailed Student’s t test; FDR-corrected for multiple testing). Each target is linked to its biological process on the left. (h) Colony-formation assay of WT or ∆HR SNF5 yeast strains in serial dilution assay, after exposure to genotoxic stress. HU, hydroxyurea. (i) Schematic illustrating that polyQ HR in Snf5 contributes to fitness by mediating interactions with proteins involved in diverse biological processes and formation of higher-order assemblages.

to affect formation of the Swi/Snf complex. However, 12/89 (~13.5%) YPG were mediated in a polyQ-dependent manner (Supplementary interactions in YPD, and a striking 35/109 (~32%) interactions in Fig. 5d). These polyQ-dependent interactions involved proteins

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 769 a r t i c l e s

b a c NonHRP HRP

Relative translational rate (ribosomal density × Protein abundance ribosomal occupancy) (copies per cell) Protein half-life (min) 0 0.2 0.4 0.6 0.8 1.0 0 400 800 1,200 0 50 100 150 Proteostasis

n = 4,162 n = 1,811 n = 2,442 –28 P = 2.9 × 10 P = 9.4 × 10–4 P = 3.9 × 10–10 CLES: 71.8% CLES: 65.4% CLES: 69.3% n = 740 n = 350 n = 496

Factors influencing protein synthesis Factors influencing protein degradation

d e h i Proteins with long- Poly(A) tail 5′ UTR N-terminal disorder (%) 0 5 10 15 20 25 AAAAAA AAAAAA n = 871 n = 180 60 n = 2,386

–83 –12 80 0.8 –3 P = 2.2 × 10 OR: 2.4

–36 n = 492 40 60 0.4 P = 4.6 × 10 OR: 9.56 P = 2.7 × 10 j Proteins with endo- P = 7.2 × 10 40 20 proteolytic sites (%) Structural features 0 Translation initiation 0 20 40 60 Genes (%)

20 −0.3 Highly disordered proteins (%) 0 (average PARS score) n = 2,386 RNA secondary structure n = 2,442 n = 496 P = 3.6 × 10–62 OR: 5.6 −0.6 0 OR: 6.3 Long Short n = 1,884 n = 273 n = 492 Poly(A)-tail length CLES: 62.9% k l Proteins with destruction box (%) f g Codon usage 60 0 5 10 15 20 25 tRNA availability (demand)

(supply) –40 Coding region n = 2,442 40 P = 5.5 × 10–5 OR: 1.6 AUG UAG AAAA Global n = 496

Average normalized P = 4.4 × 10 0.8 translational efficiency (nTE) 20 m Proteins with KEN box (%) –5 0.16 0.20 0.24 0 4 8 12 Sequence features 0.6

Proteins with PEST motif (%) n = 2,442 n = 4,162 0 0.4 –2 P = 3.2 × 10 n = 2,442 n = 496 P = 1.3 × 10 OR: 1.5 –9 P = 9.8 × 10 n = 496 OR: 3.9 0.2 n = 740

–61 (average PARS score) 0 CLES: 65.6% n P = 1.2 × 10 RNA secondary structure 40 OR: 4.7 Local Translation elongation −0.2 n = 2,386 0.22 30 © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 n = 2,056 n = 303 n = 492 CLES: 64.9% 20 0.20 Proteins (%) 10 for each codon)

0.18 Combinatorial features Local nTE (average 0 200 400 0 None 1 2 3 >3 Distance from AUG No. of degradation signals

Figure 5 HRPs are stringently regulated at multiple levels during synthesis and degradation. (a–c,e–g) Box plots showing distribution of protein abundance (a), translational rate (b), protein half-life (c), 5′ untranslated region (UTR) (e), coding-region mRNA secondary structures (f) and translational efficiency among HRPs and nonHRPs (g). Statistical significance was assessed with Wilcoxon rank-sum tests, with FDR-corrected P values, and effect sizes are displayed as CLES. In g, the local translational efficiency (TE), represented as the average TE for each codon from AUG to 400 codons for HRPs and nonHRPs, is shown. (d,i–n) Bar plots showing percentages of HRPs and nonHRPs with long and short poly(A) tails (d), long-N-terminal disorder (i), endoproteolytic sites (j), PEST motifs (k), destruction boxes (l), KEN boxes (m) and combinatorial degradation signals (n). (h) Percentage of proteins that are highly disordered (protein disorder >30%) among HRPs and nonHRPs. Box plots are as in Figure 1. Statistical significance was assessed with Fisher’s exact tests with multiple testing correction, and effect sizes are displayed as odds ratios. In n, the odds ratio was estimated for the presence of more than one degradation signal.

770 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

a b In line with this possibility, and in agreement with the loss of interactions with members of other chromatin-remodeling com- plexes and transcription factors, deletion of polyQ in Snf5 resulted Protein abundance (copies per cell) in a global decrease in active histone marks such as trimethylated K4 NonHRP NonHRP 0 500 1,500 2,500 3,500 and acetylated K27 in histone H3 (Fig. 4f). A number of known Snf5 targets display significantly altered expression levels between WT and n = 478 ∆HR in both carbon sources, with more pronounced effects in YPG NonHRP HRP P = 2.6 × 10–3 (Fig. 4g). The WT strains were able to grow after exposure to UV or in n = 141 the presence of hydroxyurea, but the ∆HR mutants were susceptible in CLES: 73.6% both conditions (Fig. 4h), thus suggesting that polyQ in Snf5 confers c d adaptability and fitness in response to genotoxic insult. Thus, the presence of polyQ expands the functional versatility

Relative translational rate Paralogs in S. cerevisiae of Snf5 beyond that of the Snf5 domain, by facilitating interactions (ribosomal density × ribosomal occupancy) Protein half-life (min) with proteins from diverse biological processes in a polyQ-dependent 0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 manner (Fig. 4i). In this way, the polyQ HR within Snf5 influences different phenotypes and determines fitness and adaptability in n = 948 n = 551 diverse environments. P = 2.6 × 10–3 P = 8.8 × 10–3 n = 277 n = 171 The activity of HRPs is highly regulated in cells CLES: 71.1% CLES: 72.2% Given the functional versatility, pleiotropic effects and the ability of proteins with HRs to spatially organize proteomes, we investigated how the activity of HRPs is regulated in cells. Protein activity can be e ~700 Ma f regulated by (i) controlling steady-state abundance (coarse tuning)

Protein abundance in S. pombe and/or (ii) altering chemical states by post-translational modifica- 3 S. pombe S. cerevisiae (×10 copies per cell) tions (PTMs; fine-tuning). Proteins with HRs were generally less 0 10 20 30 abundant, had lower synthesis rates and were turned over more rap- idly than nonHRPs (Fig. 5a–c). Such differences collectively affected pnonHRP cnonHRP n = 943 P = 4.1 × 10–4 the abundance and the amount of time that HRPs spent in the cell. n = 88 Importantly, this stringent control of HRPs was more pronounced pnonHRP cHRP CLES: 66.4% than that of highly disordered nonHRPs and nonHRPs with non- repetitive amino acid bias, as discerned from synthesis and degra- g h dation rates (Supplementary Fig. 6a–c). Such a tight regulation is critical for fitness, because overexpression of HRPs is more often toxic

Ribosomal density in S. pombe Protein half-life in S. pombe Orthologs in S. pombe (Supplementary Fig. 6d). (ribosome/kb of mRNA) (×102 min) 0 2 4 6 8 10 24 6 8 10 12 Transcripts of HRPs were found to be enriched for shorter polyA tails and have more complex 5′-untranslated-region structures n = 1,065 n = 1,000 (Fig. 5d,e), both of which decrease translation-initiation rates26,27. P = 4.8 × 10–10 P = 0.2 They more often have secondary structures in the coding region n = 97 n = 88 (Fig. 5f) and contain suboptimal codons, even after accounting for CLES: 74.2% CLES: NA the codon ramp (Fig. 5g). These features might slow both transla- 28,29 Figure 6 Homorepeats tend to be retained in proteins that are stringently tion elongation and global translation efficiency . At the protein regulated. (a–d) NonHRPs with HRP paralogs and nonHRPs whose level, HRPs tend to be more enriched than nonHRPs in high disorder paralogs never acquired an HR in S. cerevisiae (a) and box plots of content (Fig. 5h), long disordered segments at the N terminus and distributions of protein abundance (b), translation rate (c) and protein internal regions (Fig. 5i,j), and putative PEST, D-box and KEN-box © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 half-life (d). (e) Classification of nonHRPs in S. pombe (pnonHRPs) motifs (Fig. 5k–m), all of which facilitate rapid protein turnover30–34. into those whose one-to-one orthologs in S. cerevisiae were either The presence of more than one molecular feature within HRPs nonHRPs (pnonHRP–cnonHRP) or HRPs (pnonHRP–cHRP). (f–h) Box plots of distributions of protein abundance (f), ribosomal density for (Fig. 5n) suggested that their activity may be regulated through rapid protein synthesis rates (g) and protein half-life (h) for the two classes degradation in diverse conditions by different mechanisms. The ability of pnonHRPs in S. pombe. Box plots are as in Figure 1. Statistical to more stringently regulate the abundance of HRPs can control the significance was assessed with Wilcoxon rank-sum tests with FDR equilibrium between the soluble and phase-separated structures, and correction for multiple testing, and effect sizes are displayed as CLES. ensure that the spatially organized proteome remains dynamic and NA, not applicable. n represents number of proteins in each class. under control32,35. In terms of fine-tuning protein activity, HRPs were found to with diverse functions such as chromatin remodeling, transcription, have more PTMs (Supplementary Fig. 6e) than nonHRPs, thus cell-cycle control and DNA repair (Fig. 4e). Therefore, Snf5 might suggesting that they are regulated by a variety of signaling proteins influence different cellular processes through its polyQ-dependent that may be active in different conditions. The distribution of interactions with the different complexes and proteins in the cell PTM sites indicated that they occur within and immediately (Supplementary Note 2). In other words, the repeat-dependent mem- around homorepeats (Supplementary Fig. 6f). These PTMs bership of Snf5 in different assemblies might provide distinct molecu- might act as interaction switches and determine the member- lar contexts (for example, transcription, DNA repair and cell-cycle ship of HRPs among specific complexes36. Diverse PTMs, such control) for realizing the biochemical function and activity of Snf5. as phosphorylation and acetylation, not only promote protein

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 771 a r t i c l e s

degradation37–39 but also regulate the dynamic and reversible trinucleotide repeats (encoding amino acid HRs) might contribute assembly of phase-separated structures40. However, our obser- to slow translational elongation. Additionally, HRs of certain vations do not rule out the direct involvement of HRs in amino acid types and/or lengths can directly influence protein proteostasis. For instance, mRNA secondary structures formed by solubility41 and stability42.

a b c

Genes (%) Homorepeat 0 5 15 25 35 Seq A –11 Seq A′ Protogenes P = 1.1 × 10 OR: 2.9 Microsporidia Mixiomycetes Muromycotina Leotiomycetes Pezizomycetes Conserved residues Eurotiomycetes Agaricomycetes Chytridiomycota Pucciniomycetes Sordariomycetes

Sequence identity (%) = × 100 Tremellomycetes Dothideomycetes Wallemiomycetes Saccharomycetes Paralogs Aligned residues Ustilaginomycetes

7 Schizosaccharomycetes NonHRP (n = 2,077) Sequence identity (%) = × 100 = 50% 14 HRP (n = 341) Basidiomycota Ascomycota d

10 5

(–log P ) 0 Significance 60 NonHRP HRP

55

50

% sequence 45 identity (median) 40 0 1,000 No. of

orthologs 3,000 PIRID URPU LACBI YARLI PICST THIHE HYPAI PHLGI GIBZA SPIPN NEOFI AGABI DICSQ FOMPI COPCI SCLS1 KLULA RHIOR HYPJE MIXOS LEPMJ MUCCI NEUT9 PHYBL LODEL ASPCL TRAVS FUSO4 BOTFB PASFU BATDE ASPFN ASPTN ASPFU STEHR HETAN PUNST CANTE SPAPN ASPAC GLOTR VERDA CRYPA DEKBR DEBHA NECHA PHACH AURDE USTMA TREME CRYNE COCLU PENCH EURHE HYPVG WALSE ASPOR NEUCR CRYGA CANGA SCHPO PHANO A SCHCO PUCGR EMEND ASHGO TUBMM FOMME MYCGR CANAW WOLCO MAGGR CONPW Fungal species e f ~700 Ma ~300 Ma 39 S. cerevisiae strains ~180 Ma

Sequence identity (%) Upstream Within Downstream 30 40 60 80 100

S. cerevisiae E. gossypii S. pombe

– – – n = 1,219 –100 –1 0 +1 +100 P = 3.2 × 10–2 3 + – – n = 102

CLES: 60.6% 2

S. cerevisiae Y. lipolytica S. pombe © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017

Density of 1 – – – n = 1,090 P = 1.6 × 10–3 amino acid subsitutions + − – n = 78 0 –100 –50 0 50 100 HR not acquired HR acquired CLES: 65.0% Distance to HR (amino acids) Figure 7 HRPs preferentially arise by gene duplication and rapidly diverge across different timescales. (a) Proportion of HRPs and nonHRPs that arise by de novo gene birth (protogenes) or gene duplication (paralogs). Statistical significance was assessed with Fisher’s exact tests with multiple testing correction, and effect size is represented by odds ratio. (b) Estimation of sequence identity among yeast orthologs and paralogs by considering only aligned positions and disregarding gaps that might have arisen because of HRs. (c) Fungal phylogenetic classes studied here. (d) Sequence divergence among yeast HRPs and nonHRPs with their one-to-one orthologs in 73 of 74 fungal species (with at least 100 orthologs for HRPs in each species). The color of the abbreviated species names corresponds to phylogenetic classes in c. Taxonomic and phylogenetic details of the fungal species are provided in Supplementary Note 3. Median divergence of HRPs and nonHRPs with their orthologs in each species is shown (middle). Statistical significance was assessed by comparing the distribution of divergence of yeast HRPs and nonHRPs with their orthologs in each species by using Wilcoxon rank-sum tests (top). The black line shows the P-value cutoff corrected for multiple testing. Bottom, number of orthologs of yeast HRP (red) and nonHRP (gray) in each species. (e) Classification of S. cerevisiae proteins into those that have or have not acquired HRs, on the basis of the HR status of their corresponding orthologs in E. gossypii and Y. lipolytica, by using S. pombe as an outgroup (left). Box plots showing the distributions of sequence identities of yeast proteins that have or have not acquired HRs (right). Box plots are as in Figure 1. Statistical significance was assessed with Wilcoxon rank-sum test, and effect sizes are displayed as CLES. (f) Density of amino acid substitutions within (0 in the x axis) and 100 amino acids on either side of the HR, among yeast strains. n represents number of proteins in each class.

772 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

a c NonHRP HRP the pnonHRP–cnonHRP orthologs in S. pombe, were significantly less abundant and had lower translational rates (Fig. 6f,g). However, * Putative linear motif the half-lives were comparable (Fig. 6h). This result suggested that

80 100 60 during evolution, HRs were probably retained in already stringently

regulated proteins. This possibility is consistent with the idea that –4 –7 60 –42 80 preexisting proteins that are stringently regulated are better poised to 40 60 tolerate the emergence of HRs, because their regulation readily mini- 40

40 mizes their negative effects (for example, uncontrolled aggregation, P = 1.9 × 10 P = 3.7 × 10 P = 2.7 × 10 20 Putative linear motifs

20 in PPI network thus leading to protein sequestration and loss of function). This pos- 20 interactions (PPIs) No. of protein-protein Genes with amino acid

No. of link communities sibility may also explain why the presence of HRs is tolerated in only 0 0 0 certain proteins. substitutions in linear motifs (%) n = 1,971 n = 647 n = 841 n = 473 n = 841 n = 473 OR: 3.7 CLES: 56.6% CLES: 59.0% HR-containing proteins diverge rapidly across different timescales b d HR-containing proteins were more represented among paralogs (gene * duplicates) than among protogenes, which originate de novo (Fig. 7a). PTM This result was in line with the observation that HRs tend to be retained 10 100 70 in preexisting genes that are under stringent proteostasis. To inves-

tigate whether HR-containing proteins evolve differently from those –2 –2 80 –2 50 without an HR, we computed the sequence identity of one-to-one 60 5 orthologs of yeast HRPs and nonHRPs across different evolutionary 30

40 P = 3.2 × 10 P = 3.2 × 10 P = 2.6 × 10 timescales (74 fungal species; ~1 billion years of evolutionary distance;

20 in PPI network Supplementary Note 3). Only the alignable regions were considered, interactions (PPIs) Genes with PTM-site

No. of protein-protein 10 No. of link communities and gaps were not considered (Fig. 7b). S. cerevisiae HRPs, compared amino acid substitution (%) 0 0 0 n = 1,714 n = 520 n = 113 n = 50 n = 113 n = 50 Post-translational modifications (PTMs) with nonHRPs, showed lower sequence identity among their corre- OR: 1.5 CLES: 60.6% CLES: 60.6% sponding orthologs across almost all the fungal species (Fig. 7c,d), thus Figure 8 Amino acid substitutions in HRPs often affecting functionally suggesting that HRPs diverged faster than nonHRPs across species. relevant sites. (a,b) Distribution of HRPs and nonHRPs with substitutions The average median difference between nonHRPs and HRPs was 4.6%, within putative linear motifs, identified with ANCHOR61 (a) and which corresponded to substitutions of ~14 amino acids for a protein experimentally determined PTM sites (b). (c,d) Box plots showing 300 amino acids long. The saturation effect (in which same position can distributions of protein-protein interactions and link communities in the be mutated several times, given sufficient evolutionary time) resulting protein interaction network among HRPs and nonHRPs with amino acid from the several variable sites may be one of the reasons for the lin- substitutions in putative linear motifs (c) and PTM sites (d). Box plots are as in Figure 1. Statistical significance was assessed with Wilcoxon rank- ear shift observed in the rate of change of sequence identity of HRPs sum tests, and effect sizes are displayed as CLES (c and d) or Fisher’s compared with nonHRPs over long evolutionary timescales (Fig. 7d). exact test, with odds ratios indicating the effect size (a and b), with To investigate whether HRs contribute to the rapid divergence of HRPs, multiple testing correction. n represents number of proteins in each class. we first identified S. cerevisiae proteins that acquired an HR at two different time points: from the time of speciation from the common Although the reported trends may be applicable to a large number ancestor of S. cerevisiae and (i) Eremothecium gossypii (~180 Ma) or of HRPs, they are unlikely to apply to every single HRP. Nevertheless, (ii) Yarrowia lipolytica (~300 Ma), using S. pombe as an outgroup. In features such as pleiotropic effects, adaptability, functional versatility both situations, orthologs that acquired HRs in S. cerevisiae diverged and proteostasis can collectively differentiate HRPs and nonHRPs more than those that did not acquire an HR (Fig. 7e). Similarly, paralogs through an approach based on random-forest machine learning that acquired an HR diverged faster than their nonHRP counterparts (Supplementary Fig. 7). (Supplementary Fig. 8a). These results suggested that the presence of HRs is associated with and may drive the rapid divergence of Stringent proteostasis is prerequisite for HR retention HR-containing proteins. © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 during evolution At shorter timescales of divergence (approximately thousands of During evolution, has the emergence of HRs in proteins led to their years), among the 39 different strains of S. cerevisiae the regions stringent proteostasis, or have HRs been retained in proteins that are around HRs tended to accumulate more amino acid substitutions already stringently regulated? By comparing groups of nonHRPs that than did the rest of the protein (Fig. 7f), although the density of do or do not have HRP paralogs (Fig. 6a), we found that nonHRPs with such substitutions between HRPs and nonHRPs over the entire HRP paralogs were low in abundance, had slower translational rates protein was comparable. This observation supported the emerging and displayed shorter half-lives (Fig. 6b–d) than those of the group of view that substitutions are more frequently found in and around nonHRPs that did not have HRP paralogs. This result suggested that repeat-containing regions in the genome43–46 and highlighted the although HRs may emerge in any protein, they might be preferentially potential effects of HRs in proteome evolution. Collectively, these retained in proteins that are already stringently regulated. findings suggested that HR-containing proteins tend to acquire We then investigated the one-to-one orthologs of Saccharomyces more amino acid substitutions and evolve rapidly, irrespective of the cerevisiae in Schizosaccharomyces pombe (estimated divergence timescale considered. time of ~700 million years). We classified the nonHRP orthologs of S. pombe (pnonHRPs) according to those that are still nonHRPs HR-associated variation might rewire interactions and facilitate in S. cerevisiae (pnonHRP–cnonHRP) and those that have HRs in rapid adaptation S. cerevisiae (pnonHRP–cHRP; Fig. 6e) and compared their regu- To assess whether the HR-associated substitutions affect functional latory properties. The pnonHRP–cHRP orthologs, compared with sites in a protein, we integrated the amino acid–substitution data

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 773 a r t i c l e s

for the different the yeast strains with functional site information. and link communities in the protein interaction network, compared Compared with nonHRPs, amino acid substitutions within HRPs with nonHRPs that also have substitutions in such sites (Fig. 8c,d). more often map to putative linear motifs and PTM sites that mediate The differences remained significant even after controlling for the protein-peptide interactions (Fig. 8a,b). HRPs with substitutions in number of protein interactions or the density of linear-motif resi- these functionally relevant sites have higher numbers of interactions dues or PTM sites (Supplementary Fig. 8b–e). Furthermore, the

Constraints

HR-speci c bene ts Stringent proteostasis HR-speci c costs

Abundance *

* * *

* *

Divergence

years years 6

6 Synthesis Degradation 10 10 Time * *

Rapid sequence Random detrimental divergence mutations

Phenotypes

HRP

e e

Molecular interactions Time scal Time Time scal Time Physiological Nonfunctionally HR promiscuous

High pleiotropy Nonfunctional promiscuity Minutes Minutes Emergence of amino acid homorepeats in proteins

Multifunctionality Protein aggregation

Environmental perturbations

Implications !

Normal growth Positive consequences Negative consequences

Positive effects on tness Negative effects on tness © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 Non genetic Genetic

Target-gene-expression Rewiring molecular Macromolecular assemblies Cell memory evolution interactions Daughter cell *

Mother cell - Signaling networks (PTMs) - Stress granules - Heritable proteins - Domain-peptide interactions - P bodies - Mnemons Target of HR-containing TF (linear motifs)

0 6 0 6

HRPs as facilitators of adaptability Life span of organism Across generations 10 –10 years 10 –10 years

Minutes 106 years

Figure 9 Constraints, consequences and implications of the emergence of homorepeats in proteins. Top, stringent proteostasis facilitates retention of HRs in proteins, thus favoring HR-specific benefits on fitness and alleviating negative effects on fitness, at different timescales. Bottom, how HRPs facilitate evolvability and adaptability at different timescales through homorepeat-dependent nongenetic and genetic effects.

774 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

conditional probability of finding (i) amino acid substitutions and individual, it does not affect the entire population. However, if one (ii) those that affect functionally relevant sites was higher in polypep- of the individuals carrying a specific HR-associated mutation has tide segments containing HRs than that of finding an HR in polypep- better fitness when environments change, that specific mutation tide segments with amino acid substitutions (Online Methods and may be positively selected and benefit the species. Thus, HRs can Supplementary Fig. 8f,g). These observations collectively suggested be considered as evolutionary modules that confer evolvability and that HR-associated variation has the potential to alter protein inter- adaptability to organisms in a population. action networks and rewire the targets of signaling pathways. Thus, The beneficial effects of HRPs raise the evolutionary conundrum HRs act as distinct modules that increase the genetic diversity of a of why their presence is restricted to certain proteins. Our results population (i.e., standing genetic variation) by affecting functionally suggested that HRs tend to be retained in proteins whose abundance relevant sites around an HR in key proteins. and activity are under stringent control, possibly to minimize HR- A more detailed investigation of essential and highly pleiotropic associated deleterious effects such as the sequestration of proteins HRPs revealed that they might form a rapidly evolvable part of the through nonfunctional promiscuous interactions and protein aggre- proteome (Supplementary Figs. 9 and 10). Thus, HR-associated gation. Hence, stringent proteostasis might be a requirement for the variation of pleiotropic and critical hub proteins among individuals in tolerance of HRs in a protein. a population may facilitate the adaptation of an organism to diverse HRs in a protein might facilitate adaptability at different times- environments by permitting the rapid exploration of the genotype– cales (Fig. 9). At shorter timescales, HRPs facilitate adaptability phenotype landscape. through nongenetic mechanisms. For instance, they facilitate formation of assemblies, such as stress granules, that permit DISCUSSION adaptation under diverse conditions54. They also act as heritable From a biochemical and cell biological perspective, our observations proteins55 and contribute to transgenerational adaptability by revealed that the presence of HRs has the potential to increase the forming physical entities that encode molecular memory, as in functional versatility of proteins by facilitating repeat-dependent the case of polyQ/polyasparagine-containing mnemons of yeast interactions and their spatial organization in cells by forming Whi3p56. At longer timescales, they might facilitate adaptability assemblies. From a genetics and evolutionary perspective, HRs through genetic mechanisms. Recently, we have shown that HR facilitate rapid protein divergence, might rewire interactions and length variation in transcription factors facilitates adaptation favor the emergence of standing genetic variation upon which through rapid expression divergence of their targets41. Here, we selection could operate. The emergence of homorepeats in a found that HR-associated amino acid substitutions more often protein can have both positive and negative consequences at affect linear motifs and PTMs. Such variation might fuel the different timescales (Fig. 9). evolution of interaction sites and permit rapid evolution of new Our findings showed that HRs are often present in proteins that substrates for kinases or modifying enzymes of signaling pathways. influence diverse phenotypes and buffer environmental insults. The Thus, the presence of an HR might lead to the rewiring of molec- existence of repeat-dependent interactions highlights an underappre- ular networks such as signaling (through PTM-site variation) ciated component of protein interaction networks and opens up new and domain-peptide interaction networks (through linear-motif directions for future research in the area of repeat edgetics, which is variation)57,58. In this manner, HRPs can generate genotypic and similar to the concept of single–amino acid edgetics47. HR-dependent phenotypic diversity and facilitate organismal adaptation to new interactions can provide new molecular contexts for a protein at environmental niches. different times and in different conditions in which the biochemi- Although we reported general trends for proteins with HRs, and cal function of the HRP can be realized in a spatially localized and provided experimental validation for specific cases, every trend regulated manner in a cell. However, in certain situations such as observed here might not be applicable to all proteins with homore- altered protein abundance or abnormal repeat-length variation, the peats. Furthermore, in addition to the effects of repeats at the protein high interaction potential of HRs can lead to negative consequences level (i.e., the proteome), HRs may also have various consequences at such as sequestration of other proteins via nonfunctional promiscu- the nucleic acid level (i.e., the genome and transcriptome). Although ous interactions and protein aggregation48. the distribution of the type and length of HRs varies across eukaryo- At longer timescales, the presence of HRs leads to rapid diver- tes, our findings in yeast provide important insights for understand- © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 gence of proteins due to variations in repeat length8 and the accu- ing their relevance in other species including humans. mulation of more amino acid substitutions. Diverse mechanisms The current understanding of the role of proteostasis in confer- such as recurrent repair after replication stalling at nucleotide ring robustness and evolvability primarily focuses on molecular repeats, unequal crossing-over after recombination, transcription- chaperones, such as Hsp90, involved in protein folding and coupled repair and the higher error rates of nonreplicative DNA enabling accumulation of cryptic variation in a population59. polymerases that are recruited at such sites during repair might all The findings presented here suggest the existence of another collectively contribute to nucleotide-repeat-induced mutagenesis strategy wherein stringent proteostasis facilitates the retention and lead to the rapid sequence divergence of HRPs43,46,49–52. of homorepeats, thus leading to rapid protein divergence that In addition, the lower expression levels of HRPs might also influ- may drive the rewiring of interaction networks and signaling ence the rapid evolutionary rate of these proteins53. The association pathways. Given that HRPs are more prevalent among paralogs of homorepeats with disordered regions might also influence the (i.e., gene duplicates), the genetic redundancy coupled with rate of divergence. Detailed analysis of the nature of amino acid the rapid mutation accumulation of HRPs may provide a ‘low substitutions may help to uncouple the influence of disordered risk–high gain’ scenario for accelerating evolution and adap- regions and homorepeats on the rapid divergence of proteins with tation to diverse environments. The synergy among stringent HRs. Thus, the presence of an HR results in standing genetic vari- proteostasis, increased evolvability and functional versatility ation of a population. Although a deleterious consequence of the make HRPs important for fitness and crucial contributors to HR-associated amino acid substitution might be detrimental for an the functioning of cells.

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 775 a r t i c l e s

Methods 18. Karlin, S., Brocchieri, L., Bergman, A., Mrazek, J. & Gentles, A.J. Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl. Acad. Sci. USA 99, Methods, including statements of data availability and any associated 333–338 (2002). accession codes and references, are available in the online version of 19. Albà, M.M. & Guigó, R. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 14, 549–554 (2004). the paper. 20. Faux, N.G. et al. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res. 15, 537–551 (2005). Note: Any Supplementary Information and Source Data files are available in the online 21. Faux, N.G. et al. RCPdb: an evolutionary classification and codon usage database version of the paper. for repeat-containing proteins. Genome Res. 17, 1118–1127 (2007). 22. Délot, E., King, L.M., Briggs, M.D., Wilcox, W.R. & Cohn, D.H. Trinucleotide Acknowledgments expansion mutations in the cartilage oligomeric matrix protein (COMP) gene. Hum. We thank B. Lenhard, C. Semple, M. Torrent, A.J. Venkatakrishnan, B. Lang and Mol. Genet. 8, 123–128 (1999). members of our group for stimulating discussions and comments on this study. 23. Ahn, Y.Y., Bagrow, J.P. & Lehmann, S. Link communities reveal multiscale complexity in networks. , 761–764 (2010). This work was supported by the Medical Research Council (MC_U105185859; Nature 466 24. Koch, E.N. et al. Conserved rules govern genetic interaction degree across species. S.C., N.S.L., S.B. and M.M.B.), the Human Frontier Science Program Genome Biol. 13, R57 (2012). (RGY0073/2010; M.M.B.), a European Molecular Biology Organization Long 25. Kemmeren, P. et al. Large-scale genetic perturbations reveal regulatory networks term fellowship (S.C.), the Young Investigator Program (M.M.B. and K.J.V.), and an abundance of gene-specific repressors. Cell 157, 740–752 (2014). ERASysBio+ (GRAPPLE; S.C., S.B. and M.M.B.), Marie Curie actions (FP7- 26. Munroe, D. & Jacobson, A. mRNA poly(A) tail, a 3′ enhancer of translational PEOPLE-2011-IEF-299105; N.S.d.G.) and Cancer Research UK (P.L.C.). M.M.B. initiation. Mol. Cell. Biol. 10, 3441–3455 (1990). is supported as a Lister Institute Prize Fellow. We thank C. Taylor and C. d’Santos 27. Jackson, R.J., Hellen, C.U. & Pestova, T.V. The mechanism of eukaryotic translation from the CRUK-Cambridge Institute proteomics core facility. initiation and principles of its regulation. Nat. Rev. Mol. Cell Biol. 11, 113–127 (2010). 28. Wen, J.D. et al. Following translation by single ribosomes one codon at a time. AUTHOR CONTRIBUTIONS Nature 452, 598–603 (2008). S.C. and M.M.B. conceived the project and wrote the manuscript, incorporating 29. Gingold, H. & Pilpel, Y. Determinants of translation efficiency and accuracy. Mol. input from all authors. S.C., K.J.V., S.B. and M.M.B. designed the study; S.C., Syst. Biol. 7, 481 (2011). G.C., N.S.L., E.I.-S. and S.B. collected the data sets and performed computational 30. van der Lee, R. et al. Intrinsically disordered segments affect protein half-life in investigations; S.C., P.L.C., N.S.d.G. and R.G. designed and performed the the cell and during evolution. Cell Rep. 8, 1832–1844 (2014). experiments. All authors participated in interpreting the results. S.C. led the study, 31. Glotzer, M., Murray, A.W. & Kirschner, M.W. Cyclin is degraded by the ubiquitin and M.M.B. supervised the project. pathway. Nature 349, 132–138 (1991). 32. Gsponer, J., Futschik, M.E., Teichmann, S.A. & Babu, M.M. Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. COMPETING FINANCIAL INTERESTS Science 322, 1365–1368 (2008). The authors declare no competing financial interests. 33. Pfleger, C.M. & Kirschner, M.W. The KEN box: an APC recognition signal distinct from the D box targeted by Cdh1. Genes Dev. 14, 655–665 (2000). Reprints and permissions information is available online at: http://www.nature.com/ 34. Rogers, S., Wells, R. & Rechsteiner, M. Amino acid sequences common to rapidly reprints/index.html. Publisher’s note: Springer Nature remains neutral with regard to degraded proteins: the PEST hypothesis. Science 234, 364–368 (1986). jurisdictional claims in published maps and institutional affiliations. 35. Gsponer, J. & Babu, M.M. Cellular strategies for regulating functional and nonfunctional protein aggregation. Cell Rep. 2, 1425–1437 (2012). 1. La Spada, A.R. & Taylor, J.P. Repeat expansion disease: progress and puzzles in 36. Woodsmith, J., Kamburov, A. & Stelzl, U. Dual coordination of post translational disease pathogenesis. Nat. Rev. Genet. 11, 247–258 (2010). modifications in human protein networks. PLoS Comput. Biol. 9, e1002933 2. Moumné, L. et al. Differential aggregation and functional impairment induced by (2013). polyalanine expansions in FOXL2, a transcription factor involved in cranio-facial 37. Mateo, F. et al. Degradation of cyclin A is regulated by acetylation. Oncogene 28, and ovarian development. Hum. Mol. Genet. 17, 1010–1019 (2008). 2654–2666 (2009). 3. Gatchel, J.R. & Zoghbi, H.Y. Diseases of unstable repeat expansion: mechanisms 38. Qian, M.X. et al. Acetylation-mediated proteasomal degradation of core histones and common principles. Nat. Rev. Genet. 6, 743–755 (2005). during DNA repair and spermatogenesis. Cell 153, 1012–1024 (2013). 4. Tsuda, H. et al. The AXH domain of Ataxin-1 mediates neurodegeneration through 39. Tyers, M., Tokiwa, G., Nash, R. & Futcher, B. The Cln3-Cdc28 kinase complex of its interaction with Gfi-1/Senseless proteins. Cell 122, 633–644 (2005). S. cerevisiae is regulated by proteolysis and phosphorylation. EMBO J. 11, 5. Cortes, C.J. et al. Polyglutamine-expanded androgen receptor interferes with TFEB 1773–1784 (1992). to elicit autophagy defects in SBMA. Nat. Neurosci. 17, 1180–1189 (2014). 40. Bergeron-Sandoval, L.P., Safaee, N. & Michnick, S.W. Mechanisms and consequences 6. Monks, D.A. et al. Overexpression of wild-type androgen receptor in muscle of macromolecular phase separation. Cell 165, 1067–1079 (2016). recapitulates polyglutamine disease. Proc. Natl. Acad. Sci. USA 104, 41. Gemayel, R. et al. Variable glutamine-rich repeats modulate transcription factor 18259–18264 (2007). activity. Mol. Cell 59, 615–627 (2015). 7. Nasrallah, I.M., Minarcik, J.C. & Golden, J.A. A polyalanine tract expansion in Arx 42. Fishbain, S. et al. Sequence composition of disordered regions fine-tunes protein forms intranuclear inclusions and results in increased cell death. J. Cell Biol. 167, half-life. Nat. Struct. Mol. Biol. 22, 214–221 (2015). 411–416 (2004). 43. McDonald, M.J., Wang, W.C., Huang, H.D. & Leu, J.Y. Clusters of nucleotide 8. Gemayel, R., Vinces, M.D., Legendre, M. & Verstrepen, K.J. Variable tandem repeats substitutions and insertion/deletion mutations are associated with repeat sequences. accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 44, PLoS Biol. 9, e1000622 (2011). 445–477 (2010). 44. Lenz, C., Haerty, W. & Golding, G.B. Increased substitution rates surrounding low- 9. Stevens, K.E. & Mann, R.S. A balance between two nuclear localization sequences complexity regions within primate proteins. Genome Biol. Evol. 6, 655–665 © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 and a nuclear export sequence governs extradenticle subcellular localization. (2014). Genetics 175, 1625–1636 (2007). 45. Huntley, M.A. & Clark, A.G. Evolutionary analysis of amino acid repeats across 10. Wolf, A. et al. The polyserine domain of the lysyl-5 hydroxylase Jmjd6 mediates the genomes of 12 Drosophila species. Mol. Biol. Evol. 24, 2598–2609 subnuclear localization. Biochem. J. 453, 357–370 (2013). (2007). 11. Salichs, E., Ledda, A., Mularoni, L., Albà, M.M. & de la Luna, S. Genome-wide 46. McDonald, M.J. et al. Mutation at a distance caused by homopolymeric guanine analysis of histidine repeats reveals their role in the localization of human proteins repeats in Saccharomyces cerevisiae. Sci. Adv. 2, e1501033 (2016). to the nuclear speckles compartment. PLoS Genet. 5, e1000397 (2009). 47. Dreze, M. et al. ‘Edgetic’ perturbation of a C. elegans BCL2 ortholog. Nat. Methods 12. Lee, C. et al. Protein aggregation behavior regulates cyclin transcript localization 6, 843–849 (2009). and cell-cycle control. Dev. Cell 25, 572–584 (2013). 48. Woerner, A.C. et al. Cytoplasmic protein aggregates interfere with nucleocytoplasmic 13. Galant, R. & Carroll, S.B. Evolution of a transcriptional repression domain in an transport of protein and RNA. Science 351, 173–176 (2016). insect Hox protein. Nature 415, 910–913 (2002). 49. Panigrahi, G.B., Lau, R., Montgomery, S.E., Leonard, M.R. & Pearson, C.E. Slipped 14. Gerber, H.P. et al. Transcriptional activation modulated by homopolymeric glutamine (CTG)*(CAG) repeats can be correctly repaired, escape repair or undergo error-prone and proline stretches. Science 263, 808–811 (1994). repair. Nat. Struct. Mol. Biol. 12, 654–662 (2005). 15. Michael, T.P. et al. Simple sequence repeats provide a substrate for phenotypic 50. Mar Albà, M., Santibáñez-Koref, M.F. & Hancock, J.M. Amino acid reiterations in variation in the Neurospora crassa circadian clock. PLoS One 2, e795 yeast are overrepresented in particular classes of proteins and show evidence of a (2007). slippage-like mutational process. J. Mol. Evol. 49, 789–797 (1999). 16. Fondon, J.W. III & Garner, H.R. Molecular origins of rapid and continuous 51. Shah, K.A. & Mirkin, S.M. The hidden side of unstable DNA repeats: mutagenesis morphological evolution. Proc. Natl. Acad. Sci. USA 101, 18058–18063 at a distance. DNA Repair (Amst.) 32, 106–112 (2015). (2004). 52. Shah, K.A. et al. Role of DNA polymerases in repeat-mediated genome instability. 17. Gidalevitz, T., Ben-Zvi, A., Ho, K.H., Brignull, H.R. & Morimoto, R.I. Progressive Cell Rep. 2, 1088–1095 (2012). disruption of cellular protein folding in models of polyglutamine diseases. Science 53. Zhang, J. & Yang, J.R. Determinants of the rate of protein sequence evolution. 311, 1471–1474 (2006). Nat. Rev. Genet. 16, 409–420 (2015).

776 VOLUME 24 NUMBER 9 SEPTEMBER 2017 nature structural & molecular biology a r t i c l e s

54. Narayanaswamy, R. et al. Widespread reorganization of metabolic enzymes into 58. Hancock, J.M. & Simon, M. Simple sequence repeats in proteins and their reversible assemblies upon nutrient starvation. Proc. Natl. Acad. Sci. USA 106, significance for network evolution. Gene 345, 113–118 (2005). 10147–10152 (2009). 59. Jarosz, D.F., Taipale, M. & Lindquist, S. Protein homeostasis and the phenotypic 55. Chakrabortee, S. et al. Intrinsically disordered proteins drive emergence and manifestation of genetic diversity: principles and mechanisms. Annu. Rev. Genet. inheritance of biological traits. Cell 167, 369–381.e12 (2016). 44, 189–216 (2010). 56. Caudron, F. & Barral, Y. A super-assembly of Whi3 encodes memory of deceptive 60. Ekman, D., Light, S., Björklund, A.K. & Elofsson, A. What properties characterize encounters by single cells during yeast courtship. Cell 155, 1244–1257 the hub proteins of the protein-protein interaction network of Saccharomyces (2013). cerevisiae? Genome Biol. 7, R45 (2006). 57. Levy, E.D., Landry, C.R. & Michnick, S.W. How perfect can protein interactomes 61. Dosztányi, Z., Mészáros, B. & Simon, I. ANCHOR: web server for predicting protein be? Sci. Signal. 2, pe11 (2009). binding regions in disordered proteins. Bioinformatics 25, 2745–2746 (2009). © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017

nature structural & molecular biology VOLUME 24 NUMBER 9 SEPTEMBER 2017 777 ONLINE METHODS belonging to divergent pairs (Fig. 6a–d). To test whether HRs facilitated rapid Classification of the yeast proteome. This analysis was based on the following evolvability, we tested for differences in the distribution of sequence identities seven parameters. among similar (nonHRP–nonHRP) and divergent (HRP–nonHRP) paralog pairs The presence of homorepeats. Yeast proteins were classified as HRPs if they (Supplementary Fig. 8a). contained at least one continuous stretch of five or more identical amino acids. NonHRPs were used as the control group for all genome-scale comparisons. To Classification of yeast proteins and their orthologs. One-to-one orthologs ensure that we used the most appropriate control set, we disregarded all non- of yeast proteins across different fungal proteomes were obtained from OMA HRPs that contained imperfect repeats (n = 650). Imperfect repeats included browser68 (Supplementary Table 1). To test whether, during evolution, HRs all repeats of unit sizes two and longer, with similarity scores of 70% or higher have tended to emerge and be retained in already stringently regulated between repeating units. These were identified with TREKS62. A compen- proteins across species, we classified S. pombe nonHRP (pnonHRP) orthologs of dium of different genome-scale data sets analyzed in this study is provided in S. cerevisiae proteins into those whose S. cerevisiae orthologs have not acquired Supplementary Table 1. HRs (pnonHRP–cnonHRP) and those that have acquired HRs (pnonHRP– Disorder fraction. The disorder status of every residue in the yeast proteome cHRP). Protein-regulatory features measured in S. pombe were compared for was inferred with DISOPRED2 (ref. 63), and the disorder content was averaged these two classes of pnonHRPs (Fig. 6e–h). over the entire protein length and assigned as the disorder fraction of a protein. To test whether HRs facilitate rapid divergence of HRPs across evolutionary Repeats are known to overlap with disordered regions64. To examine whether timescales, we identified S. cerevisiae proteins that (i) acquired an HR from the protein disorder confounded our observations on the functional attributes and time of the common ancestor of S. cerevisiae and E. gossypii (~180 Ma), and that of proteostasis of HRPs, we compared distributions of diverse features of HRPs S. cerevisiae and Y. lipolytica (~300 Ma), by using S. pombe as the outgroup and with low (≤30%) and high intrinsic disorder (>30%) with nonHRPs with similar (ii) did not acquire an HR. We then compared the distribution of protein sequence intrinsic disorder content (Fig. 2e,f and Supplementary Figs. 6 and 7). identity between these two groups (Fig. 7e). Protein length. To examine whether the functional attributes of HRPs were independent of protein length, we selected HRPs and nonHRPs with similar Gene Ontology enrichment analysis. GO biological-process enrichment among lengths and compared distributions of connectivity in the protein interaction HRPs, essential HRPs and nonHRPs, highly pleiotropic HRPs and nonHRPs was network and the number of link communities in which each protein participated obtained with the DAVID server69. FDR values, corrected for multiple testing, (Supplementary Fig. 2b). highlighted the significance of the enrichment of the GO terms. Nonrepetitive amino acid bias. We compared the distributions of different features among HRPs and nonHRPs containing nonrepetitive amino acid bias Calculation of sequence identity. We obtained one-to-one orthologs of yeast (Supplementary Figs. 2d and 6) to investigate whether the observed trends for proteins in 74 fungal species from OMA browser68. For each pairwise align- physiological importance, functional versatility and stringent proteostasis were ment, we computed the percentage sequence identity with the Biostrings pack- similar between these two groups. Sequences with compositional amino acid age in R by considering only the aligned regions, ignoring gaps, to ensure that bias in nonHRPs were identified in LPS-annotate with default parameters65,66 gaps arising from the emergence of repeats did not confound the calculation of but with a stringent detection P-value cutoff of <1 × 10−7 (approximately twice sequence identity. the default detection threshold). Functional similarity. To rule out the confounding effects of differences in the Statistical analysis for genome-level computational investigations. Differences functional versatility of HRPs resulting from the differences in biological func- in the distribution of continuous variables among different classes of proteins tions between HRPs and nonHRPs, we randomly selected nonHRPs with biologi- were assessed with the nonparametric Wilcoxon rank-sum test or Wilcoxon cal functions similar to those of HRPs and drew comparisons. We first assigned matched-pairs test, as appropriate. Medians and confidence intervals (CI = GO biological processes to HRPs through GOSlim annotation retrieved from 1.58(IQR/√n), where IQR is the interquartile range, and n is the sample size of The Saccharomyces Genome Database (SGD)67. Approximately 97 GO terms each group) were computed. Median differences between the compared groups were assigned to HRPs. For each GO term, we counted the number of HRPs. We and CLES70 were calculated as estimates of the effect size. CLES describes the then obtained, at random, equivalent numbers of nonHRPs for each GO term and probability that a randomly sampled data point from a distribution A, will be compiled them into a single list of functionally similar nonHRPs. We compared greater than a data point sampled from distribution B and was computed as the differences in distributions for the number of protein-protein interactions, previously described71. Briefly, on the basis of the recommendation from ref. 72, genetic interactions, link communities in protein-protein-interaction networks we first computed the U statistic: and genetic networks between the two sets of functionally similar groups of HRPs n( n + )1 and nonHRPs, for each of the 100 randomizations (Supplementary Fig. 2c). UW= − s s Essentiality for growth. To examine whether the physiological and functional 2 effects, regulation and evolution of HRPs were influenced by HRP enrichment where W is the Wilcoxon test statistic, and ns is the smaller of na and nb (the

© 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 in essential genes, we compared the distributions of different features between sample size of each data set). CLES is then given by: essential HRPs and essential nonHRPs (Supplementary Fig. 9).  U  High pleiotropy. We classified yeast proteins into those with low, medium and CLES (%)=1 − ×100  ×  high pleiotropic effects, on the basis of the number of phenotypes to which they  ()na n b  were linked in the yeast G2P network, by using tertile cutoffs. To examine whether For comparisons done with Wilcoxon matched-pairs tests (Fig. 3f), Z scores high pleiotropic effects of HRPs influenced the physiological and functional were estimated with the R coin package. Differences in distributions of categorical effects, regulation and evolution of HRPs, we compared the distributions of dif- variables were tested with Fisher’s exact test or chi-squared test. Effect sizes were ferent features between highly pleiotropic HRPs and highly pleiotropic nonHRPs estimated with odds ratios. Enrichment of HRPs in different classes of proteins (Supplementary Fig. 10). (for example, stress-granule proteins) was assessed with permutation tests. In each permutation, every HRP was replaced with a random gene. The number of Classification of yeast paralogs. Paralog protein pairs were identified as such randomly obtained genes that overlapped with a specific class of proteins described in Supplementary Table 1. On the basis of the presence or absence of was noted for 10,000 randomizations. From this result, we estimated the Z score, homorepeats, paralog pairs were classified as similar pairs, in which both the par- which indicates the distance of the actual observation to the mean of random alogs lacked HRs (nonHRP–nonHRP) or as divergent pairs, in which one paralog expectations in terms of the number of standard deviations. P values were esti- contained the HR and the other did not (HRP–nonHRP). To examine whether the mated as the ratios of the randomly observed proteins greater than or equal to emergence of HRs within a protein facilitated the acquisition of physiological and the number of actually observed HRPs to the total number of random samples functional attributes of HRPs, we compared members of divergent pairs (Fig. 3f). (10,000). The null expectation of the homorepeats of five or more amino acids in To test whether HRs emerged in proteins that were already stringently regu- the yeast proteome, Z score and P value was obtained by permutation tests with lated during evolution, we compared nonHRPs of similar pairs with nonHRPs 1,000 shuffled proteomes by shuffling each protein for sequence order, keeping

nature structural & molecular biology doi:10.1038/nsmb.3441 the length and amino acid composition constant. Tests for association between Cell growth rate measurement. Using an overnight culture, we achieved an the length of the homorepeats and different features were done with Pearson’s initial OD600 of 0.001 in YPD or YPG. The growth of the cells was monitored in correlation. The extent and the type (positive/negative) of correlation were esti- a Tecan 2000 pro plate reader measuring the absorbance at 600 nm every 10 min mated with Pearson’s correlation coefficients. Correction for multiple testing was for 3 d. For budding-index measurements, 0.5 × 106 cells in stationary phase were done with the Benjamini–Hochberg method (FDR) to control for the FDR for re-inoculated into 5 ml of YPD or YPG, and samples were collected after the comparisons between similar classes (such as HRPs versus nonHRPs), by using indicated time points. Samples were then analyzed with a Vi-cell analyzer, images the same test (such as Wilcoxon rank-sum test or Fisher’s exact test) but differ- were acquired, and mitotic cells were quantified with Fiji (ImageJ) software. ent data sets pertaining to similar biological attributes such as physiological and Results depicted are means of 3 independent experiments for 50 fields each. functional relevance and proteostasis (Supplementary Table 1). For Bayesian inferences, features were classified as either categorical Immunofluorescence. Cells in exponential phase were fixed in 4% formaldehyde (for example, essentiality) or quantitative (for example, number of pheno- for 1 h. Spheroplasts of fixed cells (prepared in 3.2 µl of β-mercaptoethanol and types per gene) depending on the data type (Supplementary Fig. 3). On the 5 µl of 5 mg/ml zymolase) were permeabilized in 100 µl PBS with 0.05% Tween-20. basis of tertile cutoffs, quantitative features were classified into three bins as Approximately 25 µl of this suspension was then loaded into 96-well plates and low, medium or high. For features for which HRPs had significantly higher allowed to settle for 5 min. Cells were blocked in 1 mg/ml BSA in PBS for 30 min values than nonHRPs (for example, number of phenotypes per gene) we in a humid chamber. After cells were washed three times with PBS, primary- selected the ‘high’ bin, whereas for features in which HRPs had significantly antibody cocktail (1:500 of anti-HA (05-902R, Merck Millipore) and anti-tubulin lower values than nonHRPs (for example, protein abundance) we selected the (T9026, Sigma-Aldrich)) was added and incubated for 1 h at room tempera- ‘low’ bin. Conditional probabilities for finding a feature given that the pro- ture. Subsequently, cells were washed, and secondary-antibody mix containing tein contained an HR, and for finding a protein with an HR given a specific rabbit Alexa Fluor 555 (ab150074, Abcam) and mouse Alexa 488 (ab150113, feature, were estimated. On the basis of the conditional probabilities, we Abcam) was added (1:1,000 in blocking solution) and incubated in the dark for obtained likelihood ratio (LR) estimates as the ratio of the probability of 1 h. Samples were finally washed in PBS and mounted in propyl gallate solution finding a feature given that the protein contained an HR divided by the prob- with DAPI. Samples were stored in 4 °C until visualization. Images were acquired ability of finding a feature given that the protein did not contain an HR. All with an LSM710 microscope (Zeiss), and images were rendered and quantified statistical analyses were performed in R. with Imaris software (Bitplane). The ability of the features associated with HRPs, pertaining to (i) pleiotropy (number of phenotypes per gene and essentiality), (ii) adaptability (resistance to Immunoblotting and immunoprecipitation analyses. Yeast cells expressing small molecules), (iii) functional versatility (disorder fraction, number of protein- HA-tagged WT or ∆HR SNF5 grown in YPD or YPG were lysed with a glass protein interactions, number of link communities in the protein-protein interaction bead beater in HEPES buffer containing 100 mM KCl, 150 mM KOAc, protease- network and number of GO functions) and (iv) proteostasis (protein abun- inhibitor cocktail and 0.1% Triton X-100. Samples were then centrifuged at dance, relative translational rate and protein half-life) to distinguish between 4 °C, at 10,000 r.p.m. for 10 min, and the lysate was collected. For immunoblot HRPs and non-HRPs was assessed with a random forest (RF) model (details in analysis, 30 µg of total protein from each condition was loaded on a 4–12% legend of Supplementary Fig. 7). NuPAGE gel, and transfer was performed with an iblot 1.0 system, according Trained models were evaluated on the basis of the test set, and performance to the manufacturer’s instructions (Invitrogen). Primary antibodies used were summary metrics (precision, recall and F1 scores) were calculated. Precision anti-HA (Millipore, 1:1,000 in milk) and anti-tubulin (Sigma, 1:2,000 in milk). reflects the ability of the classifier not to label a nonHRP as an HRP or vice versa For immunoprecipitations, 10 mg of protein from each condition was precleared and is computed as follows: with 20 µl of agarose beads for 30 min. The precleared lysate was incubated with truepositives 50 µl of HA-tagged Dynabeads and incubated at 4 °C for 4 h, then washed four Precision = times with lysis buffer and three times with 25 mM ammonium bicarbonate. (true positives+false positives) Yeast histones were extracted with the established protocol75 and immunoblotted Recall reflects the ability of the classifier to find all the positive samples and in a 4–20% gel. is computed as follows: truepositives NanoLC–MS/MS analysis. Bead-bound proteins were digested twice by the addi- Recall = (truepositives+falsenegatives) tion of trypsin. Peptides were acidified with formic acid to a final concentration of 5% and desalted before injection. Separation was achieved with an Acclaim F1 is the weighted harmonic mean of the precision and recall. Features PepMap 100 column (C18, 3 µl, 100 Å, Thermo Scientific) with an internal were ranked by relative importance according to their mean decrease in diameter of 75 µm and a capillary length of 25 cm. A flow rate of 300 nl/min impurity, weighted by the probability of reaching the associated node. Feature was used with a solvent gradient of 5% B to 45% B in 50 min followed by an importance values were normalized to the maximum value. Modeling was increase to 95% B in 25 min. Solvent A was 0.1% formic acid, 2% MeCN and 5% 73 © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017 performed with the scikit-learn Python library . Because the disorder DMSO (v/v) in water, and solvent B had a similar composition but contained fraction was one of the important features that distinguished HRPs and 80% MeCN. nonHRPs, an additional RF model was trained to distinguish between HRPs The mass spectrometer was operated in positive-ion mode with an nth-order (n = 611) and highly disordered proteins (HDPs; nonHRPs with disorder fraction double-play method to automatically switch between Orbitrap-MS and LTQ >30%; n = 770). The data set was first balanced by subsampling HDPs to reach Velos-MS/MS acquisition. Survey full-scan MS spectra (from 400 to 1,600 m/z) n = 650. RF hyperparameters were tuned by using ten-fold cross-validation were acquired in the Orbitrap with resolution (R) 60,000 at 400 m/z (after accu- and grid search (yielding optima: criterion = ‘entropy’, max_features = ‘log2’, mulation to a target of 1,000,000 charges in the LTQ). The method allowed for max_depth = 50). sequential isolation of the 20 most intense ions for fragmentation in the linear ion trap. Charge-state screening was enabled, and precursors with an unknown Strains and culture conditions. Yeast genetic manipulations were performed charge state or a charge state of 1 were excluded. Acquired RAW files were ana- according to standard methods74 in the endogenous Snf5 locus in the S288c lyzed with the Sequest76 and Mascot77 search engines running under Proteome strain. Primers used for amplifications and generating strains are listed in Discoverer version 1.4. The S. cerevisiae protein database used for searches Supplementary Table 2. Deletion of the repeat region in the mutant (∆HR) was was retrieved in canonical form from the UniProt Knowledge Base (as of 27 verified by sequencing with primers flanking the repeat region. Additionally, we July 2014). Variable modifications included oxidized methionine residues and created HA6-tagged versions of the SNF5 WT and ∆HR, from plasmid pYM14 deamidated glutamine and asparagine. Search conditions allowed for a single (Euroscarf). Fresh colonies, grown at 30 °C for 3 d in solid YPD, were picked and missed cleavage with an MS-mode fragment tolerance of 40 p.p.m. coupled with incubated overnight in liquid YPD at 30 °C. These cells were used to start cultures 0.5 Da for MS/MS ions. An FDR of 1% was applied in all cases. IgG was used that were grown overnight in YPD (YP with 2% glucose) or YPG (YP with 4% as a negative control. Proteins identified in IgG even with a single peptide were glycerol) and 30 °C, for different experiments. not considered. A minimum of three unique peptide counts in two independent

doi:10.1038/nsmb.3441 nature structural & molecular biology experiments was required to be considered a hit (Supplementary Data Set 1). 62. Jorda, J. & Kajava, A.V. T-REKS: identification of Tandem REpeats in sequences Interactions that were present in WT but absent in ∆HR were classified as ‘polyQ- with a K-meanS based algorithm. Bioinformatics 25, 2632–2638 (2009). 63. Ward, J.J., McGuffin, L.J., Bryson, K., Buxton, B.F. & Jones, D.T. The DISOPRED dependent interactions’. server for the prediction of protein disorder. Bioinformatics 20, 2138–2139 (2004). Target-gene expression analysis. To examine the effect of Snf5 polyQ dele- 64. Simon, M. & Hancock, J.M. Tandem and cryptic amino acid repeats accumulate in tion on target-gene expression, we selected 18 functionally diverse nonessential disordered regions of proteins. Genome Biol. 10, R59 (2009). 65. Harrison, P.M. & Gerstein, M. A method to assess compositional bias in biological targets with high fold changes in expression after SNF5 gene deletion, as com- sequences and its application to prion-like glutamine/asparagine-rich domains in 25,78 pared with WT, in one or both of the previous studies Yeast cells were lysed eukaryotic proteomes. Genome Biol. 4, R40 (2003). in lithium acetate–SDS solution, and RNA was extracted with TRIzol reagent, 66. Harbi, D., Kumar, M. & Harrison, P.M. LPS-annotate: complete annotation of according to the manufacturer’s instructions (Life Technologies). Reverse compositionally biased regions in the protein knowledgebase. Database (Oxford) 2011, baq031 (2011). transcription was performed with a RevertAid H Minus First Strand cDNA 67. Cherry, J.M. et al. Saccharomyces Genome Database: the genomics resource of Synthesis Kit (Thermo Scientific). The concentration of the cDNA generated budding yeast. Nucleic Acids Res. 40, D700–D705 (2012). was adjusted, and qPCR was performed with SYBR Green PCR Master Mix 68. Altenhoff, A.M., Schneider, A., Gonnet, G.H. & Dessimoz, C. OMA 2011: orthology (Life Technologies). The reactions (five replicates per gene) were performed inference among 1000 complete genomes. Nucleic Acids Res. 39, D289–D294 (2011). in an Eco-Illumina qPCR thermocycler (Illumina). The measured geometric 69. Huang, W., Sherman, B.T. & Lempicki, R.A. Systematic and integrative analysis of mean of three different reference genes (ALG9, TAF10 and TFC1 (ref. 79)) were large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 used to normalize the data. The mRNA quantification was performed with (2009). the ∆Ct method. The variation between WT and ∆HR in YPD and YPG was 70. McGraw, K.O. & Wong, S.P. A common language effect size statistic. Psychol. Bull. −∆∆Ct 111, 361–365 (1992). calculated as 2 . 71. Weatheritt, R.J., Gibson, T.J. & Babu, M.M. Asymmetric mRNA localization contributes to fidelity and sensitivity of spatially localized systems. Nat. Struct. Statistical analysis for gene-level experimental investigations. The doubling Mol. Biol. 21, 833–839 (2014). 72. Grissom, R.J. & Kim, J.J. Effect Sizes for Research: Univariate and Multivariate time (Td) of WT or ∆HR SNF5 strains in YPD and YPG was obtained from growth Applications (Routledge, 2012). curves of the slopes of the best fit obtained by linear regression. The statistical 73. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. significance of differences in the budding index of cells after re-inoculation of 12, 2825–2830 (2011). stationary-phase cultures from WT or ∆HR in YPD or YPG after 1, 3 and 4 h 74. Amberd, D.C., Burke, D. & Strathern, J.N. Methods in Yeast Genetics: a Cold Spring was assessed with ANOVA. The statistical significance of differences in the dis- Harbor Laboratory Course Manual (Cold Spring Harbor Laboratory Press, 2005). 75. Rossmann, M.P. & Stillman, B. Immunoblotting histones from yeast whole-cell tribution of HA area/nuclear area between the WT and ∆HR Snf5 in YPD and protein extracts. Cold Spring Harb. Protoc. 2013, 625–630 (2013). YPG was assessed with Wilcoxon rank-sum tests. The statistical significance 76. Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass for expression changes of Snf5 targets between WT and ∆HR was assessed by spectral data of peptides with amino acid sequences in a protein database. J. Am. Student’s t test, and the P values were corrected for multiple testing with the FDR Soc. Mass Spectrom. 5, 976–989 (1994). 77. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein method designed by Benjamini, Hochberg and Yekutieli. identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999). Data availability. Source data of computational and experimental studies are 78. Hu, Z., Killion, P.J. & Iyer, V.R. Genetic reconstruction of a functional transcriptional regulatory network. Nat. Genet. 39, 683–687 (2007). available with the paper online (Supplementary Tables 1 and 2, Supplementary 79. Teste, M.A., Duquenne, M., François, J.M. & Parrou, J.L. Validation of reference Data Sets 1 and 2 and Supplementary Note 4). Other data are available upon genes for quantitative expression analysis by real-time RT-PCR in Saccharomyces request. cerevisiae. BMC Mol. Biol. 10, 99 (2009). © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2017

nature structural & molecular biology doi:10.1038/nsmb.3441