cells

Article Computational Assessment of Bacterial Structures Indicates a Selection Against Aggregation

Anita Carija †, Francisca Pinheiro †, Valentin Iglesias and Salvador Ventura * Institut de Biotecnologia i Biomedicina and Departament de Bioquímica i Biologia Molecular, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain * Correspondence: [email protected] These authors contributed equally. †  Received: 5 July 2019; Accepted: 6 August 2019; Published: 8 August 2019 

Abstract: The aggregation of compromises cell fitness, either because it titrates functional proteins into non-productive inclusions or because it results in the formation of toxic assemblies. Accordingly, computational proteome-wide analyses suggest that prevention of aggregation upon misfolding plays a key role in sequence evolution. Most proteins spend their lifetimes in a folded state; therefore, it is conceivable that, in addition to sequences, protein structures would have also evolved to minimize the risk of aggregation in their natural environments. By exploiting the AGGRESCAN3D structure-based approach to predict the aggregation propensity of >600 Escherichia coli proteins, we show that the structural aggregation propensity of globular proteins is connected with their abundance, length, essentiality, subcellular location and quaternary structure. These data suggest that the avoidance of protein aggregation has contributed to shape the structural properties of proteins in bacterial cells.

Keywords: protein aggregation; protein structure; protein function; cell proteostasis; Escherichia coli

1. Introduction Proteins are central components of almost all biological processes, being involved in a variety of complex interactions in the crowded cellular environment [1]. The establishment of non-functional protein–protein interactions has a detrimental impact on cell fitness, both because these contacts sequester proteins into inactive complexes [2] and because it can lead to the aggregation or co-aggregation of proteins into toxic soluble and insoluble assemblies [3]. Importantly, it is increasingly evident that, instead of being an unusual feature of a reduced set of proteins, aggregation is a generic property of many polypeptides [4]. Accordingly, hundreds of unrelated proteins have been reported to aggregate under stress or during ageing [5–7]. It has been suggested that this intrinsic propensity to establish anomalous interactions and aggregate is encoded in the amino acid sequence [8–10] and, therefore, a variety of complementary methods have been developed to predict those propensities from the linear sequence [11]. Large-scale analysis using these algorithms has led to the hypothesis that proteins have evolved sequence adaptations to counteract their natural propensity to aggregate [12–14]. Because in these studies aggregation is analysed along sequences, they mostly measure the aggregation potential of the unfolded state; indeed, the aggregation-prone regions (APRs) that these algorithms identify and evaluate are blocked in properly folded proteins, either because they are buried inside the hydrophobic core or engaged in the series of cooperative non-covalent interactions that sustain the secondary and tertiary protein structure [15]. These sticky sequences might, however, become accessible in case the protein fails to fold due to translational errors. Accordingly, prevention of mistranslation-induced protein misfolding is thought to constraint the evolution of sequences [16,17]. However, it is worth to point out that protein aggregation is not always deleterious and different

Cells 2019, 8, 856; doi:10.3390/cells8080856 www.mdpi.com/journal/cells Cells 2019, 8, 856 2 of 20 of mistranslation-induced protein misfolding is thought to constraint the evolution of sequences [16,17]. CellsHowever,2019, 8, 856it is worth to point out that protein aggregation is not always deleterious and 2 of 20 different organisms have exploited the structural/mechanical properties of protein aggregates for functionalorganisms purposes have [18,19]. exploited the structural/mechanical properties of protein aggregates for functional Proteinpurposes misfolding [18,19]. is not always a requisite for aggregation, and different proteins have been shown to aggregateProtein from misfolding their initial is not functional always a requisiteconformations for aggregation, [20,21], a propensity and different that proteins would havebe been exacerbatedshown in tothe aggregate crowded from cellular their initialmilieu functional[22]. This conformations suggests that[ 20many,21], aproteins propensity might that be would be kinetically,exacerbated but not thermodynamically, in the crowded cellular stable milieu in their [22]. native This suggests states under that many physiological proteins might conditions be kinetically, [23,24]. butIndeed, not thermodynamically,the aggregated state stable of ain protein their native is usually states more under stable physiological than its conditionsnative state [23 [25].,24]. Indeed, According to this view, a solution of proteins that have a certain propensity to aggregate but are the aggregated state of a protein is usually more stable than its native state [25]. According to this view, correctly folded constitutes a system in a transient state, that might ultimately reach the global free a solution of proteins that have a certain propensity to aggregate but are correctly folded constitutes energy minimum of the aggregated state [23]. Therefore, it is conceivable that proteins have evolved a system in a transient state, that might ultimately reach the global free energy minimum of the strategies to delay/prevent this deleterious transition, adjusting the solubilities of their folded states aggregated state [23]. Therefore, it is conceivable that proteins have evolved strategies to delay/prevent to those required for function. this deleterious transition, adjusting the solubilities of their folded states to those required for function. In the present work, we address the relationship between the predicted aggregation propensities In the present work, we address the relationship between the predicted aggregation propensities of protein structures and a set of intrinsic parameters relevant for their function in the cell. To this of protein structures and a set of intrinsic parameters relevant for their function in the cell. To this aim, we exploited a recently developed algorithm, AGGRESCAN3D (A3D) [26–29], to analyse 619 aim, we exploited a recently developed algorithm, AGGRESCAN3D (A3D) [26–29], to analyse 619 high-resolution Escherichia coli protein structures displaying lower than 40% . high-resolution Escherichia coli protein structures displaying lower than 40% sequence homology. A3D is an evolution of our previous AGGRESCAN algorithm [30]. The AGGRESCAN amino acid A3D is an evolution of our previous AGGRESCAN algorithm [30]. The AGGRESCAN amino acid aggregation scale was experimentally derived from protein aggregation measurements in vivo in E. aggregation scale was experimentally derived from protein aggregation measurements in vivo in coli cells [31]. This placed AGGRESCAN among the most accurate algorithms for the prediction of E. coli cells [31]. This placed AGGRESCAN among the most accurate algorithms for the prediction of intracellular protein aggregation, especially when analysing E. coli data [32]. A3D also implements intracellular protein aggregation, especially when analysing E. coli data [32]. A3D also implements these in vivo derived intrinsic propensities but they are strongly modulated by the structural context these in vivo derived intrinsic propensities but they are strongly modulated by the structural context in which they are located, allowing prediction of the aggregation properties of proteins in their folded in which they are located, allowing prediction of the aggregation properties of proteins in their folded states [26–29]. By analogy to AGGRESCAN, A3D is expected to be accurate when analysing E. coli states [26–29]. By analogy to AGGRESCAN, A3D is expected to be accurate when analysing E. coli protein structures. Overall, our structural analysis suggests that the avoidance of protein aggregation protein structures. Overall, our structural analysis suggests that the avoidance of protein aggregation has contributed to shape the properties of present proteins in this organism. has contributed to shape the properties of present proteins in this organism.

2. Material2. Material and Methods and Methods

2.1. Selection2.1. SelectionCriteria for Criteria the An foralysis the Analysisof Bacterial of BacterialProteome Proteome We employedWe employed a dataset a consisting dataset consisting of 1103 proteins of 1103 proteins,, including, including, to the best to theof our best knowledge, of our knowledge, the the most completemost complete set of setproteins of proteins experimentally experimentally detected detected in inthe the bacterial bacterial cytosol atat the the time time we we performed performedthe the analysis analysis [33 ].[33]. This This is the is the same same dataset datase wet usedwe used previously previously to address to address the relationshipthe relationship between the betweenaggregation the aggregation propensity propensity of linear of sequences linear se andquences different and protein different properties protein [34 ],properties facilitating [34], comparison facilitatingbetween comparison both studies. between The both dataset studies. was Th usede dataset to search was through used to thesearch Protein through Data the Bank Protein (PDB) [35] in Data Bankorder (PDB) to identify [35] in proteins order to from identify the dataset protei whichns from have the availabledataset which solved have crystal available/NMR structures. solved E. coli crystal/NMRproteins structures. with available E. coli three-dimensionalproteins with available structure three-di in PDBmensional format, structure with X-ray in resolutionPDB format,3.5, were ≤ with X-rayconsidered resolution in this ≤3.5, analysis. were Furthermore,considered in available this analysis. three-dimensional Furthermore, structures available covering three-<90% of dimensionalprotein structures primary cove sequencering <90% were of excluded,protein primary as well sequence as proteins were displaying excluded, sequence as well as homology proteins >40%, displayingresulting sequence in a sethomology of 619 bacterial >40%, proteinsresulting (Table in a S2).set of The 619 original bacterial 1103 proteins sequences (Table dataset S2). was The supposed original to1103 correspond sequences only dataset to cytosolic was supposed proteins to [correspond33], but we only could to identifycytosolic also proteins membrane [33], but proteins we in it could identify(Table S2). also Unless membrane otherwise proteins indicated, in it membrane (Table S2). proteins Unless were otherwise not taken indicated, into account membrane for the di fferent proteinsanalysis were not performed taken into in account this work, for asthe they different differ analysis significantly performed from the in non-membranethis work, as they proteins differin terms significantlyof charge from andthe non-membrane hydrophobicity. proteins in terms of charge and hydrophobicity.

2.2. Structural2.2. Structural and Sequential and Sequential Aggregation Aggregation Propensity Propensity Predictions Predictions PredictionsPredictions for the structures for the structures set were set perfor weremed performed using the using A3D the algorithm A3D algorithm which implements which implements structure-basedstructure-based approach approach and predicts and predicts aggregation aggregation propensity propensity of initially of initiallyfolded states folded [26]. states Selected [26]. Selected proteinsproteins were submitted were submitted to A3 toD A3Din the in the‘Static ‘Static mode’ mode’ and and 10 10 Ǻ waswasselected selected as as a distancea distance for aggregationfor aggregationanalysis analysis (default (default sphere sphere radius). radius). The The average average scores scores for for each each protein protein in in the the set set were were obtained. obtained.The The average average score score allows allows comparing comparing the the solubility solubility of proteinof protein structures structures diff differingering in size. in size. It also It allows assessing changes in solubility promoted by amino acid substitutions in a particular protein structure; Cells 2019, 8, 856 3 of 20 this value was assumed to reflect the Structural Aggregation Propensity (STAP) of any given protein. To perform sequential aggregation analysis, we used the linear predictor AGGRESCAN [30] on top of the sequences corresponding to the analysed protein structures. Normalized a4v Sequence Sum for 100 residues (Na4vSS) values were obtained, which reflect the average protein aggregation propensities of the sequences once corrected for their size.

2.3. Datasets Protein abundance data for the set of 619 bacterial proteins was taken from PaxDb [36]. The obtained values were then log10 transformed for statistical analysis. Information about protein length, protein subcellular location, number of transmembrane segments and protein active form were obtained from Swiss-Prot Protein knowledgebase [37]. Ontology (GO) enrichment analysis was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) [38]. The essentiality of bacterial proteins for cellular fitness was derived from the reported data [39,40] and updated with the data available at server EcoGene 3.0 [41]. The RegulonDB was used to obtain the known E. coli operon structure set [42]. To assess the compositional determinants of STAP in the bacterial proteome, amino acids present on the surface of the 10% least and the 10% most aggregation-prone proteins were identified by A3D, and the frequency of appearance of each individual amino acid among the total set of surface amino acids belonging to the two mentioned groups was calculated. Amino acid frequencies for aggregation-prone and soluble E. coli proteins were compared using Wilcoxon test.

2.4. Definition of the Supersaturation Index (SSI) The structural supersaturation index (SSI), as the sum:

(C + A) SSI = (1) 2 where C represents normalized protein concentration and A is the normalized STAP score. The logarithm of the protein abundance levels derived from PaxDb was used in order to determine C for each polypeptide from the dataset. These values were then normalized by rescaling them between 0 and 1. (Ci min(Ci ... Cn)) C = − (2) (max(Ci ... Cn) min(Ci ... Cn)) − Cmin = the minimum value of protein concentration from the dataset; Cmax = the maximum value of protein concentration from the dataset. The propensity of proteins to aggregate represented by A3D average score was normalized in the same manner by rescaling the values between 0 and 1.

(Ai min(Ai ... An)) A = − (3) (max(Ai ... An) min(Ai ... An)) − Amin = the minimum A3D score from the dataset; Amax = the maximum A3D score from the dataset.

2.5. Statistical Analysis Statistical significance was determined using Wilcoxon test. Analyses were performed using KaleidaGraph software (Synergy Software, Reading, PA, USA). p-values < 0.01 were considered statistically significant (* statistically significant at p 0.01; ** statistically significant at p 0.001; ≤ ≤ *** statistically significant at p 0.0001). The cumulative frequencies plots presented in the manuscript ≤ give information on the percentage of proteins (y-axis) that display a value equal or lower to a value, x. Fold enrichment indicates how much higher is the proportion of hits in relation to the background sample (E. coli proteome). For every GO term, the fold enrichment is the number of positive hits Cells 2019, 8, 856 4 of 20 Cells 2019, 8, 856 4 of 20 among our list (nl) between the number of annotated proteins in our list (pl); and subsequently divided into the rate of hits of that GO term in background (np) between the number of total proteins l l amongin the our background list (n ) between (pb): the number of annotated proteins in our list (p ); and subsequently divided p into the rate of hits of that GO term in background (n ) between푙 the number of total proteins in the b 푛 background (p ): 푝푙 푛푙푝푏 nl 퐹표푙푑 퐸푛푟𝑖푐ℎ푚푒푛푡 = 푏 =l b푏 푙 (4) pl 푛 n p푛 푝 = =푏 Fold Enrichment b 푝 (4) n nbpl pb

3.3. Results Results

3.1.3.1. A3D A3D Analysis Analysis Rationale Rationale TheThe A3D A3D algorithm algorithm uses the uses protein the protein 3D structure 3D structureas an input, as which an input, is subsequently which is energetically subsequently minimizedenergetically using minimized the FoldX using force fieldthe FoldX [43]. Then, force an field aggregation [43]. Then, propensity an aggregation score is propensity calculated forscore all is thecalculated spheres with for all a 10the Å spheres radius inwith the a protein 10 Å radius structure. in the The protein variables structure. that contribute The variables to the that A3D contribute score areto (i) the the A3D experimentally-derived score are (i) the experimentally individual amino-derived acid individual aggregation amino propensities; acid aggregation (ii) the surface propensities area ; exposure(ii) the surface of the amino area exposure acids in o thef the sphere; amino and acids (iii) in the the e sphereffective; and distance (iii) the between effective adjacent distance residues between andadjacent the central residues amino and acid the incentral the sphere.amino acid Therefore, in the sphere. the A3D Therefore, score is structurallythe A3D score corrected is structurally and, incorrected contrast to and, sequence-based in contrast to aggregation sequence- predictors,based aggregation provides predictors, information provides on the STAP information of proteins on in the theirSTAP functional of proteins folded in their states functional [26]. Figure folded1 illustrates states [26] how. Figure A3D allows1 illustrates to discard how A3D the contribution allows to discard of thethe buried contribution APRs detected of the buried by sequence-based APRs detected algorithms by sequence (Figure-based1A), algorithms and how residues (Figure scattered1A), and inhow theresidues sequence scattered can come in together the sequence in the structure can come upon together folding in to the form structure an aggregation-prone upon folding surface,to form as an identifiedaggregation by A3D-prone (Figure surface1B)., as identified by A3D (Figure 1B).

Figure 1. Analysis of structural aggregation propensity (STAP) for representative soluble and aggregation-prone proteins. (A) Comparison between the aggregation propensity of representative solubleFigure (PDB: 1. Analysis 3T36:A; leftof structural panel) and aggregation aggregation-prone propensity proteins (STAP) (PDB: for 2HO9:A; representative right panel) soluble using and sequence-basedaggregation-prone predictors proteins. and ( A3D.A) Comparison Aggregation-prone between the residues aggregation are indicated propensity in di ffoferent representative colours forsoluble each predictor: (PDB: 3T36:A; AGGRESCAN left panel) (black),and aggregation TANGO- (grey)prone andproteins A3D (PDB: (red). 2HO9:A; (B) A3D right analysis panel) of theusing abovesequence mentioned-based proteins.predictors The and A3D A3D. average Aggregation scores-prone for PDBs residues 3T36:A are and indicated 2HO9:A in aredifferent –1.061 colours and –0.421,for each respectively. predictor: The AGGRESCAN protein surface (black), is coloured TANGO according (grey) and to A3D A3D score (red). in ( aB) gradient A3D analysis fromblue of the (high-predicted solubility) to white (negligible impact on protein aggregation) to red (high-predicted above mentioned proteins. The A3D average scores for PDBs 3T36:A and 2HO9:A are –1.061 and – aggregation propensity). 0.421, respectively. The protein surface is coloured according to A3D score in a gradient from blue (high-predicted solubility) to white (negligible impact on protein aggregation) to red (high-predicted aggregation propensity).

Cells 2019, 8, 856 5 of 20

3.2. Relationship Between Protein Abundance and Structural Aggregation Propensity Despite that non-functional interactions might be detrimental, statistically, the large number of non-functional contacts a protein can establish clearly outweighs its functional interactions. This is especially true for abundant proteins, since, according to the law of mass action, the probability to establish a non-functional interaction should be proportional to the protein abundance [2]. Indeed, most protein aggregation processes are strongly dependent on the initial protein concentration [44]. Therefore, an abundant protein with an aggregation-prone surface is expected to be more dangerous than a low-abundant protein with the same surface stickiness [12]. Thus, a negative relationship between protein STAP and protein abundance can be expected. We explored this relationship using E. coli as a model organism. We obtained the abundance data for the 612 proteins in our structural dataset and we proceeded by log10 transforming the reported abundance for statistical analysis, since the number of mRNAs in the bacterial cytosol encoding a given protein can vary greatly from 1 to 100,000 [45]. The negative correlation between STAP and abundance is low (r = 0.21), but highly significative (p = 7.50 10 7). × − The comparison of the STAP distribution in the 25% most and least abundant proteins in this dataset is shown in Figure2A–C and illustrates how highly abundant proteins and proteins present at low concentrations exhibit, indeed, differential aggregation propensities (p = 3.26 10 4), with low × − abundant proteins displaying surfaces that can support much higher aggregation load than those of high abundant polypeptides (Figure2D). The probability density function of abundant proteins illustrates the absence of polypeptides with very high aggregation propensity in this group, with a concomitant enrichment in highly soluble proteins (Figure2C). In addition, low abundant proteins seem not to be under selection for protein aggregation, as the STAP of the 25% of the least abundant proteins is significantly higher than that of the conjunct of the remaining 75% protein structures (p = 7.77 10 5). × − The higher solubility of abundant proteins would likely work to prevent non-functional interactions in these proteins, even if they become concentrated at specific sub-cytosolic locations. Different studies have reported a relationship between the abundance of proteins and the aggregation propensity of linear protein sequences [12,34,46,47]. This has been explained in terms of natural selection acting on protein unfolded states in order to minimize aggregation in case of eventual misfolding. In order to test whether this relationship also exists in our dataset, we used our linear aggregation-predictor AGGRESCAN, which uses exactly the same intrinsic amino acid propensities that A3D without taking into account the structural context, to calculate the aggregation properties of the correspondent 612 sequences. As expected, the 25% most abundant proteins display more soluble sequences than the 25% least abundant (Figure3). However, the significance of the solubility differences between these two subclasses is lower when considering sequences (p = 3.91 10 3) than × − when we consider structures (p = 3.26 10 4). Thus, it seems that, at least in E. coli, in addition to × − misfolding, structural misassembly, driven by the establishment of non-native interactions between folded protein structures, might also constrain the evolution of proteins [2]. Overall, it can be concluded that proteins’ STAP seems to be adjusted to the gene expression levels required for an optimal cell function. Therefore, mutations that decrease the proteins’ surface solubility, or increases in protein expression levels might exacerbate the probability to aggregate [12]. Cells 2019, 8, 856 6 of 20 Cells 2019, 8, 856 6 of 20 Cells 2019, 8, 856 6 of 20

Figure 2. Relationship between STAP and protein abundance. (A) Cumulative distribution of STAP FigureFigure 2. 2.Relationship Relationship between between STAP STAP and and protein protein abundance. abundance. (A ()A C) umulativeCumulative distribution distribution of of STAP STAP for the 25% most (red line) and the 25% least (blue line) abundant proteins. (B) Box plot showing forfor the the 25% 25% most most (red (red line) line) and and the the 25% 25% least least (blue (blue line) line) abundant abundant proteins. proteins. (B ()B Box) Box plot plot showing showing the the the comparison between the A3D average score of the 25% most (red) and the 25% least (blue) comparisoncomparison between between the the A3D A3D average average score score of of the the 25% 25% most most (red) (red) and and the the 25% 25% least least (blue) (blue) abundant abundant abundantproteins.proteins. proteins.(C ()C Probability) Probability (C) Probability density density function function density of of functionthe the 25% 25% most of most the (red) 25%(red) and mostand the the (red) 25% 25% andleast least the(blue) (blue) 25% abundant leastabundant (blue) abundantproteins.proteins. proteins.(D ()D A3D) A3D analysis ( Danalysis) A3D of analysisof representative representative of representative low low-abundance-abundance low-abundance (PDB: (PDB: 2WSX:A; 2WSX:A; (PDB: upper upper 2WSX:A; panel) panel) upperand and high high panel)- - andabundanceabundance high-abundance proteins proteins proteins(PDB: (PDB: 1DFO:A; 1DFO:A; (PDB: 1DFO:A;lower lower panel). panel). lower Views Views panel). of of opposite Views opposite of sides oppositesides of of each each sides protein protein of each are are proteinshown. shown. are shown.The A3D The average A3D average scores for scores 2WSX:A for 2WSX:A and 1DFO:A and 1DFO:A are 0.052 are and 0.052 −0.799, and while0.799, log while10 values log ofvalues their of The A3D average scores for 2WSX:A and 1DFO:A are 0.052 and −0.799,− while log10 values10 of their theirprotein protein abundance abundance are − are1.523 1.523and 3.420, and 3.420,respectively. respectively. Colour Colourcode is codeas in isFigure as in 1. Figure 1. protein abundance are −1.52− 3 and 3.420, respectively. Colour code is as in Figure 1.

Figure 3. Relationship between sequential aggregation propensity and protein abundance. Figure 3. Relationship between sequential aggregation propensity and protein abundance. (A) (A)Figure Cumulative 3. Relationship distribution between of Na4vSS sequential values for aggregation the 25% most propensity (red line) and and protein the 25% abundance. least abundant (A) CumulativeCumulative distribution distribution of of Na4vSS Na4vSS values values for for the the 25% 25% most most (red (red line) line) and and the the 25% 25% least least abundant abundant proteins (dashed blue line). (B) Box plot representing the Na4vSS values for the 25% most (red) and the 25% least (blue) abundant proteins.

Cells 2019, 8, 856 7 of 20

Cells 2019proteins, 8, 856 (dashed blue line). (B) Box plot representing the Na4vSS values for the 25% most (red) and7 of 20 the 25% least (blue) abundant proteins.

3.3.3.3. Relationship Relationship B Betweenetween Protein Protein Length Length and and Structural Structural Aggregation Aggregation Propensity PreviousPrevious studies studies have have suggested suggested that that the the length length of of a a given given protein protein might might be be an important determinantdeterminant of of its its aggregation aggregation propensity. propensity. In In this this way, way, it it has been shown that longer proteins were moremore likely to to co co-aggregate-aggregate and and be be sequestered sequestered inin vivo by artificiallyartificially designed aggregation-proneaggregation-prone polypeptides [48] [48].. Also, Also, it it has has been been reported reported that that the the least least soluble soluble proteins proteins of of three three distinct distinct eukaryotic eukaryotic organismsorganisms share share several common common traits, traits, one one of them being generally longer longer in in comparison comparison to to highly highly solublesoluble proteins proteins [49] [49].. The The comparison comparison of of the the 25% 25% shortest shortest and and the the 25% longest polypeptides polypeptides in in our protein set set,, indicates that both groups exhibitexhibit didifferentfferent STAPSTAP ((pp= = 2.652.65 × 1010−9)) ( (FigureFigure 44A–C).A–C). ShortShort × − proteins exhibit a wider distribution of A3D average scores (SD(SD = 0.0.30),30), indicating that the dynamic rangerange of aggregation propensitiespropensities in in this this group group is is larger larger when when compared compared to to the the group group composed composed of theof the25% 25% longest longest polypeptides polypeptides (SD (=SD0.18) = 0.18 (Figure) (Figure4C). 4C).

Figure 4. Relationship between STAP and protein length. (A) Comparison between cumulative Figure 4. Relationship between STAP and protein length. (A) Comparison between cumulative distribution of STAP for the 25% longest (red line) and the 25% shortest proteins (blue line). (B) Box distribution of STAP for the 25% longest (red line) and the 25% shortest proteins (blue line). (B) Box plot representing the A3D average score of the 25% longest (red) and the 25% shortest (blue) proteins. plot representing the A3D average score of the 25% longest (red) and the 25% shortest (blue) proteins. (C) Probability density function of the 25% longest (red) and the 25% shortest proteins (blue). (C) Probability density function of the 25% longest (red) and the 25% shortest proteins (blue). (D) (D) Cumulative distribution of protein length for the 25% proteins with the highest STAP (solid Cumulative distribution of protein length for the 25% proteins with the highest STAP (solid line) and line) and the 25% proteins with the lowest STAP (dashed line). the 25% proteins with the lowest STAP (dashed line).

Cells 2019, 8, 856 8 of 20 Cells 2019, 8, 856 8 of 20

ToTo confirm confirm the the relationship relationship betw betweeneen size size and and solubility, solubility, we we compared compared the the size size of of the the 25% 25% m mostost solublesoluble proteins, proteins, according according to to their their A3D A3D average average score, score, and and the the 25% 25% least least soluble. soluble. These These two two sub subsetssets correspondcorrespond to to proteins proteins with with different different size properties (p == 9.119.11 × 10−77) )( (FigureFigure 4DD).). As expected, the × − groupgroup of of low low aggregation aggregation-prone-prone proteins proteins is is enriched enriched in in small small polypeptides, polypeptides, whereas whereas the the proteins proteins in in thethe high high aggregation aggregation-prone-prone set set are are clearly clearly larger larger ( (FigureFigure 4D).D). A visual inspection of the aggregation surfacessurfaces of of short short and and long long proteins proteins suggests suggests that that the the higher higher number number of of exposed exposed aggregation aggregation-prone-prone structuralstructural patches patches in in the the longest longest proteins proteins could could be be an an important important determin determinantant of of their their higher higher ST STAP.AP. TheThe presence presence of of these these patches patches likely likely responds respond to thes factto that the these fact proteins that these can interactproteins simultaneously can interact simultaneouslywith a larger number with of a partners larger number or through of largerpartners protein or through interfaces, larger but this protein functionality interfaces, also but involves this functionalitya higher risk also of aggregation. involves a higher risk of aggregation. WWee used AGGRESCAN toto calculatecalculate the the aggregation aggregation propensities propensities of of all all the the protein protein sequences sequences and andthen then grouped grouped them them as described as described above. above The. T overallhe overall aggregation aggregation propensities propensities of the of 25%the 25% shortest shortest and andthe longestthe longest protein protein sequences sequences do not do di ffnoter significantlydiffer significantly (p = 0.72) (p (Figure= 0.72) 5().Figure The di 5)ff. erencesThe differences observed observedwhen considering when considering structural structural or sequential or sequential aggregation aggregation propensities propensities are not surprising are not surprising if we think if thatwe thinkmany that of the many sequential of the sequential aggregation-prone aggregation regions-prone are regions hidden are inside hidden the structureinside the when structure the protein when theis folded protein (Figure is folded1A) ( andFigure that 1A many) and structuralthat many aggregation-prone structural aggregation regions-prone at the regions protein at the surface protein are surfacecomposed are composed of residues of that residues are not that consecutive are not consecutive in the sequence. in the sequence. In any case, In any at least case, in at our least bacterial in our bacterialprotein dataset, protein the dataset, apparent the relationshipapparent relationship between aggregation between aggregation and size is bestand explainedsize is best if explained we consider if wethat consider the selective that the pressure selective acts pr onessure protein acts structures on protein instead structures of/in instead addition of/in to sequences. addition to sequences.

FigureFigure 5. 5. Relationship Relationship between between sequential sequential aggregation aggregation propensity propensity and andprotein protein length. length. (A) Cumulative(A) Cumulative distribution distribution of Na4vSS of Na4vSS values values for the for25% the longest 25% longest(red line) (red and line) the 25% and shortest the 25% p shortestroteins (dashedproteins blue (dashed line). blue (B) line). Box plot(B) Box showing plot showing the comparison the comparison between between the Na4vSS the Na4vSS values values of the of 25% the longest25% longest (red) (red)and the and 25% the shortest 25% shortest (blue) (blue) proteins. proteins. 3.4. Relationship Between Structural Aggregation Propensity and Protein Function 3.4. Relationship Between Structural Aggregation Propensity and Protein Function The set of in an operon share a common gene expression regulation and are generally The set of genes in an operon share a common gene expression regulation and are generally connected by their biological function. As a result, proteins encoded by the same operon are suggested connected by their biological function. As a result, proteins encoded by the same operon are to be present in similar amounts in the cell [33]. The observed association between STAP and abundance suggested to be present in similar amounts in the cell [33]. The observed association between STAP would imply that polypeptides in the same operon should have related aggregation properties. To test and abundance would imply that polypeptides in the same operon should have related aggregation this hypothesis, we first ascribed the proteins in our complete dataset to individual operons. 234 out of properties. To test this hypothesis, we first ascribed the proteins in our complete dataset to individual 619 proteins could be ascribed to a particular E. coli operon (Table S1). We calculated A3D average operons. 234 out of 619 proteins could be ascribed to a particular E. coli operon (Table S1). We scores and, as hypothesized, found out that the standard deviation of the A3D average score between calculated A3D average scores and, as hypothesized, found out that the standard deviation of the proteins regulated by the same operon is lower in 93% of the cases (38 out of 41) than the standard A3D average score between proteins regulated by the same operon is lower in 93% of the cases (38 deviation in the complete set of proteins that could be ascribed to a particular operon (Figure6). out of 41) than the standard deviation in the complete set of proteins that could be ascribed to a particular operon (Figure 6).

Cells 2019, 8, 856 9 of 20 Cells 2019, 8, 856 9 of 20

In the loss of function scenario, the impact of protein aggregation on cellular fitness would be In the loss of function scenario, the impact of protein aggregation on cellular fitness would be ultimately associated with the particular protein activity. Therefore, it is expected that evolution ultimately associated with the particular protein activity. Therefore, it is expected that evolution would select for an overall decreased aggregation propensity in the proteins from operons involved would select for an overall decreased aggregation propensity in the proteins from operons involved in in essential cellular functions [50]. To explore this possibility, the bacterial operons were divided into essential cellular functions [50]. To explore this possibility, the bacterial operons were divided into two two groups according to their A3D average scores, those with lower (LA operons) and higher groups according to their A3D average scores, those with lower (LA operons) and higher aggregation aggregation propensity (HA operons) than the mean propensity of the complete operon protein set propensity (HA operons) than the mean propensity of the complete operon protein set (A3D average (A3D average score = −0.880). The essentiality of approximately half of the proteins in each subset has score = 0.880). The essentiality of approximately half of the proteins in each subset has been annotated been annotated− via genetic footprinting or knockout experiments [39,40]. Importantly, considering via genetic footprinting or knockout experiments [39,40]. Importantly, considering only the annotated only the annotated polypeptides, the majority of proteins regulated by LA operons are essential, polypeptides, the majority of proteins regulated by LA operons are essential, whereas most of those whereas most of those in HA operons are non-essential (Table S1, Figure 6B). This supports the view in HA operons are non-essential (Table S1, Figure6B). This supports the view that, as a trend, the that, as a trend, the structures of essential bacterial proteins suffer a stronger selection against structures of essential bacterial proteins suffer a stronger selection against aggregation than those of aggregation than those of non-essential ones. non-essential ones.

Figure 6. The variation in A3D average scores for known bacterial operons. (A) The standard deviation of A3D average scores in 41 analysed operons was calculated and operons were divided Figure 6. The variation in A3D average scores for known bacterial operons. (A) The standard into 10 classes. The dashed line indicates the standard deviation in the complete set of proteins (0.4). deviation of A3D average scores in 41 analysed operons was calculated and operons were divided Low standard deviation within an operon indicates that the aggregation propensity of its proteins is into 10 classes. The dashed line indicates the standard deviation in the complete set of proteins (0.4). similar. (B) Number of essential (green) and non-essential (orange) proteins codified by operons with Low standard deviation within an operon indicates that the aggregation propensity of its proteins is low (left panel) and high (right panel) aggregation propensity. similar. (B) Number of essential (green) and non-essential (orange) proteins codified by operons with low (left panel) and high (right panel) aggregation propensity.

3.5. Effect of Subcellular Location on the Structural Aggregation Propensity

Cells 2019, 8, 856 10 of 20

3.5. Effect of Subcellular Location on the Structural Aggregation Propensity It is of great importance that the protein maintains its biological function in its native state no Cells 2019, 8, 856 10 of 20 matter in which subcellular location this protein resides [46,47,51]. Bacterial proteins can populate other subcellularIt is of great compartments importance that apart the fromprotein the maintains cytosol, its like biological the periplasm function in and its native the inner state no and outer membranes.matter Presumably,in which subcellula theirr aggregationlocation this protein properties resides would [46,47,51] be. adapted Bacterial toproteins the specific can populat environmente in thoseother subcellular subcellular locations. compartments A3D apart analysis from showsthe cytosol, that like proteins the periplasm residing and in the the inner bacterial and outer cytoplasm membranes. Presumably, their aggregation properties would be adapted to the specific environment and the periplasm possess the lowest STAP of the complete dataset (Figure7A,B). After excluding in those subcellular locations. A3D analysis shows that proteins residing in the bacterial cytoplasm proteinsand with the periplasm transmembrane possess the domains, lowest ST theAP distributionof the complete of dataset the 20% (Figure proteins 7A,B). After with excluding the lowest A3D averageproteins scores withand 20%transmembrane proteins with domains the,highest the distribution A3D average of the 20% scores proteins in these with two the compartments lowest A3D was analysed.average We scores found and out 20% that proteins soluble with proteins the highest were A3D enriched average inscores both in the these cytosol two compartments and the periplasm, whereaswas enrichment analysed. inWe this found last out compartment that soluble was proteins not observed were enriched among in both the aggregation-prone the cytosol and the proteins (Figureperiplasm,7C). This whereas is consistent enrichment with in thethis factlast compartment that the periplasm, was not observed in contrast among to the the aggregation cytosol,- lacks a prone proteins (Figure 7C). This is consistent with the fact that the periplasm, in contrast to the sophisticated cellular system able to control protein quality and avoid aggregation [52], and is separated cytosol, lacks a sophisticated cellular system able to control protein quality and avoid aggregation from the[52] outside, and is separated solution from by athe highly outside permeable solution by a outer highly membrane permeable outer that membrane provides that limited provides protection againstlimited environmental protection against variations. environmental variations.

Figure 7. Relationship between STAP and protein subcellular location. (A) The bacterial proteins Figure 7. Relationship between STAP and protein subcellular location. (A) The bacterial proteins were dividedwere divided into four into groups four groups based basedon the on protein the protein localization localization according according to UniProt. to UniProt. (B) A(B) cumulative A distributioncumulative of STAP distribution of proteins of STAP located of proteins in the located cytoplasm in the cytoplasm (C; purple), (C; purple), periplasm periplasm (P; dark (P; dark grey), outer membranegrey), (OM; outer red) membrane and inner (OM; membrane red) and inner (IM; membrane blue). (C (IM;) Proteins blue). (C with) Proteins 20% lowestwith 20% (left lowest panel) (left and 20% proteinspanel) highest and 20% STAP proteins (right highest panel) STAP belonging (right panel) to the belonging dataset to in the which dataset proteins in which with proteins transmembrane with segmentstransmembrane (TS) were segments excluded, (TS) according were excluded, to GO according terms. to OnlyGO terms. GO Only cellular GO cellular component component terms with terms with p-value < 0.05 and false discovery rate (FDR) <0.05 were plotted. (D) Diagram for the p-value < 0.05 and false discovery rate (FDR) <0.05 were plotted. (D) Diagram for the analysed IM analysed IM proteins showing the number of TS and STAP. (E) Analysis of the molecular functions proteins showing the number of TS and STAP. (E) Analysis of the molecular functions associated to IM

proteins with A3D average score <0 (left panel) and A3D average score 0 (right panel) according to ≥ GO terms proposed by PANTHER. Cells 2019, 8, 856 11 of 20 Cells 2019, 8, 856 11 of 20

associated to IM proteins with A3D average score <0 (left panel) and A3D average score ≥0 (right Notpanel) surprisingly, according to inner GO terms membrane proposed by (IM) PANTHER. proteins exhibit the highest theoretical aggregation propensities of all bacterial proteins, according to A3D analysis (Figure7A,B). The gram-negative bacterialNot inner surprisingly, membrane inner is a membrane semipermeable (IM) proteins shield exhibit that preserves the highest the theoretical cytoplasm aggregation environment. The proteinspropensities associated of all bacterial with this proteins, membrane according can haveto A3D a variableanalysis ( numberFigure 7A, ofB). TS The per gram protein-negative [53]. These regionsbacterial are stableinner membrane in the hydrophobic is a semipermeable environment shield that of thispreserves lipid the bilayer cytoplasm due toenvironment. their enrichment The in apolarproteins residues. associated A visual with this inspection membra ofne theircan have structures a variable shows number that of theTS per highly protein aggregation-prone [53]. These regions are stable in the hydrophobic environment of this lipid bilayer due to their enrichment in surfaces identified by A3D in many of these proteins sharply coincide with regions embedded in the apolar residues. A visual inspection of their structures shows that the highly aggregation-prone membrane, since the membrane width could be traced in their A3D coloured structure (Figure8A,B). surfaces identified by A3D in many of these proteins sharply coincide with regions embedded in the Interestingly,membrane, when since thethe A3Dmembrane average width scores could of be inner traced membrane in their A3D proteins coloured were structure plotted (Figure as a distribution, 8A,B). the existenceInterestingly, of two when protein the A3D groups average became scores evident of inner (Figure membrane7B). We proteins foundthat were the plotted main di as ff aerence betweendistribution, these two the groupsexistence is of the two number protein of groups TS (Figure became7D). evident In fact, (Figure the inner 7B). We membrane found that group the main contains a numberdifference of proteins between without these two TS groups and a lowis the STAP. number An of analysis TS (Figure of their 7D). structuresIn fact, the indicatesinner membrane that they are devoidgroup of largecontains aggregation-prone a number of proteins patches without suggesting TS and thata low they STAP. associate An analysis to the of membrane their structures transiently (Figureindicates8C,D). that Proteins they are with devoid a single of large TS aggregation also display-prone significant patches predictedsuggesting solubility,that they associate likely because to apartthe from membrane the short transiently TS anchored (Figure to8C, theD). Proteins membrane, with usuallya single TS an alsoα-helix, display they significant also exhibit predicted globular solubility, likely because apart from the short TS anchored to the membrane, usually an α-helix, they domains facing either the cytosol or the periplasm (Figure7D). Proteins exhibiting two or more TS also exhibit globular domains facing either the cytosol or the periplasm (Figure 7D). Proteins account for the most aggregation prone structures in the analysed sub-proteome. However, it is exhibiting two or more TS account for the most aggregation prone structures in the analysed sub- worthproteome. to mention However, that it these is worth high to theoreticalmention that propensitiesthese high theoretical do not propensities necessarily do translate not necessaril intoy a high aggregationtranslate ininto the a cell, high where aggregation IM proteins’ in the hydrophobic cell, where IM stretches proteins remain' hydrophobic protected stretches inside the remain lipophilic membraneprotected environment. inside the lipophilic membrane environment.

FigureFigure 8. A3D 8. A3D analysis analysis of of selected selected representativerepresentative inner inner membrane membrane proteins. proteins. (A) Aquaporin(A) Aquaporin Z (PDB: Z (PDB: 1RC2:A;1RC2:A; A3D A3D average averag scoree score= =0.417); 0.417); ((BB)) NitrateNitrate/nitrite/nitrite transporter transporter NarK NarK (PDB: (PDB: 4JR9 4JR9:A;:A; A3D A3Daverage average scorescore= 0.051); = 0.051); (C )(C Acyl-CoA) Acyl-CoA thioester thioester hydrolasehydrolase YbgC YbgC (PDB: (PDB: 1S5U:A; 1S5U:A; A3D A3D average average score score = −0.664);= 0.664); − and (D) Septum site-determining protein MinD (PDB: 3Q9L:A; A3D average score = 1.031). The dashed − line indicates the existence of two protein groups among inner membrane proteins, exhibiting different STAPs. Lateral view and top view of each protein together with the views of opposite sides are shown. Colour code is as in Figure1. Cells 2019, 8, 856 12 of 20

and (D) Septum site-determining protein MinD (PDB: 3Q9L:A; A3D average score = −1.031). The dashed line indicates the existence of two protein groups among inner membrane proteins, exhibiting different STAPs. Lateral view and top view of each protein together with the views of opposite sides Cells 2019are, shown.8, 856 Colour code is as in Figure 1. 12 of 20

An analysis of the molecular functions associated to IM proteins with lower and higher predictedAn analysis aggregation of the propensity molecular indicates functions that associated they seem to to IM play proteins different with cellular lower roles, and the higher low predictedaggregation aggregation-prone ones propensity displaying indicates preferentially that theycatalytic seem activity, to play and different the high cellular aggregation roles, the-prone low aggregation-proneproteins possessing ones mainly displaying the role preferentiallyof transporters catalytic (Figure 7 activity,E). and the high aggregation-prone proteinsAs IM possessing proteins, mainly outer membrane the role of transportersproteins (OM) (Figure are located7E). in a hydrophobic environment, and consequently,As IM proteins, they are outer usually membrane thought proteins to have (OM) a high are aggregation located in a tendency. hydrophobic However, environment, their STAP and consequently,lies between those they areof cytoplasmic usually thought and toIM have proteins a high (Figure aggregation 7A,B). tendency.In fact, the However, outer membrane their STAP acts lies as betweena permeable those barrier of cytoplasmic to hydrophilic and substances. IM proteins In (Figure general,7A,B). outer In membrane fact, the outer proteins membrane display a acts β-barrel as a permeablestructure that barrier encloses to hydrophilic a hydrophilic substances. cavity surrounded In general, outerby a hydrophobic membrane proteins outer layer display embedded a β-barrel in structurethe membrane. that encloses A3D structural a hydrophilic predictions cavity surrounded are able to by capture a hydrophobic this particular outer architecture. layer embedded In the in thepredictions membrane. it can A3D be easily structural seen a predictionshigh aggregation are able-prone to capture fringe in this the particular outside, re architecture.strained to the In exact the predictionsboundary embedded it can be easily in the seen membrane a high aggregation-prone, flanked by low aggregation fringe in the-prone outside, regions restrained that protrude to the exact out boundaryof the membrane embedded (Figure in the 9). membrane, Seen from flankedabove, or by below, low aggregation-prone it can be observed regions that the that residues protrude flanking out of thethe membrane cavity at both (Figure inner9). Seen and from outer above, sides or conform below, it surfaces can be observed of low aggregation that the residues propensity. flanking This the cavityparticular at both assembly inner and is outer achieved sides by conform alternating surfaces hydrophobic of low aggregation and hydrophilic propensity. segments This particular in the assemblysequence is[54] achieved. This byexplains alternating why hydrophobic when linear and predictors hydrophilic were segments used to in analyse the sequence the aggregation [54]. This explainspropensities why whenof these linear proteins, predictors they were were used wrongly to analyse predicted the aggregation as being propensities even more soluble of these than proteins, the theycytoplas weremic wrongly proteins predicted [34]. In reality, as being in even these more proteins, soluble the than scattered the cytoplasmic aggregation proteins-prone [regions34]. In reality, in the insequence these proteins, come together the scattered in the aggregation-prone structure to allow regions their stable in the insertion sequence in come the togethermembrane in the, where structure they toremain allow protected their stable from insertion aggregation. in the membrane, where they remain protected from aggregation.

Figure 9. A3D analysis of selected representative outer membrane proteins. (A) Usher protein FimD (PDB:Figure 3RFZ:B; 9. A3D A3Danalysis average of selected score = representative0.565); (B) Outer outer membrane membrane protein proteins. G (PDB: (A) 2F1C:X; Usher protein A3D average FimD − score(PDB:= 3RFZ:B;0.275); A3D(C) Long-chain average score fatty acid = −0.565); transport (B) protein Outer membrane (PDB: 1T16:A; protein A3D average G (PDB: score 2F1C:X;= 0.350) A3D − − and (D) Fe(3+) dicitrate transport protein FecA (PDB: 1KMO:A; A3D average score = 0.296). Lateral − view and top view of each protein together with the views of opposite sides are shown. Colour code is as in Figure1. Cells 2019, 8, 856 13 of 20

average score = −0.275); (C) Long-chain fatty acid transport protein (PDB: 1T16:A; A3D average score = −0.350) and (D) Fe(3+) dicitrate transport protein FecA (PDB: 1KMO:A; A3D average score = −0.296). Lateral view and top view of each protein together with the views of opposite sides are shown. Colour Cells 2019code, 8, 856 is as in Figure 1. 13 of 20

3.6. Compositional Determinants of Protein Structural Aggregation Propensity 3.6. Compositional Determinants of Protein Structural Aggregation Propensity To understand whether there is any compositional bias in the surfaces of the 10% most and 10% leastTo aggregation understand- whetherprone E. therecoli structures, is any compositional we compared bias the in frequency the surfaces of ofthe the 2010% natural most amino and 10% acids leastat their aggregation-prone surfaces. ElevenE. out coli structures,of twenty amino we compared acids did the not frequency show any of significant the 20 natural enrichment amino acids in any at of theirthese surfaces. two groups Eleven (Figure out of 10 twenty). However, amino we acids found did notout showthat charged any significant residues enrichment (glutamic acid, in any lysine of theseand twoarginine) groups were (Figure significantly 10). However, enriched we in found the soluble out that structures charged’ residuessubset (Figure (glutamic 10). acid,It is generally lysine andaccepted arginine) that were charge significantly–charge interactions enriched in are the important soluble structures’ for protein subset solubility (Figure by inducing10). It is generallylong-range acceptedrepulsion that between charge–charge alike-charged interactions species. are This important has been for clearly protein demonstrated solubility by by inducing introducing long-range multiple repulsioncharged between side chains alike-charged in proteins species., a strategy This that has provides been clearly an increased demonstrated resilience by introducing towards aggregation, multiple chargedparticularly side chains at elevated in proteins, temperatures a strategy that[55,56] provides. Conversely, an increased neutralization resilience towards of surface aggregation, charges is particularlysuggested at to elevated be required temperatures for efficient [55,56 amyloid]. Conversely, fibril neutralization formation [57] of and surface chemical charges and is suggested mutational toneutralization be required for of effi chargescient amyloid has indeed fibril formation been shown [57] andto promote chemical aggregation and mutational [58] neutralization. Importantly, of the chargesenrichment has indeed in these been charged shown residues to promote is no aggregationt associated [58 with]. Importantly, differences the in enrichment the pI of soluble in these and chargedaggregation residues-prone is not proteins associated (p = with0.228). di fferences in the pI of soluble and aggregation-prone proteins (p = 0.228).

Figure 10. Comparison of amino acid composition in soluble and aggregation-prone protein Figure 10. Comparison of amino acid composition in soluble and aggregation-prone protein surfaces. Plot of p-values correlating surface amino composition for two protein groups: 10% surfaces. Plot of p-values correlating surface amino composition for two protein groups: 10% least least and 10% most aggregation-prone proteins. In black: amino acids that do not show significant and 10% most aggregation-prone proteins. In black: amino acids that do not show significant differences. In blue: amino acids that are more abundant in soluble proteins. In red: amino acids that differences. In blue: amino acids that are more abundant in soluble proteins. In red: amino acids that are more abundant in aggregation-prone proteins. are more abundant in aggregation-prone proteins. The aromatic residues (phenylalanine, tyrosine and tryptophan) were enriched in aggregation-prone structuresThe (Figurearomatic 10 residues). This (phenylalanine, can be related tyrosine to the fact and that tryptophan) exposed were aromatic enriched residues in aggregation facilitate - protein–proteinprone structures interactions (Figure 10 and). This indeed can be these related three to residues the fact arethat more exposed frequent aromatic at protein residues interfaces facilitate thanprotein at their–protein surfaces interactions [59]. This propertyand indeed can these be partially three r explainedesidues are by more their frequent ability to at establish protein both interfacesπ–π orthanπ–cation at the interactions,ir surfaces [59] their. This flat surfacesproperty andcan thebe partially entropic benefitexplained of hidingby their them ability from to waterestablish inside both interfaces.π–π or π However,–cation interactions, these properties their flat also surfaces imply an and increased the entropic probability benefit of of establishing hiding them competing from water non-functionalinside interfaces. interactions. However, The these trade-o propertiesff between also proper imply and an anomalous increased probabili interactionsty of will establishing explain whycompeting we did not non find-functional any hydrophobic interactions. aliphatic The residue trade-off enriched between in aggregation-prone proper and anomalous structures, interactions since theywill will explain increase why the we aggregation did not find potential any hydrophobic without providing aliphatic aresidue significant enriched counteracting in aggregation functional-prone advantage.structures, We since also identified they will glycine increase and the proline aggr enrichedegation in potential the surface without of aggregation-prone providing a significant proteins (Figurecounteracting 10). These functional two residues advantage. are bad Weβ also-sheet identified formers glycine and their and presenceproline enriched might counteract in the surface the of enrichmentaggregation in aromatic-prone proteins residues ( byFigure diminishing 10). These the protein two residues probability are to bad form β- intermolecularsheet formers β and-sheets their contacts leading to the formation of protein aggregates [60]. Cells 2019, 8, 856 14 of 20

presence might counteract the enrichment in aromatic residues by diminishing the protein probability to form intermolecular β-sheets contacts leading to the formation of protein aggregates Cells 2019, 8, 856 14 of 20 [60].

3.7.3.7. Relationship Relationship Between Between Functional Functional Protein Protein Assemblies Assemblies and and Protein Protein Structural Structural Aggregation Aggregation Propensity Propensity AsAs we we described described above,above, there is is a anegative negative correlation correlation between between protein protein abundance abundance and andSTAP. STAP.Indeed, Indeed, it has it has been been shown shown that that “supersaturated” “supersaturated” proteins, proteins, which are are maintained maintained at at a higha high concentrationconcentration relative relative to to their their solubility solubility are are among among the the most most aggregation aggregation susceptible susceptible proteins proteins in in diffdifferenterent proteomes proteomes [61 [61]]. The. The presence presence of of supersaturated supersaturated proteins proteins in in the the cell cell may may respond respond to to the the requirementsrequirements to exert to exert their their biological biological functions. functions. The stoichiometry The stoichiometry of multimeric of multimeric proteins proteins might imply might thatimply their that monomeric their monomeric subunits would subunits be supersaturatedwould be supersa relativeturated to proteinsrelative thatto proteins are active that in are the active cell as in monomers.the cell as We monomers. calculated We a new calculated parameter, a new the parameter, structural supersaturationthe structural supersaturation index (SSI), for index the subunits (SSI), for ofthe oligomeric subunits and of foroligomeric monomeric and proteinsfor monomeric in our dataset proteins (Figure in our 11 datasetA). This (Figure parameter 11A). reflects This parameter the risk ofreflects a folded the protein risk of subunit a folded to aggregateprotein subunit at its physiological to aggregate concentrationat its physiological in the concentration cell. The structures in the ofcell. monomericThe structures proteins of monomeric displayed significantlyproteins displayed lower SSIsignificantly values than lower that SSI of oligomericvalues than protein that of subunitsoligomeric (p protein= 3.29 subunits10 7). Next,(p = 3.29 we × 10 analysed−7). Next, the we oligomeric analysed the state oligomeric of protein state subunits of protein displaying subunits displaying the 25% × − lowestthe 25% and lowest highest and SSI highest (Figure 11SSIB). (Figure The low 11B). SSI The group low contained SSI group 33.0% contained of the monomers33.0% of the and monomers 21.0% ofand the oligomers21.0% of the in oligomers the dataset, in respectively.the dataset, respectively. In contrast, theIn contrast, high SSI the group high encompassed SSI group encompassed 14.6% of the14.6% monomers of the monomers and 30.0% ofand the 30.0% oligomers of the inoligomers the dataset, in the respectively. dataset, respectively. Thus, it becomes Thus, it clear becomes that the clear formationthat the of formation multimeric of proteinsmultimeric comes proteins at a higher comes risk at a of higher aggregation. risk of Then,aggregation why these. Then, proteins why thesedo notprotein aggregates doin not vivo aggregateunder physiological in vivo under conditions? physiological Mainly, because conditions? the sticky Mainly, interfaces because needed the sticky for theinterfaces assembly needed of the monomeric for the assembly subunits of into the a quaternarymonomeric structure subunits are into protected a quaternary in the native structure state are andprotected accordingly in the the native aggregation state and risk accordingly is only transitory. the aggregation This implies, risk however, is only thattransitory. cellular This conditions implies, orhowever, genetic changes that cellular that favour conditions multimers or genetic dissociation changes would that favour also favour multimers non-functional dissociation interactions, would also asfavour it occurs non in- thefunctional case of human interactio transthyretinns, as it occurs or superoxide in the case dismutase of human 1, whose transthyretin dissociation or superoxide leads to theirdismutase aggregation 1, whose and thedissociation onset of amyloidosis leads to their [62 aggregation,63]. and the onset of amyloidosis [62,63].

Figure 11. Relationship between structural supersaturation index and protein active form. (A) A comparison between cumulative distributions of SSI values for the subunits of proteins active as monomers (dashed black line) or as oligomers (solid red line). (B) A bar graph representing the percentage of monomers (black) or oligomers (red) from the dataset that display low (left panel) and high (right panel) structural supersaturation indexes (SSI). Cells 2019, 8, 856 15 of 20

Figure 11. Relationship between structural supersaturation index and protein active form. (A) A comparison between cumulative distributions of SSI values for the subunits of proteins active as Cells 2019monomers, 8, 856 (dashed black line) or as oligomers (solid red line). (B) A bar graph representing the 15 of 20 percentage of monomers (black) or oligomers (red) from the dataset that display low (left panel) and high (right panel) structural supersaturation indexes (SSI). 3.8. Protein Structural Aggregation Propensity and Bacterial Symmetric Complexes Self-Assembly 3.8. Protein Structural Aggregation Propensity and Bacterial Symmetric Complexes Self-Assembly A recent study has shown that, even when they are in the native and properly folded states, bacterialA symmetricrecent study protein has shown complexes that, even are atwhen risk they of aggregation are in the native [64]. The and authors properly introduced folded states point, mutationsbacterial atsymmetric the surface protein of 12 complexes distinct symmetricare at risk of complexes aggregation from [64]E.. The coli authors, resulting introduc in 73ed di pointfferent mutations at the surface of 12 distinct symmetric complexes from E. coli, resulting in 73 different variants. Some of these mutants resulted in the formation of aggregates when they were expressed variants. Some of these mutants resulted in the formation of aggregates when they were expressed intracellularly. Biophysical measurements and electron microscopy revealed that the aggregated intracellularly. Biophysical measurements and electron microscopy revealed that the aggregated mutants self-assembled in their folded states. Because these mutations were introduced at the surface mutants self-assembled in their folded states. Because these mutations were introduced at the surface of the symmetric protein complexes they could indeed impact their STAP. Therefore, we analysed of the symmetric protein complexes they could indeed impact their STAP. Therefore, we analysed whetherwhether A3D A3D predictions predictions were were able able to to detect detect anyany didifferencefference between the the wild wild type type (wt) (wt) proteins proteins and and thosethose mutants mutants that that formed formed self-assembled self-assembled entitiesentities intracellularly.intracellularly. As As it it is is shown shown in in Figure Figure 12 12, in, inall all cases,cases, the the average average STAP STAP of of protein protein complexes complexes formingforming aggregates was was higher higher than than the the ST STAPAP of ofthe the correspondentcorrespondent wt forms.wt forms. This This suggests suggests that thethat STAP the of ST proteinAP of complexes protein complexes is an important is an determinant important of theirdeterm assemblyinant of theirand accordingly assembly and that accordingly mutations that impacting mutations this impacting structural this property structural might property trigger foldedmight functional trigger folded complexes functional to self-assemble complexes to into self- higher-orderassemble into structureshigher-order in thestructures cell. in the cell.

FigureFigure 12. 12. STAP STAP predictions predictions for for bacterial bacterial self-assembledself-assembled homomers with with dihedral dihedral symmetry symmetry and and theirtheir point point mutants mutants that aggregate.that aggregate.(A) Bar (A graphs) Bar graphs for indicated for indicated supramolecular supramolecular complexes complexes comparing A3Dcomparing average scores A3D averagefor wt proteins scores and for wtheirt proteins mutants and forming their mutantseither fibres forming or focis. either Corresponding fibres or focis. PDB accession numbers for analysed proteins are indicated on top of the bar graphs. A3D analysis of representative supramolecular assemblies: (B) dihedral octamer (PDB: 1YAC; A3D average score = 0.567) and its aggregating mutant (D92L, E94L, K98L, K101L; A3D average score = 0.409) − − and (C) dihedral decamer (PDB: 2IV1; A3D average score = 0.690) and its foci-forming mutant (K24L, − K25L, D26L; A3D average score = 0.535). Top-bottom views of each protein are shown. Colour code − is as in Figure1. Cells 2019, 8, 856 16 of 20

4. Discussion It is now well established that, in addition to stability and function, the avoidance of aggregation is an important driver of the evolution of proteins. However, the large majority of studies that provide support to this hypothesis have measured aggregation along protein sequences [34,46,47]. Therefore, they essentially evaluate the aggregation potential of unfolded or at least partially unfolded states. These conformations are of course relevant for aggregation during or immediately after translation or in the case of eventual misfolding; nevertheless, the time that globular proteins expend in these transient states is relatively short in comparison to the one in which they remain in the native state. For a long time, it has been assumed that the attainment of a stable folded conformation in which aggregation-prone sequences are shielded from the solvent constitutes a permanent protection against aggregation. However, it is now clear that proteins can aggregate from their native states, being thus kinetically, but not thermodynamically, stable [23,24]. The present study suggests that, as for protein sequences, aggregation might influence the evolution of protein structures. We acknowledge that the structures we studied here represent only a fraction of the soluble E. coli proteome. However, we expect that the conclusions we delineate would remain valid when more structures are available, since the significance of the observed differences was almost independent of the proportion of proteins selected to build up the subsets in the different analysis and, indeed, those relative to protein abundance are totally consistent with the ones obtained previously analysing a dataset of 397 E. coli structures, from which only 172 had abundance data [2]. It is important to note, however, that our analyses assume minor structural differences to exist between the used static structures and the conformation the respective proteins might adopt in the in vivo environment. Thus, it will be worth re-evaluating them when computational means would allow to model these functional conformations. It has been suggested that, in the cell, globular proteins can remain soluble for long periods of time only because of the presence of high kinetic barriers that separate the native and the aggregated states [24]. Despite speculative, it is tempting to propose that the STAP of proteins has been shaped to keep these barriers high enough to allow them functioning at the concentrations required for an optimal cell fitness (Figure 13A,B). In this context, amino acid changes that, despite not affecting protein stability, would increase STAP, without providing a compensatory fitness benefit, would be purged out by natural selection, since they would decrease the energy barrier between folded and aggregated states and therefore potentially allow aggregation in a biologically relevant time (Figure 13C). The study of the constraints imposed by protein aggregation on natural protein structures according to their abundance, function or tertiary/quaternary organization may uncover novel protein design principles with different potential applications. In this context, our analysis explains why the in vitro concentration of therapeutic globular proteins, like antibodies, above their natural abundance levels results, in many cases, in slow, but irreversible aggregation, precluding their marketing. Designed mutations that decrease the aggregation propensity of their protein surface, without impacting their stability, are anticipated to allow them to remain soluble and functional in the same conditions [65]. Cells 2019, 8, 856 17 of 20 Cells 2019, 8, 856 17 of 20

FigureFigure 13. Schematic 13. Schematic representation representation of the of energy the energy diagrams diagrams for aggregation for aggregation processes. processes.The The ensemble of foldedensemble proteins of folded in solution proteins in is solution considered is considered as one as thermodynamic one thermodynamic system. system. The aggregated aggregated state representsstate inrepresents all cases in all the cases global the global minimum minimum of of Gibbs Gibbs energy. energy. However, However, to reach to reach this state, this proteins state, proteins have to overcome kinetic barriers that differ in their activation energies (Eact). (A) Proteins with low have to overcome kinetic barriers that differ in their activation energies (Eact). (A) Proteins with low STAP, such as highly abundant, short, essential and proteins active as monomers are protected from STAP, such as highly abundant, short, essential and proteins active as monomers are protected from aggregation by a high energy barrier. (B) Proteins with higher STAP, such as low abundant, long, aggregationnon-essential by a high and proteins energy active barrier. in oligomeric (B) Proteins forms withhave to higher cross a STAP, lower energy such as barrier low and abundant, thus, long, non-essentialthey can andpotentially proteins access active the aggregated in oligomeric state more forms frequently. have to (C cross) Mutations a lower that energy increase barrier the STAP and thus, they canwithout potentially conferring access additional the aggregated functional advantages state more would frequently. be purged (C) out Mutations by natural that selection, increase since the STAP withoutthey conferring decrease the additional energy barrier functional for aggregation advantages from the would initially be purgedfolded state. out by natural selection, since they decrease the energy barrier for aggregation from the initially folded state. Supplementary Materials: The following are available online at www.mdpi.com/2073-4409/8/8/856/s1, Table S1: List of low-aggregation and high-aggregation propensity operons, Table S2: List of the analysed structures together with the information about protein length, abundance, the presence of transmembrane segments, active Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4409/8/8/856/s1, form and essentiality. Table S1: List of low-aggregation and high-aggregation propensity operons, Table S2: List of the analysed structures togetherAuthor with theContributions: information Conceptualization, about protein length, S.V.; Methodology, abundance, S. theV., A.C. presence and F.P. of; transmembraneFormal Analysis, A.C. segments, F.P. active form andand essentiality. V.I.; Writing-Original Draft Preparation, S.V and A.C.; Writing-Review & Editing, S.V. and F.P.; Supervision, S.V.; Funding Acquisition, S.V. Author Contributions: Conceptualization, S.V.; Methodology, S.V., A.C. and F.P.; Formal Analysis, A.C., F.P. and V.I.; Writing-OriginalFunding: Please Draft add: This Preparation, research was S.V. funded and A.C.; by MINECO Writing-Review grant number & Editing, BIO2016 S.V.-78310 and-R and F.P.; by Supervision, ICREA S.V.; Fundinggrant Acquisition, number ICREA S.V. -Academia-2016. Funding:ConflictsPlease of add: Interest: This The research authors was declare funded no conflict by MINECO of interest grant. The founding number sponsors BIO2016-78310-R had no role andin the by design ICREA grant numberof ICREA-Academia-2016. the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the Conflictsdecision of Interest: to publishThe the authors results. declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

1. Deeds, E.J.; Ashenberg, O.; Gerardin, J.; Shakhnovich, E.I. Robust protein protein interactions in crowded cellular environments. Proc. Natl. Acad. Sci. USA 2007, 104, 14952–14957. [CrossRef][PubMed] Cells 2019, 8, 856 18 of 20

2. Levy, E.D.; De, S.; Teichmann, S.A. Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc. Natl. Acad. Sci. USA 2012, 109, 20461–20466. [CrossRef][PubMed] 3. Hartl, F.U.; Hayer-Hartl, M. Converging concepts of protein folding in vitro and in vivo. Nat. Struct. Mol. Biol. 2009, 16, 574–581. [CrossRef][PubMed] 4. Chiti, F.; Dobson, C.M. Protein Misfolding, Amyloid Formation, and Human Disease: A Summary of Progress Over the Last Decade. Annu. Rev. Biochem. 2017, 86, 27–68. [CrossRef][PubMed] 5. Chapman, E.; Farr, G.W.; Usaite, R.; Furtak, K.; Fenton, W.A.; Chaudhuri, T.K.; Hondorp, E.R.; Matthews, R.G.; Wolf, S.G.; Yates, J.R.; et al. Global aggregation of newly translated proteins in an Escherichia coli strain deficient of the chaperonin GroEL. Proc. Natl. Acad. Sci. USA 2006, 103, 15800–15805. [CrossRef][PubMed] 6. O’Connell, J.D.; Tsechansky, M.; Royall, A.; Boutz, D.R.; Ellington, A.D.; Marcotte, E.M. A proteomic survey of widespread protein aggregation in yeast. Mol. Biosyst. 2014, 10, 851–861. [CrossRef][PubMed] 7. David, D.C.; Ollikainen, N.; Trinidad, J.C.; Cary, M.P.; Burlingame, A.L.; Kenyon, C. Widespread protein aggregation as an inherent part of aging in C. elegans. PLoS Biol. 2010, 8, 47–48. [CrossRef][PubMed] 8. Chiti, F.; Stefani, M.; Taddei, N.; Ramponi, G.; Dobson, C.M. Rationalization of the effects of mutations on peptide andprotein aggregation rates. Nature 2003, 424, 805–808. [CrossRef] 9. Ventura, S. Sequence determinants of protein aggregation: Tools to increase protein solubility. Microb. Cell Fact. 2005, 4, 11. [CrossRef] 10. Reumers, J.; Maurer-Stroh, S.; Schymkowitz, J.; Rousseau, F. Protein sequences encode safeguards against aggregation. Hum. Mutat. 2009, 30, 431–437. [CrossRef] 11. Trainor, K.; Broom, A.; Meiering, E.M. Exploring the relationships between protein sequence, structure and solubility. Curr. Opin. Struct. Biol. 2017, 42, 136–146. [CrossRef][PubMed] 12. Tartaglia, G.G.; Pechmann, S.; Dobson, C.M.; Vendruscolo, M. Life on the edge: A link between gene expression levels and aggregation rates of human proteins. Trends Biochem. Sci. 2007, 32, 204–206. [CrossRef] [PubMed] 13. Gsponer, J.; Babu, M.M. Cellular Strategies for Regulating Functional and Nonfunctional Protein Aggregation. Cell Rep. 2012, 2, 1425–1437. [CrossRef][PubMed] 14. Rousseau, F.; Serrano, L.; Schymkowitz, J.W.H. How evolutionary pressure against protein aggregation shaped chaperone specificity. J. Mol. Biol. 2006, 355, 1037–1047. [CrossRef][PubMed] 15. Espargaró, A.; Castillo, V.; de Groot, N.S.; Ventura, S. The in vivo and in vitro aggregation properties of globular proteins correlate with their conformational stability: The SH3 case. J. Mol. Biol. 2008, 378, 1116–1131. [CrossRef] 16. Drummond, D.A.; Wilke, C.O. Mistranslation-Induced Protein Misfolding as a Dominant Constraint on Coding-Sequence Evolution. Cell 2008, 134, 341–352. [CrossRef] 17. Drummond, D.A.; Wilke, C.O. The evolutionary consequences of erroneous protein synthesis. Nat. Rev. Genet. 2009, 10, 715–724. [CrossRef] 18. Falsone, A.; Falsone, S.F. Legal but lethal: Functional protein aggregation at the verge of toxicity. Front. Cell. Neurosci. 2015, 9, 1–16. [CrossRef] 19. Bruce, J.B.; West, S.A.; Griffin, A.S. Functional amyloids promote retention of public goods in bacteria. Proc. R. Soc. B 2019, 286.[CrossRef] 20. Marcon, G.; Plakoutsi, G.; Canale, C.; Relini, A.; Taddei, N.; Dobson, C.M.; Ramponi, G.; Chiti, F. Amyloid formation from HypF-N under conditions in which the protein is initially in its native state. J. Mol. Biol. 2005, 347, 323–335. [CrossRef] 21. Garcia-Pardo, J.; Graña-Montes, R.; Fernandez-Mendez, M.; Ruyra, A.; Roher, N.; Avilés, F.X.; Lorenzo, J.; Ventura, S. Amyloid formation by human carboxypeptidase D transthyretin-like domain under physiological conditions. J. Biol. Chem. 2014, 289, 33783–33796. [CrossRef][PubMed] 22. McGuffee, S.R.; Elcock, A.H. Diffusion, crowding & protein stability in a dynamic molecular model of the bacterial cytoplasm. PLoS Comput. Biol. 2010, 6, e1000694. 23. Gazit, E. The “correctly folded” state of proteins: Is it a metastable state? Angew. Chem. Int. Ed. 2002, 41, 257–259. [CrossRef] 24. Ciryam, P.; Kundra, R.; Morimoto, R.I.; Dobson, C.M.; Vendruscolo, M. Supersaturation is a major driving force for protein aggregation in neurodegenerative diseases. Trends Pharmacol. Sci. 2015, 36, 72–77. [CrossRef] [PubMed] Cells 2019, 8, 856 19 of 20

25. Buell, A.K.; Dhulesia, A.; White, D.A.; Knowles, T.P.J.; Dobson, C.M.; Welland, M.E. Detailed Analysis of the Energy Barriers for Amyloid Fibril Growth. Angew. Chem. Int. Ed. 2012, 51, 5247–5251. [CrossRef][PubMed] 26. Zambrano, R.; Jamroz, M.; Szczasiuk, A.; Pujols, J.; Kmiecik, S.; Ventura, S. AGGRESCAN3D (A3D): Server for prediction of aggregation properties of protein structures. Nucleic Acids Res. 2015, 43, W306–W313. [CrossRef][PubMed] 27. Pujols, J.; Peña-Díaz, S.; Ventura, S. AGGRESCAN3D: Toward the Prediction of the Aggregation Propensities of Protein Structures. In Computational Drug Discovery and Design; Gore, M., Jagtap, U.B., Eds.; Springer: New York, NY, USA, 2018; pp. 427–443. ISBN 978-1-4939-7756-7. 28. Kuriata, A.; Iglesias, V.; Pujols, J.; Kurcinski, M.; Kmiecik, S.; Ventura, S. Aggrescan3D (A3D) 2.0: Prediction and engineering of protein solubility. Nucleic Acids Res. 2019, 47, 300–307. [CrossRef] 29. Kuriata, A.; Iglesias, V.; Kurcinski, M.; Ventura, S.; Kmiecik, S. Structural bioinformatics Aggrescan3D standalone package for structure-based prediction of protein aggregation properties. Bioinformatics 2019, btz143. 30. Conchillo-Solé, O.; de Groot, N.S.; Avilés, F.X.; Vendrell, J.; Daura, X.; Ventura, S. AGGRESCAN: A server for the prediction and evaluation of “hot spots” of aggregation in polypeptides. BMC Bioinform. 2007, 8, 65. [CrossRef] 31. de Groot, N.S.; Pallarés, I.; Avilés, F.X.; Vendrell, J.; Ventura, S. Prediction of “hot spots” of aggregation in disease-linked polypeptides. BMC Struct. Biol. 2005, 5, 18. [CrossRef] 32. Belli, M.; Ramazzotti, M.; Chiti, F. Prediction of amyloid aggregation in vivo. EMBO Rep. 2011, 12, 657–663. [CrossRef][PubMed] 33. Ishihama, Y.; Schmidt, T.; Rappsilber, J.; Mann, M.; Hartl, F.U.; Kerner, M.J.; Frishman, D. Protein abundance profiling of the Escherichia coli cytosol. BMC Genom. 2008, 9, 102. [CrossRef][PubMed] 34. De Groot, N.S.; Ventura, S. Protein aggregation profile of the bacterial cytosol. PLoS ONE 2010, 5, e9383. [CrossRef][PubMed] 35. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.; Bourne, P. The . Nucleic Acids Res. 2000, 28, 235–242. [CrossRef][PubMed] 36. Wang, M.; Weiss, M.; Simonovic, M.; Haertinger, G.; Schrimpf, S.P.; Hengartner, M.O.; von Mering, C. PaxDb, a Database of Protein Abundance Averages Across All Three Domains of Life. Mol. Cell. Proteom. 2012, 11, 492–500. [CrossRef] 37. Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.C.; Estreicher, A.; Gasteiger, E.; Martin, M.J.; Michoud, K.; O’Donovan, C.; Phan, I.; et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31, 365–370. [CrossRef][PubMed] 38. Dennis, G.; Sherman, B.T.; Hosack, D.A.; Yang, J.; Gao, W.; Lane, H.; Lempicki, R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4, R60. [CrossRef] 39. Gerdes, S.; Scholle, M.D.; Campbell, J.W.; Balázsi, G.; Ravasz, E.; Daugherty, M.D.; Somera, A.L.; Kyrpides, N.C.; Anderson, I.; Gelfand, M.S.; et al. Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J. Bacteriol. 2003, 185, 5673–5684. [CrossRef] 40. Baba, T.; Ara, T.; Hasegawa, M.; Takai, Y.; Okumura, Y.; Baba, M.; Datsenko, K.A.; Tomita, M.; Wanner, B.L.; Mori, H. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: The Keio collection. Mol. Syst. Biol. 2006, 2.[CrossRef] 41. Zhou, J.; Rudd, K.E. EcoGene 3.0. Nucleic Acids Res. 2013, 41, D613–D624. [CrossRef] 42. Huerta, A.M.; Salgado, H.; Thieffry, D.; Collado-Vides, J. RegulonDB: A database on transcriptional regulation in Escherichia coli. Nucleic Acids Res. 1998, 26, 55–59. [CrossRef][PubMed] 43. Schymkowitz, J.; Borg, J.; Stricher, F.; Nys, R.; Rousseau, F.; Serrano, L. The FoldX web server: An online force field. Nucleic Acids Res. 2005, 33, W382–W388. [CrossRef][PubMed] 44. Meisl, G.; Kirkegaard, J.B.; Arosio, P.; Michaels, T.C.T.; Vendruscolo, M.; Dobson, C.M.; Linse, S.; Knowles, T.P.J. Molecular mechanisms of protein aggregation from global fitting of kinetic models. Nat. Protoc. 2016, 11, 252–272. [CrossRef][PubMed] 45. Selinger, D.W.; Cheung, K.J.; Mei, R.; Johansson, E.M.; Richmond, C.S.; Blattner, F.R.; Lockhart, D.J.; Church, G.M. RNA expression analysis using a 30 resolution Escherichia coli genome array. Nat. Biotechnol. 2000, 18, 1262–1268. [CrossRef][PubMed] 46. Monsellier, E.; Ramazzotti, M.; Taddei, N.; Chiti, F. Aggregation propensity of the human proteome. PLoS Comput. Biol. 2008, 4, e1000199. [CrossRef][PubMed] Cells 2019, 8, 856 20 of 20

47. Tartaglia, G.G.; Vendruscolo, M. Correlation between mRNA expression levels and protein aggregation propensities in subcellular localisations. Mol. Biosyst. 2009, 5, 1873–1876. [CrossRef] 48. Olzscha, H.; Schermann, S.M.; Woerner, A.C.; Pinkert, S.; Hecht, M.H.; Tartaglia, G.G.; Vendruscolo, M.; Hayer-Hartl, M.; Hartl, F.U.; Vabulas, R.M. Amyloid-like aggregates sequester numerous metastable proteins with essential cellular functions. Cell 2011, 144, 67–78. [CrossRef] 49. Albu, R.F.; Chan, G.T.; Zhu, M.; Wong, E.T.C.; Taghizadeh, F.; Hu, X.; Mehran, A.E.; Johnson, J.D.; Gsponer, J.; Mayor, T. A feature analysis of lower solubility proteins in three eukaryotic systems. J. Proteom. 2015, 118, 21–38. [CrossRef] 50. Chen, Y.; Dokholyan, N. V Natural selection against protein aggregation on self-interacting and essential proteins in yeast, fly, and worm. Mol. Biol. Evolut. 2008, 25, 1530–1533. [CrossRef][PubMed] 51. Linding, R.; Schymkowitz, J.; Rousseau, F.; Diella, F.; Serrano, L. A comparative study of the relationship between protein structure and β-aggregation in globular and intrinsically disordered proteins. J. Mol. Biol. 2004, 342, 345–353. [CrossRef][PubMed] 52. Dougan, D.A.; Mogk, A.; Bukau, B. Protein folding and degradation in bacteria: To degrade or not to degrade? That is the question. Cell. Mol. Life Sci. 2002, 59, 1607–1616. [CrossRef][PubMed] 53. Santoni, V.; Molloy, M.; Rabilloud, T. Membrane proteins and proteomics: Un amour impossible? Electrophoresis 2000, 21, 1054–1070. [CrossRef] 54. Cowan, S.W.; Schirmer, T.; Rummel, G.; Steiert, M.; Ghosh, R.; Pauptit, R.A.; Jansonius, J.N.; Rosenbusch, J.P. Crystal structures explain functional properties of two E. coli porins. Nature 1992, 358, 727–733. [CrossRef] [PubMed] 55. Lawrence, M.S.; Phillips, K.J.; Liu, D.R. Supercharging proteins can impart unusual resilience. J. Am. Chem. Soc. 2007, 129, 10110–10112. [CrossRef][PubMed] 56. Miklos, A.E.; Kluwe, C.; Der, B.S.; Pai, S.; Sircar, A.; Hughes, R.A.; Berrondo, M.; Xu, J.; Codrea, V.; Buckley, P.E.; et al. Structure-based design of supercharged, highly thermoresistant antibodies. Chem. Biol. 2012, 19, 449–455. [CrossRef][PubMed] 57. Jeppesen, M.D.; Westh, P.; Otzen, D.E. The role of protonation in protein fibrillation. FEBS Lett. 2010, 584, 780–784. [CrossRef][PubMed] 58. Morshedi, D.; Ebrahim-Habibi, A.; Moosavi-Movahedi, A.A.; Nemat-Gorgani, M. Chemical modification of lysine residues in lysozyme may dramatically influence its amyloid fibrillation. Biochim. Biophys. Acta-Proteins Proteom. 2010, 1804, 714–722. [CrossRef][PubMed] 59. Mohamed, R.; Degac, J.; Helms, V. Composition of overlapping protein-protein and protein-ligand interfaces. PLoS ONE 2015, 10, e0140965. [CrossRef][PubMed] 60. Villar-Piqué, A.; Ventura, S. Modeling amyloids in bacteria. Microb. Cell Fact. 2012, 11, 166. [CrossRef] [PubMed] 61. Ciryam, P.; Tartaglia, G.G.; Morimoto, R.I.; Dobson, C.M.; Vendruscolo, M. Widespread Aggregation and Neurodegenerative Diseases Are Associated with Supersaturated Proteins. Cell Rep. 2013, 5, 781–790. [CrossRef][PubMed] 62. Hörnberg, A.; Eneqvist, T.; Olofsson, A.; Lundgren, E.; Sauer-Eriksson, A.E. A comparative analysis of 23 structures of the amyloidogenic protein transthyretin. J. Mol. Biol. 2000, 302, 649–669. [CrossRef][PubMed] 63. Chattopadhyay, M.; Durazo, A.; Sohn, S.H.; Strong, C.D.; Gralla, E.B.; Whitelegge, J.P.; Valentine, J.S. Initiation and elongation in fibrillation of ALS-linked superoxide dismutase. Proc. Natl. Acad. Sci. USA 2008, 105, 18663–18668. [CrossRef][PubMed] 64. Garcia-Seisdedos, H.; Empereur-Mot, C.; Elad, N.; Levy, E.D. Proteins evolve on the edge of supramolecular self-assembly. Nature 2017, 548, 244–247. [CrossRef][PubMed] 65. Gil-Garcia, M.; Bañó-Polo, M.; Varejão, N.; Jamroz, M.; Kuriata, A.; Díaz-Caballero, M.; Lascorz, J.; Morel, B.; Navarro, S.; Reverter, D.; et al. Combining Structural Aggregation Propensity and Stability Predictions To Redesign Protein Solubility. Mol. Pharm. 2018, 15, 3846–3859. [CrossRef][PubMed]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).