<<

bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Insights on protein thermal stability: a graph representation of molecular interactions

Mattia Miotto1,2,3, Pier Paolo Olimpieri1, Lorenzo Di Rienzo1, Francesco Ambrosetti1,4, Pietro Corsi5, Rosalba Lepore6,7, Gian Gaetano Tartaglia8,*, and Edoardo Milanetti1,2

1Department of Physics, Sapienza University of Rome, Rome, Italy 2Center for Life Nanoscience, Istituto Italiano di Tecnologia, Rome, Italy 3Soft and Living Matter Lab, Institute of Nanotechnology (CNR-NANOTEC), Rome, Italy 4Bijvoet Center for Biomolecular Research, Faculty of Science - Chemistry, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands 5Department of Science, University Roma Tre, Rome, Italy 6Biozentrum, University of Basel, Klingelbergstrasse 50–70, CH-4056 Basel, Switzerland 7SIB Swiss Institute of Bioinformatics, Biozentrum, University of Basel, Klingelbergstrasse 50–70, CH-4056 Basel, Switzerland 8Centre for Genomic Regulation (CRG) and Institucio´ Catalana de Recerca i Estudis Avanc¸ats (ICREA) *[email protected]

ABSTRACT

Understanding the molecular mechanisms of thermal stability is a challenge in protein biology. Indeed, knowing the temperature at which proteins are stable has important theoretical implications, which are intimately linked with properties of the native fold, and a wide range of potential applications from drug design to the optimization of activity. Here, we present a novel graph-theoretical framework to assess thermal stability based on the structure without any a priori information. In our approach we describe proteins as energy-weighted graphs and compare them using ensembles of interaction networks. Investigating the position of specific interactions within the 3D native structure, we developed a parameter-free network descriptor that permits to distinguish thermostable and mesostable proteins with an accuracy of 76% and Area Under the Roc Curve of 78%.

Introduction Temperature is one of most crucial factors organisms have to deal with in adapting to extreme environments1 and plays a key role in many complex physiological mechanisms2. Indeed a fundamental requirement to ensure life at high temperatures is that the organisms maintain functional and correctly folded proteins2–4. Accordingly, evolution shapes energetic and structural placement of each residue-residue interaction for the whole protein to withstand thermal stress. Studying thermostability is fundamental for several reasons ranging from theoretical to applicative aspects5, such as gaining insight on the physical and chemical principles governing protein folding6–8, and improving the thermal resistance of to speed up chemical reactions in biopharmaceutical and biotechnological processes9, 10. Despite the strong interest in thermostability11–13, its prediction remains an open problem. A common descriptor used to quantify the thermal stability of proteins is the melting temperature (Tm), defined as the temperature at which the concentration of the protein in its folded state equals the concentration in the unfolded state. To date, computational approaches, both sequence- and structure-based, have exploited statistical analysis8, 14, 15, molecular dynamics16, 17 and machine learning18, 19 to predict the melting temperature. Most of the studies are based on comparative analyses between pairs of homologues belonging to organisms of different thermophilicity20, 21. Predicting the stability of a protein ab initio using a structure based-approach has never been achieved so far. Lack of success in this area is mostly due to limitations in our knowledge about the relationship between thermal resistance and role of the interactions that stabilize a protein structure22. Some differences in terms of amino acid composition or spatial arrangement of residues have been reported8, 23–25. One of most notable differences involves the salt bridges: hyperthermostable proteins have stronger electrostatic interactions than their mesostable counterparts26. Recently Folch et al.22, 27 reported that distinct salt bridges may be differently affected by the temperature and this might influence the geometry of these interactions as well as the compactness of the protein. Core packing seems related to thermal resistance at least to some extent28. Yet, a lower number of cavities and a higher average relative contact order (i.e. a measure of non-adjacent amino acid proximity within a bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

folded protein) have been also observed while comparing thermostable proteins with their mesostable paralogs and orthologs6 . Noteworthy, the hydrophobic effect and residue hydrophobicity seem to play a rather marginal role on protein stabilization29–31, while they are considered the main forces driving . Here, we present a new analysis based on the graph theory that allows us to reveal important characteristics of the energetic reorganization of intramolecular contacts between mesostable and thermostable proteins. In light of our results and to promote their application, we have designed a new computational method able to classify each protein as thermostable or as mesostable without using other information except for the three-dimensional structure.

Results Uncovering the differences in energetic organization Aiming at the comprehension of the basic mechanisms that allow proteins to remain functional at high temperature, we focused on the non-bonded interactions between residues that play a stabilizing role in structural organization32. In particular, we investigated how different thermal properties are influenced by the energy distribution at different layers of structural organization. We analyzed the interactions occurring in proteins of the Twhole dataset, containing the union of the Tm dataset 33 (proteins with known melting temperature taken from ProTherm database ) and the Thyper dataset (proteins belonging to o 34 hyperthermophilic organisms, with Tenv ≥ 90 C collected from Protein Data Bank ), for a total number of 84 proteins (see Methods). To describe the role of single residues in the complex connectivity of whole protein, we adopted a graph theory approach describing each protein by the Residue Interaction Network (RIN): each residue is represented as a node and links between residues are weighed with non-bonded energies (as described in Methods). At first, we investigated the relationship between thermostability and energy distribution of intramolecular interactions. To this end, the Tm dataset was divided into eight groups according to protein Tm and for each group the energy distribution was evaluated, as shown in Figure 1a. The general shape of the density functions is almost identical between the eight cases, independently from the thermal properties of the macromolecules, and this is clearly due to the general folding energetic requirements. A strong dependence between thermal stability and the percentage of strong interactions is evident looking at the disposition of the density curves (Figure 1a): the higher the thermal stability the higher the probability of finding strong interactions. Yet, less thermostable proteins possess a larger number of weak interactions. In particular, as shown in Figures 1a-c, it is possible to identify three ranges of energies that correspond to three peaks of probability density, i.e. a very strong favorable energy region (E < −70 kCal/mol), a strong favorable energy region between −70 kCal/mol and −13 kCal/mol, and a strong unfavorable interaction region (E > 11 kCal/mol). More formally, for a protein the probability of having an interaction with energy E, P(E), in the three ranges linearly depends on the protein melting temperature with correlation coefficients of 0.90, 0.85, 0.87, respectively (Figure 1b). In order to have strong-signal sets, we reduced the division in just two groups, classifying proteins as mesostable or thermostable if their melting temperatures are, respectively, lower or higher than 70oC, which is the optimal reaction temperature of thermophilic enzymes35–40. In this way the energy distributions in Figure 1a are calculated only for the mesostable and thermostable distributions in Figure 1c (Twhole dataset). The two-group division allows us to include the hyperthermophilic proteins in our analysis, since their Tm is higher than the threshold. The two resulting distributions, found to be significantly different with a p-value of 4.2×10−46 (nonparametric test of Kolmogorov-Smirnov41), have an expected value at −0.5 kCal/mol and negative interactions have a probability of more than 60% to be found. Regions below −13 kCal/mol and above 11 kCal/mol represent the 6.6% and 5.2% of the total energy for thermostable and mesostable proteins respectively. Typically, such energies require the presence of at least one polar or charged amino acid and in particular Arg, Asp, Glu and Lys are involved in more than 90% of the interactions. Noteworthy, the small fraction of energies centered near -120 kCal/mol (see Figure 1a) are due to polar or charged amino acid interactions taking place at short distance. Next, we investigated the residue Strength, defined for each node as the sum of the weights of its links (see Methods). The two Strength distributions for mesostable and thermostable proteins are shown in Figure 1d. Even in this case, they are different according to Kolmogorov-Smirnov test with a p-value of 1.9 × 10−9. For the first time, our analysis provides both a general intuition on the protein folding and a specific insight on thermal stability. Even if strong positive and strong negative peaks have a comparable height (Figure 1c), the rearrangement of protein side chains masks the positive interactions, substantially preventing the condensation of unfavorable interactions in a single residue, as testified by the small probability of finding a residue with a positive Strength. Indeed, for the whole dataset, there is more than 97% of probability of finding a residue with negative Strength. The most frequent value is found at -27 kCal/mol, with a change in the slope of the density functions around -70 kCal/mol and 5 kCal/mol, corresponding to the regions with negative and positive Strengths. At the Strength level of organization, a difference between thermostable and mesostable proteins is found. Indeed, residues belonging to the group of thermostable proteins show a higher probability of having high negative Strength values with respect to the mesostable ones, testifying an overall higher compactness of thermostable protein

2/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1 (color online) a) Probability density distributions of total interaction energies for the eight subsets defined in the Tm dataset. Each distribution is built using a group of proteins whose melting temperatures lie in the same range. The eight ranges, from lower to higher Tm, are represented by colors from dark blue to dark red. The density functions exhibit a dependence with the melting temperatures ranges and peak heights increase with the temperatures. Inset shows the energy distribution in log-scale obtained using all proteins. b) Correlation between the area of each density peak and the average Tm for the eight dataset groups. c) Probability density distributions in log-scale of total interaction energies for mesostable (blue) and thermostable (red) proteins belonging to the Twhole dataset. Inset shows the energy distribution in log-scale obtained using all proteins. d) Probability density distributions in log-scale of Strength network parameter for mesostable (blue) and thermostable (red) proteins belonging to the Twhole dataset. Inset shows the Strength distribution in log-scale obtained using all proteins. e) Schematic representation of the strong favorable and unfavorable interactions both for a mesostable (left) and a thermostable network (right).

3/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(a) T ranges (b) 1 2 m3 4 5 6 7 8

1010 Group 50 55 60 65 70 80 105

55

Group 1 2 3 4 0 5 0 6 7 8 Energy difference (%) Energy difference

-5−5 erence (%) Di ff erence Energy

-10 strand − −10 strand

helix−helix helix−loop loop−loop helix−strand loop−strand strand−strand

helix-helix helix-loop loop-loop helix-strand loop-strand strand-strand Figure 2 a) (color online) For each class of interaction, we report the difference in percentage between actual energy and the expected value of the specific group assuming an uniform distribution of energies. b) Cartoon representation showing a region of a thermostable (top) and mesostable (bottom) protein. Strong interactions between charged-charged amino acid belonging to loop-helix secondary structure are shown in yellow sticks. Loop secondary structures are indicated in red for the thermostable

protein and in blue in the mesostable one. All the interactions are represented in yellow.

strand − fold. In particular, the probability of finding a node having a Strength below -70 kCal/mol is 19.9% and 16.1% for thermostable loop and mesostable respectively. Figure 1e shows a schematic representation of the organization of strong energies both for mesostable proteins and thermostable proteins. In fact, the most important finding of our analyses is that thermostable proteins have more favorable energies concentrated in a few specific residues. In contrast, mesostable proteins tend to have a less organized negative residue-residue interactions network. Given this different way to rearrange amino acidic side chains between proteins with different thermal properties, we mapped the energetic distribution on the protein secondary structures in order to study how energetic allocation is reflected on a higher level of organization. 42 To do so, we retrieved the secondary structures (helices, strands, loops) for all the proteins of the Tm dataset using DSSP and assigned each residue-residue interaction occurring in a protein to six possible classes (helix-helix, helix-strand, helix-loop, strand-strand, strand-loop and loop-loop), according to the secondary structure residues belong to (see Methods).

The goal of the analysis is to determine whether there is a secondary structure class containing more energy than one would

loop − expect by chance. To this end, we estimated the difference in energy of a specific class with respect to the average energy (i.e. loop energy that the same class would had if uniformly assigned). In the group of mesostable proteins, pairing between residues of the same structure (helix-helix, strand-strand and loop-loop) is associated with higher-than-average energies, while mixed combinations (helix-strand, loop-strand and loop-helix) have lower energies. As shown in Figure 2a, a surplus is committed to helix-helix interactions, while helix-strand has a negative difference of about 8.5%. When the thermal resistance is taken into account the most significant distinction between the mesostable and thermostable groups is in the helix-loop pairing. Thermostable proteins have a larger shift than the mesostable counterparts. These results suggest a stabilizing role of helix-loop interactions and we argue that thermostable proteins preferentially gather their energy to this specific class (Figure 2b).

Assessing protein thermal stability In the light of our findings on the difference on the energy organization between mesostable and thermostable proteins,we looked for a way to assess the thermal resistance of a protein given its structure. The simplest way to quantify the impact of energy distribution on the thermal resistance is the comparison with a protein of same structure but different energy organization,

i.e. a homologue (and indeed this has been widely done43). Ideally, differences between two homologous proteins with

strand − different thermal stability are attributable only to their different thermal resistance. The more pronounced reorganization of helix the interactions in thermostable proteins confirms that they undergo an evolutionary optimization process which introduces fold-independent correlations in the spatial distribution of the interactions. By contrast, mesostable proteins have not these correlations, thus with respect to thermal stability, their energy organization can be considered more random.

4/16

loop − helix

helix − helix 5 0 10 − 5

− 10 Energy difference (%) difference Energy mesostable peaks, both in the range of positive energies and in negative one. Anyway the 3D rearrangement of protein side chains masks the positive interactions, substantially preventing the condensation of unfavorable interactions in a single residue, as evident because of the absence of a peak in the positive strength region ( Fig 1B). On the other hand residues belonging tothermostable proteins show an higher probability to have strength in the negative regions, testifying an overall more compact fold of thermostable proteins. Given this statistically different way to reorganize their amino acidic side chains between proteins with different thermal properties, we then map the energetic distribution on the protein secondary structures in order to study how energetic allocation is reflected on a higher level of organization. To do so, we retrieved the secondary structures (helices, 33 mesostablestrands, loops) peaks, both for all in the the range proteins of positive of the energiesTm dataset and in using negative DSSP one. Anywayabd assigned the 3D rearrangement each interaction of protein occurring side in a protein to six mesostablechainspossible masks peaks, classes the both positive in (helix-helix, the rangeinteractions, of positive helix-strand, substantially energies and preventing helix-loop, in negative the one. condensation strand-strand, Anyway the of 3D unfavorable strand-loop rearrangement interactions of and protein loop-loop), in side a single residue, according to the secondary chainsas evident masks the because positive of interactions, the absence substantially of a peak preventingin the positive the condensation strength region of unfavorable ( Fig 1B). interactions On the other in a single hand residue, residues belonging mesostablestructure peaks, a residue both in is the allocated range of positive to (see energies Methods). and in negative one. Anyway the 3D rearrangement of protein side astothermostable evident because proteins of the absence show anof a higher peak in probability the positive to strength have strength region in( Fig the1 negativeB). On the regions, other hand testifying residues an belonging overall more compact tothermostablechains masks proteins the positive show interactions, an higher probability substantially to have preventing strength thein the condensation negative regions, of unfavorable testifyinginteractions an overall more in a compact single residue, fold ofThe thermostable goal of the proteins. analysis Given is this to statistically estimate thedifferent difference way to reorganize in energy their of amino a specific acidic class side chains with between respect proteins to the energy that the same foldasclass of evident thermostable would because have proteins. of the proportionally Givenabsence this of statistically a peak to in the thedifferent number positive way strength ofto reorganize links. region Through their ( Fig amino1B). this acidic On approach, the side other chains hand between a positive residues proteins belonging difference means that the class has withwithtothermostable different different thermal thermal proteins properties,properties, show an we higher then we map thenprobability the map energetic the to have energetic distribution strength distribution in on the the negative protein on the secondaryregions, protein testifying secondary structures an in overallstructures order tomore study in compact order to study howhowfoldmore energetic of energetic thermostable energy allocation allocation than proteins. is reflected expected is reflected Given on a thisif higher on randomly statistically a higher level of level organization.different assigned. of organization. way AsTo to reorganizedo shown so, To we do retrieved in their so, Figure aminowe the retrieved secondaryacidic2B, a side the surplus structures secondarychains between of (helices, about structures proteins 7% (helices, is committed to helix-helix strands, loops) for all the proteins of the T dataset using33 DSSP33 abd assigned each interaction occurring in a protein to six strands,withinteractions, different loops) for thermal all while the proteinsproperties, helix-strand of the weT thenm dataset has mapm a the using negative energetic DSSP difference distributionabd assigned on around each the protein interaction the secondary 8.5%. occurring When structures in a the protein in thermal order to six to study resistance is taken into account possiblepossiblehow energetic classes classes (helix-helix, allocation (helix-helix, is helix-strand, reflected helix-strand, on helix-loop, a higher helix-loop, level strand-strand, of organization. strand-strand, strand-loop To dostrand-loop so, and we loop-loop), retrieved and loop-loop), accordingthe secondary to according the structures secondary to (helices, the secondary the most significant distinctions between the two33 groups are helix-loop and loop-loop. In the first case indeed, thermostable structurestructurestrands, a residue loops) a residue for is allocated all is the allocated proteins to (see to Methods). of (see the Methods).Tm dataset using DSSP abd assigned each interaction occurring in a protein to six possibleproteinsTheThe goal goal classes of have the of analysis(helix-helix, the 20 analysis times is to estimatehelix-strand, is bigger to estimate the shift difference helix-loop, the than difference in the strand-strand,energy mesostables, in of energy a specific strand-loop of a class whilespecific with and these class respect loop-loop), with ones to the respect have according energy to twice that the to theenergy the the same secondary shift that the for same loop-loop of the thermal classclassstructureresistant would would ahave residue proteins. have proportionally proportionally is allocated These to to the results (see to number the Methods). number suggest of links. of Through links.a stabilizing Through this approach, this role approach, afor positive helix-loop a difference positive interactions, meansdifference that the means class considering that has the class that has thermostable proteins moremorepreferentially energyThe energy goal than of than expected the gather expected analysis if randomly their is if to randomly estimate energy assigned. theassigned. to differencethis As shown specific As in inshown Figure energy class, in2 of FigureB, whilea a specific surplus2B, loop-loop a ofclass surplus about with 7% of respectinteractions is about committed to 7% the is energy tocommitted are helix-helix more that the energeticto same helix-helix in less resistant proteins interactions,interactions,class(Figure would while2). have while helix-strand proportionally helix-strand has a to has negative the a number negative difference of difference links. around Through the around 8.5%. this the approach, When 8.5%. the When athermal positive the resistance thermaldifference is resistance taken means into that is account thetaken class into has account thethemore most most energy significant significant than distinctions expected distinctions if between randomly between the assigned. two the groups two As groups are shown helix-loop are in Figure helix-loop and2 loop-loop.B, a and surplus loop-loop. In of the about first In case 7% the is indeed, first committed case thermostable indeed, to helix-helix thermostable proteinsproteinsinteractions, have have 20 while times 20 times helix-strand bigger bigger shift has thanshift a the negative than mesostables, the difference mesostables, while around these while the ones these 8.5%. have ones When twice have the the thermal twiceshift for the resistance loop-loop shift for is of loop-loop taken the thermal into of account the thermal resistant proteins. These results suggest a stabilizing role for helix-loop interactions, considering that thermostable proteins resistanttheRandomization most significant proteins. These distinctions of results protein between suggest networks the a stabilizing two groups role are helix-loop for helix-loop and loop-loop. interactions, In the considering first case indeed, that thermostable thermostable proteins preferentiallyproteinsTo further have gather 20 evaluate theirtimes energy bigger the to shift thisstructural specificthan the class, mesostables, properties while loop-loop while of proteins these interactions ones with have are more twice different energetic the shift thermal in for less loop-loop resistant properties, ofproteins the thermal we modeled each protein as a (FigurepreferentiallybioRxiv2). preprint gather doi: https://doi.org/10.1101/354266 their energy to this specific; this version class, posted while August loop-loop1, 2018. The copyright interactions holder for are this morepreprint energetic (which was not in less resistant proteins (FigureresistantResidue2 proteins.). Interactioncertified These by results Networkpeer review) suggest is (RIN),the aauthor/funder. stabilizing where All rolerights each forreserved. helix-loop residue No reuse is interactions,allowed represented without permission. considering as a node that thermostable and links between proteins residues are weighed with preferentially gather their energy to this specific class, while loop-loop interactions are more energetic in less resistant proteins Randomization(Figurenon-bonded2). of energies protein networks (as described in the Methods section). Ideally, the most concise way to assess the impact of energy ToRandomizationdistribution further evaluate on the of the structural protein thermal properties networks resistance of proteins limiting with different fold dependent thermal properties, effects we would modeled be each the protein comparison as a of homologous proteins with ResidueTo further Interaction evaluate Network the structural(RIN), where properties each residue of proteinsis represented with as different a node and thermal links between properties, residues we are modeled weighed each with protein as a Randomizationvery similar structures; of protein networks in other words when two homologous proteins are compared, the differences between them may be non-bondedResidueTo further Interaction energies evaluate (as the Network described structural (RIN), in properties the where Methods each of section). proteins residue Ideally,with is represented different the most thermal as concise a node properties, way and to links assess we between modeled the impact residues each of energy protein are weighed as a with distributionnon-bondedResidueattributable Interaction on the energies only thermal Network to (as resistance their described (RIN), different limiting where in the foldthermal each Methods dependent residue resistance. section). is effects represented would Ideally, Unfortunately, as be a the thenode comparison most and links concise only betweenof few homologous way cases residuesto assess of proteins are homologous the weighed impact with with of proteins energy with known melting verydistributionnon-bondedtemperature, similar structures; onenergies the solved thermal in (as other described with wordsresistance a when reasonablein the limiting two Methods homologous fold resolution section). dependent proteins Ideally, and effects are with compared,the would most no concise be missing the the differences comparison way residues to assess between of the arehomologous them impact available; may of be proteins energy furthermore with it is surely more attributable only to their different thermal resistance. Unfortunately, only few cases of homologous proteins with known melting verydistribution similar on structures; the thermal in resistance other words limiting when fold two dependent homologous effects proteins would are be the compared, comparison the ofdifferences homologous between proteins them with may be temperature,meaningful solved to with assess a reasonable the thermal resolution properties and with no of missing a protein residues regardless are available; of similarity furthermore with it is surely other more known proteins. Anyway it is useful attributablevery similar only structures; to their in different other words thermal when resistance. two homologous Unfortunately, proteins only are compared,few cases of the homologous differences proteins between(c) with them known may be melting meaningfulto preserve to assess the the comparative thermal properties approach, of a protein regardless because of able similarity to underline with other known differencesEnergy distribution proteins. for 8-8.5 Anyway dueÅ range only it is usefulto thermal stability. In light of these temperature,attributable only solved to their with different a reasonable thermal resistance. resolution Unfortunately, and with noE = -9.2 missingonly kCal/mol few residuescases of homologous are available; proteins furthermore with known it is melting surely more to preserve the comparative approach, because able to underline differences due only to thermal stability. In light of these meaningfultemperature,considerations, to solved assess we with the designthermal a(b) reasonable a properties procedure resolution of a thatprotein and withcompare regardless no missing a given8.4 of residues similarity protein are with available; with other its known furthermore modifiedtrue proteins.energy versionit is Anyway surely where more it is useful chemical interactions have considerations, we design a procedure that compare a given protein with its modified version where chemical interactions have meaningful to assess the thermal properties of a protein regardless of similarity with other known proteins.random energy Anyway it is useful energiestoenergies preserve typical typical the of mesostable comparative of mesostable proteins. approach, proteins. because able to underline differences due only to thermal stability. In light of these to preserve the comparative approach, because able to underline differences due only to thermal stability. In light of these considerations,We adoptedWe adopted a randomization we design a randomization a procedure strategy that that provide strategy compare a way a that given to providecompare protein with each a way its real modified protein to compare versionnetwork where eachwith an chemical real ensemble protein interactions of network have with an ensemble of

considerations, we design a procedure that compare a given protein with its modifiedDensity version where chemical interactions have re-weightedenergiesre-weighted typical ones. The ones. of proceduremesostable The generatesprocedure proteins. an ensemble generates of random an ensemble networks having of random the same networks number of nodeshaving and the links same but number of nodes and links but withenergies newWe physically adopted typical of allowable a mesostable randomization weights proteins. (i.e. strategy energies), that whichprovide are a extracted wayE’ = -1.4 to kCal/mol fromcompare the mesostable each real energy protein distribution network withusing an the ensemble of with new physically allowable weights (i.e. energies),8.4 which are extracted from the mesostable energy distribution using the interactionre-weightedWe adopted distance ones. as a randomization constraint The procedure for the strategy generates sampling. that Thisan provide ensemble has the a purpose way of random to ofcompare disrupting networks each the real having evolutionary protein the same network optimization number with and of an nodes shouldensemble and linksof but havewithre-weightedinteraction a greater new physically effect ones. distance on The highly allowable procedure as organized constraint weights generates proteins. (i.e. for an All energies),ensemble the steps sampling. of theof which random randomization Thisare networks extracted has process the having from purpose is schematically the samemesostable of number disrupting illustrated energy of nodes in the distribution Figure and evolutionary links3. but using the optimization and should with new(a) physically allowable weights (i.e. energies), which are extracted from the mesostable -30 -15 0 energy 15 30 distribution using the Ininteraction particular,have a greater given distance a link effect as characterized constraint on highly for by the an organized energy sampling. weight proteins. ThisEij hasand the by Alla purpose distance steps of of of disrupting theinteraction randomization thedEnergyij , evolutionary we [kCal/mol] substituted process optimization the energyis schematically and should illustrated in Figure 3. interaction distance as constraint for the sampling. This has the purpose of disrupting the evolutionary optimization and should withhave a new a greater one (Eij0 effect) extracted on highly from organized a energy distribution proteins. All defined steps in of the the specific randomization distance interval processdij isbelongs schematically to. Intervals illustrated were in Figure 3. haveIn particular, a greater effect given on highly a link organized characterized proteins. All by steps an energyof the randomization weight Eij processand by is schematically a distance illustrated of interaction in Figured3ij. , we substituted the energy definedIn particular, using a binning given a of link 0.5 characterizedA˚ width over the by distancean energy from weight 0 A˚ toEij 12andA˚, obtaining by a distance 24 different of interaction ranges.d Forij , each we substituted distance the energy Inwith particular, a new given one a ( linkE0 characterized) extracted by from an energy a energy weight distributionEij and by a distance defined of in interaction the specificdij , we distance substituted interval the energydij belongs to. Intervals were interval k, we generated a probabilityij density function rk(E), using only the energies values observed in such interval in the with a new one (Eij0 ) extracted from a energy distribution defined in the specific distance interval dij belongs(d) to. Intervals were with a new one (Eij0 ) extracted from a energy˚ distribution defined in the specific distance˚ interval˚dij belongs to. Intervals were mesostabledefineddefined usingproteins using a ofbinning a the binningTm ofdataset. 0.5 ofA˚ Thewidth 0.5 procedureA overwidth the is applied distanceover the 500 from timesdistance 0 forA˚ to each 12 from RIN,A˚, obtaining 0 soA thatto each 12 24 realA different, obtaining network ranges. is associated 24 For different each distance ranges. For each distance withdefined 500 random using networksa binning (rRINs). of 0.5 A˚ Thewidth randomization over the distance allows fromus to develop 0 A˚ to 12 a classifierA˚, obtaining based 24 on different the distance ranges. between For eachthe real distance intervalinterval k, k,we we generated generated a probability a probability density function densityr functionk(E), usingr onlyk(E) the, using energies only values the observed energies in values such interval observed in the in such interval in the networkinterval strength k, weand generated the mean a probability value of the density random function strengthr distribution.k(E), using The onlyT thescore, energies defined values in Eq. observed(1), is a in measure such interval of how in the mesostable proteins of the T dataset. The procedure is applied 500 timess for each RIN, so that each real network is associated muchmesostablemesostable the original proteins RIN proteins average of the ofTm strengthmdataset. the Tm value Thedataset. deviates procedure The from is procedure applied the expected 500 times is average applied for valueeach 500 RIN, of the times so rRIN that fordistribution: each each real network RIN, so is associated that each real network is associated (e) withwith 500 500 random random random networks networks networks (rRINs). (rRINs). (rRINs). The The randomization randomization The randomization allows allows us to us develop to allowsdevelop a classifier usa classifier to developbased based on the a on classifier distance the distance between based between the on real the the real distance between the real networks strength¯ strengthprotein and( ands¯ thes the) mean mean value value of of the the random random strength strength distribution. distribution. The TheTs score,Ts score, defined defined in Eq. in(1 Eq.), is( a1) measure, is a measure of how of how networkTs = strength and the mean value of the random strength distribution. The Ts score, defined(1) in Eq. (1), is a measure of how much the the original originals¯ RIN RIN average average strength strength value value deviates deviates from from the the expected expected average average value value of the of rRIN the rRIN distribution: distribution: much the originalprotein RIN average strength value deviates from the expected average value of the rRIN distribution: where s¯proteinDensity iss¯s¯protein theprotein average(s(¯s¯ - ofss the) ) strength parameter for the RIN; s¯ and s are the mean and standard deviation of the average TTss == T (1) (1) values of the rRINs¯ distribution.s¯protein (s We¯ uses) s score as a classifier setting the threshold value at 0; substantially we consider true proteins¯protein o all predictionsTs = for which theTs score is higher than 0 and the protein Tm is higher than 70 C or alternatively the Ts score is (1) s¯protein o lowerwhere thanss¯¯protein 0protein and theisis the the protein averageStrength averageTm [kCal/mol]is of of lower the the strength than strength70 parameterC parameter. The method for for the achieves the RIN; RIN;s¯ anands encouraging¯ ands ares theare mean averagethe mean and accuracy standard and standard across deviation a deviation seven-cross of the ofaverage the average validation of 72% plus or minus 3% (see Method). Figure 3.valuesTheFigure first of of step the the3. 3 ofThe the rRIN(color rRIN firstmethod’s step distribution. online) distribution. of procedure the Givenmethod’s is to a Weprocedurecalculate, protein We use use structure forT iss toT eachscore calculate,s score residues(a) as, for our asa pair, each classifier amethod the classifier residues minimal representspair, setting atom-atom thesetting minimalthe it asdistance, thresholdthe a atom-atom Residue threshold distance, Interaction value value at 0; Network atsubstantially 0; substantially(b), where we consider we consider true true which is indicatedwhich in yellow is indicated in panel in yellow (a). In inthis panel case, (a). the In minimal this case, distance the minimal between distance the two between residues the is two8.4 Angstrom. residues is The8.4 Angstrom. The o allwhereThe predictionseach scatter aminos¯protein plot acid for for inis becomes which Figurewhich the the average a4 the nodedepictsTsTscore andscore the theof is energetic the confusionis higher higher strength interactions than than matrix 0 and 0parameter betweenandwhere the the protein each amino protein for pointT acidsm theisT represents are higheris RIN; weighted higher thans¯ the linksand than 70 average connectingCs 70orareo alternativelyCTs or theand the alternatively nodes. each mean error the andT bars thescore standardT isscore deviation is of the average energy value relatedenergy to value such contact related tois replacedsuch contact with is another replaceds randomly with another extracted randomlyo from theextracted energy from distribution the energy of mesostable distributionm of mesostable s thelower standard(c) thanThe deviation 0first and step the of obtained protein the procedureTm fromis is lower theto calculate validation. than 70 theC minimalo. The The bottom method atom-atom left achieves distance square an for and encouraging each the residue top right pair, average squarewhich accuracy is contain indicated across respectively in a seven-cross proteins derivedlowervaluesproteins only than considering of derived 0 the and only energies rRINthe considering protein in distribution.the distance energiesTm is interval in lower the distance which Wethan corresponds interval use70 whichCT. tos The thecorrespondsscore minimal method as distanceto the a achieves minimal classifier of the specifydistance an encouraging of setting the specify the average threshold accuracy value across at a0; seven-cross substantially we consider true contact. Invalidation the panelyellowcontact. (b) Inofallon the energy the72% panel left values plus (b) of all panel. of orenergy all minus mesostable In values the 3% ofexample proteins all (see mesostable in here, Method). function proteins the of minimal distance in function are distance shown, of distance whilebetween are in shown, the the panel while two (c) residues inthe the panel is (c) 8.4 the Angstrom. The energy o density ofvalidationall energyvaluedensity predictions belonging related of of energy to 72% the to belonging distance such plus for contact range to orwhich the minus 8-8.5 distance is replaced Angstrom the range3%T (see8-8.5 is withs shown.score Angstrom Method). another The newis is one, shown.higher energy randomly The is represented new than energy extracted with 0 is represented and the from green the the line with proteinenergy the green distribution lineTm is of higher mesostable than 70 C or alternatively the Ts score is in the panel (d).Thein Performing the scatter panel (d). this plot Performing procedure in Figure for this each procedure residues4 depicts for pair each a newthe residues energy confusion pair organization a new energy matrix is definedorganization where and it is associatedeach defined point and to it a associated represents to a the average Ts and each error bar Theproteins scatter derived plot only in considering Figure 4 energiesdepicts in the the distance confusion interval matrixo which where corresponds each to point the minimal represents distance the of the average specificTs and4/12 each error bar new networkthelower of standardnew intramolecular network than deviation of 0 interactions. intramolecular and the obtained For interactions. eachprotein random from For networkT eachm theis random the validation. lower average network strength than the Theaverage parameter70 bottom strengthC is. calculated. The parameter left method square In is the calculated. panel and achieves In the the panel top right an encouraging square containaverage respectively accuracy across a seven-cross (e) is reportedthe thestandardcontact.(e) strength is reported In distribution deviation the strength middle obtained distribution of obtained panel iterating obtained the the from density procedure iterating the of for the energy validation. 500 procedure times. belonging Green for 500 The line times. to represents the bottom Green distance the line real left represents range mean square 8-8.5 the real andAngstrom mean the top is shown. right The square new contain respectively strength value,validationenergy whilestrength red value,is and representedof blue while72% region red and in with plusthe blue random the region orgreen strength in minus the line random distribution in the 3% strength right show (see distributionof the the Method).classification panel. show Performing the criterion: classification if this real criterion: procedurestrength if real for strength each pair, a new network of lies in red regionintramolecularlies the in protein red region is classified the interactions protein as thermostable is classified is established as while thermostable if characterizedit sets in while the blue if it by sets region a in new the it will blue energy be region labeled organization. it aswill mesostable. be labeled Reiterating as mesostable. the process many times, we The scatter plot in Figure 4 depicts the confusion matrix where each point represents the4/ average12 Ts and each error bar obtain an ensamble of random networks (d). Finally, for each random network the average Strength parameter is calculated. 4/12 thePanel standard(e) shows deviation the Strength distribution obtained obtained from iterating the the validation. procedure. Green The line represents bottom the left mean squareStrength value and of the the top right square contain respectively real network, while red and blue region in the random Strength distribution show the classification criterion: if real Strength lies in red region the protein is classified as thermostable while if it sets in the blue region it will be labeled as mesostable. 4/12

(a) Confusion matrix(a) Confusion matrix (b) ROC (b) ROC 5/16 Figure 4. a)ScatterFigure plot 4. ofa) theScatter melting plot temperature of the melting (Tm temperature) as a function (Tm of) as the a theoreticalfunction of descriptor the theoretical (Ts). descriptor Points represent (Ts). Points the represent the mean value of Tmeans over value the stratification of Ts over the procedure, stratification while procedure, error bars while on the error x axis bars are on given the x by axis the are standard given by deviation. the standard Vertical deviation. Vertical o o line at Tm = 70 lineC and at T horizontalm = 70 C lineand at horizontalTs = 0 are line shown at Ts in= green.0 are shownb)Plot inof green. the meanb)Plot ROC of thecurve. mean ROC curve.

5/12 5/12 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

We designed a procedure that compares a given protein with modified versions of itself where is preserved, while chemical interactions have energies typical of mesostable proteins and randomly assigned in a physical way, i.e. maintaining residue-residue distance information (see Methods). In this way, the randomization strategy provides a way to compare each real protein network with an ensemble of re-weighted cases, having the same number of nodes and links but with new weights (i.e. energies). These energies are extracted from the mesostable energy distribution using the interaction distance as constraint for the sampling. This procedure has the purpose of disrupting the evolutionary optimization and it is expected to have a larger effect on the highly organized network of thermostable proteins. By virtue of the different energy distribution between mesostable and thermostable proteins, sampling mesostable energies allows to properly assess the difference between the real thermostable protein network and its randomized counterpart. All steps of the randomization process are schematically illustrated in Figure 3. In particular, given a link characterized by an energy weight Ei j and by a distance of interaction di j, 0 we replaced the energy with a new one (Ei j ) extracted from a energy distribution defined for the specific distance interval di j belongs to. For each distance interval k, we generated a probability density function ρk(E), using only the energies values observed in such interval in the mesostable proteins. At the end of the process, for each real RIN, we generated an ensamble of random networks (rRINs). The randomization allows us to develop a classifier based on the distance between the real network Strength and the random Strength distribution. The Ts score, defined in Eq. 7 (see Methods), is a measure of how much the original RIN average Strength value deviates from the expected average value of the rRIN distribution. Note that our descriptor is general and parameter-free and can be computed for every kind of weighted graph. The Ts score can be used as a thermal stability classifier setting the threshold value at 0; substantially considering true all predictions for which the Ts score is higher o o than 0 and the protein Tm is higher than 70 C or alternatively the Ts score is lower than 0 and the protein Tm is lower than 70 C. A so defined method is completely parameter-free. It only requires a probability density of mesostable protein interactions. In order to evaluate a possible dependence of the method from the chosen dataset, we performed a cross-validation (7-folds see Method) using the Ts score computed with total energy Strength. The method achieves an average accuracy of 72% plus or minus 3% with a mean ROC curve characterized by an AUC value of 80% plus or minus 2%. The small error on both the performances (due to the dimensions of the dataset) indicates the independence of the method from the input information. Classifying on the basis of the 0 threshold of the Ts score, i.e. considering the Ts as a binary variable, loses part of the information contained in the descriptor. In order to have a more sensible classification, we evaluated three different scores, using the total energy (defined in Eq. 5) and specific interaction terms, i.e. the Coulomb and Lennard-Jones interactions (Eq. 3 and Eq. 4), and performed a clustering analysis. Figure 4(a) shows the hierarchical clustering obtained clustering all the proteins of our Twhole dataset using the Ward method as linkage function while the Manhattan distance among the three descriptors was used as distance metric. We also tested different metrics and clustering methods obtaining very similar results (data not shown). The optimal clustering cut was estimated using both the Connectivity, Dunn and Silhouette parameters, which indicates the two group division as the optimal one. We called these groups “Mesostable” (right group in Figure 4) and ”Thermostable” (left group). Indeed, the right cluster, containing 47 proteins, includes almost exclusively mesostable proteins (38), while the left cluster contains 26 thermostable proteins over the total 37 proteins. The overall accuracy of the method is 76%. We correctly assign the right thermal stability to 64 out of 84 proteins. The AUC of the ROC curve for the three Ts descriptors are 0.78, 0.79 and 0.68 (Figure 5a).

Key residues identification Here, we investigated the thermal resistance properties of proteins at the residue level. As protein stability is the result of the cooperative effects and the synergic actions of several residues, assessing the specific contribution of each amino acid is 44 i difficult . We define the Ts score for each residue according to Eq. 8, creating two groups of residues for each protein: with i Ts lower or higher than zero. We consider residues belonging to the first group to have a more stabilizing role than the ones in the second group. Consequently, along the lines of the global-protein classification procedure, we defined ”thermostable” (respectively ”mesostable”) residues belonging to the first (second) group. Using a total-energy based score, thermostable residues are the (11 ± 4)% of total residues. As expected, thermostable proteins have more thermostable residues with respect to mesostable ones (12% compared to 9%). Furthermore, by repeating the same analysis using Coulomb (C) and van der Waals (vdW)-based scores, we found that the average number of thermostable residues is the 11% and 16% of total residues, respectively. In the Coulomb network (see Figure 6a), the most frequent thermostable amino acids are the four charged amino acids: Arg, Asp, Glu and Lys, which cover the 96.6% and 96.1% of thermostable residues in thermostable and mesostable proteins respectively. Apolar and aromatic residues (Leu, Met, Phe and Tyr) are typically thermostable residues of the van der Waals network, including 53% and 54% of the total residues in mesostable and thermostable proteins, respectively (see Figure 6b). In order to investigate the role of each residue in the complexity of the whole system, we analyzed the properties of all residues using a graph-theory approach, calculating 8 network parameters, i.e. Betweenness Centrality, Closeness Centrality, Strength, Diversity Index, Mean Shortest Path, Hub Score, Clustering Coefficient and Degree (see Methods). A Principal

6/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 14 14 Mesostable Mesostable

12 Thermostable

12 Thermostable 10 10 8 6 8 4 6 2 0 1jji 1ftr 2sil 5hjj 1pii 1fsf 1iof 2dri 1rtb 4lyz 3ssi 1fvk 2izp 1j2v 1div 1hix 1jyd 1j4s 2zta 1i4n 1stn 1tca 1bni 1sfp 1bpi 3enj 1ino 3ufc 1orc 4gcr 2cro 3p8t 2e8f 5fb6 3dfq 1tpe 2prd 1r56 1rgg 1rop 1rhg 1rg8 1rn1 2zyz 1lnw 3chy 2lzm 1msi 1mjc 1y4y 2y3z 1chk 4blm 4ake 1ke4 1v6h 1udv 1npk 1onc 1spp 1c5g 1ekg 1a3y 1gtm 4n9h 4g03 1b8e 2dbo 2ebd 2gd1 2d4a 1qhe 5pep 1h09 2p68 1bd8 1poh 3d2a 2ysw 1pxw 1ew4 1cmb 1cm7 1h7m 4 * * ** * * * * * * * * *

Figure 4 (color online) Cluster of the Twhole dataset proteins with three Strength based descriptors, i.e. Coulomb, Lennard-

2 Jones and total energy. Stars indicate proteins on the Thyper dataset. The two groups are discriminated with a p-value of 2.6 × 10−6 (Fisher’s exact test).

0 Component Analysis (PCA) was performed in both kinds of network. In Figure 6c-d, all residues were projected along the first three principal components, which represent the 82% and 63% of the variance for van der Waals and Coulomb network 1jji 1ftr 2sil 5hjj 1pii 1fsf 1iof 2dri 1rtb 4lyz 3ssi 1fvk 2izp 1j2v 1div 1hix 1jyd 1j4s 2zta 1i4n 1stn 1tca 1bni 1sfp 1bpi 3enj 1ino 3ufc 1orc 4gcr 2cro 3p8t 2e8f 5fb6 3dfq 1tpe 2prd 1r56 1rgg 1rop 1rhg 1rg8 1rn1 2zyz 1lnw 3chy 2lzm 1msi 1mjc 1y4y 2y3z 1chk 4blm 4ake 1ke4

respectively1v6h 1udv (see Supplementary1npk Information). Thermostable residues1onc are neatly separated from others1spp 1c5g if we consider the1ekg 1a3y 1gtm 4n9h 4g03 1b8e 2dbo 2ebd 2gd1 2d4a 1qhe 5pep 1h09 2p68 1bd8 1poh 3d2a 2ysw 1pxw 1ew4 1cmb 1cm7 1h7m largest eigenvalue of the PCA in the Coulomb network and more weakly if we take into account the second and third ones. The Strength and the Closeness Centrality are the most relevant loadings along the first eigenvector both in favorable and unfavorable interactions. In vdW networks (Figure 6d), residue splitting in the PCA planes is less pronounced, even if a separation does occur along the second and third components, having the Clustering Coefficient and the Diversity Index as principal loadings (see Supplementary Information). In both kinds of network, independently from the protein of origin, thermostable residues populate the same regions of the PCA planes (green and orange dots in Figure6). Considering all dataset residues, about 25% of them are charged and only the 11% is classified as thermostable. A similar evidence was found for the four thermostable amino acids identified by vdW score. The role played by charged and aromatic amino acids in thermal resistance has been investigated in previous comparative studies45, 46. Generally, charged residues form highly energetic electrostatic cages which prevent water inclusion47, 48 while apolar and aromatic amino acids form short-ranged vdW interactions that confer stability to the overall structure49, 50. Here we identify key residues whose peculiar spatial disposition confers them a particular role in the stabilization of the protein. Notably, our approach, based on an heterogeneous dataset, permits us to confirm and generalize the stabilizing role of both the charged and apolar/aromatic residues formerly suggested by homologous-based studies. The mean shortest path (L) and the clustering coefficient (C) are able to catch the effect of the thermostable residues on maintaining these important structural motifs. The former provides information about the position of the residue in the network with the most central residues, having higher shortest path values. The latter quantifies the residue surrounding packing, being a ratio between the actual links and maximal number of possible links51–53. In Figure 6e (left panel), we projected all residues in the LC plane coloring in dark red the charged thermostable residues and in cyan the charged non-thermostable residues. Charged residues are concentrated in the region characterized by both small L and C values, with their thermostable subset tending to possess the smaller possible value of C. This means that thermostable residues have both to be exposed and surrounded by residue that make low energetic interaction between each others. In analogy with coulombian networks, we projected in the LC plane the four kinds of key residues identified in the vdW networks. Even if the signal is weaker, key residues in the thermostable van der Waals network (Leu, Met, Phe e Tyr) tend to possess a higher clustering coefficient, testifying the packing stabilizing effect of vdW interactions. Densities of C parameter are found to be

7/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(a) (b)

ROC - Coulomb + vdW

ROC - Coulomb

ROC - vdW

Figure 5 (color online) a) ROC curves of the three descriptors with the whole network Ts scores. b) ROC curves of the three i descriptors with the single-residue Ts scores. different with a p-value < 10−16 (nonparametric test of Kolmogorov-Smirnov). These finding allow us to divide residues in 8 groups: four groups are identified by the Coulomb interaction, i.e. thermostable charged/uncharged residues and non-thermostable charged/uncharged residues; while vdW interaction networks divide residue according to thermostable/non-thermostable being or not being in the Leu-Met-Phe-Tyr group. For each protein of the Twhole i dataset it is possible to compute the sum of the Ts scores in each of the 8 possible groups, obtaining a vector of 8 descriptors for each protein. Performing a linear regression with the four Coulomb-based vector component, the four vdW-based ones and with the whole eight-component vector we end up with a preliminary AUC of the ROC curves of 0.81 e 0.77 and 0.83 respectively (see Figure 5b), and we are currently developing a residue-specific approach for Tm prediction.

Discussion Proteins evolved to be functional in very distinct thermal conditions. The mechanisms proteins have developed to face thermal noise have been studied for a long time, given the influence that the comprehension of these mechanisms could exert on both the academic and theoretical industrial field. However, the complete understanding of the reasons that rule the fold stability is a challenging and unsolved problem in which a number of factors have to be taken into account. Comparative studies of homologous pairs have previously reported a change of content in the amino acids of thermostable proteins54 and Arg, Glu and Tyr are more frequently at the surface of thermophilic proteins with Tyr being involved in the formation of stabilizing aromatic clusters22, 27, 46. Undoubtedly, these studies have contributed to unveil mechanisms of thermal resistance typical of specific protein families, yet they do not provide a unifying, global theory describing the rules of adaptation at extreme conditions. In fact, distinct chemical physical characteristics or different combinations of attributes contribute differently to the stabilization of a protein family or fold, making not trivial to use homologous-based findings to infer the thermal stability of a given protein. At present, just two methods have been developed to predict the melting temperature of a given protein without need of 19 comparison with homologs and relying on a dataset of known Tm. Ku et al. proposed a sequence-based methods of statistical inference that relates the number of dipeptides within the amino acid sequence to the protein Tm allowing one to separate high thermostable from low thermostable proteins. More recently, Pucci et al.15 developed a method, based on the thermodynamic statistical potentials, that is able to predict the melting temperature of a given protein using as inputs the three-dimensional structure and the additional information of the optimal temperature (Tenv) the protein host organism lives in. Our work aims to represent a step toward the understanding of the thermal properties of a protein given its 3D structure. While the axiom thermophilic organisms have thermostable proteins is certainly correct, some mesophilic proteins may as well 31 be thermostable . Knowledge on the organism optimal growth temperature, Tenv, used to classify mesophiles and ,

8/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

(a) Freq. Amino Acids Freq. Thermostable Amino Acids - Thermostable proteins Freq. Thermostable Amino Acids - Mesostable proteins (b)

0.3 0.15

0.2 0.10 Frequency 0.1 Frequency 0.05

0.0 0.0 ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

(c) Thermostable residues of thermostable proteins Thermostable residues of mesostable proteins (d) PC1 vs PC2 PC1 vs PC2 5 10

5 0 0

-5 -5 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -5 0 5 10 -10 -5 0 5 10 PC1 vs PC3 PC1 vs PC3 5 10

5 0 0

-5 -5 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -5 0 5 10 -10 -5 0 5 10 PC2 vs PC3 PC2 vs PC3 5 10

5 Betweenness Centrality Mean Shortest Path 0 0 Closeness Centrality Hub Score Strength Clustering Coefficient -5 Diversity Index Degree -5 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -5 0 5 10 -10 -5 0 5 10

(e) ShortestPath vs Clustering Coefficient ShortestPath vs Clustering Coefficient (f) Clustering Coefficient Clustering Coefficient 1.0 Negative Network Negative Network 8 8 0.9 0.8 0.6 0.7 0.4 4 4 0.5 0.2 0.0 0 0 0.3 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 0.2 0.4 0.6 0.5 0.7 0.9

Thermostable residues No thermostable charged residues No thermostable Tyr, Phe, Met and Leu residues

Figure 6 (color online) a-b) Frequencies of thermostable amino acids for the thermostable and mesostable proteins are shown in red and blue respectively. Frequencies of all the amino acids are shown in gray. c-d) Projection along the first three principal components of all residues. Thermostable residues for mesostable and thermostable proteins are indicated in green and in orange dots, respectively. e-f) All residues are mapped in LC space. In red Arg, Asp, Glu and Lys amino acids are shown as the most frequent thermostable residues of the Coulomb network. In yellow dots, Tyr, The, Leu and Met are shown as the most frequent thermostable amino acids identified by van der Waals based Ts score.

may be misleading with high value of correlation due to the fact that Tenv is always a lower-bound for Tm. The basic idea behind our method relies on the assumption that thermostable proteins undergo an optimization process during evolution that leads to specific structural arrangement of their energy interactions. Our analysis is based on a residue interaction network (RIN) in which the three dimensional structure of a protein is schematized as a graph with the residues acting as nodes and the molecular interactions as links. The graph representation is frequently adopted to study complex biological systems involving multiple interacting agent55–59. In our definition of network, links are weighted according to the sum of two nonbonded energetics terms: electrostatic and Lennard-Jones potential. The analysis of the distribution of energies (links) highlighted the correlation between the thermal stability of protein sets (grouped according to their Tm) and the probability of finding high intramolecular interactions, with a highest correlation of 0.90 considering eight groups of proteins (Figure 1). Unfortunately, neither it is possible to further divide the dataset in more groups due to the dataset dimension, nor we could not consider the energy distribution for the single protein because the small number of links makes the statistics noisy, especially in strong energy regions. Moreover, moving to higher orders of organization, e.g. considering the individual residual energies (Strength parameter), further reduces the data. For this reason, the next-up analysis were performed with a two-groups division of the dataset.

9/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Interestingly, we found that not only strong negative energies determine the thermal stability of a protein, but also strong positive interactions play a role. Such finding confirms the complex nature of the protein interaction network and in fact the stabilizing role of repulsive energies can be explained in cases where repulsion between a couple of residues results in a better spatial rearrangement of protein regions. In order to investigate the complex arrangement of the interactions in the 3D protein structure, we performed an analysis based on graph theory approach aimed at the comprehension of the favorable and unfavourable energies disposition. We determined the stabilizing contribution of each amino acid, defining the Strength of a residue (Eq. 6) as the sum of the energies of all the interactions (favorable and unfavorable) the residue establishes with all other residues. Indeed, this parameter gives an estimate of the residue significance in the overall protein architecture and can be used both as a local property of each individual amino acid and as a global average network feature of the entire protein. Moving to the higher level of organization we investigated the biological role of the secondary structure interactions in thermal stability. The interactions between residues belonging to alpha helixes and loops concentrate more energy in thermostable proteins than mesostable ones. Those results suggest that the thermal stability of a given protein is deeply linked both to the intensity of interactions and to their spatial disposition, and that both are fine-tuned during the evolutionary process. In order to assess the thermal stability, we investigated the network energy organization and compared it against an ensemble of randomized networks. The ensemble comparison has two main purposes: The first consists in overcoming the limitation of the need of pairs of homologous proteins for direct comparison. The second purpose, raised from the observation that thermostable proteins are enriched of high connected nodes (hubs) and have more organized networks of interactions respect mesostable proteins60–62, relies in the need introducing a quantitative measure of the evolutionary optimization process thermostable proteins underwent, i.e. the distance between real protein interaction network and a randomized one, in which we disrupt the optimization of energy achieved by thermostable proteins during evolution. As described in the method section, the energies of a network are always obtained from a distribution of mesostable protein interactions. In this way, the more the original network diverts from the ensemble, the higher the probability that the protein belongs to the thermostable class. Moreover, the comparison allows us to assess in a quantitative way the effect of the energetic topology of the protein. Using this protocol to build up the Ts parameter-free descriptor and performing a cluster analysis, we are able to discriminate between mesostable and thermostable proteins, with a maximum accuracy of 76% and an maximum Area Under The Curve (AUC) of 78%. At last, we investigated whether evolution acts on particular residues to optimise protein thermal stability or if stability is given by a cooperative effect with evolution acting on the whole protein. Our analysis identifies two sets of key (thermostable) residues according to the kind of energetic interactions the network is built with (Coulomb or van der Waals). Surprisingly, thermostable residue frequency in thermostable and mesostable proteins is comparable and they represent only a small subset of all residues. In order to better understand the theoretical aspects of thermostability and improve the classification to be used in more applicative fields, we created a new parameter dependent Ts score given by a linear combination of the Ts score of the eighth possible set of residues (see Results ). The improved performance of 83% of ROC’s AUC highlighted the promising features of the single residue approach.

Methods Datasets 63 Proteins with known melting temperature (Tm) were obtained from the ProTherm database . We selected all wild-type proteins o for which the following thermodynamic data and experimental conditions were reported: Tm ≥ 0 C; 6.5 ≤ pH ≤ 7.5 and no denaturants. Experimentally determined structures were collected from the PDB34 and filtered according to method (x-ray diffraction), resolution (below 3A˚) and percentage of missing residues (5% compared to the Uniprot64 sequence). Proteins for which experimentally determined structures were only available in a bound state, i.e. in complex with either a ligand or a ion, were excluded. Proteins were filtered using the CD-HIT software65 to remove proteins with chain sequence identity ≥ 40% to each other. The final dataset, hereinafter referred to as the Tm dataset, consisted of 71 proteins. Consistently with o 46, 66, 67 previous reported dataset, thermostable proteins (Tm ≥ 70 C) represent about a third of the overall dataset . In order to have a dataset as balanced as possible, we also manually collected a second, independent dataset consisting of proteins from hyperthermophilic organisms with optimal growth at T ≥ 90 oC and pH between 6.5 and 7.5. Experimentally determined structures were collected and filtered according to same criteria described above for the Tm dataset, leading to a total of 13 protein structures. This second dataset is referred to as the Thyper dataset. The union of the two dataset, referred as the Twhole dataset, accounts of 84 proteins (see Supplementary Information). CATH class and architecture for protein domains were checked: the most representative class for thermostable domains is Alpha Beta (84% respectively) with only a few Mainly Beta domains (11%) and just 1 Mainly Alpha domain (0.02%). On the other hand, Mainly Alpha and Mainly Beta constitutes the 21% and 26% of the mesostable domains, with Alpha Beta sets to 53%. 15 different folds are available for mesostable proteins, with the most representative one being 3-Layer(aba) Sandwich (19%) while the thermostable domains count 9 different folds, with 31% of 2-Layer Sandwich and 3-Layer(aba) Sandwich.

10/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Protein structures were minimized using the standard NAMD68 algorithm and the CHARMM force field69 in vacuum. A 1 fs time step was used and structures were allowed to thermalize for 10000 time steps.

Structural analysis Proteins from both the Tm and Thyper datasets were analyzed for their secondary structure content and architecture according to the CATH Protein Structure Classification database70. Per residue secondary structure assignment was done using the DSSP software71. In order to assess how the energy is distributed among protein secondary structure elements we assigned each couple of residues to a class of interaction, based on which secondary structure element they belong to. Possible interaction classes are: helix-helix, helix-strand, helix-loop, strand-strand, strand-loop and loop-loop. To evaluate the fraction of total energy proteins devolve to each class we defined the difference between the fraction of observed energy and a theoretical fraction:

i i i %∆E = %Eobs − %Ethe (1)

i where i represents the pair of secondary structures considered (e.g. helix-helix), %Eobs is the ratio between the energy of i class i and the total energy (Etot ). %Ethe estimates the expected fraction of energy for class i assuming equivalent distribution of energy among classed:

i i Nint Eint %Ethe = (2) Etot

i where Nint is the number of interactions of class i and Eint is the average energy value. In other words Eq. 2 gives the ratio between the number of interactions of class i and the total number of interactions. Energy distribution densities were calculated using the R density function with default parameters.

Network representation and analysis In the present work, protein structures are represented as Residue Interaction Networks (RINs), where each node represents a single amino acid aai. The nearest atomic distance between a given pair of residues aai and aa j is defined as di j. Two RIN 68, 69 nodes are linked together if di j ≤ 12 A˚ . Furthermore links are weighted by the sum of two energetic terms: Coulomb (C) and Lennard-Jones (LJ) potentials. The C contribution between two atoms, al and am, is calculated as:

C 1 qlqm Elm = (3) 4πε0 rlm

where ql and qm are the partial charges for atoms al and am, as obtained from the CHARMM force-field: rlm is the distance between the two atoms, and ε0 is the vacuum permittivity. The Lennard-Jones potential is instead given by:

" l m 12  l m 6# LJ √ Rmin + Rmin Rmin + Rmin Elm = εlεm − 2 (4) rlm rlm

l m where εl and εm are the depths of the potential wells of atom l and m respectively, Rmin and Rmin are the distances at which the potentials reach their minima. Therefore, the weight of the link connecting residues aai and aa j is calculated by summing the contribution of the single atom pairs as reported in equation 5.

" Ni Nj # C LJ Ei j = ∑∑ Elm + Elm (5) l m

where Ni and Nj are the number of atoms of the i-esime and j-esime residue respectively. Network analysis has been performed using the igraph package72 implemented in R73. For each RIN, the Strength local parameter74 is defined as:

i Naa si = ∑ Ei j (6) j=1

i where the Strength si of the i-esime residue is calculated as the sum of all energetic interactions for that residue (Naa). The 3D images of the protein networks were generated using Pymol75.

11/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Network randomization In order to distinguish mesostable from thermostable proteins, we compare the Strength calculated in the real network against the same parameter obtained from an ensemble of random RINs. More specifically, the Strength of each real network is compared against a distribution of mean Strength values from 500 randomized networks obtained from the real one using the procedure described below. Given a RIN link characterize by an energy weight Ei j and an interaction distance di j, we replace the energy value with a 0 new one (Ei j), randomly extracted from the energy distribution observed in mesostable proteins from the Tm dataset and in the same distance interval. Given the global range of interaction distances 0-12 A˚, twenty-four consecutive, non-overlapping distance intervals are obtained by dividing the entire range into a grid of bins using a bin width of 0.5 A˚.A Ts score, defined as:

Ts = s¯protein − (s¯− σ) (7) is calculated to estimate how much the original RIN mean Strength value deviates from the expected mean value of rRIN distribution. s¯protein is the average of the Strength parameter for the RIN; s¯ and σ are the mean and standard deviation of the i average values of the rRIN distribution. At the level of single residue, we define a Ts score, similarly to the one in Eq.7, as i Ts = si − (hsi − σhsi) (8) where si represents the Strength of residue i, hsi is the average Strength of residue i over the 500 randomized networks and σhsi is the standard deviation.

Performance evaluation We evaluated the performance of the Ts score in discriminating between termostable and mesostable proteins by a seven cross validation. The 49 mesostable proteins of the Tm dataset were divided in seven groups, guaranteeing that number of residues and Tm values were as broad distributed as possible. For each group of mesostable proteins: g 1. twenty-four density distribution ρk (E) are built, where g indicates the groups out of the seven created and E stands for the total energy defined in Eq. 5.

2. The remaining 42 mesostable proteins and the 22 thermostable ones together with the Thyper dataset proteins, are g randomized according to previous described procedure sampling the weights from the ρk (E).

3. All randomized proteins are classified as mesostable or thermostable proteins according to the obtained Ts score. This procedure ensures that the classification of the mesostable proteins is not biased by their own presence in the energy density distributions used in the randomization process. We used the R package pROC76 to plot the ROC curve and calculate the AUC values.

Clustering analysis 77 We clustered the Ts descriptors using the Euclidean distance and the Ward method as linkage function via the “hclust” function 73 of the “Stats” package of R . To better compare the different Ts score between them we normalize the data dividing each Ts score for the maximum of the absolute values. Finally, using the R package ”clValid”78, we performed an internal validation for the hierarchical cluster considering both the Connectivity, Dunn and Silhouette parameters.

Principal Component Analysis PCA was performed over eight graph-based descriptors using ”princomp” function of R software and the correlation matrix was used for the analysis79. Each descriptor has been computed using a specific function available in the R i-graph package. The involved descriptors and corresponding functions are: Betweenness Centrality (betweenness function)80; Closeness Centrality (closeness function)80; Strength, (strength function)74; Diversity Index, (diversity function)81; Mean Shortest Path, (distances function, with ”dijkstra” algorithm)82; Hub Score, (hubscore function)83; Clustering Coefficient, (transitivity function, with ”barrat” algorithm)74; Degree, (degree function)82.

Acknowledgements The authors dedicate this article to the memory of Professor Anna Tramontano, whose striking ideas lied the basis of the present work. The research leading to these results has been supported by Epigenomics flagship project EPIGEN. G.G.T. is funded by European Research Council (RIBOMYLOME 309545), Spanish Ministry of Economy and Competitiveness (BFU2014 − 55054 − P and BFU2017 − 86970 − P) and “Fundacio´ La Marato´ de TV3” (PI043296).

12/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Author contributions statement E.M. and R.L. conceived the study. G.G.T. contributed with additional ideas. M.M. and E.M. performed the calculations, M.M., P.P.O., L.D.R., F.A., P.C. and G.G.T. analysed the data. P.P.O. and F.A. collected the dataset. E.M., M.M., R.L. and G.G.T. wrote the manuscript. All authors reviewed the manuscript.

Conflict of interest The authors declare no conflict of interest.

References 1. Rothschild, L. J. & Mancinelli, R. L. Life in extreme environments. Nat. 409, 1092–1101 (2001). 2. Chen, P. & Shakhnovich, E. I. Thermal adaptation of viruses and bacteria. Biophys. J. 98, 1109–1118 (2010). 3. Mozhaev, V. V., Heremans, K., Frank, J., Masson, P. & Balny, C. High pressure effects on protein structure and function. Proteins 24, 81–91 (1996). 4. Talley, K. & Alexov, E. On the pH-optimum of activity and stability of proteins. Proteins 78, 2699–2706 (2010). 5. Huang, P. S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nat. 537, 320–327 (2016). 6. Robinson-Rechavi, M. & Godzik, A. Structural genomics of thermotoga maritima proteins shows that contact order is a major determinant of protein thermostability. Struct. 13, 857–860 (2005). 7. Brinda, K. V. & Vishveshwara, S. A network representation of protein structures: implications for protein stability. Biophys. J. 89, 4159–4170 (2005). 8. Amadei, A., Galdo, S. D. & D’Abramo, M. Density discriminates between thermophilic and mesophilic proteins. J. Biomol. Struct. Dyn. 1–9 (2017). 9. Daniel, R. The upper limits of enzyme thermal stability. Enzym. Microb. Technol. 19, 74–79 (1996). 10. Chen, Y.-C. et al. Thermal stability, storage and release of proteins with tailored fit in silica. Sci. Reports 7, 46568 (2017). 11. Razvi, A. & Scholtz, J. M. Lessons in stability from thermophilic proteins. Protein Sci. 15, 1569–1578 (2006). 12. Argos, P. et al. Thermal stability and protein structure. Biochem. 18, 5698–5703 (1979). 13. Bischof, J. C. & He, X. Thermal stability of proteins. Ann. N. Y. Acad. Sci. 1066, 12–33 (2005). 14. Pucci, F., Bourgeas, R. & Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC. Sci. Reports 6 (2016). 15. Pucci, F., Kwasigroch, J. M. & Rooman, M. SCooP: an accurate and fast predictor of protein stability curves as a function of temperature. Bioinforma. 33, 3415–3422 (2017). 16. Tavernelli, I., Cotesta, S. & Di Iorio, E. E. Protein dynamics, thermal stability, and free-energy landscapes: a investigation. Biophys. J. 85, 2641–2649 (2003). 17. Manjunath, K. & Sekar, K. Molecular dynamics perspective on the protein thermal stability: A case study using SAICAR synthetase. J. Chem. Inf. Model. 53, 2448–2461 (2013). 18. Wu, L.-C., Lee, J.-X., Huang, H.-D., Liu, B.-J. & Horng, J.-T. An expert system to predict protein thermostability using decision tree. Expert. Syst. with Appl. 36, 9007–9014 (2009). 19. Ku, T. et al. Predicting melting temperature directly from protein sequences. Comput. Biol. Chem. 33, 445–450 (2009). 20. Vogt, G., Woell, S. & Argos, P. Protein thermal stability, hydrogen bonds, and ion pairs. J. Mol. Biol. 269, 631–643 (1997). 21. Mozo-Villar´ıas, A., Cedano, J. & Querol, E. A simple electrostatic criterion for predicting the thermal stability of proteins. Protein Eng. Des. Sel. 16, 279–286 (2003). 22. Folch, B., Dehouck, Y. & Rooman, M. Thermo- and mesostabilizing protein interactions identified by temperature- dependent statistical potentials. Biophys. J. 98, 667–677 (2010). 23. Tartaglia, G. G., Cavalli, A. & Vendruscolo, M. Prediction of local structural stabilities of proteins from their amino acid sequences. Struct. 15, 139–143 (2007). 24. Vishveshwara, S., Brinda, K. V. & Kannan, N. Protein Structure: insights from graph theory. J. Theor. Comput. Chem. 01, 187–211 (2002).

13/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

25. Vijayabaskar, M. & Vishveshwara, S. Interaction energy based protein structure networks. Biophys. J. 99, 3704–3715 (2010). 26. Lee, C. W., Wang, H. J., Hwang, J. K. & Tseng, C. P. Protein thermal stability enhancement by designing salt bridges: a combined computational and experimental study. PLoS ONE 9, e112751 (2014). 27. Folch, B., Rooman, M. & Dehouck, Y. Thermostability of salt bridges versus hydrophobic interactions in proteins probed by statistical potentials. J Chem Inf Model. 48, 119–127 (2008). 28. Vogt, G. & Argos, P. Protein thermal stability: hydrogen bonds or internal packing? Fold Des 2, S40–46 (1997). 29. Priyakumar, U. D. Role of hydrophobic core on the thermal stability of proteins - molecular dynamics simulations on a single point mutant of Sso7d abstract. J. Biomol. Struct. Dyn. 29, 961–971 (2012). 30. Van den Burg, B. et al. Protein stabilization by hydrophobic interactions at the surface. Eur. J. Biochem. 220, 981–985 (1994). 31. Pucci, F. & Rooman, M. Physical and molecular bases of protein thermal stability and cold adaptation. Curr. Opin. Struct. Biol. 42, 117–128 (2017). 32. Chakrabarty, B. & Parekh, N. NAPS: Network analysis of protein structures. Nucleic Acids Res. 44, W375–W382 (2016). 33. Gromiha, M. M. et al. ProTherm, version 2.0: thermodynamic database for proteins and mutants. Nucleic Acids Res. 28, 283–285 (2000). 34. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000). 35. Brock, T. D. Life at high temperatures. Sci. 230, 132–138 (1985). 36. Brock, T. D. & Brock, T. D. The value of basic research: discovery of aquaticus and other extreme thermophiles. Genet. 146, 1207–1210 (1997). 37. Saiki, R. K. et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Sci. 239, 487–491 (1988). 38. Vieille, C. & Zeikus, G. J. Hyperthermophilic enzymes: Sources, uses, and molecular mechanisms for thermostability. Microbiol. Mol. Biol. Rev. 65, 1–43 (2001). 39. Vieille, C., Burdette, D. S. & Zeikus, J. G. Thermozymes. Biotechnol Annu. Rev 2, 1–83 (1996). 40. Serre, M.-C. & Duguet, M. Enzymes That Cleave and Religate DNA at High Temperature: The Same Story with Different Actors (Elsevier, 2003). 41. Marsaglia, G., Tsang, W. W. & Wang, J. Evaluating kolmogorov’s distribution. J. Stat. Softw. 8 (2003). 42. Touw, W. G. et al. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 43, D364–368 (2015). 43. Yang, H., Liu, L., Li, J., Chen, J. & Du, G. Rational design to improve protein thermostability: Recent advances and prospects. ChemBioEng Rev. 2, 87–94 (2015). 44. Sadeghi, M., Naderi-Manesh, H., Zarrabi, M. & Ranjbar, B. Effective factors in thermostability of thermophilic proteins. Biophys. Chem. 119, 256–270 (2006). 45. Pezzullo, M. et al. Comprehensive analysis of surface charged residues involved in thermal stability in alicyclobacillus acidocaldarius esterase 2. Protein Eng. Des. Sel. 26, 47–58 (2012). 46. Kannan, N. & Vishveshwara, S. Aromatic clusters: a determinant of thermal stability of thermophilic proteins. Protein Eng. 13, 753–761 (2000). 47. Sabarinathan, R., Aishwarya, K., Sarani, R., Vaishnavi, M. K. & Sekar, K. Water-mediated ionic interactions in protein structures. J. Biosci. 36, 253–263 (2011). 48. Levy, Y. & Onuchic, J. N. Water and proteins: A love-hate relationship. Proc. Natl. Acad. Sci. 101, 3325–3326 (2004). 49. Lanzarotti, E., Biekofsky, R. R., Estrin, D. A., Marti, M. A. & Turjanski, A. G. Aromatic–aromatic interactions in proteins: Beyond the dimer. J. Chem. Inf. Model. 51, 1623–1633 (2011). 50. Paiardini, A., Sali, R., Bossa, F. & Pascarella, S. ”hot cores” in proteins: Comparative analysis of the apolar contact area in structures from hyper/thermophilic and mesophilic organisms. BMC Struct. Biol. 8, 14 (2008). 51. Atilgan, A. R., Akan, P. & Baysal, C. Small-world communication of residues and significance for protein dynamics. Biophys. J. 86, 85–91 (2004).

14/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

52. Vendruscolo, M., Paci, E., Dobson, C. M. & Karplus, M. Three key residues form a critical contact network in a protein folding transition state. Nat. 409, 641–645 (2001). 53. Vendruscolo, M., Dokholyan, N. V., Paci, E. & Karplus, M. Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E 65 (2002). 54. Yokot, K., Satou, K. & ya Ohki, S. Comparative analysis of protein thermostability: Differences in amino acid content and substitution at the surfaces and in the core regions of thermophilic and mesophilic proteins. Sci. Technol. Adv. Mater. 7, 255–262 (2006). 55. Grewal, R. K. & Roy, S. Modeling proteins as residue interaction networks. Protein Pept. Lett. 22, 923–933 (2015). 56. Vendruscolo, M., Paci, E., Dobson, C. M. & Karplus, M. Three key residues form a critical contact network in a protein folding transition state. Nat. 409, 641–645 (2001). 57. Taylor, N. R. Small world network strategies for studying protein structures and binding. Comput. Struct Biotechnol J 5, e201302006 (2013). 58. Amor, B. R., Schaub, M. T., Yaliraki, S. N. & Barahona, M. Prediction of allosteric sites and mediating interactions through bond-to-bond propensities. Nat Commun 7, 12477 (2016). 59. Souza, V. P., Ikegami, C. M., Arantes, G. M. & Marana, S. R. Protein thermal denaturation is modulated by central residues in the protein structure network. FEBS J. 283, 1124–1138 (2016). 60. Jonsdottir, L. B. et al. The role of salt bridges on the temperature adaptation of aqualysin I, a thermostable subtilisin-like proteinase. Biochim. Biophys. Acta 1844, 2174–2181 (2014). 61. Kumar, S., Tsai, C. J. & Nussinov, R. Factors enhancing protein thermostability. Protein Eng. 13, 179–191 (2000). 62. Pucci, F. & Rooman, M. Improved insights into protein thermal stability: from the molecular to the structurome scale. Philos Trans A Math Phys Eng Sci 374 (2016). 63. Kumar, M. D. et al. ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res. 34, D204–206 (2006). 64. Pundir, S., Martin, M. J. & O’Donovan, C. UniProt Protein Knowledgebase. Methods Mol. Biol. 1558, 41–55 (2017). 65. Huang, Y., Niu, B., Gao, Y., Fu, L. & Li, W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinforma. 26, 680–682 (2010). 66. Karshikoff, A. & Ladenstein, R. Proteins from thermophilic and mesophilic organisms essentially do not differ in packing. Protein Eng. 11, 867–872 (1998). 67. Parthasarathy, S. & Murthy, M. R. Protein thermal stability: insights from atomic displacement parameters (B values). Protein Eng. 13, 9–13 (2000). 68. Phillips, J. C. et al. Scalable molecular dynamics with NAMD. J Comput. Chem 26, 1781–1802 (2005). 69. Vanommeslaeghe, K. & MacKerell, A. D. Automation of the CHARMM General Force Field (CGenFF) I: bond perception and atom typing. J Chem Inf Model. 52, 3144–3154 (2012). 70. Sillitoe, I. et al. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43, D376–381 (2015). 71. Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. 22, 2577–2637 (1983). 72. Csardi, G. & Nepusz, T. The igraph software package for complex network research. InterJournal 1695 (2006). 73. Ihaka, R. & Gentleman, R. R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314 (1996). 74. Barrat, A., Barthelemy, M., Pastor-Satorras, R. & Vespignani, A. The architecture of complex weighted networks. Proc. Natl. Acad. Sci. U.S.A. 101, 3747–3752 (2004). 75. L DeLano, W. The PyMOL Molecular Graphics System (2002) DeLano Scientific, Palo Alto, CA, USA. (2002). 76. Robin, X. et al. pROC: an open-source package for r and s to analyze and compare roc curves. BMC Bioinforma. 12, 77 (2011). 77. Ward, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963). 78. Brock, G., Pihur, V., Datta, S. & Datta, S. clValid: AnRPackage for cluster validation. J. Stat. Softw. 25 (2008). 79. Venables, W. & Ripley, B. Modern applied statistics with S-Plus (Springer-Verlag, 1997), second edn.

15/16 bioRxiv preprint doi: https://doi.org/10.1101/354266; this version posted August 1, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

80. Freeman, L. C. Centrality in social networks conceptual clarification. Soc. Networks 1, 215–239 (1978). 81. Eagle, N., Macy, M. & Claxton, R. Network diversity and economic development. Sci. 328, 1029–1031 (2010). 82. West, D. B. Introduction to Graph Theory (Prentice Hall, 2000). 83. Kleinberg, J. M. Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999).

16/16