Supporting Information for Proteomics DOI 10.1002/pmic.200401138

Massimo Vergassola, Alessandro Vespignani and Bernard Dujon

Cooperative evolution in complexes of from comparative analyses of its interaction network

ã 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Supplementary Material for the manuscript “Cooperative evolution in protein complexes of yeast from comparative analyses of its protein interaction network”

Massimo Vergassola, Alessandro Vespignani and Bernard Dujon

1. Phylogenetic relationships among selected hemiascomycetes

The cartoon is a scheme of the evolutive relationships among hemiascomycetes yeasts which have been sequenced. The orders are inferred from the cladogram of 25S rDNA sequences, computed (courtesy of P. Durrens) from the Tajima-Nei distance method (Tajima, F. and Nei, M. Estimation of evolutionary distance between nucleotide sequences Mol. Biol. and Evol. 17, 269-285, (1984)) and maximum parsimony by the MEGA2 software (Kumar, S. et al. MEGA2: Molecular Evolutionary Analysis software Bioinformatics 17, 1244-1245, (2001)). The S. pombe sequence is taken as an outgroup. The length of the lines should not be taken as indicative of evolutive distances. The order within the stricto sensu group, S. paradoxus, S. mikatae and S. kudriavzevii, sequenced in (Kellis, M. et al. Sequencing and comparison of yeast species to identify genes and regulatory elements 423, 241-254 (2003)), has low bootstrap values due to the very strong similarity of their genomic sequences to S. cerevisiae. The couple S. kluyveri and S. castellii was sequenced in (Cliften, P. et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting Science 301, 71-76 (2003)). 2. Topology of the protein interaction networks for the various datasets

The topological properties of the networks constructed from the three different types of datasets discussed in the paper are presented here. We remind that (I) is the ensemble of the two- data in [2,3]; (II) is made of the TAP-tag data in [4] and (III) is a large collection of both types of interactions, assembled at http://dip.doe-mbi.ucla.edu. Their respective numbers of nodes N and edges E are: (2152; 2831), (1361; 3221) and (4713; 14846). The average connectivities are 2.63, 4.73 and 6.3, respectively. The values of the clustering coefficient are: 0.16, 0.19 and 0.06 to be compared with the values 1.2 10-3, 3.5 10-3 and 1.3 10-3 for the corresponding random networks having the same average connectivity.

3. “Lost” for the separate datasets (I) and (II)

As discussed in the body of the paper, proteins are dubbed “lost” if they do not appear in the list of bi-directional best hits for the comparative analysis between S. cerevisiae and one of the four yeasts of comparison. The table in the paper was obtained for the dataset (III), containing both TAP-tag and two-hybrid interactions. The values for the separate data sets (I) and (II), made of two-hybrid and TAP-tag interactions separately, are reported hereafter.

C. glabrata K. lactis C. glabrata K. lactis Connectivity 1 -> 3 4 -> 76 1 -> 3 4 -> 76 1 -> 5 6 -> 53 1 -> 5 6 -> 53 # Genes 1728 424 1728 424 1006 355 1006 355 # Lost 374 58 430 62 79 15 96 20 Fraction 22% 14% 25% 15% 8% 4% 10% 6% Prob. 4.0 10-4 2.1 10-5 1.4 10-2 1.7 10-2

D. hansenii Y. lipolytica D. hansenii Y. lipolytica Connectivity 1 -> 3 4 -> 76 1 -> 3 4 -> 76 1 -> 5 6 -> 53 1 -> 5 6 -> 53 # Genes 1728 424 1728 424 1006 355 1006 355 # Lost 636 121 767 157 206 35 286 53 Fraction 37% 29% 44% 37% 21% 10% 28% 15% Prob. 4.9 10-3 2.0 10-2 1.0 10-5 2.8 10-6

Table S.1: The repartition of proteins “lost” for the comparative analysis between S. cerevisiae and the yeasts indicated in the first rows. The range of the protein connectivities in the two groups is indicated in the second row. The separation value was chosen as the average connectivity in the network. The two halves of the table refers to the two-hybrid data set (I) and the TAP-tag data set (II), respectively. The rows from the third to the fifth report the total number of proteins, the number of those which are lost and the corresponding fraction for the two groups. In the last line it is reported the probability that the difference between the “loss” rates in the two groups be due to chance (see Methods). 3. Footprinting profiles

The footprinting profiles are aimed at quantifying the co-occurrence in the genome of different species of couples of genes [10]. Each species contributes zero if genes are both “present” or both “absent” and one otherwise. The footprinting distance between two genes is the sum of those quantities over the species under consideration (the four hemiascomycetes C. glabrata, K. lactis, D. hansenii and Y. lipolytica, in our case). We defined “presence/absence” of a gene via the list of bi-directional best hits. The higher similarity of the phylogenetic profiles of interacting proteins is evident for all data sets.

Mixed data set

0.5 0.4 0.3 Interacting 0.2 All couples 0.1 0 01234

TAP-tag data set

0.8

0.6 Interacting 0.4 All couples 0.2

0 01234

Tw o-hybrid

0.5 0.4 0.3 Interacting 0.2 All couples 0.1 0 01234

Figs. S1,2,3: The histograms of the footprinting distances for couples of proteins linked in the protein interaction network of S. cerevisiae and all possible couples of proteins of S. cerevisiae are presented. The first figure refers to the data set (III) and the last to (I). 4. Co-evolution analysis for the separate data sets (I) and (II)

We report here the results on the co-evolution within cliques of interlinked proteins for separate data sets. The two following tables correspond to Table 2 in the body of the paper and refer to data sets (I) and (II), made of two-hybrid and TAP-tag interactions. We limited the order of the cliques analyzed to 3 and 4. For higher orders the number of proteins drops below a hundred and the statistics is not reliable.

C. glabrata K. lactis # proteins 2323 interlinked # groups 2073 776 1973 689 # proteins 1602 344 1151 322 z-score 4.0 ± 0.3 4.6 ± 0.2 4.0 ± 0.3 5.5 ± 0.2 probability 3.2 10-5 2.1 10-6 3.2 10-5 1.9 10-8 D. hansenii Y. lipolytica # proteins 2323 interlinked # groups 1455 628 1141 484 # proteins 1196 261 1002 228 z-score 2.3 ± 0.3 2.5 ± 0.2 2.9 ± 0.3 2.8 ± 0.2 probability 1.1 10-2 6.2 10-3 1.9 10-3 2.6 10-3

C. glabrata K. lactis # proteins 234234 interlinked # groups 2901 2129 1558 2842 2068 1501 # proteins 1246 588 297 1221 583 283 z-score 6.3 ± 0.3 7.3 ± 0.2 5.5 ± 0.1 7.0 ± 0.3 7.6 ± 0.2 4.5 ± 0.1 probability 1.5 10-10 1.4 10-13 1.9 10-8 1.3 10-12 1.5 10-14 3.0 10-6 D. hansenii Y. lipolytica # proteins 234234 interlinked # groups 2486 1715 1126 2165 1505 1009 # proteins 1085 523 256 968 467 237 z-score 8.2 ± 0.3 7.7 ± 0.2 4.0 ± 0.1 6.9 ± 0.3 6.2 ± 0.2 2.5 ± 0.1

probability 1.2 10-16 6.8 10-15 3.2 10-5 2.6 10-12 2.8 10-10 6.2 10-3

Table S2,3: Wilcoxon test for the dissimilarities observed in the normalized histograms corresponding to those in Fig.1. The second and third rows report the total number of cliques and proteins appearing therein. In the last line it is reported the probability that the observed difference be due to chance. Its error bars, estimated from the standard deviation measured in 10,000 realizations of the random draws, are reported on the corresponding z-scores, i.e. the deviations to the mean normalized by the standard deviation, for the null Gaussian distribution in the Wilcoxon test. Normalized histograms of the evolutionary distance D for proteins cliques

6e-04

4e-04 4e-04

Probability 2e-04 Probability 2e-04

0 1000 2000 3000 0 1000 2000 3000

6e-04 1e-03

4e-04 5e-04 Probability Probability 2e-04

0 500 1000 1500 2000 2500 0 500 1000 1500 Evolutive distance D Evolutive distance D

Caption: The four plots report the normalized histograms for the evolutive distance D of cliques identified within the S. cerevisiae protein interaction network. The observable D, defined in the text, conveys information on the strength of multi-point correlations in the evolution rates of proteins within a clique: low values of D are the signature of strong co- evolutive patterns. The curves refer to the comparison between S. cerevisiae and K. lactis. In red and black, the curves for the original interacting cliques and a randomized version thereof. The two curves clearly intersect, with a visible accumulation of probability for small differences in the set of interacting proteins. The displacement to the right of the maximum of the curve follows from the definition of D and the fact that the number of different couples within a clique grows with its order. 4. Evolutive divergence rates vs connectivity: the role of protein concentrations An issue which has recently attracted much attention is the possible dependence of evolutive divergence rates of proteins on their connectivity. A statistically significant dependence was found in [8,9], later criticized in [11,12] as no such behavior was found in data sets composed of two-hybrid interactions only. Table S4 reports the results for the comparative analysis of the four yeasts considered in this work, which clearly confirm the strong dependence on the data set. A possible origin of the discrepancy is the bias of mass-spectroscopy methods towards abundant proteins [16]. A direct illustration of the differences between mass-spectroscopy and two-hybrid methods is provided by Figs. S4. The under-assay of complexes involving rare proteins makes that their connectivity is depleted. The differences observed in Table S4 might be due to the previous effect coupled to the dependence of the evolutive divergence rates of proteins on their concentration. That was first remarked in [17] and is confirmed in Fig. S5. Although the previous mechanism is a plausible cause of the observed discrepancies, it should be remarked that over-expression of the proteins assayed in the two-hybrid systems might yield non-physiological interactions and wipe out biologically relevant dependences on the concentrations. On the contrary, the TAP-tag technique presents the advantage of assaying the interactions of proteins at their physiological concentrations. Settling the issue will therefore require a more complete screening of the interaction network.

C. glabrata K. lactis Connectivity 1 -> 4 5 -> 9 10 -> 53 1 -> 4 5 -> 9 10 -> 53 # Genes 860 210 197 842 208 195 1 -> 4 2.60 3.68 2.93 4.61 5 -> 9 1.03 1.59 D. hansenii Y. lipolytica Connectivity 1 -> 4 5 -> 9 10 -> 53 1 -> 4 5 -> 9 10 -> 53 # Genes 737 198 185 661 188 173 1 -> 4 1.40 5.06 1.17 4.60 5 -> 9 3.00 2.86

C. glabrata K. lactis Connectivity 1 -> 3 4 -> 7 8 -> 76 1 -> 3 4 -> 7 8 -> 76 # Genes 1354 366 255 1298 362 249 1 -> 3 -0.68 0.66 -1.37 -0.13 4 -> 7 1.02 0.69 D. hansenii Y. lipolytica Connectivity 1 -> 3 4 -> 7 8 -> 76 1 -> 3 4 -> 7 8 -> 76 # Genes 1092 303 204 961 180 87 1 -> 3 -0.51 -0.32 0.01 0.30 4 -> 7 0.10 0.24

Table S4: Wilcoxon tests for the dissimilarities between the distributions of the evolutive ranks within groups having different connectivities. The range of the connectivities for the various groups is indicated in the second rows of the tables. The third rows report the total number of proteins in each group. The dissimilarities are quantified by the z-scores, i.e. the deviations to the mean normalized by the standard deviation, for the null Gaussian distribution in the Wilcoxon test. A positive value indicates that the group on the row has higher ranks than the one on the corresponding column, e.g. the groups (1->4) and (5->9) for C. glabrata in the first table. The two halves of the table refer to the data set (II) and (I), made of TAP-tag interactions and two-hybrid interactions. Note the strong dependence on the data sets of the statistical significance of the observed dissimilarities between the groups. 800

750

700 Two-hybrid data 650

600 Abundance rank 550 Mass spectroscopy data

500

450 250 500 750 1000 1250 1500 Connectivity rank

FIG. S4: Abundance rank versus connectivity rank in the two-hybrid and mass spectroscopy data sets. The two-hybrid data do not exhibit any clear correlation, while the mass spectroscopy data set shows a striking correlations.

600 600 550 500 500 450 400

Abundance rank C. glabrata D. hansenii 400 350 300 0 200 400 600 800 1000 0 200 400 600 800 1000

600 500

450 500 400

abundance rank K. lactis Y. lipolytica 400 350

300 0 200 400 600 800 1000 0 200 400 600 800 Evolutive divergence rank Evolutive divergence rank

FIG S5: Evolutive rank versus abundance rank of proteins in the four hemiascomycetes analyzed in the paper. In all instances, more abundant proteins have small evolution divergence.