Normalized Histograms of the Evolutionary Distance D for Proteins Cliques
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Information for Proteomics DOI 10.1002/pmic.200401138 Massimo Vergassola, Alessandro Vespignani and Bernard Dujon Cooperative evolution in protein complexes of yeast from comparative analyses of its interaction network ã 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de Supplementary Material for the manuscript “Cooperative evolution in protein complexes of yeast from comparative analyses of its protein interaction network” Massimo Vergassola, Alessandro Vespignani and Bernard Dujon 1. Phylogenetic relationships among selected hemiascomycetes yeasts The cartoon is a scheme of the evolutive relationships among hemiascomycetes yeasts which have been sequenced. The orders are inferred from the cladogram of 25S rDNA sequences, computed (courtesy of P. Durrens) from the Tajima-Nei distance method (Tajima, F. and Nei, M. Estimation of evolutionary distance between nucleotide sequences Mol. Biol. and Evol. 17, 269-285, (1984)) and maximum parsimony by the MEGA2 software (Kumar, S. et al. MEGA2: Molecular Evolutionary Genetics Analysis software Bioinformatics 17, 1244-1245, (2001)). The S. pombe sequence is taken as an outgroup. The length of the lines should not be taken as indicative of evolutive distances. The order within the stricto sensu group, S. paradoxus, S. mikatae and S. kudriavzevii, sequenced in (Kellis, M. et al. Sequencing and comparison of yeast species to identify genes and regulatory elements Nature 423, 241-254 (2003)), has low bootstrap values due to the very strong similarity of their genomic sequences to S. cerevisiae. The couple S. kluyveri and S. castellii was sequenced in (Cliften, P. et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting Science 301, 71-76 (2003)). 2. Topology of the protein interaction networks for the various datasets The topological properties of the networks constructed from the three different types of datasets discussed in the paper are presented here. We remind that (I) is the ensemble of the two-hybrid data in [2,3]; (II) is made of the TAP-tag data in [4] and (III) is a large collection of both types of interactions, assembled at http://dip.doe-mbi.ucla.edu. Their respective numbers of nodes N and edges E are: (2152; 2831), (1361; 3221) and (4713; 14846). The average connectivities are 2.63, 4.73 and 6.3, respectively. The values of the clustering coefficient are: 0.16, 0.19 and 0.06 to be compared with the values 1.2 10-3, 3.5 10-3 and 1.3 10-3 for the corresponding random networks having the same average connectivity. 3. “Lost” proteins for the separate datasets (I) and (II) As discussed in the body of the paper, proteins are dubbed “lost” if they do not appear in the list of bi-directional best hits for the comparative analysis between S. cerevisiae and one of the four yeasts of comparison. The table in the paper was obtained for the dataset (III), containing both TAP-tag and two-hybrid interactions. The values for the separate data sets (I) and (II), made of two-hybrid and TAP-tag interactions separately, are reported hereafter. C. glabrata K. lactis C. glabrata K. lactis Connectivity 1 -> 3 4 -> 76 1 -> 3 4 -> 76 1 -> 5 6 -> 53 1 -> 5 6 -> 53 # Genes 1728 424 1728 424 1006 355 1006 355 # Lost 374 58 430 62 79 15 96 20 Fraction 22% 14% 25% 15% 8% 4% 10% 6% Prob. 4.0 10-4 2.1 10-5 1.4 10-2 1.7 10-2 D. hansenii Y. lipolytica D. hansenii Y. lipolytica Connectivity 1 -> 3 4 -> 76 1 -> 3 4 -> 76 1 -> 5 6 -> 53 1 -> 5 6 -> 53 # Genes 1728 424 1728 424 1006 355 1006 355 # Lost 636 121 767 157 206 35 286 53 Fraction 37% 29% 44% 37% 21% 10% 28% 15% Prob. 4.9 10-3 2.0 10-2 1.0 10-5 2.8 10-6 Table S.1: The repartition of proteins “lost” for the comparative analysis between S. cerevisiae and the yeasts indicated in the first rows. The range of the protein connectivities in the two groups is indicated in the second row. The separation value was chosen as the average connectivity in the network. The two halves of the table refers to the two-hybrid data set (I) and the TAP-tag data set (II), respectively. The rows from the third to the fifth report the total number of proteins, the number of those which are lost and the corresponding fraction for the two groups. In the last line it is reported the probability that the difference between the “loss” rates in the two groups be due to chance (see Methods). 3. Footprinting profiles The footprinting profiles are aimed at quantifying the co-occurrence in the genome of different species of couples of genes [10]. Each species contributes zero if genes are both “present” or both “absent” and one otherwise. The footprinting distance between two genes is the sum of those quantities over the species under consideration (the four hemiascomycetes C. glabrata, K. lactis, D. hansenii and Y. lipolytica, in our case). We defined “presence/absence” of a gene via the list of bi-directional best hits. The higher similarity of the phylogenetic profiles of interacting proteins is evident for all data sets. Mixed data set 0.5 0.4 0.3 Interacting 0.2 All couples 0.1 0 01234 TAP-tag data set 0.8 0.6 Interacting 0.4 All couples 0.2 0 01234 Tw o-hybrid 0.5 0.4 0.3 Interacting 0.2 All couples 0.1 0 01234 Figs. S1,2,3: The histograms of the footprinting distances for couples of proteins linked in the protein interaction network of S. cerevisiae and all possible couples of proteins of S. cerevisiae are presented. The first figure refers to the data set (III) and the last to (I). 4. Co-evolution analysis for the separate data sets (I) and (II) We report here the results on the co-evolution within cliques of interlinked proteins for separate data sets. The two following tables correspond to Table 2 in the body of the paper and refer to data sets (I) and (II), made of two-hybrid and TAP-tag interactions. We limited the order of the cliques analyzed to 3 and 4. For higher orders the number of proteins drops below a hundred and the statistics is not reliable. C. glabrata K. lactis # proteins 2323 interlinked # groups 2073 776 1973 689 # proteins 1602 344 1151 322 z-score 4.0 ± 0.3 4.6 ± 0.2 4.0 ± 0.3 5.5 ± 0.2 probability 3.2 10-5 2.1 10-6 3.2 10-5 1.9 10-8 D. hansenii Y. lipolytica # proteins 2323 interlinked # groups 1455 628 1141 484 # proteins 1196 261 1002 228 z-score 2.3 ± 0.3 2.5 ± 0.2 2.9 ± 0.3 2.8 ± 0.2 probability 1.1 10-2 6.2 10-3 1.9 10-3 2.6 10-3 C. glabrata K. lactis # proteins 234234 interlinked # groups 2901 2129 1558 2842 2068 1501 # proteins 1246 588 297 1221 583 283 z-score 6.3 ± 0.3 7.3 ± 0.2 5.5 ± 0.1 7.0 ± 0.3 7.6 ± 0.2 4.5 ± 0.1 probability 1.5 10-10 1.4 10-13 1.9 10-8 1.3 10-12 1.5 10-14 3.0 10-6 D. hansenii Y. lipolytica # proteins 234234 interlinked # groups 2486 1715 1126 2165 1505 1009 # proteins 1085 523 256 968 467 237 z-score 8.2 ± 0.3 7.7 ± 0.2 4.0 ± 0.1 6.9 ± 0.3 6.2 ± 0.2 2.5 ± 0.1 probability 1.2 10-16 6.8 10-15 3.2 10-5 2.6 10-12 2.8 10-10 6.2 10-3 Table S2,3: Wilcoxon test for the dissimilarities observed in the normalized histograms corresponding to those in Fig.1. The second and third rows report the total number of cliques and proteins appearing therein. In the last line it is reported the probability that the observed difference be due to chance. Its error bars, estimated from the standard deviation measured in 10,000 realizations of the random draws, are reported on the corresponding z-scores, i.e. the deviations to the mean normalized by the standard deviation, for the null Gaussian distribution in the Wilcoxon test. Normalized histograms of the evolutionary distance D for proteins cliques 6e-04 4e-04 4e-04 Probability 2e-04 Probability 2e-04 0 1000 2000 3000 0 1000 2000 3000 6e-04 1e-03 4e-04 5e-04 Probability Probability 2e-04 0 500 1000 1500 2000 2500 0 500 1000 1500 Evolutive distance D Evolutive distance D Caption: The four plots report the normalized histograms for the evolutive distance D of cliques identified within the S. cerevisiae protein interaction network. The observable D, defined in the text, conveys information on the strength of multi-point correlations in the evolution rates of proteins within a clique: low values of D are the signature of strong co- evolutive patterns. The curves refer to the comparison between S. cerevisiae and K. lactis. In red and black, the curves for the original interacting cliques and a randomized version thereof. The two curves clearly intersect, with a visible accumulation of probability for small differences in the set of interacting proteins. The displacement to the right of the maximum of the curve follows from the definition of D and the fact that the number of different couples within a clique grows with its order.