GBE

The Complex Evolutionary Dynamics of Hsp70s: A Genomic and Functional Perspective

Jacek Kominek1,y, Jaroslaw Marszalek1,2,y,Ce´cile Neuve´glise3, Elizabeth A. Craig2,*, and Barry L. Williams4,5,* 1Laboratory of Evolutionary Biochemistry, Intercollegiate Faculty of Biotechnology, University of Gdansk, Kladki, Poland 2Department of Biochemistry, University of Wisconsin–Madison 3INRA Micalis UMR1319, Biologie Inte´grative du Me´tabolisme Lipidique Microbien, Thiverval-Grignon, France 4Department of Zoology, Michigan State University

5Department of Microbiology and Molecular Genetics, Michigan State University Downloaded from *Corresponding authors: E-mail: [email protected]; [email protected]. yThese authors contributed equally to this work. Accepted: November 20, 2013 http://gbe.oxfordjournals.org/

Abstract Hsp70 molecular chaperones are ubiquitous. By preventing aggregation, promoting folding, and regulating degradation, Hsp70s are major factors in the ability of cells to maintain proteostasis. Despite a wealth of functional information, little is understood about the evolutionary dynamics of Hsp70s. We undertook an analysis of Hsp70s in the fungal clade . Using the well-characterized 14 Hsp70s of Saccharomyces cerevisiae, we identified 491 orthologs from 53 genomes. Saccharomyces cerevisiae Hsp70s fall into seven subfamilies: four canonical-type Hsp70 chaperones (SSA, SSB, KAR, and SSC) and three atypical Hsp70s (SSE, SSZ, and LHS) that at University of Wisconsin-Madison on June 20, 2014 play regulatory roles, modulating the activity of canonical Hsp70 partners. Each of the 53 surveyed genomes harbored at least one member of each subfamily, and thus establishing these seven Hsp70s as units of function and evolution. Genomes of some species contained only one member of each subfamily that is only seven Hsp70s. Overall, members of each subfamily formed a monophyletic group, suggesting that each diversified from their corresponding ancestral gene present in the common ancestor of all surveyed species. However, the pattern of evolution varied across subfamilies. At one extreme, members of the SSB subfamily evolved under concerted evolution. At the other extreme, SSA and SSC subfamilies exhibited a high degree of copy number dynamics, consistent with a birth–death mode of evolution. KAR, SSE, SSZ, and LHS subfamilies evolved in a simple divergent mode with little copy number dynamics. Together, our data revealed that the evolutionary history of this highly conserved and ubiquitous protein family was surprising complex and dynamic. Key words: molecular chaperones, multi gene family, gene duplication, birth-and-death, concerted evolution, Ascomycota genomes.

Introduction and protein homeostasis, it is not surprising that Hsp70 chap- Hsp70s, ancient and taxonomically ubiquitous molecular erone systems have been connected to a number of disease chaperones, form a protein family whose members function states, including protein misfolding diseases (Broadley and in all major cellular compartments, participating in an exten- Hartl 2009). sive array of cellular processes. Hsp70s transiently interact with Despite functional diversification among Hsp70s, funda- a wide variety of client proteins, impinging on virtually all mental sequence and structural features have been con- stages of a protein’s lifetime (Kampinga and Craig 2010; served. All Hsp70s possess a highly conserved, regulatory Kim et al. 2013). They promote protein folding, prevent pro- N-terminal nucleotide-binding domain (NBD) of approximately tein aggregation, foster protein degradation, disassemble pro- 40 kDa and a less well conserved, approximately 25 kDa, tein complexes, and modulate protein–protein interactions. C-terminal substrate-binding domain (SBD), which contains Given their diverse and central functions in cell physiology the binding site for client proteins (Mayer and Bukau 2005).

ß The Author(s) 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

2460 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

At its core, Hsp70 chaperone activity is characterized by a Gupta et al. 1997; Germot and Philippe 1999; Bettencourt cyclic bind and release interaction with client proteins, which and Feder 2001; Lin et al. 2001; Nikolaidis and Nei 2004; is regulated by the ATP binding and hydrolysis activity of the Brocchieri et al. 2008). Thus, sequence data available for anal- NBD. A conformational change generated upon ATP hydroly- ysis were limited. The recently completed genome sequencing sis stabilizes Hsp70s’ interaction with its client protein; for 53 species of the ancient fungal clade Ascomycota pro- exchange of ADP for ATP promotes client release and com- vides a unique opportunity to determine patterns of gene gain pletes the binding cycle. and gene loss, as well as mechanisms of gene family evolution Hsp70s do not function alone, but rather as part of com- (Dujon 2010). Genomic sequence divergence between the plex machinery. The client protein-binding and release cycle distantly related Ascomycota species S. cerevisiae and requires the activity of co-chaperones: J-proteins and nucleo- Schizosaccharomyces pombe is nearly as great as that tide release factors. J-proteins stimulate Hsp70s’ ATPase activ- between either of these fungi and human, thus providing a ity, promoting client binding. Nucleotide release factors deep evolutionary history for examination (Taylor and Berbee Downloaded from stimulate release of ADP, and thus facilitating ATP binding 2006). Advantageously, Ascomycota species have a small and release of client, and as a result promoting initiation of genome size, relatively complete genome sequence assem- a new Hsp70–client interaction cycle. Those Hsp70s capable blies are available due to low numbers of repetitive sequences of performing the full bind and release cycle with client pro- in these species, and synteny data are available for many teins are termed “canonical.” Despite sequence and structural clades. http://gbe.oxfordjournals.org/ similarity with canonical Hsp70s, some members of this family We found that the Hsp70s present in Ascomycota comprise have a limited ability to carry out the full binding and release seven monophyletic subfamilies that corresponded with pre- cycle, and are termed “atypical” (Shaner and Morano 2007). vious functional categorization in S. cerevisiae basedonsub- Atypical Hsp70s are known to act as co-chaperones for spe- cellular localization and function. All genomes harbored at cific canonical Hsp70 partners, acting as nucleotide release least one member from each subfamily, indicating that factors themselves or as modulators of J-protein co-chaperone seven homologs constitute the minimal number of Hsp70s activity. required for proper cell function in Ascomycota. However, Although Hsp70s are among the most studied genes at the

despite the uniformity of subfamily composition, dramatic at University of Wisconsin-Madison on June 20, 2014 molecular and cellular level, their evolutionary history is not differences with respect to gene copy number dynamics and well understood. It is known that Hsp70 gene copy number is the evolutionary mechanisms behind duplicate gene retention dynamic among eukaryotes species, for example, 14 in were uncovered within each subfamily. Saccharomyces cerevisiae,14inCaenorhabditis elegans, 11 in Drosophila melanogaster, and 17 in Homo sapiens (Daugaard et al. 2007). The most detailed functional and Materials and Methods evolutionary data for eukaryotes are available for S. cerevisiae Identification of Hsp70 Orthologs (Kampinga and Craig 2010). In this species, the ten canonical members of the Hsp70 family form four subfamilies, which Genomic data were obtained from the NCBI Genbank, were originally designated based on a combination of cellular BROAD Institute, Saccharomyces Genome Database, location, genetic analysis, and rudimentary sequence compar- Genolevures, DOE Joint Genome Institute, EBI Integr8, isons (Boorstein et al. 1994). Two subfamilies are present in Ensembl Genomes, Yeast Gene Order Browser, Genoscope, the cytosol: four paralogous SSA proteins reside in the nu- Sanger Institute, the Podospora anserina genome project, the cleus, as well as the cytosol; two paralogous SSB proteins Saccharomyces sensu-stricto resequencing efforts (Scannell bind cytosolic ribosomes near the polypeptide exit site. Two et al. 2011), and the lab of Ce´cile Neuve´glise (Neuve´glise C, subfamilies are strictly localized within organelles: A single unpublished data). Contig, scaffold, and open reading frame KAR protein is found in the lumen of the endoplasmic reticu- (ORF) data for each species were examined in local Blast lum (ER); three paralogous SSC proteins localize within the databases (Altschul et al. 1990). To assess the presence or mitochondrial matrix. Atypical S. cerevisiae Hsp70s form absence of Hsp70 orthologs in a species, BlastP and TBlastN three subfamilies: A single LHS protein functions as a nucleo- searches were performed using protein sequences of S. cere- tide release factor for KAR in the lumen of the ER (Steel et al. visiae Hsp70s:Ssa1p (YAL005C), Ssa2p (YLL024C), Ssa3p 2004); two paralogous SSE proteins are nucleotide release (YBL075C), Ssa4p (YER103W), Ssb1p (YDL229W), Ssb2p factors for both SSA and SSB proteins (Raviol et al. 2006); a (YNL209W), Kar2p (YJL034W), Ssc1p (YJR045C), Ssc3p/ single SSZ protein forms a heterodimer with the J-protein Ecm10p (YEL030W), Ssq1p (YLR369W), Sse1p (YPL106C), co-chaperone of SSB proteins, Zuo1, regulating its ability to Sse2p (YBR169C), Ssz1p (YHR064C), and Lhs1p (YKL073W) stimulate SSB’s ATPase activity (Huang et al. 2005). as queries. The genomic locations of the best hits obtained Previous studies aimed at understanding the evolution- using BlastP were cross-validated with the corresponding lo- ary diversification of the Hsp70 superfamily predate the cation of the best hits in raw genomic data obtained using genomic revolution (Hughes 1993; Gupta and Singh 1994; TBlastN. If a high-scoring hit could not be identified within the

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2461 Kominek et al. GBE

annotated ORFs, predicted ORFs were annotated manually group A constituted of sequences of three canonical Hsp70s using ORF finder from highest scoring genomic regions subfamilies (SSA, KAR, and SSC). All multiple sequence align- (http://www.ncbi.nlm.nih.gov/gorf/gorf.html, last accessed ments were manually edited to remove sites containing gaps. December 11, 2013). Subcellular localization of the analyzed Next, sequences from the analyzed group B, were added to Hsp70 homologs was predicted using the TargetP v1.1b soft- the group A alignment using the “–add” option in MAFFT v7 ware (Emanuelsson et al. 2007). (Katoh and Standley 2013). This option kept the group A alignment unchanged, with the exception of group-wide Sequence Alignments and Phylogenetic Reconstruction indels, as the group B sequences were aligned to it. Sequence positions in divergence analyses were standardized Nucleotide and amino acid sequences of the Hsp70 homologs relative to analyzed group B sequences from S. cerevisiae sub- were used to construct multiple sequence alignments using family homologs. Sites from each group were classified as MAFFT v7 (Katoh and Standley 2013). Alignments generating either invariant, while also allowing for a single amino acid using the E-INS-i, L-INS-i, and G-INS-i algorithms and the auto Downloaded from variant within a group, or variable. Sites were classified as option were compared, and the alignment with the largest conserved were invariant across both groups whereas relaxed number of nongap positions was selected for further analyzes. sites were invariant in reference group A, but variable in an- Phylogenetic analyses were carried out using both maximum alyzed group B, fixed; sites were variable in reference group A likelihood (ML) and Bayesian inference (BI) methods. The ML and invariant in analyzed group B, and switched sites were analyzes were performed using RAxML v7.4.2 (Stamatakis http://gbe.oxfordjournals.org/ invariant within each group, but contained different amino 2006) with 100 rapid bootstrap replicates, under the LG acids. Sites with gaps or those that were variable in both model of amino acid substitution with empirical amino acid groups were ignored. Sites from the fixed and switched clas- frequencies and 4 gamma distribution rate categories to esti- ses were further subclassified as conservative or radical, based mate rate heterogeneity (LG + G + F) (Le and Gascuel 2008), on the BLOSUM62 matrix. If the BLOSUM62 substitution score which was estimated as the best-fitting substitution model by for the two most prevalent amino acids in group A and group ProtTest v3 (Darriba et al. 2011). The BI analyses were per- B was strictly larger than zero, the site was classified as formed using MrBayes v3.2 (Ronquist et al. 2012) under the conservative. If the score was zero or negative, the site was mixed protein model of amino acid substitution and assuming at University of Wisconsin-Madison on June 20, 2014 classified as radical. All analyzes of sequence divergence were a gamma distribution of rate variation across sites. The performed using scripts written in BioPython (Cock et al. Bayesian analyses were performed with two independent 2009). runs, each with four chains, sampling every 100 generations. The analyses were run for 4,000,000 generations and the first 25% of sampled trees were discarded as burn-in before esti- Codon Usage Bias mation of posterior probabilities for branch support. The cal- Codon usage bias was estimated using the effective number culations were accelerated using the BEAGLE library for of codons (ENC) (Wright 1990), and the Codon Adaptation Statistical Phylogenetics (Ayres et al. 2012). Validation that Index (CAI) (Sharp and Cowe 1991). Both statistics were the MCMC chains reached stationarity was validated through calculated using the CodonW software (Peden 1999)for a combination of PRSF values near 1.0 from the saved gener- orthologs from five closely related Saccharomyces sensu- ations, standard deviation of the split frequencies below a stricto species (S. cerevisiae, S. paradoxus, S. mikatae, value of 0.001 from the saved generations, and the two S. kudriavzevii,andS. uvarum). CAI values were calculated independent runs converged on the same tree topology. using the reference set of optimal codons from S. cerevisiae.

Sequence Analyses Evolutionary Analyses Sequence identity levels were calculated as the arithmetic Synonymous (dS) and nonsynonymous (dN) substitution mean of all pairwise sequence comparisons in the untrimmed rates for the 14 Hsp70 orthologs from five closely related multiple sequence alignment of a given Hsp70 subfamily. Saccharomyces sensu-stricto species (S. cerevisiae, S. para- Amino acid usage was measured as the number of different doxus, S. mikatae, S. kudriavzevii,andS. uvarum) were deter- amino acids across all nongap sites in the multiple sequence mined using the ML approach as implemented in the PAML alignment of a given Hsp70 subfamily. Analysis of amino acid package v4.7 (Yang 2007) using the codeml program. sequence divergence of a given Hsp70 subfamily was per- AsingledN/dS value was assumed for the entire tree obtained formed by comparing two groups of Hsp70 sequences: with ML methods and mean dN and dS rates were obtained group A (reference group) and group B (analyzed group). In by averaging them across all branches of the tree. In all PAML the case of analyses comparing members of the SSQ1 and analyses, the codon frequencies were averaged from nucleo- SSC3 subfamilies, the sequences from group A and group B tide frequencies at the three codon positions (codon fre- were aligned together. In the cases of analyses comparing quency model F3x4). To avoid falling into suboptimal members of the SSE, SSZ, and LHS subfamilies, the reference likelihood peaks and ensure proper convergence, the analyses

2462 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

were carried out multiple times, using various initial settings of motifs using the dna-pattern program from the Regulatory the estimated parameters (dN/dS ratio and transition to trans- Sequence Analysis Tools (Thomas-Chollier et al. 2008). version rate kappa, k). Initial values of 0.1, 0.5, 1, and 5 were Motifs with six or fewer bases in length were required to used for both parameters in all 16 different combinations and perfectly match the searched sequence, whereas motifs of the final likelihood values from all runs were required to be more than six bases in length were allowed a single base identical. Statistical significance at P ¼ 0.05 for differences in mismatch within the searched sequence. codon usage and evolutionary rates was performed using the R package (R Development Core Team 2013), where the dis- tributions of parameter estimates were tested for nonnormal- Results and Discussion ity using the Shapiro–Wilk test and the difference between Seven Distinct Hsp70 Subfamilies Are Conserved groups was assessed using the Kolmogorov–Smirnov test. among Ascomycota Downloaded from Species Phylogeny As a first step in our evolutionary analysis of Hsp70s in Ascomycota, we sought to identify all Hsp70 genes in 53 Species phylogeny was based on Fitzpatrick et al. (2006) for Ascomycota genomes. Using the sequences of the 14 S. cer- the Basidiomycota and Pezizomycotina, Rhind et al. (2011) for evisiae Hsp70s in Blast searches, we identified 491 Hsp70s the Schizosaccharomyces, Maguire et al. (2013) for Candida, homologs. Thirty Hsp70 genes were also identified in 4 and Gordon et al. (2009) for the Saccharomycetaceae.For Basidiomycota genomes for use as an out-group in our protein http://gbe.oxfordjournals.org/ species closely related to Yarrowia lipolytica, a species tree phylogenetic analysis (fig. 1). We resolved orthologous rela- was deduced from the alignment of 912 single copy pro- tionships among Hsp70s through analysis of the reciprocal tein-coding genes (398,959 residues). Individual gene align- best Blast scores and the evolutionary position within the phy- ments for the 912 orthologs was performed with MUSCLE logeny. In addition, for groups of closely related species, we (Edgar 2004) edited using Gblocks (Castresana 2000)conca- verified orthology through validation of syntenic relationships tenated, and the phylogenetic tree was estimated by ML using among genes. All identified Hsp70 genes were clearly homol- PHYML v3.0 (Guindon and Gascuel 2003) assuming a JTT ogous to the well-characterized 14 Hsp70 genes present in substitution model with gamma-distributed rate variation S. cerevisiae (fig. 1). For clarity throughout this report, we at University of Wisconsin-Madison on June 20, 2014 and a proportion of invariant sites estimated from the data. designate identified Hsp70 genes in accordance with the names of their S. cerevisiae homologs, with the specific geno- Synteny Analyses mic annotations for each identified Hsp70 gene provided in Synteny of Hsp70 orthologs was determined using the Yeast supplementary figure S1, Supplementary Material online. Gene Order Browser (Byrne and Wolfe 2006)for Invariably, we found that each surveyed genome harbored Saccharomycetaceae species and the Candida Gene Order homologs of all seven Hsp70 subfamilies that correspond to Browser (Fitzpatrick et al. 2010) for CTG group species. In those recognized in S. cerevisiae, that is both the four canon- Schizosaccharomyces and Pezizomycotina,syntenywasas- ical subfamilies (SSA, SSB, KAR, and SSC) and the three atyp- sessed by manual annotation and homology comparisons for ical subfamilies (SSE, SSZ, and LHS). Genomes from 16 of the 20 genes upstream and downstream of each Hsp70 gene. In 53 analyzed Ascomycota contained only seven Hsp70 genes, the Yarrowia clade, the genomic location of ten genes up- in every case one from each subfamily. For the 37 species stream and downstream of each Hsp70 gene was compared having more than seven Hsp70s, gene copy number and evo- with Y. lipolytica. If synteny was not conserved, then the same lutionary dynamics varied significantly among Hsp70 subfami- analyses were performed with all other Yarrowia species. lies. As discussed later, some subfamilies have a stable number of paralogous copies, particularly when they resulted from Transcription Factor-Binding Sites Analysis genome-scale duplication events. Other subfamilies have Upstream (50-intergenomic sequences [50-IGS]) and down- highly variable gene copy number, resulting from independent stream (30-IGS) intergenic sequences of SSB orthologs from taxon-specific gene duplications. Regardless of their evolution- post-whole-genome duplication (WGD) species were ob- ary history, each subfamily was recovered as a monophyletic tained from the Yeast Gene Order Browser for S. cerevisiae, group with strong branch support on the phylogenetic tree, S. mikatae, S. kudriavzevii,andS. uvarum.The50-IGS and regardless of whether nucleotide or amino acid sequence 30-IGS of S. paradoxus were extracted directly from the geno- data sets were examined (fig. 1). Phylogenetic analyses also mic data, based on annotation of the surrounding genes. indicated that within each subfamily, gene trees generally Motifs for 16 transcription factor (TF) binding sites identified recapitulated the hypothesized species trees (figs. 1 and 2; previously as present in the promoters of SSB1 and SSB2 supplementary figs. S2–S7, Supplementary Material online). genes of S. cerevisiae were obtained from the study of The invariant presence of homologs of the seven subfami- (MacIsaac et al. 2006). All 50-IGS of SSB1 and SSB2 orthologs lies suggests that they comprised the ancestral combination from the five post-WGD species were then searched for these present in the common ancestor of Ascomycota, and that the

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2463 Kominek et al. GBE

AB *

** Downloaded from

** http://gbe.oxfordjournals.org/

** at University of Wisconsin-Madison on June 20, 2014

**

**

**

FIG.1.—Phylogenetic distribution, copy number dynamics, and protein phylogeny of Hsp70s in fungi. (A) Phylogenetic distribution of the Hsp70 orthologs from 7 subfamilies across the surveyed fungal species. Number of orthologs identified for each Hsp70 subfamily is indicated; in parenthesesthe number of orthologs including putative orthologs (pseudogenes or partial hits due to low quality genomic data (supplementary fig. S1, Supplementary Material online). The SSC column includes orthologs of SSC3 and SSQ1 genes. The species tree was based on various sources (see Materials and Methods for details). (B) Maximum-likelihood tree of amino acid sequences from all identified Hsp70 orthologs. Scale is in expected amino acid substitutions per site. **Bootstrap 95, *bootstrap 70. seven S. cerevisiae subfamilies have been experimentally dem- gene family (Boorstein et al. 1994; Daugaard et al. 2007). onstrated to comprise functionally distinct groups based on Thus, the seven Hsp70 subfamilies most likely constitute both subcellular localization and phenotypic effects of muta- units of function and evolution conserved across surveyed tions (Kampinga and Craig 2010) supports this idea. This fungal genomes. To more thoroughly examine evolutionary conclusion is also consistent with the previously postulated patterns and modes of sequence evolution within each of evolutionary basis for functional diversification of Hsp70 these Hsp70 subfamilies, we analyzed them individually by

2464 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

AB Downloaded from http://gbe.oxfordjournals.org/

C at University of Wisconsin-Madison on June 20, 2014

D

FIG.2.—Protein phylogeny and copy number dynamics of the SSA subfamily. (A) Bayesian tree of amino acid sequences from the SSA subfamily. Scale is in expected amino acid substitutions per site. **Posterior probability 0.95, *posterior probability 0.7. (B) Cladogram showing the hypothetical scenario of SSA evolution in the Yarrowia clade. Gene copies named A–G exhibit synteny across at least two species, gene copies named X1–X6 are taxon specific and gene copies named “p” are putative pseudogenes. (C) Cladogram showing the hypothetical scenario of SSA evolution in the CTG and Saccharomycetaceae clades, gene copies named X1 and X2 are taxon specific. (D) Copy number dynamics in the SSA subfamily. Number of species for each clade indicated in parenthesis.

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2465 Kominek et al. GBE

generating separate sequence alignments for each. This ap- by five ancestral duplications in combination with six lineage- proach led to more evolutionarily informative sites for each specific gene duplications and two losses—A total of 13 subfamily, and thus facilitated the more detailed studies events (fig. 2B and D). Interestingly, one of the ancestral du- described in the following sections. plications, which took place before the divergence of Y. lipolytica, Y. yakushimensis,andC. galli, generated a sub- Canonical SSA Hsp70s: Dynamic Copy Number Evolution telomeric copy that was further propagated by either a recip- rocal or nonreciprocal translocation to another subtelomere Among the 53 Ascomycota species surveyed, we identified (SSA-D and SSA-F). Additionally in C. galli, a third subtelomeric 117 homologs of SSA with two distinct patterns of gene copy (CgalSSA-X6) was identified, corresponding to a species- copy number evolution (figs. 1 and 2A). The 17 species of specific duplication. Based on this analysis, we conclude that Pezizomycotina all harbored a single SSA ortholog, whereas lineage-specific gains and losses were exceptionally common the 36 species belonging to Schizosaccharomyces, Yarrowia, for SSA homologs within the Yarrowia clade. Downloaded from CTG group, and Saccharomycetaceae had more than one SSA To explain even more complex copy number dynamics gene. The copy number varied in a taxon-specific manner, exhibited by species in the Saccharomycetaceae clade, we with individual genomes harboring between two and seven combined phylogenetic and synteny data with published SSA genes (figs. 1 and 2A). The pattern of SSA copy number genomic analyses (Gordon et al. 2009).Accordingtoourhy- variation is consistent with the occurrence of several gene pothetical scenario, a gene duplication that took place in the http://gbe.oxfordjournals.org/ duplication-plus-retention events (fig. 2). In some cases, a ancestor of Saccharomycetaceae accounted for formation of rather straightforward scenario can be envisioned. For exam- the gene pair orthologous to SSA2 and SSA3 of S. cerevisiae. ple, two independent ancestral gene duplications, each of Next, both the SSA2 and the SSA3 genes were duplicated in which took place in a common ancestor of either the CTG the WGD event (Wolfe and Shields 1997) ancestral to the group or Schizosaccharomyces, is sufficient to explain the pair S. cerevisiae lineage (fig. 2C). The post-WGD fate of SSA3 of SSA copies present in species belonging to these clades. was straightforward; the SSA3/SSA4 pair of ohnologs was While a single gene loss that took place in a species specific retained in most post-WGD species, with only C. glabrata lineage within the CTG-group, could explain the presence of

experiencing a loss of SSA3. The evolutionary history of at University of Wisconsin-Madison on June 20, 2014 a single SSA copy in Meyerozyma guilliermondi. However, SSA2 genes in post-WGD species is complex. Both phyloge- as discussed later, SSA evolution in the Yarrowia and netic and synteny data are consistent with the occurrence of Saccharomycetaceae clades exhibited very complex SSA at least three independent loss events (fig. 2C), resulting in gene dynamics. only one SSA2 gene copy being maintained in each species. Copy number in the Yarrowia clade ranges between two However, subsequently two independent duplications of and seven among the six species analyzed (fig. 2). The precise SSA2 gene took place. One was in the ancestor of the branching order among paralogs was not fully resolved, likely Saccharomyces sensu-stricto clade, which resulted in forma- because the number of informative amino acid sites was rel- tion of the SSA2/SSA1 paralogous gene pair currently present atively low (supplementary fig. S8, Supplementary Material in S. cerevisiae and closely related species. An independent, online). Moreover, a strict interpretation of copy number dy- species specific, gene duplication is sufficient to explain the namics based on the Bayesian gene tree resulted in a large presence of two SSA2 copies in Vanderwaltozyma polyspora number of gene retention and loss events. Therefore, we con- (called SSA2 and SSA-X2 in fig. 2C). Altogether, based on sidered alternative scenarios. Using synteny data to determine these analyses we concluded that SSA subfamily experienced the orthologous relationships among the SSA genes within highly dynamic copy number evolution driven by genomic this clade, we identified seven loci (marked as SSA-A to and/or gene-specific gain and loss events. SSA-G, fig. 2B) conserved across at least two Yarrowia species. These loci allowed us to determine orthologous relationships among 25 of the 31 SSA genes. For the remaining 6 genes Canonical SSA Hsp70s: Low Sequence Divergence (marked as SSA-X1 to SSA-X6, fig. 2B), we were unable to In contrast to the high copy number dynamics, sequence identify syntenic conservation; thus their orthologous relation- evolution of SSA genes was very slow, as illustrated by the ships remained uncertain. To reconstruct an evolutionary sce- short branches of the protein tree (figs. 1 and 2A)andthehigh nario that could explain the observed gene copy number level of mean pairwise amino acid sequence identity (~82%, dynamics, we placed 25 SSA genes for which orthologous table 1). To further characterize the mode of sequence evolu- relationships were determined onto the species tree. Next, tion, we analyzed the distribution of amino acid usage (u), we added the remaining 6 SSA genes, for which orthologous defined as the number of different amino acids observed relationships were not resolved, in a way that minimized per site in the sequence alignment (Breen et al. 2012). As the number of necessary evolutionary events (i.e., gene dupli- the amino acid usage can be used as a proxy for the substi- cation and gene loss) (fig. 2B). The results indicated that SSA tution rate (Breen et al. 2012), we fit a gamma distribution to copy number dynamics in Yarrowia cladecouldbeexplained thedata(fig. 3A). The gamma distribution has the shape

2466 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

Table 1 A Global Analysis of Hsp70 Subfamilies in Ascomycota α α Subfamily Sequence Statistics Cellular Localization α No. of Lengthb Identityc α Orthologsa α Canonical α SSA 117 645 þ 5 81.68 þ 4.81 Cytosol α SSB 61 613 þ 1 80.45 þ 8.80 Cytosol α KAR 53 674 þ 9 72.46 þ 6.99 ER SSC 53 658 þ 13 74.51 þ 8.08 Mitochondria SSQ 26 648 þ 8 59.63 þ 10.64 Mitochondria Atypical Downloaded from SSE 60 703 þ 18 55.62 þ 13.13 Cytosol SSZ 54 550 þ 15 49.44 þ 13.40 Cytosol LHS 52 947 þ 61 26.42 þ 12.76 ER

aNumber of identified orthologs, excluding putative orthologs and sequences B with unique internal deletions. bMean sequence length þ SD.

cMean pairwise amino acid sequence identity þ SD. http://gbe.oxfordjournals.org/ parameter a, which specifies the range of rate variation among sites. An a value lower than 1 indicates an L-shaped distribution, where most sites are invariable. Conversely, an a greater than 1 indicates a bell-shaped distribution, with the FIG.3.—Amino acid usage among Hsp70 subfamilies. (A) Distribution majority of sites having higher u values. of sites with indicated amino acid usage, defined as the number of differ- For SSA homologs, the distribution of amino acid usage ent amino acids observed per site in a multiple sequence alignment. (Inlet) at University of Wisconsin-Madison on June 20, 2014 was markedly L shaped (a ¼ 0.50, fig. 3A), indicating that Values of the shape parameter a of the gamma distribution fitted to the formostsitestheu value was very low (supplementary amino acid usage data for each Hsp70 subfamily, as indicated. fig. S9, Supplementary Material online). For example, for (B) Maximum amino acid usage covering the indicated proportion of 50% of sites u was equal to 1, indicating that those sites sites in each Hsp70 subfamily. were invariant; for 70% of sites u was lower or equal to 2 showing that, at most, one alternative amino acid was present (table 2). In contrast, the percentage of invariant codons in the sequence alignment (fig. 3B). These observations are was approximately 2-fold lower for the SSA3 and SSA4 pair consistent with a high level of constraint of the sequence and they exhibited much less biased codon usage. The higher variation, caused by strong purifying selection. sequence divergence of SSA3/SSA4 pair could have resulted To test whether the SSA sequences were subjected to the from longer evolutionary history due to their earlier origin same evolutionary selection pressures over a short timescale, (fig. 2C), whereas their higher codon bias could be explained we analyzed synonymous (dS) and nonsynonymous (dN)sub- by their relatively low expression levels observed under optimal stitution rates for a set of five species closely related to growth conditions (table 2). However, upon stress conditions S. cerevisiae (table 2) for which detailed genomic analysis or during transition to the stationary phase (the diauxic shift) areavailable(Scannell et al. 2011). As dS values were not all four SSA paralogs are expressed at comparable level saturated for any Hsp70 sequence (dS 1) in this data set, (Werner-Washburne et al. 1989), indicating that more com- we were able to calculate dN/dS ratios for all members of all plex mechanisms could be responsible for the observed differ- Hsp70s subfamilies (table 2). Each of the five analyzed ge- ences in codon bias. Yet, overall, our data suggest that at both nomes harbors four paralogous copies of SSA genes, with long- and short-term evolutionary timescales, members of SSA3 and SSA4 originating from the WGD and SSA1 and SSA subfamily evolved very slowly. SSA2 from the subsequent lineage-specific gene duplication We also examined whether concerted evolution, driven by event (fig. 2). The two pairs of SSA paralogs share very low dN/ gene conversion events, could explain the high sequence iden- dS values (table 2). In fact their dN/dS values were from 4- to tity observed among SSA subfamily members. In cases of con- 10-fold lower than those estimated for average protein coding certed evolution paralogous sequences from within a species gene harbored by the five analyzed species (Scannell et al. form sister branches on a gene tree, indicating that they are 2011). Yet, the two pairs of SSA paralogs differ markedly in more similar to each other than to their orthologs from other their sequence divergence, with SSA1 and SSA2 sequences species (Dover 1982). Such signs of concerted evolution were sharing a high percentage of invariant codons (56% and detected in only two cases for SSA subfamily members (three 67%, respectively) and a highly biased codon usage SSA paralogs in Y. yakushimensis and three paralogs in

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2467 Kominek et al. GBE

Table 2 Codon Usage and Evolutionary Rates of the Hsp70 Genes in Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii,andS. uvarum Hsp70 Codon Usage Protein Abundanced Evolutionary Rates

Totala Invariantb % Invariant CAIc ENC dN dS dN/dS SSA1 642 361 56.2 0.65 þ 0.06 29.85 þ 1.34 8,178 0.003 þ 0.002 0.259 þ 0.169 0.013 SSA2 639 426 66.7 0.78 þ 0.02 27.40 þ 0.43 7,986 0.002 þ 0.001 0.234 þ 0.110 0.010 SSA3 647 188 29.1 0.19 þ 0.01 47.29 þ 3.20 290 0.007 þ 0.003 0.292 þ 0.127 0.025 SSA4 642 175 27.3 0.21 þ 0.03 43.09 þ 1.57 475 0.006 þ 0.004 0.331 þ 0.220 0.019 SSB1 613 420 68.5 0.80 þ 0.02 27.57 þ 0.71 2,320 0.002 þ 0.001 0.186 þ 0.122 0.010 SSB2 613 405 66.1 0.77 þ 0.02 27.83 þ 0.40 2,791 0.002 þ 0.001 0.176 þ 0.083 0.012 KAR2 682 300 44.0 0.45 þ 0.04 35.81 þ 0.74 523 0.004 þ 0.003 0.238 þ 0.145 0.018 SSC1 651 285 43.8 0.51 þ 0.02 35.05 þ 1.36 709 0.004 þ 0.003 0.257 þ 0.186 0.016 Downloaded from SSQ1 654 185 28.3 0.14 þ 0.01 49.93 þ 2.62 26.4 0.014 þ 0.006 0.307 þ 0.123 0.045 SSC3 (d)e 643 181 28.1 0.20 þ 0.01 47.50 þ 0.92 86 0.027 þ 0.011 0.372 þ 0.151 0.073 SSC3 (c)f 630 281 44.6 0.52 þ 0.04 34.24 þ 0.31 n/a n/a n/a n/a SSE1 693 296 42.7 0.55 þ 0.05 34.04 þ 1.70 1,959 0.007 þ 0.004 0.289 þ 0.178 0.025 SSE2 693 180 26.0 0.20 þ 0.01 48.91 þ 1.62 155 0.017 þ 0.009 0.319 þ 0.170 0.053

SSZ1 538 186 34.6 0.44 þ 0.05 36.02 þ 2.09 1,040 0.011 þ 0.007 0.274 þ 0.194 0.038 http://gbe.oxfordjournals.org/ LHS1 881 191 21.7 0.15 þ 0.01 50.69 þ 2.08 20.3 0.038 þ 0.022 0.302 þ 0.179 0.125

NOTE.—n/a, not applicable. aTotal number of aligned codons. bNumber of identical codons in the analyzed species. cCAI shown as mean þ SD, for statistical significance see supplementary figure S10, Supplementary Material online. dProtein abundance in parts-per-million (Wang et al. 2012). eSSC3 (d)—divergent SSC3 orthologs from S. cerevisiae, S. paradoxus, S. kudriavzevii,andS. uvarum, that form a separate clade on the phylogenetic tree of the SSC subfamily (supplementary fig. S4, Supplementary Material online). fSSC3 (c)—gene-converted SSC3 orthologs from Candida glabrata and Vanderwaltozyma polyspora, that cluster with SSC1 orthologs from their respective species on the phylogenetic tree of the SSC subfamily (supplementary fig. S4, Supplementary Material online). at University of Wisconsin-Madison on June 20, 2014

C. galli), both of which belong to the Yarrowia clade (supple- SSA2 copies differ in their ability to assist in protein transloca- mentary fig. S8, Supplementary Material online). Interestingly, tion from the cytosol into the ER and/or vacuole (Deshaies all but two of these genes (YyakSSA-E, YgalSSA-E) are located et al. 1988; McClellan et al. 1998; Brown et al. 2000), as in the subtelomeres. In other species, including five closely well as in prion propagation (Sharma and Masison 2011). related to S. cerevisiae, the branching order of the SSA sub- Similarly, data for Y. lipolytica indicate functional diversifica- family gene tree recapitulated the topology of the species tree tion among four SSA paralogs within this species (Sharma (figs. 1 and 2), indicating that they indeed evolved according et al. 2009). to a divergent mode of sequence evolution under purifying Dynamic copy number evolution of the SSA subfamily is selection. not only restricted to fungi, but also has been observed in Overall, the pattern of SSA evolution is consistent with the Drosophila and mammals (Bettencourt and Feder 2001). birth-and-death evolutionary scenario for gene family evolu- In contrast to fungi, the SSA genes from Drosophila evolved tion (Nei and Rooney 2005), wherein gene duplicates are in a concerted manner via gene conversion (Bettencourt and functionally equivalent, so that retention and loss of paralogs Feder 2002). The authors hypothesized that multiplication of is a stochastic process. We hypothesized that harboring mul- SSA genes in Drosophila species may be adaptive because the tiple paralogs with partially overlapping functions may provide resulting increased expression provides resistance to a variety increased robustness of this major cytosolic chaperone system of types and severity of stresses. They also note, however, that both under physiological and stress conditions. Consistent other species of Drosophila have achieved high Hsp70 expres- with this hypothesis, functional data from S. cerevisiae indicate sion and stress resistance with a limited number of SSA genes. that members of SSA subfamily form a dynamic functional Thus, there is no simple functional explanation for the multiple network wherein depletion of one paralog leads to compen- copies of SSA harbored by Drosophila genomes. In humans, satory induction of another (Werner-Washburne et al. 1987). six paralogous members of SSA subfamily exhibit complex Yet, despite a high degree of sequence identity relatively to pattern of evolution and functional diversification with typical protein coding genes, members of the SSA subfamily approximately half of them being stress inducible and the also have distinct functional properties. In S. cerevisiae SSA other half expressed constitutively (Daugaard et al. 2007). paralogs differ in patterns of expression, with SSA1/SSA2 Two of the stress inducible genes (HSPA1A/B) are evolving being expressed constitutively while SSA3/SSA4 induced by by concerted evolution, driven by high GC content and stress (Werner-Washburne et al. 1987). Moreover, SSA1/ biased gene conversion (Kudla et al. 2004). In summary, the

2468 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

slow rate of sequence evolution on a global scale that is char- However, SSB evolution was clearly more complicated than acteristic for SSA homologs could be explained by purifying can be accommodated by the simple form of dosage hypoth- selection due to strong functional constraints imposed by the esis. Interestingly, the nucleotide sequences just outside of the multiple roles they perform in the cytosolic/nuclear compart- SSB1/SSB2 coding sequences are highly variable within species ment of all eukaryotic cells. However, understanding the func- and the phylogeny of their 5’- and 3’-IGS exhibit a branching tional variation among SSA lineages also requires comparisons pattern characteristic of divergent evolution (fig. 4A). To of gene-specific evolutionary patterns among closely related examine whether SSB paralogs exhibit evidence of potential species. Further studies combining evolutionary analysis with divergent transcriptional profiles, we searched for putative TF- experimental approaches will likely be key in answering the binding sites in 5’-IGS regions of SSB1/SSB2 genes from question of whether SSA paralogs have diversified functionally five species closely related to S. cerevisiae. We focused on or whether their different expression patterns evolved to allow S. cerevisiae TFs for which functional data are available for differences in regulation of proteins having the same (MacIsaac et al. 2006). Among 16 identified TF-binding sites Downloaded from biochemical function (Harms and Thornton 2013). (supplementary fig. S11, Supplementary Material online), we observed two distinct patterns of presence/absence within Canonical SSB Hsp70s: Stable Copy Number with the5’-IGSsequencesofSSB1/SSB2 paralogs. Six of these TF- Concerted Evolution binding sites are shared between the 5’-IGS regions of both SSB1/SSB2 copies (fig. 4B). The remaining ten exhibit a diver- In contrast to SSAs, there were few evolutionary changes in gent pattern with six TF-binding sites being present predom- http://gbe.oxfordjournals.org/ copy number in the SSB subfamily. A total of 62 orthologous inantly upstream of SSB1, while the other four are present sequences were identified in the 53 surveyed Ascomycota predominantly upstream of SSB2. Moreover, while 14 of the species (fig. 1 and supplementary fig. S2, Supplementary 16 TF binding sites are also present among the 5’-IGS regions Material online). The presence of SSB1/SSB2 paralogs in the of singleton SSB genes from species ancestral to the WGD 8 species of the Saccharomycetaceae lineage is consistent with event (Kominek J, unpublished data), two are present only a single gene duplication event that coincided with the WGD. in post-WGD species. Namely, the Rfx1 binding site is present Two copies of SSB were also found in the recently sequenced in 5’-IGSs of SSB1, while the Gln3-binding site is present in at University of Wisconsin-Madison on June 20, 2014 genome of Schizosaccharomyces japonicus (supplementary 5’-IGS of SSB2 in 4 out of the 5 species, suggesting that fig. S2, Supplementary Material online), whereas a single further functional diversification may have taken place after copy was present in the other 44 genomes surveyed. emergence of SSB duplicates. Similar to SSAs, SSB sequences evolved slowly, as illustrated The earlier discussion focuses on the possible functional by a high pair wise sequence identity (~80%) (table 1)andan identity of the SSB1/SSB2 paralogs. However, a degree of L-shaped distribution of amino acid usage, with a markedly functional specialization may exist despite their high level of lower than 1 (fig. 3). Also indicative of a divergent mode sequence identity. A few of the amino acid substitutions of sequence evolution controlled by purifying selection, the exhibit repeated instances of parallel evolution among para- tree topology for singleton SSB genes was consistent logous copies of SSBs (fig. 4A), implying, as suggested previ- with the species tree (fig. 1; supplementary fig. S2, ously (Takuno and Innan 2009), that these substitutions could Supplementary Material online). However, the amino acid have functional importance and were thus driven by selection. phylogeny of paralogous SSB1/SSB2 sequences from post- However, we note that two species (V. polyspora and C. glab- WGD Saccharomycetaceae species show a complex branching rata)harborcopiesoftheSSB1/SSB2 gene pair having identi- pattern not consistent with a divergent mode of sequence cal amino acid sequences. One possible explanation is that the evolution (supplementary fig. S2A, Supplementary Material identity in these species is due to recent gene conversion online). As this could be due to weak phylogenetic signal re- events. On the other hand, as the species harboring identical sulting from high level of sequence conservation, we turned to copies of SSB1/SSB2 genes branch at the base of the clade nucleotide-based phylogeny (fig. 4 and supplementary S2B, harboring SSB duplicates, it is also possible that functionally Supplementary Material online). The tree topology for para- important sequence divergence of SSB1/SSB2 copies evolved logous SSB1/SSB2 sequences exhibited a branching pattern more recently. characteristic of concerted evolution. The high sequence Overall, SSB subfamily evolution is a remarkable example of similarity between duplicate SSB paralogs via concerted independent gene conversion, where the physical boundaries evolution is consistent with the gene dosage hypothesis of each event correspond precisely with the coding region of (Sugino and Innan 2006), wherein selection acts to maintain each duplicate pair. Noncoding sequences adjacent to both duplicate genes for increased protein concentrations. SSBs, sides of the coding region exhibit clear patterns of divergent which assist folding of polypeptide chains emerging from evolution, but closer inspection of promoter regions shows the ribosome, fit this criterion because their abundance is so evidence of potential subfunctionalization for regulatory ele- high that it equals or surpasses that of ribosomes in the cell ments following gene duplication. Finally, despite concerted (Pfund et al. 1998). evolution within coding sequences, multiple instances of

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2469 Kominek et al. GBE

A Downloaded from http://gbe.oxfordjournals.org/

B at University of Wisconsin-Madison on June 20, 2014

FIG.4.—Sequence evolution and regulatory divergence of SSB subfamily. (A) Bayesian trees of nucleotide sequences of SSB orthologs from post-WGD Saccharomycetaceae species: (left) the 5’-IGS, (middle) open reading frames (SSB ORF), and (right) the 3’-IGS. The SSB ORF tree was rooted using the SSB1 ortholog from Candida albicans and is shown as cladogram for clarity (for the corresponding phylogram, see supplementary fig. S2B, Supplementary Material online). Scale is in expected nucleotide substitutions per site. **Posterior probability 0.95, *posterior probability 0.7. Differences aa/nn indicates number of amino acid (aa) and nucleotide (nn) differences between paralogous SSB1/SSB2 sequences from a given species. Residues listed along branches indicate amino acid substitutions between each SSB1 and SSB2 pair. Substitutions present in more than one species are marked in bold. (B) Schematic representation of predicted TF-binding sites present in the 5’-IGSs of SSB1 and SSB2 orthologs from the indicated post-WGD species. Underlined TF-binding sites were not identified in the pre-WGD species. The diagram is not drawn to scale. For clarity, specific locations of individual TF binding sites are not indicated. parallel amino acid substitutions indicate some degree of func- number in the KAR subfamily (fig. 1; supplementary fig. S3, tional divergence between gene duplicates. Further functional Supplementary Material online). The single, species specific, studies are needed to fully explain the biological importance of KAR duplicate was identified in Chaetomium globosum the complex evolutionary patterns observed for SSB1/SSB2 (fig. 1). The rate of sequence evolution of KAR is low, as indi- gene duplicates cated by high pairwise protein sequence identity (~72%) among KAR orthologs (table 1), the L-shaped amino acid usage distribution with a ¼ 0.74 (fig. 3), and dN/dS value Canonical KAR Hsp70s: Stable Gene Copy Number and (0.018) (table 2). In summary, KAR is the most stable subfamily Divergent Sequence Evolution among the canonical Hsp70s in Ascomycota. In the 53 Ascomycota genomes surveyed, we identified only This stability of KAR is consistent with observations in other 54 copies of KAR orthologs, indicating stable gene copy lineages, as single orthologs have been identified in a wide

2470 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

range of eukaryotes, including human (Daugaard et al. 2007), negative correlation between the rate of sequence evolution Xenopus (Miskovic et al. 1997), several fish species (He et al. and protein expression level (Drummond et al. 2005), the low 2013), Drosophila (Nikolaidis and Nei 2004), silkworm (Xi and level of expression of S. cerevisiae SSQ1 compared with SSC1 Ma 2013), copepod (Xuereb et al. 2012), mollusks (Clark et al. (table 2) provides one possible explanation. In addition, de- 2008), shrimp (Luan et al. 2009), Giardia (Gupta and Singh tailed functional studies of SSQ1 from S. cerevisiae revealed 1994), Trypanosoma cruzi and Leishmania major (Louw et al. that it underwent subfunctionalization, wherein it became 2010). Taxon-specific duplications of KAR genes have been specialized in the maturation of iron–sulfur cluster proteins, identified in Caenorhabditis (Nikolaidis and Nei 2004)andin which had been carried out by the ancestral SSC1 gene Trypanosoma brucei (Louw et al. 2010). However, in flowering (Schilke et al. 2006). During its subfunctionalization, SSQ1 plants, an expansion of the KAR subfamily has been observed, lost functions normally carried out by SSC1, such as protein with copy number ranging between two and five among the import across the inner mitochondrial membrane, protein six species analyzed (Denecke et al. 1991; Marocco et al. folding, and refolding of damaged proteins. These diverse Downloaded from 1991; Cascardo et al. 2000; Lin et al. 2001; Jung et al. functions require SSC1 to interact with an array of client 2013). Thus, even the most stable subfamily of canonical proteins and several J-protein co-chaperones (Craig and Hsp70s occasionally exhibits copy number variation in a line- Marszalek 2011). In contrast, biochemical experiments age specific manner. revealed that SSQ1 interacts with a single client, a scaffold protein on which iron–sulfur clusters are formed, and that this specific interaction requires cooperation with only a http://gbe.oxfordjournals.org/ Canonical SSC Hsp70s: Complex Evolutionary History single specialized J-protein Jac1(Craig and Marszalek 2011). Involving Subfunctionalization As SSQ1 lost the ability to interact with other clients or Our survey of 53 Ascomycota genomes revealed 86 members J-protein co-chaperones, we hypothesize that fewer func- of the SSC subfamily. Although 27 species harbor a single tional constraints acting upon SSQ1 resulted in relaxed puri- copy of this mitochondrial Hsp70, those belonging to fying selection and increased rate of sequence evolution the CTG and Saccharomycetaceae clades bear either 2 or 3 observed. (fig. 1; supplementary fig. S4, Supplementary Material online). To find signatures consistent with functional diversification at University of Wisconsin-Madison on June 20, 2014 Consistent with our previous, more limited, analysis (Schilke of SSQ1,wecomparedSSQ1 with SSC1 from species that pre- et al. 2006), two independent gene duplication events are dated the gene duplication (fig. 5), specifically searching for sufficient to explain this phylogenetic distribution. First, a sites that were variable in SSC1 but invariant in SSQ1 (Gu duplicationinacommonancestoroftheCTGand 2001, 1999). We termed such sites “fixed” (called Type I in Saccharomycetaceae clades could explain the presence of Gu [1999]), which are interpreted to be potentially important the two paralogous SSC genes in members of the CTG for the specialized function of SSQ1. We also searched for clade and those Saccharomycetaceae species that did not switched sites (called Type II in Gu [1999]) that were conserved experience the WGD event. A subsequent duplication is con- within each SSC1 and SSQ1 sequences, but occupied by dif- sistent with the presence of three paralogous SSC genes in ferent amino acids in each. Type II sites are often interpreted as Saccharomycetaceae post-WGD species. More specifically, the associated with functional changes in each paralog. The fixed ancestral duplication of SSC1 gave rise to SSQ1,andthe and switched sites were further classified as radical and con- subsequent WGD gave rise to SSC3 (known also as ECM10). servative, based on BLOSUM62 scores. We classified the site The phylogeny is consistent with this pattern in that SSQ1 is as radical when the substitution from the most prevalent res- derived within SSC1 and SSC3 forms a sister taxon with SSC1 idue in the SSC1 sequences to the most prevalent residue in (supplementary fig. S4, Supplementary Material online). the SSQ1 sequences had the BLOSUM62 score equal to or less Singleton sequences of the SSC subfamily evolved slowly, than zero and conservative when the BLOSUM62 score was as is typical for canonical Hsp70s from Ascomycota (tables 1 strictly greater than zero. and 2; fig. 3). However, sequences orthologous to SSQ1 SSQ1 contains 30 fixed conservative sites that are spatially evolved quite rapidly, relative to other canonical Hsp70s, as distributed evenly across the linear amino acid sequence indicated by: 1) longer branches on the gene tree (fig. 1; sup- (fig. 5). The spatial distribution of fixed radical sites was very plementary fig. S4, Supplementary Material online),2)lower different. Five of seven were within or in close proximity to the pair-wise amino acid identity (~60%; table 1), 3) higher dN/dS client protein-binding cleft localized in the beta sheet region of values (0.045; table 2), and 4) among all canonical Hsp70s, the SBD domain (SBD-ß). All five switched sites, both conser- thehighestvalueofa (0.93; fig. 3)foraminoacidusage. vative and radical, were localized within SBD-ß (fig. 5B; sup- The increased rate of substitution for the derived SSQ1 sug- plementary fig. S12, Supplementary Material online). Three of gests less purifying selection in this subclade (fig. 3). these five sites (P446, P449, and V466) are near or in the cleft Two causative factors are consistent with the higher rate of itself (supplementary fig. S13, Supplementary Material online), SSQ1 sequence evolution: 1) low levels of expression (table 2) and thus likely to affect the specificity of client binding (Zhu and 2) highly specialized function. Given the established et al. 1996), a testable scenario that is consistent with these

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2471 Kominek et al. GBE

A

B Downloaded from http://gbe.oxfordjournals.org/ at University of Wisconsin-Madison on June 20, 2014

FIG.5.—Amino acid sequence divergence of SSQ1 and SSC3 orthologs. (A) Summary of sequence divergence analysis. Group A: SSC1preSSQ1—27 sequences of SSC1 orthologs from species predating the emergence of SSQ1; SSC1preSSC3—6 sequences of SSC1 orthologs from species predating the emergence of SSC3 (supplementary fig. S4, Supplementary Material online, for details). Group B: 26 sequences of SSQ1 orthologs and 4 sequences of SSC3 orthologs. Conserved sites—number of sites with residues identical in group A and B. Relaxed sites—number of sites that are invariant in A but variableinB. Fixed sites—number of sites that are variable in A but invariant in B. Switched sites—number of sites that are invariant within group A and group B but occupied by different amino acids in each group. Radical—number of sites where substitution from the most prevalent residue in group A to the most prevalent residue in group B had the BLOSUM62 score equal to or less than zero. Conservative—number of sites where substitution from the most prevalent residue in group A to the most prevalent residue in group B had the BLOSUM62 score strictly greater than zero. (B) Approximate localization of the radical fixed and radical switched sites within sequences orthologous to SSQ1 and SSC3 as indicated. Fixed sites represented by bars above the sequence and switched sites marked by bars below sequence with radical sites marked by their sequence position. Pale fragments indicate patches of at least 9 consecutive gaps in the alignment of sequences from groups A and B, and thus not subjected to the analysis. SP, signal peptide; L, Linker; SBD-a, substrate-binding domain alpha-helical lid.

changes being important for functional specialization of and branching pattern of the protein tree suggested a con- SSQ1. Altogether, our data indicate that SSQ1 was subjected certed mode of evolution driven most likely by gene conver- to divergent evolution driven by lowered expression levels and sion events. However, in the other species, SSC3 diverged functional diversification, but that critical function was main- significantly from its parental gene. In fact, the dN/dS ra- tained by purifying selection. This scenario is consistent with tio determined for SSC3 (table 2) was the highest among ca- previously observed asymmetric rate of evolution between nonical Hsp70s. All fixed and switched sites are spatially duplicate genes (Kim and Yi 2006), which is usually explained distributed evenly along the linear SSC3 sequence (fig. 5). by differences in expression levels and relaxed purifying selec- Thus, in contrast to SSQ1, no specific region was indicative tion due to the partitioning of ancestral functions between of functional divergence for SSC3. Even though the S. cerevi- duplicates. siae SSC3 protein is associated with mitochondrial DNA As mentioned earlier, the phylogenetic distribution of (Sakasegawa et al. 2003), deletion of SSC3 has no known the third SSC paralog, SSC3, is restricted to post-WGD species phenotypic effects (Baumann et al. 2000; Pareek et al. (fig. 1; supplementary fig. S4, Supplementary Material online). 2011), including any functional relationship to mtDNA main- Evolution of SSC3 varies in a taxon-specific manner. In two tenance and propagation. However, SSC3 is defective in its species (C. glabrata and V. polyspora) high sequence identity ability to interact with client peptides, likely due to its inability

2472 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

to perform ATPase-dependent allosteric interdomain commu- detected for LHS, due to a very low level of sequence conser- nication, which may indicate it is early in the process of pseu- vation. These sites exhibit no clear spatial pattern in their dis- dogenization (Pareek et al. 2011). Yet, given that gene loss is tribution across the sequence. During our analysis of sequence rapid for nonfunctional gene in post-WGD species (Hittinger divergence of atypical Hsp70s, we noted another difference: et al. 2004), the maintenance of SSC3 orthologs in most post- subfamily-specific differences in their length due to specific WGD genomes strongly suggests a functional role. An intrigu- insertions/deletions (fig. 6B). ing possibility is that SSC3 is emerging as a new atypical Hsp70 Altogether, atypical Hsp70s evolve faster and subfamilies (Pareek et al. 2011). However, further experimental studies are more divergent than those of canonical Hsp70s. As fast are needed to reveal whether SSC3 should be classified as a evolving SSE1 and SSZ1 are expressed at high levels, compa- pseudo, a canonical or, possibly, an atypical Hsp70. rable with that of most canonical Hsp70s, low expression is not a likely explanation for rapid evolution (table 2; figs. 3 and 6). The more likely explanation is relaxation of functional Atypical Hsp70s: Low Copy Number Dynamics and High Downloaded from Sequence Divergence constraints. However, the situation is complex, because, while atypical Hsp70s are known to have a more limited ability to Copy number is relatively stable among atypical Hsp70s. In the perform ATPase-dependent client binding and releasing cycle 53 surveyed species, we identified 60 copies of SSE, 54 copies (Shaner and Morano 2007), they are also known to interact of SSZ, and 54 copies of LHS. Paralogous copies of SSE specifically with canonical Hsp70 partners (SSE and LHS) or http://gbe.oxfordjournals.org/ werefoundin7ofthe8Saccharomycetaceae analyzed and with J-protein co-chaperones (SSZ). In addition, recent exper- coincided with the WGD (fig. 1; supplementary fig. S5, imental studies reveals that their repertoire of biochemical Supplementary Material online); the 8th, C. glabrata,experi- activities is larger than expected based on earlier studies enced a lineage-specific loss of one SSE copy. Paralogous (Steel et al. 2004; Shaner and Morano 2007; Rampelt et al. copies of SSZ and LHS were found in a single species 2012; Xu et al. 2012; Mattoo et al. 2013). However, despite (V. polyspora) that branches at the base of recent findings, the limited allosteric communication between Saccharomycetaceae post-WGD clade. This finding coupled their two functional domains (NBD and SBD) (Liu and with the results of synteny analysis implies that the additional Hendrickson 2007; Schuermann et al. 2008) remains the at University of Wisconsin-Madison on June 20, 2014 copies of both LHS and SSZ present in V. polyspora originated most likely explanation for relaxed selection as the cause of during the WGD event. The lack of such SSZ and LHS paralogs higher substitution rates among atypical Hsp70s. Interdomain in most post-WGD Saccharomycetaceae species, and there- allosteric communication engages a large number of co-evolv- fore, likely resulted from multiple lineage- or species-specific ing residues, linking the functional sites of the two domains loss events in other post-WGD species (fig. 1; supplementary across a specific interdomain interface (Smock et al. 2010), figs. S6 and S7, Supplementary Material online). and thus likely explaining lower substitution rates among Unlike the stability of copy number, atypical Hsp70s have canonical Hsp70s. high rates of sequence evolution, as indicated by the follow- ing: 1) long branches on the gene tree (fig. 1B; supplementary figs. S9–S11, Supplementary Material online), 2) relatively low Conclusions and Perspectives mean pairwise amino acid identity (table 1), 3) high a values The major finding of our analysis is that the Hsp70 chaperone for amino acid usage, ranging from 1.21 for SSZ to 2.85 for family is highly conserved with members of the seven Hsp70 LHS (fig. 3), and 4) dN/dS values that are markedly higher than subfamilies, both canonical and atypical, present in all sur- those of most canonical Hsp70s (table 2). Thus, our data are veyed species. We postulate that each of these seven subfa- consistent with atypical Hsp70s evolving under more relaxed milies diversified from their corresponding single ancestral purifying selection than canonical Hsp70s. gene that was present in the common ancestor of all surveyed To further characterize sequence divergence patterns of species and which subsequently underwent both independent atypical Hsp70s, we analyzed the distribution of fixed and and genome-scale duplication events. As a result of indepen- switched sites (fig. 6). Because of rapid sequence evolution dent duplicate gene retentions among subfamilies and taxa, among atypical Hsp70s, we compared sequences for each we observed wide ranging gene family evolutionary dynamics. subfamily of atypical Hsp70s (termed Group B) against canon- Yet, despite these rich evolution dynamics, we found no evi- ical Hsp70s (termed Group A) (fig. 6A). Consistent with high dence for any changes in cellular localization or function rates of sequence evolution, our divergence analysis revealed across subfamilies. Thus, the Hsp70 subfamilies constitute a very limited number of conserved sites across group A and conserved units of function and evolution. This relationship group B alignments (fig. 6). 44 conserved sites were found for between subfamilies and functional units was first proposed SSE, 26 for SSZ and only 19 such sites for LHS. Consistent with based on the study of Hsp70 family in S. cerevisiae (Boorstein their high substitution rates, multiple fixed and switched sites et al. 1994). Here, we demonstrate that this relationship is were distributed evenly across the entire sequences of SSE and robust, supported after much greater taxon sampling and SSZ. Although only 18 fixed and 2 switched sites were analysis of genomic sequence data sets. These findings may

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2473 Kominek et al. GBE

A

B Downloaded from http://gbe.oxfordjournals.org/ at University of Wisconsin-Madison on June 20, 2014

FIG.6.—Amino acid sequence divergence of atypical SSE, SSZ, and LHS subfamilies. (A) Summary of sequence divergence analysis. Group A, canonical Hsp70 homologs (223 sequences) belonging to the SSA, KAR, and SSC subfamilies. Group B, Hsp70 orthologs belonging to the SSE (60 sequences), SSZ (54 sequences), and LHS (52 sequences) subfamilies, as indicated. Conserved sites—number of sites with residues identical in group A and B. Relaxed sites— number of sites that are invariant in A but variable in B. Fixed sites—number of sites that are variable in A but invariant in B. Switched sites—number of sites that are invariant within group A and group B but occupied by different amino acids in each group. Radical—number of sites where substitution from the most prevalent residue in group A to the most prevalent residue in group B had the BLOSUM62 score equal to or less than zero. Conservative—number of sites where substitution from the most prevalent residue in group A to the most prevalent residue in group B had the BLOSUM62 score strictly greater than zero. (B) Approximate localization of the radical fixed and radical switched sites within sequences orthologous to SSE1, SSZ1,andLHS1, as indicated. Fixed sites marked by bars above the sequence and switched sites marked by bars below sequence with radical sites marked by their sequence position. Pale fragments indicate patches of at least nine consecutive gaps in the alignment of sequences from groups A and B, and thus not subjected to the analysis. SP, signal peptide; L, Linker; SBD-a, substrate-binding domain alpha-helical lid.

2474 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

well be applicable to all eukaryotes, as data available for other have resulted in lethality in taxa without the duplicate gene taxonomic groups suggest a similar pattern, both in terms of (Craig and Marszalek 2011). evolutionary conservation of subfamilies and their consistent In the context of evolutionary studies, our results provide a compartmental localization (Hughes 1993; Lin et al. 2001; striking example: even a highly conserved and ubiquitous pro- Nikolaidis and Nei 2004; Wada et al. 2006; Daugaard et al. tein family, whose members are present in all domains of life, 2007; Georg Rde and Gomes 2007; Brocchieri et al. 2008; performing universal and often essential functions is subjected Kabani and Martineau 2008). to dynamic and variable evolutionary forces, which often act The mechanisms of Hsp70 gene family evolution varied in a lineage-specific level. Yet, there is a common theme in among subfamilies and taxa. Some subfamilies evolved ac- these ostensibly chaotic evolutionary dynamics. The presence cording to a simple divergent scenario with few changes in of each Hsp70 subfamily is conserved; thus, despite evolution- copy number dynamics (e.g., KAR, SSZ, and LHS). Others ary complexity, subfamilies evolve in accordance to the protein underwent highly dynamic changes in copy number and family evolution model, wherein orthologous groups consti- Downloaded from evolved according to the birth–death model of gene family tute units of function and evolution preserved across a phy- evolution (SSA and SSC). In a few cases, subfamilies evolved logeny (Gabaldon and Koonin 2013). Finally, the evolutionary under concerted evolution with gene conversion homogeniz- dynamics detected in our study are still difficult to interpret ing amino acid sequences within the species (SSBs and to a in functional terms. For example, why do taxa vary in copy more limited extent, SSA and SSC). Yet, a common theme number for the SSA subfamily? Why have mitochondrial http://gbe.oxfordjournals.org/ emerges when our analysis is considered together with avail- Hsp70s specialized for iron–sulfur biogenesis (SSQ1)evolved able functional data: within each Hsp70 subfamily paralogous only in a limited number of fungal species, given this special- genes diversified functionally regardless of their mode of ization independently arose in the bacterial ancestors of mito- evolution. In contrast to high copy number dynamics, the se- chondria (Schilke et al. 2006)? Why do SSB paralogs quence evolution was rather conservative with purifying se- independently maintain a few amino acid differences despite lection as the major evolutionary force. The rate of sequence repeated homogenization across the rest of the protein evolution was inversely proportional to the expression level, as sequence? How have interactions of Hsp70s with their co- expected, as well as strongly functionally constrained. In gen- chaperones, J-proteins and nucleotide release factors, affected at University of Wisconsin-Madison on June 20, 2014 eral, atypical Hsp70s, which have limited allosteric communi- patterns of evolution? It is clear that sequence analysis alone is cation between functional domains, evolved faster than not sufficient to answers these questions; only a combination canonical Hsp70s. Among canonicals, highly specialized and of evolutionary analyses coupled with experimental studies low abundant SSQ1 evolved faster than multifunctional and will ultimately provide answers (Harms and Thornton 2013). highly abundant SSA, KAR and SSC; however, in all cases the rate of evolution for Hsp70s relative to the genome-wide Supplementary Material average was very low, indicative of purifying selection as the major evolutionary force influencing sequence evolution. Supplementary figures S1–S13 are available at Genome Practically, how do our findings inform future studies? Biology and Evolution online (http://www.gbe.oxfordjour In the context of functional analyses, researchers using nals.org/). model organisms should consider that Hsp70 subfamilies, which constitute conserved functional units, experience com- Acknowledgments plex evolutionary dynamics, often in a taxon-specific manner. This work was supported by the Foundation for Polish Science As a consequence, functions associated with a given Hsp70 TEAM/2009-3/5 project (to J.K. and J.M.); the National subfamily could be partitioned among a number of paralo- Institutes of Health grants GM31107 and GM27870 (to gous genes, whose homology relationships across species E.A.C.); the Michigan State University start-up fund, a grant might be difficult to determine. Moreover, the manners in from the Michigan State University Gene Expression in which these functions are distributed among paralogous Development and Disease Group (GEDD), and the NSF-STC genes may vary from model species to model species, so in- BEACON: Center for the Study of Evolution in Action (to ference of function across species using only sequence homol- B.W.); and the genome sequencing of the species closely re- ogy should be made with caution. Yet, in some circumstances, lated to Yarrowia lipolytica was financially supported by INRA it may be advantageous for experimentalists to exploit this (AIP-Bioressources 2011 YALIP) (to C.N.). situation, as it could allow study of a specific function associ- ated with a specialized Hsp70 without interference from disruption of other activities typically carried out by multifunc- Literature Cited tional Hsp70s. For example, the presence of the specialized Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. SSQ1 gene in S. cerevisiae, has allowed detailed analysis Ayres DL, et al. 2012. BEAGLE: an application programming interface and of Hsp70s’ function in the mitochondrial biogenesis of iron– high-performance computing library for statistical phylogenetics. Syst sulfur containing proteins, where genetic manipulation would Biol. 61:170–173.

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2475 Kominek et al. GBE

Baumann F, Milisav I, Neupert W, Herrmann JM. 2000. Ecm10, a novel Gabaldon T, Koonin EV. 2013. Functional and evolutionary implications of hsp70 homolog in the mitochondrial matrix of the yeast gene orthology. Nat Rev Genet. 14:360–366. Saccharomyces cerevisiae. FEBS Lett. 487:307–312. Georg Rde C, Gomes SL. 2007. Comparative expression analysis of mem- Bettencourt BR, Feder ME. 2001. Hsp70 duplication in the Drosophila bers of the Hsp70 family in the chytridiomycete Blastocladiella emer- melanogaster species group: how and when did two become five? sonii. Gene 386:24–34. Mol Biol Evol. 18:1272–1282. Germot A, Philippe H. 1999. Critical analysis of eukaryotic phylogeny: a Bettencourt BR, Feder ME. 2002. Rapid concerted evolution via gene con- case study based on the HSP70 family. J Eukaryot Microbiol. 46: version at the Drosophila hsp70 genes. J Mol Evol. 54:569–586. 116–124. Boorstein WR, Ziegelhoffer T, Craig EA. 1994. Molecular evolution of the Gordon JL, Byrne KP, Wolfe KH. 2009. Additions, losses, and rearrange- HSP70 multigene family. J Mol Evol. 38:1–17. ments on the evolutionary route from a reconstructed ancestor to the Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. 2012. modern Saccharomyces cerevisiae genome. PLoS Genet. 5:e1000485. Epistasis as the primary factor in molecular evolution. Nature 490: Gu X. 1999. Statistical methods for testing functional divergence after 535–538. gene duplication. Mol Biol Evol. 16:1664–1674. Broadley SA, Hartl FU. 2009. The role of molecular chaperones in human Gu X. 2001. Maximum-likelihood approach for gene family evolution misfolding diseases. FEBS Lett. 583:2647–2653. under functional divergence. Mol Biol Evol. 18:453–464. Downloaded from Brocchieri L, Conway de Macario E, Macario AJ. 2008. hsp70 genes in the Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to human genome: conservation and differentiation patterns predict a wide estimate large phylogenies by maximum likelihood. Syst Biol. 52: array of overlapping and specialized functions. BMC Evol Biol. 8:19. 696–704. Brown CR, McCann JA, Chiang HL. 2000. The heat shock protein Ssa2p is Gupta RS, Bustard K, Falah M, Singh D. 1997. Sequencing of heat shock required for import of fructose-1, 6-bisphosphatase into Vid vesicles. protein 70 (DnaK) homologs from Deinococcus proteolyticus and J Cell Biol. 150:65–76. Thermomicrobium roseum and their integration in a protein-based http://gbe.oxfordjournals.org/ Byrne KP, Wolfe KH. 2006. Visualizing syntenic relationships among the phylogeny of prokaryotes. J Bacteriol. 179:345–357. hemiascomycetes with the Yeast Gene Order Browser. Nucleic Acids Gupta RS, Singh B. 1994. Phylogenetic analysis of 70 kD heat shock pro- Res. 34:D452–D455. tein sequences suggests a chimeric origin for the eukaryotic cell nu- Cascardo JC, et al. 2000. The phosphorylation state and expression of cleus. Curr Biol. 4:1104–1114. soybean BiP isoforms are differentially regulated following abiotic Harms MJ, Thornton JW. 2013. Evolutionary biochemistry: revealing the stresses. J Biol Chem. 275:14494–14500. historical and physical causes of protein properties. Nat Rev Genet. 14: Castresana J. 2000. Selection of conserved blocks from multiple alignments 559–571. for their use in phylogenetic analysis. Mol Biol Evol. 17:540–552. He Y, et al. 2013. Identification of a testis-enriched heat shock protein and

Clark MS, Fraser KPP, Peck LS. 2008. Antarctic marine molluscs do have an fourteen members of hsp70 family in the swamp eel. PLoS One 8: at University of Wisconsin-Madison on June 20, 2014 HSP70 heat shock response. Cell Stress Chaperones 13:39–49. e65269. Cock PJ, et al. 2009. Biopython: freely available Python tools for compu- Hittinger CT, Rokas A, Carroll SB. 2004. Parallel inactivation of multiple tational molecular biology and bioinformatics. Bioinformatics 25: GAL pathway genes and ecological diversification in yeasts. Proc Nati 1422–1423. Acad Sci U S A. 101:14144–14149. Craig EA, Marszalek J. 2011. Hsp70 chaperones. Encyclopedia of life Huang P, Gautschi M, Walter W, Rospert S, Craig EA. 2005. The Hsp70 sciences (ELS). Chichester (UK): John Wiley & Sons. Ssz1 modulates the function of the ribosome-associated J-protein Darriba D, Taboada GL, Doallo R, Posada D. 2011. ProtTest 3: fast selection of Zuo1. Nat Struct Mol Biol. 12:497–504. best-fit models of protein evolution. Bioinformatics 27:1164–1165. Hughes AL. 1993. Nonlinear relationships among evolutionary rates iden- Daugaard M, Rohde M, Jaattela M. 2007. The heat shock protein 70 tify regions of functional divergence in heat-shock protein 70 genes. family: highly homologous proteins with overlapping and distinct func- Mol Biol Evol. 10:243–255. tions. FEBS Lett. 581:3702–3710. Jung KH, Gho HJ, Nguyen MX, Kim SR, An G. 2013. Genome-wide Denecke J, Goldman MHS, Demolder J, Seurinck J, Botterman J. 1991. The expression analysis of HSP70 family genes in rice and identification tobacco luminal binding-protein is encoded by a multigene family. of a cytosolic HSP70 gene highly induced under heat stress. Funct Plant Cell 3:1025–1035. Integr Genomics. 13:391–402. Deshaies RJ, Koch BD, Werner-Washburne M, Craig EA, Schekman R. Kabani M, Martineau CN. 2008. Multiple hsp70 isoforms in the eukaryotic 1988. A subfamily of stress proteins facilitates translocation of secre- cytosol: mere redundancy or functional specificity? Curr Genomics. 9: tory and mitochondrial precursor polypeptides. Nature 332:800–805. 338–248. Dover G. 1982. Molecular drive: a cohesive mode of species evolution. Kampinga HH, Craig EA. 2010. The HSP70 chaperone machinery: J proteins Nature 299:111–117. as drivers of functional specificity. Nat Rev Mol Cell Biol. 11:579–592. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005. Why Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment soft- highly expressed proteins evolve slowly. Proc Natl Acad Sci U S A. 102: ware version 7: improvements in performance and usability. Mol Biol 14338–14343. Evol. 30:772–780. Dujon B. 2010. Yeast evolutionary genomics. Nat Rev Genet. 11:512–524. Kim SH, Yi SV. 2006. Correlated asymmetry of sequence and functional Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accu- divergence between duplicate proteins of Saccharomyces cerevisiae. racy and high throughput. Nucleic Acids Res. 32:1792–1797. Mol Biol Evol. 23:1068–1075. Emanuelsson O, Brunak S, von Heijne G, Nielsen H. 2007. Locating pro- Kim YE, Hipp MS, Bracher A, Hayer-Hartl M, Hartl FU. 2013. Molecular teins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2: chaperone functions in protein folding and proteostasis. Annu Rev 953–971. Biochem. 82:323–355. Fitzpatrick DA, Logue ME, Stajich JE, Butler G. 2006. A fungal phylogeny Kudla G, Helwak A, Lipinski L. 2004. Gene conversion and GC-content based on 42 complete genomes derived from supertree and combined evolution in mammalian Hsp70. Mol Biol Evol. 21:1438–1444. gene analysis. BMC Evol Biol. 6:99. Le SQ, Gascuel O. 2008. An improved general amino acid replacement Fitzpatrick DA, O’Gaora P, Byrne KP, Butler G. 2010. Analysis of gene matrix. Mol Biol Evol. 25:1307–1320. evolution and metabolic pathways using the Candida Gene Order Lin BL, et al. 2001. Genomic analysis of the Hsp70 superfamily in Browser. BMC Genomics 11:290. Arabidopsis thaliana. Cell Stress Chaperones 6:201–208.

2476 Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 Complex Evolutionary Dynamics of Hsp70s GBE

Liu Q, Hendrickson WA. 2007. Insights into Hsp70 chaperone activity from Schuermann JP, et al. 2008. Structure of the Hsp110:Hsc70 nucleotide a crystal structure of the yeast Hsp110 Sse1. Cell 131:106–120. exchange machine. Mol Cell. 31:232–243. Louw CA, Ludewig MH, Mayer J, Blatch GL. 2010. The Hsp70 chaperones Shaner L, Morano KA. 2007. All in the family: atypical Hsp70 chaperones of the Tritryps are characterized by unusual features and novel mem- are conserved modulators of Hsp70 activity. Cell Stress Chaperones bers. Parasitol Int. 59:497–505. 12:1–8. Luan W, Li FH, Zhang JQ, Wang B, Xiang JH. 2009. Cloning and expression Sharma D, et al. 2009. Function of SSA subfamily of Hsp70 of glucose regulated protein 78 (GRP78) in Fenneropenaeus chinensis. within and across species varies widely in complementing Mol Biol Rep. 36:289–298. Saccharomyces cerevisiae cell growth and prion propagation. PLoS MacIsaac KD, et al. 2006. An improved map of conserved regulatory sites One 4:e6644. for Saccharomyces cerevisiae. BMC Bioinformatics 7:113. Sharma D, Masison DC. 2011. Single methyl group determines prion prop- Maguire SL, et al. 2013. Comparative genome analysis and gene finding in agation and protein degradation activities of yeast heat shock protein Candida species using CGOB. Mol Biol Evol. 30:1281–1291. (Hsp)-70 chaperones Ssa1p and Ssa2p. Proc Nat Acad Sci U S A. 108: Marocco A, et al. 1991. Three high-lysine mutations control the level of 13665–13670. ATP-binding HSP70-like proteins in the maize endosperm. Plant Cell 3: Sharp PM, Cowe E. 1991. Synonymous codon usage in Saccharomyces 507–515. cerevisiae. Yeast 7:657–678. Downloaded from Mattoo RU, Sharma SK, Priya S, Finka A, Goloubinoff P. 2013. Hsp110 is a Smock RG, et al. 2010. An interdomain sector mediating allostery in Hsp70 bona fide chaperone using ATP to unfold stable misfolded polypep- molecular chaperones. Mol Syst Biol. 6:414. tides and reciprocally collaborate with hsp70 to solubilize protein ag- Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phyloge- gregates. J Biol Chem. 288:21399–21411. netic analyses with thousands of taxa and mixed models. Mayer MP, Bukau B. 2005. Hsp70 chaperones: cellular functions and mo- Bioinformatics 22:2688–2690. lecular mechanism. Cell Mol Life Sci. 62:670–684. Steel GJ, Fullerton DM, Tyson JR, Stirling CJ. 2004. Coordinated activation http://gbe.oxfordjournals.org/ McClellan AJ, et al. 1998. Specific molecular chaperone interactions and of Hsp70 chaperones. Science 303:98–101. an ATP-dependent conformational change are required during post- Sugino RP, Innan H. 2006. Selection for more of the same product as a translational protein translocation into the yeast ER. Mol Biol Cell. 9: force to enhance concerted evolution of duplicated genes. Trends 3533–3545. Genet. 22:642–644. Miskovic D, Salter-Cid L, Ohan N, Flajnik M, Heikkila JJ. 1997. Isolation and Takuno S, Innan H. 2009. Selection to maintain paralogous amino acid characterization of a cDNA encoding a Xenopus immunoglobulin differences under the pressure of gene conversion in the heat-shock binding protein, BiP (grp78). Comp Biochem Physiol B Biochem Mol protein genes in yeast. Mol Biol Evol. 26:2655–2659. Biol. 116:227–234. Taylor JW, Berbee ML. 2006. Dating divergences in the fungal tree of life: Nei M, Rooney AP. 2005. Concerted and birth-and-death evolution of review and new analyses. Mycologia 98:838–849. at University of Wisconsin-Madison on June 20, 2014 multigene families. Annu Rev Genet. 39:121–152. Thomas-Chollier M, et al. 2008. RSAT: regulatory sequence analysis tools. Nikolaidis N, Nei M. 2004. Concerted and nonconcerted evolution of the Nucleic Acids Res. 36:W119–W127. Hsp70 gene superfamily in two sibling species of nematodes. Mol Biol Wada S, Hamada M, Satoh N. 2006. A genomewide analysis of genes for Evol. 21:498–505. the heat shock protein 70 chaperone system in the ascidian Ciona Pareek G, Samaddar M, D’Silva P. 2011. Primary sequence that determines the intestinalis. Cell Stress Chaperones 11:23–33. functional overlap between mitochondrial heat shock protein 70 Ssc1 and Wang M, et al. 2012. PaxDb, a database of protein abundance Ssc3 of Saccharomyces cerevisiae. J Biol Chem. 286:19001–19013. averages across all three domains of life. Mol Cell Proteomics 11: Peden J. 1999. Analysis of codon usage. Nottingham (UK): University of 492–500. Nottingham. Werner-Washburne M, Becker J, Kosic-Smithers J, Craig EA. 1989. Yeast Pfund C, et al. 1998. The molecular chaperone Ssb from Saccharomyces cerevisiae is a component of the ribosome-nascent chain complex. Hsp70 RNA levels vary in response to the physiological status of the EMBO J. 17:3981–3989. cell. J Bacteriol. 171:2680–2688. R Development Core Team. 2013. R: a language and environment for Werner-Washburne M, Stone DE, Craig EA. 1987. Complex interactions statistical computing [Internet] R version 3.0.1. Vienna, Austria: R among members of an essential subfamily of hsp70 genes in Foundation for Statistical Computing. [updated 2013 May 16; cited Saccharomyces cerevisiae. Mol Cell Biol. 7:2568–2577. 2013 Dec 11]. Available from: http://www.R-project.org. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplica- Rampelt H, et al. 2012. Metazoan Hsp70 machines use Hsp110 to power tion of the entire yeast genome. Nature 387:708–713. protein disaggregation. EMBO J. 31:4221–4235. Wright F. 1990. The ‘effective number of codons’ used in a gene. Gene 87: Raviol H, Bukau B, Mayer MP. 2006. Human and yeast Hsp110 chaperones 23–29. exhibit functional differences. FEBS Lett. 580:168–174. Xi XZ, Ma KS. 2013. Molecular cloning and expression analysis of glucose- Rhind N, et al. 2011. Comparative functional genomics of the fission regulated protein 78 (GRP78) gene in silkworm Bombyx mori. Biologia yeasts. Science 332:930–936. 68:559–564. Ronquist F, et al. 2012. MrBayes 3.2: efficient Bayesian phylogenetic in- Xu X, et al. 2012. Unique peptide substrate binding properties of 110-kDa ference and model choice across a large model space. Syst Biol. 61: heat-shock protein (Hsp110) determine its distinct chaperone activity. 539–542. J Biol Chem. 287:5661–5672. Sakasegawa Y, Hachiya NS, Tsukita S, Kaneko K. 2003. Ecm10p localizes Xuereb B, et al. 2012. Molecular characterization and mRNA expression of in yeast mitochondrial nucleoids and its overexpression induces exten- grp78 and hsp90A in the estuarine copepod Eurytemora affinis.Cell sive mitochondrial DNA aggregations. Biochem Biophys Res Commun. Stress Chaperones 17:457–472. 309:217–221. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Scannell DR, et al. 2011. The awesome power of yeast evolutionary ge- Biol Evol. 24:1586–1591. netics: new genome sequences and strain resources for the Zhu X, et al. 1996. Structural analysis of substrate binding by the molecular Saccharomyces sensu stricto genus. G3 (Bethesda) 1:11–25. chaperone DnaK. Science 272:1606–1614. Schilke B, et al. 2006. Evolution of mitochondrial chaperones utilized in Fe- S cluster biogenesis. Curr Biol. 16:1660–1665. Associate editor: Tal Dagan

Genome Biol. Evol. 5(12):2460–2477. doi:10.1093/gbe/evt192 Advance Access publication November 24, 2013 2477 Fungi Hsp70

Supplementary Fig. S1. Hsp70 orthologs identified in 57 fungal species with the 14 S.cerevisiae Hsp70 proteins used as queries. Taxonomic group, ORF/Gene names, their approximate genomic location, names given in this work (Given Name) and the reciprocal best-blast hits against the S. cerevisiae proteins (rec-BLAST) are shown for each identified Hsp70 ortholog. Specific issue encountered in the identification process are explained in the Notes section.

Group Species / Strain Data Source(s) ORF ID(s) / Gene name Approximate genomic location of the ORF Given Name rec-BLAST Notes Ustilago maydis BROAD Institute UM03791.1 Contig 126: 117491-119428 + UmaySSA SSA1 UM04434+UM04435 Contig 157: 150824 – 149484, 151718 – 151260, 152019 – 151956 UmaySSB1 SSB2 ORFs contains partial sequences, full sequence reconstructed using genomic data and homology with Ccinerea and Cneoformans UM05918.1 Contig 216: 22904-25867 + UmaySSE1 SSE1 UM01209.1 Contig 43: 12990-15309 - UmaySSC1 SSC1 UM05357.1 Contig 192: 77460-79277 - UmaySSZ1 SSZ1 Contig 32: 325 – 1 // Contig 31: 84996 – 83292 UmayKAR2 KAR2 contigs contain partial sequences, because the gene is located at a contig break UM00904.1 Contig 30: 62426-65185 - UmayLHS1 LHS1

Cryptococcus neoformans grubii h99 BROAD Institute CNAG_01750.2 Chromosome 11: 787758-790112 - CneoSSA1 SSA1 CNAG_01727.2 Chromosome 11: 729059-731645 + CneoSSA2 SSA1 CNAG_00334.2 Chromosome 1: 857472-859921 + CneoSSB1 SSB2 CNAG_06208.2 Chromosome 12: 630749-633499 + CneoSSE1 SSE1 CNAG_05199.2 Chromosome 4: 697170-699940 + CneoSSC1 SSC1 CNAG_01404.2 Chromosome 5: 394159-396194 + CneoSSZ1 SSZ1 CNAG_06443.2 Chromosome 13: 499100-501759 - CneoKAR2 KAR2 CNAG_03899.2 Chromosome 2: 1130168-1133013 + CneoLHS1 LHS1

Phanerochaete chrysosporium JGI e_gwh2.1.247.1 scaffold_1:1436212-1438391 PchrSSA SSA4 BROAD Institute e_gwh2.1.301.1 scaffold_1:1820168-1822284 PchrSSB1 SSB2 fgenesh1_pm.C_scaffold_6000005 scaffold_6:138921-141446 PchrSSE1 SSE1

BASIDIOMYCOTA fgenesh1_pg.C_scaffold_12000350 scaffold_12:1072737-1074905 PchrSSC1 SSC1 Gww2.10.123.1 scaffold_10:1261738-1263184 PchrSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed using PCHR_78864, genomic data and homology with Ccinerea e_gww2.9.92.1 scaffold_9:750951-753143 PchrKAR2 KAR2 fgenesh1_pg.C_scaffold_9000190 scaffold_9:620781-623737 PchrLHS1 LHS1

Coprinopsis cinerea okayama BROAD Institute CC1G_13924.2 Supercontig 2: 1783904-1786230 - CcinSSA SSA3 CC1G_13982.2 Supercontig 2: 2701019-2703492 + CcinSSB1 SSB2 CC1G_09266.2 Supercontig 3: 3516070-3518798 + CcinSSE1 SSE1 CC1G_06894.2 Supercontig 3: 2451192-2453422 + CcinSSC1 SSC1 CC1G_05804.2 Supercontig 12: 216005-218031 + CcinSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed from genomic data and homology with Pchrysosporium CC1G_12348.2 Supercontig 10: 1397150-1399622 - CcinKAR2 KAR2 ORF contains partial sequence, full sequence reconstructed from genomic data and homology with Cneoformans, Pchrysosporium and Umaydis CC1G_09108.2 Supercontig 10: 1064743-1067804 - CcinLHS1 LHS1 CC1G_14207.2 Supercontig 3: 1960479-1963567 + CcinKAR2p KAR2 putative pseudogene, not used

Schizosaccharomyces japonicus BROAD Institute SJAG_04055 Chromosome 2: 3183743-3186117 + SjapSSA-I SSA4 SJAG_01186 Chromosome 3: 1132068-1134327 - SjapSSA-II SSA4 SJAG_05779 Chromosome 3: 2523408-2525449 + SjapSSB1 SSB2 SJAG_03237 Chromosome 2: 691851-693863 - SjapSSB2 SSB2 SJAG_04511 Chromosome 3: 714716-717686 + SjapSSE1 SSE1 SJAG_05709 Chromosome 3: 2001458-2003774 + SjapSSC1 SSC1 SJAG_01510 Chromosome 3: 1786066-1788033 + SjapSSZ1 SSZ1 SJAG_00790 Chromosome 1: 2667113-2669515 - SjapKAR2 KAR2 SJAG_02875 Chromosome 1: 1651957-1654797 + SjapLHS1 LHS1

Schizosaccharomyces pombe 972h- BROAD Institute SPAC13G7_02c / ssa1 Supercontig 2: 2297753 – 2295819 SpomSSA-I SSA1 SPCC1739_13 / ssa2 Supercontig 3: 2057243 – 2059186 SpomSSA-II SSA1 SPBC1709_05 Supercontig 2: 1106456 – 1108297 SpomSSB1 SSB2 SPAC110_04c Supercontig 1: 1920160 – 1917998 SpomSSE1 SSE1 SPAC664_11 Supercontig 1: 1726009 – 1728033 SpomSSC1 SSC1 SPAC57A7_12 Supercontig 1: 1515089-1516914 - SpomSSZ1 SSZ1 SPAC22A12_15c Supercontig 1: 1185303 – 1183312 SpomKAR2 KAR2 SPAC1F5_06 Supercontig 1: 281367 – 278599 SpomLHS1 LHS1

Schizosaccharomyces octosporus BROAD Institute SOCG_04298 Chromosome 3: 576782-579147 - SoctSSA-I SSA4 SOCG_02770 Chromosome 1: 320278-322202 - SoctSSA-II SSA2 SOCG_03479 Chromosome 1: 2053617-2055908 - SoctSSB1 SSB2 SOCG_00373 Chromosome 2: 2092129-2094901 + SoctSSE1 SSE1 SOCG_01535 Chromosome 3: 2196974-2199415 - SoctSSC1 SSC1 SCHIZOSACCHAROMYCES SOCG_02356 Chromosome 1: 3427346-3429199 - SoctSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Scryophilus SOCG_02212 Chromosome 1: 3105086-3107714 - SoctKAR2 KAR2 SOCG_00522 Chromosome 2: 2444159-2447187 - SoctLHS1 LHS1

Schizosaccharomyces cryophilus BROAD Institute SPOG_00242 Supercontig 3: 532003-534426 - ScrySSA-I SSA4 SPOG_05342 Supercontig 4: 334784-337065 - ScrySSA-II SSA4 SPOG_04259 Supercontig 9: 444495-446661 - ScrySSB1 SSB2 SPOG_02132 Supercontig 2: 1328159-1330917 - ScrySSE1 SSE1 SPOG_05615 Supercontig 1: 923558-926038 - ScrySSC1 SSC1 SPOG_05669 Supercontig 5: 927867-929974 - ScrySSZ1 SSZ1 SPOG_00613 Supercontig 5: 614924-617571 - ScryKAR2 KAR2 SPOG_01983 Supercontig 2: 974957-977770 + ScryLHS1 LHS1

Magnaporthe grisea 70-15 (MG6) BROAD Institute MGG_06958.6 Supercontig 18: 935235-937726 - MgriSSA SSA1 MGG_11513.6 Supercontig 21: 1724419-1727245 + MgriSSB1 SSB2 MGG_06065.6 Supercontig 24: 545033-548037 - MgriSSE1 SSE2 MGG_04191.6 Supercontig 12: 486481-489423 + MgriSSC1 SSC1 MGG_02842.6 Supercontig 23: 918967-920953 + MgriSSZ1 SSZ1 MGG_02503.6 Supercontig 18: 2064348-2066834 - MgriKAR2 KAR2 MGG_06648.6 Supercontig 18: 2211652-2214877 - MgriLHS1 LHS1

Neurospora crassa OR74A (NC10) BROAD Institute NCU09602.4 / HSP70 Supercontig 2: 79807-82656 - NcraSSA SSA1 NCU02075.4 Supercontig 1: 1450891-1453703 + NcraSSB1 SSB2 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Mgrisea, Panserina and Cglobosum NCU05269.4 Supercontig 4: 5521208-5524832 - NcraSSE1 SSE2 NCU08693.4 Supercontig 4: 4072155-4075127 + NcraSSC1 SSC1 NCU00692.4 Supercontig 1: 7530307-7532510 + NcraSSZ1 SSZ1 NCU03982.4 Supercontig 6: 2302462-2305332 - NcraKAR2 KAR2 NCU09485.4 Supercontig 2: 4430576-4433950 + NcraLHS1 LHS1

Podospora anserina Podospora anserina Genome Project Pa_6_4100 CHR 6: 160198 – 162442 PansSSA SSA1 Pa_1_2670 CHR 1: 962007 – 964244 PansSSB1 SSB2 Pa_3_4490 CHR 3: 926456 – 929160 PansSSE1 SSE1 Pa_6_2570 CHR 6: 1429367 – 1431561 PansSSC1 SSC1 Pa_1_18900 CHR 1: 1696036 – 1697858 PansSSZ1 SSZ1 Pa_4_6420 CHR 4: 1766710 – 1769004 PansKAR2 KAR2 Pa_3_10060 CHR 3: 2861974 – 2865069 PansLHS1 LHS1

Chaetomium globosum BROAD Institute CHGG_08262.1 Supercontig 5: 4397131-4399392 + CgloSSA SSA1 CHGG_01675.1 Supercontig 1: 5157151-5159397 - CgloSSB1 SSB2 CHGG_04834.1 Supercontig 4: 3438199-3440891 - CgloSSE1 SSE2 CHGG_04605.1 Supercontig 4: 2692267-2694395 + CgloSSC1 SSC1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Panserina CHGG_06364.1 Supercontig 3: 3303639-3305447 + CgloSSZ1 SSZ1 CHGG_09206.1 Supercontig 6: 2741846-2744216 - CgloKAR2 KAR2 CHGG_01653.1 Supercontig 1: 5081454-5083492 + CgloKAR2p KAR2 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Panserina, Ncrassa and Mgrisea CHGG_10554.1 Supercontig 7: 3076476-3079557 + CgloLHS1 LHS1 ORF contains partial sequence, full sequence reconstructed using genomic data, trace archive and homology with Panserina, Ncrassa and Mgrisea

Trichoderma reesei JGI e_gw1.2.175.1 scaffold_2:1721232-1723250 TreeSSA SSA1 ORF contains a unique insertion/intron, full sequence reconstructed using using homology with G.zeae and G. moniliformis estExt_fgenesh5_pg.C_130159 scaffold_13:501120-503692 TreeSSB1 SSB2 estExt_GeneWisePlus.C_400011 scaffold_40:20191-23199 TreeSSE1 SSE1 estExt_fgenesh5_pg.C_20133 scaffold_2:459908-462223 TreeSSC1 SSC1 estExt_fgenesh1_pm.C_60103 scaffold_6:874716-876824 TreeSSZ1 SSZ1 estExt_fgenesh5_pg.C_160150 scaffold_16:512482-515161 TreeKAR2 KAR2 gw1.9.395.1 scaffold_9:508545-511499 TreeLHS1 LHS1

Gibberella moniliformis BROAD Institute FVEG_00720.2 Supercontig 1: 2118626-2121082 + GmonSSA SSA1 (Fusarium verticillioides) FVEG_00688.2 Supercontig 1: 2008124-2011068 - GmonSSB1 SSB2 FVEG_07295.2 Supercontig 9: 379858-382935 - GmonSSE1 SSE2 FVEG_06113.2 Supercontig 7: 905127-908495 + GmonSSC1 SSC1 FVEG_02022.2 Supercontig 2: 1460770-1462601 + GmonSSZ1 SSZ1 FVEG_04019.2 Supercontig 4: 1690107-1692628 + GmonKAR2 KAR2 FVEG_11493.2 Supercontig 17: 362696-366046 - GmonLHS1 LHS1

Gibberella zeae BROAD Institute FGSG_00838.2 Supercontig 1: 2689098-2691457 - GzeaSSA SSA1 (Fusarium graminearum) FGSG_00921.2 Supercontig 1: 2987810-2990564 + GzeaSSB1 SSB2 ORF contains partial sequence, nearly full sequence reconstructed using genomic data and homology with Gmoniliformis, but a missing fragment is absent even from trace data, so probably wasn't sequenced at all FGSG_01950.2 Supercontig 1: 6393997-6397123 + GzeaSSE1 SSE2 FGSG_06154.2 Supercontig 3: 4480983-4483396 + GzeaSSC1 SSC1

Strona 1 PEZIZOMYCOTINA Fungi Hsp70

FGSG_08644.2 Supercontig 5: 1135291-1137371 + GzeaSSZ1 SSZ1 FGSG_09471.2 Supercontig 6: 1681356-1683917 + GzeaKAR2 KAR2 FGSG_10819.2 Supercontig 8: 1239854-1243148 + GzeaLHS1 LHS1

Botryotinia fuckeliana BROAD Institute BC1G_06164.1 / BofuT4_P016860.1 Supercontig 29: 217973-220426 - BfucSSA SSA2 ORF contains partial sequence, full sequence reconstructed using BofuT4_P016860.1 Genoscope BC1G_10845.1+BC1G_10846.1 / BofuT4_P147470.1 Supercontig 80: 115316-116955 - / Supercontig 80: 117318-119054 - BfucSSB1 SSB2 ORFs contains partial sequences, full sequence reconstructed using BofuT4_P147470.1, genomic data and homology with Sscleoriorum BC1G_09769.1 Supercontig 66: 142479-144980 + BfucSSE1 SSE2 BC1G_11661.1 Supercontig 91: 152354-154585 + BfucSSC1 SSC1 BC1G_00968.1 Supercontig 3: 72264-74141 - BfucSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Sscleotiorum BC1G_04390.1 Supercontig 18: 306893-309142 - BfucKAR2 KAR2 BC1G_12543.1 Supercontig 103: 133202-136309 + BfucLHS1 LHS1

Sclerotinia sclerotiorum BROAD Institute SS1G_00134.1 Supercontig 1: 360402-362841 + SsclSSA SSA2 SS1G_05645.1 Supercontig 7: 85362-87846 + SsclSSB1 SSB2 SS1G_01888.1 Supercontig 2: 2131071-2133573 - SsclSSE1 SSE2 SS1G_09744.1 Supercontig 14: 356107-358330 + SsclSSC1 SSC1 SS1G_09280.1 Supercontig 13: 380553-382433 + SsclSSZ1 SSZ1 SS1G_02266.1 Supercontig 3: 436178-438409 + SsclKAR2 KAR2 SS1G_10087.1 Supercontig 15: 78997-82107 + SsclLHS1 LHS1 PEZIZOMYCOTINA

Phaeospheria nodorum BROAD Institute SNOG_14320.1+SNOG_14321.1+SNOG_14322.1 Supercontig 31: 75420-79148 + PnodSSA SSA2 ORFs contained partial sequence, full sequence reconstructed using genomic data, trace archive and homology with Ssclerotiorum and Ureesii SNOG_04404.1 Supercontig 6: 669855 - 672136 PnodSSB1 SSB2 ORFs contained partial sequence, full sequence reconstructed using genomic data SNOG_06062.1 Supercontig 8: 1294925-1297418 - PnodSSE1 SSE1 SNOG_13583.1 Supercontig 27: 223556-225572 + PnodSSC1 SSC1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Ssclerotiorum and Ureesii SNOG_00353.1 Supercontig 1: 825701-827510 - PnodSSZ1 SSZ1 SNOG_12476.1 Supercontig 22: 436572 – 432968 - PnodKAR2 KAR2 ORFs contained partial sequence, full sequence reconstructed using genomic data SNOG_02459.1 Supercontig 3: 1389497-1392654 + PnodLHS1 LHS1

Ajellomyces capsulatus G186AR BROAD Institute HCBG_07920.2 Supercontig 14: 787633-790545 - AcapSSA SSA3 HCBG_04781.2 Supercontig 6: 246375-249081 + AcapSSB1 SSB2 HCBG_04755.2 Supercontig 6: 175186-178648 - AcapSSE1 SSE2 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Cimmitis HCBG_08743.2 Supercontig 17: 332729-336038 - AcapSSC1 SSC1 HCBG_02577.2 Supercontig 3: 1241446-1243861 + AcapSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Cimmitis HCBG_05946.2 Supercontig 8: 462625-465850 + AcapKAR2 KAR2 HCBG_03803.2 Supercontig 4: 2258060-2261385 - AcapLHS1 LHS1

Uncinocarpus reesii BROAD Institute UREG_03219.1 Supercontig 2: 732847-734942 - UreeSSA SSA1 UREG_01466.1 Supercontig 1: 3706657-3708820 + UreeSSB1 SSB2 UREG_03621.1 Supercontig 2: 1898773-1901396 + UreeSSE1 SSE2 UREG_01663.1 Supercontig 1: 4304912-4307119 + UreeSSC1 SSC1 UREG_06810.1 Supercontig 4: 2000023-2001905 + UreeSSZ1 SSZ1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Cimmitis UREG_01160.1 Supercontig 1: 2836596-2838920 - UreeKAR2 KAR2 UREG_07090.1 Supercontig 5: 207815-210859 + UreeLHS1 LHS1 ORF contains partial sequence, full sequence reconstructed using genomic data, trace archive and homology with Cimmitis and Acapsulatus

Coccidioides immitis RS BROAD Institute CIMG_02494.3 Supercontig 1: 1675495-1678079 - CimmSSA SSA1 CIMG_02269.3 Supercontig 1: 2256912-2259616 + CimmSSB1 SSB2 CIMG_06861.3 Supercontig 2: 3686703-3689818 - CimmSSE1 SSE2 CIMG_04436.3 Supercontig 4: 521504-524476 + CimmSSC1 SSC1 CIMG_01851.3 Supercontig 1: 3583856-3586135 - CimmSSZ1 SSZ1 CIMG_01229.3 Supercontig 1: 5252843-5255630 + CimmKAR2 KAR2 CIMG_03236.3 Supercontig 6: 809068-812503 + CimmLHS1 LHS1

Aspergillus nidulans FGSC A4 BROAD Institute ANID_05129 Contig 88: 88007-90384 + AnidSSA SSA1 ANID_11227+ANID_10202 Contig 175: 18168-19865 + / Contig 22: 134-1579 + AnidSSB1 SSB2 contigs contain partial sequences, full sequence reconstructed using combined data from both contigs ANID_01047 Contig 16: 8192-11153 + AnidSSE1 SSE2 ANID_06010 Contig 103: 5334-7986 + AnidSSC1 SSC1 ANID_04616 Contig 79: 68926-71101 - AnidSSZ1 SSZ1 ANID_02062 Contig 32: 187084-189402 - AnidKAR2 KAR2 ANID_00847 Contig 13: 191288-194278 - AnidLHS1 LHS1

Aspergillus fumigatus EBI Integr8 Afu1g07440 Chromosome 1: 2102587-2104621 + AfumSSA SSA4 Afu8g03930 Chromosome 8: 839316-841607 + AfumSSB2 SSB2 Afu1g12610 Chromosome 1: 3331407-3333764 + AfumSSE1 SSE2 Afu2g09960 Chromosome 2: 2551872-2554095 - AfumSSC1 SSC1 Afu2g02320 Chromosome 2: 584085-586013 - AfumSSZ1 SSZ1 Afu2g04620 Chromosome 2: 1258486-1260682 - AfumKAR2 KAR2 Afu1g15050 Chromosome 1: 4045354-4048347 - AfumLHS1 LHS1

Aspergillus oryzae BROAD Institute AO090012000995 Supercontig 12: 2578981-2581024 + AorySSA SSA2 AO090103000007 Supercontig 25: 21617-23874 + AorySSB1 SSB2 AO090113000129 Supercontig 17: 1656464-1658958 + AorySSE1 SSE2 AO090011000638 Supercontig 22: 1645256-1647468 + AorySSC1 SSC1 ORF contains partial sequence, full sequence reconstructed using genomic data and homology with Aterreus,Anidulans and Afumigatus AO090011000513 Supercontig 22: 1284917-1286908 - AorySSZ1 SSZ1 AO090003000257 Supercontig 6: 682684-684869 + AoryKAR2 KAR2 AO090005001222 Supercontig 2: 3237941-3240898 + AoryLHS1 LHS1

Aspergillus terreus BROAD Institute ATEG_10178.1 Supercontig 16: 632123-634104 + AterSSA SSA2 ATEG_00003.1 Supercontig 1: 13213-15416 - AterSSB1 SSB2 ATEG_00453.1 Supercontig 1: 1283326-1285800 + AterSSE1 SSE2 ATEG_04432.1 Supercontig 6: 524851-527117 - AterSSC1 SSC1 ATEG_06952.1 Supercontig 10: 305711-307663 + AterSSZ1 SSZ1 ATEG_06040.1 Supercontig 8: 1156564-1158740 + AterKAR2 KAR2 ATEG_00843.1 Supercontig 1: 2352335-2361162 + AterLHS1 LHS1 ORF contains a very long protein, the LHS1 homolog forms the 2nd exon

Candida hispaniensis Y-5580 Cecile Neuveglise YAHI0S3-06700g YAHI0S3_545647..547605 ChisSSA-X1 SSA3 YAHI0S6-03510g YAHI0S6_270881..272824 ChisSSA-A SSA3 - no ORF data published - YAHI_scf4343_434777..436777 ChisKAR2 KAR2 - no ORF data published - YAHI_scf4344_84637..86862 ChisSSC1 SSC1 - no ORF data published - YAHI_scf4340_426683..428518 ChisSSB1 SSB2 - no ORF data published - YAHI_scf4343_680889..682967 ChisSSE1 SSE2 - no ORF data published - YAHI_scf4345_178227..179846 ChisSSZ1 SSZ1 - no ORF data published - YAHI_scf4341_168515..171424 ChisLHS1 LHS1

Candida alimentaria Cecile Neuveglise - no ORF data published - YAAL_scf8585_2882264..2884198 CaliSSA-E SSA3 - no ORF data published - YAAL_scf8609_2533460..2535394 CaliSSA-X2 SSA4 - no ORF data published - YAAL_scf8610_806196..808157 CaliSSA-G SSA3 - no ORF data published - YAAL_scf8611_3130179..3132134 CaliSSA-A SSA3 - no ORF data published - YAAL_scf8613_411018..412976 CaliSSA-C SSA4 - no ORF data published - YAAL_scf8613_155349..155953[pseudo] CaliSSA-X3p SSA3 - no ORF data published - YAAL_scf8611_2204030..2206042 CaliKAR2 KAR2 - no ORF data published - YAAL_scf8585_1668335..1670537 CaliSSC1 SSC1 - no ORF data published - YAAL_scf8605_54251..56086 CaliSSB1 SSB2 - no ORF data published - YAAL_scf8611_1242441..1244495 CaliSSE1 SSE1 - no ORF data published - YAAL_scf8609_1004967..1006577 CaliSSZ1 SSZ1 - no ORF data published - YAAL_scf8609_1683712-1686792. CaliLHS1 LHS1

Candida phangngensis CBS10407 Cecile Neuveglise - no ORF data published - YAPH233-YALI0D08184g complement(1614545..1616491) CphaSSA-C SSA3 - no ORF data published - YAPH233-YALI0D22352g complement(3197089..3199029) CphaSSA-G SSA3 - no ORF data published - YAPH240-YALI0D08184g complement(1417672..1419603) CphaSSA-X4 SSA3 - no ORF data published - YAPH240-YALI0D08184g complement(444452..446380) CphaSSA-X5 SSA3 - no ORF data published - YAPH242-YALI0D08184g complement(1854555..1856501) CphaSSA-A SSA3 - no ORF data published - YAPH239-YALI0E13706g 1641771..1643789 CphaKAR2 KAR2 - no ORF data published - YAPH233-YALI0C17347g join(618522..618560,618773..620680) CphaSSC1 SSC1 - no ORF data published - YAPH242-YALI0A00132g complement(168836..170671) CphaSSB1 SSB2 - no ORF data published - YAPH239-YALI0E13255g complement(1675854..1677911) CphaSSE1 SSE1 - no ORF data published - YAPH241-YALI0B12474g 199154..200764 CphaSSZ1 SSZ1 - no ORF data published - YAPH241-YALI0B08778g complement(1092513..1095374) CphaLHS1 LHS1

Candida yakushimensis CBS10253 Cecile Neuveglise - no ORF data published - YAYA_scf25028_28151..30081 YyakSSA-Fp SSA4 - no ORF data published - YAYA_scf25030_10981..12897 YyakSSA-D SSA3 - no ORF data published - YAYA_scf25030_622567..624501 YyakSSA-E SSA4 - no ORF data published - YAYA_scf25031_2026191..2028137 YyakSSA-A SSA3

YARROWIA - no ORF data published - YAYA_scf25031_2677312..2679276 YyakSSA-B SSA3 - no ORF data published - YAYA_scf25032_1078054..1080003 YyakSSA-C SSA4 - no ORF data published - YAYA_scf25030_2324415..2326427 YyakKAR2 KAR2 - no ORF data published - YAYA_scf25033_736061..738306 YyakSSC1 SSC1 - no ORF data published - YAYA_scf25032_13901..15736 YyakSSB1 SSB2 - no ORF data published - YAYA_scf25030_1569979..1572036 YyakSSE1 SSE1

Strona 2 YARROWIA Fungi Hsp70

- no ORF data published - YAYA_scf25034_2297654..2299264 YyakSSZ1 SSZ1 - no ORF data published - YAYA_scf25034_3835205..3838159 YyakLHS1 LHS1

Candida galli CBS9722 Cecile Neuveglise - no ORF data published - Yaga0C-YALI0D08184g complement(join(1301102..1302967,1302967..1303059)) CgalSSA-A SSA3 - no ORF data published - Yaga0D-YALI0D22352g 1595424..1597382 CgalSSA-B SSA3 - no ORF data published - Yaga0D-YALI0E35046g complement(3125512..3127443) CgalSSA-X6 SSA3 - no ORF data published - Yaga0E-YALI0E35046g 718090..720036 CgalSSA-E SSA3 - no ORF data published - Yaga0E-YALI0E35046g complement(4952556..4954487) CgalSSA-F SSA3 - no ORF data published - Yaga0E-YALI0E35046g3 19404..21335 CgalSSA-D SSA3 - no ORF data published - Yaga0F-YALI0F25289g 5231772..5233733 CgalSSA-C SSA3 - no ORF data published - Yaga0E-YALI0E13706g complement(4384451..4386463) CgalKAR2 KAR2 - no ORF data published - Yaga0B-YALI0C17347g 2599322..2601641 CgalSSC1 SSC1 - no ORF data published - YagaOD-YALI0A00132g 39754..41589 CgalSSB1 SSB2 - no ORF data published - Yaga0E-YALI0E13255g 2867376.. 2869433 CgalSSE1 SSE1 - no ORF data published - Yaga0F-YALI0B12474g complement(2631902..2633512) CgalSSZ1 SSZ1 - no ORF data published - Yaga0E-YALI0B08778g 3129789.. 3132875 CgalLHS1 LHS1

Yarrowia lipolytica Genolevures YALI0D08184g CHR D: 1045550 – 1043604 YlipSSA-A SSA1 YALI0D22352g CHR D: 2843487 – 2845436 YlipSSA-B SSA4 YALI0F25289g CHR F: 3260540 – 3258579 YlipSSA-C SSA4 YALI0E35046g CHR E: 4190297 – 4188366 YlipSSA-D SSA4 YALI0E29491g CHR E: 3493589 – 3493885 YlipSSA-Ep SSA3 YALI0A00132g CHR A: 7045 – 8880 YlipSSB1 SSB2 YALI0E13255g CHR E: 1595877 – 1597934 YlipSSE1 SSE1 YALI0C17347g CHR C: 2440654 – 2440692, 2438437 – 2440341 YlipSSC1 SSC1 YALI0B12474g CHR B: 1664503 – 1666113 YlipSSZ1 SSZ1 YALI0E13706g CHR E: 1658116 – 1656104 YlipKAR2 KAR2 YALI0B08778g CHR B: 1207977 – 1204954 YlipLHS1 LHS1

Candida lusitaniae BROAD Institute CLUG_03234.1 Supercontig 4: 77068-79032 - ClusSSA-a SSA1 CLUG_01400.1 Supercontig 2: 390531-392468 + ClusSSA-b SSA1 CLUG_02820.1 Supercontig 3: 1072350-1074239 - ClusSSB1 SSB2 CLUG_02016.1 Supercontig 2: 1623249-1625318 + ClusSSE1 SSE1 CLUG_04122.1 Supercontig 5: 42250-44184 + ClusSSC1 SSC1 CLUG_03878.1 Supercontig 4: 1315749-1317701 - ClusSSQ1 SSC1 CLUG_05851.1 Supercontig 8: 549826-551862 + ClusKAR2 KAR2 CLUG_05445.1 Supercontig 7: 427265-428886 + ClusSSZ1 SSZ1 ORF contains a frameshifted sequence, full sequence reconstructed using genomic data and trace archive CLUG_04525.1 Supercontig 5: 879033-882343 + ClusLHS1 LHS1 ORF contains partial sequence, nearly full sequence reconstructed using genomic data, trace archive and using homology with Dhansenii

Meyerozyma guillermondii BROAD Institute PGUG_05605.1 Supercontig 7: 760966-762921 + MguiSSA-ap SSA3 (Pichia guillermondii) PGUG_04814.1 Supercontig 6: 360136 – 358199 MguiSSA-b SSA2 PGUG_01956.1 Supercontig 2: 1331754 – 1329913 MguiSSB1 SSB2 PGUG_00585.1+PGUG_00586.1 Supercontig 1:1030838 – 1032948 MguiSSE1 SSE1 ORFs contains partial sequences, full sequence reconstructed using genomic data and trace archive PGUG_04143.1 Supercontig 5: 350855 – 352801 MguiSSC1 SSC1 PGUG_04519.1 Supercontig 5: 1055829 – 1057802 MguiSSQ1 SSC1 PGUG_00144.1 Supercontig 1: 244452-242821 MguiSSZ1 SSZ1 PGUG_04934.1 (partial) Supercontig 6: 596018 – 593982 MguiKAR2 KAR2 ORFs contains partial sequences, full sequence reconstructed using genomic data and trace archive PGUG_02995.1 Supercontig 3: 1231598 – 1228821 MguiLHS1 LHS1

Debaryomyces hansenii BROAD Institute DEHA0D06116g Supercontig 4: 436738-438699 - DhanSSA-a SSA2 Genolevures DEHA0D13838g Supercontig 4: 1059116-1061062 + DhanSSA-b SSA1 DEHA0A11539g Supercontig 1: 937961-939802 + DhanSSB1 SSB2 DEHA0D17215g Supercontig 4: 1324520-1326607 + DhanSSE1 SSE1 DEHA0C08877g Supercontig 3: 727170-729113 + DhanSSC1 SSC1 DEHA0G17688g Supercontig 7: 1354668-1356635 + DhanSSQ1 SSC1 DEHA0A01749g Supercontig 1: 108699-110750 - DhanKAR2 KAR2 DEHA0G04587g Supercontig 7: 349980-352748 + DhanLHS1 LHS1 DEHA2E03410g Supercontig 5/E: 291711 - 293339 DhanSSZ1 SSZ1

Candida parapsilosis BROAD Institute CPAG_04247.0 Contig 136: 73482-75437 - CparSSA-a SSA3 CPAG_04938.0 Contig 140: 211688-213628 + CparSSA-b SSA4 CPAG_04207.0 Contig 135: 255883-257724 - CparSSB1 SSB2 CPAG_00100.0 Contig 35: 3304-5424 + CparSSE1 SSE1 CPAG_03670.0 Contig 130: 154581-156536 - CparSSC1 SSC1 CPAG_04753.0 Contig 139: 220038-221975 - CparSSQ1 SSC1 CPAG_01744.0 Contig 102: 106644-108257 - CparSSZ1 SSZ1 CPAG_04622.0 Contig 138: 269187-271241 + CparKAR2 KAR2 CPAG_01321.0 Contig 93: 11466-14276 - CparLHS1 LHS1

Candida tropicalis BROAD Institute CTRG_03648.3 Supercontig 4: 1466594-1468540 - CtroSSA-a SSA4 CTRG_04728.3 Supercontig 6: 1007504-1009438 + CtroSSA-b SSA1 CTRG_03109.3 Supercontig 4: 164218-166059 - CtroSSB1 SSB2 CTRG_04273.3 Supercontig 5: 1351392-1353467 - CtroSSE1 SSE1 CTRG_01722.3 Supercontig 2: 1395100-1397040 + CtroSSC1 SSC1 CTRG_05157.3 Supercontig 7: 759503-761413 - CtroSSQ1 SSC1 CTRG_03925.3 Supercontig 5: 470629-472251 + CtroSSZ1 SSZ1 CTRG_01299.3 Supercontig 2: 483729-485789 - CtroKAR2 KAR2 CTRG_01349.3 Supercontig 2: 594327-597200 + CtroLHS1 LHS1

Candida dubliniensis Sanger Institute CD36_12490 CHR 1: 2990554 – 2992533 + CdubSSA-a SSA4 CD36_04050 CHR 1: 919866 – 921809 - CdubSSA-b SSA1 CD36_33730 CHR R: 1735413 – 1737254 + CdubSSB1 SSB2 CD36_05770 CHR 1: 1317430 – 1319556 - CdubSSE1 SSE1 CD36_21650 CHR 2: 1509835 – 1511781 + CdubSSC1 SSC1 CD36_73680 CHR 7: 922410 – 924329 + CdubSSQ1 SSC1 CD36_44350 CHR 4: 1045493 – 1047115 + CdubSSZ1 SSZ1 CD36_16010 CHR 2: 188912 – 190969 + CdubKAR1 KAR2 CD36_17610 CHR 2: 579192 - 576394 - CdubLHS1 LHS1

Candida albicans WO-1 BROAD Institute CAWG_00103.1 / HSP70 Supercontig 1: 238582-240513 - CalbSSA-a SSA4 CAWG_00963.1 / SSA2 Supercontig 1: 2302015-2303952 + CalbSSA-b SSA1

CTG GROUP CAWG_02101.1 Supercontig 2: 1758887-1760728 + CalbSSB1 SSB2 CAWG_00797.1 Supercontig 1: 1913171-1915279 + CalbSSE1 SSE1 CAWG_05834.1 Supercontig 9: 139233-141179 + CalbSSC1 SSC1 CAWG_05730.1 Supercontig 8: 876839-878755 + CalbSSQ1 SSC1 CAWG_03345.1 Supercontig 4: 601054-602676 - CalbSSZ1 SSZ1 CAWG_03887.1 Supercontig 5: 180920-182983 + CalbKAR2 KAR2 CAWG_04044.1 Supercontig 5: 549902-552700 + CalbLHS1 LHS1

Scheffersomyces stipitis NCBI PICST_80761 Chr1 (2017568..2019526) SstiSSA-a SSA3 PICST_84662 Chr6 complement(630721..632652) SstiSSA-b SSA4 PICST_79362 Chr7 (460467..462308) SstiSSB1 SSB2 PICST_78689 Chr6 complement(1635139..1637082) SstiSSC1 SSC1 PICST_87680 Chr2 complement(1594863..1596815) SstiSSQ1 SSC1 PICST_69996 Chr2 complement(255860..257905) SstiKAR2 KAR2 PICST_80235 Chr1 complement(975511..977601) SstiSSE1 SSE1 PICST_85151 Chr7 (119178..120815) SstiSSZ1 SSZ1 PICST_42658 Chr2 complement(543302..546091) SstiLHS1 LHS1

Candida orthopsilosis Co 90-125 NCBI CORT0G00500 c7 complement(79787..80419) // contig64 (7490-12309) CortSSA-a SSA3 Co 90-125 ORF contains a partial sequence, full sequence reconstructed using data from the AY2 strain, paralog synteny determined based on sequence similarity Candida orthopsilosis AY2 NCBI CORT0B08150 c2 (1696261..1696893) // contig718 (5506-10334) CortSSA-b SSA4 Co 90-125 ORF contains a partial sequence, full sequence reconstructed using data from the AY2 strain, paralog synteny determined based on sequence similarity CORT0D05790 c4 complement(1161719..1163560) CortSSB1 SSB2 CORT0B06250 c2 (1324900..1326855) CortSSC1 SSC1 CORT0G02560 c7 (512271..514211) CortSSQ1 SSC1 CORT0A13190 c1 complement(2866861..2868912) CortKAR2 KAR2 CORT0B04950 c2 (1077703..1079796) CortSSE1 SSE1 CORT0E04540 c5 complement(1009642..1011255) CortSSZ1 SSZ1 CORT0A11580 c1 complement(2559395..2562205) CortLHS1 LHS1

Candida tenuis NCBI CANTEDRAFT_107645 c21 (2016600..2018534) CtenSSA-a SSA4 CANTEDRAFT_104659 c15 complement(524416..526344) CtenSSA-b SSA4 CANTEDRAFT_100511 c4 (8246..10087) CtenSSB1 SSB2 CANTEDRAFT_100988 c4 complement(664208..666142) CtenSSC1 SSC1 CANTEDRAFT_102936 c8 complement(14877..16799) CtenSSQ1 SSQ1 CANTEDRAFT_102381 c6 complement(590787..592832) CtenKAR2 KAR2 CANTEDRAFT_112794 c6 complement(790531..792627) CtenSSE1 SSE1 CANTEDRAFT_97846 c15 complement(211510..213153) CtenSSZ1 SSZ1 CANTEDRAFT_128824 c4 complement(148956..151730) CtenLHS1 LHS1

Strona 3 CTG GROUP

Fungi Hsp70

Lodderomyces elongisporus BROAD Institute LELG_03645 Supercontig 4: 1593130-1595073 LeloSSA-a SSA3 LELG_01669 Supercontig 2: 873906-875840 LeloSSA-b SSA2 LELG_03097 Supercontig 4: 106976-108817 + LeloSSB1 SSB2 LELG_01847 Supercontig 2: 1348570-1350534 - LeloSSC1 SSC1 LELG_05039 Supercontig 8: 92736-94691 - LeloSSQ1 SSC1 LELG_01164 Supercontig 1: 3106363-3108426 - LeloKAR2 KAR2 LELG_01979 Supercontig 2: 1666184-1668286 - LeloSSE1 SSE1 LELG_05616 Supercontig 11: 22203-23828 - LeloSSZ1 SSZ1 LELG_01083 Supercontig 1: 2866724-2869447 - LeloLHS1 LHS1

Spathaspora passalidarum NCBI SPAPADRAFT_138204 c4 complement(1647522..1649483) SpasSSA-a SSA3 SPAPADRAFT_53368 c1 (809852..811786) SpasSSA-b SSA2 SPAPADRAFT_134244 c2 (523624..525465) SpasSSB1 SSB2 SPAPADRAFT_60315 c3 complement(677675..679606) SpasSSC1 SSC1 SPAPADRAFT_63632 c8 complement(367554..369449) SpasSSQ1 SSC1 SPAPADRAFT_60887 c3 complement(1879782..1881326) SpasKAR2 KAR2 ORF contains partial sequence, full sequence reconstructed using genomic data and similarity with other CTG group homologs SPAPADRAFT_58517 c1 complement(1429380..1431207) SpasSSE1 SSE1 ORF contains partial sequence, full sequence reconstructed using genomic data and similarity with other CTG group homologs SPAPADRAFT_58845 c1 complement(2173195..2174832) SpasSSZ1 SSZ1 SPAPADRAFT_50314 c3 (1602011..1604749) SpasLHS1 LHS1

Lachancea kluyveri Genolevures SAKL0H20020g CHR H: 1754166 – 1752241 LkluSSA2 SSA1 (Saccharomyces kluyveri) SAKL0F12606g CHR F: 981486 – 983453 LkluSSA3 SSA3 SAKL0E13618g CHR E: 1124772 – 1122931 LkluSSB1 SSB2 SAKL0H08822g CHR H: 757260 – 755179 LkluSSE1 SSE1 SAKL0D11638g CHR D: 966535 – 964598 LkluSSC1 SSC1 SAKL0H04004g CHR H: 373807 – 371864 LkluSSQ1 SSQ1 SAKL0G09416g CHR G: 804704 – 806323 LkluSSZ1 SSZ1 SAKL0G11792g CHR G: 1003865 – 1001826 LkluKAR2 KAR2 SAKL0B11990g CHR B: 1034223 – 1031509 LkluLHS1 LHS1

Lachancea thermotolerans Genolevures KLTH0H08448g Klth0H: 736897 - 738819 LtheSSA2 SSA2 KLTH0C06556g Klth0C: 573991 - 575820 LtheSSA3 SSA3 KLTH0B07744g Klth0B: 628227 - 630065 LtheSSB1 SSB2 KLTH0E13662g Klth0E: 1209958 – 1212039 LtheSSE1 SSE1 KLTH0F06996g Klth0F: 612749 – 614689 LtheSSC1 SSC1 KLTH0D13530g Klth0D: 1102731 – 1104581 LtheSSQ1 SSQ1 KLTH0F07150g Klth0F: 624087 – 625703 LtheSSZ1 SSZ1 KLTH0G07810g Klth0G: 629654 – 631633 LtheKAR2 KAR2 KLTH0B04290g Klth0B: 346782 – 349325 LtheLHS1 LHS1

Lachancea waltii BROAD Institute Kwal_2130 Contig 228: 2784 – 862 LwalSSA2 SSA2 Kellis et al, 2004 (sup. materials) Kwal_10817 Contig 33: 80028 – 78076 LwalSSA3 SSA3 Kwal_4548 Contig 4: 15848 – 17689 LwalSSB1 SSB2 Kwal_12262 Contig 22: 32828 – 34909 LwalSSE1 SSE1 Kwal_14307 (partial) Contig 104: 28507 – 30452 LwalSSC1 SSC1 ORF contains a frameshifted sequence, full sequence reconstructed using genomic data and trace archive Kwal_9278 (partial) Contig 70: 67493 – 68860 // contig 71: 5 – 589 LwalSSQ1 SSQ1 contigs contain partial sequences, full sequence reconstructed using combined data from both contigs Kwal_14336 Contig 104: 39418 – 41034 LwalSSZ1 SSZ1 Kwal_2889 Contig 246: 63588 – 61549 LwalKAR2 KAR2 Kwal_14948 Contig 396: 13355 – 16012 LwalLHS1 LHS1

Kluyveromyces lactis Genolevures KLLA0D04224g CHR D: 363732 – 365654 KlacSSA2 SSA2 KLLA0E20527g CHR E: 1822412 – 1824364 KlacSSA3 SSA3 KLLA0D19041g CHR D: 1602756 – 1600915 KlacSSB1 SSB2 KLLA0E24597g CHR E: 2183993 – 2186077 KlacSSE1 SSE1 KLLA0E22309g CHR E: 1990670 – 1992595 KlacSSC1 SSC1 KLLA0B07843g CHR B: 690308 – 688371 KlacSSQ1 SSQ1 KLLA0C14652g CHR C: 1280333 – 1281970 KlacSSZ1 SSZ1 PRE-SACCHAROMYCETACEAE WGD PRE-SACCHAROMYCETACEAE KLLA0D09559g CHR D: 807614 – 805574 KlacKAR2 KAR2 ORF contains a frameshifted sequence, full sequence reconstructed using genomic data and trace archive KLLA0F26455g CHR F: 2450871 – 2453462 KlacLHS1 LHS1

Eremothecium gossypii EBI Integr8 EGOS_Q754F6 / AGOS_AFR114W CHR 6: 641037 – 642962 EgosSSA2 SSA2 EGOS_Q756R7 / AGOS_AER187W CHR 5: 983874 – 985823 EgosSSA-X1 SSA3 EGOS_Q75E44 / AGOS_ABL174C CHR 2: 78053 – 76212 EgosSSB1 SSB2 ORF has CTG as the starting codon, instead of ATG EGOS_Q74ZJ0 / AGOS_AGR212W CHR 7: 1153862 – 1155955 EgosSSE1 SSE1 EGOS_Q753R1 / AGOS_AFR352C CHR 6: 1075912 – 1073984 EgosSSC1 SSC1 EGOS_Q752J2 / AGOS_AFR583W CHR 6: 1485413 – 1487323 EgosSSQ1 SSQ1 EGOS_Q75EX1 / AGOS_AAL043C CHR 1: 266861 – 268474 EgosSSZ1 SSZ1 EGOS_Q75C78 / AGOS_ACR038W CHR 3: 428006 – 430030 EgosKAR2 KAR2 EGOS_Q759Z5 / AGOS_ADR128C CHR 4: 937048 – 934445 EgosLHS1 LHS1

Zygosaccharomyces rouxii Genolevures ZYRO0B00836g Zyro0B: 65436 – 67271 ZrouSSA2 SSA2 ZYRO0B03476g Zyro0B: 284311 – 286137 ZrouSSA3 SSA3 ZYRO0C10450g Zyro0C: 800876 – 802714 ZrouSSB1 SSB2 ZYRO0A10516g Zyro0A: 849544 – 851625 ZrouSSE1 SSE1 ZYRO0F05038g Zyro0F: 420062 – 421960 ZrouSSC1 SSC1 ZYRO0B07524g Zyro0B: 589928 – 591784 ZrouSSQ1 SSQ1 ZYRO0C14190g Zyro0C: 1111922 – 1113544 ZrouSSZ1 SSZ1 ZYRO0A10846g Zyro0A: 872493 – 874520 ZrouKAR2 KAR2 ZYRO0E01474g Zyro0E: 105281 – 107740 ZrouLHS1 LHS1

Vanderwaltozyma polyspora SGD Kpol_1054.19 KpolContig1054a: 45398 – 47320 VpolSSA2 SSA2 Kpol_1041.2 KpolContig1041a: 5803 – 7725 VpolSSA-X2 SSA2 Kpol_1018.97 KpolContig1018c: 256463 – 258388 VpolSSA3 SSA3 Kpol_1045.36 KpolContig1045c: 85701..87641 VpolSSA4 SSA4 Kpol_1054.33 KpolContig1054a: 83386 – 85227 VpolSSB2 SSB2 Kpol_1014.35 KpolContig1014d: 65303 – 67144 VpolSSB1 SSB2 Kpol_1048.35 KpolContig1048b: 7671 – 9689 VpolSSE1 SSE1 Kpol_242.3 KpolContig242: 2796 – 4775 VpolSSE2 SSE1 Kpol_1061.31 KpolContig1061c: 6149 – 8089 VpolSSC1 SSC1 Kpol_472.7 KpolContig472: 17347 – 19218 VpolSSQ1 SSQ1 Kpol_1012.1 KpolContig1012a: 631 – 2532 VpolSSC3 SSC1 Kpol_1005.9 KpolContig1005: 13821..15440 VpolSSZ1 SSZ1 Kpol_339.4 KpolContig339: 6900 – 8510 VpolSSZ2 SSZ1 Kpol_1004.11 KpolContig1004a: 21700 – 23670 VpolKAR2 KAR2 Kpol_1019.2 KpolContig1019a: 4142 – 6757 VpolLHS1 LHS1 Kpol_1073.18 KpolContig1073b: 26498 – 29009 VpolLHS2 LHS1

Naumovozyma castelli YGOB NCAS_0C05760 Chr3 (1183153..1185075) NcasSSA2 SSA2 (Saccharomyces castelli) NCBI NCAS_0A14440 Chr1 (2840790..2842730) NcasSSA4 SSA3 NCAS_0E02580 Chr5 complement(515573..517513) NcasSSA3 SSA3 NCAS_0D00240 Chr4 (27261..29102) NcasSSB1 SSB2 NCAS_0G01560 Chr7 (278345..280186) NcasSSB2 SSB2 NCAS_0C01870 Chr3 complement(344877..346679) NcasSSE1 SSE1 NCAS_0B01860 Chr2 complement(304736..306832) NcasSSE2 SSE1 NCAS_0A08260 Chr1 complement(1631053..1633017) NcasSSC1 SSC1 NCAS_0J01260 Chr10 (217723..219678) NcasSSQ1 SSQ1 NCAS_0H01530 Chr8 (295003..296628) NcasSSZ1 SSZ1 NCAS_0A01220 Chr1 (238768..240819) NcasKAR2 KAR2 NCAS_0H03240 Chr8 (621613..624528) NcasLHS1 LHS1

Candida glabrata Genolevures CAGL0G03795g CHR G: 363008 – 364930 CglaSSA2 SSA2 CAGL0G03289g CHR G: 312871 – 314814 CglaSSA4 SSA3 CAGL0K04741g CHR K: 465701 – 463860 CglaSSB2 SSB2 CAGL0C05379g CHR C: 514118 – 512277 CglaSSB1 SSB2 CAGL0M06083g CHR M: 637047 – 634963 CglaSSE2 SSE1 CAGL0G04917g CHR G: 472650 – 470710 CglaSSQ1 SSQ1 CAGL0I03322g CHR I: 285239 – 283299 CglaSSC1 SSC1 CAGL0I01496g CHR I: 120258 – 122201 CglaSSC3 SSC1 CAGL0L10560g CHR L: 1129742 – 1128132 CglaSSZ1 SSZ1 CAGL0D02948g CHR D: 308191 – 310194 CglaKAR2 KAR2 CAGL0F06369g CHR F: 630723 – 628054 CglaLHS1 LHS1

Saccharomyces uvarum Saccharomyces Sensu Stricto Project Sbay_Contig669.15 Contig 669: 32964 – 34901 SuvaSSA1 SSA1 SSS assembly contained only SSA3 and SSA4, extracted all four SSAs from SGD data (Saccharomyces bayanus CBS 7001) SGD Sbay_10.49 // Sbay_Contig566.14 Sbay_10 (100174-101864) // Contig 566: 31167 – 33086 SuvaSSA2 SSA2 SSS assembly contained only SSA3 and SSA4, extracted all four SSAs from SGD data Sbay_2.44 // Sbay_Contig616.26 Sbay_2 (81093-83647) // Contig c616 48192-50150 SuvaSSA3 SSA3 SSS assembly contained only SSA3 and SSA4, extracted all four SSAs from SGD data Sbay_5.222 // Sbay_Contig677.58 Sbay_5 (338527-340952) // Contig c677 127431-129359 SuvaSSA4 SSA4 SSS assembly contained only SSA3 and SSA4, extracted all four SSAs from SGD data Sbay_4.15 // Sbay_Contig538.4 Sbay_4 (31421-33858) // Contig c538 6878-8719 SuvaSSB1 SSB1 SSS assembly contained partial SSB2, extracted both SSBs from SGD data

Strona 4 SACCHAROMYCETACEAE POST-WGD SACCHAROMYCETACEAE Fungi Hsp70

Sbay_14.134 // Sbay_Contig667.61 Sbay_14 (246668-248568) // Contig c667 103948-105789 SuvaSSB2 SSB2 SSS assembly contained partial SSB2, extracted both SSBs from SGD data Sbay_16.206 Sbay_16 (359654-362331) SuvaSSE1 SSE1 Sbay_4.422 Sbay_4 (743056-745733) SuvaSSE2 SSE2 Sbay_12.128 Sbay_12 (193086-195649) SuvaSSC1 SSC1 Sbay_10.485 Sbay_10 (824352-826918) SuvaSSQ1 SSQ1 Sbay_5.39 Sbay_5 (69830-72354) SuvaSSC3 SSC3 Sbay_15.253 Sbay_15 (426849-429061) SuvaSSZ1 SSZ1 Sbay_12.47 Sbay_12 (63913-66470) SuvaKAR2 KAR2 Sbay_11.150 Sbay_11 (282486-285727) SuvaLHS1 LHS1

Saccharomyces kudriavzevii IFO 1802 Saccharomyces Sensu Stricto Project Skud_1.58 Skud_1 (114404-116823) SkudSSA1 SSA1 Skud_12.44 Skud_12 (86537-88956) SkudSSA2 SSA2 Skud_2.32 Skud_2 (67960-70385) SkudSSA3 SSA3 Skud_5.228 Skud_5 (360315-362740) SkudSSA4 SSA4 Skud_4.25 Skud_4 (40909-43346) SkudSSB1 SSB1 Skud_14.127 Skud_14 (240586-243023) SkudSSB2 SSB1 Skud_16.177 Skud_16 (321332-324009) SkudSSE1 SSE1 Skud_2.297 Skud_2 (541084-543698) SkudSSE2 SSE2 Skud_10.263 Skud_10 (475604-478185) SkudSSC1 SSC1

SACCHAROMYCETACEAE POST-WGD SACCHAROMYCETACEAE Skud_5.61 Skud_5 (92310-94819) SkudSSC3 SSC3 Skud_12.454 Skud_12 (798802-801371) SkudSSQ1 SSQ1 Skud_8.121 Skud_8 (198489-200701) SkudSSZ1 SSZ1 Skud_10.188 Skud_10 (348964-351521) SkudKAR2 KAR2 Skud_11.153 Skud_11 (281438-284679) SkudLHS1 LHS1

Saccharomyces mikatae IFO 1815 Saccharomyces Sensu Stricto Project Smik_1.63 Smik_1 (123006-124621) SmikSSA1 SSA1 Smik_12.40 Smik_12 (81603-83479) SmikSSA2 SSA2 Smik_2.43 Smik_2 (78473-80412) SmikSSA3 SSA3 Smik_5.249 Smik_5 (373523-375348) SmikSSA4 SSA4 Smik_1.64 Smik_1 (123006-124621) SmikSSA1p SSA1 Alternative transcript to Smik_1.63 Smik_4.12 Smik_4 (22025-23862) SmikSSB1 SSB1 Smik_14.123 Smik_14 (231362-233199) SmikSSB2 SSB2 Smik_16.131 Smik_16 (233516-235593) SmikSSE1 SSE1 Smik_2.310 Smik_2 (560159-562173) SmikSSE2 SSE2 Smik_10.320 Smik_10 (505806-507703) SmikSSC1 SSC1 Smik_5.73 / Smik_5.74 Smik_5 (103210-104729) SmikSSC3p SSC3 Smik_12.456 Smik_12 (798229-800180) SmikSSQ1 SSQ1 Smik_8.135 Smik_8 (200825-202437) SmikSSZ1 SSZ1 Smik_10.238 Smik_10 (377587-379544) SmikKAR2 KAR2 Smik_11.165 Smik_11 (282629-285270) SmikLHS1 LHS1

Saccharomyces paradoxus Saccharomyces Sensu Stricto Project Spar_1.73 Spar_1 (114869-116688) SparSSA1 SSA1 Spar_12.48 Spar_12 (82826-84645) SparSSA2 SSA2 Spar_2.48 Spar_2 (79731-81676) SparSSA3 SSA3 Spar_5.276 Spar_5 (375605-377430) SparSSA4 SSA4 Spar_4.37 Spar_4 (49681-51518) SparSSB1 SSB1 Spar_14.132 Spar_14 (236135-237972) SparSSB2 SSB2 Spar_16.222 Spar_16 (375298-377375) SparSSE1 SSE1 Spar_2.330 Spar_2 (546730-548744) SparSSE2 SSE2 Spar_10.310 Spar_10 (495494-497445) SparSSC1 SSC1 Spar_12.506 Spar_12 (814208-816177) SparSSQ1 SSQ1 Spar_5.49 Spar_5 (72542-74469) SparSSC3 SSC3 Spar_8.141 Spar_8 (201371-202983) SparSSZ1 SSZ1 Spar_10.229 Spar_10 (371921-373878) SparKAR2 KAR2 Spar_11.184 Spar_11 (306855-309496) SparLHS1 LHS1

Saccharomyces cerevisiae SGD YAL005C Chr I 141433-139505 ScerSSA1 SSA1 YLL024C Chr XII 97484-95565 ScerSSA2 SSA2 YBL075C Chr II 86446-84497 ScerSSA3 SSA3 YER103W Chr V 364585-366513 ScerSSA4 SSA4 YDL229W CHR IV 44066-45907 ScerSSB1 SSB1 YNL209W CHR XIV 252060-253901 ScerSSB2 SSB2 YPL106C CHR XVI 352272-350191 ScerSSE1 SSE1 YBR169C CHR II: 575991-573910 ScerSSE2 SSE2 YJR045C CHR X 521594-519630 ScerSSC1 SSC1 YLR369W CHR XII: 859551-861524 ScerSSQ1 SSQ1 YEL030W Chr V 94644-96578 ScerSSC3 SSC3 YHR064C CHR 8: 227143-225527 ScerSSZ1 SSZ1 YJL034W Chr X 381321-383369 ScerKAR2 KAR2 YKL073W Chr XI 296074-298719 ScerLHS1 LHS1

Strona 5 Supplementary Figures

Fig. S2. (A) Bayesian tree of amino acid sequences from the SSB subfamily. The tree was rooted using SSB orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95. (B) Bayesian tree of nucleotide sequences of SSB ORFs from post-WGD Saccharomycetaceae species. Scale is in expected nucleotide substitutions per site.

Fig. S3. Bayesian tree of amino acid sequences from the KAR subfamily. The tree was rooted using KAR orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95.

Fig. S4 (A) Bayesian tree of amino acid sequences from the SSC subfamily. The tree was rooted using SSC orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95. SSC1preSSQ1 - SSC1 sequences from species that pre-date the emergence of SSQ1; SSC1preSSC3 - SSC1 sequences from Saccharomycetaceae pre-WGD species that pre-date the emergence of SSC3. (B) Cladogram showing the hypothetical scenario of SSC evolution in the CTG and Saccharomycetaceae clades. Arrows indicate genes evolving under concerted evolution.

Fig. S5. (A) Bayesian tree of amino acid sequences from the SSE subfamily. The tree was rooted using SSE orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95. (B) Cladogram showing the hypothetical scenario of SSE evolution in the Saccharomycetaceae clade.

Fig. S6. (A) Bayesian tree of amino acid sequences from the SSZ subfamily. The tree was rooted using SSZ orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95. (B) Cladogram showing the hypothetical scenario of SSZ evolution in the Saccharomycetaceae clade.

Fig. S7. (A) Bayesian tree of amino acid sequences from the LHS subfamily. The tree was rooted using LHS orthologs from Basidiomycota. Scale is in expected amino acid substitutions per site. (**) - posterior probability ≥ 0.95. (B) Cladogram showing the hypothetical scenario of LHS evolution in the Saccharomycetaceae clade.

Fig. S8. Bayesian tree of amino acid sequences from the SSA subfamily in the Yarrowia clade. The tree was rooted using SSA1 from S. cerevisiae. Scale is in expected amino acid substitutions per site. Posterior probabilities are shown for the individual branches.

Fig. S9. Amino acid usage for Hsp70 subfamilies defined as number of different amino acids observed at a site in the sequence alignment (see Materials and Methods for details).

Fig. S10. P-values of the Kolmogorov-Smirnov tests comparing codon usage and evolutionary rates of the 14 Hsp70 orthologs in 5 Saccharomyces sensu-stricto species: S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii and S. uvarum. Values below 0.05 are marked in bold. (A) Codon usage comparisons: values below the main diagonal correspond to CAI comparisons, values above the main diagonal correspond to ENC comparisons. (B) Evolutionary rate comparisons: values below the main diagonal correspond to dN comparisons, values above the main diagonal correspond to dS comparisons.

Fig. S11. List of transcription factors whose binding sites were analyzed for 5'IGS of SSB1/SSB2 paralogs. In bold TF sites not present in species pre-dating SSB gene duplication.

Fig. S12. List of switched sites from fig. 5 and fig. 6. Group A and group B – Hsp70 subfamilies as in fig. 5 and fig. 6. Positions are indicated as in the S. cerevisiae ortholog. Dominant residue – residue most frequently present at a given site. BLOSUM62 score – substitution score between the dominant residue from Group A and dominant residue from Group B. Radical switched sites in bold.

Fig. S13. Localization of the switched radical sites on the structural model of the SBD of SSQ1 from S. cerevisaie, which was generated by homology modeling using SWISS-MODEL (Arnold et al. 2006) and structure of DnaK (PDB: 2KHO) as template. The SBD-β domain is shown in orange, the SBD-α domain is shown in red. In green are the switched radical sites at indicated positions. In blue are sites homologous to DnaK sites known to participate in substrate binding (Zhu et al. 1996).

Arnold K, Bordoli L, Kopp J, Schwede T. 2006. The SWISS-MODEL workspace: a web- based environment for protein structure homology modelling. Bioinformatics 22: 195-201. Zhu X, Zhao X, Burkholder WF, Gragerov A, Ogata CM, Gottesman ME, Hendrickson WA. 1996. Structural analysis of substrate binding by the molecular chaperone DnaK. Science 272: 1606-1614.

Supplementary Figure S2

UmaySSB1 CneoSSB1 CcinSSB1 PchrSSB1 A SjapSSB1 SjapSSB2 ** SpomSSB1 ScrySSB1 SoctSSB1 TreeSSB1 GmonSSB1 GzeaSSB1 NcraSSB1 CgloSSB1 PansSSB1 ** ** MgriSSB1 BfucSSB1 SsclSSB1 PnodSSB1 AcapSSB1 CimmSSB1 UreeSSB1 AterSSB1 AfumSSB1 AnidSSB1 ** AorySSB1 ChisSSB1 1 CalbSSB1 ** CaliSSB1 YlipSSB1 B 1 YyakSSB1 VpolSSB1 1 CgalSSB1 1 CphaSSB1 VpolSSB2 MguiSSB1 ClusSSB1 1 CglaSSB1 DhanSSB1 1 1 ** ** CtenSSB1 1 SstiSSB1 CglaSSB2 SpasSSB1 1 CtroSSB1 NcasSSB1 CalbSSB1 0,79 1 CdubSSB1 1 NcasSSB2 LeloSSB1 CortSSB1 1 CparSSB1 SkudSSB1 0,8 0,99 KlacSSB1 EgosSSB1 1 SkudSSB2 LkluSSB1 LtheSSB1 1 LwalSSB1 SuvaSSB1 1 1 ZrouSSB1 1 ** CglaSSB1 SuvaSSB2 CglaSSB2 NcasSSB1 1 Major clades: 0,78 SmikSSB1 NcasSSB2 1 Schizosaccharomyces SmikSSB2 1 ScerSSB2 Pezizomycotina SmikSSB2 SparSSB2 1 1 Yarrowia SkudSSB2 SparSSB1 CTG group SuvaSSB1 0,73 Saccharomycetaceae pre-WGD SuvaSSB2 1 ScerSSB1 SparSSB2 Saccharomycetaceae post-WGD 0,89 SmikSSB1 1 WGD-dependent duplication SparSSB1 ScerSSB1 0,73 SkudSSB1 Taxon-specific gene duplication 1 VpolSSB1 ScerSSB2 VpolSSB2

0.1 0.09

Supplementary Figure S3

UmayKAR2 CneoKAR2 CcinKAR2 PchrKAR2 SjapKAR2 ** SpomKAR2 ScryKAR2 SoctKAR2 PnodKAR2 AcapKAR2 CimmKAR2 UreeKAR2 ** AnidKAR2 AoryKAR2 ** AfumKAR2 AterKAR2 BfucKAR2 SsclKAR2 TreeKAR2 GmonKAR2 GzeaKAR2 MgriKAR2 NcraKAR2 CgloKAR2 PansKAR2 ChisKAR2 ** CphaKAR2 CaliKAR2 YyakKAR2 CgalKAR2 YlipKAR2 ClusKAR2 ** ** CtenKAR2 DhanKAR2 MguiKAR2 SstiKAR2 SpasKAR2 CtroKAR2 CalbKAR2 ** CdubKAR2 LeloKAR2 CortKAR2 CparKAR2 EgosKAR2 KlacKAR2 ** LtheKAR2 LwalKAR2 Major clades: LkluKAR2 Schizosaccharomyces ZrouKAR2 Pezizomycotina NcasKAR2 Yarrowia CglaKAR2 CTG group ScerKAR2 Saccharomycetaceae pre-WGD SparKAR2 Saccharomycetaceae post-WGD SkudKAR2 SuvaKAR2 SmikKAR2 VpolKAR2

0.09

Supplementary Figure S4

CneoSSC1 UmaySSC1 CcinSSC1 PchrSSC1 A B CtenSSC1 SjapSSC1 ClusSSC1 ** SpomSSC1 MguiSSC1 ScrySSC1 DhanSSC1 SoctSSC1 SstiSSC1 SpasSSC1 PnodSSC1 LeloSSC1 AcapSSC1 CortSSC1 CimmSSC1 CparSSC1 ** UreeSSC1 CtroSSC1 AorySSC1 CdubSSC1 CalbSSC1 AnidSSC1 EgosSSC1 AfumSSC1 KlacSSC1 AterSSC1 LkluSSC1 BfucSSC1 LwalSSC1 SsclSSC1 SSC1 LtheSSC1 preSSQ1 ZrouSSC1 NcraSSC1 VpolSSC1 CgloSSC1 NcasSSC1 PansSSC1 CglaSSC1 MgriSSC1 SuvaSSC1 SkudSSC1 ** TreeSSC1 SmikSSC1 GmonSSC1 SparSSC1 GzeaSSC1 ScerSSC1 ChisSSC1 VpolSSC3 ** CaliSSC1 NcasSSC3* CglaSSC3 CphaSSC1 SuvaSSC3 YyakSSC1 SkudSSC3 CgalSSC1 SmikSSC3p YlipSSC1 SparSSC3 ClusSSC1 ScerSSC3 CtenSSQ1 SstiSSC1 ClusSSQ1 ** DhanSSC1 MguiSSQ1 MguiSSC1 DhanSSQ1 ** CtenSSC1 SstiSSQ1 SpasSSQ1 SpasSSC1 LeloSSQ1 CtroSSC1 CortSSQ1 CalbSSC1 CparSSQ1 CdubSSC1 CtroSSQ1 ** LeloSSC1 CdubSSQ1 CalbSSQ1 CortSSC1 EgosSSQ1 CparSSC1 KlacSSQ1 EgosSSC1 LkluSSQ1 KlacSSC1 LwalSSQ1 LkluSSC1 LtheSSQ1 SSC1 ZrouSSQ1 LtheSSC1 preSSC3 VpolSSQ1* LwalSSC1 NcasSSQ1 ZrouSSC1 CglaSSQ1 CglaSSC3 SuvaSSQ1 SkudSSQ1 CglaSSC1 SmikSSQ1 VpolSSC3 SparSSQ1 VpolSSC1 ScerSSQ1 NcasSSC1 VpolSSQ1 SkudSSC1 NcasSSQ1* CglaSSQ1* SmikSSC1 SuvaSSQ1* ScerSSC1 SkudSSQ1* SuvaSSC1 SmikSSQ1* SparSSC1 SparSSQ1* SuvaSSC3 ScerSSQ1* ** SkudSSC3 ScerSSC3 SSC3 SparSSC3 ClusSSQ1 DhanSSQ1 CtenSSQ1 ** MguiSSQ1 SstiSSQ1 SpasSSQ1 CtroSSQ1 CalbSSQ1 CdubSSQ1 LeloSSQ1 CortSSQ1 CparSSQ1 ** ZrouSSQ1 LkluSSQ1 SSQ1 KlacSSQ1 EgosSSQ1 Major clades: LtheSSQ1 Schizosaccharomyces LwalSSQ1 VpolSSQ1 Pezizomycotina NcasSSQ1 SparSSQ1 Yarrowia ** ScerSSQ1 SmikSSQ1 CTG group SuvaSSQ1 Saccharomycetaceae pre-WGD SkudSSQ1 CglaSSQ1 Saccharomycetaceae post-WGD WGD-dependent duplication 0.3 Ancestral gene duplication Gene loss

Supplementary Figure S5

UmaySSE1 CneoSSE1 A CcinSSE1 B KlacSSE1 PchrSSE1 EgosSSE1 SjapSSE1 ** SpomSSE1 LkluSSE1 ScrySSE1 LwalSSE1 SoctSSE1 LtheSSE1 PnodSSE1 AcapSSE1 ZrouSSE1 CimmSSE1 VpolSSE1 UreeSSE1 NcasSSE1 ** AnidSSE1 AfumSSE1 CglaSSE1* ** AterSSE1 SuvaSSE1 AorySSE1 SkudSSE1 BfucSSE1 SsclSSE1 SmikSSE1 TreeSSE1 ScerSSE1 GmonSSE1 SparSSE1 GzeaSSE1 VpolSSE2 MgriSSE1 ** CgloSSE1 NcasSSE2 PansSSE1 CglaSSE2 NcraSSE1 SuvaSSE2 ChisSSE1 ** CaliSSE1 SkudSSE2 CphaSSE1 SmikSSE2 YlipSSE1 ScerSSE2 CgalSSE1 YyakSSE1 SparSSE2 SstiSSE1 ClusSSE1 ** DhanSSE1 MguiSSE1 ** CtenSSE1 SpasSSE1 CtroSSE1 CalbSSE1 CdubSSE1 ** LeloSSE1 CortSSE1 CparSSE1 LtheSSE1 LwalSSE1 LkluSSE1 ** EgosSSE1 KlacSSE1 VpolSSE1 Major clades: ZrouSSE1 Schizosaccharomyces NcasSSE1 SuvaSSE1 Pezizomycotina SkudSSE1 Yarrowia SmikSSE1 CTG group ScerSSE1 Saccharomycetaceae pre-WGD SparSSE1 Saccharomycetaceae post-WGD CglaSSE2 NcasSSE2 WGD-dependent duplication VpolSSE2 Gene loss SuvaSSE2 SkudSSE2 SmikSSE2 ScerSSE2 SparSSE2

0.2

Supplementary Figure S6

UmaySSZ1

CneoSSZ1 root CcinSSZ1 A B KlacSSZ1 PchrSSZ1 SjapSSZ1 EgosSSZ1 ** SpomSSZ1 LkluSSZ1 ScrySSZ1 LwalSSZ1 SoctSSZ1 LtheSSZ1 AcapSSZ1 ZrouSSZ1 CimmSSZ1 VpolSSZ1 UreeSSZ1 AnidSSZ1 NcasSSZ1 AfumSSZ1 CglaSSZ1 ** AorySSZ1 SuvaSSZ1 ** AterSSZ1 SkudSSZ1 PnodSSZ1 SmikSSZ1 BfucSSZ1 SsclSSZ1 ScerSSZ1 TreeSSZ1 SparSSZ1 GmonSSZ1 VpolSSZ2 GzeaSSZ1 NcasSSZ1* MgriSSZ1 CglaSSZ1* NcraSSZ1 SuvaSSZ1* ** CgloSSZ1 PansSSZ1 SkudSSZ1* ChisSSZ1 SmikSSZ1* ** CaliSSZ1 ScerSSZ1* CphaSSZ1 SparSSZ1* CgalSSZ1 YlipSSZ1 YyakSSZ1 CtenSSZ1 ** ** MguiSSZ1 DhanSSZ1 SstiSSZ1 ClusSSZ1 SpasSSZ1 CtroSSZ1 ** CalbSSZ1 CdubSSZ1 LeloSSZ1 CortSSZ1 CparSSZ1 KlacSSZ1 ** EgosSSZ1 LkluSSZ1 LtheSSZ1 LwalSSZ1 Major clades: ZrouSSZ1 Schizosaccharomyces CglaSSZ1 Pezizomycotina VpolSSZ1 VpolSSZ2 Yarrowia CTG group NcasSSZ1 Saccharomycetaceae pre-WGD SuvaSSZ1 ScerSSZ1 Saccharomycetaceae post-WGD SkudSSZ1 WGD-dependent duplication SparSSZ1 Gene loss SmikSSZ1

0.2

Supplementary Figure S7

CneoLHS1 UmayLHS1 CcinLHS1 root PchrLHS1 A B KlacLHS1 SjapLHS1 ** SpomLHS1 EgosLHS1 ScryLHS1 LkluLHS1 SoctLHS1 LwalLHS1 PnodLHS1 LtheLHS1 AcapLHS1 ZrouLHS1 CimmLHS1 UreeLHS1 VpolLHS1 AnidLHS1 NcasLHS1 AfumLHS1 CglaLHS1* ** ** AoryLHS1 SuvaLHS1 AterLHS1 SkudLHS1 BfucLHS1 SmikLHS1 SsclLHS1 TreeLHS1 ScerLHS1 GmonLHS1 SparLHS1 GzeaLHS1 VpolLHS2 MgriLHS1 NcasLHS1* PansLHS1 CglaLHS1 CgloLHS1 NcraLHS1 SuvaLHS1* ChisLHS1 SkudLHS1* ** CphaLHS1 SmikLHS1* CaliLHS1 ScerLHS1* YyakLHS1 SparLHS1* CgalLHS1 YlipLHS1 ClusLHS1 ** CtenLHS1 ** DhanLHS1 MguiLHS1 SstiLHS1 SpasLHS1 CtroLHS1 CalbLHS1 CdubLHS1 ** LeloLHS1 CortLHS1 CparLHS1 EgosLHS1 KlacLHS1 LkluLHS1 LtheLHS1 LwalLHS1 Major clades: ZrouLHS1 ** VpolLHS2 Schizosaccharomyces VpolLHS1 Pezizomycotina NcasLHS1 Yarrowia SuvaLHS1 CTG group SkudLHS1 Saccharomycetaceae pre-WGD SmikLHS1 Saccharomycetaceae post-WGD ScerLHS1 SparLHS1 WGD-dependent duplication CglaLHS1 Gene loss 0.3

Supplementary Figure S8

1 0,93 ChisSSA-A 1 ChisSSA-X1 1 CphaSSA-C 0,72 1 1 CphaSSA-X4 1 CphaSSA-X5 1 0,93 CphaSSA-A 1 YyakSSA-A 0,64 1 0,94 YlipSSA-A 0,6 1 CgalSSA-A 1 CaliSSA-A 1 0,67 0,8 CaliSSA-X3p 1 YlipSSA-Ep 1 CaliSSA-C 1 YlipSSA-C 0,69 0,9 0,53 1 0,56 CgalSSA-C 1 0,74 YyakSSA-C 1 CaliSSA-G 0,78 1 CphaSSA-G 0,73 1 YyakSSA-B 0,55 1 1 YlipSSA-B 1 CgalSSA-B 1 0,75 CaliSSA-E 1 CaliSSA-X2 1 0,88 YyakSSA-E 0,98 1 0,61 YyakSSA-Fp 1 0,97 YyakSSA-D 1 YlipSSA-D 0,91 1 CgalSSA-E 0,96 1 CgalSSA-X6 0,991 CgalSSA-D 1 CgalSSA-F

0.05

Supplementary Figure S9

Amino-acid Number of sites usage SSA SSB SSC KAR SSQ SSE SSZ LHS 1 317 286 292 243 196 108 47 31 2 126 153 142 134 97 111 89 31 3 60 74 70 92 78 116 74 44 4 38 38 46 48 60 71 70 64 5 21 25 22 36 52 64 60 48 6 12 20 27 24 37 56 52 51 7 14 6 16 24 32 36 37 57 8 13 5 11 16 23 31 26 50 9 9 1 4 8 13 15 25 71 10 2 0 0 4 10 15 13 78 11 1 0 0 0 3 12 12 57 12 0 0 0 2 3 3 5 41 13 1 0 0 0 0 0 0 38 14 0 0 0 0 0 0 1 17 15 0 0 0 0 0 0 0 7 16 0 0 0 0 0 0 0 7 TOTAL 614 608 630 631 604 638 511 692

Supplementary Figure S10

A

CAI \ ENC SSA1 SSA2 SSA3 SSA4 KAR2 SSB1 SSB2 SSC1 SSC3 SSQ1 SSE1 SSE2 SSZ1 LHS1 SSA1 - 0.008 0.008 0.008 0.008 0.008 0.013 0.008 0.016 0.013 0.357 0.008 0.008 0.013 SSA2 0.013 - 0.008 0.008 0.008 0.873 0.819 0.008 0.016 0.013 0.008 0.008 0.008 0.013 SSA3 0.008 0.013 - 0.873 0.008 0.008 0.013 0.008 0.429 0.013 0.008 0.873 0.008 0.013 SSA4 0.008 0.013 0.079 - 0.008 0.008 0.013 0.008 0.746 0.013 0.008 0.873 0.008 0.013 KAR2 0.008 0.013 0.008 0.008 - 0.008 0.013 0.008 0.016 0.013 0.008 0.008 1.000 0.013 SSB1 0.008 0.819 0.008 0.008 0.008 - 0.329 0.008 0.016 0.013 0.008 0.008 0.008 0.013 SSB2 0.008 0.329 0.008 0.008 0.008 0.873 - 0.013 0.023 0.013 0.013 0.013 0.013 0.013 SSC1 0.008 0.013 0.008 0.008 0.873 0.008 0.008 - 0.016 0.013 0.329 0.008 0.079 0.013 SSC3 0.016 0.023 0.286 0.016 0.016 0.016 0.016 0.016 - 0.023 0.016 0.869 0.016 0.023 SSQ1 0.008 0.013 0.357 0.008 0.008 0.008 0.008 0.008 0.079 - 0.013 0.013 0.013 0.819 SSE1 0.079 0.013 0.008 0.008 0.079 0.008 0.008 0.819 0.016 0.008 - 0.008 0.079 0.013 SSE2 0.008 0.013 0.357 0.008 0.008 0.008 0.008 0.008 0.286 0.873 0.008 - 0.008 0.013 SSZ1 0.008 0.013 0.008 0.008 0.873 0.008 0.008 0.357 0.016 0.008 0.357 0.008 - 0.013 LHS1 0.013 0.013 0.329 0.013 0.013 0.013 0.013 0.013 0.116 0.329 0.013 0.329 0.013 -

B

dN \ dS SSA1 SSA2 SSA3 SSA4 KAR2 SSB1 SSB2 SSC1 SSC3 SSQ1 SSE1 SSE2 SSZ1 LHS1 SSA1 - 0.575 0.575 0.575 0.963 0.575 0.575 0.963 0.737 0.575 0.963 0.575 0.963 0.963 SSA2 0.541 - 0.963 0.963 0.963 0.575 0.963 0.963 0.116 0.575 0.963 0.575 0.575 0.963 SSA3 0.212 0.008 - 0.963 0.575 0.575 0.575 0.575 0.639 0.963 0.963 0.963 0.575 1.000 SSA4 0.212 0.203 0.938 - 0.575 0.212 0.575 0.575 0.737 0.963 0.963 0.963 0.575 1.000 KAR2 0.938 0.203 0.541 0.541 - 0.963 0.963 0.963 0.328 0.575 0.575 0.575 0.963 0.575 SSB1 0.575 0.575 0.008 0.053 0.203 - 0.963 0.963 0.030 0.575 0.575 0.212 0.575 0.575 SSB2 0.575 1.000 0.008 0.053 0.203 0.938 - 0.938 0.030 0.212 0.575 0.212 0.575 0.575 SSC1 0.963 0.575 0.541 0.575 1.000 0.203 0.212 - 0.116 0.575 0.963 0.575 0.963 0.575 SSC3 0.003 0.003 0.030 0.015 0.006 0.003 0.003 0.015 - 0.737 0.116 0.576 0.545 0.737 SSQ1 0.001 0.001 0.212 0.053 0.012 0.001 0.001 0.008 0.030 - 0.963 0.963 0.575 0.963 SSE1 0.212 0.056 0.963 0.963 0.541 0.008 0.008 0.575 0.015 0.053 - 0.575 0.575 0.963 SSE2 0.001 0.001 0.212 0.053 0.002 0.001 0.001 0.012 0.116 0.575 0.203 - 0.575 0.938 SSZ1 0.212 0.008 0.541 0.541 0.203 0.008 0.008 0.575 0.030 0.575 0.963 0.575 - 0.575 LHS1 0.001 0.001 0.001 0.001 0.002 0.001 0.001 0.001 0.545 0.056 0.001 0.212 0.008 - Supplementary Figure S11

TF Presence Description Abf1 SSB1 and SSB2 DNA binding protein with possible chromatin-reorganizing activity involved in transcriptional activation, gene silencing, and DNA replication and repair. Ace2 SSB1 and SSB2 Transcription factor required for septum destruction after cytokinesis. Pho2 SSB1 and SSB2 Homeobox transcription factor; regulatory targets include genes involved in phosphate metabolism. Rme1 SSB1 and SSB2 Zinc finger protein involved in control of meiosis. Spt23 SSB1 and SSB2 ER membrane protein involved in regulation of OLE1 transcription, acts with homolog Mga2p. Ste12 SSB1 and SSB2 Transcription factor that is activated by a MAP kinase signaling cascade, activates genes involved in mating or pseudohyphal/invasive growth pathways. Mac1 SSB1 Copper-sensing transcription factor involved in regulation of genes required for high affinity copper transport. Rap1 SSB1 Essential DNA-binding transcription regulator that binds at many loci. Rfx1 SSB1 Major transcriptional repressor of DNA-damage-regulated genes, recruits repressors Tup1p and Cyc8p to their promoters. Sfp1 SSB1 Regulates transcription of ribosomal protein and biogenesis genes. Yap5 SSB1 Basic leucine zipper (bZIP) iron-sensing transcription factor. Yhp1 SSB1 Homeobox transcriptional repressor. Dal80 SSB2 Negative regulator of genes in multiple nitrogen degradation pathways. Gln3 SSB2 Transcriptional activator of genes regulated by nitrogen catabolite repression (NCR), localization and activity regulated by quality of nitrogen source. Hap3/Hap5 SSB2 Subunit of the heme-activated, glucose-repressed Hap2p/3p/4p/5p CCAAT- binding complex, a transcriptional activator and global regulator of respiratory gene expression. Opi1 SSB2 Transcriptional regulator of a variety of genes; phosphorylation by protein kinase A stimulates Opi1p function in negative regulation of phospholipid biosynthetic genes.

Supplementary Figure S12

Switched Dominant residue BLOSUM62 Group A Group B Domain Position Group A Group B score 446 R P -2 SBD-α 449 N P -2 SBD-α SSC1 SSQ1 454 I V 3 SBD-α 466 A V 0 SBD-α 558 Y N -2 SBD-β 83 A S 1 NBD 85 V I 3 NBD 116 Y F 3 NBD 174 S A 1 NBD 207 E D 2 NBD 325 F L 0 NBD 330 E A -1 NBD 333 V I 3 NBD 358 I L 2 NBD 380 E D 2 NBD 381 P A -1 NBD 393 I L 2 NBD 403 A S 1 NBD SSC1 SSC3 436 T S 1 SBD-α 458 I V 3 SBD-α 469 R K 2 SBD-α 501 D N 1 SBD-α 509 R K 2 SBD-α 517 A S 1 SBD-α 524 S A 1 SBD-α 538 D E 2 SBD-α 581 E D 2 SBD-β 582 A S 1 SBD-β 585 V L 1 SBD-β 603 E D 2 SBD-β 614 T I -1 SBD-β 32 G S 0 NBD 148 Y W 2 NBD 235 E R 0 NBD 406 I W -3 SBD-α Canonical SSE 418 L V 1 SBD-α Hsp70 419 I F 0 SBD-α 435 T R -1 SBD-α 443 I A -1 SBD-α 445 V Y -1 SBD-α 10 L F 0 NBD 12 T N 0 NBD 38 T I -1 NBD 44 F Y 3 NBD 49 R E 0 NBD 50 L Y -1 NBD Canonical SSZ 72 R D -2 NBD Hsp70 170 R Q 1 NBD 203 T R -1 NBD 217 E T -1 NBD 223 G H -2 NBD 232 F L 0 NBD 374 A L -1 NBD Canonical 46 N T 0 NBD LHS Hsp70 520 A C 0 SBD-α

Supplementary Figure S13

A

N558

V466

P446 P449

180° B

N558

V466

P446 P449