<<

BMC Genomics BioMed Central

Research article Open Access Comparison of protein coding gene contents of the fungal phyla and Saccharomycotina Mikko Arvas*1, Teemu Kivioja2, Alex Mitchell3, Markku Saloheimo1, David Ussery4, Merja Penttila1 and Stephen Oliver5

Address: 1VTT, Tietotie 2, Espoo, P.O. Box 1500, 02044 VTT, Finland, 2Biomedicum, P.O. Box 63 (Haartmaninkatu 8), FI-00014 University of Helsinki, Finland, 3EMBL Outstation – Hinxton, European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, 4Center for Biological Sequence Analysis BioCentrum-DTU The Technical University of Denmark DK-2800 Kgs. Lyngby, Denmark and 5University of Manchester, Michael Smith Building, Oxford Road, Manchester, M13 9PT, UK Email: Mikko Arvas* - [email protected]; Teemu Kivioja - [email protected]; Alex Mitchell - [email protected]; Markku Saloheimo - [email protected]; David Ussery - [email protected]; Merja Penttila - [email protected]; Stephen Oliver - [email protected] * Corresponding author

Published: 17 September 2007 Received: 23 April 2007 Accepted: 17 September 2007 BMC Genomics 2007, 8:325 doi:10.1186/1471-2164-8-325 This article is available from: http://www.biomedcentral.com/1471-2164/8/325 © 2007 Arvas et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Several dozen fungi encompassing traditional model organisms, industrial production organisms and human and plant have been sequenced recently and their particular genomic features analysed in detail. In addition comparative genomics has been used to analyse specific sub groups of fungi. Notably, analysis of the phylum Saccharomycotina has revealed major events of evolution such as the recent genome duplication and subsequent gene loss. However, little has been done to gain a comprehensive comparative view to the fungal . We have carried out a computational genome wide comparison of protein coding gene content of Saccharomycotina and Pezizomycotina, which include industrially important and filamentous fungi, respectively. Results: Our analysis shows that based on genome redundancy, the traditional model organisms and Neurospora crassa are exceptional among fungi. This can be explained by the recent genome duplication in S. cerevisiae and the repeat induced point mutation mechanism in N. crassa. Interestingly in Pezizomycotina a subset of protein families related to plant biomass degradation and secondary are the only ones showing signs of recent expansion. In addition, Pezizomycotina have a wealth of phylum specific poorly characterised genes with a wide variety of predicted functions. These genes are well conserved in Pezizomycotina, but show no signs of recent expansion. The genes found in all fungi except Saccharomycotina are slightly better characterised and predicted to encode mainly enzymes. The genes specific to Saccharomycotina are enriched in transcription and mitochondrion related functions. Especially mitochondrial ribosomal proteins seem to have diverged from those of Pezizomycotina. In addition, we highlight several individual gene families with interesting phylogenetic distributions. Conclusion: Our analysis predicts that all Pezizomycotina unlike Saccharomycotina can potentially produce a wide variety of secondary metabolites and secreted enzymes and that the responsible gene families are likely to evolve fast. Both types of fungal products can be of commercial value, or on the other hand cause harm to humans. In addition, a great number of novel predicted and known enzymes are found from all fungi except Saccharomycotina. Therefore further studies and exploitation of fungal metabolism appears very promising.

Page 1 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Background Atlases [11], e-Fungi [12], but none offers a level of detail At least 36 fungal genomes are already available at public comparable to Ensembl. However, high quality software databases and several more are being sequenced. Of to set up such a database is freely available from the , Fungi is the kingdom with most sequenced Generic Genome Browser (GBrowse) [13] and Ensembl species. The sequenced genomes cover fungal species projects. Most importantly, these support interfaces to broadly and include industrially, medically and agricul- BioPerl [14], which offers currently the most extensive turally important species with very diverse genomes. For sequence analysis programming library. these reasons fungal genomes form one of the most attrac- tive eukaryotic datasets for comparative genomics Our goal is to explain differences in phenotypes of fungi research and method development. Moreover, the possi- through differences in their genotypes and to deduce evo- bilities for in silico exploration of the diversity of biologi- lutionary processes that have shaped fungi. To this end we cal functions in these organisms have been have used TRIBE-MCL [15] protein clustering and Inter- revolutionized. ProScan [16] protein analysis to compare genome open reading frame (ORF) contents of Pezizomy- The fungal kingdom is divided in , Zygo- cotina and Saccharomycotina species in a large data set of mycota, , , 33 fungal genomes. TRIBE-MCL divides the genomes and Deuteromycota. Most sequenced fungi belong to comprehensively into protein clusters, while InterProScan Ascomycota, which is again divided in Pezizomycotina, detects known protein domains. We have also set up a Saccharomycotina and . Pezizomy- GBrowse based comparative genomics web interface that cotina, commonly referred to as filamentous fungi, grow enables easy browsing of our data. We detected protein mostly in a filamentous form and degrade biomass to free clusters specific to Pezizomycotina, other fungi than Sac- sugars to be used as source of carbon and energy. In con- charomycotina and Saccharomycotina, and characterised trast, Saccharomycotina, commonly referred to as yeasts, them with Funcat gene classifications [17-19] and Protfun grow mostly in a unicellular form. They have generally no protein function predictions [20]. capability to degrade biomass and accordingly live on free sugars. We found that in Pezizomycotina protein clusters related to plant biomass degradation and secondary metabolism Comparative genomics has been applied extensively are the only ones with multiple proteins in each species. within the phylum Saccharomycotina (for review see [1]), Pezizomycotina have also an interesting expansion of a where a whole genome duplication followed by massive transcription factor protein family and a wealth of gene loss and specialization has been shown to have enzymes and subsequent metabolic capabilities not occurred [2]. This has been further confirmed by a broader found in Saccharomycotina. In contrast, Saccharomy- analysis of differential genome evolution in Saccharomy- cotina specific protein clusters are enriched in transcrip- cotina species [3] that also demonstrated the usefulness of tion and mitochondrion related functions. multi-species protein clustering. Results Studies of Pezizomycotina have concentrated on descrip- Overview of clustering results tions of individual genomes or comparisons inside a class To compare the ORF content of Pezizomycotina and Sac- such as the Eurotiales genomes of Aspergillus fumigatus [4], charomycotina we selected a set of 33 species shown in oryzae [5] and nidulans [6]. A. fumigatus and oryzae had Table 1. 14 of them are Pezizomycotina and 13 Saccharo- been previously considered as asexual organisms, but mycotina. In addition one Archeascomycota, four Basidi- comparative analysis showed that their genomes have the omycota and one were included as reference same potential for as A. nidulans. species outside the two phyla. Sequencing of the Neurospora crassa genome [7] revealed the full effect of the mechanism called Repeat Induced To compare the ORF content systematically, we first Point mutations (RIP). RIP specifically mutates duplica- divided all the ORFs in groups related by sequence simi- tions that are longer than 400 bp and more identical than larity. For this we used a protein clustering software 80%, thus effectively preventing gene duplications and TRIBE-MCL [15]. It uses a graph clustering algorithm to keeping the genome extremely non-redundant (for review cluster proteins based on the results of an all-against-all see [8]). BLASTP [21]. The results have been shown to correlate well with manual family classifications [15]. The sizes of Extensive comparative genomic databases for some TRIBE-MCL clusters are influenced by the parameter infla- groups of eukaryotic species are available such as the tion value (r), which is limited to values between 1.1 and Ensembl for mammals [9]. Several comparative databases 4.5. Subsequently, it effects also the sensitivity, specificity, exist also for fungi such as MIPS [10], CBS Genome total count of clusters and count of orphan clusters (i.e.

Page 2 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325 Table 1:Genome data ta Basidiomyco ta Basidiomyco Saccharomycotina Saccharomycotina Ascomycota Ascomycota Saccharomycotina Ascomycota Saccharomycotina Saccharomycotina Ascomycota Ascomycota Pezizomycotina Ascomycota Pezizomycotina Ascomycota Pezizomycotina Ascomycota Pezizomycotina Ascomycota Pezizomycotina Pezizomycotina Ascomycota Ascomycota for each species. used figu abbreviation in subphylum, Phylum, ta Basidiomyco ta Basidiomyco Saccharomycotina Ascomycota Saccharomycotina Saccharomycotina Ascomycota Saccharomycotina Ascomycota Ascomycota Saccharomycotina Saccharomycotina Ascomycota Ascomycota Pezizomycotina Pezizomycotina Ascomycota Ascomycota Pezizomycotina Ascomycota Count Abbreviation Pezizomycotina Ascomycota Pezizomycotina Ascomycota Pezizomycotina Ascomycota Name Archeascomycota Subphylum Ascomycota Phylum yoyoaZygomycetes Zygomycota Saccharomycotina Ascomycota Saccharomycotina Ascomycota Pezizomycotina Ascomycota Pezizomycotina Ascomycota Hymenomycetes Hymenomycetes Ustilago maydis Coprinus cinereus chrysosporium Phanerochaete neoformans Cryptococcus Coccidioides immitis cinerea Botrytis Aspergillus niger Aspergillus nidulans pastoris lactis Kluyveromyces guilliermondii albicans Candida gossypii Ashbya Sclerotinia sclerotiorum Nectria haematococca Fusarium graminearum Chaetomium globosum Aspergillus oryzae pombe Schizosaccharomyces Yarrowia lipolytica cerevisiae Saccharomyces Saccharomyces castellii Pichia stipitis Debaryomyces hansenii Trichoderma reesei nodorum Stagonospora crassa Neurospora Aspergillus fumigatus Rhizopus oryzae Rhizopus Saccharomyces kluyveri Magnaporthe grisea res, count of ORFs inthe ofORFs genome, datares, count so my62 ra nttt,UAhttp://www http://www Institute,USA Broad 6522 Institute,USA Broad 13544 Umay Ccin no70 ra nttt,UAhttp://www http://genome.jgi-psf.org/ USA Genome Institute, Joint Institute,USA Broad 10048 7302 Pchr Cneo ps59 nertdgnmc,UAhttp http://www http:// Integrated genomics, USA Genolevures Consortium, Fran http://www 5999 5327 Institute,USA Broad http://agd.unibas.ch/ UK University, Stanford 5920 Database, Switzerland Genome Ashbya http://genome.jgi-psf.org/ http://www Ppas 6165 Klac http://www Institute,USA Broad 4726 USA Genome Institute, Joint http://www Cgui 14522 Calb 16237 Institute,USA Broad Agos Institute,USA Broad 10457 Sscl Genome Joint Institute Institute,USA Broad 16446 Nhae 11200 9541 Cimm Bcin Anig Anid ga160BodIsiue S http://www http://www.bio.nite.go.jp/dogan/Top http://www Institute,USA Broad http://www.sanger.ac.uk/Projects/S_pombe/ Institute,USA Broad 11640 NITE, Japan 11124 12062 Fgra Sanger Institute, UK Cglo 4984 Aory Spom si53 on eoeIsiue S http://genome.jgi-psf.org/ Genolevures Consortium, Fran http://www Genome Databa Saccharomyces 6521 Genome Databa Saccharomyces USA Genome Institute, Joint 5888 4677 5839 Fran Consortium, Genolevures http://www Institute,USA Broad Ylip http://genome.jgi-psf.org/ http://www 6318 Scer 5941 Scas Psti USA Genome Institute, Joint Institute,USA Broad Dhan 9997 Institute,USA Broad Clus 16597 10082 Tree Snod Ncra oy147BodIsiue S http://www Institute,USA Broad 17467 Rory Genome Databa Saccharomyces 2968 Sklu Genolevures Consortium, Fran http://www 5181 http://www.tigr.org/tdb/e2k1/afu1/ Institute,USA Broad Cgla 11109 Tigr, USA Mgri 9926 Afum urce and internet address and information, and address internet urce and fOF oreWWFna nePoProtfun InterPro Funcat WWW Source ORFs of S http://genome.jgi-psf.org/ , USA whether Funcat annotation existe annotation Funcat whether ehttp://cbi.labri.fr/Genolevures/ ce ehttp://cbi.labri.fr/Genolevures/ ce http://cbi.labri.fr/Genolevures/ ce ehttp://cbi.labri.fr/Genolevures/ ce e S http://www.yeastgenome.org/ http://www.yeastgenome.org/se, USA se, USA e S http://www.yeastgenome.org/ se, USA ://ergo.integratedgenomics.com/ERGO/ www-sequence.stanford.edu/group/candida/ .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno .broad.mit.edu/anno d and whether InterPro or Protfun analysis whetherInterPro carried or Protfun was out, shown d and tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ tation/fungi/fgi/ YES YES YES YES YES YES YES YES YES YES E YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES

Page 3 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

clusters with only one member), which are shown in Fig- ure 1 for different inflation values. We defined sensitivity as the average percentage of how many of the proteins with same InterPro domain structure (see below) are con- tained in a single cluster. In parallel, specificity was defined as the average percentage of how many of the pro- teins in a cluster have the same InterPro domain structure [22]. We selected for further analysis a clustering where r = 3.1, as a good compromise between sensitivity and spe- cificity.

We used the program InterProScan [23] to detect if pro- teins from a sub set of genomes (Table 1) had a sequence feature corresponding to an InterPro entry. InterPro is a database of known protein sequence features such as domains, families and active sites, i.e. InterPro entries. In contrast to TRIBE-MCL clustering, InterProScan is not comprehensive as it is limited to known entries accepted to the InterPro database and not uniform because an entry might represent a completely different evolutionary or functional level such as superfamily, subfamily or protein sequence feature such as domain or active site. Thus, UnrecognisedFigure 2 by InterPro or clustering TRIBE-MCL clustering allows us to compare genomes in a Unrecognised by InterPro or clustering. Percentage of comprehensive and uniform way, while InterProScan ORFs in orphan clusters (y-axis) versus the percentage of analysis allows a comparison of distributions of specific ORFs unrecognised by InterPro (x-axis) for species which known sequence features. were analysed with InterProScan. Species are coloured by phyla. Names of the species, plotted as abbreviations beside the data points, are explained in Table 1.

In order to elucidate how well our analysis covers the genomes studied we counted, for the species analysed with InterProScan, the percentage of ORFs in orphan clus- ters and the percentage of ORFs without any InterPro entries (Figure 2). The Ascomycota genomes with particu- larly high percentage of orphan clusters in relation to their genome size are N. crassa and Yarrowia lipolytica. N. crassa has also a particularly high percentage of InterProScan unrecognised ORFs. In contrast, the data set includes sev- eral closely related members of the Eurotiales; A. nidulans, A. fumigatus, A. oryzae and A. niger, which have a particu- larly low percentage of orphan clusters and InterProScan unrecognised ORFs.

To find out how redundant the ORF contents of the genomes studied are, we counted genomic ORF redun- dancy in each genome (Figure 3). This was defined as the percentage of ORFs in a genome found in clusters with Selection(Figurer) 1 of the protein clustering parameter "inflation value" more than one ORF from the same species [3]. Particularly Selection of the protein clustering parameter "infla- high values of redundancy in relation to their genome tion value" (r). Average sensitivity and specificity percent- sizes are found in the Saccharomycotina species that have ages (left y-axis) and total count and count of orphan clusters undergone a recent genome duplication [2,3] (S. cerevisiae i.e. clusters with only a single member ORF (right y-axis) for and S. castellii), while Candida glabrata, which also clusterings made with different inflation values (x-axis). belongs to the genome duplication lineage has an average value. N. crassa genome has been shown to have an

Page 4 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

GenomicFigure 3 ORF redundancy Genomic ORF redundancy. Percentage of ORFs in clusters containing more than one ORF from the species in question i.e. genomic ORF redundancy (y-axis) versus the size of the genome in ORFs (x-axis) for each species. See Figure 2 for further details.

extremely low level of paralogy due to the RIP mechanism were analysed for enrichment of Funcat categories, Prot- [7], and in accordance it has a particularly low value of fun categories and InterPro entries (Tables 2, 3, 4 and 5). redundancy in relation to its genome size. Due to the Funcat is a hierarchical protein function classification sys- incomplete state of S. kluyveri genome sequence [24] its tem based on extensive manual annotation [17]. Protfun redundancy is likely to be underestimated. integrates results from several other protein feature pre- diction programmes to predict characteristics and func- Characterization of clusters tions of a protein based on its sequence [20]. However, In order to divide the clusters obtained in our analysis to Protfun does not use any sequence similarity information, phylogenetically interesting groups, clusters with at least which is essential in attempting to characterise novel pro- 10 member ORFs, were ordered by hierarchical clustering tein families unique to fungi. Each cluster's member ORFs based on how many proteins of each species were were checked for Funcat, InterPro and Protfun assign- included in each cluster (Figure 4 and 5 or an excel version ments and when at least 75% (ProtFun and InterPro) or [see additional file 1]). We selected four regions with all (Funcat) ORFs of those for which data was available interesting ORF contents for further analysis, A: "Pezizo- had the same assignment, it was given to the cluster. If mycotina abundant", B: "Pezizomycotina specific", C: assignments were more heterogeneous, the cluster was "Saccharomycotina absent" and D: "Saccharomycotina assigned as uncategorised for the particular type of assign- unique". To characterise the content of the regions, they ment. In addition, we checked which clusters contained

Page 5 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

an S. cerevisiae ORF found in the metabolic model are distributed even more strikingly. Most clusters having iMH805 [25]. The model includes mostly enzyme genes these ORFs have members from all species. In addition involved in the known primary metabolism of S. cerevisiae 7% of the clusters in D: "Saccharomycotina unique" have and also many of their transcriptional regulators. Of the an iMH805 [25] ORF (see below). 805 ORFs in iMH805 [25], 737 were successfully mapped to 498 clusters shown in Figure 4. 198 clusters were assigned to the region A: "Pezizomy- cotina abundant". They have usually more ORFs in Pezi- Overall, 57% of the clusters with at least 10 members were zomycotina than in other phyla and none in assigned with a Funcat category, but the cover of Funcat Saccharomycotina (Figure 4). The categories of clusters differs greatly between the regions (Figure 4). The respec- significantly enriched in this region compared to other tive percentages for the selected regions are, A: "Pezizomy- clusters are shown in Table 2. Clusters of the region con- cotina abundant" 61%, B: "Pezizomycotina specific" tain mainly predicted enzymes. 20% of the clusters were 20%, C: "Saccharomycotina absent" 43% and D: "Saccha- predicted to have a secretion pathway signal sequence and romycotina unique" 81%. Also, over 95% of the clusters 36% of the clusters had heterogeneous signal sequence having members from all species shown at the upper part predictions and were assigned as "TargetP: Uncategorised" of main heatmap of Figure 4 have a Funcat category. The (Table 4). Thus, it is likely that more than 20% of the clus- ORFs of the S. cerevisiae metabolic model iMH805 [25] ters are actually secreted. The enriched Funcat categories

ProteinFigure clustering4 overview Protein clustering overview. See Figure 5 for legend.

Page 6 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

LegendFigure for5 figure 4 Legend for figure 4. A heatmap of clusters with at least ten ORFs. In the main heatmap colour intensity of a cell shows the number of ORFs shown by clusters (rows) and by species (columns). Both rows and columns are ordered by hierarchical clus- tering to group similar rows or columns together. The dendrogram from hierarchical clustering is shown for columns and the phylum of species is indicated by a column colour bar between the heatmap and the dendrogram. Under the heatmap each spe- cies is specified by an abbreviation explained in Table 1. On the left side of the main heatmap a black and white side heatmap shows the percentage of ORFs in a cluster that have an InterPro entry of all cluster's ORFs analysed with InterProScan ("wIPR"), cluster's stability and cluster's Saccharomycotina to Pezizomycotina ratio in a clustering where inflation value (r) was 1.1 ("S/P r 1.1"). Stability reflects the ratio of cluster size between a clustering where r = 3.1 to that where r = 1.1. As Figure 1 shows, when r is set to its minimum value (1.1), TRIBE-MCL clustering produces a minimum amount of clusters and orphan clusters. In consequence the clusters are on average larger when r = 1.1. The ratio between cluster size r = 3.1 and r = 1.1 is shown as a percentage. "S/P r 1.1" reflects the ratio of count of Saccharomycotina ORFs to Pezizomycotina ORFs in a cluster when r = 1.1. By comparing "S/P r 1.1" to the species distribution of a cluster shown on the main heatmap one can see if a clus- ter retains the Saccharomycotina to Pezizomycotina ratio when r = 1.1. On the right side of the main heatmap, a side heatmap shows various functional classifications for the clusters. Whether or not the cluster has a Funcat classification ("Funcat") or has an ORF found in S. cerevisiae metabolic model iMH805 is shown ("iMH805"). Whether the proteins in the cluster are predicted by Protfun to have a signal sequence directing them into either mitochondrion or secretion pathway ("TargetP"), have trans- membrane domains ("TMHMM") or are predicted to be enzymes is shown ("Enz."). Clusters belonging to regions A: "Pezizo- mycotina abundant", B: "Pezizomycotina specific", C: "Saccharomycotina absent" and D: "Saccharomycotina unique" are specified by a vertical bar between main and right heatmap. of the clusters are mainly related to plant biomass degra- membrane regions and 16% of them were predicted not dation and secondary metabolism (Table 2). Enriched to be enzymes. 11% of the clusters were predicted to have InterPro entries of the clusters are also mainly in catego- a secretion pathway signal sequence. The cellular role pre- ries connected to secondary metabolism and plant bio- dictions "Cell envelope", "Regulatory functions", "Central mass degradation. However, proteins with domains intermediary metabolism" "Purines and pyrimidines" related to secondary metabolism might also have other show significant enrichment. The Funcat classification functions. For example, the methylation carried out by category "response to environmental stimuli" is the only "IPR01077: O-methyltransferases family 2" domains can one significantly enriched (Table 3). Enriched InterPro function in transcription regulation or aflatoxin biosyn- entries IPR001138 and IPR007219 (Table 5), are tran- thesis [26]. scription factor domains that commonly occur together (see chapter "PCA of InterPro entry counts" below). Oth- The 1130 clusters of the region B: "Pezizomycotina spe- erwise the enriched entries are functionally diverse. cific" have usually just one ORF in each Pezizomycotina species and none in the other phyla (Figure 4). Most pro- The 1010 clusters of the region C: "Saccharomycotina teins in these clusters were predicted not to have trans- absent" have usually just one ORF in each Pezizomy-

Page 7 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Table 2: Enriched cluster categories in region A: "Pezizomycotina abundant"

Software or Database Category Count of clusters % of clusters % of all P-value Author assignment Figure database identifier in region in region clusters 13 with category

ProtFun - Enzyme/nonenzyme: Enzyme 180 91.0 5 1.60E-11 - - ProtFun - Enzyme class: Uncategorised 147 74.0 6 1.60E-17 - - ProtFun - Cellular role: Uncategorised 143 72.0 4 2.10E-02 - - Funcat 01 Metabolism 81 41.0 6 8.00E-06 Metabolism - ProtFun - TargetP: Uncategorised 71 36.0 5 1.40E-04 - - Funcat 01.05 C-compound and 41 21.0 10 1.30E-07 Plant biomass degradation - metabolism ProtFun - TargetP: Secretion 39 20.0 9 1.30E-07 - - ProtFun - Cellular role: Cell envelope 30 15.0 11 1.20E-08 - - Funcat 01.05.01 C-compound and carbohydrate 29 15.0 10 1.30E-05 Plant biomass degradation - utilization Funcat 32 Cell rescue, defence and virulence 24 12.0 7 6.50E-03 Virulence/Defence - Funcat 01.05.01.01 C-compound, carbohydrate 22 11.0 22 4.40E-11 Plant biomass degradation - catabolism ProtFun - TMHMM: Uncategorised 20 10.0 7 3.70E-03 - - Funcat 01.20 Secondary metabolism 19 10.0 22 1.40E-09 2ary metabolism - Funcat 32.05 Disease, virulence and defence 14 7.0 23 1.00E-07 Virulence/Defence - Funcat 01.05.01.01.02 Polysaccharide degradation 13 7.0 27 3.50E-08 Plant biomass degradation - Funcat 32.05.05 Virulence, disease factors 9 5.0 45 3.20E-08 Virulence/Defence - Interpro IPR001128 Cytochrome P450 8 4.0 42 1.30E-07 2ary metabolism YES Interpro IPR002347 Glucose/ribitol dehydrogenase 8 4.0 21 4.60E-05 2ary metabolism YES Funcat 32.07 Detoxification 7 4.0 10 2.20E-02 2ary metabolism - Interpro IPR002198 Short-chain dehydrogenase/ 7 3.5 21 1.60E-04 2ary metabolism YES reductase SDR Interpro IPR006094 FAD linked oxidase, N-terminal 5 2.5 56 6.30E-06 2ary metabolism YES Interpro IPR003042 Aromatic-ring hydroxylase 5 2.5 36 8.60E-05 2ary metabolism YES Funcat 01.25 Extracellular metabolism 4 2.0 36 7.30E-04 Plant biomass degradation - Funcat 02.16 4 2.0 20 8.00E-03 Metabolism - Funcat 01.05.01.01.01 Sugar, glucoside, polyol and 4 2.0 19 9.50E-03 Plant biomass degradation - carboxylate catabolism Funcat 30.05 Transmembrane signal transduction 4 2.0 12 4.50E-02 Signalling - Interpro IPR011050 Pectin lyase fold/virulence factor 4 2.0 27 1.60E-03 Plant biomass degradation NO Interpro IPR000873 AMP-dependent synthetase and 4 2.0 25 2.10E-03 2ary metabolism YES ligase Interpro IPR008985 Concanavalin A-like lectin/glucanase 4 2.0 24 2.60E-03 Plant biomass degradation NO Funcat 32.10.07 Degradation of foreign (exogenous) 3 2.0 50 1.20E-03 Plant biomass degradation - polysaccharides Funcat 01.01.09.05 Metabolism of tyrosine 3 2.0 43 2.10E-03 2ary metabolism - Funcat 01.05.01.01.09 Aerobic aromate catabolism 3 2.0 43 2.10E-03 Metabolism - Funcat 32.10 Degradation of foreign (exogenous) 3 2.0 38 3.30E-03 Plant biomass degradation - compounds Funcat 02.25 Oxidation of fatty acids 3 2.0 27 8.80E-03 2ary metabolism - Interpro IPR010730 Heterokaryon incompatibility 3 1.5 75 1.80E-04 Mating YES Interpro IPR003661 Histidine kinase A, N-terminal 3 1.5 43 1.40E-03 Signalling NO Interpro IPR006163 Phosphopantetheine-binding 3 1.5 43 1.40E-03 2ary metabolism YES Interpro IPR005467 Histidine kinase 3 1.5 38 2.20E-03 Signalling NO Interpro IPR001789 Response regulator receiver 3 1.5 33 3.20E-03 Signalling NO Funcat 30.05.01.10 Two-component signal transduction 2 1.0 67 4.90E-03 Signalling - system Funcat 32.05.05.01 Toxins 2 1.0 50 9.50E-03 2ary metabolism - Funcat 01.01.09.05.02 Degradation of tyrosine 2 1.0 40 1.50E-02 2ary metabolism - Funcat 01.20.37 Biosynthesis of peptide derived 2 1.0 40 1.50E-02 2ary metabolism - compounds Funcat 34.11.11 Rhythm (e.g. circadian, ultradian) 2 1.0 40 1.50E-02 Other - Funcat 01.01.09.04 Metabolism of phenylalanine 2 1.0 33 2.30E-02 2ary metabolism - Funcat 01.25.01 Extracellular polysaccharide 2 1.0 33 2.30E-02 Plant biomass degradation - degradation Funcat 16.05 Polysaccharide binding 2 1.0 33 2.30E-02 Plant biomass degradation - Funcat 16.21.05 FAD/FMN binding 2 1.0 33 2.30E-02 2ary metabolism - Funcat 01.01.11.03.02 Degradation of valine 2 1.0 29 3.10E-02 2ary metabolism - Funcat 30.05.01 Receptor enzyme mediated 2 1.0 25 4.00E-02 Signalling - signalling Funcat 01.20.35.01 Biosynthesis of phenylpropanoids 2 1.0 22 5.00E-02 2ary metabolism - Interpro IPR000675 Cutinase 2 1.0 100 1.30E-03 Plant biomass degradation NO Interpro IPR001077 O-methyltransferase, family 2 2 1.0 100 1.30E-03 Regulation NO Interpro IPR002227 Tyrosinase 2 1.0 100 1.30E-03 2ary metabolism NO Interpro IPR008922 Di-copper centre-containing 2 1.0 100 1.30E-03 Dubious NO Interpro IPR012951 Berberine/berberine-like 2 1.0 100 1.30E-03 2ary metabolism NO

Page 8 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Table 2: Enriched cluster categories in region A: "Pezizomycotina abundant" (Continued)

Interpro IPR000743 Glycoside hydrolase, family 28 2 1.0 67 3.70E-03 Plant biomass degradation NO Funcat 01.05.01.01.11 Anaerobic aromate catabolism 1 1.0 100 4.10E-02 Metabolism - Funcat 01.20.01.09 Biosynthesis of aminoglycoside 1 1.0 100 4.10E-02 2ary metabolism - antibiotics Funcat 01.20.23 Biosynthesis of secondary products 1 1.0 100 4.10E-02 2ary metabolism - derived from L-methionine Funcat 01.20.36 Non-ribosomal peptide synthesis 1 1.0 100 4.10E-02 2ary metabolism - Funcat 01.20.37.05 Biosynthesis of beta-lactams 1 1.0 100 4.10E-02 2ary metabolism - Funcat 02.16.03.03 Heterofermentative pathway and 1 1.0 100 4.10E-02 Metabolism - fermentation of other saccharides Funcat 32.05.05.03 Bacteriocins 1 1.0 100 4.10E-02 2ary metabolism - Funcat 32.07.09 Detoxification by degradation 1 1.0 100 4.10E-02 2ary metabolism - Funcat 34.11.01.01 Light environment response 1 1.0 100 4.10E-02 Signalling - Funcat 36.20.18 Plant hormonal regulation 1 1.0 100 4.10E-02 Signalling -

Software or database that the categories were derived from, category database identifier, category name, count of clusters with the category, percentage of all clusters in this region, percentage of all clusters with the category and significance of enrichment are shown for categories which are enriched in the region (Figure 4) with a p < 0.01. Column "Author assignment" shows an assignment to general themes, based on the respective database, that summarise the InterPro and Funcat categories. While the other assignments are directly based on the databases, "Secondary metabolism" covers entries which are known to participate also in secondary metabolism ([61, 62], Funcat and InterPro). For InterPro entries whether the entry is found from Figure 9 is specified in column "Figure 9". InterPro entries assigned to "Dubious" are entries that InterPro itself considers unreliable. cotina species and none in Saccharomycotina (Figure 4). ORFs, 26 are found in clusters with the category "Ribos- They were predicted mainly not to have transmembrane omal proteins". regions and 74% were predicted to be enzymes. No cellu- lar role could be predicted for 72% of the clusters (Table In contrast, five clusters with the category "Subcellular 4). Several Funcat amino acid degradation categories are localisation: mitochondrion" contain S. cerevisiae pro- enriched in these clusters (Table 4). However, all these teins not related to ribosomes that have been localised to cases correspond to only five clusters that have different mitochondrion in a genome wide analysis [29]. For the overlapping categories. The Enzyme Commission (EC) other clusters with "Subcellular localisation: mitochon- numbers found in Funcat for the proteins in these clusters drion" Funcat proposes, based on sequence similarity, all map to the KEGG [27] pathway "Valine, leucine and that they could be localised to mitochondrion. The func- isoleucine degradation" (Figure 6). According to SGD and tion of the ORFs found in the clusters with this enriched KEGG S. cerevisiae has only few of the enzymes in this category is not known. However, YJL082w (IML2) found pathway, while M. griseae has most of them. InterPro in the clusters has been shown to have a physical interac- entries enriched in the clusters of the region are mainly tion with YLR014c (PPR1) [30], a positive transcriptional related to metabolism (Table 4). The two Acyl-CoA dehy- regulator of the pyrimidine biosynthesis pathway [31]. drogenase entries are in the same clusters and correspond Also, according to InterProScan YHF198c (FMP22) found to proteins with EC: 1.3.99.3. shown in Figure 6. in the clusters is a member of the "Purine/pyrimidine phosphoribosyl transferase" family (IPR002375), like the The 466 clusters of the region D: "Saccharomycotina YML106W (URA5) and YMR271C (URA10) genes unique" have usually just one ORF in each Saccharomy- involved in the actual pyrimidine biosynthesis pathway. cotina species and none in the other phyla (Figure 4). This evidence suggests that Saccharomycotina could have These clusters were mainly predicted not to have trans- a novel mitochondrial pyrimidine related pathway not membrane regions, and 66% were predicted not to have found in Pezizomycotina. any signal sequence. 20% of the clusters were predicted not to contain enzymes, while 15% were predicted to be In addition to enriched cluster categories, the region D: ligases. Several predicted cellular roles were also enriched "Saccharomycotina unique" is characterised by 34 clus- (Table 4). Two of the four InterPro entries enriched in the ters, which have a gene found in the S. cerevisiae metabolic clusters of this region are related to nucleotide binding model iMH805 [25]. Interestingly, instead of actual met- (Table 5). abolic enzymes, 13 of these are transcription factors. Besides "01 Metabolism", the Funcat category most The Funcat categories enriched are either related to mito- enriched among these clusters in comparison to all other chondrion, transcription or translation (Table 3). Catego- clusters is "01.07.01 biosynthesis of vitamins, cofactors ries "Biogenesis of cellular components: mitochondrion" and prosthetic groups" (p < 2,2*10e-6, 7 clusters). and "Ribosomal proteins" overlap. Of the 34 clusters in "Biogenesis of cellular components: mitochondrion", 28 PCA of InterPro entry counts contain an S. cerevisiae ORF described by Saccharomyces The analysis of the TRIBE-MCL clusters presented above Genome Database (SGD) [28] as mitochondrial ribos- dealt only with clusters of at least ten member ORFs. omal component of the small or large subunit. Of these However, the clustering produced 77.278 clusters with

Page 9 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Table 3: Enriched cluster categories in region B: "Pezizomycotina specific"

Software or Database Category Count of clusters % of clusters % of all P-value Author Figure 9 database identifier in region in region clusters with assignment category

ProtFun - TMHMM: No 913 81.0 21 5.60E-03 - - transmembra ne ProtFun - Enzyme/ 175 16.0 25 1.30E-03 - - nonenzyme: Nonenzyme ProtFun - TargetP: 128 11.0 28 1.90E-05 - - Secretion ProtFun - Cellular role: 93 8.0 35 9.50E-09 - - Cell envelope ProtFun - Cellular role: 84 7.0 26 1.20E-02 - - Central intermediary metabolism Interpro IPR001138 Fungal 41 3.6 51 7.10E-10 Transcription YES transcriptiona regulation l regulatory protein, N- terminal ProtFun - Cellular role: 37 3.0 39 1.80E-05 - - Regulatory functions Interpro IPR007219 Fungal specific 25 2.2 37 1.30E-03 Transcription YES transcription regulation factor ProtFun - Cellular role: 27 2.0 28 4.30E-02 - - Purines and pyrimidines Interpro IPR001810 Cyclin-like F- 13 1.2 62 4.00E-05 Macromolecu YES box le interaction Interpro IPR000637 HMG-I and 7 0.6 88 9.70E-05 Chromatin NO HMG-Y, DNA-binding Interpro IPR001087 Lipolytic 6 0.5 75 1.40E-03 Metabolism NO enzyme, G-D- S-L Interpro IPR006710 Glycoside 6 0.5 75 1.40E-03 Plant biomass NO hydrolase, degradation family 43 Interpro IPR005302 MOCO 3 0.3 100 8.50E-03 Metabolism NO sulphurase C- terminal Interpro IPR006209 EGF-like 3 0.3 100 8.50E-03 Extracellular NO Funcat 36.20.35 Response to 3 0.0 75 4.20E-02 Signalling - environmenta l stimuli

See Table 2 for legend only a single member and 11.727 of these had InterPro This enabled us to differentiate between InterPro entries entries (Figure 2). In order to find differences in counts of appearing alone in an ORF from those that commonly ORFs with InterPro entries or InterPro entry structures appear together in an ORF. between the species, independently of TRIBE-MCL clus- ters, we carried out a Principal Component Analysis PCA finds principal components (PC), i.e. major sources (PCA). InterPro entry structures were constructed to each of variation between the species and positions of the orig- ORF by reducing overlapping and redundant InterPro inal samples and the loadings of individual InterPro entries into a single most specific one and concatenating entries or InterPro entry structures on the PCs. The PCs are these in the order they appear in the protein sequence. named so that the PC explaining the highest amount of

Page 10 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Table 4: Enriched cluster categories in region C: " Saccharomycotina absent"

Software Database Category Count of % of clusters % of all P-value Author Figure 9 or database identifier clusters in in region clusters assignment region with category

ProtFun - TMHMM: No 825 82.0 19 8.60E-04 - - transmembrane ProtFun - Enzyme/nonenzyme: 747 74.0 19 4.50E-02 - - Enzyme ProtFun - Cellular role: 722 72.0 20 2.20E-06 - - Uncategorised ProtFun - TargetP: 81 8.0 22 4.80E-02 - - Mitochondrion ProtFun - Cellular role: 9 1.0 38 2.10E-02 - - Replication and transcription Funcat 01.01.11.04.02 Degradation of 5 1.0 63 1.30E-02 2ary - leucine metabolism Funcat 01.01.11.04 metabolism of 5 1.0 50 3.90E-02 2ary - leucine metabolism Interpro IPR001623 Heat shock protein 9 0.9 43 7.90E-03 Macromolecule NO DnaJ, N-terminal interaction Interpro IPR000195 RabGAP/TBC 6 0.6 55 7.30E-03 Secretion NO Interpro IPR000683 Oxidoreductase, N- 6 0.6 55 7.30E-03 Metabolism NO terminal Interpro IPR006091 Acyl-CoA 4 0.4 100 1.10E-03 Metabolism NO dehydrogenase, central region Interpro IPR006092 Acyl-CoA 4 0.4 100 1.10E-03 Metabolism NO dehydrogenase, N- terminal Funcat 01.01.11.02.02 Degradation of 4 0.0 67 2.00E-02 2ary - isoleucine metabolism Funcat 01.01.11.03.02 Degradation of valine 4 0.0 57 3.90E-02 2ary - metabolism Funcat 01.01.03.01.02 Degradation of 2 0.0 100 4.40E-02 2ary - glutamine metabolism

See Table 2 for legend variation is called PC1, the PC explaining the next highest Hymenomycetes are separated from other fungi on PC2. amount of variation PC2 etc. A loading represents the An almost identical result was achieved by using counts of contribution of an InterPro entry or an InterPro entry ORFs with an InterPro entry structure instead (data not structure in forming a PC. For example, an InterPro entry shown). with the lowest loading on PC1 is the largest contributor to positions low on PC1. To understand which individual InterPro entries (Figure 8a) or InterPro entry structures (Figure 8b) contribute We found that in a PCA of counts of ORFs with InterPro most to the difference between the species we studied entries, PC1 explains 40% of all the variation in the data InterPro entry and InterPro entry structure loadings on while PC2 explains 22%. The positions of the species on PC1 and PC2 of their respective PCAs. We selected the 100 these two PCs are shown in Figure 7. Pezizomycotina are InterPro entries or InterPro entry structures with most separated from Saccharomycotina on PC1. Between these extreme loadings (TOP 100) and visualised the counts of phyla N. crassa and Y. lipolytica have the least difference ORFs of the TOP 100 InterPro entries from Figure 8a as a and Hymenomycetes are similarly in the middle of PC1. heatmap (Figure 9 and 10, or an excel version [see addi- This means that in counts of ORFs with InterPro entries tional file 2]). the difference between Pezizomycotinaand Saccharomy- cotina is the largest source of variation and InterPro Based on the distributions of the loadings less than hun- entries with lowest loadings on PC1 are the most abun- dred InterPro entries or InterPro entry structures actually dant in Pezizomycotina in comparison to Saccharomy- notably contribute to the positions of the species on PC1 cotina and vice-versa for highest loadings. and PC2 (Figure 8a and 8b). The TOP100 InterPro entries

Page 11 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325 Table 5:Enriched cluster categories rtu agt:Mtcodin5 201 .0-5------Transcription 4.20E-05 - 2.60E-02 2.40E-06 - 14 1.30E-04 11 15 2.90E-02 forlegend 2 Table See 12.0 1 10 23.0 9 15.0 54 70 66.0 105 68 82.0 306 Cellularrole: 380 TargetP: Mitochondrion Enzyme class:Ligase6.-.-.-) (EC - - Tran Cellularrole: RNA synthesis - Enzyme/nonenzyme: No Transcription 11.02 - TargetP: No splice site - Notransmembrane TMHMM: ProtFun 11 ProtFun - ProtFun - Funcat ProtFun ProtFun Funcat ProtFun ProtFun database Software or uct1.10 iooa rtis2 . 438E0 Protein 3.80E-02 14 6.0 27 Ribosomal proteins 12.01.01 Funcat ProtFun - Cellular role: Amino acid Cellularrole: Biogenesis of cellular compartment: 42.16 - Funcat ProtFun Interpro IPR007757 MT-A70 2 0.4 100 7.10E-03 Nucleotide Nucleotide 7.10E-03 - YES - 100 Transcription Dubious Transcription 2.40E-02 - 5.70E-03 1.50E-02 0.4 19 5.30E-08 18 31 2 45 2.0 3.0 1.0 11 3.0 14 5 17 14 MT-A70 Stress-induced protein IPR007757 IPR000992 Replication Cellularrole: Biosynthesis Cellularrole: Regulator of transcription factor Interpro 18.02.09 - Isomerase 5.-. class: (EC Enzyme -.-) Interpro - rRNA synthesis Regulator Cellularrole: 11.02.01 - Funcat WD40-like - Energy Cellularrole: rRNA processing mito localisation: Subcellular ProtFun IPR011046 ProtFun 11.04.01 Funcat 70.16 - ProtFun Interpro ProtFun Funcat Funcat ProtFun nepoIR056Ncetd idn rti,PN . 052E0 Nucleotide 5.20E-03 60 0.6 3 Nucleotide binding protein, PINc IPR006596 Interpro identifier Database aeoyCount ofclusters Category mitochondrion metabolism in region C: "Sac C: in region Central intermediary Central intermediary lto 71. 362E2 - - 6.20E-20 23 19.0 87 slation mtbls 0401 .0-3-- - 1.80E-03 17 4.0 20 metabolism R1TP . 0 .0-3Srs epneNO Stressresponse 7.10E-03 100 0.4 2 SRP1/TIP1 ucin 5301 .0-2-- - 1.10E-02 16 3.0 15 y functions ezm 52. 378E0 - - 7.80E-07 13 20.0 95 nenzyme charomycotina unique" boytei 7803 .0-5-- - 6.80E-15 35 8.0 37 biosynthesis n rncito . 928E0 - - 2.80E-03 29 2.0 7 and transcription o oatr . 760E0 - - 6.00E-04 37 2.0 7 of cofactors hnro 8401 .0-2Mtcodin- Mitochondrion 2.40E-02 16 4.0 18 chondrion in region 3901 .0-3-- - Mitochondrion - 5.30E-06 2.00E-03 21 13 7.0 9.0 34 43 % ofclustersin % region . 539E0 rncito - Transcription 3.90E-02 15 4.0 . 247E0 rncito - Transcription 4.70E-02 12 5.0 % of all clusters ofallclusters % with category P-value Author Author P-value assignment translation regulation binding binding Figure 9 Figure NO NO - -

Page 12 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

TheFigure KEGG 6 metabolic pathway "Valine, leucine and isoleucine degradation" The KEGG metabolic pathway "Valine, leucine and isoleucine degradation". Enzymes found in S. cerevisiae according to SGD and KEGG are filled with green and in M. grisea according to KEGG circled in orange. Enzymes found in clusters cor- responding to Funcat categories 01.01.11.02 -01.01.11.04 enriched in the region C: "Saccharomycotina absent" in Table 3 are filled with pink.

Page 13 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

PCAFigure of counts7 of ORFs with an InterPro entry PCA of counts of ORFs with an InterPro entry. Posi- tions of species analysed with InterProScan on the two PCs that explain the largest amount of variation in the counts of ORFs with an InterPro entry. Species abbreviations are explained in Table 1 and data points are coloured by phyla. are related to a wide variety of biological functions, but the five most common, with their counts, are: "Macromol- ecule interaction, 19", "Secondary metabolism, 18", "Dubious, 9", "Protein modification, 8" and "Transporter, 7". InterPro entries assigned to "Dubious" are entries that InterPro itself considers unreliable.

The InterPro entry structures contributing most to PC1 and PC2 and having several InterPro entries that appear together are "IPR001138: Fungal transcriptional regula- tory protein, N-terminal, IPR007219: Fungal specific tran- scription factor", "IPR000719: Protein kinase, PCAturesFigure loadings 8 of InterPro entries and InterPro entry struc- IPR008271: Serine/threonine protein kinase, active site", PCA loadings of InterPro entries and InterPro entry "IPR011046: WD40-like, IPR001680: WD40 repeat" and structures. PCA loadings of InterPro entries (a) or InterPro "IPR002290: Serine/threonine protein kinase, entry structures (b) of the two PCs that explain the largest IPR008271: Serine/threonine protein kinase, active site" amount of variation in the counts of ORFs with an InterPro (Figure 8b). These seven InterPro entries belong also to entry. The PCA for InterPro entries (a) is shown in picture the TOP100 individual entries shown on Figure 8a and 8. 11, for InterPro entry structures (b) data not shown. 100 InterPro entries or InterPro entry structures having the most IPR001138 is a DNA binding N-terminal zinc binuclear extreme PCA loadings on the two PCs shown are coloured cluster (Zn Cys ) domain and IPR007219 a C-terminal with orange (TOP 100), while the rest are 4373 InterPro 2 6 entries and 16319 InterPro structures are in blue. InterPro domain commonly found in proteins with IPR001138. entries identifiers are shown for 20 most extreme PCA load- Both are found in many fungal transcription factors such ings. as S. cerevisiae GAL4 [32], A. niger xlnR [33] and Aspergillus flavus aflR [34]. Of proteins with IPR001138, 39% have also an IPR007219. On the average a Pezizomycotina has three times more ORFs with IPR001138 than a Saccharo- mycotina.

Page 14 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

TOPFigure 100 9 InterPro entries TOP 100 InterPro entries. See Figure 10 for legend.

Page 15 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

found in them. "InterPro entry view" is identical to "Pro- tein cluster view", except that it shows all ORFs with a given InterPro entry. "Protein cluster overview" and "InterPro entry overview" show a heatmap where clusters or InterPro entries are rows, and the columns show the count of ORFs in each species. The two overviews are used to find clusters or InterPro entries with similar phyloge- netic profiles. "InterPro cluster overlap view" shows for a given InterPro entry a table of all clusters whose members have the entry.

LegendFigure for10 figure 9 The sizes of the TRIBE-MCL [15] clusters are influenced by Legend for figure 9. A heatmap of counts of ORFs with an the parameter inflation value. For an overall comparison InterPro entry for the TOP 100 entries from Figure 8a. In the of genomes we selected the inflation value 3.1, because it main heatmap colour intensity of a cell shows the number of gives a good compromise between clustering sensitivity ORFs with an InterPro entry shown by entries (rows) and by and specificity (Figure 1). However in individual cases this species (columns). Both rows and columns are ordered by might not be the best value. Users can specify different hierarchical clustering to group similar rows or columns inflation values while browsing the database. together. Columns were clustered with counts of ORFs while rows were clustered with the entry PCA loadings (Left side heatmap and Figure 8a). The dendrogram from hierar- Discussion chical clustering is shown for columns and the phylum of spe- Overview of clustering results cies is indicated by a column colour bar between the We have analysed the ORF content of Pezizomycotina and heatmap and the dendrogram. Under the heatmap each spe- Saccharomycotina species with sequence similarity based cies is specified by an abbreviation explained in Table 1. Left protein clustering. The main clustering parameter, infla- side heatmap shows the loading of the entry as in Figure 8a. tion value (r), was selected as a compromise between sen- Interpro entry identifier ("IPR id.", "IPR0" removed from sitivity and specificity of clustering. Consequently, the beginning), name ("IPR name") and "Author assignment" are number of ORFs in clusters with only a single member shown for each entry. The "Author assignment" is an assign- (orphan clusters) is twice as large with the selected (r = ment to general themes that summarise the individual cate- 3.1) than with the lowest (r = 1.1) inflation value (Figure gories based on the InterPro database. While the other 1). The clusters were then divided in phylum specific assignments are directly based on the InterPro, "Secondary metabolism" covers entries which are known to participate groups for further analysis. The column "S/P r 1.1" in Fig- also in secondary metabolism ([61, 62] and InterPro). Inter- ure 4 shows that the ratio of Saccharomycotina to Pezizo- Pro entries assigned to "Dubious" are entries that InterPro mycotina ORF counts, i.e. the phylum specificity, is itself considers unreliable. largely maintained in the clusters even in r = 1.1 cluster- ing. This holds especially well for A: "Pezizomycotina abundant" and D: "Saccharomycotina unique" clusters, while other regions have more variation. Browsable fungal comparative genomics database Automatic classifications of translated ORFs to families As expected, N. crassa due to its extreme RIP mechanism, based on similarity to known families or de novo classifi- has the lowest genomic ORF redundancy among Pezizo- cations by translated ORF clustering are not perfect. We mycotina (Figure 3) [7]. Variable RIP activity could also have used InterProScan [16] to find out if an ORF is a explain the separation of species (T. ree- member of a known family or has some known sequence sei, F. graminerum, C. globosum, M. grisea and N. crassa) in feature like a domain or an active site i.e. has an InterPro to two groups in regard to counts of unrecognised ORFs entry. We also used TRIBE-MCL [15] to cluster ORFs based (Figure 2). It has been proposed that Aspergillus species purely on sequence similarity. Our database aims to inte- and M. grisea could have a milder RIP like mechanism grate these two classifications in the context of fungal [35,36]. Our analysis does not exclude this possibility, but comparative genomics and to allow easy browser access to it is clear that the N. crassa genome is particular among the data. fungi. N. crassa and S. cerevisiae are probably the most commonly used fungal model organisms and thus it is Figure 11 shows a schematic of browser views and the notable that with regard to genomic ORF redundancy they links between them and example screenshots. "Protein have the most exceptional genomes within the analysed view" shows an individual ORF with its InterPro entries. It large fungal genome set. is mainly used to access "Protein cluster view" that shows all the member ORFs of a cluster with InterPro entries

Page 16 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

SchemaFigure 11and screenshots of the browsable fungal comparative genomics database Schema and screenshots of the browsable fungal comparative genomics database. The schema on the upper right corner shows links between different browser views to the database. Additionally two example screenshots are shown. See text for further details.

Characterisation of cluster regions Many of the enzymes in clusters of this region, such as From the protein clustering results presented in Figure 4 cytochrome P450s, glycoside hydrolases and tyrosinases we selected four interesting regions containing clusters are at the edge of the metabolic network involved in syn- with particular phylogenetic distributions. The clusters in thesis of the final secreted metabolites or degradation of these regions were then searched for enrichment of Prot- the external sources of carbon. Consequently, the high fun and Funcat category and InterPro entry annotations. sequence homology and the resulting larger clusters of A: "Pezizomycotina abundant" proteins could be due to Clusters in the region A: "Pezizomycotina abundant" have recent protein family expansions to give more possibilities usually more ORFs in Pezizomycotina than in other phyla for the fungi to interact with the environment. and none in Saccharomycotina(Figure 6). The results of all three annotation analyses of these clusters agree well Clusters in the region B: "Pezizomycotina specific" have (Table 2). This region contains relatively well character- usually just one ORF in each Pezizomycotina species and ised enzymes involved either in secondary metabolism or none in the other phyla (Figure 4). In comparison to A: in extracellular plant biomass degradation machinery, but "Pezizomycotina abundant" clusters, far fewer member Funcat analysis finds far more clusters involved in plant ORFs have any InterPro entries (Figure 4, column "wIPR") biomass degradation than InterPro analysis. However, and very few InterPro or Funcat annotations show enrich- InterPro provides much less detail on glycosyl hydrolases ment (Tables 3). The significant enrichment of the Prot- involved in lignocellulose degradation, than for example fun "Cellular role: Regulatory functions" and InterPro the specialised Carbohydrate-active enzymes (CAZY) entries "IPR001138: Fungal transcriptional regulatory database [37].

Page 17 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

protein, N-terminal" and "IPR007219: Fungal specific eries, i.e. A: "Pezizomycotina abundant" ORFs, could then transcription factor" coincide well. form the edges of the network. Being on the outer rim of the metabolic network they could be subjected to mini- Clusters in the region C: "Saccharomycotina absent" have mal evolutionary constraints for conserving network usually just one ORF in each Pezizomycotina species and topology and could be directly subjected to various evolu- none in Saccharomycotina (Figure 4). According to Prot- tionary pressures by the changing environments and thus fun predictions these are mostly enzymes. In comparison evolve fast. to B: "Pezizomycotina specific" clusters, more member ORFs have some InterPro entries (Figure 4, column Proteins involved in secondary metabolism, glycoside "wIPR"), but still neither InterPro entry nor Funcat cate- hydrolases and predicted secreted uncharacterised ORFs gory enrichment provide much detail, except for proteins are found in for example subtelomeric regions of Mag- found in the KEGG pathway "Valine, leucine and isoleu- naporthe oryzae [43]. Subtelomeric regions appear to have cine degradation" (Figure 6). The "EC: 1.3.99.3" acyl-CoA a potential for faster evolution than many other chromo- dehydrogenase, detected by InterPro and Funcat enrich- somal regions [43-45]. Consequently, subtelomeric posi- ment analysis, is a key enzyme for fatty acid β-oxidation tioning could provide a mechanism to explain expansion in mitochondria. The capability for fatty acid β-oxidation of A: "Pezizomycotina abundant" like protein families. in mitochondria has been shown experimentally for A. nidulans [38] and proposed by genome analysis for several Clusters in the region D: "Saccharomycotina unique" have other Pezizomycotina species [6,38]. However, Saccharo- usually just one ORF in each Saccharomycotina species mycotina can apparently only carry out fatty acid β-oxida- and none in the other phyla (Figure 4). Various Protfun tion in peroxisomes [39-42]. The proteins detected by cellular role predictions are enriched in these clusters, Funcat enrichment analysis and found in M. grisea are among them translation, transcription and mitochon- according to KEGG linked to polyketide synthesis (Figure drion related, found also in Funcat enrichment analysis. 6). A reason for the other fungi than Saccharomycotina to Strikingly, clusters with the Funcat category "42.16: Bio- have this capability while the Saccharomycotina can do genesis of cellular components: Mitochondrion" corre- without, could be the ability to provide precursors from spond to at least 28 different components of the fatty acid and amino acid metabolism to the mitochondrial ribosomal protein complexes. However, polyketide synthesis. homologous Pezizomycotina ORFs are found in the region "B: Pezizomycotina specific". The protein B: "Pezizomycotina specific" and C: "Saccharomycotina sequences in different families of mitochondrial ribos- absent" clusters contain proteins of low homology that omal proteins have diverged so much between Pezizomy- subsequently divide into small equally sized clusters. They cotina and Saccharomycotina that TRIBE-MCL splits the seem to contain proteins more connected in cellular families in different clusters. This implies that these com- metabolism and regulation networks than the clusters in plexes have diverged significantly between Pezizomy- the region A: "Pezizomycotina abundant" and thus be cotina and Saccharomycotina. under more pressure not to expand to keep the topology of the networks conserved. This would allow the Comparison of clustering and PCA results sequences to diverge in time without simultaneous expan- In addition to clustering, a PCA of counts of ORFs with an sion of the families and result into small clusters broadly InterPro entry was done. We searched for known protein conserved over Pezizomycotina. sequence features (domains, families, and active sites etc., i.e. InterPro entries) with the program InterProScan from In addition, based on the distribution of the clusters with a subset of genomes in our data set (Table 1), counted ORFs found in S. cerevisiae metabolic model iMH805 ORFs having an entry and analysed the counts with PCA. [25], it is clear that over 90% of these ORFs are well con- The function of the 100 InterPro entries with most differ- served in all fungal species. These form the conserved core ences between the species analysed (TOP 100) was stud- of the fungal metabolic network. Metabolism related ied in detail. ORFs in B: "Pezizomycotina specific" and C: "Saccharo- mycotina absent" could then represent a mid layer around TRIBE-MCL clustering and InterPro analyses comple- the core such as the ORFs shown in Figure 6, that link pri- mented each other well. Proteins that are conserved in mary metabolism to secondary metabolism and biomass several species could be studied comprehensively regard- degradation machinery. However, Protfun predicts that less of how well they have previously been described in most B: "Pezizomycotina specific" ORFs are not related to InterPro by clustering (Figure 4). However, the TOP 100 metabolism (Table 4), rather they could be involved for InterPro entries contain interesting details not detected by example in Pezizomycotina specific morphology. Actual clustering, such as the expansion of major facilitator biomass degradation and secondary metabolism machin- superfamilies (MFS) and macromolecule interaction

Page 18 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

related protein families in Pezizomycotina and the distri- separated from other yeasts. This could imply that Y. lipo- bution of transposon related proteins among Pezizomy- lytica ORFs and ORF content of conserved cellular func- cotina (Figure 9). MFS transporters lie in clusters with tions resemble more those of other fungi than ORFs from all species and due to this their expansion was Saccharomycotina. In the tree constructed from the TOP not detected clearly by the clustering analysis (Figure 4). 100 InterPro entries Saccharomycotina are grouped In contrast, enrichment of "IPR001138: Fungal transcrip- together while Pezizomycotina are not divided as tional regulatory protein, N-terminal" was detected. How- expected (Figure 9). For example, the Eurotiales are ever, the full importance of this expansion was only grouped together in Figure 4, although they are separated revealed by PCA, because there are 614 clusters with less in Figure 9. This could mean that even though the ORF than 10 members that have an IPR001138. As InterPro contents among Eurotiales are very similar at large, in domain analysis uses searches optimised for individual regard to protein families contributing most to the differ- protein families or domains, it can detect well both fami- ences between the fungi studied, individual Eurotiales can lies with high internal sequence similarity such as MFS resemble more other less related species. For example the transporters and with low internal sequence similarity human A. fumigatus is more similar to the plant such as IPR001138, if the family has been previously well pathogen M. griseae than to other Eurotiales. characterised. Conclusion The InterPro entries among the TOP 100 with the lowest In conclusion we have carried out a detailed comparison values on PC1, thus most abundant in Pezizomycotina in of the ORF content of Pezizomycotina and Saccharomy- comparison to Saccharomycotina (Figures 8 and 9), over- cotina species and found previously well described and lap with cluster InterPro entry enrichment of A: "Pezizo- novel interesting differences. Our analysis includes funda- mycotina abundant" and B: "Pezizomycotina specific" mentally different but complementary databases and clusters (Tables 2 and 3). The major difference between methods. Due to this we are able to support our conclu- clustering and PCA results is that in clustering transcrip- sions with results from independent analysis. The use of a tion factors belong to B: "Pezizomycotina specific", and large multi genome dataset further emphasises the credi- secondary metabolism related clusters fall into A: "Pezizo- bility of the results. mycotina abundant", while PCA makes no such distinc- tion. MFS transporters have also very negative values on We highlight promising targets for future experimental PC1 (Figure 9). They are transporters found for example studies such as the expanded transcription factor family in fungal secondary metabolite clusters for gliotoxin [46] and potential novel secreted protein families in Pezizo- or trichothecene [47] synthesis. As secondary metabolism mycotina. In general the proteins specific to and well con- seems to be the major function connecting InterPro served in Pezizomycotina should be studied further as entries with most negative values on PC1, it could be they are largely unknown. We propose an evolutionary expected that "IPR001138: Fungal transcriptional regula- explanation for the discovered Pezizomycotina genome tory protein, N-terminal" family has expanded to control features. However, confirmation of this requires detailed secondary metabolism and that MFS transporters would comparisons of the chromosomal localisations of the have expanded to export the produced metabolites. How- genes in question and of the fungal metabolic and regula- ever, likely targets are also the many plant biomass degra- tory network. dation related clusters found in A: "Pezizomycotina abundant", for example IPR001138 transcription factor A. Methods niger xlnR [33] is known to regulate cellulase and hemicel- Sequence and cluster analyses and annotation, database lulase genes and A. flavus aflR [34] regulates a secondary construction and interactions and usage of various bioin- metabolism pathway. Furthermore, Pezizomycotina spe- formatics software was done with custom Perl scripts cific features like formation or other yet incom- using when possible the BioPerl [14] libraries. Final data pletely described processes could involve these analysis and visualisation of results was done with custom transcription factors. R scripts using the Bioconductor libraries [49] when pos- sible. Hierarchical clustering of species' cluster ORF (Figure 4) or InterPro entry (Figure 9) counts produce different spe- Retrieval and analysis of sequence data cies similarity dendrograms (shown on top of the heat- Protein sequences of predicted open reading frames map in both figures). In Figure 4, the tree constructed (ORF), in fasta format, were retrieved from various from clusters with at least ten ORFs, resembles closely the sequencing centres as indicated in Table 1. They have been one presented by NCBI database [48]. How- published in the following articles: S. pombe [50], A. fumi- ever, Y. lipolytica, expected to be the first species to diverge gatus [4], A. nidulans [6], A. oryzae [5], M. grisea [36], N. from the Saccharomycotina lineage (for review see [1]), is crassa [7], A. gossypii [51], C. albicans [52], C. glabrata [3],

Page 19 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

D. hansenii [3], K. lactis [3], S. castellii [53], S. cerevisiae all cluster's sensitivities and specificities in a clustering. [54], S. kluyveri [53], Y. lipolytica [3], P. chrysosporium [55], Clusters and their members are provided [see additional U. maydis [56], in addition P. pastoris was purchased from file 6]. ERGO [57]. Sequences were inserted to Oracle database (Oracle) according to the BioPerlDB schema. The Addi- We defined two indexes to describe the difference tional file 3 maps each sequence identifier to its species between clustering with r = 1.1 and r = 3.1, Stability and enabling the use of other Additional files [see additional "S/P r 1.1" (Figure 4). The Stability of a cluster is the ratio file 3]. of cluster sizes between r = 3.1 and r = 1.1. The cluster size in r = 1.1 is the average of the sizes of clusters to which the To find functional and pathway annotations for proteins members of the r = 3.1 cluster belong to. "S/P r 1.1" of a Funcat [17-19] annotations were downloaded from MIPS cluster is the ratio between counts of ORFs from Saccharo- [58] for genomes indicated in Table 1. mycotina and Pezizomycotina species. Similarly as Stabil- ity, it was counted for an r = 3.1 cluster by taking the Proteins from a subset of genomes (Table 1) were ana- average of Saccharomycotina – Pezizomycotina ratios of lysed with InterProScan [16] to find features from their all the clusters to which the members of an r = 3.1 cluster sequence, i.e. InterPro [59] entries. InterPro entries can belong to in an r = 1.1 clustering. have hierarchical parent-child relationship. For example IPR001236 "Lactate/malate dehydrogenase " is a parent of As well as an InterPro entry structure, we assigned individ- IPR010097 "Malate dehydrogenase, NAD-dependent". ual InterPro entries and Funcat and Protfun categories for For each protein an InterPro entry structure, i.e. a vector of clusters to enable analysis of sets of clusters. In addition InterPro entry identifiers in the order they appear on the clusters were checked for presence of proteins found in the sequence, was constructed. From non-identical entries S. cerevisiae metabolic model iMH805 [25]. overlapping along protein sequence only the most specific was selected in case a parent-child relationship existed Browsing of protein clusters between them. Next, any overlapping or sequential dupli- To view and browse sequences, their features and protein cates of the same entry were removed and the longest clusters, a Generic Genome Browser (GBrowse) [13] was entry in amino acids was used to define the entries loca- set up. Normally GBrowse is used to view individual scaf- tion. InterPro entries found in the ORFs are provided [see folds with features i.e. genes. We modified it to show pro- additional file 4]. tein sequences with features i.e. domains etc. detected with InterProScan [16] and to display a cluster view where Proteins from a subset of genomes (Table 1) were ana- all proteins of a cluster are shown along with their fea- lysed also with Protfun [60] to predict their function and tures. localisation. Protfun categories of proteins are provided [see additional file 5]. Data analysis and visualisation Significance of enrichment of a category in a set of clusters Protein clustering and cluster annotation was tested using the hypergeometric distribution with R Proteins were clustered with the graph clustering software function "phyper" with default settings i.e. by counting TRIBE-MCL [15]. Briefly, TRIBE-MCL identifies protein the probability of the number of successes (number of family like protein clusters using a Markov Clustering pro- clusters with a certain category) in a sequence of n draws cedure operating on a matrix of expectation values com- from a finite population (total number of clusters, uni- puted from an all-versus-all BLASTP [21] search of protein verse) without replacement. Because of the different sequences. TRIBE-MCL requires one major user defined nature of the cluster annotation data, different universes parameter, the inflation value (r), which influences the were used. For Funcat categories all clusters which had any size of the clusters. To define a suitable r the InterPro entry assigned category or entry were used as the universe. For structures were used. Each cluster's InterPro entry struc- Protfun and Interpro data all clusters having at least 10 ture was defined as the most common InterPro entry members (Figure 4), were used as the universe. structure found in its member proteins. Subsequently, for each tested r specificity and sensitivity of the clustering Principal Component Analysis (PCA) was used to find (Figure 1) were counted as proposed in [22]. A cluster's major sources of variation in counts of ORFs having either specificity was defined as the percentage of proteins in a individual InterPro entries or InterPro entry structures. cluster having the same structure as the cluster's structure. Counts of ORFs with InterPro entries or InterPro entry A cluster's sensitivity was defined as the percentage of pro- structures were normalised with the count of ORFs in a teins with the same structure as the cluster in question of genome. PCA was done with R function "prcomp" with all proteins with this structure. Sensitivity and specificity option "retx" set to true. of a clustering were defined as the respective averages of

Page 20 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

Heatmaps of ORF counts in clusters or having certain InterPro entries were plotted with modified version of Additional file 4 "heatmap.2" from R library "grecmisc". Complete linkage Sequence indentifier to InterPro identifier mappings. A text file with the hierarchical clustering with Euclidian distance was used to Interpro entry identifiers found in each protein sequence. First column contains the ORF identifier as provided by the sequencing centre and sec- order rows and columns separately. Prior to clustering one ond column the identifier for InterPro domain found in the ORF. was added to each count, natural logarithm taken and Click here for file data was mean centred and standard deviation scaled, row [http://www.biomedcentral.com/content/supplementary/1471- or column wise for either clustering respectively. 2164-8-325-S4.zip]

Authors' contributions Additional file 5 MA carried out data analysis and drafted the manuscript. Sequence indentifier to ProtFun category mappings. A text file with the TK and MA designed and setup the database and the data- Protfun categories found for each protein sequence. First column contains the ORF identifier as provided by the sequencing centre and following col- base browser interface and TK participated in the design of umns the ProtFun categories found for the ORF. the study. AM carried out InterProScan analysis. MS par- Click here for file ticipated in the design of the study. DU participated in the [http://www.biomedcentral.com/content/supplementary/1471- design of the study. MP and SO conceived the study and 2164-8-325-S5.zip] participated in the design of the study. All authors have participated in drafting the manuscript and approved the Additional file 6 final manuscript. Sequence indentifier to cluster identifier mappings. A text file with clusters and their members. First column contains the ORF identifier as provided by the sequencing centre and second column the respective protein cluster. Additional files Click here for file No data is included for yet unpublished genomes, nor for [http://www.biomedcentral.com/content/supplementary/1471- the P. pastoris purchased from ERGO. To find actual 2164-8-325-S6.zip] sequences search corresponding sequencing centre inter- net site with the protein sequence identifier.

Additional material Acknowledgements We thank Marko Sysi-Aho for help in data analysis, Kristoffer Kiil for help with Protfun, Luke Hakes, Thomas Grotkjaer and Balázs Papp for a lot of Additional file 1 useful discussions in bioinformatics and Esko Ukkonen and Liisa Holm for Figure 4 as table. An excel version of Figure 4. First sheet contains column giving great ideas. title explanations, second sheet the numbers for Figure 4 with cluster iden- tifiers, InterPro structures and Protfun assignments. Last sheet contains This work was supported by the European Union 6th Framework Collab- Funcat annotations for the clusters in Figure 4. oration Action Systems Biology Network LSHG-CT-2005-018942 Click here for file and by the European Union 5th Framework Marie Curie Training Site for [http://www.biomedcentral.com/content/supplementary/1471- Bioinformatics Education and Research at the Faculty of Life Sciences, Uni- 2164-8-325-S1.xls] versity of Manchester HPMT-CT-2001-00285. This work was also part of the research programme "VTT Industrial Biotechnology" (Academy of Fin- Additional file 2 land; Finnish Centre of Excellence programme 2000 – 2005, project no. Figure 9 as table. An excel version of Figure 9. First sheet contains column 64330). title explanations, second sheet the numbers for Figure 9 with InterPro identifiers, InterPro names and "Author assingments". References Click here for file 1. Dujon B: Yeasts illustrate the molecular mechanisms of [http://www.biomedcentral.com/content/supplementary/1471- eukaryotic genome evolution. Trends Genet 2006, 22:375-387. 2164-8-325-S2.xls] 2. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428:617-624. Additional file 3 3. Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine Sequence identifier to species mappings. A text file with species for each I, De Montigny J, Marck C, Neuveglise C, Talla E, Goffard N, Frangeul protein sequence identifier. First column contains the ORF identifier as L, Aigle M, Anthouard V, Babour A, Barbe V, Barnay S, Blanchin S, provided by the sequencing centre and second column the species. Beckerich JM, Beyne E, Bleykasten C, Boisrame A, Boyer J, Cattolico L, Confanioleri F, De Daruvar A, Despons L, Fabre E, Fairhead C, Click here for file Ferry-Dumazet H, Groppi A, Hantraye F, Hennequin C, Jauniaux N, [http://www.biomedcentral.com/content/supplementary/1471- Joyet P, Kachouri R, Kerrest A, Koszul R, Lemaire M, Lesur I, Ma L, 2164-8-325-S3.zip] Muller H, Nicaud JM, Nikolski M, Oztas S, Ozier-Kalogeropoulos O, Pellenz S, Potier S, Richard GF, Straub ML, Suleau A, Swennen D, Tekaia F, Wesolowski-Louvel M, Westhof E, Wirth B, Zeniou-Meyer M, Zivanovic I, Bolotin-Fukuhara M, Thierry A, Bouchier C, Caudron B, Scarpelli C, Gaillardin C, Weissenbach J, Wincker P, Souciet JL: Genome evolution in yeasts. Nature 2004, 430:35-44. 4. Nierman WC, Pain A, Anderson MJ, Wortman JR, Kim HS, Arroyo J, Berriman M, Abe K, Archer DB, Bermejo C, Bennett J, Bowyer P, Chen D, Collins M, Coulsen R, Davies R, Dyer PS, Farman M, Fedor-

Page 21 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

ova N, Fedorova N, Feldblyum TV, Fischer R, Fosker N, Fraser A, browser: A building block for a model organism system data- Garcia JL, Garcia MJ, Goble A, Goldman GH, Gomi K, Griffith-Jones base. Genome Res 2002, 12:1599-1610. S, Gwilliam R, Haas B, Haas H, Harris D, Horiuchi H, Huang J, Hum- 14. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, phray S, Jimenez J, Keller N, Khouri H, Kitamoto K, Kobayashi T, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mun- Konzack S, Kulkarni R, Kumagai T, Lafon A, Latge JP, Li W, Lord A, gall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Lu C, Majoros WH, May GS, Miller BL, Mohamoud Y, Molina M, Stupka E, Wilkinson MD, Birney E: The bioperl toolkit: Perl mod- Monod M, Mouyna I, Mulligan S, Murphy L, O'Neil S, Paulsen I, Penalva ules for the life sciences. Genome Res 2002, 12:1611-1618. MA, Pertea M, Price C, Pritchard BL, Quail MA, Rabbinowitsch E, 15. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm Rawlins N, Rajandream MA, Reichard U, Renauld H, Robson GD, for large-scale detection of protein families. Nucleic Acids Res Rodriguez de Cordoba S, Rodriguez-Pena JM, Ronning CM, Rutter S, 2002, 30:1575-1584. Salzberg SL, Sanchez M, Sanchez-Ferrero JC, Saunders D, Seeger K, 16. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Squares R, Squares S, Takeuchi M, Tekaia F, Turner G, Vazquez de Lopez R: InterProScan: Protein domains identifier. Nucleic Aldana CR, Weidman J, White O, Woodward J, Yu JH, Fraser C, Acids Res 2005, 33:W116-20. Galagan JE, Asai K, Machida M, Hall N, Barrell B, Denning DW: 17. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko Genomic sequence of the pathogenic and allergenic filamen- I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW: The tous Aspergillus fumigatus. Nature 2005, 438:1151-1156. FunCat, a functional annotation scheme for systematic clas- 5. Machida M, Asai K, Sano M, Tanaka T, Kumagai T, Terai G, Kusumoto sification of proteins from whole genomes. Nucleic Acids Res K, Arima T, Akita O, Kashiwagi Y, Abe K, Gomi K, Horiuchi H, Kita- 2004, 32:5539-5545. moto K, Kobayashi T, Takeuchi M, Denning DW, Galagan JE, Nierman 18. Guldener U, Munsterkotter M, Kastenmuller G, Strack N, van Helden WC, Yu J, Archer DB, Bennett JW, Bhatnagar D, Cleveland TE, J, Lemer C, Richelles J, Wodak SJ, Garcia-Martinez J, Perez-Ortin JE, Fedorova ND, Gotoh O, Horikawa H, Hosoyama A, Ichinomiya M, Michael H, Kaps A, Talla E, Dujon B, Andre B, Souciet JL, De Montigny Igarashi R, Iwashita K, Juvvadi PR, Kato M, Kato Y, Kin T, Kokubun A, J, Bon E, Gaillardin C, Mewes HW: CYGD: The comprehensive Maeda H, Maeyama N, Maruyama J, Nagasaki H, Nakajima T, Oda K, yeast genome database. Nucleic Acids Res 2005, 33:D364-8. Okada K, Paulsen I, Sakamoto K, Sawano T, Takahashi M, Takase K, 19. Guldener U, Mannhaupt G, Munsterkotter M, Haase D, Oesterheld Terabayashi Y, Wortman JR, Yamada O, Yamagata Y, Anazawa H, M, Stumpflen V, Mewes HW, Adam G: FGDB: A comprehensive Hata Y, Koide Y, Komori T, Koyama Y, Minetoki T, Suharnan S, Tan- fungal genome resource on the plant pathogen fusarium aka A, Isono K, Kuhara S, Ogasawara N, Kikuchi H: Genome graminearum. Nucleic Acids Res 2006, 34:D456-8. sequencing and analysis of Aspergillus oryzae. Nature 2005, 20. Jensen LJ, Ussery DW, Brunak S: Functionality of system compo- 438:1157-1161. nents: Conservation of protein function in protein feature 6. Galagan JE, Calvo SE, Cuomo C, Ma LJ, Wortman JR, Batzoglou S, Lee space. Genome Res 2003, 13:2444-2449. SI, Basturkmen M, Spevak CC, Clutterbuck J, Kapitonov V, Jurka J, 21. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local Scazzocchio C, Farman M, Butler J, Purcell S, Harris S, Braus GH, alignment search tool. J Mol Biol 1990, 215:403-410. Draht O, Busch S, D'Enfert C, Bouchier C, Goldman GH, Bell-Peder- 22. Veeramachaneni V, Makalowski W: Visualizing sequence similar- sen D, Griffiths-Jones S, Doonan JH, Yu J, Vienken K, Pain A, Freitag ity of protein families. Genome Res 2004, 14:1160-1169. M, Selker EU, Archer DB, Penalva MA, Oakley BR, Momany M, Tan- 23. Zdobnov EM, Apweiler R: InterProScan–an integration plat- aka T, Kumagai T, Asai K, Machida M, Nierman WC, Denning DW, form for the signature-recognition methods in InterPro. Bio- Caddick M, Hynes M, Paoletti M, Fischer R, Miller B, Dyer P, Sachs MS, informatics 2001, 17:847-848. Osmani SA, Birren BW: Sequencing of Aspergillus nidulans and 24. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, comparative analysis with A. fumigatus and A. oryzae. Nature Waterston R, Cohen BA, Johnston M: Finding functional features 2005, 438:1105-1115. in Saccharomyces genomes by phylogenetic footprinting. Sci- 7. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, Fit- ence 2003, 301:71-76. zHugh W, Ma LJ, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, 25. S. cerevisiae iMH805/775 [http://gcrg.ucsd.edu/organisms/ Wang S, Nielsen CB, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Ped- yeast.html] ersen D, Nelson MA, Werner-Washburne M, Selitrennikoff CP, Kin- 26. Yu J, Chang PK, Ehrlich KC, Cary JW, Bhatnagar D, Cleveland TE, sey JA, Braun EL, Zelter A, Schulte U, Kothe GO, Jedd G, Mewes W, Payne GA, Linz JE, Woloshuk CP, Bennett JW: Clustered pathway Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange- genes in aflatoxin biosynthesis. Appl Environ Microbiol 2004, Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli 70:1253-1262. E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzen- 27. Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and berg RL, Perkins DD, Kroken S, Cogoni C, Macino G, Catcheside D, genomes. Nucleic Acids Res 2000, 28:27-30. Li W, Pratt RJ, Osmani SA, DeSouza CP, Glass L, Orbach MJ, Berglund 28. Saccharomyces genome database [http://www.yeastge JA, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, nome.org/] Aramayo R, Natvig DO, Alex LA, Mannhaupt G, Ebbole DJ, Freitag M, 29. Reinders J, Zahedi RP, Pfanner N, Meisinger C, Sickmann A: Toward Paulsen I, Sachs MS, Lander ES, Nusbaum C, Birren B: The genome the complete yeast mitochondrial proteome: Multidimen- sequence of the filamentous fungus Neurospora crassa. Nature sional separation techniques for mitochondrial proteomics. 2003, 422:859-868. J Proteome Res 2006, 5:1543-1554. 8. Galagan JE, Selker EU: RIP: The evolutionary cost of genome 30. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A compre- defense. Trends Genet 2004, 20:417-423. hensive two-hybrid analysis to explore the yeast protein 9. Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox interactome. Proc Natl Acad Sci USA 2001, 98:4569-4574. T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez- 31. Losson R, Lacroute F: Cloning of a eukaryotic regulatory gene. Suarez XM, Flicek P, Graf S, Hammond M, Herrero J, Howe K, Iyer V, Mol Gen Genet 1981, 184:394-399. Jekosch K, Kahari A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, 32. Hashimoto H, Kikuchi Y, Nogi Y, Fukasawa T: Regulation of London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, expression of the galactose gene cluster in Saccharomyces Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, cerevisiae. isolation and characterization of the regulatory Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker gene GAL4. Mol Gen Genet 1983, 191:31-38. J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hub- 33. van Peij NN, Visser J, de Graaff LH: Isolation and analysis of xlnR, bard TJ: Ensembl 2006. Nucleic Acids Res 2006, 34:D556-61. encoding a transcriptional activator co-ordinating xylano- 10. Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt lytic expression in Aspergillus niger. Mol Microbiol 1998, G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, 27:131-142. Ruepp A: MIPS: Analysis and annotation of proteins from 34. Woloshuk CP, Foutz KR, Brewer JF, Bhatnagar D, Cleveland TE, whole genomes. Nucleic Acids Res 2004:D41-4. Payne GA: Molecular characterization of aflR, a regulatory 11. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW: A locus for aflatoxin biosynthesis. Appl Environ Microbiol 1994, DNA structural atlas for escherichia coli. J Mol Biol 2000, 60:2408-2414. 299:907-930. 35. Montiel MD, Lee HA, Archer DB: Evidence of RIP (repeat- 12. E-fungi [http://www.e-fungi.org.uk/index.html] induced point mutation) in transposase sequences of 13. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson Aspergillus oryzae. Fungal Genet Biol 2006, 43:439-445. E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome

Page 22 of 23 (page number not for citation purposes) BMC Genomics 2007, 8:325 http://www.biomedcentral.com/1471-2164/8/325

36. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, the ancient Saccharomyces cerevisiae genome. Science 2004, Thon M, Kulkarni R, Xu JR, Pan H, Read ND, Lee YH, Carbone I, 304:304-307. Brown D, Oh YY, Donofrio N, Jeong JS, Soanes DM, Djonovic S, 52. Jones T, Federspiel NA, Chibana H, Dungan J, Kalman S, Magee BB, Kolomiets E, Rehmeyer C, Li W, Harding M, Kim S, Lebrun MH, Newport G, Thorstenson YR, Agabian N, Magee PT, Davis RW, Bohnert H, Coughlan S, Butler J, Calvo S, Ma LJ, Nicol R, Purcell S, Scherer S: The diploid genome sequence of . Nusbaum C, Galagan JE, Birren BW: The genome sequence of the Proc Natl Acad Sci USA 2004, 101:7329-7334. rice blast fungus Magnaporthe grisea. Nature 2005, 434:980-986. 53. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, 37. Coutinho PM, Henrissat B: Carbohydrate-active enzymes: An Waterston R, Cohen BA, Johnston M: Finding functional features integrated database approach. In Recent Advances in Carbohydrate in Saccharomyces genomes by phylogenetic footprinting. Sci- Bioengineering Edited by: Gilbert HJ, Davies G, Henrissat B, Svensson ence 2003, 301:71-76. B. Cambridge: The Royal Society of Chemistry; 1999:3-12. 54. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, 38. Maggio-Hall LA, Keller NP: Mitochondrial beta-oxidation in Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Aspergillus nidulans. Mol Microbiol 2004, 54:1173-1185. Murakami Y, Philippsen P, Tettelin H, Oliver SG: Life with 6000 39. Kunau WH, Buhne S, de la Garza M, Kionka C, Mateblowski M, genes. Science 1996, 274:563-7. Schultz-Borchard U, Thieringer R: Comparative enzymology of 55. Martinez D, Larrondo LF, Putnam N, Gelpke MD, Huang K, Chapman beta-oxidation. Biochem Soc Trans 1988, 16:418-420. J, Helfenbein KG, Ramaiya P, Detter JC, Larimer F, Coutinho PM, 40. Hiltunen JK, Wenzel B, Beyer A, Erdmann R, Fossa A, Kunau WH: Henrissat B, Berka R, Cullen D, Rokhsar D: Genome sequence of Peroxisomal multifunctional beta-oxidation protein of Sac- the lignocellulose degrading fungus Phanerochaete chrysospo- charomyces cerevisiae. molecular analysis of the fox2 gene rium strain RP78. Nat Biotechnol 2004, 22:695-700. and gene product. J Biol Chem 1992, 267:6646-6653. 56. Kamper J, Kahmann R, Bolker M, Ma LJ, Brefort T, Saville BJ, Banuett 41. Kurihara T, Ueda M, Okada H, Kamasawa N, Naito N, Osumi M, Tan- F, Kronstad JW, Gold SE, Muller O, Perlin MH, Wosten HA, de Vries aka A: Beta-oxidation of butyrate, the short-chain-length R, Ruiz-Herrera J, Reynaga-Pena CG, Snetselaar K, McCann M, Perez- fatty acid, occurs in peroxisomes in the yeast Candida tropi- Martin J, Feldbrugge M, Basse CW, Steinberg G, Ibeas JI, Holloman W, calis. J Biochem (Tokyo) 1992, 111:783-787. Guzman P, Farman M, Stajich JE, Sentandreu R, Gonzalez-Prieto JM, 42. Smith JJ, Brown TW, Eitzen GA, Rachubinski RA: Regulation of per- Kennell JC, Molina L, Schirawski J, Mendoza-Mendoza A, Greilinger D, oxisome size and number by fatty acid beta -oxidation in the Munch K, Rossel N, Scherer M, Vranes M, Ladendorf O, Vincon V, yeast Yarrowia lipolytica. J Biol Chem 2000, 275:20168-20178. Fuchs U, Sandrock B, Meng S, Ho EC, Cahill MJ, Boyce KJ, Klose J, 43. Rehmeyer C, Li W, Kusaba M, Kim YS, Brown D, Staben C, Dean R, Klosterman SJ, Deelstra HJ, Ortiz-Castellanos L, Li W, Sanchez- Farman M: Organization of chromosome ends in the rice blast Alonso P, Schreier PH, Hauser-Hahn I, Vaupel M, Koopmann E, Frie- fungus, Magnaporthe oryzae. Nucleic Acids Res 2006, drich G, Voss H, Schluter T, Margolis J, Platt D, Swimmer C, Gnirke 34:4685-4701. A, Chen F, Vysotskaia V, Mannhaupt G, Guldener U, Munsterkotter 44. Freitas-Junior LH, Bottius E, Pirrit LA, Deitsch KW, Scheidig C, Gui- M, Haase D, Oesterheld M, Mewes HW, Mauceli EW, DeCaprio D, net F, Nehrbass U, Wellems TE, Scherf A: Frequent ectopic Wade CM, Butler J, Young S, Jaffe DB, Calvo S, Nusbaum C, Galagan recombination of virulence factor genes in telomeric chro- J, Birren BW: Insights from the genome of the biotrophic fun- mosome clusters of P. falciparum. Nature 2000, 407:1018-1022. gal plant pathogen Ustilago maydis. Nature 2006, 444:97-101. 45. Naumov GI, Naumova ES, Sancho ED, Korhola MP: Polymeric SUC 57. Overbeek R, Larsen N, Walunas T, D'Souza M, Pusch G, Selkov E Jr, genes in natural populations of Saccharomyces cerevisiae. Liolios K, Joukov V, Kaznadzey D, Anderson I, Bhattacharyya A, Burd FEMS Microbiol Lett 1996, 135:31-35. H, Gardner W, Hanke P, Kapatral V, Mikhailova N, Vasieva O, Oster- 46. Gardiner DM, Waring P, Howlett BJ: The epipolythiodioxopiper- man A, Vonstein V, Fonstein M, Ivanova N, Kyrpides N: The ERGO azine (ETP) class of fungal toxins: Distribution, mode of genome analysis and discovery system. Nucleic Acids Res 2003, action, functions and biosynthesis. Microbiology 2005, 31:164-171. 151:1021-1032. 58. Mewes HW, Albermann K, Heumann K, Liebl S, Pfeiffer F: MIPS: A 47. Alexander NJ, McCormick SP, Hohn TM: TRI12, a trichothecene database for protein sequences, homology data and yeast efflux pump from Fusarium sporotrichioides : Gene isolation genome information. Nucleic Acids Res 1997, 25:28-30. and expression in yeast. Mol Gen Genet 1999, 261:977-984. 59. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, 48. Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, GD, Tatusova TA, Rapp BA: Database resources of the national Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn center for biotechnology information. Nucleic Acids Res 2000, D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Mad- 28:10-14. era M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, 49. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, V, Studholme DJ, Vaughan R, Wu CH: InterPro, progress and sta- Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, tus in 2005. Nucleic Acids Res 2005, 33:D201-5. Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: 60. Jensen LJ, Gupta R, Staerfeldt HH, Brunak S: Prediction of human Open software development for computational biology and protein function according to gene ontology categories. Bio- bioinformatics. Genome Biol 2004, 5:R80. informatics 2003, 19:635-642. 50. Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, 61. Mannhaupt G, Montrone C, Haase D, Mewes HW, Aign V, Hoheisel Sgouros J, Peat N, Hayles J, Baker S, Basham D, Bowman S, Brooks K, JD, Fartmann B, Nyakatura G, Kempken F, Maier J, Schulte U: What's Brown D, Brown S, Chillingworth T, Churcher C, Collins M, Connor in the genome of a filamentous fungus? analysis of the Neu- R, Cronin A, Davis P, Feltwell T, Fraser A, Gentles S, Goble A, Hamlin rospora genome sequence. Nucleic Acids Res 2003, 31:1944-1954. N, Harris D, Hidalgo J, Hodgson G, Holroyd S, Hornsby T, Howarth 62. Keller NP, Hohn TM: Metabolic pathway gene clusters in fila- S, Huckle EJ, Hunt S, Jagels K, James K, Jones L, Jones M, Leather S, mentous fungi. Fungal Genet Biol 1997, 21:17-29. McDonald S, McLean J, Mooney P, Moule S, Mungall K, Murphy L, Nib- lett D, Odell C, Oliver K, O'Neil S, Pearson D, Quail MA, Rabbinow- itsch E, Rutherford K, Rutter S, Saunders D, Seeger K, Sharp S, Skelton J, Simmonds M, Squares R, Squares S, Stevens K, Taylor K, Taylor RG, Tivey A, Walsh S, Warren T, Whitehead S, Woodward J, Volckaert G, Aert R, Robben J, Grymonprez B, Weltjens I, Vanstreels E, Rieger M, Schafer M, Muller-Auer S, Gabel C, Fuchs M, Dusterhoft A, Fritzc C, Holzer E, Moestl D, Hilbert H, Borzym K, Langer I, Beck A, Lehrach H, Reinhardt R, Pohl TM, Eger P, Zimmermann W, Wedler H, Wambutt R, Purnelle B, Goffeau A, Cadieu E, Dreano S, Gloux S, et al.: The genome sequence of Schizosaccharomyces pombe. Nature 2002, 415:871-880. 51. Dietrich FS, Voegeli S, Brachat S, Lerch A, Gates K, Steiner S, Mohr C, Pohlmann R, Luedi P, Choi S, Wing RA, Flavier A, Gaffney TD, Philippsen P: The Ashbya gossypii genome as a tool for mapping

Page 23 of 23 (page number not for citation purposes)