Genetics: Published Articles Ahead of Print, published on June 3, 2005 as 10.1534/genetics.104.034652

Sequence Finishing and Mapping for Candida albicans 7, and Syntenic

Analysis against Saccharomyces cerevisiae Genome

Hiroji Chibana,1* Nao Oka, 1 Hironobu Nakayama, 2 Toshiriro Aoyama, 2 B.B. Magee, 3

P.T. Magee, 3 Yuzuru Mikami 1

1Research Center for Pathogenic Fungi and Microbial Toxicoses, Chiba University, Chiba

260-8673, Japan.

2Suzuka National College of Technology, Suzuka, 510-0294, Japan.

3Department of Genetics, Cell Biology, and Development, University of Minnesota, St. Paul,

MN 55108, USA.

The whole chromosome 7 sequence has been deposited at DDBJ/EMBL/GenBank under the project accession AP006852.

A running title: Finishing and annotation of C. albicans chr. 7

Keywords; pathogenic fungus, synteny, annotation, gene finding, genomic sequence, genome project

*Corresponding author: Hiroji Chibana, Research Center for Pathogenic Fungi and

Microbial Toxicoses, Chiba University, 1-8-1, Inohana, Chuo-ku, Chiba (260-8673), Japan,

Tel: 81(43)-226-2792, Fax: 81(43)-226-2486, E-mail: [email protected]

2

3 ABSTRACT

The size of the genome in the opportunistic fungus Candida albicans is 15.6 Mb. Whole genome shotgun sequencing has been carried out at Stanford University. The sequences

were assembled into 412 contigs. C. albicans is a diploid basically, and analysis of the sequence is complicated due to repeated sequences and to sequence polymorphism between homologous . Chromosome 7 is 1 Mb in size, and the best characterized of the

8 chromosomes in C. albicans. We assigned 16 of the scontigs, ranging in length from 7,309 bp to 267,590 bp, to chromosome 7, and determined sequences of 16 regions. These regions included four gaps, a misassembled sequence, and two Major Repeat Sequences (MRS) of greater than 16 kb. The length of the continuous sequence attained was 949,626 bp and provided complete coverage of chromosome 7 except for telomeric regions. Sequence analysis was carried out and predicted 407 on chromosome 7. A 7-kb indel, which might be caused by a retrotransposon, was identified as the largest difference between the homologous chromosomes. Synteny analysis revealed that the degree of synteny between C. albicans and Saccharomyces cerevisiae is too weak to use for completion of the genomic sequence in C. albicans.

4 Candida albicans is an important pathogenic fungus, causing diseases ranging from superficial thrush and vaginal infections in overtly healthy humans through to systemic infections of immunocompromised hosts, such as patients who have undergone organ transplants or those undergoing intensive chemotherapy. In this diploid organism, mating is regulated by a mating-type like (MTL) (Hull et al. 2000; Magee and Magee 2000;

Tzung et al. 2001), but a perfect meiotic or haploid phase has not been identified. It is supposed that reproduction is primarily by clonal propagation (Graser et al. 1996; Lott et al.

1999; Pujol et al. 1993; Xu et al. 1999), which has led to sequence polymorphisms between homologous chromosomes (Lott et al. 1999). Because meiotic analysis is not available in this fungus, the whole genome sequence is necessary for the development of research.

However, in order to attain it, we had to overcome several obstacles. Major Repeat

Sequences (MRSs), the length of which are at least 16 kbp, are located on all chromosomes except chromosome 3 (Chibana et al. 1998; Chindamporn et al. 1998), and partial sequences of MRS are distributed in the genome, making it impossible to join contigs which flank them. Additionally, sequence polymorphisms were identified between homologous chromosomes (Chibana et al. 2000). For these reasons, it appeared that it would be difficult

5 to complete the sequence of the entire C. albicans genome after whole genome shotgun sequencing.

As part of characterizing the genome, a physical mapping project in C. albicans strain 1161 is underway at the University of Minnesota (Chibana et al. 1998). A macro-restriction map was constructed using Sfi I, revealing that the chromosome number is eight, and that the size of the genome is 16 Mb (Chu et al. 1993). Chromosome 7 was shown to be composed of four Sfi I fragments named 7C, 7A, 7F and 7G (Chu et al. 1993). A more accurate physical map for chromosome 7 was constructed using fosmid contiguous clones and random breakage mapping (Chibana et al. 1998). In parallel, whole genome shotgun sequencing and assembly for the C. albicans strain SC5314 genome was carried out at the

Stanford Genome Technology Center. The most recent assembly of the sequences, Assembly

19, is composed of 412 super-contigs, including 146 homologous pairs

(http://www-sequence.stanford.edu/group/candida/index.html). For the present work, one from each pair of the homologous contigs was arbitrarily discarded, and the remaining contigs used to create a reference haploid genome consisting of 266 super-contigs (Jones et al. 2004). We used this haploid set of the super-contigs as our starting point, and closed up

6 the polymorphisms between the homologous pairs of the super-contigs.

Although the strain used for the mapping project is different from the strain used for the sequence project, major differences between these strains on the chromosome level have not been identified. In this work we demonstrate how the physical maps help with completion of the chromosomal sequence, e.g. the completion of the whole sequence of chromosome 7. In another aspect of this work, we examine whether there are obvious syntenic regions between the Saccharomyces cerevisae and C. albicans genomes. The degree of synteny we identified did not provide gene linkages useful for gap closing. This work is a pilot study for completion of the whole genomic sequence of C. albicans.

MTERIAL AND METHODS

DNA amplification for gap closing: PCR with each primer pair shown at supplementary data was carried out with Ready-To-Go PCR beads (Amersham Biosciences) using genomic

DNA of C. albicans SC5314 as a template DNA. PCR was carried out using a hotstart of 3 min at 94 oC followed by 35 cycles of 94 oC for 10 sec, 50 oC for 10 sec, and 68 oC for 1

7 min, and concluding with 68 oC for 10 min. Long PCR was carried out with LA PCR kit ver.2.1 (TAKARA, Japan). Conditions used were a hotstart of 3 min at 94 oC followed by

35 cycles of 98 oC for 10 sec, and 68 oC for 20 min, concluding with a final extension of 72 oC for 10 min. Genomic DNA from C. albicans strain SC5314 (Fonzi and Irwin 1993) was used for all sequence analysis in this work.

Determination of DNA sequence: DNA products amplified by long PCR were randomly fragmented into 1 to 1.5 kb fragments with HydroShear (Genomachines). The fragmented

DNA was blunted and ligated with pNot I linker (TAKARA, Japan), using a DNA Blunting

Kit (TAKARA, Japan). The ligated product was digested with Not I and ligated with pBlueScriptSK+. Determination of sequence was performed with primers:

pBlF; GAGCGGATAACAATTTCACACAGGAAACAG,

pBlRN; CCCTCGAGGTCGACGGTATC.

DNA was labeled with BigDye terminator cycle sequencing kits ver 1.1 (ABI), and the sequence read using ABI3100 (ABI) sequencer.

8 Test analyses for gene prediction: All the genome sequences and coding sequence data in

S. cerevisiae currently exhibited by EMBL were used for evaluation of the gene finding programs Glimmer2.10 and Critica version 1.05. We selected the set of ORFs that met the criteria of having the start codon ATG and the termination codon as either TAA, TAG, or

TGA, and that extracted a domain of 303 bases or greater (100 aa or greater), in any of the six reading frames.

Glimmer2.10 (Delcher et al. 1999) has a function which permits training of the program, using data from a known set of ORFs to set parameters for that species, and allowing better predictions for related species. For training purposes, we used the S. cerevisiae dataset of

ORFs of 303 or greater bases (100 aa or greater). For the Critica version1.05

(http://www-structure.llnl.gov/cvs/Brian/critica.html) analysis, sequence data of release1 of

RefSeq of microbial and fungi minus the sequence data of S. cerevisiae were used as reference sequences for BLASTN. When the termination position predicted with the gene domain prediction tools and the termination position of the gene domain of S. cerevisiae as registered in EMBL corresponded, it was judged that the domain prediction was correct.

9 Syntenic analysis : BLASTX on NCBI with default values was used to compare the whole sequence of chromosome 7 against the S. cerevisiae genome.

Sequences of open reading frames, and also sequences of intergenic spaces were surveyed. If the score was 50 or more, it was taken to indicate that the corresponding sequence was an orthologous gene. If at least two alleles located within 20 kb on C. albicans chromosome 7 were also located within 20 kb in the genome of S.cerevisiae, then that region was defined as a synteny block.

RESULTS

Identification and mapping of super-contigs on chromosome 7: In previous work, 39

DNA probes were sequenced and mapped to chromosome 7, and the chromosomal location of 11 of these probes was determined accurately using random breakage mapping (Chibana et al. 1998). The sequences of these probes were used to perform BLASTN searches against the Stanford Candida genome web site (http://www-sequence.stanford.edu:8080/ bncontigs19super.html) and 15 contigs were assigned to chromosome 7 (Fig. 1). Of the

10 super-contigs, 19-10248 and 19-20248, 19-10187 and 19-20187, 19-10219 and 19-20219,

19-10110 and 19-20110, 19-10253 and 19-20253 were identified as homologous pairs, and

19-10262, 19-2485, 19-2305, 19-2175 and 19-2506 were identified as haploid contigs. To fill the gaps between the super-contigs, a haploid set of the super-contigs (Jones et al. 2004), comprising 19-10248, 19-10187, 19-10219, 19-10110, and 19-10253 was derived from the diploid set of super-contigs. Recognition sites for SfiI were identified on 19-10262, 19-2485,

19-10187 and 19-10219. The super-contigs were mapped precisely on chromosome 7 using the SfiI map (Chu et al. 1993) and an accurate physical map derived from fosmid contig and random breakage maps for chromosome 7 (Chibana et al. 1998).

Conserved orientation of SfiI fragment 7F: Two MRSs exist on chromosome 7, one between 7A and 7F, and another between 7F and 7G. Since the sequences of MRSs are highly conserved across the chromosomes, the process of assembling the contigs across the

MRSs would be error-prone. Indeed, there are inconsistencies in the sequences across the

MRSs among the super-contigs. The orientation of fragment 7F was determined for strain

1161 (Chibana et al. 2000; Chibana et al. 1998). However, since 7F is flanked by two

11 inverted MRSs, it is possible that the entire 7F region could be inverted in SC5314 on one or both homologues due to homologous recombination between the MRSs. To resolve the inconsistencies and confirm the orientation of 7F on chromosome 7, long PCR amplification was performed with two pairs of PCR primers based on unique sequences flanking both the

MRSs. The sequence of the products obtained from the PCR amplifications, which were about 16 kb, was determined. When the counterpart of the primer pairs was swapped, the

PCR band did not appear. Thus, the results indicate that the orientation of 7F in strain

SC5314 is the same as in strain 1161 and in other strains for which the orientation was determined by Chibana et al. (Chibana et al. 2000; Chibana et al. 1998).

PCR amplification and sequencing to fill the gaps between the super-contigs and to elucidate ambiguous sequences in the super-contigs: To obtain the sequence of the gaps between the super-contigs, appropriate PCR primers were designed near the ends of each super-contig. PCR was carried out using those primers, and the sequence of the PCR products was determined. Since super-contigs 19-10248, 19-2305 and 19-10187 overlapped each other, they were assumed to contain assembly errors. About 10 kb extending from the

12 end of 19-10248 to the middle of 19-10187 was amplified by PCR and the sequence determined. Another assembly error, caused by repetitive sequences located at three points, was found on 19-10253. The assembly errors were corrected, and the gaps between

19-10110 and 19-10253 and between 19-10253 and 19-2335 were closed. Although super-contig 19-2335 was not identified with the BLASTN search, it was assigned between

19-10253 and 19-2506 because it shares homologous sequences with 19-10253.

Undetermined sequences, which were depicted in the Stanford Assembly 19 using the letter n, ranged in length from a single nucleotide to 360 nucleotides, and had a total length of 659 bp distributed across seven regions on the super-contigs. The unread regions were recovered from GenBank database, once the sequences had been integrated with Assembly

19 (Jones et al. 2004). These regions were amplified with flanking primers, and their sequences were determined in this work.

Gene prediction of chromosome 7 in C. albicans: Gene finding tools Glimmer 2.10

(Salzberg et al. 1998) and Critica 1.05 work with high reliability for finding genes within sequences, provided the sequences do not include introns. Thus, these tools have been used

13 for sequence analyses on prokaryotic genomes (Aggarwal and Ramaswamy 2002; McHardy et al. 2004). Unlike some other fungal species, only a small fraction of genes in the genome of C. albicans carry introns. The Candida intron structure is generally similar to that of S. cerevisiae (Jones et al. 2004). For these reasons, Glimmer2 and Critica were employed to begin gene finding. In order to apply these programs to gene finding in fungal genomes, their reliability first has to be evaluated. The genomic sequence of S. cerevisiae was surveyed using these tools. In this test evaluation, ORFs were classified into three groups according to the robustness of the results. Class 1 ORFs were identified with both Glimmer2 and Critica and Class 2 ORFs were identified by only one of the programs (Table 1). The third group contained ORFs that were predicted by neither program.

On chromosome 7, 516 open reading frames encoding greater than 100 amino acids were identified. According to our method of ORF classification, 373 Class 1, and 18 Class 2

ORFs were identified. For these ORFs, the specificity is higher than 95 %. The remaining

125 ORFs were not suggested by either program. Out of this remainder, 107 ORFs were not counted as coding sequences, since the complementary chain encoded an overlapping gene predicted with higher probability. It was not possible to confirm that the other 18 ORFs

14 encoded polypeptides. BLASTX analysis showed that two ORFs shorter than 100 amino acids had a score greater than 50 against S. cerevisiae or S. pombe. A total of 20 ORFs were classified as Class 3 genes. There were 16 sites at which a coding region was possibly divided by an intron or sequence error on the supercontigs. For these sites, sequence determination and intron identification were performed, and an intron was suggested in 9

CDSs, and two introns in each of two other CDSs. In the remaining 5 sites, sequencing errors were identified and corrected. A gene encoding a Leu-tRNA was also identified. A total of 407 CDS were predicted as genes (Table 2). The gene maps, to which BLAST and pFam information are attached, are available at http://www.pf.chiba-u.ac.jp/HCCA/chr7/pageAll_new.html (supplementary data)

Synteny between C. albicans chromosome 7 and the genome of S. cerevisiae: In previous analyses of synteny between C. albicans chromosome 7 and the S. cerevisiae genome, few syntenic regions were identified because of the low resolution of sequence for the chromosome. The gene map was based on the fosmid contig map, which is composed of a tiling set of fosmid clones. The average length of the fosmid DNA insert was 40 kbp, and

15 only 39 probes were mapped with their sequence information (Chibana et al. 1998). The continuous DNA sequence information covering chromosome 7 allowed us to perform syntenic analysis at a higher resolution. The gene arrangement of the 282 C. albicans ORFs identified by sequence similarity to the genome of S. cerevisiae was compared in the two fungi. A group of ORFs with close linkage in both fungi was called a synteny block. A total of 32 synteny blocks were identified between chromosome 7 in C. albicans and the genome in S. cerevisiae. The number of ORFs found in shared synteny blocks ranged from 2 to 8

ORFs. The average of the number of ORFs per synteny block was 2.68 ORFs. A synteny block composed of eight ORFs is the largest area of synteny between chromosome 7 and the

S. cerevisiae genome. However the order and direction of the ORFs were jumbled, with at least three indels and four inversions being found in this region (Fig. 3).

DISCUSSION

In previous work, probes G2E10 and R2B9 were assigned the edges of chromosome 7 (Chibana et al. 1998). G2E10 was assigned 35 kbp away from one end of

16 chromosome 7. The sequence of G2E10 was identified 29 kb away from the end of the contig 19-10262. This indicates that the end of the contig 19-10262 is 6 kb from the end of chromosome 7. Using similar reasoning, we suggest that the end of contig 19-2506 reaches to within 20 kb of the other end of chromosome 7. The total length of the continuous sequence composed of contigs and filled gaps comes to 950 kb. The remaining sequences, including telomeres and subtelomeres, amount to 26 kb. The telomere sequence in C. albicans is composed of 23 bp repeated sequences (Sadhu et al. 1991). Long PCR was carried out using PCR primers based on the sequence of the telomere and super-contigs to amplify the telomeres and subtelomeres, but the expected product was not detected (data not shown). The same problem is likely to happen in other chromosome gap closing. Other approaches, e.g. cloning into a cosmid vector, are needed to determine the sequence of these regions.

The super-contigs released by Jones et al. (Jones et al. 2004) were assigned to chromosome 7, and the gap closing and sequence correction of chromosome 7 in C. albicans was carried out in this work. The length of the continuous sequence is 949,626 bp and covers the entire chromosome 7 except for telomeric and subtelomeric regions. The total

17 length of the missing sequences was only 289 bp in the gaps, and 659 bp in continuous sequence of the super-contigs. Thus, the coverage of the super-contigs in Assembly 19 was

99.9% of the continuous sequences of chromosome 7.

The differences in length between the homologous pairs of the super-contigs of chromosome 7 are less than 1 kb, except between 19-10248 and 19-20248. A region causing the difference was identified, and is depicted in figure 4. The difference in length is 7,249 bp, due to a section in 19-20248 which contains three defective ORFs flanked by long terminal repeats (LTRs) of the class described by Goodwin et al. as LTR η(Goodwin and Poulter

2000). Because there were no LTR sequences on the corresponding region in 19-20248, this region has been inserted only on one homologue. Three ORFs included homologous sequence to pol and gag like elements and STA1 respectively. STA1 is associated with initiation of the developmental programs of pseudohyphal formation and invasive growth response in S. cerevisiae (Vivier et al. 1997). Although the similarity is not high between the

ORF on chromosome 7 and STA1 in S.cerevisiae, this retrotransposon like element might contribute to cell morphology polymorphism and pathogenesis in C. albicans. We suggest that this region should be investigated further.

18 The greatest synteny block includes LEU2 and NFS1 (Fig. 3). Interestingly, that region has received attention from other groups. It was published that close linkage of these genes was conserved in Ashbya gossypii, C. albicans, C. maltosa, C. rugosa, Pichia anomala, S. cerevisiae, S. servazzi, Yamadazyma ohmeri, and Zygosaccharomyces rouxii (De la Rosa et al. 2001; Keogh et al. 1998). We surveyed the linkage of both genes in other fungi.

The linkage of the genes was not conserved in S. pombe (Wood et al. 2002)

(http://www.sanger.ac.uk/Projects/S_pombe/) nor in N. crassa (Galagan et al. 2003)

(http://www-genome.wi.mit.edu/annotation/fungi/neurospora/index.html). The linkage of these genes is thus likely conserved only in related Ascomycetes species. In this work, it was revealed that the conserved region extends beyond LEU2 and NFS1, and comprises the longest syntenic block in chromosome 7 in C. albicans shared with the S. cerevisiae genome.

The reason for the high conservation of this region is unknown, although genes located within the region are likely related. YCL002, which is at the right end of the map (Fig 3), is only 3 kb from CEN3 in S. cerevisiae. It turns out that centromere on chromosome 7 exists in a completely different region (unpublished data). This implies that a centromere existed near this domain in the common ancestor of the fungi of Saccharomycetales group,

19 including C. albicans.

It is clear that the degree of synteny between C. albicans and S. cerevisiae is too weak to enable linkage information in S. cerevisiae genome to be useful for the completion of genomic sequence in C. albicans.

This work was supported by Grant-in-Aid for Scientific Research for Priority Areas

‘Infection and Host Response’ and for Priority Areas ‘Genome Biology’, and for ‘Frontier

Studies in Pathogenic Fungi and Actinomycetes’ through the Special Coordination Funds for

Promoting Science and Technology from the Ministry of Education, Culture, Sports, Science and Technology of Japan, and ‘The Sumitomo Foundation’, ‘Hokuto Foundation for

Bioscience’. The sequences of all the contigs were provided by the Stanford Genome

Technology Center at http://www-sequence.stanford.edu/group/candida. The sequencing of C. albicans at the Stanford Genome Technology Center was accomplished with the support of the NIDR and the Burroughs Welcome Fund. The gene finding analyses using

Glimmer2 and Critica was carried out by Kouzuki et al. (Xanagen, Japan).

20

LITERATURE CITED

Aggarwal, G., and R. Ramaswamy, 2002 Ab initio gene identification: prokaryote genome

annotation with GeneScan and GLIMMER. J Biosci 27: 7-14.

Chibana, H., J. L. Beckerman and P. T. Magee, 2000 Fine-resolution physical mapping of

genomic diversity in Candida albicans. Genome Res 10: 1865-1877.

Chibana, H., B. B. Magee, S. Grindle, Y. Ran, S. Scherer et al., 1998 A physical map of

chromosome 7 of Candida albicans. Genetics 149: 1739-1752.

Chindamporn, A., Y. Nakagawa, I. Mizuguchi, H. Chibana, M. Doi et al., 1998 Repetitive

sequences (RPSs) in the chromosomes of Candida albicans are sandwiched between

two novel stretches, HOK and RB2, common to each chromosome. Microbiology

144 ( Pt 4): 849-857.

Chu, W. S., B. B. Magee and P. T. Magee, 1993 Construction of an SfiI macrorestriction

map of the Candida albicans genome. J Bacteriol 175: 6637-6651.

De la Rosa, J. M., J. A. Perez, F. Gutierrez, J. M. Gonzalez, T. Ruiz et al., 2001 Cloning and

21 sequence analysis of the LEU2 homologue gene from Pichia anomala. Yeast 18:

1441-1448.

Delcher, A. L., D. Harmon, S. Kasif, O. White and S. L. Salzberg, 1999 Improved microbial

gene identification with GLIMMER. Nucleic Acids Res 27: 4636-4641.

Fonzi, W. A., and M. Y. Irwin, 1993 Isogenic strain construction and gene mapping in

Candida albicans. Genetics 134: 717-728.

Galagan, J. E., S. E. Calvo, K. A. Borkovich, E. U. Selker, N. D. Read et al., 2003 The

genome sequence of the filamentous fungus Neurospora crassa. Nature 422:

859-868.

Goodwin, T. J., and R. T. Poulter, 2000 Multiple LTR-retrotransposon families in the asexual

yeast Candida albicans. Genome Res 10: 174-191.

Graser, Y., M. Volovsek, J. Arrington, G. Schonian, W. Presber et al., 1996 Molecular

markers reveal that population structure of the human pathogen Candida albicans

exhibits both clonality and recombination. Proc Natl Acad Sci U S A 93:

12473-12477.

Hull, C. M., R. M. Raisner and A. D. Johnson, 2000 Evidence for mating of the "asexual"

22 yeast Candida albicans in a mammalian host. Science 289: 307-310.

Jones, T., N. A. Federspiel, H. Chibana, J. Dungan, S. Kalman et al., 2004 The diploid

genome sequence of Candida albicans. Proc Natl Acad Sci U S A.

Keogh, R. S., C. Seoighe and K. H. Wolfe, 1998 Evolution of gene order and chromosome

number in Saccharomyces, Kluyveromyces and related fungi. Yeast 14: 443-457.

Lott, T. J., B. P. Holloway, D. A. Logan, R. Fundyga and J. Arnold, 1999 Towards

understanding the evolution of the human commensal yeast Candida albicans.

Microbiology 145 ( Pt 5): 1137-1143.

Magee, B. B., and P. T. Magee, 2000 Induction of mating in Candida albicans by

construction of MTLa and MTLalpha strains. Science 289: 310-313.

McHardy, A. C., A. Goesmann, A. Puhler and F. Meyer, 2004 Development of joint

application strategies for two microbial gene finders. Bioinformatics.

Pujol, C., J. Reynes, F. Renaud, M. Raymond, M. Tibayrenc et al., 1993 The yeast Candida

albicans has a clonal mode of reproduction in a population of infected human

immunodeficiency virus-positive patients. Proc Natl Acad Sci U S A 90: 9456-9459.

Sadhu, C., M. J. McEachern, E. P. Rustchenko-Bulgac, J. Schmid, D. R. Soll et al., 1991

23 Telomeric and dispersed repeat sequences in Candida yeasts and their use in strain

identification. J Bacteriol 173: 842-850.

Salzberg, S. L., A. L. Delcher, S. Kasif and O. White, 1998 Microbial gene identification

using interpolated Markov models. Nucleic Acids Res 26: 544-548.

Tait, E., M. C. Simon, S. King, A. J. Brown, N. A. Gow et al., 1997 A Candida albicans

genome project: cosmid contigs, physical mapping, and gene isolation. Fungal Genet

Biol 21: 308-314.

Tzung, K. W., R. M. Williams, S. Scherer, N. Federspiel, T. Jones et al., 2001 Genomic

evidence for a complete sexual cycle in Candida albicans. Proc Natl Acad Sci U S A

98: 3249-3253.

Vivier, M. A., M. G. Lambrechts and I. S. Pretorius, 1997 Coregulation of starch degradation

and dimorphism in the yeast Saccharomyces cerevisiae. Crit Rev Biochem Mol Biol

32: 405-435.

Wood, V., R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne et al., 2002 The genome

sequence of Schizosaccharomyces pombe. Nature 415: 871-880.

Xu, J., T. G. Mitchell and R. Vilgalys, 1999 PCR-restriction fragment length polymorphism

24 (RFLP) analyses reveal both extensive clonality and local genetic differences in

Candida albicans. Mol Ecol 8: 59-73.

25

Table 1 Evaluation of the gene prediction programs using genomic data of Saccharomyces cerevisiae

ORF greater than Class 1: Class 2: Predicted by neither

100 aa Glimmer 2 & Critica (Glimmer2 or Critica) but Glimmer2 nor

not (Glimmer 2 & Critica) Critica

Number of 7,467 4,702 759 2,006

prediction (100%) (77.1%) (12.4%) (10.5%)

(Sensitivity)

Correct 5,831 4,498 721 612

prediction (78.1%) (95.7%) (95.0%) (30.5%)

(Specificity)

26 Table 2 Gene prediction on chromosome 7

Class 3

ORF greater than Class 1 Class 2 ORF ORF t-RNA

100 aa >100 aa <100 aa*

Number of 516 373 18 18 2 1

prediction

*) BLASTX analysis showed that two ORFs shorter than 100 amino acids had score greater than 50.

27

FIGURE LEGENDS

Figure 1 Overview of gap closing

A physical map, which was previously published by Chibana et al. (1998) is presented with the gray bar. Probes assigned to chromosome 7 are located on the map. Open boxes on the gray bar describe sites for the probes which are mapped with Random breakage mapping.

Sfi I sites are described with vertical white bars on the gray bar. The location for ODP2 was corrected by the authors after publication. Two arrows on the gray bar described MRS

(Major repeat sequence). Although the MRSs are represented with a single RPS in this map,

RPS is a tandem repeated sequence in some other MRS. The total length of chromosome 7 is 995 kbp in this map. Super-contigs assigned on chromosome 7 are described with a closed arrow headed by a closed bold bar. Probes with known DNA sequence and mapped on the super-contigs are described with an open box. R2B9, SfiI, RBP1, ARS3, ARG4 and G2E10 were used as landmarks to assign the super-contigs on the physical map. In order to adjust the position of the landmarks, super-contigs 19-10262 was cleaved with a spacer. A sequence deletion was identified on super-contig 19-20248 and indicated with a spacer.

28 Gaps between the super-contigs and region including ambiguous sequences were amplified with PCR and are indicated by arrows. Ca35A5 represents a whole sequence of cosmid inserted DNA # AL033396. The inserted DNA of the cosmid was isolated from strain 1161 and the sequence was determined using the methods described by Tait et al. (Tait et al.

1997).

Figure 2 Gene maps for C. albicnas chromosome 7.

Class 1 and Class 2 ORFs were shown as rectangles with heavy and light outlines, respectively. Class 3 ORFs were shown as rectangles with no outline. Each ORF was colored the same color as the best-hit species, which are red, blue, green, and yellow for S. cerevisiae, S. pombe, N. crassa and H. sapiens, respectively. Closed arrowheads pointing cording sequence indicate position of intron. CDSs, which are reorganized in this work were represented with asterisks. ORF number of Assembly 19 is assigned on the column of C. albicans. Results of BlastX, pFam, and more information are available at http://www.pf.chiba-u.ac.jp/HCCA/chr7/pageAll_new.html

29 Figure 3 The longest syntenic region between the genome in S. cerevisiae and chromosome

7 in C. albicans.

The region between DCC1 and PGS1 includes 11 ORFs encoding polypeptides more than

100aa. Out of these, 8 ORFs were identified as sharing similarity in amino acid sequence with genes in S. cerevisiae with a higher score than 50 in BLASTX results. Although linkage of these genes is conserved in C. albicans and S. cerevisiae, at least four inversions and three indels were identified.

Figure 4 Sequence indel between 19-10248 and 19-20248

The length of this indel is 7,249 bp. LTR η(Goodwin and Poulter 2000) flanks this region.

The best hits of three ORFs are; YLT1 : hypothetical YLT1 (TrEMBL# Q9COUO) suggested to encode a fungal retroelement pol polyprotein in Yarrowia lypolytica (1e-44),

YIL8 : putative protein (TrEMBL# G94LP8) suggested to encode Gag or Capsid-like from LTR retrotransposons Oryza sativa (1e-25), and STA1 : putative protein

(TrEMBL# PO8640) suggested to encode Glucoamylase S1/S2 precursor (EC 3.2.1.3)

(Glucan 1,4-alpha-glucosidase) (1,4-alpha-D-glucan ) in S. cerevisiae (9e-06).

30

31 (3) (4) (5) 19-2485 19-20248 (1) (2) 4 2 19-10248 19-10262 R K 7 Y R 2 2 1 4 8 2 1 9 2 3 2 5 4 I 2 2 I 0 P 2 H 3 0 4 D B 1 U 5 2 H 0 G fi 1 1 fi 4 D D C 1 P L 2 2 E P 3 2 0 2 S 2 L S 4 R N D L S A 3 L N 1 p P 1 O O J I R p I p R p R Y Y C Y 7C 7A 7F 7 2 e 2 2 9 3 2 8 8 3 5 4 iI 2 4 2 9 7 1 ) 1 r T E B 1 U H H 5 2 H 0 G f 2 1MRS 2 6 7 R 2 H 8 R 4 4 1 e R 2 2 D E 7 P 2 0 S 1 L A E 0 P C 2 D 0 H 3 P D B R A R 3 1 L 1 7 N 3 2 1 2 2 P I R 3 2 4 D 2 R 1 C S L 8 m L p S 1 I p R p R p fi K T 1 4 O S O N L p C I A T lo C S T Y S Y T 1 J ( D te Y Y C 7C 7A 7F

0kbp 100kbp 200kbp 300kbp 400kbp 500kbp

Ca35A5 3 1 (8) S S R H (6) (7) A C 19-2305 (9) 19-20248 19-2175 19-10248 (14) (10) (11) (12) (13) (15) 1 (16) 4 4 1 3 R 1 4 F 2 1 19-20187 19-20219 19-20110 19-20253' 1 2 4 L p C R 4 D Y D 1 Y Y 19-10187 19-10219 19-10110 19-10253' 19-2335 19-2506 1 I 1 1 ) 4 4 2 fi 3 S 3 5 1 1 0 2 1 4 8 ) 0 E P S S H R 5 7 8 M Y 5 G A A F 3 1 B R F p 1 8 H P 9 R L 2 R 2 4 E R R A C D p 9 1 A S R J 1 2 A 1 (S C Y (G p G C

7F ) 7G 4 1 4 1 3 1 4 2 1 1 3 5 8 1 1 ) 2 4 0 2 2 1 3 4 1 3 P S S B F E 5 5 7 8 Y 1 0 4 2 1 2 1 4 1 8 ) 1 E re 1 1 G 4 F 2 ) B MRS 8 1 1 A R p 1 9 M 1 5 G 1 A R E F 3 e C L 8 2 C R R 8 B iI R H S T 4 F p 1 H P A 9 G E L A J 9 2 4 E R 5 1 p Y 1 T R f A C S 1 D S C 1 1 R 5 6 S 2 1 1 2 A m 1 D T D 4 S T ( T ( A 1 S R Y T (G p G C lo T Y Y 4 T te (1 7F 7G

900kbp 500kbp 600kbp 700kbp 800kbp 1,000kbp

Figure 1 H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.po m S.cer C.alb 7125 0 Q96TC1 Q871Y6 Q9C101 7071 80 (kb) (kb) BIN3 Q8X0D7 HOB3 RVS161 7124 Q9BV86 Q9P6Y1 YDRE YBR261C 7069 7123 7121 CUF1 HAA1 7068 XPD Q9P609 RA15 RAD3 7119 Q15015 TPR1 CDP1 7067 KAD3 KAD KAD1 KAD2 7118 Q8WVC0 Q9P6R2 LEO1 7116 RHG6 O74335 SAC7 7115 Q9BRX5 Q9UTC3 PSF3 7065

10 SYQ SYQ GLN4 7064 90 WAP1 Q9N WX6 Q9Y7T3 YGR024C 7063

RPA2 RPA2 RPA2 RPA2 7062 FRP1 FRE5 7112 SODM SOD2 SOD2 TTCB Q9US26 YIL065C 7111 Q871G7 Q9UTC5 YFR007W 7061 Q8NFD1 O42912 TAD1 7110 * 7060 Q9H231 QP710 YGL242C 7109 Q8NEK9 Q9P7J1 YOR166C 7059 RPE O14105 RPE1 7108 YJL122 W 7107 20 Q9N WS8 YAGF YFR048W 7058 100 NLD2 YGV9 YJR126C 7106 SYEP Q870V6 SYEC YGL245 W 7057 7105 FAR1 Q9URZ3 DIP5 7056 7104 O14041 YCL036 W 7103 7055

Q9P3W5 TEL2 7102 30 110 O14041 YCL036 W 7101 Q86SY5 YJR001 W 7100 YKL070 W Q86U42 O14327 7098 7097 7054 O94439 YLR00C 7096 Q16821 GAC1 7053 Q9UTS8 YOR390W 7095 40 120

RCO3 GHT6 SNF3 SNF3

MYCT Q9P702 STL1 7093 * O94984 O43001 YNL106C 7052 O74482 YLR290C 7092 ASML YE W3 YOR111W7051 7091 Q8TB66 O74978 YNL110C 7050

Q86V72 Q9UUX9 O59868 PMR1 PMR1 CYM5 CYB5 CY52 CYB5 J7.0083 50 130 O94530 SUA5 7088 Q92541 O94667 RTF1 7047 Q96P70 YD43 KAP114 7086

CYC3 YDC3 7046 7085

7084 DCC1 7083 60 Q96B94 COT1 CEK1 RIM15 7044 140 Q96E68 Q96U23 YDE9 PET8 7082 NFS1 Q8NIK7 NFS1 NFS1 NFS1 LEU3 LEU3 LEU2 LEU2 YLR050C 7043

BUD3 7079 7042

YFF7 YCL012C 7078 SPK YL31 PTA1 7041 70 150 FRE7 FRE7 MVP1 7038 Q9H922 Q9P3U1 GBP2 7076 * YJR067C 7037 Q870X2 Q9USW9 YCL010C 7074 Q9P7D3 W HI2 7036 7073 Q8TA85 Q9P713 Q9HDW1 PGS1 7072 RFC4 Q872Q7 RFC2 RFC2 7035

7034 Q96TC1 Q871Y6 Q9C101 7071 80 160

Figure 2-1 H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.po m S.cer C.alb 160 240 MRS (kb) (kb)

Q8N4A4 Q8X0K3 YBR376C7033

Q96U38 074741 6898 NI8 M NI8 M J7.s006 RAD3 7032 6899

MD12 MDM12 6900 Q9Y2S5 170 Q9P7C9 DCS1 6901 250

CCW14 7030 Q96NY2 O60173 DBP7 6902

RPC5 O74883 YKR025 W 6903 GUAD Q9P5W2 GUAC YDL238C 7029 E2BA Q9USP0 GCN3 6904 Q96TZ5 YNL011C 6905 7028 7027 GBLP GBLP ASC1 6906

180 Q9H967 YDL156 W 6907 260 Q9HEK9 Q9UTD0 FOL3 6908 6909 Y181 Q9P7X6 YDR013 W 6910 6911 EKI1 6912 SRF MAP1 MCM1 7025 7024 270 7023 190 Q9P2K8 074297 Q9HCN1 GCN2 6913 7022

MR11 MR11 RA32 MRE11 6915 PHS1 GPH1 7021 ATPS ATP11 6916 YA72 YNL310C 6917 6918 CPVL Q871G2 GHT6 KEX1 CBPY 200 280 Q9BW07 O74801 Y ML025C7019 6919 RS18 RPS18A 7018

HXCB YHZ3 YOX1 7017 6920

J7.0154

Q8WV13 Q9P3S1 Q9C1W8 PPN1 7016 210 290 NBP2 6588 RLA0 RLA0 O74864 RPP0 7015 RWD1 GIR2 6587 7013 RT24 RT24 YPL013C 7012

6586 Q96E87 Q9P6A4 Q9C0X8 RRP12 7011

Q969K4 YAZ4 YIL001 W 7010 220 300

Q8NIS3 Q9UT59 YDR541C7009 7007 Q9USK1 YAL046 6585

7006 IF39 IF39 PRT1 6584

USO1 6583 7005 PSA3 PSA3 PRS1 6582 230 310 Q86W80 YOL003C 6581 7003 6580

6579 6898

MRS Q8N4V2 Q871Y6 Q9C101 6578 240 320

Figure 2-2 S.po m H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.cer C.alb 320 C02A CORO CRN1 6535 400 (kb) Q873D4 Q9T7S4 TPO1 6577 (kb) * 6575 Q9HB23 O87ON5 O74858 MSK1 6533 MFTC Q9HE96 O13660 FLX1 6532 Q9C1J0 TOM7 J7.s008 6574

NUCM NUCM 6531 DJA3 6530 330 410 Q9NX64 Q8XON3 UBCF UBC3 6529 6528 Q9P227 BEM2 6573

O14267 YJL108C 6527

O74493 6570 RBD2 6526 O74493 6569 340 420 6525 Q9H0X3 RAD18 RHC18 6568 Q9BR95 OM40 OM40 TOM40 6524

6566 OXA1 Q8X216 OXA1-2 OXA1 6565

YJR054 W 6563 J7.0225 350 430 Q86ZH9 Q9US37 YIL166C 6522 RNHL RNHL RNHL RNH35 6562 6521 Q872Q8 Q9US37 Q86YI5 ODP2 ODP2 PDA2 6561 YIL166C 6520 YNL108C 6559

Q9BT37 Q870P2 Y MR110C6518 S23A Q9C284 O74873 SEC23 6558 XPA O59753 RAD14 6517 440 FAAH O59805 AMD2 6557 360 Q9C2E7 HS90 HSP82 6515

6556

O14033 YKL084 6555 370 MEI2 CUP9 6514 450 YOR220W6554 O74728 6553

Q8TAH6 Q90806 ERV2 6552 GS28 YADA YHL031 6551 YOR228C6550 EX70 EXO070 6512 Q9H1K1 ISU1 6548 Q9UTC6

6547 YD61 TRL1 6511 380 460 Q9Y3D4 Q9P718 GLR1 GRX1 6510 Q9Y3D4 Q9P718 GLR1 TTR1 6509 Q871E9 YOR227W6544 Q8NFG4 O74769 YGR057C6508 P2G4 Q871H0 CDB4 6507 Q9G3F4 RL5 RL5B RPL5 6541 Q8IXZ5 YA27 Y MR075 W6506 6503 KGPF K6PF PFK2 6540 Q8IZV5 Q873D6 O42949 YDL114 W 6502 390 O9305 YKL094 W 6501 470 RUV2 Q873C7 O94692 RVB2 6539 Q871X7 O9346 ARG7 6500 Q9Y874 Q9URZ8 VMA11 6538 Q9PGM3 NSL1 6537 RPOM RPOM RPO41 6499

O13951 IWR1 6498 IQG1 RNG2 IQG1 6536 Q86SZ2 YA95 TRS33 6496

CRN1 6535 400 480

Figure 2-3 H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.po m S.cer C.alb 480 560 (kb) SCW1 W HI3 6494 (kb) 3701

6493 OM70 Q9EHG7 014217 TOM70 3700 Q9P0L2 6492 043460 094526 YNL128 W 3699 6491 YDL222C 6489 3698

Q9C2N2 0138665 YNL127 W 3697 6488 490 570 013813 TOM22 3696 6487 SELI 013901 EPT1 3695 6486

6484 6482 YPS7 CAR9 500 MRS 580 Q9NQQ7 Q871P2 YDB1 Y ML038C 6480

Q96TG8 094590 SEC1 6479

5191

Q9UQ99 YA WB YCF1 6478

510 5190 590 WDR4 TR82 TRM82 6477 Q8N6Z3 013943 EFR4 6476

CHS2 CHS CHS2 5188 6475 RCC1 Q9P3R6 RCC1 PRP20 5184 6474 * PPIF CYPH CPR1 CYPH Q8NER3 Q8X0N7 DPOD POL3 5182 PM20 AHP1 6470 520 600

Q01910 YGR109 W 6469 Q01318 5181

YIL080 W 6468 PDX6 YBL064C 5180

Q9Y5L9 STA1 6465

YDEG 6464 530 5179 610 XAB1 042906 YJR072C 6463 Q9P0I2 Q0P787 YKL207 W 6462 013820 ERG5 5178 COX20 6461 *

PEX1 074941 PEX1 6460 5177

Q8NEB5 Q9UUA6 DPP1 6459 060337 06013 SSM4 5175 540 620 LSM4 LSM4 J7.s010 6458 T2DB 060076 TAF19 5174 5172 6457 *

6456 POT2 Q9P3D8 PMTX PMT1 5171 CGC8 YIV G Y MR122 W 6455 6454 550 630 6453 Q9UX7 ATC3 ENA2 5170 FKBP FKBP FPR1 6452 FAAH 059805 AMD2 5169

6451 PCD2 YB0F YOL022C 5168 * IF2 M IF2 M IF M1 5167 HI M1 DBF4 5372 560 5166 640

Figure 2-4 H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.po m S.cer C.alb HI M1 DBF4 5166 640 720 MU5B Q8TFG9 AWA1 1346 (kb) Q9NYD0 O74310 YNR029C5165 (kb) Q8NG10 O74753 YNR030 W 5164

O6O289 O13652 SFI1 5163

1345

O42625 MKH1 BCK1 5162 650 730

Q9P7D4 YJL096 1344 5161 SIN4 1343 K685 EKC1 SAP190 5160

Q871Z2 YNL191 W 5159 GLYM GLYC GLYC SHM1 1342 5158 ST2B 059763 PRR2 1341 Q9HB93 YNQF YJL097 W 5157 Q9P6D3 013848 YDL124 W 1340 Q9UHX2 YNQF YJL097 W 5156 660 740 P87317 CHS6 5155 PRTP CBPY PRC1 1339

094548 YJL123C 1338 CYAA CYAA CYR1 5151 * PSB3 PSB3 PUP3 1336 K052 Q873J5 YDVG MTR4 1335

670 750 YLL007C 5147 1334

Q9C2H5 Q9HFE4 5145

SNG1 1333

5144 Q9C0Q7 O94624 YJL054 W 4143 YJR015 W 1332 DYR DFR1 5142 YBR272C 1331 5141 680 760

5140 J7.0397

6704 5139

5138 690 770

5137.5 5137 O13690 YBL060 W 6705 YGR017 W 5136

5135 Q9HA83 Q9HE60 Q9UUH7 GYP7 6706 5134 YNL217 6707 YAL051 W 5133 700 YGL037C 6684 780 5132 Q9ULR0 CWFC ISY1 6685

Q9BSC4 O74879 YGR145W 6686

Q9H7D7 Q871I3 Q9UT85 YCL039 W 5131 6687

Q8IYF8 P DI1 P DI1 5130

PMP4 5129 6688 710 790 5128 Q9P6K0 5126 ARLY ARLZ ARG4 6689

YLR149C 5125 6690

Q86VL4 Q872B9 Q9USK3 YHR032 W 6691 MU5B Q8TFG9 AWA1 1346 720 800

Figure 2-5 H.sap N.cra S.po m S.cer C.alb H.sap N.cra S.po m S.cer C.alb 800 880 DJBC Q873F3 O13633 HLJ1 7175 TTP1 6692 (kb) (kb) 7173 7170 O95204 YAN2 YOL098C 6693 7167

Q9H0E9 O74350 6694 Q8TB40 YE63 YGR110 W 7166 IM9A Q8J1Z1 IM09 TI M9 6696 810 7165 890 Q9P6I0 YCR024C 6698 Q9BUH4 KAND YPL236C 7164 HIS9 HIS2 6699 IF2B Q8NJ00 IF2B S UI3 7161 Q9P6Y9 Q9HFE7 YAR1 7160 SYEP O60155 YHR020 W 6701 Q96U31 O42977 Y MR185 W 7159 SYN O94567 DED81 6702 Q9P397 Q9US37 YIL166C 7158 6703 7157 5508 820 900 LCFE Q9P302 FAA2 7156

Q86W44 YBB9 ENP1 5507 CG48 CG48 YJL069C 7154

Q8WUG2 O94258 LOS1 7153 Q86VN8 PLC1 PLC1 5506

O597O1 YGR012 W 7152 Q870Y9 HIS5 HIS7 5505 830 910

5504

5503 7151 5502 O95619 YD67 YNL107 W 5501 YJR149 W 7204 KLF7 SCR1 NRG2 7150 840 920 Q8X0Q2 Q9HDV5 MRPL2 7203 Q9P0H9 YDB5 RER1 7202

HIP1 SLA2 SLA2 7201 Y MR258C 7149 7200 Q9P7D6 YBR137 W 7199 Q92600 RCD1 YNL288 W 7198 Q873D4 Q9Y7S4 TPO1 7148 850 930 Q8WTT2 O94288 NOC3 7197 Q9UPW3 O74774 HBS1 7144 O94531 UFE1 7141 ISP6 PRB1 7196 COMT Q8NKC1 7140 7139 UBCA UBC2 UBC2 UBC2 7195 7194 Q15737 Q8NIV6 YAK1 SPT6 7136 Q96UB6 O74497 YKL124 W 7193 7192 BODG TMLH YHL021C J7.0511 940 Q9BW42 OGG1 7190 860 7130 Q96SD8 YJL004C 7128 RLA1 RLA5 RPP1B 7188 O94675 MAM33 7187 7127 CGB2 Q9C2K2 CG21 CLB4 7186 Leu-tRNA RBB4 094244 MAT2 7185 7184 Q871Q5 Q9UUH1 YJR044C 7183 Q9P0T9 O94520 YGL231C Q9BWC3 YB0A YGL232 W 7182 870 950 7181

GR75 HS7M SSC1 7179 PSA1 Q86ZH0 PSA1 PRE5 7178

Q9NVB1 O94545 KAP120 7177

NPT1 NPT1 7176 880 960

Figure 2-6 C. albicans chro moso me 7 YCL002 DCC1 NFS1 LEU2 BUD3 GBP2 SGF29 PGS1

LEU2 NFS1 DCC1 BUD3 GBP2 SGF29 PGS1 YCL002 S. cerevisiae chro moso me 3

Figure 3 C o ntig 19-20248

CPR1 AHP1 YJR72

LTRη LTRη

CPR1 AHP1 YLT1 YIL8 STA1 YJR72 C o ntig 19-10248

Figure 4