Abstract of thesis entitled

Understanding the Pathogenic Fungus Penicillium marneffei:A Computational Perspective

by James J. Cai

for the degree of Doctor of Philosophy at The University of Hong Kong in May 2006

Penicillium marneffei, a thermally dimorphic fungus that alternates be- tween a filamentous and a yeast growth form in response to changes in its environmental temperature, has become an emerging fungal pathogen endemic in Southeast Asia. Defining the genomics of P. marneffei will provide a better understanding of the fungus. This thesis reports the draft sequence of the P. marneffei genome as- sembled from 6.6 coverage of the genome through whole genome shotgun sequencing. The 31 Mb genome obtained from the assembly contains 10,060 protein-coding genes. The complete mitochondrial genome is 35 kb long and its gene content and gene order are very similar to that of Aspergillus. An annotation system and P. marneffei genome database (PMGD) were developed to allow a preliminary annotation of the se- quences and provide an intuitive graphic interface to give curators and users ready access to the annotation and the underlying evidence, and a Matlab-based software package, MBEToolbox, was developed for data analysis in phylogenetics and comparative genomics. A well-designed and structured annotation system and powerful sequence analysis software are essential requirements for the success of large-scale genome analysis projects. Analysis of the gene set of P. marneffei provided insights into the adaptations required by a fungus to cause disease. The genome encodes a diverse set of putative virulence genes such as proteinase, phospholi- pase, metacaspase and agglutinin, which may enable the fungus to adhere to, colonise and invade the host, adapt to the tissue environment, and avoid the host’s humoral and cellular defences of the innate and adaptive immune responses. A gene cluster involved in biosynthesis of melanin, a known virulence factor in some other pathogenic fungi, was also identi- fied in the genome, indicating that P. marneffei may produce melanin or melanin-like immunosuppressive compounds that protect the fungus against immune effector cells. More interestingly, P. marneffei genome contains more intragenic tandem repeats (IntraTRs) than other fungi. These IntraTRs encoding repeat domains/motifs may create quantita- tive variation in surface proteins, allowing the fungus to ‘disguise’ itself to slip past the vigilant defences of the host immune system. The genome sequence of P. marneffei also revealed a number of genes associated with mating processes and sexual development, suggesting an unidentified sex- ual cycle in the fungus. The extent and evolutionary patterns of duplicate genes in P. marn- effei and other ascomycetes were compared. All ascomycetes show a certain degree of redundancy (though its extent can vary considerably), which may provide the foundation for the specialisation of fungal genes and form the basis for fungal diversification. An inverse relationship be- tween the lineage specificity of a gene and gene’s evolutionary rate was also discovered, implying that an accelerated evolutionary rate may be responsible for the emergence of lineage specific genes. The genome sequence of P. marneffei has provided our first glimpse into the genomic basis of the physiology of the dimorphic filamentous fungus.

Understanding the Pathogenic Fungus Penicillium marneffei: A Computational Genomics Perspective

BY

James J. Cai

M.D., Henan Medical University, 1996 M.S., University of New South Wales, 2001

THESIS

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at The University of Hong Kong

May 2006

To Yan

“Any living cell carries with it the experiences of a billion years of experimentation by its ancestors.”

Max Delbruck (1949)

DECLARATION

I declare that this thesis represents my own work, except where due acknowledgement is made, and that it has not been previously included in a thesis, dissertation or report submitted to this University or to any other institution for a degree, diploma or other qualifications.

Signature:

Date:

i ACKNOWLEDGEMENTS

First of all, a special thanks goes to my principle supervisor, Pro- fessor Kwok-yung Yuen, for his enthusiasm and support during the course of my study. My heartfelt thanks to Dr. David K. Smith and Dr. Xuhua Xia who introduced me to the fascinating world of and molecular evolution. Thanks to my friends and colleagues for their moral support and technical assistance over the past four years especially Dr. Patrick Woo, Dr. Sussana Lau, and Jade, Huang Yi, Ken, Haw, Candy, Rachel ... I am also grateful to my external mentor Dr. Gavin Huttley and fellow colleagues Peter, Ray, Helen and Brett in the Australian National University. Finally, I am very grateful to my wife and my parents. Without their support, this work would not have been possible.

ii TABLE OF CONTENTS

Declaration i

Acknowledgements ii

List of Figures x

List of Tables xii

Abbreviations xiv

Glossary xviii

Introduction 1

Chapter 1: The draft genome sequence of Penicillium marneffei 4 1.1 Introduction ...... 4 1.2 Literature Review ...... 5 1.2.1 General fungal biology ...... 5 1.2.2 P. marneffei, as an important fungal pathogen . . 7 1.2.3 Penicilliosis marneffei ...... 13 1.2.4 Fungal genome projects ...... 20 1.3 Materials and Methods ...... 23 1.3.1 Strain and DNA preparation ...... 23 1.3.2 Library construction, shotgun sequencing . . . . . 24 1.3.3 Sequence assembly ...... 24 1.3.4 Data release ...... 24

iii 1.4 Results ...... 25 1.4.1 Assembly and general characteristic ...... 25 1.4.2 Genome architecture and co-linearity ...... 29 1.4.3 Gene duplications (multigene families) and com- parisons ...... 30 1.4.4 Interspecies proteome comparison ...... 31 1.4.5 Lineage-specific genes ...... 33 1.4.6 Cell signalling and morphogenesis ...... 35 1.4.7 Potential mating ability ...... 35 1.4.8 Putative virulence genes ...... 35 1.4.9 Cell wall antigens and biosynthetic genes . . . . . 35 1.5 Discussion ...... 37

Chapter 2: Penicillium marneffei genome database and annotation pipeline 40 2.1 Introduction ...... 40 2.2 Literature Review ...... 42 2.2.1 Methods for predicting protein function ...... 42 2.2.2 Software/database systems for protein function pre- diction ...... 44 2.2.3 The art of gene finding ...... 47 2.3 Implementation ...... 50 2.3.1 Annotation pipeline ...... 50 2.3.2 Assembly process ...... 53 2.3.3 Gene finding ...... 55 2.3.4 Database and databank to store results ...... 57 2.3.5 Perl source code collection ...... 58 2.3.6 Genome browser configuration ...... 58 2.3.7 Synteny identification ...... 59

iv 2.4 Results ...... 60 2.4.1 Statistics of assembly ...... 60 2.4.2 Genome size estimation ...... 61 2.4.3 Accuracy of gene finding ...... 63 2.4.4 Combination of gene finding ...... 63 2.4.5 Database and databank to store results ...... 65 2.5 Discussion ...... 65

Chapter 3: Mitochondrial genome of Penicillium marn- effei 69 3.1 Introduction ...... 69 3.2 Materials and Methods ...... 72 3.2.1 Library construction and sequence assembly . . . . 72 3.2.2 Mitochondrial DNA sequence annotation . . . . . 72 3.2.3 Phylogenetic analysis ...... 73 3.2.4 Mitochondrial DNA sequences in nuclear genome . 73 3.3 Results and Discussion ...... 74 3.3.1 Gene content and genome organisation ...... 74 3.3.2 Protein coding genes ...... 74 3.3.3 Genetic code and codon usage ...... 81 3.3.4 tRNA genes ...... 81 3.3.5 Other RNA genes ...... 81 3.3.6 Group I introns ...... 84 3.3.7 Mitochondrial DNA sequences in nuclear genome . 85

Chapter 4: Genomic evidence for the presence of melanin biosynthesis gene cluster in Penicillium marn- effei 88 4.1 Introduction ...... 88 4.2 Literature Review ...... 89

v 4.2.1 Potential virulence factors ...... 90 4.2.2 Genomic approaches in identification of virulence factors ...... 95 4.3 Materials and Methods ...... 96 4.3.1 Identification of melanin biosynthesis genes in P. marneffei ...... 96 4.3.2 Multiple alignments and phylogenetic analyses . . 97 4.4 Results and Discussion ...... 97 4.4.1 Melanin gene cluster present in P. marneffei . . . 97 4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P. marneffei ...... 101 4.4.3 Absence of penicillin biosynthesis genes in P. marn- effei ...... 103

Chapter 5: Mating abilities in Penicillium marneffei 105 5.1 Introduction ...... 105 5.2 Literature Review ...... 107 5.2.1 Mating in hemiascomycete yeasts ...... 108 5.2.2 Mating in filamentous ascomycetes ...... 109 5.3 Materials and Methods ...... 112 5.4 Results and Discussion ...... 113 5.4.1 Homologs of known sexual genes ...... 114 5.4.2 Mating type genes ...... 116 5.4.3 Mating pheromone precursor genes ...... 120 5.4.4 Mating pheromone processing genes ...... 123 5.4.5 Mating pheromone receptor and other GPCRs . . 126

Chapter 6: Exploring the genetic components associated with the dimorphism of Penicillium marnef- fei 128

vi 6.1 Introduction ...... 128 6.2 Materials and Methods ...... 130 6.2.1 Sequence similarity ...... 130 6.2.2 Phylogenetic Analysis ...... 131 6.3 Results and Discussion ...... 131 6.3.1 Perception of external stimuli by cellular sensors . 132 6.3.2 Transduction of biochemical signal ...... 134 6.3.3 Alteration of the genomic expression ...... 136 6.3.4 Structural reorganization towards the morphologi- cal change ...... 141

Chapter 7: Intragenic tandem repeats in Penicillium marn- effei and other ascomycetes 144 7.1 Introduction ...... 144 7.2 Materials and Methods ...... 146 7.2.1 Identification of coding tandem repeats ...... 146 7.2.2 Sequence analysis ...... 146 7.3 Results and Discussion ...... 146

Chapter 8: Extent and evolutionary pattern of duplicate genes in Penicillium marneffei and other as- comycetes 155 8.1 Introduction ...... 156 8.2 Literature Review ...... 158 8.3 Materials and Methods ...... 160 8.3.1 Sequences and gene families ...... 160 8.3.2 Estimation of substitution rate ...... 161 8.3.3 Relative rate test ...... 162 8.4 Results ...... 163 8.4.1 Extent of gene duplication in ascomycetes . . . . . 163

vii 8.4.2 Age distribution of duplicate genes ...... 164 8.4.3 Selective constraint between paralogs ...... 168

8.4.4 Ka/Ks between paralogs and orthologs ...... 169 8.4.5 Relative evolutionary rate between paralogs . . . . 170 8.5 Discussion ...... 172 8.5.1 Gene duplication in ascomycetes is highly diverse . 173 8.5.2 Different selective constraints in yeasts and fila- mentous ascomycetes ...... 176 8.5.3 Majority of paralogous genes evolve symmetrically 178

Chapter 9: Accelerated evolutionary rate may be respon- sible for the emergence of lineage-specific genes180 9.1 Introduction ...... 181 9.2 Literature Review ...... 184 9.3 Materials and Methods ...... 185 9.3.1 Sequences and data sets ...... 185 9.3.2 Identification of orthologs ...... 188 9.3.3 Classification of genes into LS groups ...... 188 9.3.4 Divergence Times ...... 189 9.3.5 Estimation of substitution rates and statistical analy- ses ...... 189 9.3.6 Detection of rate variability across species - Rela- tive Divergence Score (RDS) ...... 190 9.4 Results ...... 191 9.4.1 Evolutionary rate differences among LS groups . . 191 9.4.2 Evolutionary rate-related factors of genes belong- ing to different LS groups ...... 196 9.4.3 Linear regression of divergence time and relative divergence score (RDS) ...... 201

viii 9.5 Discussion ...... 201

Chapter 10: MBEToolbox: a Matlab toolbox for sequence data analysis in molecular biology and evo- lution 205 10.1 Introduction ...... 205 10.2 Literature Review ...... 206 10.2.1 Probabilistic DNA substitution models ...... 206 10.2.2 Maximum likelihood estimation ...... 210 10.2.3 Elements of phylogenetic theory ...... 211 10.2.4 Programs used for phylogenetic analyses ...... 214 10.3 Implementation ...... 216 10.3.1 Input data and formats ...... 216 10.3.2 Sequence Manipulation and Statistics ...... 217 10.3.3 Evolutionary Distances ...... 217 10.3.4 Phylogeny Inference ...... 219 10.3.5 Combination of functions ...... 222 10.3.6 Graphics and GUI ...... 222 10.4 Results and Discussion ...... 223 10.4.1 Vectorisation simplifies programming ...... 223 10.4.2 Extensibility ...... 226 10.4.3 Comparison with other toolboxes ...... 226 10.4.4 A novel enhanced window analysis ...... 227 10.4.5 Limitations ...... 230

Chapter 11: Concluding remarks 231

Bibliography 234

ix LIST OF FIGURES

Figure Number Page

1.1 P. marneffei mould and yeast culture ...... 7 1.2 Dimorphic switching of P. marneffei ...... 8 1.3 Phylogenetic tree showing the relationships of P. marneffei to other fungi ...... 28 1.4 Microsyntenies containing pheromone precursor loci from four fungi ...... 30 1.5 Triple proteome comparison between P. marneffei, S. cere- visiae and A. fumigatus ...... 32 1.6 Putative MAPK signalling pathway in P. marneffei . . . 34

2.1 Flowchart of annotation pipeline for P. marneffei genome 51 2.2 PMGD genome browser ...... 60 2.3 Database schema of PMGD ...... 66

3.1 Fungal respiratory pathways ...... 71 3.2 Physical map of P. marneffei mitochondrial DNA . . . . 75 3.3 Comparison of gene order between mitochondrial DNAs . 78 3.4 Phylogenetic distribution of group I and group II introns . 80 3.5 28 tRNAs encoded in the mitochondrial genome of P. marneffei ...... 83 3.6 Secondary structures of two representative group I introns 84

4.1 P. marneffei abr1 gene Cu-oxidase domain homologues . 100 4.2 Melanin gene cluster in P. marneffei and A. fumigatus . . 102

x 5.1 Comparison of the mating-type loci in P. marneffei and other fungi ...... 113 5.2 Comparison of the alpha1 domian of MAT proteins of fil- amentous ascomycetes ...... 116 5.3 Gene organisation around the MAT locus ...... 117 5.4 P. marneffei biogenesis of the a-factor pheromones . . . . 121

6.1 Phylogenetic tree of fungal GPCR family genes ...... 133 6.2 P. marneffei genes in cAMP pathway ...... 135

7.1 Amino acid composition in intragenic tandem repeats . . 153

8.1 Frequency distribution of Ks ...... 166

8.2 Log-log plots of Ka vs. Ks for duplicate gene pairs . . . . 167

9.1 LS classification based on phylogenetic profiles of genes . 186 9.2 Divergence of nonsynonymous substitution rate in LS groups192 9.3 Dependence of log gene expression level and substitution rate ...... 194 9.4 Linear regression analysis of divergence time and RDS . . 195

10.1 Relationship of GTR class DNA substitution models . . . 209 10.2 Log-likelihood of evolutionary distance ...... 221 10.3 MBEToolbox GUI ...... 224 10.4 Comparison between sliding window and enhanced sliding window methods ...... 228

xi LIST OF TABLES

Table Number Page

1.1 General features of the P. marneffei genome ...... 25 1.2 Comparison of genome statistics of several fungi . . . . . 27 1.3 Putative virulence genes ...... 36 1.4 Cell wall antigens and biosynthetic genes predicted in P. marneffei ...... 38

2.1 Commonly used domain databases ...... 48 2.2 Summary of assembly statistics ...... 61

3.1 Gene content of P. marneffei mitochondrial genome . . . 76 3.2 Codon usage in protein-coding genes of P. marneffei mi- tochondrial genome ...... 82 3.3 Presence of mitochondrial DNA fragments in nuclear genomes 85 3.4 P. marneffei mitochondrial DNA sequences present in nu- clear genome ...... 86

4.1 Major dimorphic fungal pathogens ...... 95 4.2 Putative gene products related to melanin biosynthesis in P. marneffei ...... 99

5.1 Mating strategies adopted by ascomycetous fungi . . . . . 110 5.2 Pheromone-processing enzymes encoded by the putative P. marneffei genes ...... 122

6.1 GPCR family in P. marneffei and A. nidulans ...... 132

xii 6.2 Homologous genes related to signal transduction in fila- mentous growth ...... 137

7.1 P. marneffei genes containing intragenic tandem repeats . 147 7.2 Comparison of genome size and base in repeats ...... 152

8.1 Distribution of multigene families in fungi ...... 163 8.2 Large multigene families in fungi ...... 165

8.3 Ka/Ks ratio for recently diverged paralogs ...... 169

8.4 Amino-acid substitution rates versus Ka/Ks ratios in two copies of duplicate genes ...... 172

9.1 Genomic sequence sources ...... 185

9.2 Average Ka, Ks and Ka/Ks among LS classes ...... 197 9.3 Correlation and partial correlation between LS and other factors ...... 198 9.4 Regression analyseson predicted S. cerevisiae-S. mikatae orthologs ...... 199

xiii ABBREVIATIONS AND SYMBOLS aa Amino acid

AIDS Acquired Immunodeficiency Syndrome

ADHoRe Automatic Detection of Homologous Regions

BLAST Basic Local Alignment Search Tool

BLOSUM BLOcks SUbstitution Matrix bp Base pairs

CDS Nucleotide coding sequence

DBMS Database management system

DDC Duplication-degeneration-complementation (model)

EST Expressed Sequence Tag

FASTA Fast-All (pronounced fast-aye) a program for pairwise sequence alignment

FGI Fungal Genome Initiative

GFF ‘Gene-Finding Format’ or ‘General Feature Format’

GO Gene Ontology

xiv GOLD Genomes OnLine Database

GPCR G Protein-Coupled Receptor

GTR General Time Reversible model

GUI Graphical User Interface

HAART Highly Active Anti-Retroviral Therapy

HMM Hidden Markov Model

HKU CC Computer Centre, University of Hong Kong

ITR Intragenic Tandem Repeat

Ka Nonsynonymous substitution rate

Ks Synonymous substitution rate

LS Lineage specificity

MAPK Mitogen-activated protein kinase

Mb Megabases

MBEToolbox Molecular biology and evolution toolbox

MCMC Markov-chain Monte Carlo

MDD Maximal dependence decomposition

MFS Major facilitator superfamily

MIPS Munich Information Center for Protein Sequences

xv TF Transcription Factor

TNF Tumor Necrosis factor

MIT Massachusetts Institute of Technology

MLMT Multilocus microsatellite typing system

NCBI National Centre for Biotechnology Information

RDS Relative Divergence Score

ORF Open Reading Frame

PAUP* Phylogenetic Analysis Using Parsimony, *and other methods (pronounced pop star)

PFGE Pulsed-field gel electrophoresis

PHYLIP PHYLogenetic Inference Package

PMGD P. marneffei genome database

REV General reversible process model

RIP Repeat-induced point

SAGE Serial Analysis of Gene Expression

SGD Saccharomyces Genome Database in Stanford Genomic Resources

xvi Swiss-Prot a curated protein sequence database which strives to pro- vide a high level of annotation (such as the description of the func- tion of a protein, its domains structure, post-translational modi- fications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

TIGR The Institute for Genomic Research

TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

UML Unified Modelling Language

UCSC University of California, Santa Cruz

URF unidentified reading frame

UTR Untranslated transcriptional region

WGS Whole-genome shotgun

HMG high mobility group motif

xvii GLOSSARY

ADDITIVE TREE: A phylogenetic tree in which the distance between any two terminal nodes is equal to the sum of the branch lengths connecting them.

BOOTSTRAP: A statistical technique using resampling with replace- ment.

BRANCH: The graphical representation of an evolutionary relation- ship in a phylogenetic tree.

CODON: A triplet of adjacent nucleotides in mRNA that either codes for an amino acid carried by a specific tRNA or specifies the ter- mination of the translation process.

CODON USAGE: The frequency with which members of a codon family are used in protein-coding genes.

COMPLEMENTARY DNA (CDNA): DNA synthesised from an RNA tem- plate by the enzyme reverse transcriptase.

CONCERTED EVOLUTION: Maintenance of homogeneity of nucleotide sequences among members of a gene family in a species, although the nucleotide sequences change over time.

CONSENSUS SEQUENCE: A sequence that represents the most preva- lent nucleotide or amino acid at each site in a number of homologous sequences.

xviii CONSERVATIVE SUBSTITUTION: The substitution of an amino acid by another with similar chemical properties.

CONSTANT SITE OR CONSTANT REGION: A site or region within the DNA that is occupied by the same nucleotide in all homologous sequences under comparison.

CONVERGENCE: The independent evolution of similar genetic or phe- notypic traits.

CONVERGENT SUBSTITUTION: The substitution of two different nu- cleotides by the same nucleotide at the same nucleotide site in two homologous sequences.

DETERMINISTIC PROCESS: A process, the outcome of which can be predicted exactly from knowledge of initial conditions.

DIRECTIONAL SELECTION: A selective regime that changes the fre- quency of an allele in a specific direction, either toward fixation or toward elimination.

DIVERGENCE: The differences between two homologous sequences due to the independent accumulation of genetic changes in each lineage.

DOMAIN: A well-defined region within a protein that can perform a specific function. May not consist of a continuous stretch of amino acids, although it almost always consists of amino acids that are adjacent to each other as far as the tertiary structure of the protein is concerned.

DUPLICATION: The presence or the creation of two copies of a DNA segment in the genome.

xix EUKARYOTE: An organism having a true nucleus and membraneous organelles. One of the three primary lines of descent in the living world.

EXON: A DNA segment of a gene, the transcript of which appears in the mature RNA molecule.

FIXATION PROBABILITY: The probability that a particular allele will become fixed in a population.

FIXATION TIME: The time it takes for a mutant allele to become fixed in a population.

FLANKING SEQUENCE: Untranscribed sequences at the 5’ or 3’ termi- nal of transcribed genes.

FOURFOLD DEGENERATE SITE: A nucleotide site within a codon at which all possible substitutions are synonymous. For example, in the codon CCT, the third site is fourfold degenerate because CCT, CCC, CCA and CCG are all codons for proline.

FUNCTIONAL CONSTRAINT (SELECTIVE CONSTRAINT): The degree of intolerance characteristic of a site or a locus toward nucleotide sub- stitutions.

GENE CONVERSION: A nonreciprocal recombination process resulting in a sequence becoming identical with another.

GENE DIVERSITY: A measure of genetic variability in a population. The mean expected heterozygosity per locus in a population.

xx GENE DUPLICATION: Generally, the production of two copies of a DNA sequence. Specifically, the duplication of an entire gene se- quence.

GENETIC DISTANCE: Broadly, any of several measures of the degree of genetic difference between individuals, populations, or species. In reference to molecular evolution, a measure of the number of nucleotide substitutions per nucleotide site between two homolo- gous DNA sequences that have accumulated since the divergence between the sequences.

INFERRED TREE: A phylogenetic tree based on empirical data per- taining to extant taxa.

INFORMATIVE SITE (DIAGNOSTIC POSITION): A site that is used to choose the most-parsimonious tree from among all the possible phy- logenetic trees. In molecular evolution, a site where there are at least two different kinds of nucleotides or amino acids, and each of them is represented in at least two sequences.

LIKELIHOOD RATIO TEST: A statistical test of the goodness-of-fit be- tween two models. A relatively more complex model is compared to a simpler model to see if it fits a particular dataset significantly better.

LINEAGE: A linear evolutionary sequence from an ancestral species through all intermediate species to a particular descendant species.

MAXIMUM LIKELIHOOD: A statistical procedure of finding the value of one or more parameters for a given statistic which makes the known likelihood distribution a maximum.

xxi ORTHOLOGOUS LOCUS: A gene that has evolved directly from an an- cestral locus. homologous genes: genes that share a common evo- lutionary ancestor.

PARALOGOUS LOCUS: A gene that originated by duplication and then diverged from the parent copy by mutation and selection or drift.

PATTERN OF SUBSTITUTION (SUBSTITUTION SCHEME): The relative fre- quency with which a nucleotide or an amino acid changes into an- other during evolution.

POSITIVE SELECTION: Selection for an advantageous mutant allele.

POSTERIOR PROBABILITY: The probability of a parameter value in- ferred from an analysis.

RELATIVE-RATE TEST: A calibration-free test for checking the con- stancy of the rate of nucleotide substitutions in different lineages during their evolution, thus determining whether or not the mole- cular clock operates at the same rate among different lineages.

ROOTED TREE: A phylogenetic tree that specifies ancestral and de- scendant species, thus indicating the direction of the evolutionary path.

SENSE CODON: A codon specifying an amino acid.

SEQUENCE DIVERGENCE (DIVERGENCE): The differences between two homologous sequences due to the independent accumulation of ge- netic changes in each lineage.

xxii STOCHASTIC PROCESS: A process, the outcome of which cannot be predicted exactly from knowledge of initial conditions. However, given the initial conditions, each of the possible outcomes of the process can be assigned a certain probability.

SYNTENY: A pair of genomes in which at least some of the genes are located at similar map positions.

TANDEM DUPLICATION: A duplication, the products of which reside in close proximity to each other on the chromosome.

TRANSITION: The substitution of a purine for a purine or a pyrimidine for a pyrimidine.

TRANSVERSION: The substitution of a purine for a pyrimidine or vice versa.

xxiii

1

INTRODUCTION

Penicillium marneffei is a dimorphic fungus that intracellularly in- fects the reticuloendothelial system of humans and bamboo rats. En- demic in Southeast Asia, it infects 10% of AIDS patients in this re- gion [365, 201, 182, 50, 348, 350]. The complete genomic sequencing for various organisms has accelerated rapidly, which has offered another path to gene discovery in recent years. This thesis presents the sequence of P. marneffei genome, as well as related studies from the perspectives of comparative and evolutionary genomics. These studies will throw light on the molecular mechanism of virulence of this important pathogenic fungus. Chapter 1 gives an overview of P. marneffei genome, including se- quence statistics, gene content and prediction of gene function. Chapter 2 describes the organisation and implementation of genome database of P. marneffei genome project. The complete mitochondrial genome of P. marneffei is reported in Chapter 3. The gene content and gene order P. marneffei of mitochondrial genome are highly similar to that of As- pergillus, further confirming their close phylogenetic relationship. This provides the basis for comparative genomics study between P. marneffei and Aspergillus species. This is followed by Chapter 4 that reports the presence of impor- tant virulence gene cluster, the melanin biosynthesis gene cluster, in P. marneffei genome. Since melanin is a highly toxic natural product pro- duced by some species of Aspergillus which are phylogenetically close to P. marneffei, this finding is also valuable in revealing the evolutionary origin of this gene cluster. 2

Mating of P. marneffei has not yet been observed in nature or under laboratory defined conditions. The lack of a sexual stage impairs the utility of experimental fungal genetics. By using genome sequence infor- mation, however, we found evidence of the potential mating ability of P. marneffei (Chapter 5). It suggests that P. marneffei, like other patho- genic fungi, may limit access to the sexual cycle to generate a population structure that is in part clonal but which retains the ability to undergo sexual cycle in response to challenging conditions in the environment or in the host. Chapter 6 contributes to the thesis by offering a systemic exploration of genetic components that may be responsible for the mor- phogenetic processes in the genome of P. marneffei, mainly through the sequence analysis in a context of comparative genomics. Chapter 7 re- ports an interesting phenomenon: Tandemly repeated DNA sequences occuring frequently in the genomes of P. marneffei, not only in noncod- ing regions, but also in protein-coding regions, i.e. intragenic regions. These highly dynamic genomic components provide the clue on how the pathogenic fungus adapts to the host immune system.

Chapter 8 introduces a systematic test about the extent of duplicate genes in major ascomycetes. We observed significant variation within ascomycetes in the extent of gene duplications. Age distribution of gene duplications tentatively suggests that P. marneffei genome have experi- enced duplication in large scale twice. We argue that different extents and evolutionary patterns of duplicate genes in ascomycetes might be associated with the great genotypical and phenotypical differences in as- comycetes. Chapter 9 tackled the question of the origin of species-specific genes. The statistically significant correlation between accelerated evo- lutionary rate and the degree of lineage specificity is confirmed. This correlation is independent of many confounding factors, like gene essen- tiality and expression level. This finding helps to explain the origin of P. marneffei-specific genes, which is about one third of all P. marneffei 3

genes. Finally, Chapter 10 introduces the software package, developed in a high-performance scientific computer language, for sequence data manip- ulation and analysis, which performed very successfully throughout the whole genome project. Publications arising from this thesis are:

1. Cai JJ, Liu B, Woo PC, Lau SKP, Wong SS, Zhen H, Yuen KY (In preparation) Genomic evidence for the presence of melanin biosyn- thesis gene cluster in the thermal dimorphic fungus Penicillium marneffei

2. Cai JJ, Woo PCY, Lau SKP, Smith DK and Yuen KY (2006) Ac- celerated evolutionary rate may be responsible for the emergence of lineage-specific genes in Ascomycota Journal of Molecular Evo- lution, in press

3. Cai JJ, Smith DK, Xia X and Yuen KY (2005) MBEToolbox: a MATLABTM toolbox for sequence data analysis in molecular biol- ogy and evolution. BMC Bioinformatics, 6:64

4. Woo PC, Zhen H, Cai JJ, Yu J, Lau SKP, Wang J, Teng JLL, Wong SS, Tse RH, Chen R, Yang H, Liu B and Yuen KY (2003) The mitochondrial genome of the thermal dimorphic fungus Penicillium marneffei is more closely related to those of molds than yeasts. FEBS Letters, 555 (3): 469-77

5. Yuen KY, Pascal G, Wong SS, Glaser P, Woo PC, Kunst F, Cai JJ, Cheung EY, Medigue C, Danchin A (2003) Exploring the Penicil- lium marneffei genome. Archives of Microbiology, 179 (5): 339-53

I have tried to explicitly acknowledge where the other authors’ ideas have contributed significantly to the present work. 4

Chapter 1

THE DRAFT GENOME SEQUENCE OF PENICILLIUM MARNEFFEI

This chapter describes basic features of genome of Penicillium marn- effei, such as, genome assembly, gene content and some comparative re- sults, attempting to give an overall impression of the genome. More detail and complete analyses of some sections may be found in corresponding chapters.

1.1 Introduction

Although fungi pose little threat to people with healthy immune systems, they can cause fatal infections in the immunocompromised individuals. Penicillium marneffei is the most important thermal dimorphic fungus causing respiratory, skin and systemic mycosis in Southeast Asia [365, 201, 182, 50, 348, 350]. Discovered in 1956 in hepatic abscesses of the Chinese bamboo rat Rhizomys sinensis, only 18 cases of human diseases were reported (in HIV-negative patients) until 1985 [66]. The appearance of the HIV pandemic, especially in South-east Asian countries, saw the emergence of the infection as an important opportunistic mycosis in this group of immunocompromised patients. About 10% of AIDS patients in Hong Kong are infected with P. marneffei [346]. In northern Thailand, penicilliosis is the third most common indicator disease of AIDS following tuberculosis and cryptococcosis [300]. Genome sequencing of P. marneffei will increase the understanding molecular biology and biochemical mechanisms for the pathogenicity of this fungus. Despite its medical importance and its unusual thermal di- 5

morphism, our understanding of gene organisation in P. marneffei was limited. To my knowledge, only one cell wall mannoprotein gene has been characterised and successfully used in serodiagnosis and prevention of this infection [38,37,347]. As a ‘pilot study’ of this genome project, the random analysis of 2303 random sequence tags has been performed [364], which laid down the foundation for the complete genomic sequencing project of this fungus. In 2002, the complete genome sequencing project of P. marneffei was initiated, and we have now approximately 6.6 cov- erage of the genome, which includes a contig that contains the complete sequence of the mitochondrial genome. The sequencing of its genome paves the way for the development of novel methods for detecting, pre- venting and treating this infection.

1.2 Literature Review

In this section I will first recap some basic concepts and terminologies in fungal biology, and then review some clinical aspects, including the diagnosis and management of P. marneffei infection. Finally, I will give a survey of the recent advances in fungal genome projects.

1.2.1 General fungal biology

Fungi are a large and diverse group of eukaryotes characterised by their absorptive mode of nutrition, i.e., digesting food outside of their bodies. Modern taxonomists place fungi in their own kingdom, on equal footing with plants and animals, sometimes called “The Fifth Kingdom”. They include moulds, yeasts, and mushrooms. Most fungi are multicellular, but some, the yeasts, are simple unicellular organisms. Fungi are plastic, having a diversity of forms which influence the manner of function, and a range of dispersal mechanisms enabling various approaches to survival over time. Nevertheless, some basic structures of diverse fungi are in common. 6

A fungal organism consists of a mass of threadlike filaments called hyphae, which combine to make up the fungal mycelium. Each hypha is composed of a chain of fungal cells, a continuous cytoplasm with many nuclei. The hypha is surrounded by a plasma membrane and a polysac- charide chitin cell wall. The hyphae in a fungus branch off from one another to form the mycelium, and are all ultimately connected to the original hypha. Septa are barriers across the filament. In all fungi, septa form, either adventitiously in all filamentous fungi, or at regular intervals along the hypha in most members of the Ascomycota and Basidiomycota. Different methods of reproduction have been adopted by different types of fungus. For example, yeasts reproduce mitotically, while moulds have much more complex life cycles involving distinct phases, including diploid and haploid phases.

Fungi are often directly involved in our lives. Some fungi are in- deed parasitic, and cause devastating plant infections. Serious agricul- tural pests, parasitic fungi such as the rusts and the smuts can ruin entire crops, especially affecting cereals such as wheat and corn. Only about 50 species are known to harm animals. Many medical applications of fungi have been discovered, of which antibiotic production by fungi is the most important. The first among these antibiotics is penicillin, possibly the most important non-genetic medical breakthrough of last century. Approximately 75% of all described fungi belongs to the As- comycota. Among them are some famous ones, such as, , the baker’s yeast, Penicillium chrysogenum, producer of peni- cillin, and Neurospora crassa, the “one-gene-one-enzyme” organism, As- pergillus flavus, the producer of aflatoxin, Candida albicans, the cause of thrush. 7

(A) (B)

Figure 1.1: P. marneffei mould (A) and yeast (B) culture. Courtesy of Prof. KY Yuen, Micriobiolgy, HKU

1.2.2 P. marneffei, as an important fungal pathogen

Mycology

The fungus grows well on the Sabouraud dextrose agar. When grown at 25‰, the fresh culture appears similar to other Penicillium species, with rapidly growing greenish-silver mycelial colonies. The reverse side is usually of a beige colour. One of the most characteristic features is the production of a soluble red pigment that diffuses into the medium. Of all the Penicillium species, only P. marneffei, P. citrinum, P. janthinellum, P. purpurogenum, and P. rubrum produce diffusible red pigments. The other Penicillium species are generally not associated with human infec- tions nor do they display dimorphism. In contrast to a room temperature culture, the fungus assumes a yeast form at 37‰, whether in cultures or in vivo. Colonies at 37‰ are glabrous and beige-coloured and do not produce any red pigment (Fig. 1.1). The dimorphic growing feature that as a yeast-like fungus at 37‰ and as a mould in culture at temperatures below 30‰ is illustrated in Fig 1.2. Microscopically, the mycelial form resembles other Penicillium species with conidiophore-bearing biverticillate penicilli, and each penicillus be- ing composed of four to five metulae with smooth-walled conidia. The 8

Figure 1.2: Dimorphic switching of P. marneffei.The diagram is obtained from the website of Department of Genetics, University of Melbourne.

yeast forms are ovoid or elongated measuring 2–3 µm × 2–6.5 µm. Sim- ilar forms are also observed in tissue samples obtained from patients, which may be seen within macrophages or extracellularly. In contrast to other yeasts, the yeast cells of P. marneffei divide not by budding, but by fission, with the result that a transverse septum is often seen in the di- viding cell. This helps to differentiate P. marneffei from other dimorphic fungi in histological sections, especially Histoplasma capsulatum.

Ecology and epidemiology

P. marneffei is geographically restricted to the Southeast Asia. Cases have been reported mostly from northern Thailand, southwestern China (e.g., around the Guangxi Province), Hong Kong, Taiwan, Singapore, Malaysia, and the Philippines. The ecology and possible environmental reservoirs of P. marneffei was first investigated in 1986 by Deng et al.[67]. In the Guangxi Province of region of the People’s Republic of China, it was found that P. marn- effei can be isolated in the internal organs of 18 out of 19 bamboo rats belonging to the species Rhizomys pruinosus. The findings of Deng et al. 9

were confirmed by a subsequent study by Li et al.[195]. Rhizomys pru- inous senex bamboo rats in the Guangxi Province were studied. 93.1% of the wild bamboo rats carried P. marneffei in the internal organs. The fungus was most commonly isolated from the lungs (87.5%), followed by the liver (56.3%), spleen (56.3%) and mesentery lymph node (50%).

The association between P. marneffei and bamboo rats had also been noted in Thailand, another country endemic for the infection. In two studies by Ajello et al.[3] and Chariyalertsak et al.[47], P. marneffei was recovered from various species of bamboo rats, including Cannomys badius, Rhizomys pruinosus, and R. sumatrensis. The distribution of the fungus in the internal organs was similar to previous studies, with the highest prevalence in the lungs followed by the liver.

The consistency of these findings suggests that inhalation of the (pre- sumably) infective conidia could be an important mode of transmission. The occurrence of the fungus in the liver could be a result of the propen- sity of the fungus to invade the reticuloendothelial system. It has been suggested that bamboo rats, like human victims, probably acquired the infection from a common environmental source. The possible link to en- vironmental factors is demonstrated by two studies from northern Thai- land which showed a significant clustering of cases of penicilliosis marn- effei during the rainy season [45,46]. A recent history of occupational or other forms of exposure to soil is also a significant risk factor. Impor- tantly, exposure to or consumption of bamboo rats, was not a risk factor for infection. The exact mode of transmission of the fungus its natural habitat is still unsettled at the moment.

Although P. marneffei is a naturally occurring sylvatic infection in a high proportion of bamboo rat species [67], it is not known whether bamboo rats are (1) an obligate stage in P. marneffei’s life cycle or (2) a zoonotic focus for human infection. Furthermore, it is not known whether all lineages of P. marneffei are equally infectious to bamboo rats and hu- 10

mans or rather represent a subset of a wider, more genetically diverse population. In order to address these questions, four groups of investiga- tors reported the use of various molecular typing techniques in the differ- entiation of P. marneffei strains. Vanittanakom et al.[323] first reported in 1996 the use of restriction endonuclease analysis for epidemiological typing of strains isolated in Thailand. Hsueh et al. noted an increase in the incidence of P. marneffei infection in Taiwan in the 1990’s [134]. Antifungal susceptibility, chromosomal DNA restriction fragment-length polymorphism types, and randomly amplified polymorphic DNA patterns recognised 8 strain types out of 20 isolates. Trewatcharegon et al., on the other hand, used pulsed-field gel electrophoresis (PFGE) with NotI digestion for strain differentiation [316]. Fisher et al.[88] used multilo- cus microsatellite typing (MLMT) system, an accurate and reproducible method of characterizing genetic diversity of eukaryotic pathogens that have low levels of genetic variation. They observed the high genetic di- versity and extensive spatial structure among clinical isolates, revealing spatially structured P. marneffei populations [88]. In further study, again based on MLMT typing results, Fisher et al.[89] showed that different clones of the fungus are found in different environments, all the samples from any given location were genetically very similar. This led them to the conclusion that the fungus becomes highly adapted to its local en- vironment, making it highly successful there, but stopping it spreading to other areas. This is why P. marneffei is only endemic to a relatively small area of south-east Asia.

Immunobiology

Like most other pathogens, the availability of iron is crucial to the survival of P. marneffei in the human host. Studies by Taramelli et al. shown that the antifungal activity of macrophages is markedly suppressed in the presence of iron overload and that iron chelators inhibit the extracellular 11

growth of P. marneffei [306].

The route of transmission and infection of P. marneffei is unknown at the moment. However, it is generally believed that inhalation of the coni- dia is a likely route, in line with the mode of infection for other moulds. The attachment of P. marneffei conidia to host cells and tissues is the first step in the establishment of an infection. The conidia-host interac- tion may occur via adhesion to the extracellular matrix protein laminin and fibronectin via a sialic acid-dependent process. Using immunofluores- cence microscopy, Hamilton et al. demonstrated that fibronectin binds to the conidia surface and to phialides, but not to hyphae [122]. The inves- tigators suggested that there could be a common receptor for the binding of fibronectin and laminin on the surface of P. marneffei [123, 122].

The interaction between human leukocytes and heat-killed yeast-phase P. marneffei has been studied by Rongrungruang et al.[269]. Their data suggested that monocyte-derived macrophages phagocytose P. marneffei even in the absence of opsonisation and the major receptor(s) recognising P. marneffei could be a glycoprotein with N-acetyl-beta-D-glucosaminyl groups. P. marneffei stimulates the respiratory burst of macrophages regardless of whether opsonins are present, but tumour necrosis factor-α secretion is stimulated only in the presence of opsonins. The authors thus speculated that the ability of unopsonised fungal cells to infect mononu- clear phagocytes in the absence of TNF-α production is a possible viru- lence mechanism.

Although P. marneffei is capable of infecting and replicating inside mononuclear macrophages, it is also evident that macrophages do possess antifungal activities. The fungicidal activities of macrophages is likely to involve the generation of reactive nitrogen intermediates, as described by Kudeken et al.[180]. In addition to macrophages, the neutrophils also exhibit antifungal properties. The fungicidal activity of neutrophils is significantly increased in the presence of proinflammatory cytokines, 12

especially GM-CSF, G-CSF and IFN-γ. In addition to GM-CSF, G-CSF and IFN-γ, other cytokines such as TNF-α and IL-8 are capable of en- hancing the neutrophil’s inhibitory effects on germination of P. marneffei conidia. The strongest effect was observed with GM-CSF [179]. Coni- dia are, however, generally not susceptible to killing by phagocytes. The fungicidal activity exhibited by neutrophils is believed to be independent of superoxide anion, but through exocytosis of granular enzymes [181]. Recently, Koguchi et al. demonstrated that osteopontin (secreted by monocytes) could be involved in IL-12 production by peripheral blood mononuclear cells during infection by P. marneffei, and the production of osteopontin is also regulated by GM-CSF [171]. It is also likely that the mannose receptor is involved as a signal-transducing receptor for trig- gering the secretion of osteopontin by P. marneffei-stimulated peripheral blood mononuclear cells.

Molecular biology

The mechanism of thermal dimorphism and morphogenesis in P. marnef- fei is not fully understood. However, studies by Borneman et al. start to provide important information in this area [18,19]. It was shown that the homologue of the Aspergillus nidulans abaA gene is involved in the reg- ulation of cell cycle and morphogenesis in P. marneffei [18]. An STE12 homologue of P. marneffei (stlA gene) was subsequently shown to be able to complement the sexual defect of an A. nidulans steA mutant [19]. A hitherto unknown sexual stage of P. marneffei is therefore postulated to be present. Other genes which are involved in the growth and development of P. marneffei have been described recently. A CDC42 homologue (cflA gene) was shown to be required for polarisation and determination of cor- rect cell shape during yeast-like growth, and for the separation of yeast cells [22]. Deletion of the homologue of Aspergillus nidulans stuA gene in 13

P. marneffei showed that the gene is required for metula and phialide for- mation during conidiation but is not required for dimorphic growth [20]. No vaccine is currently available for P. marneffei. Some recent studies showed that vaccine development is potentially feasible. The P. marnef- fei mannoprotein Mp1p (encoded by the MP1 gene) has been tested in a mouse model as a potential vaccine candidate [347]. The relative efficacy of intramuscular MP1 DNA vaccine, oral mucosal MP1 DNA vaccine us- ing live-attenuated Salmonella typhimurium carrier, and intraperitoneal recombinant Mp1p protein vaccine were compared. Intramuscular MP1 DNA vaccine appears to give the best protection against P. marneffei.

1.2.3 Penicilliosis marneffei

Clinical features

Penicilliosis marneffei manifests clinically as a progressive systemic febrile illness as a result of infiltration and inflammation of the reticuloendothe- lial system by the yeast stage of P. marneffei. Common clinical fea- tures include systemic symptoms of fever, weight loss, anaemia, and those due to local organ involvement such as pulmonary syndrome, chest radi- ographic infiltrate, lymphadenopathy, hepatosplenomegaly, molluscum- contagiosum-like skin lesions, osteolytic bone lesions, arthritis, subcuta- neous abscesses and even endophthalmitis. Almost all organs could be involved in severe disseminated disease. In immunocompetent hosts, the tissue damage is mainly associated with granulomatous inflammation with multinucleated giant cells, lym- phocytes, and neutrophils. A suppurative inflammation dominated by neutrophils resulting in abscess formation can be present. In immuno- suppressed hosts, an anergic and necrotising reaction is found with diffuse infiltration of macrophages engorged with yeast cells. Underlying immunosuppression could be found in 80% of penicilliosis patients. The commonest underlying disease is AIDS. P. marneffei is 14

second only to Cryptococcus neoformans as the commonest opportunis- tic fungal pathogen in AIDS patients in Southeast Asian countries like Thailand.

Infections in non-HIV-infected patients have also been described, pri- marily among immunocompromised patients and less frequently in pa- tients without any known underlying diseases. Reported cases of non- HIV-associated penicilliosis marneffei had occurred in patients with al- coholism, tuberculosis, systemic lupus erythematosus, patients receiving corticosteroid or other forms of immunosuppressive therapy, and even patients without any apparent underlying disease. Manifestations of the infection included lymphadenopathy, osteomyelitis and septic arthritis, pulmonary infection, and disseminated infection with multi-organ in- volvement.

Comparison of the clinical manifestations of penicilliosis in HIV-positive and HIV-negative patients has been published recently [349]. Of the 15 patients who had culture-documented P. marneffei infection, 8 (53.3%) were HIV positive and 7 (46.7%) were HIV negative. The HIV-infected patients were more likely to have a higher incidence of fungaemia than the non-HIV-infected patients (50% vs. 28.6%) while the latter group fre- quently required tissue biopsies for confirmation of the infection. There was a significant delay in establishing the diagnosis in non-HIV-infected patients when compared with HIV-infected patients (median delay of 5.5 weeks vs. 1 week, P < 0.01). Most of the non-HIV patients (85.7%) have underlying immunocompromising conditions including haematolog- ical malignancies and autoimmune diseases requiring the use of corticos- teroids or cytotoxic chemotherapy, as well as diabetes mellitius. In both categories, pulmonary involvement was the commonest manifestation on initial presentation, followed by pyrexia of unknown origin and cutaneous manifestation. 15

Diagnosis

Fungal culture The infection itself is relatively amenable to antifun- gal therapy and a cure is potentially possible. Early recognition of the infection is therefore essential for timely initiation of effective therapy. Conventional fungal culture remains the diagnostic test of choice in most settings. The fungus may be cultivated from appropriate clinical specimens in most cases, such as blood cultures, skin lesions, and respira- tory tract specimens. In the AIDS patients with high levels of fungaemia, it has been occasionally reported that a direct smear of the peripheral blood may reveal the fungus. In HIV-positive patients, fungaemia could be detected in at least 55% of the patients in previous reports. Unfortunately, fungal culture suffers from the drawback of a long turnaround time and that sometimes invasive tissue biopsies are necessary for obtaining a satisfactory specimen. In a series of HIV-infected patients from Hong Kong, 50% of them had documented fungaemia [349]. The yeast form of P. marneffei may be stained by the methenamine silver or periodic acid-Schiff stains in tissue sections. When the cen- tral septation of the yeast cell is seen in the histopathological section, this offers clues to the diagnosis of penicilliosis. Pi´erard et al. reported that the monocloncal antibody EB-A1 against the galactomannan of As- pergillus species may also be used to detect P. marneffei in formalin- fixed, paraffin-embedded tissues [249].

Serology A number of studies aimed at detecting fungal antibodies and/or antigens in the serum and body fluids of infected patients. In earlier studies, culture filtrates or whole cell extracts were being used as antigens. P. marneffei was cultured in liquid media, and the culture fil- trate was concentrated to immunise rabbits. The culture filtrate and the anti-P. marneffei rabbit sera were incorporated in an immunodiffusion test to detect antibody or antigens respectively [277, 333, 144]. 16

In 1994, an indirect immunofluorescent antibody test for serodiagnosis of P. marneffei infection was reported, using the yeast-hyphae (represent- ing tissue multiplication phase) or the germinating conidia (representing initial tissue invasion phase) as antigens [365]. None of the eight sera from culture-documented patients tested at 1 : 10 dilution gave a posi- tive result for IgM. High IgG titres (of the respective phases, geometric mean 1 : 905 and 1 : 1280) were found in all eight penicilliosis marneffei patients, in contrast to that obtained from 78 healthy controls (with a respective geometric mean of 1 : 1.34 and 1 : 2.14). Sera from patients with cryptococcosis (n = 2) or candidaemia (n = 2) did not show cross- reactivity (IgG titre < 1 : 40, which is similar to that of the healthy con- trols). Overall, the IgG titre was higher than IgA for the cases but there was little difference in using the germinating conidia or the yeast-hyphae form as the testing antigen. Moreover, IgA could not be detected in two out of eight positive cases. Three HIV patients with culture-documented penicilliosis marneffei were tested positive (IgG titres 1 : 80 − 1 : 160). An IgG titre > 1 : 80 is suggestive of penicilliosis marneffei.

In 1996 Kaufman et al. developed a latex agglutination test to detect antigenaemia, where polystyrene beads were coated with rabbit anti-P. marneffei globulin, obtained from rabbits immunised with yeast culture filtrate [160]. 77% of the 17 P. marneffei culture-positive HIV patients were tested positive.

Desakorn et al. later used purified hyperimmune IgG, from rabbits immunised with yeast cells, in an enzyme-linked immunosorbent assay (ELISA) to quantitate P. marneffei yeast antigens in urine samples [69]. All urine samples from 33 P. marneffei culture-positive HIV patients were tested positive, with a median titre of 1 : 20.

Jeavons et al. characterised and purified three cytoplasmic yeast anti- gens of 50-, 54- and 61-kDa, which were found respectively in 48, 71 and 85% of serum samples from 21 P. marneffei culture-positive pa- 17

tients [146]. Chongtrakool et al. isolated a 38-kDa antigen partially- purified from yeast culture filtrate, where 45% of P. marneffei culture- positive HIV patients (n = 51), 17% of HIV positive asymptomatic pa- tients (n = 262) and 25% of other fungal culture-positive HIV patients (n = 67) have developed antibodies against this antigen [54].

PCR The detection of the P. marneffei genomic DNA in clinical spec- imens have also been reported. LoBuglio and Taylor used primers PM2 and PM4 to amplify a 347 bp fragment of the internal transcribed spacer region between 18S rDNA and 5.8S rDNA [202]. On the other hand Vanittanakom et al. used a PCR-Southern hybridisation format, where primers RRF1 and RRH1 were used to amplify a 631 bp fragment of the 18S rDNA, followed by hybridisation with a P. marneffei-specific 15- oligonucleotide probe [324]. Recently Vanittanakom et al. described a nested PCR assay which might prove useful in the detection of P. marn- effei and identification of young fungal cultures [325].

Mp1p The first gene cloned from P. marneffei was the MP1 gene [37]. Serum from guinea pigs immunised with P. marneffei yeast cells was used to screen the cDNA library of P. marneffei. The MP1 gene was subse- quently cloned which encodes an abundant antigenic cell wall manno- protein in P. marneffei. MP1 is a unique gene without homologues in sequence databases. It codes for a protein, Mp1p, of 462 amino acid residues, with a few sequence features that are present in several cell wall proteins of Saccharomyces cerevisiae and Candida albicans. It contains two putative N-glycosylation sites, a serine- and threonine-rich region for O-glycosylation, a signal peptide, and a putative glycosylphosphatidyli- nositol attachment signal sequence. Specific anti-Mp1p antibody was generated with recombinant Mp1p protein purified from to allow further characterisation of Mp1p. Western blot analysis with anti-Mp1p antibody revealed that Mp1p produces dominant bands with 18

molecular masses of 58 and 90 kDa and that it belongs to a group of cell wall proteins that can be readily removed from yeast cell surfaces by glu- canase digestion. In addition, Mp1p is an abundant yeast glycoprotein and has high affinity for concanavalin A, a characteristic indicative of a mannoprotein. Furthermore, ultrastructural analysis with immunogold staining indicated that Mp1p is present in the cell walls of the yeast, hy- phae, and conidia of P. marneffei. Finally, it was observed that infected patients develop a specific antibody response against Mp1p, suggesting that this protein represents a good cell surface target for host humoral immunity.

The antibody response of penicilliosis patients to Mp1p was studied in two subsequent studies [38, 39]. An ELISA-based antibody test with purified Mp1p was produced. Evaluation of the test with guinea pig sera against P. marneffei and other pathogenic fungi indicated that this assay was specific for P. marneffei. Clinical evaluation revealed that high levels of specific antibody were detected in two immunocompetent penicilliosis patients. Furthermore, approximately 80% (14 of 17) of the documented penicilliosis patients with human immunodeficiency virus tested positive for the specific antibody. No false-positive results were found for serum samples from 90 healthy blood donors, 20 patients with typhoid fever, and 55 patients with tuberculosis, indicating a high specificity of the test. Thus, this ELISA-based test for the detection of anti-Mp1p antibody can be of significant value as a diagnostic for penicilliosis.

In vitro, Mp1p is found to be secreted into the cell culture super- natant at a level that can be detected by Western blotting. A sensitive ELISA developed with antibodies against Mp1p was capable of detect- ing this protein from the cell culture supernatant of P. marneffei at 104 cells/mL. The anti-Mp1p antibody is specific since it fails to react with any protein-form lysates of Candida albicans, Histoplasma capsulatum, or Cryptococcus neoformans by Western blotting. In addition, this Mp1p 19

antigen-based ELISA is also specific for P. marneffei since the cell cul- ture supernatants of the other three fungi gave negative results. Finally, a clinical evaluation of sera from penicilliosis patients indicates that 17 of 26 (65%) patients are Mp1p antigen test positive. Furthermore, an Mp1p antibody test was performed with these serum specimens. The combined antibody and antigen tests for P. marneffei carry a sensitivity of 88% (23 of 26), with a positive predictive value of 100% and a negative predictive value of 96%. The specificities of the tests are high since none of the 85 control sera was positive by either test.

The value of antigen (Mp1p) and antibody (anti-Mp1p) detection in the diagnosis of penicilliosis marneffei is best evaluated by comparing the results in patients with or without underlying HIV infection. In a study involving eight HIV positive and seven non-HIV penicilliosis marneffei patients, the HIV positive patients tended to have a higher antigen titre and a lower antibody titre, while the converse is true in the HIV negative patients. This presumably is due to impaired antibody production as a result of the underlying immune defects associated with HIV infection and a higher fungal load in this group of patients. Concomitant testing of the serum antigen and antibody levels could therefore improve the diagnostic yield of serology in immunocompromised patients.

When serial serum samples were available for the HIV-positive pa- tients, it was found that the serum antigen and antibody titres against P. marneffei were elevated as early as 30 days before the day of posi- tive cultures. The titres of both serum antigen and antibody dropped with the initiation of amphotericin B therapy and itraconazole prophy- laxis. Upon subsequent follow up, there was no clinical and mycological evidence of relapse and this was associated with a persistently negative serum antigen and antibody ELISA. 20

Treatment

In vitro, P. marnefffei is susceptible to itraconazole and amphotericin B, while the susceptibility to fluconazole and 5-fluorocytosine is less uni- form [301]. The recommended antifungal regimen to date consists of two weeks of intravenous amphotericin B (0.6 mg/kg/d) followed by ten weeks of oral itraconzaole (400 mg/d), which resulted in clinical and microbio- logical cure in 97.3% of the patients. Long term secondary prophylaxis has also been suggested to reduce the relapse rate [290,302]. With wider use of HAART for HIV infection, it has been suggested that long term antifungal prophylaxis may not be necessary. The highly active anti- retroviral therapy (HAART) has been shown to reduce the incidence of many opportunistic infections in AIDS patients, including invasive fun- gal infection. There is, however, currently no specific cut-off value of CD4 cell count can be used to guide the use of secondary antifungal prophylaxis [140]. One recent interesting observation is that several 4- aminoquinoline agents including chloroquine were found to be able to inhibit the growth of P. marneffei inside macrophages. The activity of chlorquine on P. marneffei is postulated to be due to an increase in the intravacuolar pH and a disruption of pH-dependent metabolic processes. This finding could be of value in the chemotherapy or chemoprophylaxis of penicilliosis marneffei [307].

1.2.4 Fungal genome projects

Genomics has only just started to impact on biological/medical research, although modern molecular genetics has been at the center of the bio- medical revolution in research since 1980s. The potential of studying whole genome sequences is a new tool in biomedical research. At the time when this thesis is written, there are about 317 completed and published genome sequence projects and 549 eukaryotic and 802 prokaryotic ongoing projects (data from the Genomes OnLine Database 21

(GOLD) at http://www.genomesonline.org/). Current estimates sug- gest at least 2 million fungal species, of which only some 50,000 to 70,000 have been documented and merely a couple of them whose genomes them have been completed.

S. cerevisiae was the first eukaryote to have its genome fully se- quenced. In 1996 the work was completed by many different laboratories and organisations. Its genome contains ≈6,000 genes on 16 chromosomes. At the time that genome sequence was published, only 43.3% of the yeast genes were classified as ‘functionally characterised’, i.e., having ex- perimentally well-investigated properties, being members of well-defined protein families, or displaying strong homology to proteins with known biochemical functions. Despite this limitation, it is the most well studied fungus, which serves as the most important model organim for fungal genetics. The all-against-all matching of the yeast genome had been accomplished and duplication patterns within the genome have been in- vestigated in a systematic way. Such a view of the genome’s architecture, based on an exhaustive intra-genomic sequence comparison, revealed that whole genome duplication seems to have had an important influence of the evolutionary development of S. cerevisiae [220].

The S. pombe genome [354] contains the smallest number of protein- coding genes yet recorded for a eukaryote: 4,824. Centromere structure has been well studied in S. pombe: the centromeres are between 35 and 110 kb and contain related repeats including a highly conserved 1.8-kb element. More introns (of which there are 4,730) are found than in S. cerevisiae. Some 43% of the genes contain introns. Some homologs of human disease genes, such as cancer related genes, have been identified. Comparative study identified highly conserved genes important for eu- karyotic cell organisation including those required for the cytoskeleton, compartmentation, cell-cycle control, proteolysis, protein phosphoryla- tion and RNA splicing, which may have originated with the appearance 22

of eukaryotic life. In constrast, few similarly conserved genes that are important for multicellular organisation were identified. The lesson from studying S. pombe genome is that the transition from prokaryotes to eu- karyotes required more new genes than did the transition from unicellular to multicellular organisation.

The N. crassa genome has been reported recently [101]. The genome is assembled from genomic data of more than 20-fold sequence coverage of the genome. It has the highest genome size (39.9 Mb) and gene num- ber (10,082 protein-coding genes) among all published fungal genomes so far. On average, the gene density is one gene per 3.7 kilobases (kb) and an average of 1.7 short introns (134 bp on average) per gene. Neurospora genome comprises a small number of repetitive elements, a low degree of segmental duplications and very few paralogous genes. Neurospora genes are highly divergent – of the predicted proteins 41% have no significant matches to known proteins. Many of genes with predicted products likely to be involved in determining hyphal growth and multicellular develop- mental structures in Neurospora, as well as involved in catabolism, chem- ical detoxification and stress-defense mechanisms. It has also been noted that for some Neurospora genes the only known homologs are found in prokaryotes [216], indicating that occupation of similar ecological niches has resulted in conservation of genes for substrate degradation and sec- ondary metabolism.

Magnaporthe grisea, one of the most devastating agricultural pathogens in the world, has been sequenced [64]. The fungus causes blast disease in rice, a scourge that destroys enough rice crops to feed 60 million people annually. The pathogen’s remarkable ability to overcome plant defences has stymied efforts to fight the disease. Analysis of its predicted gene set provides an insight into the adaptations required by a fungus to cause disease. The M. grisea genome encodes a large and diverse set of se- creted proteins, including those defined by unusual carbohydrate-binding 23

domains. This fungus also possesses an expanded family of G-protein- coupled receptors, several new virulence-associated genes and large suites of enzymes involved in secondary metabolism. Together with the draft rice genome sequences published earlier this year, the new information will help researchers develop better and cheaper methods of protecting plants than the currently available fungicides. Recently, the C. albicans and C. neoformans genomes were reported [148, 203], enabling a comparison between these divergent fungi. More- over, high-quality draft sequences of A. nidulans and A. fumigatus are already in the public domain, and others, such as Ustilago maydis, are likely to be available soon. Other genome sequencing projects of patho- genic fungi are also under way or will soon be started (for instance, Pneumocystis carinii).

1.3 Materials and Methods

Strain and DNA preparation of P. marneffei genome were done by col- leagues in the department of Microbiology, University of Hong Kong. Library construction and shotgun sequencing were carried out by Beijing Genomics Institute (BGI).

1.3.1 Strain and DNA preparation

P. marneffei strain PM1 was isolated from an HIV-negative patient suf- fering from culture-documented penicilliosis in Hong Kong. The arthro- conidia (“yeast form”) of PM1 was used throughout the DNA sequencing experiments. Genomic DNA, including mitochondrial DNA, was pre- pared from the arthroconidia purified at 37‰ . A single colony of the fungus grown on Sabouraud dextrose agar at 37‰ was inoculated into yeast peptone broth and incubated in a shaker at 30‰ for 3 days. Cells were cooled in ice for 10 min, harvested by centrifugation at 2000g for 10 min, washed twice and re-suspended in ice cold 50 mmol EDTA/l 24

buffer (pH 7.5). 20 mg novazym/ml was added and incubated at 37‰ for one hour followed by digestion in a mixture of 1 mg proteinase K/ml, 1% N-lauroylsarcosine, and 0.5 mol EDTA/l pH 9.5 at 50‰ for 2 hours. Genomic DNA was then extracted by phenol, phenol-chloroform, and fi- nally precipitated and washed in ethanol. After digestion with RNase A, a second ethanol precipitation was followed by washing with 70% ethanol, air-dried and dissolved in 500 µl of TE (pH 8.0).

1.3.2 Library construction, shotgun sequencing

Two genomic DNA libraries were made in pUC18 carrying insert sizes from 2.0 – 3.0 kb and 7.5 – 8.0 kb, respectively. DNA inserts were pre- pared by physical shearing using the sonication method. The genome sequence was assembled from deep whole-genome shotgun (WGS) cov- erage obtained by paired-end sequencing from a variety of clone types, i.e., all inserts were sequenced from both ends to generate paired reads. A total of about 190.3 Mb of sequence data, which is equivalent to ap- proximately 6.6 coverage of the genome, has been generated by random shotgun sequencing.

1.3.3 Sequence assembly

Phred/Phrap/Consed package was used for base calling, contig assembly and quality assessment [83, 84, 112]. Contigs were ordered into scaffolds by the scaffold building program, Bambus [255]. Refer to Chapter 2 for more detailed descriptions of annotation procedure and genome database construction.

1.3.4 Data release

Sequence data generated by the project were released continuously and were available for searching using the on-site BLAST server and down- loading by FTP with access restriction. The annotated sequences are 25

Table 1.1: General features of the P. marneffei genome.

Feature Value Assembly size (excluding gaps) 28.98 Mb Estimated genome size ∼ 31 Mb GC content overall 47% GC content (coding) 50% Protein coding genes 10,060 tRNAs 110 % coding 62% Average gene size 1,753 bp Average intergenic distance 1,051 bp Average intron size 111 bp Average exon size 380 bp

available for browsing and downloading from web interface of P. marn- effei Genome Database (PMGD), http://www.pmarneffei.hku.hk. At present, PMGD contains 10,060 protein-coding genes.

1.4 Results

1.4.1 Assembly and general characteristic

Using a pure whole genome shotgun approach, we sequenced the P. marn- effei genome to 6.6× coverage. The net length of assembled contigs totalled 28.98 Mbp. Genome statistics are presented in Table 1.1.

Genome sequence

The P. marneffei genome size was estimated ∼ 31 Mb (see Section 2.4.2), which is similar to that of Magnaporthe (∼ 30 Mbp), larger than that of S. cerevisiae and S. pombe (both about 12 Mbp), but smaller than Neurospora (greater than 40 Mbp). The resulting assembly consists of 2,911 sequence contigs with a total length of 28,977,603 bp. Contigs were ordered into 273 supercontig (i.e., scaffolds) with a total length of 28.42 Mbp (excluding gaps between contigs). Most of the assembly 26

(98.35%) is contained in the contigs. Given the high sequencing cover- age, the assembly represents the vast majority (> 95%) of the genome, as theoretically assessed by the Lander-Waterman model [186]. The mi- tochondrial genome (35 kb, circular) has been completely sequenced and assembled (See Chapter 3 for detail).

Genes

A total of 10,060 protein-coding genes (9,257 (92%) longer than 100 amino acids) were predicted. This, again is similar to that of Magna- porthe and less than that of Neurospora, and constitutes nearly twice as many genes as in S. cerevisiae(about 6,300) and S. pombe (about 4,800), and nearly as many as in D. melanogaster (about 14,300). The average gene density is one gene per 2.8 kb. The average gene length of 1.75 kb is slightly longer than the 1.67 kb average gene length for Magnaporthe and the 1.40 kb for both S. cerevisiae and S. pombe. The protein-coding sequence is predicted to occupy 62.1% (51.2% excluding introns) of the se- quenced portion of the P. marneffei, compared with 71% in S. cerevisiae (70.5% excluding introns) and 60.2% in S. pombe (57% excluding introns) (Table 1.2). An estimated total of 28,180 introns are distributed among 91% of P. marneffei genes, with 34 being the largest number of introns found within a single gene. Introns varied from 15 to 1,617 nucleotides long, with a mean length of 111 nucleotides. The telomere tandem re- peat identified is TTAGGG. Several predicted genes that encode conserved telomere and centromere proteins, such as, telomere-associated helicases, were identified, but telomere and centromere sequences have remained elusive. Note, although the complete genomes of A. fumigatus and A. nidulans are not published, the high-quality drafts of their genomes can be obtained. Preliminary analyses reveal that most of above statistics about gene number and gene density of P. marneffei are similar to those of Aspergillus. This result is consistent with our understanding of phylo- 27

Table 1.2: Comparison of P. marneffei genome statistics to those of other fungi. PM - P. marneffei, AN - A. nidulans, MG - M. grisea, NC - N. crassa, SC - S. cerevisiae, and SP - S. pombe.

PM AN MG NC SC SP Genome size (Mb) 31 31 30 43 12 12 Gene number 10,060 9,457 11,108 10,620 6,300 4,800 Gene coverage 62.1% 59.2% 48.2% 44.5% 71.0% 60.2% Gene coverage (ex- 51.2% 50.6% 40.5% 37.6% 70.5% 57.0% cluding introns)

genetic relationship between them, as obtained by small ribosomal RNA sequences (Section 1.4.1) and mitochondrial comparison (Chapter 3).

Ribosomal RNA and tRNA

Copies of the large rRNA tandem repeat containing the 18S, 5.8S and 25S rRNA genes are present in P. marneffei genome. Ribosomal RNAs from P. marneffei and other fungi were used to construct phylogeny to study phylogenetic relationships. 18S rRNA from 43 species of As- comycetes were obtained from Ribosomal Database Project II Release 8.1 (http://rdp.cme.msu.edu/html/). The phylogenetic relationship is presented in Fig. 1.3. The neighbour-joining method of tree recon- struction, implemented in MBEToolbox (Chapter 10), was used. Align- ment replicates for bootstrapping were generated by using Phylip [86]. Result suggests that P. marneffei is likely to be an anamorph of a Ta- laromyces species. This substantiates the observation that the spacer regions of the rRNA loci are highly similar to that found in Talaromyces species [158,330]. Indeed the sequence is almost identical with that of T. flavus and T. bacillisporus (Fig. 1.3). It is also very similar to that of Chromocleista cinnabarina, a soil fungus that produces a red pigment, as does P. marneffei. A total of 110 tRNA genes were identified, including 69 (63%) with introns. 28

0.01 98 Ascosphaera apis [M83264] 62 Eremascus albus [M83258] 68 Coccidioides immitis [M55627] Paracoccidioides brasiliensis [AF227151] Blastomyces dermatitidis [M55624] 98 59 Histoplasma capsulatum [Z75306] 90 Penicillium allii [AF218787] 72 Penicillium expansum [AF218786] 80 Penicillium commune [AF236103] 100 Penicillium chrysogenum [AF548086] 99 Penicillium notatum [M55628] 100 73 Eupenicillium javanicum [U21298] 75 Monascus purpureus [M83260] Aspergillus flavus [D63696] 55 82 Aspergillus fumigatus [M55626] Eurotium rubrum [U00970] 53 50 Byssochlamys nivea [M83256] Chromocleista cinnabarina [AB003952] Talaromyces bacillisporus [D14409] 100 Talaromyces flavus [M83262] 62 100 Penicillium marneffei 97 63 Penicillium verruculosum [AF510496] 76 Thermoascus crustaceus [M83263] Pleospora rudis [U00975] Aureobasidium pullulans [M55639] 50 Leucostoma persoonii [M83259] 81 Ophiostoma ulmi [M83261] 100 Pseudallescheria boydii [U43913] 100 Microascus cirrosus [M89994] 100 54 Podospora anserina [X54864] 100 Neurospora crassa [X04971] 77 77 Chaetomium elatum [M83257] Taphrina wiesneri [D12531] 97 Taphrina deformans [U00971] 100 Taphrina populina [D14165] 65 Protomyces inouyei [D11377] 97 Saitoella complicata [D12530] Schizosaccharomyces pombe [X58056] 100 Torulaspora delbrueckii [X53496] 100 Saccharomyces cerevisiae [Z75578] 64 Zygosaccharomyces rouxii [X58057] 94 Candida tropicalis [M55527] Pichia anomala [D86914] 100 Clavispora lusitaniae [M55526]

Figure 1.3: Phylogenetic tree showing the relationships of P. marneffei to other Penicillium and Talaromyces species. The tree was inferred from 18S rRNA data by the neighbour-joining method and bootstrap values calculated from 1000 trees. The scale bar indicates the estimated number of substitutions per 100 bases using the Jukes-Cantor correction. Names and accession numbers are given as cited in the GenBank database. 29

1.4.2 Genome architecture and co-linearity

Identification of syntenies conserved between species is valuable for trac- ing the evolutionary events that affect genomes, however, little informa- tion about synteny among chromosome segments (or contig) is known for filamentous ascomycetes. Analysis of orthologous genes among P. marneffei, A. nidulans and A. fumigatus, revealed extensive regions of conserved synteny, as well as a considerable extent of genome reorganisa- tion that has occurred in this phylum. There are 1,340 regions containing four or more genes that were found to be co-linear between P. marneffei and A. nidulans. A total 3,188 P. marneffei genes are in these regions. There are 1,273 regions between P. marneffei and A. fumigatus, contain- ing 3,716 P. marneffei genes. The largest syntenic cluster contains 27 gene pairs, appearing in P. marneffei and A. nidulans.

Melanin-biosynthesis gene cluster

One of the interesting examples of the syntenic segments conserved be- tween P. marneffei and Aspergillus spp. is the melanin biosynthesis gene cluster. This six-gene cluster, spanning ∼ 19 kb, which participates in DHN-melanin biosynthesis [24, 187, 317, 318], is found in P. marneffei, and is syntenic in A. fumigatus (Chapter 4).

Pheromone precursor gene loss

Syntenic regions reveal evolutionary events, like gene loss, which are dif- ficult to identify by other methods. One of the examples is the loss of known mating pheromone precursor genes. Figure 1.4 shows the mi- crosyntenies among pheromone precursor loci from P. marneffei, A. nidu- lans, A. fumigatus and N. crassa. The pheromone precursor gene has been identified in all these species (highlighted in green) except for P. marneffei. The hypothetical locations of P. marneffei pheromone pre- cursor genes before loss are indicated by triangles in the figure. 30

Figure 1.4: Microsyntenies containing pheromone precursor loci from P. marneffei, A. nidulans, A. fumigatus and N. crossa. The pheromone pre- cursor genes have been highlighted in green. The hypothetical locations of P. marneffei pheromone precursor genes before gene loss are indicated by triangles.

1.4.3 Gene duplications (multigene families) and comparisons

Among all predicted P. marneffei genes (total 10,060 with 9,541 longer than 100 bp), 1,335 of them belong to 428 multigene families which con- tain more than one homologous member. The largest gene family consists of 34 genes. The most expanded gene families include MFS multidrug transporter, dehydrogenase/reductase and hexose transporter, as well as pepsin-type protease (see Table 8.2 on page 165). Comparisons of con- 31

tig/supercontig sequences and searches for tracts of conserved gene order reveal little evidence for large-scale duplications in P. marneffei. The incomplete genome sequences and unordered contigs obviously impair the detection. Notably, the result is inconsistent with that based on the other line of evidence, as presented in Chapter 8, in which histogram of synonymous substitution rate of P. marneffei duplicate gene pairs sug- gesting two large-scale gene duplications probably happened. Compared to S. cerevisiae which undergone genome duplication (i.e., the largest gene duplication), P. marneffei has relatively smaller number of recently duplicate gene pairs. But, the age distribution of duplicate genes in P. marneffei at the first peak (see Chapter 8 for detail) shows a similar pattern with that in S. cerevisiae, which might suggest that duplicate genes in P. marneffei probably originated through one or two episodic, large-scale gene duplication.

1.4.4 Interspecies proteome comparison

The comparison of genomic sequences of two or more species may provide highlighted information on how evolution shapes genome structure and content, and to reveal specific sequences that have been conserved, as well as those that have been invented throughout evolution. I conducted such a comparative analysis of proteome sequences between P. marneffei and A. fumigatus and S. cerevisiae. The analysis started by defining ortholog or paralog pairs among proteomes. Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event. Determining ortholog is important step in assessing the relationship between genomes. This was performed us- ing the BLAST comparison tool. BLASTP was used to compare the sequences of proteins encoded by genes of one genome against those from the other genomes. Protein sequences, instead of nucleotide sequences, were compared because protein sequences remain conserved much longer, 32

on an evolutionary time scale and therefore can detect much older rela- tionships among alignments. The lower the E-value, the greater chance that two proteins are orthologous, that is, derived from a common ances- tral protein and therefore having the same function. E-values have been shown to be an accurate indication for the ratio of false positives to true positives of homologous relationships. Genes g and h were considered or- thologues if h is the best BLASTP hit for g and vice versa, with E-value less than or equal to 1e-10. The translated ORFs sequences of S. cerevisiae were obtained from the Saccharomyces Genome Database (SGD) at http://www.yeastgenome. org/. The predicted peptides of A. fumigatus were downloaded from the FTP service at the A. fumigatus genome project in the Sanger Institute (http://www.sanger.ac.uk/). The result of the proteome comparison is given in Fig. 1.5.

Figure 1.5: Graphical representation of a triple proteome comparison between P. marneffei, S. cerevisiae and A. fumigatus. 33

1.4.5 Lineage-specific genes

We identified many genes only present in P. marneffei or its closely re- lated fungal species, namely lineage-specific genes. At the most extreme, some genes are present in P. marneffei exclusively. These genes are of particular interest because they may be determinators of characteristic features of the fungus. A total of 1,447 genes whose proteins lack signifi- cant matches to known proteins from public databases (TBLASTN cutoff 10−10) were found. This reflects that the Penicillium and its closely re- lated fungal genome projects are still in the early stage, the diversity of fungal genes remaining to be explored. Furthermore, 2,506 proteins do not have significant matches to genes in either of the sequenced yeast and A. nidulans. A novel theory about the emergence of lineage- or species-specific genes is given in Chapter 9. Briefly speaking, the accel- erated evolutionary rate, one of the most characterised properties of a lineage-specific gene, may be responsible for the gene’s emergence.

In addition to the lineage-specific genes, many fungal specific domains have been identified. These include cell wall antigen MP1 domain that is first described in cell wall antigen Mp1p encoded in P. marneffei [347]. The Mp1p contains two self conserved regions, namely CR1 and CR2, which form a new conserved domain family that has not been described in conserved domain databases, such as Pfam and ProDom. The genome sequence reveals more than 12 P. marneffei genes containing at least one MP1 domain. That is to say, the genes encoding MP1 containing proteins have been expanded in P. marneffei genome. Such an expansion is not so extensive in A. fumigatus and A. nidulans, despite at least two MP1 containing proteins, afmp1 and afmp2 (GenBank Acc.: AAG09624 and AAR22399), were discovered in A. fumigatus genome. 34

Figure 1.6: Putative MAPK signalling pathway in P. marneffei. Overview of major intracellular signalling pathways in P. marneffei. Common genes between S. cerevisiae and P. marneffei are marked with asterisks. Names of S. cerevisiae genes are presented. The P. marnef- fei genes are in parentheses. Created by using GenMAPP v2.0, a free program for visualising genes on biological pathways. 35

1.4.6 Cell signalling and morphogenesis

The sequences encoding proteins that act on well-studied signalling path- ways, including mitogen-activated protein kinases (MAPK) and cyclic AMP-dependent protein kinase, as well as small GTPases of the Ras family, are readily recognised in the P. marneffei genome. Figure 1.6 is the comparison of MAPK signalling pathways between S. cerevisiae and P. marneffei.

1.4.7 Potential mating ability

Traditionally, P. marneffei is considered as an asexual (anamorph) as- comycete that lacks an apparent sexual (teleomorph) stage in its life cycle and seems to reproduce only mitotically [44, 104]. Recent genetic stud- ies, however, suggest it may have an unidentified sexual cycle. Except for the pheromone precursor gene, the whole set of sex-related genes in P. marneffei genome was identified, which demonstrates the potential matting ability of this important thermally dimorphic fungus (Chapter 5).

1.4.8 Putative virulence genes

What makes a fungus a pathogen is an old question. The P. marneffei genome sequence has revealed many proteins and systems with functions that have previously been found to be important in pathogenic fungi. For example, proteins such as phospholipases and proteinases are involved in direct host cell damage and lysis. A review about fungal virulence factor is in Section 4.2. A few identified putative virulence factors are presented in Table 1.3.

1.4.9 Cell wall antigens and biosynthetic genes

The cell wall of a fungus maintains the structural integrity of the cell, protects the fungus against the defence mechanism of the host and har- 36

Table 1.3: Putative virulence genes

Gene Acc. No. BLAST hit E value Proteinase Pm47.49 P87184 Intracellular vacuolar serine pro- 0 teinase precursor Pm61.35 Q96WN2 Lon proteinase 0 Pm109.24 P25375 Saccharolysin (EC 3.4.24.37) (Pro- 1e-159 tease D) (Proteinase yscD) Pm61.50 Q6FX66 YCL057w PRD1 proteinase yscD 1e-158 Pm88.30 Q64HW0 Aspartyl proteinase 1e-122 Pm66.31 P32379 Proteasome component PUP2 (EC 3e-98 3.4.25.1) Pm13.58 Q871P4 Related to ubiquitin-specific pro- 6e-97 teinase UBP1 Phospholipase Pm1.261 Q769K2 N-acyl-phosphatidylethanolamine- 6e-61 hydrolysing phospholipase D Pm103.31 Q874F2 Phospholipase D 1e-156 Pm16.57 Q6U820 Lysophospholipase (EC 3.1.1.5) 0 Pm167.18 Q877A5 Phospholipase (Fragment) 2e-51 Pm182.7 Q76H92 Phospholipase A2 3e-27 Pm22.27 Q9P866 Candida albicans Phosphatidylinosi- 4e-44 tol phospholipase C Metacaspase Pm112.34 Q8J140 Metacaspase 1e-91 Pm205.1 Q8J140 Metacaspase 3e-58 Agglutinin Pm113.29 Q9P5P9 related to A-agglutinin core protein 1e-24 AGA1 Pm10.4 P11219 Lectin precursor (Agglutinin) 5e-09 Pm2.195 Q8CMU7 Streptococcal hemagglutinin protein 3e-07 Pm28.53 Q7N911 Similar to hemagglutinin/hemolysin- 0.00005 related protein Toxin Pm21.30 A45086 HC-toxin synthetase - fungus 0 (Cochliobolus carbonum) Pm21.31 Q9UVN5 AM-toxin synthetase 0 Pm71.10 Q9UVN5 AM-toxin synthetase 0 Pm71.39 Q9UVN5 AM-toxin synthetase 0 Pm137.4 Q9UVN5 AM-toxin synthetase 0 Pm151.1 A45086 HC-toxin synthetase - fungus 0 (Cochliobolus carbonum) Pm112.24 Q96WL1 Aflatoxin efflux pump Aflt 1e-141 37

bours most of the fungal antigens. It consists of a polymer of α and β(1,3)-glucans, chitin, galactomannan and β(1,3)(1,4)-glucan embedding protein antigens including the adhesins. The cell wall is synthesised and continuously remodelled by enzymes including synthases, transglycosi- dases and glycosyl hyrolases. All these are absent in human cell and thus ideal targets for anti-fungal agents and immunisation. Previous studies have shown that the specific monoclonal antibody against the galactofu- rane side chain of galactomannan antigen of A. fumigatus can react with the cell wall of P. marneffei and can be used to detect the presence of antigenaemia or antigenuria in patients suffering from penicilliosis marn- effei [363]. Ortholog of one of the known P. marneffei cell wall antigen genes, MP1, is present in A. fumigatus. Within P. marneffei, homologs of a number of Aspergillus genes encoding similar biosynthetic enzymes and cell wall antigens have been identified (Table 1.4).

1.5 Discussion

This is the initial analysis of the genome of a thermal dimorphic fun- gus. Although P. marneffei has not been studied intensively, the analy- sis of the genome sequence has provided many new insights into a va- riety of gene functions and cellular processes, including cell wall com- ponents, signalling pathway, secondary metabolism and mating ability. Comparisons of the genome of P. marneffei with those of other patho- genic/nonpathogenic fungi have also uncovered surprising similarities and differences, providing a new perspective on the molecular underpinnings of these lifestyles. The analysis of P. marneffei-specific genes might allow researchers to begin to make insights into the transition from mould to yeast growth. Furthermore, the genome sequence has revealed the differ- ent pattern of gene duplication in P. marneffei and other ascomycetes, which might be linked with their divergent biological characteristics. The apparent lack of a pheromone precursor loci in P. marneffei may provide 38

Table 1.4: Cell wall antigens and biosynthetic genes predicted in P. marn- effei.

Aspergillus gene Acc. No. Pm gene E value CHSs Class I CHSA AAB33397 Pm14.101 e-107 Class II CHSB AAB33398 Pm132.15 5e-097 Class III CHSG AAB07678 Pm110.5 0 Class IV CHS F AAB33402 Pm87.22 6e-064 Class V CHSE CAA70736 Pm38.37 0 Class VI CHSD AAB33400 Pm223.4 e-051 β(1,3)-glucan synthase FKS1 AAB58492 Pm120.1 0 RHO1 AAG12155 Pm203.6 5e-099 α(1,3)-glucan synthase AGS1 AAL28129 Pm162.3 0 AGS2 AAL18964 Pm66.50 0 β(1,3)-glucanosyl transferases GEL 1 AAC35942 Pm221.6 e-154 GEL 2 AAF40139 Pm94.24 e-123 GEL 3 AAF40140 Pm119.10 e-124 Mannosyl transferases MNN9 Afu2g01450 Pm207.2 5e-097 PIG-M Afu7g01300 Pm90.41 2e-063 Chitinases Endo-β(1,3)-glucanases Engl1 AAF13033 Pm5.32 0 39

an explanation of its asexual life style. However, the fungus may indeed undergo a yet undetected sexual cycle, which is supported by the findings of homologs of many mating genes. Finally, one of the most interesting findings is the abundant intragenic tandem repeats in the coding regions of the genome. This finding provides a possible mechanism to explain how the fungus can change its surface coat and thereby evade detection by the host’s natural defences (see Chapter 7). The draft genome sequence of P. marneffei presented in this chapter provides the first attempt to understand the genetic basis of the physi- ology of the special Penicillium species. Nonetheless, This first glimpse may be expanded as many other fungal genomes generated from fungal genome sequence projects ongoing or planned. This new era in fungal biology promises to yield insights into this important group of organisms, as well as to provide a deeper understanding of the fundamental cellular processes common to all eukaryotes. 40

Chapter 2

PENICILLIUM MARNEFFEI GENOME DATABASE AND ANNOTATION PIPELINE

The draft genome of Penicillium marneffei has been obtained (Chap- ter 1). The huge amount of sequence data needs efficient analysis in order to extract valuable information. A computer-based analysis system tai- lored for the genome is required. Such a sequence data management system with a number of peripheral applications has been developed to solve this problem.

2.1 Introduction

The ever accelerating amount of genome information of P. marneffei needs to be adequately processed, annotated and interpreted. Computa- tional annotation systems providing tools and algorithms can facilitate this process and advance our understanding of the genome sequences. For the systems to be developed and refined, data must be easily acces- sible and amenable to analysis. The analysed data must be fed back into the loop to allow the data to be re-analysed, refined, verified, and new hypotheses to be built. This is the issue of data management. Good data management practices are essential to users of genomic data. This chapter is concerned with two aspects: (1) construction of the PMGD (P. marneffei genome database) system, and (2) the issues rele- vant to the development of annotation pipeline. Many steps are involved in these two aspects. Among these steps, prediction of protein function is one of the most critical one in genome information processing. The process of function prediction therefore stands the central part of an- 41

notation pipeline. Since P. marneffei genetics has not been well estab- lished, most of proteins derived from its genome will be totally unknown to biologists. More than ten thousand unknown proteins will undergo function prediction. Different methods of protein function prediction have been developed (see Literature Review). Briefly, these methods can be categorised into two major groups: homology based methods and non-homology based method. The former methods depend on the de- tectable homolog between unknown protein and the characterised pro- teins in database. The latter methods are based on various contexts in functional information of a protein, which are collected and integrated around the protein in order to assign a putative function for the protein in an indirect way [218]. However, none of these methods can guarantee a ‘one-stop’ solution that are particularly successful in P. marneffei gene function prediction. Hence, the newly developed annotation pipeline in- tegrates several currently used methods, but it is by no means a collection of methodologies. Different methods have been tailored before it can be integrated in order to give its maximum predicting power in respect to the features of fungal proteins.

In next section, I will first review underlying principle behind the methodologies used for predicting function of unknown proteins. I will then examine a few protein function prediction systems implemented by several research groups, before pointing out some additional considera- tions in regard to the further development of similar systems. Note that the topic of protein function prediction is a broad one. It could be broken down into different subtopics in many different ways. I have tried to or- ganise them in a flow from theory to application as smoothly as possible. But still, the content of sections might jumpover slightly; some of key concepts, such as, algorithm of sequence alignment, might be mentioned more than once in different sections. 42

2.2 Literature Review

In this literature review I will first examine the most widely used methods in protein function prediction. Then give a survey of software/database systems currently available, highlighting their strengths and shortcom- ings. Further possible research directions will be addressed before final- ising the whole literature review section.

2.2.1 Methods for predicting protein function

Based on the underlying principle, the methods of protein function pre- diction can be categorised into two major groups: homology-based meth- ods and nonhomology-based methods [17,217, 142].

Homology-based methods

Homology-based annotation relies on sequence similarity between query protein and a well characterised protein. If two proteins are highly similar in sequence, they possibly share the same function. The rationale behind this function extrapolation is that similarity in sequence is determinate enough to functional similarity. This is reasonable but counter-examples are not rare. For instance, in the presence of domains that are shared by numerous proteins [74], choosing the first or the best hit may not be op- timal. The multi-domain organisation of proteins can lead to incorrectly annotated database entries. Despite such criticisms, homology-based methods are definitely the most widely used method. To calculate simi- larities/distances with sequences of known proteins, pairwise similarities are computed using the rigorous dynamic programming algorithm [292], or heuristic algorithms such as FASTA [245] and BLAST [6]. Besides the whole protein similarity comparison, detecting motif or domain sharing among proteins gives additional information about func- tion. Motif is a simple combination of a few consecutive secondary struc- ture elements with a specific geometric arrangement (e.g., helix-loop- 43

helix). Not all, but some motifs are associated with a specific biologi- cal function. Domain is the fundamental unit of structure folding and evolution. It may combine several secondary elements and motifs, not necessarily contiguous. A domain can fold independently into a stable 3D structure, and it has a specific function. A variety of mathemati- cal representations of protein motif/domain were developed and utilised in detecting and storing these motifs/domains, such as, regular expres- sion, position specific scoring matrices [97], hidden Markov models [57], probabilistic suffix trees [15], and sparse Markov transducers [81].

Nonhomology-based methods

Although homology-based annotation has been widely successful in ex- tending knowledge from the small set of experimentally characterised proteins to the tens of thousand proteins found in genome sequencing projects, a fatal problem for this method is that a well characterised reference protein must be found base on sequence similarity; otherwise, one cannot assign putative function to the unknown protein. Accord- ing to the data that we currently have, 30-40% of proteins cannot find a clear sequence homology in today’s most updated protein databases. Another fungal genome sequencing project finished recently has the same problem [101]. Nonhomology-based methods, also called context-based function pre- diction is complementary to homology-based function prediction. Phy- logenetic profiles, domain fusion and gene neighbouring are examples of these methods. Pellegrini et al.[248] presented the phylogenetic profiles method based on the assumption that proteins that function together in a pathway or structural complex are likely to evolve in a correlated fash- ion. If protein A and B tend to be either preserved or eliminated together in a new species, we can expect that they are functional linked. In this case, if we know the function of protein A, we can manage to predict the 44

function of protein B with respect to this functional linkage. The method of phylogenetic profiling could be useful in predicting the function of un- characterised proteins in P. marneffei, especially, when more and more fungal species are sequenced. But for the time being this method has to be performed manually because there is no free software available in assisting automation of the analysis.

2.2.2 Software/database systems for protein function prediction

Over decades, with the close cooperation of biological scientists and soft- ware engineers, a wide range of software and/or database systems have been developed. As we can see in the next section of this review, some of them utilise mainly one of methods mentioned above as its predictive tool, while some of them try to integrate more than one method in order to give more comprehensive annotation for unknown proteins.

Systems for automatic function assignment

A group of software systems, such as, PEDANT, Genequiz, Bio-Dictionary, is attempting to accelerate the task of human experts by providing de- tailed and exhaustive information for function assignment. PEDANT (http://pedant.gsf.de) is a software system for com- pletely automatic and exhaustive analysis of protein sequence sets - from individual sequences to complete genomes [96]. It was launched in 1996 and is one of the earliest such systems. It was extensively utilised in MIPS, a Europe based bioinformatics institute. It claims that it is fully integrated with sequence database system and provides access to a broad range of biological information through a hierarchically organised, con- trolled vocabulary. The whole system became commercialised like some other similar systems these days, which limits its popularity. The GeneQuiz analysis server is open to public usage and accepts anonymous protein sequences with GQserve [7]. It is composed of several 45

major modules: GQupdate keepings target databases current; GQsearch performs database searching of queries, applies a variety of sequence analysis tools to the query sequence, parsing, and storing the results in a common format; GQbrowse allows browsing and querying of results; GQupdate maintains integrated, up-to-date, non-redundant protein and nucleotide sequence databases, as well as databases of protein structures and motifs. These modules are general engineering achievement with no principle different from other database systems. It is GQreason module that is the most critical know-how for the whole system. The module analyses results and makes intelligent guesses, assigns a specific function to the query, a general functional class, and a reliability estimate. Bio-Dictionary [264] employs a weighted, position-specific scoring scheme and uses the complete collection of amino acid patterns (referred to as seqlets) and can determine, in a single pass, the following: all local and global similarities between the query and any protein already present in a public database. The most unique feature of Bio-Dictionary is the usage of seqlets that completely cover the natural sequence space of proteins in the currently available public databases. As its developers claimed the seqlets contain in this collection can capture both functional and struc- tural signals that have been reused during evolution both within as well as across families of related proteins. With this capacity, seqlets are ideal elements for use in the context of protein annotation.

Classification system

It is not always the case that an unknown protein can be readily as- signed a definite functional description. In such a case, protein classifi- cation can help to elucidate the function of the new protein. Comparing a protein sequence with a database of protein families is more effective than a standard database search. Generally, conserved proteins are clas- sified according to their homologous relationships. Each protein group 46

composes of a set of “seed” proteins which is represented as multiple alignments, regular expression profiles or HMM. Protein classification is useful in structure and function prediction, and especially important in large-scale annotation efforts.

As it claims as of 2001, Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages [308]. Now it is more updated by including more complete genomes represent- ing broader lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. The problem with COGs system is that the system is not fully open to public. Batch-application of COGnitor, the key compo- nent of the system used to fit new proteins into the COGs, can only be accessed inside the NCBI. Another issue has to be taken into account is that COGs does not discriminate paralog (genes from the same genome which are related by duplication) from ortholog (genes in different species that evolved from the same ancestral protein). Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. In contrast, paralogs are functionally diverse proteins whose genes duplicated after speciation, although high sequence similarity is normally preserved in paralogs. A system like COGs can only be used as classifying system for automatically yielding a number of functional predictions for poorly characterised genomes. COGs system is of limited usefulness in P. marneffei genome project because its cur- rent version contains few fungal genomes. The other database systems, such as, Systers [177], iProClass [135], ProtoMap [362], have the same shortcoming as COGs. They are better to be treated as protein infor- mation storage/retrieval systems than active protein function prediction systems. 47

Protein domain databases

A list of commonly used protein domain databases are given in Table 2.1. Two of them have been used in PMGD. They are Pfam and InterPro. Pfam (http://www.sanger.ac.uk/Software/Pfam) is a large collec- tion of multiple sequence alignments and hidden Markov models covering many common protein domains and families [13]. For each protein fam- ily, Pfam allows looking at multiple alignments, viewing protein domain architectures, examining species distribution, and so on. Pfam is built from fixed releases of Swiss-Prot and TrEMBL. At current version 18.0 (2005), 75% of protein sequences in Swiss-Prot and TrEMBL have at least one match to Pfam. InterPro (http://www.ebi.ac.uk/interpro) is a database of pro- tein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. It provides an integrated view of the commonly used signature databases like PROSITE, PRINTS, SMART, Pfam, ProDom, etc., and has an in- tuitive interface for text- and sequence-based searches. The latest release 11.0 contains 12,294 entries and covers 77.5% of UniProt proteins. Inter- ProScan is a tool that combines different protein signature recognition methods native to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotation.

2.2.3 The art of gene finding

The last 20 years has witnessed the significant development of compu- tational methodology for finding genes and other functional sites in ge- nomic DNA. Two major classes of computational approaches are com- monly used to detect genes in genomic sequences. They are homology- based approaches, and ab initio gene-finding algorithms. The former approaches are relatively straightforward, focusing on search of homol- ogous relationship with the content and structure of known genes. If a 48

Table 2.1: Commonly used domain databases.

Database Method Data type URL Prosite Semi-Maual Motif www.expasy.ch/prosite/ Pfam Semi-Auto Domain www.sanger.ac.uk/Software/Pfam/ Blocks Full-Auto Motif www.blocks.fhcrc.org/ ProDom Full-Auto Domain prodes.toulouse.inra.fr/prodom Prints N/A Motif www.bioinf.man.ac.uk/PRINTS/ Domo Full-Auto Domain www.infobiogen.fr/services/domo/ InterPro N/A Motif www.ebi.ac.uk/interpro/ Smart Semi-Auto Domain smart.embl-heidelberg.de/ eMotif Full-Auto Motif dna.stanford.edu/identify

region of sequence is similar to the sequence of an identified gene it is highly suggestive, though not necessarily conclusive, of a gene. The most common program for such comparison may be BLAST.

Next I will review some issues related to ab initio gene finding al- gorithms. Generalised hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations, including GenScan [30] and FGENESH (Softberry). At the time of this thesis’ written, neither GenScan nor FGENESH is open- sourced, and no detailed information about underlying algorithm and implementation is available. According to general algorithm description, GenScan uses a training set in order to estimate the HMM parameters, then the algorithm returns the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). The generalised HMM that GenScan uses consists of a number of states mod- elling the various parts of a gene. These states include 5’ splice site, 3’ splice site, internal coding exon, start exon, and terminal exon. The final gene structure predicted by GenScan is the maximum probability path through the HMM. FGENESH is also HMM-based with the algorithm similar to GenScan [30], differing in the model of gene structure a signal 49

term (such as splice site or start site score) has some advantage over a content term (such as coding potentials), reflecting the biological signifi- cance of the signals. No matter what algorithm a gene finding program implements, several basic types of signal are indispensable to be detected. These signals (or functional sites in genomic DNA) that researchers have ever sought to recognise are splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal binding sites, topoisomerase II binding sites, topoisomerase I cleavage sites, and various transcription factor binding sites [108]. From the point of view of information sciences, two basic types of information are used here (1) “signals” in the sequence, such as splice sites; and (2) “content” statistics, such as codon bias. Among signal measures, the splice junctions-the donor and acceptor sites is the most important fea- tures to be identified. The most common method for this has been the “weight matrix” based methods. Other methods like consensus, Maximal dependence decomposition (MDD) and Neural network based methods are also used. Other signals, such as, start and stop codons, TATA boxes, transcription factor (TF) binding sites, and CpG islands, are also use- ful in predicting protein-coding regions. Content measures, like such as codon bias, periodicities and asymmetries of coding regions, help to dis- tinguish coding from noncoding regions. Fairly long exons are easy to identify whereas short ones remain difficult. Neural networks have also been used to distinguish coding from noncoding sequences.

Recently homolog-based approaches have been incorporated into the ab initio gene-finding algorithms. GenomeScan, for example, is a com- bination of two sources of information: probabilistic models of exons- introns and sequence similarity information [361]. It is an extension of the GenScan program, predicting gene structures that have at least one exon with supporting evidence from an existing protein sequence. The major disadvantage to this method is the requirement of a close homolog. 50

It is often the case that homologs are unknown or are remote, in which case this system would be inappropriate.

Although the programs for gene structure prediction have greatly im- proved in the last decade, even the best cannot autonomously detect all genes and genomic elements and have to be supported by experimental analysis. The programs still have considerable proportion of incorrect and missed exons, and they concentrate only on the detection of coding exons, while 5’ and 3’ UTRs, promoter elements, and polyA sites often remain undetected. The elucidation of complex genome organisation, such as nested and overlapping genes or alternative splicing, has not yet been considered by any of the programs [267].

2.3 Implementation

The overall objective of PMGD is to design and implement a distributed information framework that will provide services, tools and infrastruc- ture for high-quality analysis and annotation of large amounts of diverse genomic data. The whole system starts from assembly of sequences, and ends with the web interface for output of all processed information. The requirements of the update are dependant on the genomic data sources to be updated, so the PMGD was designed to be modules and config- urable so that adding new sequence data should be as straightforward as possible.

2.3.1 Annotation pipeline

The general strategy applied to the analysis of all contigs is diagrammed in Fig. 2.1. It uses standard published procedures of sequence compar- isons as well as sh/bash shell scripts and Perl specifically developed for this work (see Section 2.3.5). The procedure involves the following major steps: 51

Contigs (2911)

Consed/BAMBUS

Scaffolds (273)

GenScan FGENESH HmmGene Tandem Repeat BLASTX Search Finder

Predicted Best Gene Genes Prediction (10,060) Other Nucleotide Analyses ...

Domain Other Protein BLASTP Search Identification Analyses ...

Relational Database Storing Gene Annotation Structure PMGD Website & Interface Functional Annotation Sequence Data Files

Figure 2.1: Flowchart of annotation pipeline for P. marneffei genome.

Step 1: contig assembly

Contigs were assembled from the sequence electropherograms using the Phred/Phrap with their default options except as otherwise indicated (for detail, see Section 2.3.2).

Step 2: comparisons of contigs to sequence databases

Comparisons of all contigs with fungal DNA sequences were performed using BLASTN (default parameters) to search for rDNA, plasmid or mi- tochondrial DNA sequences. The contigs were also compared to all known proteins in GenBank (release 131) using ungapped BLASTX, with sig- nificant hits indicating potential exons. The searches were made using the seg filter and the PAM250 substitution matrix. The searches against mitochondrial sequences were made using the filamentous fungal mito- chondrial genetic code. In order to facilitate the visual inspection of the 52

alignments, I have developed blast2html script that converts regular BLAST output to the HTML format. A graph was inserted above the descriptive lines showing alignments coloured according to their similarity score with the contig or protein query. Note BLASTX hits can often in- dicate the approximate location of many coding exons but not every exon and do not accurately delineate exon boundaries, so BLASTX search in this step only provide preliminary coding information.

Step 3: identification of genetic elements

This step identifies protein coding genes and other genetic elements. Dif- ferent gene finding programs were evaluated and then the best one was used as the primary gene finding program (for detail, see Section 2.3.3). In addition to the protein-coding genes, tRNAs were identified using the tRNAScan-SE program [207](http://www.genetics.wustl.edu/eddy/ tRNAscan-SE/).

Step 4: BLAST comparisons to protein sequences

After obtaining predicted proteins, comparisons of proteins with the non- redundant NCBI protein database were performed using BLASTP ver- sion 2.0.10 with the seg filter and the PAM250 substitution matrix. All predicted genes were searched against the Pfam set of hidden Markov models using the HMMER program and InterPro using modified Inter- ProScan running locally on Bioinfo server.

Step 5: Data storing and PMGD web interface

Before dumping the annotation data into database system, information from vairous software programs were integrae d and the results were converted into either GenBank or GFF format (see below). A manual validation step was introduced at this stage. Data storing procedure will be described in Section 2.3.4. 53

2.3.2 Assembly process

Phred/Phrap/Cosed package (version 0.99.03.19) is one of the most fre- quently used software sets for trace file base calling, contig assembly and contig editing [83,84, 112].

Base calling

The purpose of base calling is to determine the nucleotide sequence on the basis of multi-colour peaks in the sequence trace. Because traces (and regions within a trace) are of variable quality, the fidelity of “called” nucleotides is also variable. This accuracy for each called base is measured by what are called base quality values. Phred takes trace file as input. The Phred base calling program provides these base quality values to help realistically evaluate sequence accuracy. It computes a probability p of an error in the base call at each position, and converts this to a quality value q using the transformation q = −10 × log10(p). Thus a quality of 30 corresponds to an error probability of 1/1000, a quality of of 40 to an error probability of 1/10000, etc.

Vector clipping

Use the cross match alignment program to compare each read in fasta- format file generated by base calling to a fasta database of cloning and sequencing vectors vector.fasta. The sequence of the cloning vector used (pUC18 plasmid sequence in our case) was added to the vector sequence database. On the bioinfo server, the the vector sequence database is lo- cated at /db/univec/UNIVEC/UniVec or /pgm1/phrap/vector.seq. The example command line for clipping CLONE.fasta is:

% cross match -minmatch 12 -penalty -2 -minscore 20 -screen CLONE.fasta 54

/db/univec/UNIVEC/UniVec

The -screen option tells cross match to produce another fasta file, CLONE.fasta.screen, nearly identical to CLONE.fasta, except that recognised vector sequences are replaced by X (or x, according to the original capitalisation).

Sequence assembly

Assemble the vector-clipped reads to reconstruct the clone sequence, us- ing the Phrap sequence assembler. The program takes as input a fasta format file of sequence fragments and a companion base quality file, con- structs contig sequence as a mosaic of the highest quality parts of reads. Run the assembly program using command line:

% phrap -new ace CLONE.fasta.screen > phrap.out

As a result, Phrap creates a number of files. The most important ones: CLONE.fasta.screen.contigs (assembly consensus sequence in Fasta format), CLONE.fasta.screen.contigs.qual (assembly consensus base quality values assigned by Phrap), and CLONE.fasta.screen.ace (a complicated- looking file that enables one to view the result of the assembly in the Consed assembly viewer/editor program). In file CLONE.fasta.screen.contigs.qual, Phrap provides quality information about assembly (i.e., quality values for contig sequence) by generating its own quality measures (based on read-read confirmation). This process seems rule-based (few references about it). For example, if all input quality values (given by Phred) are relatively small (less than 15), Phrap assumes that they do not correspond to error probabilities and attempts to rescale them so that the largest quality value is approx- imately 30; in contrast, if input quality values are relatively high (≥ 40), 55

Phrap may give the base in contig (consensus of more than one bases of reads) a higher quality value like 90. After contig assembly, for a contig of length n, the average quality value is given by:

P (Quality value of base in contigs) Number of base in contigs

2.3.3 Gene finding

One of the main aims of annotation pipeline is to aid in identification of protein-coding genes. This can be done by using a gene-finding pro- gram to predict gene models (ab initio gene finding), or by predicting possible genes based on the similarity of the sequence to other sequences, particularly other identified sequences. I used both of these approaches as follows. Ab initio gene predictions were performed using FGENESH (SoftBerry). The automated gene prediction pipeline was hosted on the bioinfo server at the Computer Center, HKU. The original prediction was manually refined with assistance from GenomeScan, another gene prediction program that combines sequence similarity and exon-intron composition (i.e., two distinct types of evidence used by these classes of methods), into one integrated algorithm.

Evaluation of gene recognition accuracy

The predictive accuracy of a gene-finding program is evaluated by com- paring the exons predicted by the program with the actual coding exons at nucleotide level and exon level [31]. For nucleotide level accuracy, define the values TP (true positives), TN (true negatives), FP (false positives), and FN (false negatives) as follows: TP = the number of coding nucleotides predicted as coding; TN = the number of noncoding nucleotides predicted as noncoding; FP = the number of noncoding nu- cleotides predicted as coding; FN = the number of coding nucleotides predicted as noncoding, then sensitivity as the proportion of coding nu- 56

cleotides that are correctly predicted as coding:

TP Sn = , TP + FN and specificity as the proportion of nucleotides predicted as coding that are actually coding:

TP Sp = . TP + FP

For exon level accuracy, the formulas for exon level sensitivity (ESn) and specificity (ESp) are:

TE TE ESn = , ESp = . AE PE where TE (true exons) is the number of exactly predicted exons and AE and PE are the numbers of annotated and predicted exons, respectively.

Combining predictions from two gene-finding programs

Gene-finding programs are still unable to provide automatic gene dis- covery with desired correctness. The benefits of combining predictions from more than one already existing gene prediction program have been explored [268]. Therefore, methods for combining predictions from pro- grams, GenScan and HMMgene, was used in predication of P. marneffei genes, in attempt to improving exon level accuracy of gene-finding by identifying more probable exon boundaries and by eliminating false pos- itive exon predictions. The scripts implementing these methods are ob- tained from http://www.cs.ubc.ca/labs/beta/genefinding/. Note that at the time this combining prediction study was conducted, the gene- finding program FGENESH was still not available. A late retrospective test was conducted after combining FGENESH with either GenScan or HMMgene though. 57

2.3.4 Database and databank to store results

The first step in database design is to decide what the database will be used for and how users will interact with it. Once these are defined, the data to be stored and how these data are associated with one another is defined. This is done using a conceptual data model. The model is independent of how the information will be stored in the final, physical implementation on the computer. Entities, like gene, contig and gene product, are defined that informally represent concepts from the real world. The relationship between these concepts were also defined, for example, a contig contains more than one genes; generally one gene pro- duces one gene product. A formal language such as Unified Modelling Language (UML) was used for specifying both use cases and conceptual data models.

The next step is physical implementation of the data model. Now a database management system (DBMS) has to be selected. Here I used Microsoft Access, relational database manager running on a Windows operating system. It is available in our departmental facilities and is quite powerful and efficient for medium-size database management. It has straightforward Web-publication capabilities and intuitive graphic user interface-building capabilities. Administrators of the database work through the application interface, while users interact with database through a web interface. Physical implementation of the conceptual data model was mediated with the database schema (Fig. 2.3).

Large-scale data that are to be made accessible to the community should be well curated, annotated and documented and appropriately formatted for publication. At present, no universally accepted standards for data format exist for genomics data. Here, I adopted GFF (http: //www.sanger.ac.uk/Software/formats/GFF) and GenBank format to transfer information to and from public databases and applications. The database was populated using Perl scripts written using ActiveState Perl 58

Version 5.6 for Windows (downloaded from http://www.activestate. com) and the Perl modules Bioperl (obtained from http://www.bioperl. org).

2.3.5 Perl source code collection

In the annotation pipeline, a sequence of analysis steps each using differ- ent tools must be carried out one after the other. The challenge was that in the absence of defined standards for the input and output of different tools. Because there is no explicit ‘contract’ between the various tools as to what input and output formats each will support, at any time one of the tools in the pipeline may change the format of its input or output (breaking the system). To connect together multiple tools ‘smoothly’ and ‘robustly’, special ‘glue codes’ have been written, mostly in Perl. The collection of Perl scripts organised into several modules are available at the PMGD website.

2.3.6 Genome browser configuration

Visualisation of genomic information is not just for the beauty or aes- thetic purposes. It is of practical use that it gives more meaning to people than reading those ‘cipher texts’. For example, three of the most promi- nent genome browsers are the Ensembl Genome Browser (http://www. ensembl.org/) by the European Bioinformatics Institute and the Sanger Institute, the Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) by National Centre for Biotechnology Information and the UCSC Genome Browser (http://genome.ucsc.edu/) by the University of California Santa Cruz Genome Bioinformatics Group. They are highly specified to their particular data type and information. Most of genome browsers can work either online or offline. They are usually developed in Perl, Java or other high-level languages. PMGD incorporates two free but powerful genome browsers, Argo 59

(Java applet Fig. 2.2) and GBrowse (Generic Genome Browser), in order to organise and annotate genomic data. The GBrowse (http: //www.gmod.org/) combines database and interactive web page for ma- nipulating and displaying annotations on genomes. It requires 3 steps, installation, configuration and customisation. Installation is a easy walk through following the instruction. Configuration is done by a configu- ration file. Customisation was achieved by the configuration file. The machine is equipped with a Pentium III Processor at the clock speed of 800 MHz and 128 MB main memory. ActivePerl, BioPerl and Apache web server are necessarily installed. There is an advanced option for choos- ing between the ‘in-memory’ database or the relational database MySQL for storing the sequence and annotation information. For genome size of P. marneffei, the ‘in-memory’ architecture is already good enough to handle. Sequence files (in FASTA format) and annotation files (in GFF format) are to be stored under ‘$HTDOCS/gbrowse/databases’ of the di- rectory of Apache web server. The configuration file (.conf) defining the settings is stored in ‘$CONF/gbrowse.conf’. GBrowse is highly customis- able. For example, administrators can use different colours or shapes to represent exon, intron, and other genetic elements. More sophisticated functions, such as the display of different reading frames, transcription profile, ESTs and alignments, are also provided. Administrators are al- lowed to freely customise it by switching ON/OFF these functions and altering the default settings so that Genome Browser can better fit the purposes of a particular database.

2.3.7 Synteny identification

To perform synteny analyses, amino acid identity between P. marnef- fei and A. nidulans (or other fungi) was first determined by comparing the predicted proteins from each fungus using BLASTP. The putative ortholog pairs is predicted by using INPARANOID program [261]. Puta- 60

Figure 2.2: PMGD genome browser.

tive ortholog pairs were aligned using ClustalW and the amino acid per cent identity for each pair was calculated. If alignments spanned 60% of both genes and the alignment score was within 80% of the top score for either of the pair of genes, then the pair was accepted. Using these putative ortholog pairs, supercontigs were compared with the ADHoRe program [322](r2 cutoff = 0.8, maximum gap size = 35 genes, minimum number of pairs = 3). Results were filtered such that the maximum probability for a segment to be generated by chance was < 0.01.

2.4 Results

2.4.1 Statistics of assembly

As mentioned in Section 1.3.2, all inserts were sequenced from both ends to generate paired reads. These paired sequence fragments were assem- bled using the Phrap package of assembly tools [84], yielding a draft assembly. 98.35% of the assembled sequence was reconstructed in 273 supercontigs (2911 contigs); The longest contig is 178,730 bp and the longest supercontig is 729,276 bp; The fidelity of the assembly is sup- 61

ported by the high degree (80.50%) of plasmid-end pairs preserved in contigs and scaffolds. The net length of assembled contigs totaled 28.98 Mbp, including the mitochondrial genome of ∼ 35 kbp (Table 2.2).

Table 2.2: Summary of assembly statistics.

Features Value Read Total Number of Reads Sequenced 315,580 Number of Bases in Total Reads 173,664,505 bp Average Read Length 550.20 Number of Confirmed Reads (by Phrap) 310,365 Fraction of Reads Assembled 98.35% Fraction of Reads Paired in Assembly 80.50% Number of Bases Used in Assembly 170,951,774 bp Average Shotgun Coverage 6.6 fold (Phrap report) Contig Total Number of Contigs 2,911 Number of Bases in Contigs 28,977,603 bp Longest Contigs 178,730 bp Average Length of Contigs 9,955 bp Supercontig (scaffold) Total Number of Supercontigs 273 Number of Bases in Supercontigs 28,421,390 bp Longest Supercontigs 729,276 bp Average Length of Supercontigs 104,110 bp

2.4.2 Genome size estimation

The genome size was approximated from the draft assembly by estimat- ing the size of gaps between contigs and scaffolds. As shown in Table 2.2, total base summarised is 28.42 Mb in supercontigs, 28.98 Mb in con- tigs. These estimates do not include gaps. Within a supercontig, gaps, so called within-supercontig gaps, are between contigs that belong to the supercontig. The size of these gaps can be derived from the size of clones spanning the gap. As mentioned in Section 1.3.2, two sequencing clone libraries were constructed, carrying insert sizes from 2.0 – 3.0 kb 62

and 7.5 – 8.0 kb, respectively. Paired-reads belonging to contigs adjacent gaps was recognised to be from which library. The size of gaps between adjacent contigs in a supercontig can therefore be derived from the size of clones spanning the gap. When estimated gap sizes are included, the total physical length of all scaffolds is estimated to be 29.8 – 30.5 Mb. Between supercontigs there are so called between-supercontig gaps. The size of these gaps is hard to estimate since no spanning clones are avail- able. In addition, these gaps include difficult-to-sequence regions of the genome including the ribosomal DNA (rDNA) repeats, centromeres, and telomeres. If we take these considerations, the genome size is estimated to be ∼ 31 Mb.

When the sequencing is at the stage of relatively low coverage. There is ‘dynamic’ way to estimate genome size by applying Lander-Smith mathematical model. Assuming there is no cloning bias, the DNA frag- ments generated in the shotgun sequence process are located around the chromosome according to a Poisson distribution [92]. The unsequenced fraction of a genome (double-strand) is:

p = e−nw/L where n is the number of reads, w is the average length of reads and L is the length of genome. For a 20 Mb genome, it would require about 120,000 reads of 500 bp to produce theoretically about 95% (P = 0.05) coverage.

The number of unsequenced regions on both strands generates the same number of contigs, N, which can be calculated as:

N = ne−nw/L

For the total sequence data (about 60 Mb reads) we have got, there are total 119,744 reads with a mean length of 511 bp. After assembly with 63

Phrap, it generated 13,861 contigs. Therefore, n = 119744, w = 511,N = 13861. The genome size can be calculated as the following:

nw L = − = 28, 377, 000 ln(N/n)

In practice, the number of contigs is higher than theoretical expectation, since when assembling fragments Phrap needs overlap of nucleotides to link two reads together. These overlap regions do not contribute to the actual coverage but was taken into calculation as it does. Another factor is the bias due to cloning difficulties [186].

2.4.3 Accuracy of gene finding

The purpose for evaluation of gene recognition accuracy is to select the best gene finding program. The testing data set, composing of 103 Peni- cillium protein-coding genes that contain multiple exons was built. Our results shows that FGENESH gives the most accurate predication over- all. With it, we can identify ∼ 90% of coding nucleotides with 12% false positives. It provides sensitivity (Sn) = 96% and specificity (Sp) = 89% at the base level, Sn = 92% and Sp = 84% at the exon level and Sn = 85% and Sp = 67% at the gene level.

2.4.4 Combination of gene finding

Gene recognition accuracy may be improved by combining predictions from two gene-finding programs. Rogic et al.[268] implemented a series of algorithms combining gene prediction from two existing gene finding systems, GenScan and HMMgene. The combined algorithms were tested on the HMR195 sequence dataset and generated improved accuracy at both the nucleotide and exon levels, where the average improvement was 7.9% compared to the best result obtained by GenScan or HMMgene alone. In order to identify the most accurate gene prediction system for P. 64

marneffei, I conducted an evaluation study to compare GenScan, HM- Mgene and the combined gene prediction system based on them. The improved accuracy of result obtained by using the combined algorithm as in Rogic’s study was not observed in our study, where we used a dataset of 103 sequences with known genes from Penicillium species. Our result shows that GenScan tends to give a significantly better prediction than either of the other systems. At the nucleotide level, the sensitivity de- creased from 95% for GenScan to 89% for HMMgene, to 92% for the combined algorithm.

Two considerations came up in regard to the discouraging result ob- tained when the combined algorithm was applied to the dataset from Penicillium species. Firstly, the different performance of combined algo- rithm in ours and Rogic’s study is most likely caused by the difference of organisms. The dataset HMR195 used in Rogic’s study is composed of 195 human, mouse and rat sequences. Secondly, if two systems gen- erate consistent (no matter good or bad) predictions, then combining them would not give better results. For the human and rodents’ dataset, GenScan and HMMgene performed differently, but neither of them was always superior to the other. But when GenScan and HMMgene were used in our dataset composed of sequences from Penicillium species, we found GenScan always generated significantly better results than HMM- gene. Obviously, it does not help to combine gene finding systems if one system is always superior.

As mentioned, FGENESH was not available during the time when the gene combination test was conducted. A late retrospective test indicated that no improvement can be obtained when combining FGENESH with either GenScan or HMMgene (data not shown). Consequently we decided to use FGENESH alone to perform the gene prediction for this project. 65

2.4.5 Database and databank to store results

Physical deployment of P. marneffei genome database is different from that of annotation pipeline hosted in SUN Solaris server at the Computer Center, HKU. PMGD is located in the Windows 2000 based system at the Department of Microbiology, HKU, which is accessible as a workstation for administrators, and as a web service system for general users.

2.5 Discussion

Nowadays high through-put DNA sequencing offers a rapid and cost ef- fective approach to obtain the most important and relevant of all ge- netic information – the complete DNA sequence of an organism. As the quantity of data increases for a genome project like P. marneffei genome, researchers have to become more sophisticated about data man- agement issues. The study developed the system for P. marneffei genome project. This system performs semi-automatic tasks of assembly analy- sis, gene prediction/analysis, and extragenic region analyses. In order to be compatible with the computer systems available at the Department of Microbiology, HKU, the system was designed to span multiple working environments and integrate several public domains and newly developed software programs capable of dealing with several types of databases. Our PMGD solution approves a feasible way to handle the information and to manage large quantities of data internally or for public use. The genome sequence was searched against the public protein databases using BLAST. Genes were predicted using FGENESH and adjusted manually by referring GenomeScan. The FGENESH was selected as the best pre- dictor from a number of gene calling programs validated against a test set of 103 previously characterised Penicillium protein-coding genes. Ab initio gene finding is challenging in P. marneffei. This is because 1) lack of training dataset. Normally training gene-finding program re- quires more than 300 genes, in order to reach statistical power. However, 66 ALIAS ALIAS_NO ALIAS_NAME FEATURE_NO PROTEIN_INFO FEATURE_NO MOLECULAR_WEIGHT PI_VALUE CAI PROTEIN_LENGTH N_TERM_SEQ C_TERM_SEQ CODON_BIAS TOP_SCORE GRAVY_SCORE AROMATICITY_SCORE GENE_PRODUCT DESCRIPTION CONTIG CONTIG_NAME ORGANISM SOURCE LENGTH POST_GAP PRE_GAP CONTIG_ORDER COMMENTS CREATED_BY DATE_CREATED GENE_PRODUCT FK1 PROTEIN_NO PKFK1,I1 GENE_PRODUCT_NO GENE_NO FK1,I1 SCAFFOLD_NO PK CONTIG_NO PROTEIN_SEQ PROTEIN_LEN DESCRIPTION EC_NUMBER PROTEIN HOMOLOG HOMOLOG _NO HMLG_SPECIES HMLG_GENE_NAME HMLG_SYS_NAME HMLG_FUNCTION SCORE LENGTH SCAFFOLD I3FK1,I1 GENE_NOI2 PROTEIN_NO PROTEIN_NAME I1 OLD_ID PK SCAFFOLD_NO PK,I2 ID I1FK1,I3 GENE_NAME GENE_NO FUNCTION_EVIDENCE INTERPRO DOMAIN_NAME FUNCTION_EVIDENCE_NAME DESCRIPTION PK INTERPRO_NO PK FUNCTION_EVIDENCE_NO GENE_ALIAS FK1 ALIAS_NO FK2,I2 GENE_NO PROTEIN_INTERPRO FK2,I2 PROTEIN_NAMEFK1,I1 INTERPRO_NO GENE EXON_NUMBER C_START C_END CDS_LENGTH FRAME CHROMOSOME GENETIC_POSITION GENE_DESCRIPTION COMMENT PKI2 GENE_NO FK1,I1 SCAFFOLD_NO GENE_NAME BLASTX HIT_GI HIT_LEN HIT_ACCESSION HIT_DEF HIT_SIGNIF HIT_SCORE BLAST_QUERY_DEF BLAST_QUERY_LEN BLAST_QUERY_ACC BLAST_QUERYDESC PATHWAY GENE_FUNCTION GENE_PRODUCT DESCRIPTION PATHWAY GOid IS_NOT GO_GENE_GOEV PK,I1 PATHWAY_ID PK,I2 BLASTX_NO I3 HIT_ID FK1,I1 BLAST_PROGRAM_NO GO_EVIDENCE EVIDENCE_CODE DESCRIPTION PKFK2,I2 GENE_FUNCTION_NO GENE_NO FK1,I1 FUNCTION_EVIDENCE_NO FK1 GENE_NO FK2,I2 GO_EVIDENCE_NO PK GO_EVIDENCE_NO REMARK REMARK REFERENCE_WEIGHT DATE_CREATED CREATED_BY Field3 FK1 PUBMED BLAST_PROGRAM BLAST_VERSION BLAST_DB BLAST_DB_LEN BLAST_DB_LET DATE_MODIFIED DATE_CREATED CREATED_BY BLAST_PROGRAM CATEGORY DATE_CREATED CREATED_BY CATEGORY SGD_ESSENTIAL_ORF FK1,I1 SYS_NAME PK,I1 BLAST_PROGRAM_NO PK,I1 CATEGORY_NO ABSTRACT ABSTRACT FK1,U2 REFERENCE_NO REF_SOURCE STATUS CITATION YEAR_VALUE DATE_PUBLISHED DATE_REVISED ISSUE PAGE VOLUME TITLE DATE_CREATED CREATED_BY CATEGORY_REF REFERENCE STANDARD_NAME ALIAS DESCRIPTION GENE_PRODUCT PHENOTYPE IS_ESSENTIAL BLASTP FK1,I3 CATEGORY_NOFK2,I4,I2 REFERENCE_NO SGD_GENENAME HIT_GI HIT_LEN HIT_ACCESSION HIT_DEF HIT_SIGNIF HIT_SCORE BLAST_QUERY_DEF BLAST_QUERY_LEN BLAST_QUERY_ACC BLAST_QUERYDESC PK,I3 REFERENCE_NO U1 PUBMED FK1,I2 JOURNAL_NOI1 BOOK_NO FK1,I1 SYS_NAME PK DB_Object_ID PK,I2 BLASTP_NO I3 HIT_ID FK1,I1 BLAST_PROGRAM_NO JOURNAL FULL_NAME ABBREVIATION ISSN PUBLISHER URL_NO PK JOURNAL_NO AUTHOR_TYPE AUTHOR_ORDER ORTHOLOG Score SGD_SYS_NAME SGD_GO PUB_TYPE AUTHOR_EDITOR DB STANDARD_NAME NOT DB_Reference Evidence With Aspect DB_Object_Name DB_Object_Synonym DB_Object_Type taxon Date Assigned_by I1 GENE_NAME AUTHOR PK OrtoID PUBLICATION_TYPE AUTHOR_NAME AUTHOR_FULLNAME DATE_CREATED CREATED_BY FK1,I4 REFERENCE_NOFK2,I3,I2 AUTHOR_NO FK1,U2 REFERENCE_NO FK1,I2 DB_Object_ID I1 GOid PK,I1 AUTHOR_NO

Figure 2.3: Database schema of PMGD. 67

for P. marneffei we don’t have enough characterised genes; 2) lack of cDNA which is very useful for confirming initial gene prediction. To identify the genes that lack available cDNA sequence will require other methods, such as, interspecies homolog search. We do have small amount of RST sequences available [364], but, due to the poor sequence quality, they are not even helpful. Our solution for this problem is to apply a pre-existing gene finding program, namely FGENESH. Generally speak- ing, if one uses a pre-existing gene finding program in a newly sequenced organism, one expects inaccurate predictions. However, our evaluation shows that FGENESH trained with A. nidulans dataset produced satis- factory results when applied onto P. marneffei. This is due to the close phylogenetic relationship between two species. We also tried to combine predictions made by more than one gene prediction system, which has been proposed that would significantly improvement gene prediction ac- curacy. But unfortunately, because FGENESH is dominately better than any other gene finding programs available, we did not observe such an improvement after combination.

The further direction can be envisaged basing on current stage of the system. Firstly, one of striking characteristics of the genomes of eu- karyotic organisms is the existence of muiltigene family. This confounds the identification of orthologous relationship among genes in interspecies comparison. In order to solve the problem of discrimination between or- tholog and paralog, more sophisticated algorithms are required. These al- gorithms should take phylogenetic information into account and integrate this into the protein prediction system. Secondly, when assigning a func- tion to protein, controlled vocabulary should be used to all organisms. Recent development of Gene Ontology [9] project produced a dynamic controlled vocabulary environment that can cope with ever accumulating and changing knowledge of gene and protein functions. Thirdly, it is ob- vious that the more function prediction system develops, the more impor- 68

tant will be its evaluation of accuracy. Iliopoulos (2002) has established a scoring scheme to measure performance of prediction systems [143]. De- spite of this, considerable concerns are still raised regarding the accuracy of assignment and the reproducibility of methodologies. The evaluation of the performance of these systems is missing at this stage. In summary, modern biology has created an information explosion. The areas of whole-genome sequencing and functional genomics have pro- duced a prodigious amount of data. This is the case in P. marneffei genome project. This study provided a solution by offering the anno- tation pipeline linking variant biological softwares in a systemic way, as well as the state-of-art database management system for storing and re- trieval biological sequence data. It has been successfully applied on the daily-based work of annotation for the most important thermal dimorphic fungus. 69

Chapter 3

MITOCHONDRIAL GENOME OF PENICILLIUM MARNEFFEI

This work described in this chapter is very closely based on a paper I have published with colleagues [353].

3.1 Introduction

Mitochondria are the power centres of the cell. They are generally the major sites of aerobic respiration and the energy production centre in fungi, providing the energy a cell needs to move, divide, produce se- cretory products and contract. They are small oval-shaped, membrane- bound organelles, about the size of a bacterium, surrounded by highly specialised double membranes. The outer membrane is fairly smooth. But the inner membrane, where oxidative phosphorylation takes place, is highly convoluted, forming two compartments, the intermembrane space and matrix. The reaction of the citric acid cycle and fatty acid oxidation occur in the matrix. Mitochondria maintain their own genomes. Nowadays a number of mitochondrial genome sequences have become available. At present, the NCBI organelle genome resource maintains a collection of 350 completed mitochondrial genomes from different organisms, including 256 meta- zoans, 15 fungi, 9 plants and 22 others. The number is subject to change with the advance of sequencing endeavours. The gene content of mito- chondrial genomes is generally well conserved. In metazoans, for exam- ple, the mitochondrial genomes are generally circular, about 16 kb long, and encode three primary transcript types (13 proteins used for energy 70

production, two rRNAs and 22 tRNAs). The homologous genes exist- ing in the mitochondria of plants, protists, fungi, and animals, and in the genomes of prokaryotes, make it possible to undertake inter-species gene comparisons. Next I will review major components in respiratory pathway of fungal mitochondria.

The common and invariant feature of respiratory pathways of mi- tochondria is production of ATP coupled to electron transport. The respiratory chain begins with electrons being transferred from NADH to complex I (NADH:ubiquinone oxidoreductase) or from the tricarboxylic acid cycle intermediate succinate to complex II (succinate:ubiquinone oxidoreductase). Electrons are transferred via ubiquinones, complex III (ubiquinol:cytochrome c oxidoreductase), cytochrome c, complex IV (cy- tochrome c oxidase) and finally to molecular oxygen to give water (Fig. 3.1).

Complex I is comprised of peptides encoded by both nuclear- and mithochondrial-genes (more than 25 nuclear-genes and seven mitochondrial- encoded genes, nad 1, 2, 3, 4, 4L, 5, 6 ), forming a large multisubunit complex and spanning the inner mitochondrial membrane. Note that a few fungi like Saccharomyces cerevisiae and Schizosaccharomyces pombe lack complex I, and many fungi have additional components, such as al- ternative NADH dehydrogenases and/or an alternative terminal oxidase (see review [152]). Complex III contains nine subunits, of which only the gene for apocytochrome b is encoded in the mitochondrion. Between complexes III and IV there is Cytochrome c existing in the intermembrane space and passes electrons. Cytochrome c is encoded by the nuclear cyc-1 gene. Complex IV contains 7-8 polypeptides of which three are encoded in mitochondrion, cox1,2,3. It is the terminal oxidase of the standard respiratory pathway. Complex V is the mitochondrial ATP synthase, encoded by two of the ATP synthase subunit genes, atp6 and atp8.

Since the formation of several mitochondrial complexes have subunits 71

Figure 3.1: Fungal respiratory pathways. The diagram is downloaded from http://pages.slu.edu/faculty/kennellj

encoded in both mitochondrion- and nuclear- genomes, the coordinated expression of genes encoded in the nucleus and mitochondrion is critical for the mitochondrial function. These mitochondrial complexes include not only the large respiratory complexes as mentioned above, but also the translational machinery that involves nuclear-encoded polypeptides and mitochondrially-encoded rRNAs and tRNAs, and so on [240]. Therefore, the communication between the nuclear and mitochondrial genomes con- tributes essential subunit polypeptides to important mitochondrial pro- teins and they collaborate in the synthesis and assembly of these proteins (for review, see [256]).

In this chapter I report the complete sequence of the mitochondr- ial genome of Penicillium marneffei, the first complete mitochondrial DNA sequence of thermally dimorphic fungi. This 35 kb mitochondrial genome contains the genes encoding ATP synthase subunits 6, 8, and 9 (atp6, atp8, and atp9 ), cytochrome oxidase subunits I, II, and III (cox1, cox2, and cox3 ), apocytochrome b (cob), reduced nicotinamide adenine dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3, nad4, nad4L, nad5, and nad6 ), ribosomal protein of the small ribosomal sub- 72

unit (rps), 28 tRNAs, and small and large ribosomal RNAs. Analysis of gene contents, gene orders, and gene sequences revealed that the mi- tochondrial genome of P. marneffei is more closely related to those of moulds than yeasts.

3.2 Materials and Methods

3.2.1 Library construction and sequence assembly

The P. marneffei mitochondrial genome was sequenced as part of the P. marneffei whole genome sequencing project as described in Chapter 1 and 2. A genomic DNA (including mitochondrial DNA) library was made in pUC18 carrying insert sizes from 2.0 to 8.0 kb. DNA inserts were prepared by physical shearing using the sonication method. These work above were done by my colleagues in the Department of Micriol- ogy, HKU and Beijing Genome Institute. I used Phred/Phrap/Consed software package for base calling, contigs assembly and assembly qual- ity assessment [83, 84, 112]. The complete mitochondrial DNA genome was generated from assembly of 467 successful sequence reads (100 bp at Phred value Q20 [112,243]), which corresponded to an overall mitochon- drial genome coverage of about 7×.

3.2.2 Mitochondrial DNA sequence annotation

The putative ORFs in P. marneffei mitochondrial DNA were denoted by using Artemis, a free sequence viewer and annotation tool, with the genetic code of mould. Genes, in which the putative ORFs were lo- cated, were functionally assigned through BLASTP searces against fun- gal mitochondrion encoding proteins available in the GenBank database. Introns and rRNAs were mainly identified by BLASTN pairwise compar- ison of P. marneffei mitochondrial DNA with mitochondrial DNAs of Aspergillus nidulans, Neurospora crassa, Saccharomyces cerevisiae (Acc. NC 001224), Schizosaccharomyces pombe (Acc. NC 001326), Podospora 73

anserina (Acc. NC 001329), Allomyces macrogynus (Acc. NC 001715), Pichia canadensis (Acc. NC 001762), Candida albicans (Acc. NC 002653), Yarrowia lipolytica (Acc. NC 002659), and Candida glabrata (Acc. NC 004691) [29, 91, 101, 354, 175, 262]. The BLASTN results were viewed through ACT, a DNA sequence comparison viewer based on Artemis [40], and exon and intron boundaries were adjusted manually. The tRNAs were predicted by tRNAscan-SE 1.21 [207]. The core structures of the group I introns were inferred by the program CITRON [200].

3.2.3 Phylogenetic analysis

Phylogenetic analysis was performed by using MBEToolbox as described in Chapter 10. The 11 genes that encode subunits of respiratory chain complexes (cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4L, nad5, and nad6 ) and the three that encode ATPase subunits (atp6, atp8, and atp9 ) in the P. marneffei mitochondrial genome and the corresponding genes in 24 other fungi with completed mitochondrial genomes were used to determine the phylogenetic relationships of P. marneffei to the other fungi. Phylogenetic trees were constructed using unambiguously aligned portions of concatenated amino acid sequences of these 14 protein cod- ing genes by the maximum likelihood method in the Phylip package [86]. The corresponding nad genes are not present in Schizosaccharomyces japonicus, Schizosaccharomyces octosporus, S. pombe, C. glabrata, Sac- charomyces castellii, Saccharomyces servazzii, and S. cerevisiae, and the maximum likelihood method is not as sensitive to a lack of sequence in- formation as the distance methods. A total of 3,462 amino acid positions were included in the analysis.

3.2.4 Mitochondrial DNA sequences in nuclear genome

Fragments of mitochondrial DNA sequences were searched for in the cor- responding nuclear genomes in P. marneffei, A. nidulans, N. crassa, S. 74

cerevisiae, and S. pombe. For each fungus, the corresponding mitochon- drial DNA sequence was used as the query sequence to search against its own nuclear genome, using a published method for S. cerevisiae [262]. The mitochondrial and genomic DNA sequences of A. nidulans and N. crassa were downloaded from the A. nidulans Database (http: //www-genome.wi.mit.edu/annotation/fungi/aspergillus/) and N. crassa Database (http://www-genome.wi.mit.edu/annotation/fungi/ neurospora/) respectively, and those of S. cerevisiae and S. pombe were obtained from GenBank. For P. marneffei, the 6.6× coverage of ge- nomic DNA sequences was generated by our own whole genome sequenc- ing project.

3.3 Results and Discussion

3.3.1 Gene content and genome organisation

The mitochondrial DNA of P. marneffei is a circular DNA molecule of 35,438 bp (Fig. 3.2). The overall G+C content is 25%, and 24% in protein-coding genes. The genome encodes 28 tRNAs, the small and the large subunit rRNAs, the ribosomal protein of the small ribosomal subunit, 11 genes encoding subunits of respiratory chain complexes, and the three ATPase subunits (Table 3.1). All genes are encoded by the same DNA strand. 63.6% of the genome is occupied by structural genes (40.5% corresponds to protein coding exons, 5.9% to the 28 tRNA genes, and 17.3% to the rRNA subunits), 8.8% by intergenic spacers that are 14-372 bp in size, and 32.4% by the 11 introns.

3.3.2 Protein coding genes

The P. marneffei mitochondrial genome contains 15 protein coding genes. These include genes encoding ATP synthase subunits 6, 8, and 9 (atp6, atp8, and atp9 ), the cytochrome oxidase subunits I, II, and III (cox1, 75

P2 nad4L nad3 cox2 urf2 nad5

nad2 atp9 N1

0/35.4 cob

cox1 30 C

nad9 P. marneffei mtDNA

35,438 bp 10 nad4

R1 atp8 H atp6 M3 20 Q L2 N2 F A rps rns L1 M2 Y M1,V,E,T nad6 introns rnl urf1 cox3 exons P1,S2,I,W,S1,D,G2,G1,K,R2 intronic ORFs

tRNAs

Figure 3.2: Physical map of P. marneffei mitochondrial DNA. The map is based on an annotation of the reverse complement of Assembly 3 of the P. marneffei mitochondrial sequence determined by the P. marneffei Sequencing Project at the University of Hong Kong in collaboration with Beijing Genomics Institute of Chinese Academy of Sciences. Numbers in the inner circle are in kb. The sequence is numbered from the unique restriction enzyme ClaI site (AT|CGAT) (0/35.4), which is located just upstream to the nad4L gene and downstream to the cox2 gene. Exons are shown in black, introns in white, and intronic ORFs in gray. 76

Table 3.1: Gene content of P. marneffei mitochondrial genome. * Exact start codon could not be determined merely through sequence compari- son.

Size Codons Genetic element Localisation (nt) bp aa Start Stop nad4L 26-295 270 89 ATG TAA nad5 295-2271 1977 658 ATG TAA nad2 2289-4028 1740 579 TTA TAA atp9 4216-4440 225 74 ATG TAA cob Join: (4706-5098, 6270-7037) 2332 386 ATG TAA cob-i1-ORF 5099-5965 867 288 TTG* TAA nad1 Join: (7532-8179, 8650-9081) 1550 359 ATA TAA nad4 9253-10716 1464 487 ATG TAA atp8 10945-11091 147 48 ATG TAG atp6 11158-11928 771 256 ATG TAA rns 12341-13721 1381 nad6 14053-14637 585 194 ATG TAA URF1 14722-15177 456 151 ATG TAA cox3 15352-16161 810 269 ATG TAA rnl Join: (17165-19688, 21361- 4738 21902) rps 19987-21252 1266 421 ATG TAA cox1 join: (23339-23718, 24994- 9821 561 ATT TAA 25099, 26298-26641, 27740- 27875, 29012-29201, 30504- 30553, 31652-31806, 32835- 33159) cox1-i1-ORF 23720-24622 903 300 AAA* TAA cox1-i2-ORF 25100-26200 1101 366 AAA* TAA cox1-i3-ORF 26643-27647 1005 334 AAA* TAA cox1-i4-ORF 27876-28928 1053 350 TGA* TAA cox1-i5-ORF 29204-30043 840 279 TTA* TAA cox1-i6-ORF 30554-31384 831 276 ACA* TAA cox1-i7-ORF 31808-32629 821 273 AGA* TAG URF2 33223-33660 438 145 ATT TAA nad3 33955-34362 408 135 ATG TAA cox2 34591-35346 756 251 ATG TAA 77

cox2, and cox3 ), apocytochrome b (cob), the reduced nicotinamide ade- nine dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3, nad4, nad4L, nad5, and nad6 ), and the ribosomal protein of the small ribosomal subunit (rps). This set of protein coding genes is exactly the same as that in the A. nidulans mitochondrial genome. Furthermore, the gene order of the protein genes is the same as that in the A. nidulans mito- chondrial genome, except for the atp9 gene, which is located between the cox1 and nad3 genes in the A. nidulans mitochondrial genome, but be- tween the nad2 and cob genes in the P. marneffei mitochondrial genome (Fig. 3.3).

Concatenated amino acid sequences of the 14 protein coding genes in the mitochondrial genomes of P. marneffei and 24 other fungi were used for phylogenetic tree construction. The closest relatives of P. marnef- fei were A. nidulans and other moulds, such as P. anserina, N. crassa, Hypocrea jecorina, and Verticillium lecanii (Fig. 3.4). On the other hand, the yeasts, such as the Saccharomyces species, Schizosaccharomyces species, Candida species, and P. canadensis were more distantly related to P. marneffei. This implied that phylogenetically the mitochondrial genome of P. marneffei is more related to those of moulds than yeasts. This is in line with our previous observation and also results published by others, that when the chromosomal 18S rRNA genes or the internal transcribed spacers and 5.8S rRNA genes (ITS1-5.8S-ITS2) and mitochondrial small subunit rRNA genes were used for phylogenetic trees construction, the closest neighbours of P. marneffei, besides the other Penicillium species, were the Aspergillus species as well as other moulds [202, 364]. Fur- thermore, the same gene content and almost the same gene order in the mitochondrial genomes of P. marneffei and A. nidulans also implies that the mitochondrial genome is probably not related to the unique charac- teristic of thermal dimorphism of P. marneffei. Interestingly, MP1, the gene that encodes an abundant and highly immunogenic protein in P. 78

P. marneffei A. nidulans

nad4L nad4L nad5 nad5 C1 nad2 nad2 atp9 N1 N1 cob cob C C2 nad1 nad1 R2 nad4 nad4 K R1 R K N2 G1 G1 atp8 atp8 G2 G2 atp6 atp6 D D N2 S S1 rns rns W W Y Y I nad6 nad6 I S2 cox3 cox3 T P1 P1

rnl rnl T E rps rps E V V M1 M1 M2 cox1 cox1 M2 P2 atp9 L1 L1 nad3 nad3 A A cox2 cox2 F F L2 L2 Q Q M3 M3 H Protein & rRNA genes H

tRNA genes

Figure 3.3: Gene content and order comparison between P. marneffei mi- tochondrial DNA and A. nidulans mitochondrial DNA. The only exonic gene that has undergone gene rearrangement is atp9, which is highlighted in black background. 79 0.1 Cryptococcus neoformans var. grubii Cryptococcus neoformans Schizophyllum commune Aspergillus nidulans Penicillium marneffei Allomyces macrogynus Podospora anserina Neurospora crassa Hypocrea jecorina Verticillium lecanii Schizosaccharomyces japonicus Schizosaccharomyces pombe Schizosaccharomyces octosporus Candida glabrata Spizellomyces punctatus Saccharomyces cerevisiae Pichia canadensis Yarrowia lipolytica Saccharomyces castellii Saccharomyces servazzii Harpochytrium sp. JEL105 Harpochytrium sp. JEL94 Monoblepharella sp. JEL15 Hyaloraphidium curvatum Candida albicans Rhizophydium sp. £ ¿ ¢ Gen ¢¢¢¿¢¢¢¢£ Group I intron with intronic ORF intronic with intron I Group Group II intron II Group ORF intronic without intron I Group es not present were crossed out ¿¢¢¢ ££ rnl ¢ £ £ £ £ £ £ £ ¢ ¢

atp6 £ £

atp8

atp9

££££¢¢ ¢££¢£¢ ¿££££ ££££ ¢¢¢ ££ ££ ££ ££ cob £ £ £ ¢ ¿

¿££££££££££££££ ¢¢£££¢£££¿¢¢£¢ ¢¢¿¢¢¢££¢¢¢¢ £££££££¢£ £££££££¢£ £££££££ ¿¿££££¿ £££££ ££££ ££¿£ £££ ¢¢¢ cox1 ££ ¢£ ££ ££ £ £

¢££ cox2 ££ £ ¢ ¢ ¢ ¿

cox3 ¢ £

££££ nad1 ¢£ £ ¢ ¢ ¢ £

nad2

nad3 ¢ ¢ £

nad4 £ £

nad4L £ £

£££¿ ££¢ nad5 ££ ¢£ ££ ££

nad6

80

Figure 3.4: Phylogenetic relationships of P. marneffei to other fungi and distribution of group I and group II introns in the corresponding fungi. Maximum likelihood tree showing phylogenetic relationships of P. marneffei to other fungi and distribution of group I and group II in- trons in the corresponding fungi. The tree was constructed using unam- biguously aligned portions of concatenated amino acid sequences of the 14 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad4L, nad5 and nad6 ). A total of 3462 amino acid positions were used for the inference with ProML [86]. Sequences were ob- tained from GenBank: Allomyces macrogynus (NC 001715), Aspergillus nidulans (CAA32799, CAA33481, AAA99207, AAA31737, CAA25707, AAA31736, CAA23994, P15956, CAA23995, CAA33116), Candida albi- cans (NC 002653), Candida glabrata (NC 004691), Cryptococcus neofor- mans var. grubii (NC 004336), Harpochytrium sp. JEL105 (NC 004623), Harpochytrium sp. JEL94 (NC 004760), Hyaloraphidium curva- tum (NC 003048), Hypocrea jecorina (NC 003388), Monoblepharella sp. JEL15 (NC 004624), Neurospora crassa (CAA24041, CAA32799, AAA31961, CAA27029, CAA27418, AAA66053, AAA31959), P. marn- effei (Present study), Pichia canadensis (NC 001762), Podospora anse- rina (NC 001329), Rhizophydium sp. 136 (NC 003053), Saccharomyces castellii (NC 003920), Saccharomyces cerevisiae (NC 001224), Saccha- romyces servazzii (NC 004918), Schizophyllum commune (NC 003049), Schizosaccharomyces japonicus (NC 004332), Schizosaccharomyces oc- tosporus (NC 004312), Schizosaccharomyces pombe (NC 001326), Spizel- lomyces punctatus (NC 003052, NC 003061 and NC 003060), Verticil- lium lecanii (NC 004514), Yarrowia lipolytica (NC 002659). Some se- quences of A. nidulans were downloaded from Fungal Mitochondrial Genome Project (http://megasun.bch.umontreal.ca/People/lang/ FMGP/FMGP.html), and some sequences of N. crassa were downloaded from http://pages.slu.edu/faculty/kennellj/genbank.html. The scale bar indicates the branch lengths that were scaled in terms of ex- pected numbers of amino acid substitutions. 81

marneffei, only has known homologues in A. nidulans, A. fumigatus, and A. flavus, but not in other fungi [37, 39,38, 363,43, 351, 352].

3.3.3 Genetic code and codon usage

Since the mitochondrial genome P. marneffei is phylogenetically closely related those of moulds and its gene content is the same as that of A. nidulans, the genetic code of the mitochondrial genome of P. marneffei is assumed to be the same as that of A. nidulans . There is a strong codon usage bias in exonic ORFs in the mitochondr- ial genome of P. marneffei towards codons ending in A or T. In fact, eight codons (CTC, CTG, ACG, TGC, TGG, CGC, CGG, and GGC) were not used at all, five codons (GTC, TCC, TCG, ACC, and AGG) were used only once, and nine codons (ATC, CCG, GCC, GCG, CAC, CAG, AGG, GAC, GGG) were used 2 to 10 times, in exonic ORFs. Moreover, this codon usage bias is also evident in the use of stop codon, where TAA is used as the stop codon in 14 genes, but TAG is only used in one gene.

3.3.4 tRNA genes

Twenty-eight tRNA genes were identified in the P. marneffei mitochon- drial genome (Fig. 3.5). These are all located on the same DNA strand as the other genes. The set of mitochondrial tRNAs in P. marneffei is similar in type to that in A. nidulans. Furthermore, the sequences of the mitochondrial tRNA genes of P. marneffei are fairly conserved with those of A. nidulans, especially between the two tRNA gene clusters of two species (Fig. 3.3).

3.3.5 Other RNA genes

The genes that encode the 23S and 16S ribosomal RNAs of the large and small subunits of the ribosome (rnl and rns) were identified. Further- more, a gene (rps), located within the intron of rnl (Table 3.1 and Fig. 82

Table 3.2: Codon usage in protein-coding genes of P. marneffei mi- tochondrial genome. Numbers indicate the total numbers of codons in either identified protein coding genes or ORFs (including both free- standing URFs, intronic ORFs and RPS).

Codon AA Genes ORFs Codon AA Genes ORFs TTT F 307 143 TCT S 160 93 TTC F 66 13 TCC S 1 5 TTA L 572 250 TCA S 105 45 TTG L 26 33 TCG S 1 13

CTT L 49 42 CCT P 119 35 CTC L 0 6 CCC P 4 2 CTA L 20 24 CCA P 25 20 CTG L 0 4 CCG P 4 3

ATT I 182 134 ACT T 121 78 ATC I 10 12 ACC T 1 7 ATA I 326 162 ACA T 105 45 ATG M 112 38 ACG T 0 4

GTT V 132 74 GCT A 144 49 GTC V 1 3 GCC A 4 7 GTA V 131 70 GCA A 81 35 GTG V 18 5 GCG A 7 3

TAT Y 191 180 TGT C 24 21 TAC Y 32 27 TGC C 0 4 TAA * 14 9 TGA W 56 37 TAG * 1 1 TGG W 0 5

CAT H 76 47 CGT R 10 24 CAC H 8 7 CGC R 0 1 CAA Q 83 75 CGA R 0 1 CAG Q 5 7 CGG R 0 2

AAT N 196 277 AGT S 123 90 AAC N 11 30 AGC S 15 8 AAA K 101 347 AGA R 78 94 AAG K 6 18 AGG R 1 9

GAT D 97 112 GGT G 188 94 GAC D 3 11 GGC G 0 1 GAA E 89 133 GGA G 92 32 GAG E 21 21 GGG G 6 13 83

1 U 2 U 3 U 4 U 5 AAGCUAGC A AUGCUACG AU GCAUCGUA GC GCGCUACG GC AUCGUAUA AU U A U AA UU A A U UGUGGUCUUGCUAU UAUAAU UA UCGUUUCUCUUCUGUCACCCCAGAAUCUAAGCUA AAAA G AAAAAAAUUAG G GUUG UGGGCUGGCUGGCGCGCUUUAUGGAAAGAUGGGACUCCUACUCAGAUUACAUGA UUUGUGUUGUUGUUUAUAU U U GCGCCGCGCUGUAAAACUAGAGAGAGAAUGU UUAAUGAUAAUUAAGAAAUUU G UGU U AUAAAAUUGUGC UCAUA AUUUU AA UGC UGGA U AAA CG A U A U GGC UA AUUGUU A U A AAUU UUUA A U A U GC A AA C A C A C A C A U A U A U A U A U A AUUGCAUC U G UU GAU ASNCYSARGASN TYR 6 U7A8A9 A10G AUGUAUAUGC AUAUUACGGU CGGCGCGCGC GCAUAUAUUA UACGCGUAUA GCCGUACGAU UAUAUAUAAUUAAUUAGCGG UUAUUCAUUACCCAUCGACUAUCGACUAUCAGCCA AAAGAAAAAAAGGAAGUAAAAG UCUUAAUAAGCCUUUGGUGGGCUUUUAGCUGACCUUGAGCUGACAUUCGUCGGC GUUUGUUUGUUUGCUUGUAU GGAAUAGAAACUCGGAAUAGGACUGGAAGGA UAACUUAAUUACAUAAUUAGUAG UAUUAGUACGGUA UAGUUAUAAU AUUACGCGUA UAGCGCGCCG GCAUAUUAUA CACACCUAUA UGUAUAUAUG ACGUUUUCCACCGUC ARGLYSGLYGLYASP 11 G 12 G 13 A 14 G 15 A GCAUGCAUCG GCAUAUGCAU AUGCUAAUGC AUAUUAGCGC AUGCCGAUUA GCUAUAGCUA GCUAAUUAUACUUAUAAUUA UUUCUCAUGAGUCAUGUCACAUGAUUAUACAGAAUACA UAGGAAAGCAAAGGGGGGGAAG UCCGAAGAGCUUUUGCUUAGCUUUCACAGUGCGUCGAUAUGCUCCGUUAUG C GUUUGUUUGUUUUCUUUGUU GGGCCGAAACUGAAGUUUGGCUGGGCU UAGGAUUUACUUAGUGUAAGAAUGAGAAU GGUAAGCCAUAAUUAUUUUG UAUUCGAUCGGCCG CGGAAUUAUAGAAUA GCUUAUUGAUUACG UAAAUCGGCAAAU UACAUACUUU UAUAUAUGUG GCU UCAGAUUGA UGG SERURPILESERPRO 16 U 17 A 18 19 A 20 A GCGU AUA AU CGAUAUGC AU CGUAAUCG GC UGCGGCUC GU GUCGAUAU CG GCAUAUAU UA UAAUAUUAAUUGUAAUUG UCGUUCAUCUCUCAUAUAUAAUCCAUUUUACA AAAGUGAAGGGAAUUCCACAAAAGUAAAG AUACGGUAAGCCCUGGAGAGCCCUCGAUUGUAUUAGGCAUUCGAAAUGC AAUGGCUUGAGGUGCGAUUGUUU GAUGCAGGACAGGAGCCUUGGCAUAGAAGCA UAAGUUAAAGUAUUGUAAAUAGUA CUAUUGGGUA AUG CGAUUAAAUA AU CGAUCGUA CG GCCGGCGC UA UUAUCGGC GU UAUCUCCACA UAUAUAUGUG UGUUUCUACCAU CAU THRGLUVALMET MET 21 A 22 A 23 A 24 A 25 A AUGCGCAUUA UAGCCGCGAU CGGUUAAUUA CGGCCGGCCG AUUAGCAUUA AUUAAUUACG GCCUAUCCGUUGAUUAGCUA AAUCUUGCAUAGUCCAAUGUUCAAAUUAUUCAUCUCGCA CAGGUAAAGUAAAGUCGGUAAUG UGUCGAACGCUUUUGUCAGGCUCUCGACAAGCUGUCAUAAGCAUCAGGAGUGC GUUUGCUUGUUUGUUUGUUU GCAGGGAAACUGGAGCUGCAGGGGAGUCA UAUUCUAUUAUUGAGGUAUAUACU AUGCGCGUGUAGCGUAAUAUU CGGGACGAUUAAU AUCUGCAUCGAU AUGCAUAUAUAU AUUAAUCAUA CAUUCAUGUU UGUAUGUAGUG UAAUGCGAA UUG LEUALAPHELEUGLN 26 U 27 G 28 A GCGCAU CGUUAU CGGCGC AUGCGC AUGUAU GCUAUA GUUAAUUAUAUA UCUCUUAUGGAACAUUGUCCA AAAGAAAGAAAAG CUUUGGAGAACUCUUGCCUUGCUUUUGAUAGGA GUUUGUUUUUUU GGAACAAGGAACAGAAACA UAAAUAAACUUAAU UAAGGUUAAA AUCGAU AUUAUA UAGUA U UAUAU UGACA A CAUUGU U GUGA Intron GGU METHISPRO

Figure 3.5: 28 tRNAs encoded in the mitochondrial genome of P. marn- effei. Predicted clover-leaf structures of the 28 tRNAs encoded in the mitochondrial genome of P. marneffei. Anticodons are underlined and the corresponding amino acids are indicated. tRNAs are listed according to the order of their positions in the map in Fig. 3.2. 84

3.6), that encodes the ribosomal protein of the small ribosomal subunit, which is also present in the A. nidulans mitochondrial genome, was also identified. .. A

.. T 30 bp C A A G 27647 A G 47 bp G 21360 T A A T A T 76 bp T T A P6 T A A 44 bp A A P5 P4 G C A G C A A A A G C T C A T P5 P4 C G P6 A G A A G T A T C A G C A A G C G A G T G 75 bp A T C A G C A A A G T T T T G A A A T C T T C G T A G T C G T C A T A G C G C G A A C G P7 T T T C A A A A C T T A T A T T A C A A T C G P7 A T T A A T A T C A A A T C G A A C 24 bp A A T T A T G P3 T A A G C A T A G P3 G C G C A G C G C G A T A A A G T T A A A A T T C G T A T A A T T A 14 bp A T A T A 38 bp C G A G C T A P8 A T A P8 A T A 26642 T A T T T A T G C 19720 T A T G T A T T A A T

G.. A A T A A G A A ..

Pm Lsu.1 98 bp 45 bp Pm Cox1.3

RPS5 1256 bp 783 bp

Figure 3.6: Predicted secondary structures of two representative group I introns. Group I introns, PmRnl.1 and PmCox1.3, of rnl and cox1 genes respectively, in P. marneffei. The exon/intron boundaries are rep- resented by dotted lines. Base pairs are depicted by bars. The corre- sponding sizes of nucleotides not shown are indicated in bp. RPS5 gene is depicted by square box. The numbers correspond to the coordinates in the mitochondrial genome.

3.3.6 Group I introns

In P. marneffei, the cox1 gene contains seven introns (PmCox1.1, Pm- Cox1.2, PmCox1.3, PmCox1.4, PmCox1.5, PmCox1.6, and PmCox1.7), while the cob gene, nad1 gene, and rnl gene contain one intron each (PmCob1.1, PmNad1.1, and PmRnl1.1 respectively). Each intron in the cox1, nad1, and rnl genes contains an ORF. The ORF in the rnl gene 85

Table 3.3: Presence of mitochondrial DNA fragments in nuclear genomes. ‘Nuc no.’, number of mtDNA fragments in nuclear genomes; ‘Mt size’, size of mitochondrial genomes (kb); ‘Nuc size’, Size of nuclear genome (Mb); ‘Ratio’, ratio of sizes of mitochondrial to nuclear genome (kb/Mb).

Fungus Nuc no. Mt size Nuc size Ratio P. marneffei 10 35.4 ∼ 29.5 ∼ 1.20 A. nidulans 17 ∼ 33.2 ∼ 31.0 ∼ 1.07 N. crassa 21 ∼ 64.8 ∼ 43.0 ∼ 1.51 S. cerevisiae 34 85.7 12.1 7.08 S. pombe 21 19.4 13.8 1.41

encodes the rps gene. The predicted secondary structures of two repre- sentative group I introns are depicted in Fig. 3.6. In both introns, the upstream exons end with a T and the introns end with a G, typical for most group I introns. A comparison of the distribution of group I and group II introns in the 14 protein coding genes and rnl gene in the P. marneffei mitochondrial genome and that in the corresponding genes in the other 24 fungi is shown in Fig. 3.4. As a whole, the distribution of these introns in the genes encoded in the mitochondrial genome of P. marneffei concurs with those of the other fungi. The cox1 gene, the gene that contains the largest number of self-splicing introns in other mitochondrial genomes, is also the gene that contains the largest number of self-splicing introns in the P. marneffei genome. The cob and nad1 genes, the genes that also contain significant numbers of self-splicing introns, also possess one group I intron each in the P. marneffei mitochondrial genome.

3.3.7 Mitochondrial DNA sequences in nuclear genome

Presence of mitochondrial DNA sequence fragments in the correspond- ing nuclear genomes of P. marneffei, A. nidulans, N. crassa, S. cere- visiae, and S. pombe were compared (Table 3.3). By using the same method of sequence similarity comparison used for S. cerevisiae [262], 86

Table 3.4: P. marneffei mitochondrial DNA sequences present in nuclear genome.

No. Coordinates Size (bp) Location E-value 1 9031..9069 39 nad1 9e-08 2 10182..10201 20 nad4 1e-03 3 11622..11697 76 atp6 2e-15 4 13445..13465 21 rrs 2e-04 5 15158..15177 20 nad6 – cox3 1e-03 6 18757..18776 20 rnl 1e-03 7 25168..25187 20 cox1 1e-03 8 31197..31216 20 cox1 1e-03 9 32560..32580 21 cox1 2e-04 10 34510..34529 20 nad3 – cox2 1e-03

only 10 mitochondrial DNA sequence fragments were detected in the 4× coverage, representing 95%, nuclear genome sequences for P. marneffei (Table 3.4). This number of mitochondrial DNA sequence fragments in the corresponding nuclear genomes, as well as the ratio of mitochondrial to nuclear genome size, was comparable to those found in A. nidulans, N. crassa, and S. pombe (Table 3.3). On the other hand, the number of mitochondrial DNA sequence fragments in the nuclear genome of S. cerevisiae was 34, which was about two times more than the other fungi. Although the relatively high ratio of mitochondrial to nuclear genome size of S. cerevisiae may partly explain this phenomenon, further studies would be necessary to elucidate the difference in the significance of these mitochondrial DNA fragments in the nuclear genomes for the different fungi.

In conclusion, among the known mitochondrial genomes of fungi, the P. marneffei mitochondrial genome has an intermediate size. The replica- tion origin of the P. marneffei mitochondrial genome is unknown. De- 87

spite the distinct biological property of thermal dimorphism in P. marn- effei, its mitochondrial genome is much more closely related to those of moulds, especially to that of A. nidulans, than to yeasts. The set of protein coding genes in the P. marneffei mitochondrial genome is ex- actly the same as that in the A. nidulans mitochondrial genome. Except for the atp9 gene, the gene order of the protein genes is also the same as that in the A. nidulans mitochondrial genome. Furthermore, when concatenated amino acid sequences of 14 protein coding genes in the mi- tochondrial genomes of P. marneffei and 24 other fungi were used for phylogenetic tree construction, the closest relatives of P. marneffei were A. nidulans and other moulds, whereas the yeasts were more distantly related. 88

Chapter 4

GENOMIC EVIDENCE FOR THE PRESENCE OF MELANIN BIOSYNTHESIS GENE CLUSTER IN PENICILLIUM MARNEFFEI

In this Chapter, I will firstly review fungal virulence factors and their identification by genomic approaches, then I give genomic evidence for the presence of melanin biosynthesis genes in Penicillium marneffei.

4.1 Introduction

In Chapter 3, when I compared the mitochondrial genome of P. marneffei to those of other fungi, it was observed that the mitochondrial genome of P. marneffei is much more closely related to those of moulds, espe- cially to that of Aspergillus nidulans, than to yeasts. The set of protein coding genes in the P. marneffei mitochondrial genome is exactly the same as that in the A. nidulans mitochondrial genome. Except for the atp9 gene, the gene order of the protein genes is also the same as that in the A. nidulans mitochondrial genome. Furthermore, the amino acid sequence identity between the mitochondrial genes of P. marneffei and those of A. nidulans is significantly higher than those between the mi- tochondrial genes of P. marneffei and those of Neurospora crassa, Can- dida albicans, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. This evidence of close relationships between P. marneffei and Aspergillus species has prompted a further search for previously undiscovered charac- teristics in P. marneffei based on our knowledge of the various Aspergillus species. Melanins are negatively charged pigments of high molecular weight 89

with hydrophobic surfaces. They are formed by the oxidative polymeri- sation of phenolic and/or indolic compounds [341]. They are carcinogens that are widespread in agricultural products and food. They are mainly produced by various Aspergillus species, like A. parasiticus and A. flavus, and less frequently, also by A. nomius, A. pseudotamarri, and A. bom- bycis [170]. Since melanin is made by these important pathogenic fungi and has been implicated in the pathogenesis of a number of fungal infec- tions, it would be of interest to investigate whether P. marneffei could synthesise melanin or melanin-like compounds. Here, after the literature review, I report the progress in identifying a gene cluster in P. marneffei, spanning 19 kb, which contains six homologs of genes. All these six genes in the cluster in A. fumigatus have been shown to be involved in DHN-melanin biosynthesis [24, 187, 317, 318]. These genes are alb1, arp1, arp2, abr1 and abr2 encoding polyketide synthases, scytalone dehydratases, and hydroxynaphthalene reductases, a putative protein possessing two signatures of multicopper oxidases and laccase respectively, as well as, ayg1 of unknown function. The order of genes in the clusters of two fungi differs slightly from each other. These findings indicate that P. marneffei can potentially produce melanin or melanin-like compounds. Since melanin is an important virulence factor in other pathogenic fungi, this pigment may have a similar role to play in the pathogenesis of penicilliosis.

4.2 Literature Review

Most fungi cannot survive in the environment provided by human tissue and therefore are not pathogenic. Amongst more than 100,000 fungal species which have been described, only a handful of them are pathogens. The pathogenic fungi are divided into two classes, primary pathogens and opportunistic pathogens. Primary pathogenic fungi, e.g., Coccidioides immitis and Histoplasma capsulatum, are “professional” pathogens which 90

adapt to live inside healthy mammalian and human tissue, causing dis- ease not only in immuno-compromised patients but also in healthy peo- ple. Opportunistic fungi may have an environmental reservoir or exist as commensals in a healthy host. Some examples include Candida species, C. neoformans and A. fumigatus. These fungi are able to grow and in- vade host tissue only when they take advantage of immuno-compromised host. However, the incidence of life-threatening mycoses caused by op- portunistic fungal pathogens has increased dramatically in recent years. They are eventually the major cause of fungal infections. The infections cause by pathogenic fungi can be superficial, subcutaneous or systemic. Superficial infection localises to the skin, the hair, and the nails; subcu- taneous infection confines to the dermis, subcutaneous tissue or adjacent structures; systemic infection refers to deep infections of the internal or- gans.

4.2.1 Potential virulence factors

Virulence factor in a fungus literally refers to any factor that a fungus possesses that increases its virulence in the host. For instance, if a gene or a protein is essential for growth in vivo whose deletion does not af- fect mycelial growth in vitro, it is considered as a virulence factor [189]. The concept of virulence factor is different in primary pathogens and opportunistic pathogens and it is relatively difficult to define literally when dealing with the latter, as pointed out by [128]. For most of fungal pathogens, few virulence factors which contribute to their pathogenicity have been reported. Although the mechanisms of fungal pathogenicity remain less-well understood, the development of a fungal infection must satisfy several considerations. The fungus must first be able to adhere to the host tissues. The fungus must colonise the host and invade the host tissue. Once the fungus has invaded the host tissue, it must be able to adapt to 91

the tissue environment. Probably most importantly, the fungus must be able to avoid the host’s cellular defences.

Adherence to host tissues

Adherence factor is essential for fungal pathogens to attach themselves onto host tissue, and to resist physical clearing of the infectious agent. For example, C. immitis, Aspergillus species, H. capaulatum and Cryp- tococcus neoformans all infect via the bronchial route and must have specific adaptations in order to avoid effective clearance from a host’s lungs. Adherence is dependent on a variety of factors, including surface glycoprotions, fungal cell surface hydrophobicity, pH, temperature, and of course, phenotype of the organism. Adhesins are biomolecules that promote the adherence of fungi to host cells or host-cell ligands that bind to several extracellular matrix proteins of mammalian cells, such as fibronectin, laminin, fibrinogen and collagen Type I and IV. Amongst many studies that have shown the association of adherence and fungal pathogenesis, the studies on adhesion in C. albicans are most extensive. Candida species express several cell surface proteins termed adhesions which actively promote binding to host cells. These include a lectin-like protein that recognises sugar residues of epithelial cell surface glycoproteins, and a complement receptor-like protein, CR3, which may play in a role in adherence to endothelial cells. Several adherence promot- ing molecules or adhesions of C. albicans regulate attachment, invasion, and dissemination of the fungus [36, 157]. Als1p (agglutinin-like sequence) of C. albicans is a member of a fam- ily of seven lycosylated proteins with similarity to the S. cerevisiae - agglutinin protein that is required for cell-cell recognition during mating. Als1p is essential for virulence in a hematogenously disseminated murine model [98]. HWP1 is a hyphal- and germ-tube-specific outer surface mannopro- 92

tein that binds C. albicans hyphae to human buccal epithelial cells [319]. The null mutant was less virulent than parental or single-gene-deleted strains in a hematogenously disseminated murine model. The yeast ger- minated less readily in the kidneys of infected mice and caused less en- dothelial cell damage [319]. C. albicans binds to several ECM ligands, including FN, laminin and collagens I and IV. C. albicans expresses an integrin-like protein INT1 which is 25% identical to a non-repeat region of the fibrinogen-binding protein, ClfA, of Staphylococcus aureus. Strains of C. albicans deleted in INT1 were less virulent and adhered less readily to an epithelial cell line [102]. Strains of C. albicans deleted in the 1,2- mannosyltransferase gene (MNT1) are less able to adhere in vitro and are avirulent. Mnt1p is a type II membrane protein that is required for both O- and N-mannosylation in fungi and found to be required for adherence to an epithelial cell line [34]. Adhesins of other medically important fungi, such as Blastomyces der- matitidis (a dimorphic fungal pathogen that infects the host through in- halation of conidia [276], have also been characterised. This is a 120-kDa surface protein adhesin, namely WI-1, on B. dermatitidis, binding CD18 and CD14 receptors on human macrophages [232]. Hogan et al.[133] cloned the adhesion WI-1 gene and found a total of 30 highly conserved repeats of a 24-amino acid sequence. The repeat sequence is similar to invasion, an adhesion-promoting protein on Yersiniae [169].

Invasion

Invasion is required for the development of deep mycoses in the internal tissues of the body. The process is probably aided by hydrolytic enzymes, such as proteinases and lipases, and in the case of dermatophytes, kerati- nases. Secretion of extracllular enzymes, such as phospholipase, has been proposed as one of the virulence mechanisms used by bacteria, parasites, and pathogenic fungi in overcoming host defence mechanism. The role of 93

extracellular phospholipase as a potential virulence factor in pathogenic fungi, including C. albicans, C. neoformans, and A. fumigatus has been reported. Of the 4 Candidal phospholipases (PLA, PLB, PLC and PLD), only C. albicans null mutants that failed to secrete phospholipase B, en- coded by PLB1, constructed by targeted gene disruption, when tested in two clinically relevant murine models of candidiasis, was shown to have attenuation of its virulence. Initial data suggest that direct host cell damage and lysis are the main virulence mechanisms. The secretion of lytic and degradative enzymes is also of obvious im- portance to the invasion of host tissues. Those necrotic enzymes secreted by fungi can break down structural barriers and play an important role in mediating host tissue invasion. The most extensively studied example is SAP gene family in C. albicans [294]. At least nine proteins comprise the family of secreted aspartyl proteinases. In guinea pig and murine models of invasive disease, deletions in sap1-6 attenuated virulence. The SAP genes have been shown to be differentially expressed, according to the growth phase and phenotype of the organism; SAP2 mRNA was the dominant transcript in the yeast phase organism; SAP4, SAP5 and SAP6 transcripts were observed only at neutral pH during serum-induced yeast to hyphal transition. The order of expression was SAP1, -2, followed sequentially by SAP8, -6 and -3 was correlated with tissue invasion i.e., early invasion (SAP1, 2), extensive penetration (SAP8) and extensive hyphal growth (SAP6). This data indicates that members of the SAP gene family may have distinct roles in the colonisation and invasion of the host [63].

Growth at elevated temperature/Thermotolerance

Thermotolerance is one of the most obvious factors leading to pathogene- sis. The ability of grow at body temperature 37‰ and within fever range 38 – 42‰ is important to systemic infection. The majority of fungi has an 94

optimum growth temperature of 25 to 30‰, and may grow only weakly or not at all at 37‰. The first genome-wide analysis of the temperature- regulated transcriptome of C. neoformans has been done by Steen et al.[296]. They identified sets of genes with higher transcript levels at 25‰ or 37‰ respectively.

Morphology/Morphogenesis

There is a growing body of evidence linking morphogenesis and virulence. Changes in morphologies are advantageous for fungal pathogens. It has been demonstrated that fungal hyphae can exert significant tip pressure for penetration [224]. Many fungi adapt this morphological change and develop virulence. Filamentous fungi (such as Aspergillus species) tend to form branched hyphae in lung. C. neoformans, being an unique en- capsulated yeast, is coated with a polysaccharide capsule. The capsule is a potent inhibitor of macrophage phagocytosis, which is an important factor in the resistance to C. neoformans infection. The most remarkable ability shared among the dimorphic fungi, such as, B. dermatitidis, C. immitis, H. capsulatum, Paracoccidioides brasilien- sis, Sporothrix schenckii, is to switch between two distinct forms: yeast and mould. The dimorphic fungi exist normally as non-pathogenic forms (normally filamentous mycelia) in the environment and converse into pathogenic forms (yeast) in the tissues of a host. This process is re- versible; the switching trigger of conversion is unknown and differs amongst fungi though. The importance of the yeast cell, as an invasive morphol- ogy, for dimorphic fungi has been reviewed by Gow et al.[113, 114]. As shown in Table 4.1, most dimorphic mycelial pathogens invade tissues of a host as yeast cells. Yeast cells are regarded as a better adapted for dissemination within host circulatory system and avoidance of immune capture. Note that although the opportunistic pathogens C. albicans and Candida tropicalis shows dimorphic growth, these Candida species 95

Table 4.1: Major dimorphic fungal pathogens and their characteristic morphologies in infectious disease. Taken from [114]

Fungal species Form in diseased tissue Blastomyces dermatitidis Budding yeasts Candida albicans (Pesudo)hyphae, budding yeasts Candida tropicalis Yeast and pesudohyphae Coccidioides immitis Endosporulating spherules . Cryptococcus neoformans Budding capsulate yeasts Histoplasma capsulatum Budding yeasts Paracoccidioides brasiliensis Budding yeasts Penicillium marneffei Yeasts undergoing binary fission Sporothrix schenckii Budding yeasts Wangiella dermatitidis Budding yeasts

mainly form pseudohyphae, therefore they are not regarded as true di- morphic fungi. Nevertheless conversion to pseudohyphae has been long regarded as essential for tissue invasion for Candida species.

4.2.2 Genomic approaches in identification of virulence factors

In practice, the combinatorial approaches by combining a few of the following techniques have great potential to make elucidation of detailed biological systems.

Mining whole genome sequences and fishing for virulence factors

The sequence of the genome of budding yeast, S. cerevisiae, is a landmark of genomics. Since then, progress has been made in sequencing whole fun- gal genomes. The second complete sequence of a fungal genome, that of S. pombe, was published in 2002 [354]. The filamentous fungi A. nidulans, A. fumigatus, N. crassa and Ashbya gossypii are nearing completion (see also Section 1.2.4). Even at its early stage, Fungal Genome Initiative (FGI), a genome sequencing program by the National Human Genome Research Institute, USA, proposed to sequence 15 fungi selected on the basis of medical, scientific and commercial criteria, in 2002. FGI will ap- 96

ply deep-shotgun sequencing approaches (sequencing coverage > 10) in order to finish all sequencing work quickly. If fully funded, it will produce massive valuable information for elaborate comparative genomic analysis across the fungal taxa. The genome sequences have an immediate impact on conventional fun- gal genetics by eliminating years of efforts previously associated with gene discovery. Traditionally genetic and biochemical approach in gene discov- ery suffered from many aspects of limitation in fungi, such as poor efficien- cies of transfer, lack of stable extrachromosomal elements, poor growth in the laboratory. With the genomic sequence in hand, one can bypass these limitations by using genomics approaches, which permit rapid identifica- tion of novel genes. Therefore, obtaining genome sequences from patho- genic fungi is one of the most efficient steps in identification of potential targets for therapeutic, intervention and vaccination.

Other genomic approaches

Current genomic approaches can be categorised into three groups: mutagenic- based, nucleotide-based and protein-based [206]. The mutagenic-based techniques include signature-tagged mutagenesis and construction of mu- tant libraries, etc. Microarray analysis and serial analysis of gene expres- sion (SAGE), for example, belong to the nucleotide based techniques. Two-hybrid system, protein arrays and 2D-PAGE expression analysis are examples of protein-based techniques.

4.3 Materials and Methods

4.3.1 Identification of melanin biosynthesis genes in P. marneffei

To identify melanin biosynthesis genes in P. marneffei genome, pro- tein sequences of melanin biosynthesis genes of Aspergillus were down- loaded from GenBank. The downloaded protein sequences were used as queries to the P. marneffei genome. The comparison was conducted 97

using the NCBI TBLASTN program version 2.0 with the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homologues was 1 × 10−20. The contigs in the P. marneffei genome that contained homologues were extracted and annotated manually. Predicted pep- tides were compared to the amino acid sequences of their correspond- ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm. nih.gov/blast/bl2seq/bl2.html). The statistics of the “expect value” were calculated based on the size of NCBI non-redundant protein data- base. Conserved domains/motifs were identified using InterPro release 5.1 [367].

4.3.2 Multiple alignments and phylogenetic analyses

Multiple alignments of amino acid sequences were performed using the program ClustalX 1.81 [311]. Initial pairwise alignments were per- formed using the Blosum62 protein weight matrix and adjustments to the alignments were performed manually. Graphic presentation of the alignments and consensus sequences were performed using the program BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX form.html). Regions of ambiguous alignment were removed by using the GeneDoc pro- gram (http://www.psc.edu/biomed/genedoc). Phylogenetic trees were inferred by the neighbour-joining method [273]. Bootstrap resampling with 1000 pseudoreplicates was carried out to assess support for each individual branch.

4.4 Results and Discussion

4.4.1 Melanin gene cluster present in P. marneffei

Secondary metabolism, the production of compounds not essential for growth in culture, is thought to be integrally intertwined with develop- ment in fungi. These events, usually induced by nutrient, biosynthesis or addition of an inducer, and/or by a growth rate decrease, generate 98

signals which effect a cascade of regulatory events resulting in chemical differentiation (secondary metabolism) and morphological differentiation (morphogenesis). Microbial secondary metabolites have a major effect on the health, nutrition and economics of our society. They include antibi- otics, pigments, toxins, effectors of ecological competition and symbiosis, pheromones, enzyme inhibitors, immunomodulating agents, receptor an- tagonists and agonists, pesticides, antitumor agents and growth promot- ers of animals and plants. Among them, fungal secondary metabolites are of intense interest due to their pharmaceutical (antibiotics) and/or toxic (mycotoxins) properties. Unlike primary metabolism, the pathways of secondary metabolism are still not understood to a great degree and thus provide opportunities for basic investigations of enzymology, con- trol and differentiation. Recently tremendous progress has been made in understanding the genes that are associated with production of var- ious fungal secondary metabolites. For example, work with Aspergillus species has revealed a link between asexual reproduction and the produc- tion of toxic secondary metabolites. One of the most well studied fungal secondary metabolic processes is the biosynthesis of melanin.

Based on the principle of similarity search, we took advantage of the whole genome sequence to identify the presence of this important genetic capacity in P. marneffei. Six known genes for DHN-melanin biosynthesis in A. fumigatus are abr2, abr1, ayg1, arp2, arp1, and alb1 [318]. Functions or gene products of these genes are given in Table 4.2, note that function of ayg1 is unknown. All these genes are available from GenBank and gene order has been determined by a previous genetic study [318] and further confirmed by the A. fumigatus genome project. The gene order is: abr2 -abr1 -ayg1 -arp2 -arp1 -alb1 (Fig 4.2).

When the amino acid sequences of proteins encoded by these 6 genes were used as queries to the P. marneffei genome, significant hits were obtained for all 6 proteins. When the predicted peptides of the corre- 99 alb1 abr2 arp1 arp2 ayg1 abr1 No.) (Acc. protein Af (AAC39471) (AAF03349) (AAF03353) (AAC49843) (AAF03314) (AAF03354) al 4.2: Table rw mar 8/2 0.0 587/526 0.0 pm-abr2 pm-alb1 664/555 pm-abr1 pm-arp1 (aa), pm-ayg1 Length protein synthase Pm polyketide 2 brown dehydratase scytalone reductase hydroxynaphthalene 1,3,6,8-tetra- 1 yellowish-green 1 brown Function uaiegn rdcsrltdt eai isnhssin biosynthesis melanin to related products gene Putative pm-arp2 Af/Pm 1616 . 97 1639 160 59/71 254 403 0.0 77/91 2e-81 63/74 2146/1568 57/71 8e-95 e-140 168/208 273/275 406/403 E-value .marneffei P. 57 505 55/73 528 60/77 (%) itive Pos- / Identity . (aa) length Overlap 100

sponding contigs were compared to the amino acid sequences of the corre- sponding query proteins, the E-values of the 6 comparisons ranged from 5E-13 to 0 (Table 4.2), indicating high levels of similarity between the P. marneffei protein and the A. fumigatus proteins. In A. fumigatus, abr1 encodes a multicopper oxidase and abr2 encodes laccase. We detected weak sequence similarity (60% alignable overlap with 30% amino-acid positive similarity) between the two genes at the amino-acid level. This weak sequence similarity suggests two genes are paralogs of each other which originated from gene duplication. In addition, we collected abr1 or abr2 homologs from some other fungal species and did a multiple align- ment of the gene family (Fig. 4.1). This gives information about how the gene family diverges.

Figure 4.1: P. marneffei abr1 gene Cu-oxidase domain homologues. Alignment of partial amino acid sequences of Cu-oxidase domains of as- comycetes.

More importantly, the synthases of secondary metabolism are often coded by clustered genes on chromosomal DNA. It has been suggested that such an organisation of genes may allow coordinated regulation of the pathway [337]. The 6 melanin biosynthesis are located in a gene clus- ter in P. marneffei (Fig. 4.2). The gene order is largely conserved when 101

compared to that of A. fumigatus. In P. marneffei, abr1 -ayg1 -arp2 -arp1 locate in one contig, and abr2 and alb1 in other two contigs. Scaffolding suggests that these 3 contigs belong to one single scaffold. Within this scaffold, the 3 contigs are ordered one after another, i.e. uninterrupted by other contigs. Therefore, gene order in P. marneffei can be inferred as: abr1 -ayg1 -arp2 -arp1 -abr2 -alb1. Such a placement was supported by 5 and 6 pairs of forward-reverse paired reads respectively in the 2 gaps of the 3 contigs, therefore, it is likely the location of 6 genes is correctly ordered and the length of this gene cluster can be closely approximated. As shown in Fig 4.2, the 6 genes span over 35 kb on the P. marneffei genome, which is about as twice the length in A. fumigatus (19 kb). The majority of this difference is due to a > 15 kp of gene-free region between abr2 and alb1 (Fig 4.2). Comparing the gene order in the two fungi, the only gene order change is abr2 jumping from the beginning of the cluster (as in A. fumigatus) to after arp1 in P. marneffei. In addition, the di- rection of alb1 is reversed. The tendency of genes for enzymes of certain metabolic pathways to be clustered in filamentous fungi has been noted previously [161]. Generally these gene clusters encode optional pathways for nutrient utilisation (e.g., the optional carbon source, quinate) [107] or for synthesis of secondary metabolites (e.g., the mycotoxin, sterigma- tocystin) [28]. Unlike the clustering of genes as operons in prokaryotes, clusters of similar genes in fungi are not cotranscribed, nor has any vital regulatory function for clustering been established [161]. Thus the rea- son for the existence of gene clusters in filamentous fungi has not been resolved.

4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P. marneffei

With the possible exception of the penicillin metabolic cluster, the most thoroughly examined fungal secondary metabolite gene clusters are those involved in mycotoxin biosynthesis, particularly the aflatoxin (AF) and 102

A. fumigatus abr2 abr1 ayg1 arp2 arp1 alb1 5kb

P. marneffei abr1 ayg1 arp2 arp1 abr2 alb1

Figure 4.2: Comparison between melanin gene cluster between P. marn- effei and A. fumigatus.

sterigmatocystin (ST) biosynthetic clusters found in several Aspergillus species [28]. These clusters contain a total of 23 genes involved in afla- toxin biosynthesis and other related functions (including 20 genes that encode enzymes, two genes that encode regulatory proteins, and one gene that encode an efflux transport protein) in Aspergillus species. No se- quence information of cypA, norB, and ordB was available from Gen- Bank at the time of analysis. The sequences of the remaining 20 genes, including 17 genes that encode enzymes (hexA, hexB, pksA, nor-1, avnA, adhA, norA, avfA, cypX, estA, vbs, ver1, moxY, verB, omtB, omtA, and ordA) and the two regulatory (aflR and aflJ ) and one transport (aflT ) genes, were downloaded. When the amino acid sequences of these pro- teins were used as queries to search against the P. marneffei genome, significant hits (TBLSTN E-value cutoff 1.0e-10) were obtained for all 20 proteins. When the predicted peptides of the corresponding contigs were compared to the amino acid sequences of the corresponding query pro- teins, the BLASTP E-values of these comparisons ranged from 5.0e-13 to 0 (data not shown), indicating high levels of similarity between the P. marneffei protein and the Aspergillus proteins. It is noticeable that the putative gene products of omtA and ordA that are responsible for the last step in conversion of ST to AF were found in P. marneffei to have high similarity with their corresponding genes in A. parasiticus.

Despite putative homologues of the Aspergillus genes in the aflatoxin biosynthesis pathway being present in the P. marneffei genome, these 103

genes do not form a cluster as they do in Aspergillus. This contradicts the general trend that genes involved in fungal secondary metabolism usually appear as a cluster, as in the A. flavus and A. parasiticus genomes. Since almost all of these genes in the P. marneffei genome were not in the same contig, it suggests that the homologs we identified might be for production of other unknown secondary metabolites, instead of aflatoxin. Or major movement of the genes in the aflatoxin biosynthesis gene cluster has occurred in P. marneffei during evolution, which might affect the ability and amount of aflatoxins.

4.4.3 Absence of penicillin biosynthesis genes in P. marneffei

Genomic sequence provides evidence for the presence of genetic compo- nents, such as, melanin biosynthesis gene cluster. On the other hand, it also provides evidence for the absence of some important genetic compo- nent, which is also valuable. The beta-lactam antibiotic penicillin, one of the most commonly used antibiotics for the therapy of infectious dis- eases, is produced as an end product by some filamentous fungi, such as, Penicillium chrysogenum. Penicillin biosynthesis is catalysed by three enzymes which are encoded by the following three genes: acvA (pcbAB), ipnA (pcbC ) and aatA (penDE), which are organised in a gene cluster. Although the production of secondary metabolites, such as penicillin, is not essential for the direct survival of the producing organisms, sev- eral studies indicated that penicillin biosynthesis genes are controlled by a complex regulatory network, e.g., by the ambient pH, carbon source, amino acids, nitrogen etc. Most notably, this gene cluster is present in A. nidulans which is a penicillin producer. In conclusion, the identification of the coding capacity for a set of proteins that could be involved in melanin biosynthesis has been reported here. The presence of these homologues suggests the potential ability for the biosynthesis of melanin or melanin-like substances in P. marneffei. 104

Since melanin is a well-defined fungal virulence factor, it is reasonable to infer that it is also a virulence factor in P. marneffei, albeit experimental confirmation is required. In addition, despite putative homologues of the Aspergillus genes in the aflatoxin biosynthesis pathway being present in the P. marneffei genome, these genes do not form a cluster as they do in Aspergillus. They might be involved in the production of other unknown secondary metabolites. 105

Chapter 5

MATING ABILITIES IN PENICILLIUM MARNEFFEI

Penicillium marneffei was believed to be asexual, but the genome sequence analysis suggests that the fungus maintains the genetic capa- bility for sexual reproduction. If confirmed, this raises the potential for developing powerful genetic tools for the organism, with far reaching im- plications for its genetic study and disease control.

5.1 Introduction

The most unique feature of Penicillium marneffei is the temperature- dependent dimorphic switch. At 25‰ P. marneffei exhibits true fila- mentous growth, while at 37‰ it undergoes a dimorphic transition to produce uninucleate yeast cells that divide by fission. The control of this “dramatic” developmental process is of interest because it is required for pathogenicity and may therefore provide a target for controlling infec- tion. Fungal dimorphic growth and mating are regulated by common signal transduction pathways, such as the mitogen-activated protein ki- nase pathway and the nutrient sensing cAMP pathway. Studies of devel- opment in many fungi have converged to define these conserved pathways, which are organised in different ways to regulate filamentation, mating and virulence, in different fungi as they adapt to unique environmental challenges [192]. Given such a common regulatory mechanism, it is not so surprising to find an association between the mating process and virulence in some fungi. For example, a MAT α strain of Cryptococcus neoformans is 30-fold more prevalent in the environment and 40-fold more prevalent in infections than a MAT a strain [183, 193]. Candida albicans utilises a 106

number of the same genes for both mating and pathogenesis. The mating pheromone of C. albicans elicits an over-expression of a set of virulence genes in recipient cells [16]. Proteins encoded by these genes were previ- ously shown to be required for virulence in a mouse model of disseminated candidiasis. Therefore, it is of particular interest to understand the P. marneffei mating system, which may be parallel to dimorphic develop- ment and pathogenesis of this medically important fungus.

Traditionally, P. marneffei is considered as an asexual (anamorph) ascomycete that lacks an apparent sexual (teleomorph) stage in its life cycle and seems to reproduce only mitotically [44, 104]. Recent genetic studies, however, suggest it may have an unidentified sexual cycle [20,19]. Two homologs of the Aspergillus nidulans steA and stuA genes, stlA and stuA have been cloned from P. marneffei [20, 19]. Both steA and stuA are involved in controlling mating in the sexual homothallic A. nidulans. The stlA gene displays no role in vegetative growth, asexual develop- ment, or dimorphic switching in P. marneffei and is able to complement the sexual defect of an steA mutant of A. nidulans [19]. The P. marn- effei stuA gene encodes a basic helix-loop-helix (bHLH) protein of the APSES family and is supposed to regulate both dimorphic growth and mating or asexual sporulation. Loss of stuA from P. marneffei resulted in no obvious effect on dimorphic growth and P. marneffei stuA is able to complement the conidation defect of an A. nidulans stuA mutant [20]. Moreover, the P. marneffei tupA gene, a homolog of rcoA, is able to com- plement both the asexual and sexual development phenotypes of an A. nidulans rcoA deletion mutant [315]. This indicates that the sexual func- tion of tupA has been retained in P. marneffei. Although the presence of these highly conserved P. marneffei homologs of these A. nidulans genes indeed suggests the presence of an undiscovered mating systems in P. marneffei, the mating process needs a comprehensive network of genes to function coordinately. Therefore, the finding of a complete mating 107

gene repository in P. marneffei would be a stronger piece of evidence to support the presence of a sexual stage for the fungus. Now the genome sequence information has enabled us to conduct a search for mating-related genes in the P. marneffei genome in order to reveal the potential mating system in this important dimorphic fungal pathogen. Similar studies have been carried out in C. albicans, which was thought to be constitutively diploid and to reproduce only asexually [138]. The complete genome predicted that a mating system existed in C. albicans after the identification of numerous highly conserved homologs of S. cerevisiae mating genes [190, 259, 272]. Eventually, it has been demonstrated by two research groups that C. albicans can be induced to mate under certain conditions [139, 213]. The sexual cycle introduces valuable genetic tools for fungal study. If a fungus has a sexual cycle, we can always screen for mutants from recombination events during meosis and gamete formation, then zygote formation. In the case of P. marneffei, the absence of a sexual stage has handicapped biological studies with this fungus. Genome sequence analysis reported in this chapter, however, provides encourageing infor- mation: many homologs of sex cycle-related genes have been identified in the P. marneffei genome, suggesting a potential matting ability of this important pathogenic fungus, despite which the sexual state has not been reported. Practically, this discovery might open the door to simple and efficient procedures for obtaining sexual recombinants of P. marn- effei that will be useful for genetic analyses of pathogenicity and other traits.

5.2 Literature Review

Studies on mating type in fungi have been helpful for the understand- ing of many eukaryotic regulation pathways, including cell cycle regu- lation, cellular and nuclear identity, and signal transduction. Most as- 108

comycetes have only two different mating types, their MAT locus encodes transcription factors that regulate mating-type–specific genes involved in pheromone production, pheromone sensing, and signal transduction [94]. Some ascomycetes are asexual, while many others have adopted different reproductive strategies: heterothallic, homothallic, and, less frequently, pseudohomothallic (Table 5.1). For homothallic species, homokaryotic haploid strains are self-fertile and complete the sexual cycle without seek- ing a mate. This diversity is so extensive that even species within the same genus, such as Neurospora, adopt either homothallic or heterothallic modes. More strikingly, in a recent study, researchers discovered that the heterothallic C. neoformans α cells can sexually reproduce via fruiting, without fusing with a partner of the opposite mating type.

5.2.1 Mating in hemiascomycete yeasts

The mating-type locus has been well studied in ascomycete S. cerevisiae. Two haploid cell types of S. cerevisiae are determined by their MAT loci, denominated as α and a. A pheromone-mediated fusion process creates a diploid cell (a/α), which then, under starvation conditions, can un- dergoes meiosis with the formation of four haploid cells, two of which are a, two are α. Each α and a mating-type locus contains two diver- gently transcribed genes: a1, a2 and α1, α2, respectively. The a1 and α2 proteins are transcriptional repressors (when both are present) and both contain a homeodomain DNA-binding motif [284]. The α1 protein has been shown to be a transcription activator [278] but its DNA-binding domain (the α-box) has yet to be characterised in detail. The function of a2 is unknown. The a1 and α1 proteins are encoded by totally dissimilar sequences of 642 and 747 bp, respectively, while a2 and α2 sequences have partial similarity [227, 299]. S. cerevisiae is basically heterothal- lic, however, a homothallic breeding system can be achieved through a mating-type switching, in which S. cerevisiae α haploid cell can switch 109

to the opposite mating type a, or vice verse [132]. This is caused by gene conversion between the MAT locus and two MAT-like loci during cellular division of haploid cells [120]. The molecular basis of the gene conver- sion is the presence of two MAT-like cassettes, HMR and HML. Normally they are transcriptionally repressed through silencing by the formation of a specialised compacted chromatin structure. They are both surrounded by “silencers,” short specific sequences that are binding sites for DNA- binding proteins and are also involved in transcriptional activation and DNA replication (for recent reviews, see [105, 117]). Moreover, haploid- specific gene products, such as the HO endonuclease, are involved in repression of meiosis and mating-type switching [120].

5.2.2 Mating in filamentous ascomycetes

These mating systems include many conserved components, such as gene regulatory polypeptides and pheromone/receptor signal transduction cas- cades, as well as conserved processes, like self-nonself recognition and controlled nuclear migration. The mating systems in filamentous as- comycetes share similar components and processes with those in yeasts but they exhibit many unique properties. First, the sequence dissimi- larity between two alternate mating-type alleles is more pronounced in filamentous ascomycetes. Usually they consist of unrelated and unique sequences. Second, the mating-type switching mechanism of filamentous ascomycetes is unknown but different from that of yeast. Filamentous ascomycetes exhibit great stability of the mating type, which might be due to the lack of additional copies of mating-type sequences outside the mating-type locus. The additional copies of the mating-type locus in yeasts are usually silent copies facilitating mating type switching through gene conversion. Among filamentous ascomycetes, the structure of the components and genetic arrangements of their mating type loci vary greatly. Neurospora 110

Table 5.1: Mating strategies adopted by ascomycetous fungi, the presence of mating type gene and ability in switching between mating types.

Species Mating strategy Mating Switching type gene S. cerevisiae Homothallic YY C. glabrata Asexual? Y NA Kluyveromyces lactis Heterothallic, some YY homothallic strains Kluyveromyces Homothallic YY waltii Ashbya gossypii Asexual? YY Debaryomyces Homothallic YY hansenii Yarrowia lipolytica Heterothallic YY Neurospora crassa Homothallic Y NA Podospora anserina Pseudohomothallic Y NA Bipolaris sacchari Asexual Y NA Neurospora interme- Heterothallic Y NA dia S. almonella Heterothallic YY C. neoformans Heterothallic YN 111

crassa and Podospora anserina are two representative ascomycetes from which molecular analyses of mating systems have been well-characterised. In N. crassa, mat a-1 and mat A-1 are the two genes responsible for a and A mating specificity, respectively. Two additional genes mat A- 2 and mat A-3, with opposite orientations are present at the mat A-1 adjacent region. In P. anserine, FPR1 is the only gene present in the mat+ idiomorph and sufficient to induce fertilisation, in contrast, FMR1 with two additional genes, SMR1 and SMR2, are required for the mat- strain to develop perithecia to maturity.

Heterothallic species require a partner for mating, whereas homothal- lic species are able to self-mate. The difference between heterothallic species and homothallic species is not due to the presence or absence of mating-type genes. Sequences similar to mating types have been identi- fied and functionally characterised in all the species tested, whether they are heterothallic or homothallic. Mating type genes are even present in asexual species, for example, asexual Bipolaris sacchari has a homolog of the MAT-2 gene of the related species C. heterostrophus. The process of sexual development is identical in homothallic and heterothallic species. Homothallic filamentous ascomycetes, even individual nuclei contain both mating-type informations, could be functionally heterothallic through a proposed a mechanism allowing alternate expression of either mating type.

Mating may serve as a model for the study of developmental genetics and could help in elucidating regulatory mechanisms of multicellularity and sexual dimorphism. Mating systems are divergent in ascomycetes. The presence of mating-type genes does not determine the mode of sexual reproduction. Because the changes in modes of sexual reproduction are frequent and disruption of sexual function is tolerated in ascomycetous fungi, the presence or absence of particular genetic components involved in the mating system is not necessarily a good indicator for which repro- 112

ductive modes a fungus adopted.

5.3 Materials and Methods

Protein sequences of fungal sex-related genes downloaded from GenBank were used as queries to the P. marneffei genome sequences. The com- parison was conducted using the NCBI TBLASTN program 2.0 with the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homologues was 1.0e-20. The contigs of the P. marneffei genome that contained homologues were extracted and annotated manually. Each an- notated gene is given a locus number of the form Pm## sequentially to identify a gene uniquely and positively. Each gene also has a ver- sion attribute (so loci are in fact displayed as Pm##.version). Predicted peptides were compared to the amino acid sequences of their correspond- ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm. nih.gov/blast/bl2seq/bl2.html). The statistics of the expect value were calculated based on the size of NCBI non-redundant protein data- base. Conserved domains/motifs were identified using InterPro release 5.1 [367]. Multiple alignments of amino acid sequences were performed using the program ClustalX 1.81 [311]. Adjustments to the alignments were performed manually. Graphic presentation of the alignments and consensus sequences were performed using the program BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX form.html). In addition to the degree of sequence similarity, several lines of supple- mentary information were used to further support gene homology. These include: (i) conserved positions of intron(s) between homologs, which argues for a common ancestor of genes studied; (ii) phylogenetic trees constructed from aligned genes, so that the most close homolog can be identified when paralogous genes present; (iii) identified features charac- teristic of the family that a gene belongs to. Phylogenetic trees were inferred by the neighbour-joining method 113

HMG box alpha box AfMAT-2 (Af59.m09249) A. fumigatus AnMAT-2 (AF508279/AN4734.2*) A. nidulans AnMAT-1 (AY339600/AN2755.2) Chromosome 3 Chromosome 6 P. marneffei PmMAT-1 (Pm1.126) N. crassa mat a-1 (M54787) mat A-3 mat A-2 mat A-1

S. cerevisiae MATalpha2 MATalpha1 MATa1

S. pombe mat1-P mat2-P mat3-M mat1-M mat2-P mat3-M

15kb 11kb 15kb 11kb

Figure 5.1: Comparison of the mating-type loci in P. marneffei and other fungi. Boxes interrupted by gaps represent the coding sequences of the genes and the introns, respectively. Arrows indicates the directions of genes. Dash lines indicate the genes linked together are present in the genome of the same isolate. Symbols: dark-gray bar, conserved HMG- box domain; light-gray bar, conserved alpha-box motif.

[273]. Genetic distances between protein sequences was estimated using WAG amino-acid substitution model [342] implemented in MBEToolbox (Chapter 10).

5.4 Results and Discussion

The close relationship between Penicillium and Aspergillus genera has been well established based on various sources of evidences. It is further supported by our recent comparative study of the mitochondrial genome of P. marneffei and those of other fungi (Chapter 3). It has prompted the search for previously undiscovered characteristics in P. marneffei based on our knowledge in the various Aspergillus species. 114

5.4.1 Homologs of known sexual genes

With respect to the potential mating system of P. marneffei, A. nidu- lans is of particular interest as this model species has two distinctive reproductive developmental processes: sexual and asexual development. We used a set of empirically selected A. nidulans genes involved in sex- ual development as queries to identify their homologs in P. marneffei. These genes are veA, medA, tubB, phoA and nsdD. The veA gene was first known to mediate the light response as early as 1965 [156]. It was later found to be required for cleistothecium and ascospore formation as well [159]. The veA1 mutant is unable to develop sexual structures and asexual sporulation in the veA1 mutant is promoted and increased [164], implying that veA gene plays a key role in activating sexual develop- ment and/or inhibiting asexual development. A. nidulans medA (Gen- bank Acc.: AAC31205) encodes a transcriptional regulator of sexual and asexual reproduction. tubB, one of two genes encoding alpha-tubulin, is involved in the processes of karyogamy and meiosis I [167, 168], but it is not required for vegetative growth or asexual reproduction, nor is it required for the initiation or early stages of sexual differentiation. The gene nsdD encodes a GATA-type transcription factor that functions in activating sexual development [124]. The gene phoA [33], like stuA [222], is involved in the biosynthesis of tryptophan and has been identified as being involved in sexual development [77,314, 355].

As in A. nidulans veA, the predicted P. marneffei veA contains one intron with conserved boundaries. The predicted P. marneffei MedA (741 aa) shows 49% identity in amino acid to A. nidulans MedA (600 aa) within an alignable region of 555 aa. The predicted P. marneffei tubB and phoA are highly conserved, sharing 83 and 80% identical amino acid residues with A. nidulans tubB and phoA, respectively. The predicted P. marneffei NsdD consists of 385 amino acid residues and, like A. nidulans NsdD, is rich in proline (13.8 and 11.3%) and serine (13.8 and 13.4%). 115

Both have the type IVb C-X2-C-X18-C-X2-C zinc finger DNA-binding domains at their C-termini.

We also identified homologs of two inhibitors of sexual processes, lsdA and rosA, in P. marneffei. The LsdA is expressed abundantly at the late sexual developmental stage of A. nidulans. Disruption of lsdA causes the preferential formation of sexual structures even under certain conditions, such as a salt at high concentration, where sexual development in the wild type is inhibited [191]. Hence, the lsdA gene inhibits sexual development in the presence of sex-inhibiting environmental signals. Under low-carbon conditions and in submersed culture, A. nidulans RosA is also a repressor of sexual development initiation [331]. The predicted P. marneffei lsdA encodes a 350 amino acid polypeptide, which when compared to the 356 amino-acid A. nidulans lsdA, shares 43% identical and 60% similar amino-acid residues. The predicted P. marneffei RosA exhibits 57% amino acid identity to A. nidulans RosA. The position of the larger intron of P. marneffei rosA is same as that in orthologs of A. nidulans, Sordaria brevicollis and N. crassa. At the N terminus of P. marneffei RosA, the highly conserved Zn(II)2Cys6 motif and a putative bipartite nuclear localisation signal and a predicted DNA-binding domain are predicted.

In summary, although studies of the molecular mechanism controlling sexual development in filamentous fungi are very limited, several sexual genes that have been identified, isolated and characterised from A. nidu- lans enable us to find their homologs in P. marneffei. This finding is in line with the other two genes mentioned above, stuA [222] and steA [19], that have been experimentally characterised in both A. nidulans and P. marneffei, revealing the functional exchangeability between correspond- ing homologs. The presence of these faithful homologs suggests that sexual development is potentially possible in P. marneffei. However, it becomes not so conclusive when the following fact is taken into account – many sexual genes may function not only in sexual development but also 116

Figure 5.2: Comparison of the alpha1 domian of MAT proteins of filamen- tous ascomycetes. The amino acid sequence alignments are as follows: putative P. marneffei, MAT-1 (Pm1.126); putative A. nidulans, MAT- 1 (AN2755.2); N. crassa, mat A-1; Paecilomyces tenuipes MAT1-1-1; Gibberella fujikuroi, MAT-1-1; Alternaria alternate, MAT-1; Pyrenopez- iza brassicae, alpha-1 domain protein (CAA06844.1); Gibberella zeae, MAT1-1-1; Fusarium oxysporum, MAT-1; Cochliobolus ellisii, MAT-1; Podospora anserine, FMR1. The arrow indicates conserved position of introns.

in other processes, like secondary metabolism. Hence, homologous sexual genes in P. marneffei might be responsible for other processes that are not related to sexual development. Therefore we need further evidences to draw a conclusion.

5.4.2 Mating type genes

Fungi are capable of sexual reproduction by using either heterothallic (self-sterile) or homothallic (self-fertile) mating strategies. In most as- comycetes, mating ability is controlled by a single mating type locus, MAT, with two alternate forms (MAT-1 and MAT-2) called idiomorphs. MAT-1 and/or MAT-2 mediate not only mating, but also several other key processes, including secretion of and response to pheromones and vegetative incompatibility. In heterothallic ascomycetes, these alternate idiomorphs reside in different nuclei. In contrast, most homothallic as- comycetes carry both MAT-1 and MAT-2 in a single nucleus, usually closely linked. A. nidulans is a homothallic ascomycete. A. nidulans MAT-2 (AnMAT - 117

Relationship:

is neighbor is homolog

cytoskeleton assembly control protein Pm1.124

AN4732.2 Af59.m09500 Pm1.125 AN2756.2

AN4733.2 Af59.m09250 PmMAT-1 AnMAT-1 (Pm1.126) (AN2755.2) AnMAT-2 AfMAT-2 (AN4734.2) (Af59.m09249) Pm1.127 AN2754.2

AN4735.2 Af59.m09248 Pm1.128 AN2753.2

AN4736.2 Af59.m09247 Pm1.129

AN4737.2 Af59.m09246 A. nidulans contig 27

P. marneffei DNA lyase

A. nidulans contig 47 A. fumigatus

Figure 5.3: Gene organisation around the MAT locus of A. nidulans and the putative MAT loci of P. marneffei and A. fumigatus. AnMAT -1 and AnMAT -2 are A. nidulans MAT-1 and MAT-2, locating on contig 47 and 27 of A. nidulans unfinished genome, respectively.

2) have been previously characterised using ‘classic’ molecular biological techniques [76], while A. nidulans MAT-1 (AnMAT -1, Genbank Acc. BK001307) has been found by similarity searching [76]. In the MIT A. nidulans genome database, two annotated genes AN2755.2 and AN4734.2 on different contigs are actually the AnMAT -1 and AnMAT -2 respec- tively. Note that AN4734.2 is slightly different from AnMAT -2 (Genbank Acc. AF508279), simply due to different isolates of A. nidulans. In con- trast to A. nidulans, only MAT-2 has been identified by genome analyses from A. fumigatus [253,326]. The AfMAT -2 encodes a regulatory protein with a high mobility group (HMG) DNA-binding domain [320], which is the characteristic feature of MAT-2 genes. No homologue of the MAT - 1 gene sequence in any of the tested fungi was found in the TIGR A. fumigatus genomic database. This suggests A. fumigatus is perhaps a heterothallic ascomycete, rather than a homothallic ascomycete (as all homothallic euascomycetes so far analysed either contain only MAT-1 or both an MAT-1 and MAT-2 [252]), and the genome sequence was from a 118

MAT-2 strain.

Using this pair of Aspergillus species that are closely related to P. marneffei, the homothallic A. nidulans and the possibly heterothallic A. fumigatus as models we undertook a series of MAT searches to determine whether P. marneffei has a hypothetical MAT locus, and if so, whether P. marneffei carries both MAT1-1 and MAT1-2 genes. Through BLAST searches, we identified a putative mating-type (PmMAT ) locus in P. marneffei, containing a conserved homolog of the A. nidulans MAT-1 (AnMAT -1), which is denoted as PmMAT -1 hereafter. The PmMAT -1 gene encodes a putative 348 amino acid polypeptide which shares 38% similarity to AnMAT-1 (361 aa) in full length, and exhibits 59, 60, 61 and 60% similarity to the alpha-box domain of AnMAT-1, P. brassicae MAT-1, G. fujikuroi MAT-1 and P. anserine MAT-1. More importantly, the intron boundaries are conserved between the putative PmMAT -1 and other fungal MAT -1 genes (Fig. 5.2).

Despite extensive genome sequence searches, we cannot identify a MAT-2 like gene in P. marneffei. Having one mating-type gene is similar to the situation in A. fumigatus, where, in contrast, MAT-1 cannot be found. The other mating type gene, P. marneffei MAT-2 or A. fumiga- tus MAT-1, might be present in other isolates, as observed in the asexual Fusarium culmorum species [163]). Alternatively the other putative mat- ing type gene could have become extinct, as observed in C. neoformans populations and Ophiostoma novoulmi [356].

The former explanation seems more plausible after we identified pu- tative mating-type loci in P. marneffei and A. fumigatus, which show similarity to A. nidulans MAT-2 and MAT-1 regions, respectively. We compared flanking genes of two mating-type loci to each other, as well as to corresponding A. nidulans MAT-2 or MAT-1 regions (Fig. 5.3). Strik- ing patterns were observed in the organisation of flanking genes where several syntenies were identified. Comparing P. marneffei to A. fumi- 119

gatus, PmMAT-1 (Pm1.126) and AfMAT-2 (Af59.m09249) are oriented differently, upstream of a hypothetical gene (Pm1.127 and Af59.m09250 respectively). The mating-type gene and its following gene occupy a unique region of ∼5 kb in both P. marneffei and A. fumigatus. No sig- nificant similarity at the amino-acid or nucleotide level can be detected between the two regions. Three pairs of homologous genes flank the two regions, the first pair encodes a homologues of S. cerevisiae SLA2-like cytoskeleton assembly control protein, and the other two encode a pu- tative DNA lyase and a proteins of the cytochrome c oxidase subunit VIa family. It therefore seems likely that the non-homologous regions in P. marneffei or A. fumigatus are the mating-locus of their idiomorphic type. The mating-locus of the other idiomorphic type might be found in another isolates. This suggests P. marneffei and A. fumigatus are heterothallic fungi.

Taken together with N. crassa, we now have the schematic organ- isation of mating-type loci from four filamentous fungi, whose genome sequences are completed or almost completed (Fig. 5.1). To compare them with those from yeasts, we note that the mating-type DNA regions of filamentous fungi are generally larger than in S. cerevisiae [10] or in S. pombe [162]. In fission yeast S. pombe, the mating-type region com- prises three linked loci, mat1, mat2 and mat3, which occupy about 30 kb of DNA on chromosome II [14]. The mat1 locus determines the cell type, depending on whether it has P (for plus) or M (for minus) infor- mation. mat2-P and mat3-M loci are transcriptionally silent and act as donors of information for switching mat1 DNA by the process of gene conversion. There is no similar arrangement of such mating-type regions in P. marneffei; however, it is noteworthy that there are other genes, such as Pm6.88 or AN1962.2, in P. marneffei or A. nidulans, having similarity to the HMG mating-type genes. They are not ‘true’ MAT- 2 family mating-type genes because they do not contain the intron with 120

conserved positions and some other conserved motifs, which are only seen in the MAT -2 gene. Also they are not located at the MAT locus, unlike other filamentous fungi, such as N. crassa, which may have an additional HMG gene at the MAT-1 idiomorph involved in fertility. These extra HMG genes are not possible to be silent copies of MAT genes, as seen in the yeasts. However, they may theoretically have some role in fertility which will need experimental investigation [Dr Paul S. Dyer, personal communication].

Finally, the detection of mating type genes, which play roles in sexual signalling between compatible heterothallic isolates, yet are present in a ‘selfing’ fungus like A. nidulans, is noteworthy itself. As suggested by Dyer [76], this observation can be interpreted by either the evolution of heterothallic species towards homothallic form or vice versa. Taking our observation from the P. marneffei genome into account, then we assume the former interpretation is more plausible, i.e., homothallic A. nidulans is originated from a heterothallic common ancestor of Penicillium and Aspergillus.

5.4.3 Mating pheromone precursor genes

The nucleotide sequence and deduced amino acid sequence of the pheromone precursor gene from several fungi have been used to search the P. marn- effei genome. After intensive searches, however, no significant similarity was found (BLAST E-value cutoff = 10). As mentioned in a previous section (Section 1.4.2), syntenic comparisons suggest the loss of original mating pheromone precursor loci may occur in P. marneffei. However, we cannot exclude the possibility that P. marneffei mating pheromone precursor genes are so highly specific that they are too divergent to be detected by similarity searches. 121

Pm6.49 Ram1p Farnesylation Pm60.30 Ram2p

C-Terminal CAAX Modification Pm60.4 Ste24p AXX Proteolysis Pm96.20 Rce1p

Pm92.26 Ste14p Carboxylmethylation

Pm60.4 Ste24p P1->P2 Proteolysis N-Terminal Processing

No match Axl1p P2->M Proteolysis Pm134.14 Ste23p

Export Export

Pm125.22 Ste6p

Figure 5.4: Predicted P. marneffei homologues of the genes involved in the biogenesis of the a-factor pheromones in S. cerevisiae. The a-factor biosynthetic intermediates and the components of the a-factor biogene- sis machinery are shown (see the text for more information). Several of the a-factor intermediates can be directly visualised by SDS-PAGE and are designated P0, P1, P2, and M [49]. The a-factor precursor contains an N-terminal extension, a mature portion, and a C-terminal CAAX motif, as indicated at top. During a-factor biogenesis, the unmodified a-factor precursor (P0) undergoes C-terminal modification (prenylation, proteolytic cleavage of AAX, and carboxylmethylation) to yield the fully C-terminally modified species P1. Next, N-terminal proteolytic process- ing occurs in two distinct steps, the first (P1→P2) cleavage removing seven residues from the N-terminal extension to yield the P2 species, and the second (P2→M) cleavage generating mature a-factor, which is ex- ported from the cell. The corresponding components predicted from P. marneffei have been given. Among them, AXL1 has not been identified. 122 t6 (1290) Ste6p (988) Ste23p (453) Ste24p (239) Ste14p (315) Rce1p (431) Ram1p (316) Ram2p (931) Ste13p (814) Kex2p (729) Kex1p Sc 5.2: Table marneffei rti (aa) protein genome. hrmn-rcsigezmsecddb h putative the by encoded enzymes Pheromone-processing T-eedn utdu fflxpm fafco m2.2(22 e17 3/20(6) 8/20(45%) 580/1280 (26%), 335/1280 1e-127, (1262) Pm125.22 (64%) 87/134 (45%), a-factor 61/134 of 1e-034, pump efflux multidrug ATP-dependent (259) in form pro- Pm92.26 Axl1p, of processing homolog N-terminal with a-factor involved, C-terminal Metalloprotease and N- processing protease prenyl CaaX methyltransferase carboxyl Prenylcysteine protease CaaX fication Farnesyltransferase CaaX fication Farnesyltransferase CaaX aminopeptidase Dipeptidyl Endoprotease Carboxypeptidase Function α a fco rcsigP63(1)1-5,3274(9) 2/7 (55%) 428/774 (39%), 302/774 1e-154, (813) Pm6.3 processing -factor fco -emnlpoesn m62 33 e05 923(0) 3/6 (50%) 132/263 (30%), 79/263 3e-025, (333) Pm96.20 processing C-terminal -factor α fco rcsigP7. 62 e07 2/5 3%,1330(52%) 183/350 (35%), 124/350 4e-057, (672) Pm76.8 processing -factor α β α subunit; subunit; fco rcsigP1.7(9)1-2,2377(3) 9/8 (50%) 399/787 (33%), 263/787 1e-128, (899) Pm10.77 processing -factor a fco otemature the to -factor a a fco modi- -factor fco modi- -factor .marneffei P. m3.4(02 .,3997(8) 6/4 (59%) 562/947 (38%), 369/947 (61%) 0.0, 274/446 (45%), 202/446 1e-115, (1012) Pm134.44 (456) Pm60.4 (47%) 157/329 (34%), 114/329 6e-050, (50%) 177/354 (35%), (635) 124/354 Pm6.49 5e-051, (350) Pm60.30 Pm rti a)Evle dniyadsmlrt noverlap in similarity and identity E-value, (aa) protein ee,a hw yaBATsac fthe of search BLAST a by shown as genes, P. 123

5.4.4 Mating pheromone processing genes

The production of pheromones has provided important insights into pro- protein processing in eukaryotic cells. The system has been well char- acterised in S. cerevisiae (for review, see [62]). A budding yeast cell produces either a-factor or α-factor corresponding to its mating type. Either a- or α-factor is synthesised as precursor that undergoes multiple maturation steps to generate its mature form. A number of S. cerevisiae pheromone processing genes have been cloned and characterised [32]. We used the protein sequences of all these genes in a BLAST search to iden- tify pheromone-processing genes encoding putative homologous proteins in P. marneffei. For all the query S. cerevisiae proteins, except Axl1p, the corresponding P. marneffei homologs with high levels of amino-acid similarity have been identified (Table 5.2). Hence, P. marneffei ap- pears capable of synthesising/processing mating pheromones although the pheromone precursor gene has not been identified by searching for known pheromone precursor genes.

Genes involved in the processing of α-factor and a-factor are different. In the case of α-factor, the maturation requires signal cleavage, glycosy- lation and proteolytic processing by three peptidases encoded by KEX2, KEX1 and STE13. The S. cerevisiae KEX2 gene encoding kexin belongs to the prohormone convertase family, which has been identified in many species. The S. cerevisiae Kex2p is membrane-bound and cleaves pep- tide substrates at both Lys-Arg and Arg-Arg sites [26, 100]. A previous study has shown that mutant Kex2p enzyme molecules lacking as many as 200 C-terminal residues still retained protease activity. Although not essential for enzymatic activity, C-terminal cytoplasmic tail contains a localisation signal so that Kex2p is localised to a later compartment of the Golgi complex. The predicted P. marneffei Kex2p shows high simi- larity (55%) to S. cerevisiae Kex2p overall and similarity at C-terminal residues is slightly lower, hence, the predicted P. marneffei Kex2p pos- 124

sibly bears protease activity but may be localised differently. The S. cerevisiae KEX1 encoding carboxypeptidase cleaves the Lys-Arg residues exposed at the C-terminus of α-factor precursor following digestion with the kexin [60, 70, 188]. Like Kex2p, the C-terminal residues of S. cere- visiae Kex1p are not highly conserved in P. marneffei, also suggesting a difference in peptide localisation between species. P. marneffei is pre- dicted to have a homolog of S. cerevisiae Ste13p, a type IV dipeptidyl aminopeptidase that trims N-terminal x-Ala dipeptides of the α-factor precursors [154].

a-factor undergoes three major maturation stages: C-terminal mod- ification, N-terminal modification, and export [49], which involve genes RAM2, RAM1, RCE1, STE14, STE24/AFC1, STE23, AXL1 and STE6 (Fig. 5.4). The S. cerevisiae RAM2 and RAM1 genes encode the α and β subunits of farnesyltransferase (FTase), respectively [129]. FTase catalyses the addition of 15-carbon (farnesyl) groups to a-factor des- tined for cell membranes [260]. RAM2 and RAM1 are conserved genes that have mammalian counterparts. RAM2 is essential to the viabil- ity of C. albicans, while RAM1 is essential to C. neoformans, indicating that protein prenylation is an indispensable cellular process in these op- portunistic yeast pathogens. The predicted P. marneffei Ram1p shows high levels of similarity to S. cerevisiae Ram1p (51 %) and to mam- malian protein farnesyltransferase β subunits (e.g. 55 % similarity to rat fntb). The predicted P. marneffei Ram2p shows 50 % similarity to S. cerevisiae Ram2p, with both containing at least three PPTA (Pfam acc. PF01239) domains at their N-termini. The S. cerevisiae RCE1 encodes an AAX prenyl protease [21]. The sequence of RCE1 contains three potential transmembrane domains but there are no other defining features and no significant similarity with other proteins, hence it may belong to a novel superfamily [247]. The predicted P. marneffei Rce1p, which is 50% similar, also contains multiple potential transmembrane 125

domains. More importantly, the three putative zinc-binding residues (E156A, H184A, H248A) and Cys (C251) are all conserved. Mutating each of these residues inactivates the protease [72]. The S. cerevisiae STE14 encodes a carboxyl methyltransferase that methylates a-factor. The predicted P. marneffei Ste14p, containing multiple predicted trans- membrane spans, shares 64% similarity with S. cerevisiae Ste14p. The S. cerevisiae Ste24p, a membrane-associated metalloprotease, is required for the first step of N-terminal processing of a-factor [99]. The predicted P. marneffei Ste24p shows 60% similarity to its counterpart. Like S. cerevisiae Ste24p, P. marneffei Ste24p (at position 299 to 303) has a Zn- dependent metalloprotease motif (HEXXH) [304]. It also matches the larger consensus sequence characteristic of neutral Zn metalloproteases, and contains multiple predicted transmembrane regions. Unlike S. cere- visiae Ste24p, however, the C-terminal di-lysine motif, KKXX (K is Lys) is replaced with KXXX in P. marneffei Ste24p. Our analysis reveals that the predicted Ste24p homologs in A. fumigatus (AF58.m07859) and N. crassa (NCU03637.2) also have the replacement of the di-lysine motif. Since the di-lysine motif at the C-terminus of many proteins facilitates their retrieval from the Golgi complex to the ER [310], it could sug- gest that Ste24p in S. cerevisiae is localised to the ER, but this is not the case in P. marneffei or the other two filamentous fungi. The S. cere- visiae metalloprotease Ste23p, a member of the insulin-degrading enzyme family, is involved in N-terminal processing of pro-a-factor to the mature form. Axl1p is a paralog to Ste23p. In S. cerevisiae, Ste23p and Axl1p proteins show 22% identity and 39% similarity throughout their entire length and Ste23p performs a role at least partially redundant with that of Axl1p in a-factor processing [1]. In P. marneffei, I identified a pu- tative homolog of Ste23p but not Axl1p. P. marneffei Ste23p is highly conserved, showing 59% similarity to S. cerevisiae Ste23p. We argue that since STE23 genes are present in S. cerevisiae and P. marneffei while 126

AXL1 is present in S. cerevisiae only, it is possible that AXL1 was cre- ated by duplication of the gene STE23 after the separation of the two species. Moreover, S. cerevisiae STE23 and AXL1 may be an example of duplicate genes that undergo subfunctionalisation, through which Axl1p gains a new role in controlling the axial budding pattern of haploid cells while retaining partial STE23 functions in processing a-factor. Finally, unlike α-factor that is exported in MATα cells via the classical secretion pathway, a-factor is pumped out of the cell by the MATa cell-specific protein Set6p. The homolog of Set6p was identified in P. marneffei, with multiple transmembrane domains and two ATP binding domains.

5.4.5 Mating pheromone receptor and other GPCRs

In S. cerevisiae, a or α-factor binds to cell-type-specific receptors encoded by STE2 or STE3. STE2 is expressed in a cells and is recognised by α- factor, and STE3 is expressed in α cells and recognised by a-factor. The binding is essential for signalling mating process between haploid cells. In A. nidulans, Han et al.[125] identified 9 genes, gprA∼I, belonging to the GPRC family. Among them, gprA and gprB are putative orthologs to STE2 and STE3. gprD is similar to the yeast glucose sensing Gpr1p [176] and plays a key role in coordinating hyphal growth and sexual develop- ment. Using these A. nidulans GPCRs as query genes, I identified 7 P. marneffei GPCRs closely related to them. A phylogeny reconstructed from a collection of fungal GPCRs gives an indication of several distinct families. The seven P. marneffei distribute across all these sub-divisions. They all contain multiple predicted transmembrane domains, which is one of characteristic features of GPCRs. Han et al.[125] also claimed that 7 putative GPCRs have been found in A. fumigatus genome. It would be interesting to re-analyse this gene family when gene sequences from all these three genomes of closely related species become available. Our results indicate that P. marneffei might have a recent evolu- 127

tionary history of sexual recombination and might have the potential for sexual reproduction. The possible presence of a sexual cycle is highly significant for the population biology and disease management of the species. 128

Chapter 6

EXPLORING THE GENETIC COMPONENTS ASSOCIATED WITH THE DIMORPHISM OF PENICILLIUM MARNEFFEI

Penicillium marneffei accommodates both complex asexual develop- ment and dimorphic switching programs, hence becomes a valuable sys- tem for the study of morphogenesis and pathogenicity. The study of the morphogenetic programs of P. marneffei has been recently greatly facilitated by the development of molecular genetic techniques, but we are only beginning to uncover some determinants which control these events, and the comprehensive picture still remains blurred. This chap- ter contributes to the thesis by offering a systemic exploration of genetic components that may be responsible for the morphogenetic processes in the genome of P. marneffei, mainly through sequence analysis in a con- text of comparative genomics. This will provide insights into the biology of P. marneffei and its pathogenic capacity.

6.1 Introduction

Dimorphism, the ability to switch between a cellular yeast form and a filamentous form, is a common morphogenetic feature in many fungi, de- spite their enormous diversity in size and shape. The change of growth form is believed to be effected by an altered programme of gene expres- sion, which is induced by a wide range of metabolic and environmental factors. In Saccharomyces, it is starvation for nitrogen, in Candida, it is serum (among other things); in Ustilago, it is a putative molecular signal from the host plant; and in P. marneffei, it is apparently temperature. 129

Note that environment conditioned dimorphism is reversible.

The yeast-form is characterised by a round or ovoid unicellular or- ganisms, dividing mitotically, either by budding or fission, to form two independent daughters. Filamentous or mould forms are more com- plex multicellular structures. The filaments are characterised by long, thin, parallel-walled tubes, growing by apical extension, with occasional branching at an angle from the original direction of growth. In contrast to yeast, filamentous cells do not separate after nuclear division but, rather, forming septations between cellular units that remain physically associated to the mother cell.

There is a growing body of evidence suggesting that the morphogen- esis is a crucial determinant of fungal pathogenicity in both plants and animals. In Magnaporthe grisea, for example, MAPK and cAMP sig- nalling promote the formation of a highly specialized infection structure, appressorium, which is essential for invasion into the host [223]. Most di- morphic fungal pathogens including P. marneffei, Blastomyces dermati- tidis, Coccidioides immitis, Histoplasma capsulatum and Paracoccidioides brasiliensis, typically enter the body as spores or, possibly, mycelial frag- ments via the lungs and grow in yeast forms in the body. Pathogenic Cryptococcus neoformans has been shown to form self-fertilising, diploid strains that are thermally dimorphic [286]. Aspergillus fumigatus spores establish invasive disease in lung tissue exclusively by hyphal develop- ment.

Because of the prevalence of dimorphism among human pathogenic fungi, it is of interest and importance to identify the molecules neces- sary for the morphologic switch. However, the mechanism of thermal di- morphism of P. marneffei remains unknown. Nevertheless, since fungal dimorphism has been seen by many investigators as a useful model of dif- ferentiation in eukaryotic systems, significant progress has been achieved in the study of fungal morphogenesis in other fungi. The approach to 130

this chapter is a review of this progress (especially experimental devel- opments) achieved in recent years in the fields of fungal genetics. These developments have suggested models and hypothesis to understand the regulation of the molecular mechanisms involved in fungal differentia- tion. Comparative sequence analysis is adopted to explore the genetic components that may be involved in the morphogenesis of P. marneffei. Specifically, we would like to know whether P. marneffei possess spe- cific (probably temperature-sensitive) cellular sensors to detect external stimuli, or unique signalling transduction pathways that translate the external stimuli into biochemical messages that alter genomic expression levels, or an enhanced ability in structural reorganization resulting in the morphological change. It is noteworthy that the comparative genomics approach adopted in this Chapter is impaired by the lack of genome sequence information from true dimorphic fungus. Nevertheless, even the genome sequences of Blastomyces dermatitidis, Coccidioides immitis, Histoplasma capsulatum or Paracoccidioides brasiliensis had become available, the comparative genomics approach might also be handicapped by the too far genetics distance between P. marneffei and these divergent species. The follow- ing analysis is therefore mainly limited by the comparison between P. marneffei and Aspergillus species.

6.2 Materials and Methods

6.2.1 Sequence similarity

To identify homologous genes in the P. marneffei genome, protein se- quences derived from target genes were used as queries to the P. marneffei genome. Sequence similarity searches were performed using BLASTP or PSI-BLAST against selected fungal genomes downloaded from GenBank. The searches were also performed against an inhouse database composed of whole-genome sequences of several fungal species from finished and 131

ongoing sequencing projects. The comparison was conducted using the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homo- logues was 1e10-5, unless otherwise claimed. Conserved domains/motifs were identified using InterPro release 5.1 [367].

6.2.2 Phylogenetic Analysis

Protein sequences were aligned using PROBCONS [71] and columns of low conservation removed manually. Phylogenetic trees were inferred by the neighbour-joining method [273]. The alignments were also used to infer maximum-likelihood trees. The maximum-likelihood trees were constructed using the PHYLIP package [86], applying the JTT substi- tution model with a gamma distribution (alpha = 0.5) of rates over four categories of variable sites. In general, the maximum-likelihood and neighbour-joining trees were congruent.

6.3 Results and Discussion

It has long been assumed that morphogenesis and virulence are associated in dimorphic fungi, as one morphotype exists in the environment or dur- ing commensalism, and another within the host during invasive process. For instance, P. marneffei lives outside the host as environmental sapro- phytic moulds. Its primary infectious form may be conidia or mycelial fragments aerosolised from disturbed soil or animal excreta. After enter- ing the host via the respiratory route upon inhalation, the cells rapidly convert to the yeast form. So do the other members of dimorphic fungi, such as B. dermatitidis, C. immitis, H. capsulatum and P. brasiliensis. From the perspective of the fungal cell, the phenomenon of dimorphic switching can be divided into four interwoven events as follows [275]: (i) perception of external stimuli by cellular sensors; (ii) transduction of biochemical signal; (iii) alteration of the genomic expression, and (iv) structural reorganization towards the morphological change. 132

6.3.1 Perception of external stimuli by cellular sensors

Table 6.1: GPCR family in P. marneffei and A. nidulans. ortholog relationship supported by synteny; „ when knocked out, no phenotypic changes. Abbreviations: Pm - P. marneffei, An - A. nidulans, Sc - S. cerevisiae, Af - A. fumigatus, and Sp, S. pombe.

Family An gene Pm gene Sc/Af homolog Sp homolog 1 gprA (AN2520.2) Pm198.6 Ste2 2 gprB (AN7743.2) Pm20.41 Ste3 Map3 gprC (AN3765.2)„ 3 gprD (AN3387.2) Pm14.37 Gpr1 Git3 gprE (AN9199.2)„ gprF (AN5721.2) Pm105.27 AF54.m07020 Stm1 4 gprG (An5720.2) „ Pm34.71 gprH (AN8262.2) Pm58.4 AF53.m04209 5 gprI (AN8348.2) Pm31.53

Limited information about cellular sensors that detect external stim- uli (especially temperature) is available for ascomycetes. Among known receptors, G protein-coupled receptors (GPCRs) are key components of heterotrimeric G protein-mediated signalling pathways. The receptors detect environmental signals and confer rapid cellular responses. The GPCR family has been propagated in the genome of Aspergillus nidulans as shown in the recent analyses of the Aspergillus nidulans genome: 9 genes (gprA∼gprI) predicted to encode seven transmembrane spanning GPCRs have been identified [125]. Among them, gprD gene was found to play a central role in coordinating hyphal growth and sexual devel- opment. Deletion of gprD causes extremely restricted hyphal growth, delayed conidial germination and uncontrolled activation of sexual devel- opment resulting in a small colony covered by sexual fruiting bodies. We identified 7 P. marneffei GPCRs closely related to A. nidulans GPCRs (Table 6.1). The phylogenetic tree of fungal GPCR family genes (Fig. 6.1) helps the assignment of these putative P. marneffei GPCRs into their corresponding sub-families. 133

Sc Ste3 Pm20.41

An GprB

Sp map3 Pm58.4 Sp Git3

Sc Ste2 8262 GprH An Af53.m04209 Sc Gpr1

Sp Stm1

An GprC 3765 Pm14.37 An GprG 5720 An GprE An 9199GprD 3387

Pm34.71 Dd crlA

Pm105.27

Pm198.6 An GprA 2520 GprA An

Dd cAR1 Sp mam2 Sp

Af54.m07020 An GprF 5721 GprF An

Pm31.53 AN8348.2

2

Figure 6.1: Phylogenetic tree of fungal GPCR family genes. Classifica- tion of fungal GPCR families was carried out by analyses of P. marneffei Pm198.6, Pm20.41, Pm14.37, Pm105.27, Pm34.71, Pm58.4 and Pm31.53, A. nidulans GprA∼GprI, A. fumigatus Af54.m07020, Af53.m04209, Sac- charomyces cerevisiae Ste2p, Ste3p, Gpr1p, Schizosaccharomyces pombe Mam2p, Map3p, Git3p, Stm1p, Dictyostelium discoideum cAR1p and crlAp (GenBank Acc.: AAO62367) using PROBCONS [71]. Algorithm parameters: Gaps/Missing data - Pairwise Deletion; Distance method – Amino Gamma Model [Pairwise distances]; Tree making method - Neighbour-joining. 134

6.3.2 Transduction of biochemical signal

Studies combining the powerful genetic and genomics tools available in fungi (mainly in Saccharomyces) have revealed three pathways that cou- ple afferent signals to the dimorphic switch. Although many different signals can induce filamentous development, the strategies for connect- ing the external signal to the change in cell differentiation are broadly conserved among the fungi. For example, studies show that distantly related fungi – Saccharomyces, an ascomycete, and Cryptococcus, a ba- sidiomycete, – use common STE12 family members to forms filamentous structures in response to nitrogen starvation, sharing a high degree of conservation in the regulatory pathways that control filamentous growth.

Studies on signalling filamentous growth in S. cerevisiae have revealed that four genes of the MAPK pathway that signals the mating pheromone response are also required for filamentous growth of diploid cells and the invasive growth of haploid cells (Fig. 1.6). These four genes are STE20, STE11 and STE7, which encode three protein kinases that act in se- quence, and STE12, which acts as a transcription factor at the terminus of both pathways. As shown in Fig. 1.6 all these four genes are marked with asterisks, indicating that the S. cerevisiae genes’ ortholog in P. marneffei has been identified (see also Table 6.2). The STE20 homolog from P. marneffei, pakA (GenBank Acc. AY621630; Pm80.15), is known to be essential during yeast but not hyphal growth (Boyce KJ et al., per- sonal communications). The STE12 homolog, stlA, has been cloned [19]. The P. marneffei stlA gene together with the A. nidulans steA and C. neoformans STE12alpha genes form a distinct subclass of STE12 ho- mologs that have a C2H2 zinc-finger motif in addition to the homeobox domain that defines STE12 genes. The stlA gene had no detectable func- tion on vegetative growth, asexual development, or dimorphic switching in P. marneffei. However stlA complements the sexual defect of an A. nidulans steA mutant [19]. These data suggest that although members 135

Ras2p (Pm85.8) Gpa2P (Pm51.59)

Cyr1p (Pm7.24) Pde2p (Pm146.17) ATP cAMP AMP

PKA (r) Bcy1p (Pm33.83)

PKA (c) Tpk1p, 2p, 3p (Pm18.86, Pm47.4, Pm19.3)

Figure 6.2: P. marneffei genes in cAMP pathway.

of the STE12 family of regulators are involved in both controlling mating and yeast-hyphal transitions in a number of fungi, stlA in P. marneffei may only play a role in controlling mating processes (see also chapter 5) but not dimorphic switching. There may be as yet undetected compen- satory genes or pathways responsible for dimorphic switching.

Another pathway controlling filamentation in Saccharomyces is cAMP pathway (Fig. 6.2). Ras2p and Gpa2p are regulators of cAMP levels, acting upstream of adenylate cyclase, Cyr1p, which in turn regulates for- mation of cAMP. The processes inactivates the cAMP-dependent protein kinase (protein kinase A, PKA), leading to enhanced filamentous growth in Saccharomyces. Homologs of all genes related in this pathway have been identified in P. marneffei (Fig. 6.2 and Table 6.2).

Another regulator implicated in Saccharomyces filamentation is Rim1p zinc-finger transcription factor. It is activated by a proteolytic cleavage dependent on several other RIM genes (RIM8, RIM9, RIM13). Rim1p’s homolog in Aspergillus nidulans, PacC, is also regulated by such a prote- olysis mechanism. Again homologs of all these RIM genes are identified 136

in P. marneffei, suggesting the existence of the regulatory pathway. Because signal transduction pathways have been well elucidated in Saccharomyces, the yeast has been used as a reference library for the analysis of conserved signalling pathways. However, the most detailed analyses in S. cerevisiae will be able only to provide stepping stones on the way to the explaining of key morphological features in more com- plex, multicellular filamentous fungi. These mould-specific features may include polarized hyphal growth, septation, establishment of multinucle- ate cellular compartments, cell type-specific gene expression, and sub- cellular localization of proteins. Furthermore, protein networks of other fungi may even differ in their regulation of similar morphological tasks. Hence, further studies toward an understanding of these differences on the molecular level will remain an important task in functional analy- ses, particularly of organisms, like P. marneffei, whose genomes will be completely sequenced in the near future.

6.3.3 Alteration of the genomic expression

Elevated temperature is apparent by the major environmental stimulus to P. marneffei resulting in the fungus undergoing a mycelium-to-yeast transformation. However, the influence of elevated temperature on the overall gene expression of P. marneffei has not been studied. Neverthe- less, since surviving at the elevated temperatures, i.e. thermotolerance, is a trait critical to the ability of many fungal pathogens to thrive in host infections, a number of studies have been conduced in other fungi. For example, two genes have been implicated during growth at elevated tem- peratures in C. neoformans. Gene RAS1 (encoding a small GTP-binding protein) regulates filamentation, mating and growth at high tempera- ture [5]. Gene CNA1 (encoding calcineurin) is required for C. neofor- mans virulence and may define signal transduction elements required for fungal pathogenesis [236]. Homologs of both genes can be identified 137

Table 6.2: Homologous genes related to signal transduction in filamentous growth.

Sc gene Pm gene Function/product MAPK pathway STE20 Pm80.15 Signal transducing kinase of the PAK fam- (CST20) ily, involved in pheromone response and pseudohyphal/invasive growth pathways STE11 Pm129.8 MAP kinase kinase kinase in the filamen- tous growth pathway pathway STE7 Pm161.15 Serine/threonine/tyrosine protein kinase (HST7) of MAP kinase kinase family STE12 Pm201.2 Ortholog to AN2290.2 (SteA). Members (CPH1) (stlA) of the STE12 family of regulators are in- volved in controlling mating and yeast- hyphal transitions in a number of fungi TEC1 Pm109.16 Transcription factor participates in two (abaA) developmental programmes: conidiation and dimorphic growth PSS1 Pm41.61 MAP kinase dedicated to filamentation pathway FUS3 Pm8.42 MAP kinase dedicated to pheromone re- sponse pathway cAMP pathway PDE2 Pm146.17 cAMP phosphodiesterase, component of the cAMP-dependent protein kinase sig- naling system RAS2 Pm85.8 Regulator of cAMP levels GPA2 Pm51.59 G protein alpha subunit homologue CYR1 Pm7.24 Adenylate cyclase, required for cAMP pro- duction and cAMP-dependent protein ki- nase signalling BCY1 Pm33.83 Regulatory subunit of the cyclic AMP- dependent protein kinase (PKA) TPK1, 2, 3 Pm18.86, Subunit of cytoplasmic cAMP-dependent Pm47.4, protein kinase; promotes vegetative Pm19.3 growth in response to nutrients; inhibits filamentous growth to be continued... 138

RIM1 related RIM1 Pm20.42 Rim1p is homologous to the Aspergillus nidulans transcription factor PacC, which is also regulated by proteolysis RIM8 Pm148.7 Protein of unknown function, involved in the proteolytic activation of Rim101p in response to alkaline pH; has similarity to A. nidulans PalF RIM9 Pm26.50 Involved in the proteolytic activation of Rim101p in response to alkaline pH; has similarity to A. nidulans PalI RIM13 Pm146.2 Calpain-like protease involved in prote- olytic activation of Ri0m101p in response to alkaline pH; has similarity to A. nidu- lans palB

within the P. marneffei genome. The P. marneffei homolog of C. neo- formans RAS1, Pm85.8, is a known P. marneffei gene (rasA, GenBank Acc. AY232652). It has been confirmed by experiment to act upstream of CflA (Cdc42) to regulate germination of spores and polarized growth of both hyphal and yeast cells, while also exhibiting CflA-independent activities [23]. For CNA1, the putative homologue gene, Pm119.15, en- codes a highly conserved (74% aa identity within alignable region of 485 aa) calcineurin peptide sequence (557 aa long). In addition to these analyses on individual gene’s functions, Steen et al. have initiated a genome-wide analysis of the response of C. neo- formans to host temperature [296]. This analysis revealed differences in the levels of responsiveness of serotype A and D strains to growth at 25‰ versus 37‰ with changes in transcript levels for histone genes, stress-related genes, and genes encoding translation components. Nunes et al. [234] used a Paracoccidioides brasiliensis biochip to monitor gene expression at several time points of the mycelium-to-yeast morpholog- ical shift. Their results revealed a total of 2,583 genes that displayed statistically significant modulation in at least one experimental time point. Among the identified genes, some encoded enzymes involved in 139

amino acid catabolism, signal transduction, protein synthesis, cell wall metabolism, genome structure, oxidative stress response, growth control, and development. Particularly, the gene 4-HPPD encoding 4-hydroxyl- phenyl pyruvate dioxygenase is highly overexpressed during mycelium-to- yeast differentiation, and its function has been shown to be the inhibition of growth and differentiation of the pathogenic yeast phase of the fun- gus in vitro [234]. Two copies of 4-HPPD, Pm48.10 and Pm14.48, were identified in the P. marneffei genome.

Neither C. neoformans nor P. brasiliensis are phylogenetically closely related to P. marneffei. Comparison of patterns in gene expression with the much more closely related Aspergillus species may be more meaning- ful. Information about A. fumigatus gene expression in metabolic adap- tation to higher temperatures became available recently [233]. Nierman et. al., examined gene expression throughout a time course upon shift of growth temperatures from 30 to 37‰ and 48‰ [233]. A total 1926 tem- perature shift-responsive genes were identified. Comparative data also indicate that high temperature responses in A. fumigatus differ from the general stress response in yeast. We performed comparative analysis of these genes against P. marneffei genome in order to identify their ho- mologs. Among the 1,926 genes, 1,032 have homologs in P. marneffei, i.e., a majority of A. fumigatus temperature shift-responsive genes are present in P. marneffei. Here the set of homologs was defined by iden- tifying unique pairwise reciprocal best hits, with at least 40% similarity in protein sequence and less than 20% difference in length. This result suggests that the genetic component of P. marneffei may not differ much from those for general high temperature responses in A. fumigatus.

The experiments mentioned above identified the temperature shift- responsive genes that may play a role in the structural or metabolic changes that take place during morphogenesis or may be necessary for colonisation and survival in the host. However, a direct interpretation 140

of the association between P. marneffei homologs of temperature shift- responsive genes in other fungi may not be reliable. Moreover, very few genetic determinants have been identified to be directly involved in either phase transition and/or pathogenicity. Further studies of gene expression in P. marneffei are necessary in order to solve these problems.

In addition to revealing the overall gene expression pattern, under- standing the transcriptional mechanisms which control the dimorphic program is also important. Some of transcription factors within known pathways have been mentioned above. Here I mention more studies that identified several other transcription factors which control conidiation and dimorphic switching in P. marneffei. The P. marneffei abaA gene (Pm109.16) encoding an ATTS/TEA DNA-binding domain transcrip- tional regulator regulates cell cycle events and morphogenesis in both filamentous and yeast growth [18]. The stuA gene (Pm107.14) encod- ing a basic helix-loop-helix transcription factor may control processes that require budding but not those that require fission as in dimorphic growth in P. marneffei [20]. TATA-binding protein (TBP) is a general transcription factor required for initiation of transcription in eukaryotes. The TBP encoding gene, Tbp (Pm19.17), has been cloned and character- ized in P. marneffei [254]. Tbp is essential for P. marneffei filamentous growth, but plays a less significant role in growth and development dur- ing the yeast phase. Furthermore, it has been shown that transcriptional regulation in S. cerevisiae appears to be mechanistically bipolar, i.e., TATA box-containing genes are predominantly involved in responses to stress, whereas TATA-less genes are mainly associated with constitutive housekeeping functions [12]. Only 20% of yeast genes contain a TATA box [12]. It therefore is interest to see if TATA-less promoters are also present in P. marneffei, suggesting a need to balance inducible stress- related responses with constitutive housekeeping functions or reflecting the difference in the regulatory basis for growth and development of the 141

two morphological forms [254].

6.3.4 Structural reorganization towards the morphological change

It is reasonable to speculate that the mycelium-to-yeast transformation of P. marneffei is an active process triggered by a shift in temperature. The fungus undergoes a ‘drastic’ structural reorganisation associated with this active process. We assume this process may be linked with a number of phenotypic changes like those characteristic of apoptosis or programmed cell death. Indeed, programmed cell death has been observed in both A. fumigatus [225] and A. nidulans [313]. The metazoan upstream apop- totic machinery is absent in fungi, whereas the downstream effectors and regulators, both caspase-dependent and caspase-independent, seem to present in A. fumigatus [225]. As in animal apoptotic cells, caspase activ- ities are involved in fungal mycelium self-activated proteolysis. Searches in P. marneffei genome revealed three genes (Pm105.4, Pm112.34 and Pm205.1) encoding metacaspase proteins that could be responsible for the caspase-like activities. Only two copies of these proteins were identified in A. nidulans genome. The searches also found a single gene (Pm93.8) en- coding a poly (ADP-ribose) polymerase (PARP) protein, a homologue of the key participant of caspase-independent apoptosis in mammals. PARP is one of the known target proteins inactivated by caspase degradation in animal cells. PARP activity was demonstrated previously in A. nidulans during sporulation-induced apoptosis. PARP is absent in S. cerevisiae but present in Aspergillus. The presence of these proteins in P. marnef- fei and Aspergillus is indicative of the PARP-dependent programmed cell death pathway. In addition, homologs of mammalian apoptotic protein AMID are found in P. marneffei and A. fumigatus, but not in unicellular yeasts such as S. cerevisiae, further suggesting that mechanisms of cell death appear to be more complex in filamentous fungi. Analysis of the cell wall of P. marneffei is basic for understanding its 142

morphological transformation. In the mould form, the hyphal cell wall is essential for P. marneffei to penetrate solid nutrient substrates. In yeast form, a transformed cell wall is essential to resist host cell defence reactions. The cell wall protects P. marneffei against the aggressive human defence reactions, harbours most of the fungal antigens and it represents a potential drug target. Therefore, comprehension of cell wall biosynthesis pathways is important. We speculate that, like many other filamentous fungi, the structural organization of the cell wall of P. marn- effei is the polysaccharide constituents composed of alpha and beta(1,3)- glucans, chitin, galactomannan, and beta(1,3),(1,4)-glucan. These struc- tural genes and genes encoding a number of enzymes including synthases, transglycosidases, and glycosyl hydrolases responsible for their biosynthe- sis and remodelling were identified in the P. marneffei genome (provided in PMGD website: www.pmarneffei.hku.hk). One of the known dif- ferences between the yeast cell wall and the mycelium cell wall is that β1,6-glucan and peptidomannan present in yeast cell walls are missing in A. fumigatus [233]. The beta1,6-Glucan is a key component of the yeast cell wall, interconnecting cell wall proteins, beta1,3-glucan, and chitin. Yeast genes, KRE5, KRE6 and SKN1, are predicted to encode paralog proteins that participate in assembly of the β1,6-glucan. Homologs of these three genes, Pm76.37, Pm104.21 and Pm34.5 were identified in P. marneffei genome, as well as in A. fumigatus genome. Seemingly, the specificity of the cell wall biosynthetic gene inventory in the P. marneffei genome determines the specificity of the polymer organization of the cell wall. Yet we need further analysis for confirmation.

As a general feature of development in eukaryotes, only a small pro- portion of the genome is associated with any particular morphogenetic process. In yeast for example, only 21-75 of the estimated 6,000 genes were assumed to be specific to meiosis and ascospore formation. This is also the case in P. marneffei. Therefore, the study of morphogenesis 143

should be directed to an emphasis on morphogenetic gene regulation of differential expression of activity, rather than on large scale replacement of one set of gene products by another. We still lack gene expression studies in P. marneffei to date. Nevertheless, the findings in this chap- ter offer new interpretive clues to the mechanisms of fungal virulence and dimorphism. First, the signalling systems that control dimorphism may be conserved between P. marneffei and related fungi. That is to say, many fungal species contain orthologous genes specifying the same pathways. Presumably, only subtle quantitative differences in the inputs and outputs of each pathway generate the different morphologies and behaviours characteristics. Second, dimorphism in P. marneffei may be controlled by multiple signalling pathways. As in Saccharomyces, at least three parallel pathways control the switch to filamentous growth. How the fungus integrates the information from different pathways to effect a change in cell type is not known. In summary, morphogenesis is an essential developmental event, pro- moting host invasion and evasion by dimorphic fungi. Prevention of this event may hold the key to control of infections by these fungi. Under- standing the molecular mechanisms for the morphologic switch could lead to new drug or vaccine targets that block the earliest events in coloniza- tion or infection. 144

Chapter 7

INTRAGENIC TANDEM REPEATS IN PENICILLIUM MARNEFFEI AND OTHER ASCOMYCETES

Tandemly repeated DNA sequences occur frequently in the genomes of organisms. Although their function and origin are not truly understood, these highly dynamic genomic components may provide the most insights into how a pathogenic fungus adapts to the host immune system.

7.1 Introduction

A tandem repeat (TR) is defined to be two or more adjacent copies of the same sequence of nucleotides and may result from tandem duplica- tion event(s). Over time, individual copies within a TR may undergo additional, uncoordinated mutations so that typically, only approximate tandem copies are present. The number of adjacent copies in a TR can be variable. Lengths of TR range from few tens of base pairs (micro- and mini-satellites) to megabases (larger satellite repeats). Genomes, particularly of eukaryotes, contain a large number of TR. For example, 10% or more human genome is composed of TRs. Simple sequence repeats are fairly abundant in plant genomes, occurring once in every approximately 6 Kb [258]. TRs are of biological importance for many reasons. First, they cause human diseases, including fragile-X mental retardation, Huntington’s disease, myotonic dystrophy, etc [288], which are the result of a dramatic expansion in the number of copies of a trinucleotide pattern. Second, they play a variety of regulatory and evolutionary roles. The repeats may interact with transcription factors or alter the structure of the chromatin or act as protein binding sites [121, 145

208]. Third, they are important laboratory and analytic tools. They have been applied in linkage analysis and DNA fingerprinting [78,340] since the number of copies of a specific TR is often polymorphic in the population. Last but not least, TRs play an apparent role in the development of immune system cells in human. Du et al.[75] showed that breakpoints of immunoglobulin switch recombination, which occur between pairs of switch regions located upstream of the constant heavy chain genes, cluster to a defined subregion in three TRs.

The most interesting feature of TRs is that their association with the functional variability of a gene product. Most TRs are in intergenic re- gions, but some are in coding sequences or pseudogenes. Verstrepen et al.[328] showed that in the genome of Saccharomyces cerevisiae, most genes containing intragenic TRs (IntraTRs) encode cell-wall proteins. The presence of IntraTRs facilitates recombination in the gene or between the gene and a pseudogene. The result of this increased frequency of re- combination events is an expansion or contraction of the gene size. More importantly, this size variation creates quantitative alterations in pheno- types (e.g., adhesion, flocculation or biofilm formation). The variation of the fungal cell surface allows fungal microbes to ‘disguise’ themselves in order to evade the host immune system’s defences.

Inspired by the finding of Verstrepen et al.[328], the aim of this chapter is to reveal the composition of IntraTRs from the genomes of Penicillium marneffei, as well as other related species. Using computer programs, we searched for both long and short repeated sequences within protein-coding regions in P. marneffei and related Ascomycetes. Com- parison of observed frequencies with expected values reveals that repeats are enriched in the P. marneffei genome. 146

7.2 Materials and Methods

7.2.1 Identification of coding tandem repeats

The previously described methodology [328] was applied to find Intra- TRs in P. marneffei genome and other fungal genomes, using the EM- BOSS ETANDEM software [263] to screen the sequences. The ETAN- DEM threshold score was set to 20. All known and predicted genes were scanned for long (> 40 nucleotide (nt)) or short (3-39 nt) repeats. Here a sequence was considered to be an intragenic repeat if it meet two con- ditions: (i) repeat conservation was at least 85%; and (ii) the number of repeats was at least 20 for trinucleotide repeats, 16 for repeats between 4 and 10 nt, 10 for repeats between 11 and 39 nt and 3 for repeats of at least 40 nt.

7.2.2 Sequence analysis

Position-specific iterated BLAST (PSI-BLAST) [6] was used to search publicly available microbial genome sequences, GenBank, or EMBL. Gen- Bank and EMBL were accessed through the National Center for Biotech- nology Information http://www.ncbi.nlm.nih.gov/ and the Oxford Uni- versity Bioinformatics Centre, respectively. Protein domain determina- tions were addressed through the NCBI Conserved Domain Search. The MBEToolbox package (Chapter 10) was used for nucleotide and amino acid sequence analysis and alignments.

7.3 Results and Discussion

One of the ultimate goals of sequence analysis is to accurately iden- tify candidate virulence genes that confer pathogenicity to P. marneffei. General comparative analyses, such as ortholog prediction and species- specific gene detection, are valuable, but not very specific. That is to say, these methods give too many candidate genes. To narrow these candidate 147

Table 7.1: P. marneffei genes containing intragenic tandem repeats. Col- umn “size” is the length of repeat unit, “count” is the occurrence of re- peat unit. Total length of repeat units is therefore equals: size × count. Sequence identity (%) of repeat unit is greater than 80%. Consensus se- quences of repeat unit for each gene are available in PMGD. * indicates the gene contains more than one type of repeat. Genes are ordered by the size of repeat unit. The last 12 genes contain short repeats, the rest contain long repeats.

Pm gene Size Count Putative Function Pm6.47 228 3 Polyubiquitin, similar to S. cerevisiae UBI4 (YLL039C) Pm27.95 171 5 Unknown function Pm78.37* 165 3 Unknown function Pm54.4 147 3 Streptococcal protective antigen (Q8NZA4) Pm71.41 144 5 Unknown function Pm133.2 141 3 Unknown function Pm1.199 126 12 Homologous to AN7363.2, AN3547.2 and AN8457.2 Pm12.139 126 9 Putative ATP/GTP binding protein Pm14.111 126 4 O-acetylhomoserine (Thiol)-lyase (CYSD EMENI) Pm30.75 126 7 Beta transducin-like protein HET-E2C*4 (Q8X1P4) Pm35.44 126 11 Beta transducin-like protein HET-E2C (Q8X1P5) Pm94.31 126 8 Putative ATP/GTP binding protein (Q6TMU6) Pm210.2 126 9 Beta transducin-like protein HET-D2Y (Q8X1P2) Pm39.56 120 3 Unknown function Pm183.10 117 3 Casein kinase I homolog hhp1 (HHP1 SCHPO) Pm54.56* 108 6 Pedal peptide precursor protein (O01387) Pm12.114 102 3 Unknown function Pm161.1 102 3 Phosphorylase (Q8TK58) Pm77.10 99 5 KIAA1223 protein (Q8TB46) Pm226.4* 99 7 Ankyrin 2 (Q9NCP8) Pm209.2 96 3 Beta transducin-like protein HET-E4S (Q8X1P6) to be continued... 148

Pm44.53 81 3 Related to transport protein USO1 (Q873K7) Pm163.5 78 9 Erythrocyte binding protein 3 [Plasmod- ium falciparum] (Q7K5Q6) Pm42.29 72 3 Phenol 2-monooxygenase (Q8X0B1) Pm117.16* 72 5 Unknown function Pm31.1 66 5 Unknown function Pm34.34 66 5 Chitinase (Q873Y0) Pm54.65 66 4 Extensin class I (cell wall hydroxyproline- rich glycoprotein) [Plasmodium falci- parum] (Q09082) Pm78.42 66 3 Chitinase 4 (Q7ZA41) Pm118.4 63 5 Unknown function Pm40.30 60 3 PAAA motif protein, similar to microfila- ment and actin filament cross-linker pro- tein [Pan troglodytes] Pm64.14 60 8 Zonadhesin – [Mouse]; PT repeat pro- tein family (EAL93999) [Aspergillus fumi- gatus] Pm95.32 60 3 Related to mannosyltransferase ALG2 (Q8X0H8) Pm194.2 60 3 Retrovirus-related Pol polyprotein from transposon TNT 1-94 (POLX TOBAC) Pm41.72 54 5 Unknown function Pm166.6 54 3 Unknown function Pm48.11 48 5 Similar to S. cerevisiae YJR054W (Q6CXI0) Pm78.3 48 4 Telomere-linked helicase 1 (Q8J216) Pm173.14 48 4 Telomere-linked helicase 1 (Q8J216) Pm194.1 48 4 Telomere-associated recQ-like helicase (O13400) Pm194.5 48 5 Polymerase (Q9C435) Pm224.1 48 3 Telomere-linked helicase 1 (Q8J216) Pm224.2 48 5 Telomere-linked helicase 1 (Q8J216) Pm230.1 48 5 Telomere-linked helicase 1 (Q8J216) Pm234.1 48 5 Telomere-linked helicase 1 (Q8J216) Pm236.2 48 4 DWIQ motif containing hypothetical pro- tein (NP 702011) PF14 0123 [Plasmodium falciparum] Pm236.3 48 7 Q8J216 Telomere-linked helicase 1 to be continued... 149

Pm247.2 48 5 Q8J216 Telomere-linked helicase 1 Pm108.33 45 4 Unknown function Pm8.109 42 3 ATPase, AAA family Pm40.29 42 3 Unknown function Pm40.31 42 4 H7H motif in multiple proteins of Plas- modium Pm52.29 42 3 Mitochondrial chaperone BCS1 (BCS1 XENLA) Pm210.1 42 4 Unknown function Pm173.16 24 10 Unknown function Pm36.21 12 11 Unknown function Pm1.35 6 25 Transcription initiation factor TFIID sub- unit 12 (TAF12 YEAST) Pm1.28 3 24 Unknown function Pm3.168 3 28 Q7Z884 Putative cell wall protein FLO11p Pm5.75 3 25 Dynamin binding protein, TUBA; DN- MBP MOUSE (Q6TXD4) Pm14.75 3 29 Unknown function Pm22.8 3 22 Unknown function Pm67.24 3 22 Related to heat shock transcription factore HSF21 (Q9P554) Pm76.36 3 21 Unknown function Pm85.21 3 30 Unknown function Pm138.7 3 24 Oxygenase-like protein (Q93M01)

genes down to a manageable amount, genes that contain IntraTRs were carefully investigated. This is because IntraTRs have been suggested to generate functional variability in S. cerevisiae, and variation in IntraTR number provides the functional diversity of cell surface antigens that, in fungi and other pathogens, allows rapid adaptation to the environment and elusion of the host immune system [328]. In S. cerevisiae, there are a total of 44 such genes with known functions that have been identified. These genes show unexpected functional similarities: 62% with conserved long repeats encode cell-wall proteins [328].

A total 66 P. marneffei genes that contain IntraTR(s) were identi- fied (Table 7.1). Nearly one third of these genes are of unknown func- tion, i.e., neither putative homologs have been detected by the extensive 150

PSI-BLAST search against GenPept databases, nor putative conserved domains have been detected. These genes may be P. marneffei-specific. The remaining two thirds of them, whose putative homologs can be found, are genes with assigned functions. Nine of these genes, namely, Pm78.3, Pm173.14, Pm224.1, Pm224.2, Pm230.1, Pm234.1, Pm236.3, Pm247.2, and Pm194.1, are homologs of the Magnaporthe grisea telomere-linked helicase 1 (TLH1) gene. Genetic mapping showed that most members of the TLH gene family are tightly linked to the telomeres and located within 10 kb from the telomeric repeat. Similar helicase gene families are also present in the chromosome ends of Saccharomyces cerevisiae and Ustilago maydis, which suggests the initial association of helicase genes with fungal telomeres might date back to the very early stages of the fungal evolution [103]. Four genes, Pm210.2, Pm30.75, Pm35.44, and Pm209.2 are homologs of beta transducin-like protein genes, most closely similar to Podospora anserina het-d2y, het-e2c, het-e2c*4 and het-e4s, re- spectively. These genes are involved in vegetative incompatibility, which prevents a viable heterokaryotic cell from being formed by the fusion of filaments from two different wild-type strains. In P. anserina, such in- compatibility is always the consequence of at least one genetic difference in het genes, specifically het-e and het-d. These loci control heterokaryon viability through genetic interactions with alleles of the unlinked het-c lo- cus [82]. The other interesting homologs include streptococcal protective antigen, chitinase, extensin, zonadhesin, and erythrocyte binding protein, etc (Table 7.1).

For further experimental studies, such as, DNA typing, only those that are most likely to be responsible for P. marneffei’s pathogenic adap- tation should be selected. The selective process involves a multi-step fil- tering. The underlying rationale is that a candidate virulence gene has to be (1) P. marneffei-specific (without orthologs or orthologs containing no similar IntraTR), and (2) functionally known to be related to intracellular 151

adaptation or otherwise completely functionally unknown. Moreover, in order to conduct a PCR-based IntraTR length polymorphism study, the constraint of the length of target DNA in PCR reactions has to be taken into account. After the multi-step filtering and investigating the lengths of IntraTR and introns of these genes, two genes, Pm40.30 (745 bp) and Pm40.31 (733 bp), were selected for further polymorphism study. The lengths of IntraTRs plus introns of the two genes are 234 and 277 bp re- spectively. What makes these two genes special are their BLAST analysis results. Pm40.30’s top hit of PSI-BLAST against NCBI NRProt database is a hypothetical Chimpanzee protein containing multiple PAAA motifs. While Pm40.31’s top hit is a hypothetical histidine-rich motif containing protein from Plasmodium falciparum. Although the function of this hy- pothetical gene encoding this protein is unknown, it is still noteworthy that another histidine-rich protein PfHRP2, encoded by P. falciparum gene HRP-2, is indeed responsible for intracellular adaptation of this parasite [11]. PfHRP2 binds heme, playing a role in hemoglobin prote- olysis, which is the primary nutrient source of the erythrocytic growth stage of P. falciparum [52].

The relative abundances of IntraTR within different fungi are com- pared. Table 7.2 shows the genome size, G, bases in repeat regions, B, and number of genes containing repeats, n, from several fungi. When take all diploid and haploid species are taken together, the two diploid fungi, S. cerevisiae and C. albicans show higher B/G ratio. It appears that genomes of diploid species may accommodate more bases located in IntraTR regions, as much as 3 times higher. Among haploid fungi, P. marneffei shows the highest B/G ratio, i.e. its fraction of bases belong to repeat regions is higher than any other haploid fungi. We argue that the relatively more abundant IntraTRs in P. marneffei might be responsible for its immuno-escaping mechanism, which enables the fungal pathogen to survive within its host. Finally, note that B/N ratios remain largely 152

constant across different species, i.e., the average number of bases within each gene is similar.

Table 7.2: Comparison of genome size and base in repeats. Abbrevi- ations: Pm, P. marneffei; Af, Aspergillus fumigatus; An, Aspergillus nidulans; Sc, Saccharomyces cerevisiae; Ca, Candida albicans; Mg, Mag- naporthe grisea; Nc, Neurospora crassa.

Pm Af An Sc Ca Mg Nc Diploid No No No Yes Yes No No Genome size (Mb), G 30 28 30 12 16 39 40 Bases in repeat re- 23,814 12,687 16,820 29,664 34,662 16,933 22,101 gions (bp), B No. of genes contain- 66 33 31 69 82 62 121 ing repeats, N B/G ratio 794 453 561 2,472 2,166 434 553 B/N ratio 361 384 543 430 423 273 183

The amino acid composition of a protein is the mole percent of the different amino acids its sequence. It is usually conserved among the same proteins of different organism species. Here we performed a cross- species comparsion of IntraTRs’ amino acid composition (Fig. 7.1). The two yeasts show a different visual pattern compared to these of moulds. S. cerevisiae and C. albicans use much more threonine and/or serine residues than any other amino acid; while in moulds the patterns are more contrast. Serine is used most in P. marneffei and A. fumigatus; alanine in A. nidulans, glycine in N. crassa and isoleucine in M. grisae. Phenylalanine, valine and tryptophan are ubiquitously less used in all species. The overall patterns of P. marneffei, A. nidulans and A. fumi- gatus are similar to each other. The result shows that the differences among amino acid composition are associated with the phylogenetic dis- tances among species. This suggests that the amino acid composition of IntraTR is not subject to neutral mutation but under the constraint of a certain level of selection. The cell surfaces of microorganisms show distinctive properties which 153

A A A C C C D D D E E E F F F G G G H H H I I I K K K L L L M M M N N N P P P Q Q Q R R R S S S T T T V V V W W W Y Af Y An Y Nc 0 100 200 300 400 500 600 0 100 200 300 400 500 600 700 0 500 1000 1500

A A A C C C D D D E E E F F F G G G H H H I I I K K K L L L M M M N N N P P P Q Q Q R R R S S S T T T V V V W W Ca W Y Mg Y Y Sc

0 100 200 300 400 500 600 700 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

A C D E F G H I K L M N P Q R S T V W Y Pm 0 100 200 300 400 500 600 700 800

Figure 7.1: Amino acid composition in intragenic tandem repeats. Fungal species are: Af, A. fumigatus; Pm, P. marneffei; An, A. nidulans; Sc, S. cerevisiae; Ca, C. albicans; Nc, N. crassa; Mg, M. grisae. For each subplot, x axis is occurrence/frequency of amino acid, y axis is amino acid in the order of downwards: A - Alanine, C - Cysteine, D - Aspartic Acid, E - Glutamic Acid, F - Phenylalanine, G - Glycine, H - Histidine, I - Isoleucine, K - Lysine, L - Leucine, M - Methionine, N - Asparagine, P - Proline, Q - Glutamine, R - Arginine, S - Serine, T - Threonine, V - Valine, W - Tryptophan, and Y - Tyrosine (Tyr). 154

can be recognised by the host immune system. Many microorganisms have the ability to switch their cell-surface molecules, a tactic that per- mits them to elude the immune system and adhere to diverse materials and cells (for review, see [329]). The human immune system poses chal- lenges to P. marneffei, which might have characteristic cell-surface mole- cules that are recognized by dedicated phagocytic cells. Recent studies linked the the diversity of cell surface molecules to the variation in In- traTR number. The persistence of a large amount of IntraTRs in the P. marneffei genome suggests that there is a compensating benefit. We therefore propose that variation in IntraTR number provides the func- tional diversity of cell surface antigens in P. marneffei, allowing rapid adaptation to the environment and evasion of the host immune system. 155

Chapter 8

EXTENT AND EVOLUTIONARY PATTERN OF DUPLICATE GENES IN PENICILLIUM MARNEFFEI AND OTHER ASCOMYCETES

Gene duplication and subsequent divergence have long been believed to be of importance for the functional novelty and complexity of organ- isms. The extent and evolutionary patterns of duplicate genes (paralogs) have long been studied in higher eukaryotes, but not in lower eukary- otes such as fungi. In this chapter, gene-coding sequences in genomes from Penicillium marneffei, together with those from other ascomycetes, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Candida albicans, Aspergillus nidulans and Neurospora crassa, are used to identify multi- gene families. The number of synonymous substitutions per synonymous site, Ks, and the number of nonsynonymous substitutions per nonsyn- onymous site, Ka, are calculated to measure the time (or relative fre- quency) of duplication as well as the selective constraint on gene pairs. The evolutionary rates of duplicate gene pairs are measured by applying the codon substitution model, which is more sensitive than traditional models [111]. A large variation in the extent of gene duplication in these species was found (percentage of genes in multigene families ranged from 23.6% in S. cerevisiae to 8.0% in N. crassa). The age distribution of the gene duplications tentatively suggests that the P. marneffei genome may have experienced two rounds of large-scale duplication. It is also detected that paralogs in filamentous ascomycetes (but not paralogs in yeast ascomycetes) are under weaker functional constraint than those of orthologs. Analysis of the divergence of evolutionary rates in S. cere- 156

visiae and C. albicans revealed that 17.8% of gene pairs show asymmetric divergence pattern in amino-acid substitutions. However, there is no evi- dence to show that this asymmetry is associated with positive selection. I speculate that the different extent and evolutionary pattern of duplicate genes in these ascomycetes might be associated with their genotypical and phenotypical differences.

8.1 Introduction

In early 1970s Ohno proposed in his book that gene duplication is a ma- jor evolutionary source of gene innovation [237]. By this he meant that: the creation of a paralog of a gene through duplication (by many possible means) results in one of the duplicates being functional redundant. This redundant copy may mutate more freely without affecting the overall fit- ness of the organism, and thus is more likely to become a gene with a novel function. Now generally, biologists accept the vision that, by cre- ating sets of gene paralogs, gene duplication plays an important role in the adaptation of organisms to their environment and in the origin of yhe phenotypic diversity of organismal evolution [210]. Nowadays, with the completion of several eukaryotic genome projects, it is well known that one of the characteristics of eukaryotic genomes is the presence of dupli- cate genes, forming numerous gene families [287]. More than a third of a typical eukaryotic genome consists of gene families [115,287,345]. Whole genome duplication(s) during the earlier evolution of the vertebrate lin- eage have been proposed to account for the presence of extensive gene duplications in most of the vertebrate genomes [209, 221,287]. The extent of gene families in one organism is firstly determined by the frequency and magnitude of gene duplication events, and secondly de- termined by the subsequent evolutionary fates of gene pairs following the duplication events. This may be better understood through comparative studies of sequence divergence in duplicate genes in different genomes. 157

However, until recently few studies have been conducted in the limited number of representative organisms available [68,174,210], because such kinds of inter-genomic comparisons rely on the availability of complete genome sequence from multiple organisms.

In this study, I compare the extent and evolutionary pattern of dupli- cate genes in the phylum ascomycota, using the complete sets of protein- coding genes in the fungi, Saccharomyces cerevisiae [110], Schizosaccha- romyces pombe [354], Candida albicans, Aspergillus nidulans, Penicillium marneffei and Neurospora crassa [101]. These fungal species display dif- ferent life styles and phenotypic characteristics. The brewer’s yeast, S. cerevisiae, and fission yeast, S. pombe, have a life cycle characterized by a unicellular thallus that reproduces by budding and fission respec- tively. Filamentous ascomycetes, N. crassa and A. nidulans, grow hyphae apically and branch laterally. P. marneffei shows dimorphic switching between mould and yeast forms of growth under different temperatures. It is of interest to know how gene duplication shaped their gene reposi- tories leading to novel genes conferring novel adaptive functions in these fungi.

In practice I used nucleotide alignments of duplicate genes to calculate two key parameters of molecular evolution: the number of synonymous

(silent) substitutions per synonymous site, Ks, and the number of non- synonymous (amino-acid replacement) substitutions per nonsynonymous site, Ka. Ks provides a crude measure of the time since duplication for each gene pair, if assume Ks increases approximately linearly with time.

The ratio Ka/Ks provides a measure of the selection pressure to which a gene pair is being subjected. Generally speaking, if Ka/Ks ratio = 1, it means that the duplicate genes are under few or no selective constraints (i.e., amino acid replacement substitutions occur at the same rate as syn- onymous substitutions). A Ka/Ks ratio > 1, which is a strong evidence for positive selection, indicates that replacement substitutions occur at 158

a rate higher than that expected by chance, so advantageous mutations have occurred during sequence divergence. In contrast, a Ka/Ks < 1 is consistent with ‘purifying selection’. That is to say, some amino-acid replacement substitutions have been purged by natural selection because of their deleterious effects [48]. Another evolutionary pattern that has at- tracted great interest is the asymmetry of evolutionary rates between the two copies of a duplicated gene pair, i.e., one copy evolves faster than the other one. Intensive studies on this pattern in different organisms have shown a wide range in estimation of the portion of duplicate gene pairs show asymmetric evolution [59, 68, 137, 174, 265,321, 370, 371]. Since the completion of whole genome sequence of S. cerevisiae [110], a number of studies have involved the identification of multigene families in this model eukaryotic genome. The resulting numbers of multigene families in S. cerevisiae reported by Rubin et al.[270] are higher than those reported by Friedman and Hughes [95] (1858 compared to 1440). This is because the former study used the simple criterion, BLAST E- value of 10−6, while the latter used the much stricter search with E = 10−50. However, using a single statistical score (such as the E value given by a BLAST search or a related score) without specifying the proportion of alignable regions may put two non-homologous proteins into the same family due to domain sharing [118]. Hence in this study, in order to obtain a reasonable estimate, I adopted a relatively stringent definition in which the lengths of gene-encoding proteins are taken into account, instead of relying on E-values only.

8.2 Literature Review

The ability to adapt to changing environments and to exploit new niches has a great influence on the success of an organism [210]. This ability is associated with new genes or genes with new functions [219]. Gene dupli- cations are traditionally considered to be a major evolutionary source of 159

new protein functions. After duplication, the fate of the resulting copy of a gene is of great interest. At least three hypotheses have been proposed, as follows:

Nonfunctionalisation The classical view pioneered by Susumu Ohno [237] holds that a duplicate gene produces two functionally redundant, paralogous genes and thereby frees one of them from selective constraints. The duplicate gene may be degraded to a pseudogene by mutational inactivation and finally could be removed from the genome by deletion [237, 238]. This is the most likely outcome of duplicate genes [237, 68].

Neofunctionalisation The duplicate gene may avoid redundancy by assuming a novel function, i.e., the redundant copy may be modified and in time assume a new role [237, 166, 336, 334, 298, 212]. Since this un- constrained paralog is free to accumulate neutral mutations, there is the possibility of fixation of mutations that may lead to a new function. This prediction was supported by studies on isozyme spectra of polyploidy in a number of organisms (reviewed in [196]). Of course, mutational time is a deciding factor, since copies need sufficient modifications to assume roles different from their parents, assuming that they are initially of neutral fitness. Thus, the deletion rate is of great importance to gene innovation by being sufficiently slow to give copies time to diverge. Both the hypotheses above assume one copy of a duplicate gene pair is free to evolve, while the other remains under selective pressure. This has been challenged in work by Kondrashov et al.[174] and Lynch and Conery [212], who show that paralogs do not seem to have experienced any extensive period of neutral evolution. Kondrashov et al.[174] pro- posed that paralogs avoid neutrality through gene amplification, followed by a period of either relaxed or positive selection. They also observed that paralogs evolve faster than their corresponding orthologs. Again, this could be due to relaxed or positive selection. Furthermore, a study 160

of 17 pairs of duplicate genes in the tetraploid frog Xenopus laevis has shown that both copies were subject to purifying selection, contrary to the notion of neutrality of one of the copies [137]. The failure of em- pirical research to support Ohno’s model has led to the proposal of an alternative hypothesis – subfunctionalisation.

Subfunctionalisation The third hypothesis, ‘subfunctionalisation’ or the duplication-degeneration-complementation (DDC) model [90], pro- poses that duplicate genes come under selective pressure and are re- tained by losing separate subfunctions from a multifunctional ancestal gene. Redundant material is discarded through degradation [90]. It also states that duplicate genes are initially redundant in function and, ac- cordingly, a duplication event is selectively neutral. But it differs from the hypothesis that successfully retained subdomains can be reused for subset of orignial functions or even other new or related purposes [90]. As a result, the two genes can be said to belong to a family, being related by sequence similarity, if not by function. Naturally, this relationship will decrease with time until no discernable similarity can be observed in regions of low conservation. A large number of observations support this model, although mostly in diploid or polyploid eukaryotes.

8.3 Materials and Methods

8.3.1 Sequences and gene families

For each organism, other than P. marneffei, the complete sets of available putative amino-acid sequences and coding DNA sequences were down- loaded from genomic databases as follows: for S. cerevisiae, http:// genome-www.stanford.edu/Saccharomyces; for S. pombe, http://www. genedb.org/genedb/pombe (Schizosaccharomyces pombe GeneDB); for C. albicans, http://genolist.pasteur.fr/CandidaDB/ (CandidaDB Data Release R1 Dec 17, 2001), this genome database was created by the EU- 161

funded consortium Galar Fungail by performing independent annotation of assembly 19 sequence data obtained from the Stanford Genome Tech- nology Centre (http://www-sequence.stanford.edu/group/candida); for A. nidulans, http://www.broad.mit.edu/annotation/fungi/aspergillus/ (Aspergillus nidulans Database), and for N. crassa, http://www-genome. wi.mit.edu/annotation/fungi/neurospora (Neurospora crassa Data- base release 3: 02.12.2002). All protein sequences that were annotated as known or suspected pseudogenes and those proteins encoded by mi- tochondrial genomes were removed. Gene families in each genome were identified by using BLASTCLUST (30% of identical residues and aligned over at least 80% of their lengths). BLASTCLUST applies the single- linkage algorithm. For documentation on its use, see ftp://ftp.ncbi. nlm.nih.gov/blast/documents/README.bcl. The clusters were used to identify and count duplication events (although not all pairs of genes in the cluster are homologous to each other). Throughout the analysis, the same criteria were applied in searching for orthologs of genes from all other species, that is to say, orthologs were predicted by BLASTP search for interspecies genes with > 30% identical residues and alignable region over at least 80%.

8.3.2 Estimation of substitution rate

Gene families with sequences similar to known transposable elements were removed at this point and excluded from the rest of analysis. Paralo- gous protein sequences were aligned using ClustalW version 1.82 with the default parameters (PAM matrix; gap opening penalty = 10.0; gap exten- sion penalty = 0.2). The corresponding nucleotide-sequence alignments were derived by substituting the respective coding sequences from the protein sequences by using MBEToolbox (Chapter 10 ). Ks and Ka were calculated by the method of maximum-likelihood, which is implemented in the CODEML program of the PAML package version 3.13d [359]. 162

Following the procedure described in Zhang et al. [371], pairs of dupli- cate genes with smallest value of Ks were picked within each family. This process was repeated for the remaining genes within the family until there was no gene pairs that could be picked. The process was implemented by ad hoc scripts in Perl.

To plot Ka versus Ks, pairs with Ks > 5.0 or Ka > 5.0 were elim- inated because such high sequence divergence is often associated with problems like difficulty in alignment, different codon usage biases or nucleotide compositions in the different sequences. Ks is known to be strongly distorted by codon usage bias [283]. The codon adaptation index (CAI) [282] was used as a measure of codon bias. I therefore calculated average values of CAI for all gene pairs and excluded those with average CAI > 0.5 from the analysis.

8.3.3 Relative rate test

The relative evolutionary rate test aims to compare the substitution rates of two sequences or two groups of sequences. Here it was applied to compare the evolutionary rate of two copies of a duplicate gene pair. In the test I only used recently duplicated (i.e., duplicate genes with

Ks < 0.5). These ‘young’ duplicates have fewer multiple substitutions and therefore can be estimated more accurately than those of older ones.

In addition, very young duplicates (Ks < 0.05) were excluded because they have too few substitutions to make statistical test significance [199]. In order to apply the relative rate test, I obtained outgroup sequences for these young gene pairs. Each relative rate test was based on one gene pair and its outgroup, forming triplets. Selection of outgroup were done by using the method described in Conant and Wagner [59]. When more than one outgroup sequence was available, either from the same genome or from other genomes, triplets of genes closest to each other in syn- onymous divergence rate, Ks, were chosen. I used two likelihood ratio 163

Table 8.1: Distribution of multigene families in fungi. Abbreviations: SC - S. cerevisiae; SP - S. pombe; CA - C. albicans; AN - A. nidulans; PM - P. marneffei; NC - N. crassa.

Family size SC SP CA AN PM NC 1 4500 4104 5276 7887 8725 9274 2 390 229 188 320 291 198 3 54 34 41 84 64 43 4 23 18 29 38 26 22 5 11 4 8 17 10 5 6-10 17 18 24 29 29 15 11-20 7 4 2 9 3 3 >20 2 0 2 5 5 1 Number of multigene families 504 307 294 502 428 287 (size >=2) Total genes used in the analysis 5889 4939 6165 9541 10060 10082 Number of genes in families 1389 835 1189 1654 1335 808 Number of young duplicate 165 51 50 43 52 10 gene pairs (Ks < 0.5)

(LR) tests to test for asymmetric divergence in both amino-acid and codon. Codon substitution rate was estimated using the codon substitu- tion model described by Goldman and Yang [111]. To do the LR test, two models were applied to the data: model 0 constrains the amino-acid or codon substitution rates to be equal in the two sequences; and model 1 assumes the rates are free parameters (hence they could be unequal to each other in two sequences). Maximum likelihood values ML1 and ML2 from the two models were collected and the likelihood ratios were calcu- lated as LR = 2(ln(ML1) − ln(ML2)). LR was then compared against the χ2 distribution with one degree of freedom, as detailed by Yang [358].

8.4 Results

8.4.1 Extent of gene duplication in ascomycetes

As shown in Table 8.1, 1,389 (23.6%) of 5,889 genes in S. cerevisiae belong to multigene families (including at least two genes), 16.9% in S. pombe, 164

19.3% in C. albicans, 17.3% in A. nidulans, 13.3% in P. marneffei, and only 8.0% in N. crassa. When comparing number of young duplicates, I found 23.8% of gene families are young (Ks < 0.5) in S. cerevisiae, 12.2% in S. pombe, 8.4% in C. albicans, 5.2% in A. nidulans, 7.8% in P. marneffei, and only 2.5% in N. crassa (Table 8.1). Apparently S. cerevisiae contains more multigene families and more recently duplicated genes than any other fungus in this analysis. This is in concordance with an earlier study [345]. Whole-genome duplica- tion approximately 108 years ago was proposed as an explanation for the presence of many duplicate genes [279]. S. pombe, C. albicans, A. nidu- lans and P. marneffei contain moderate numbers of duplicated genes to roughly the same extent as each other. Very few duplicated genes are present in the N. crassa genome. This low number of duplicate genes is consistent with results reported previously [101, 231]. Table 8.2 lists top multigene families that contain the most homol- ogous genes in number. S. cerevisiae contains large amount of trans- posable elements which play an important role in creating duplication in yeast genome [366]. Top multigene families of S. cerevisiae include a group of proteins, seripauperins, whose function(s) remain poorly un- derstood [332]. Comparable number of predicted sugar transporters is found in N. crassa and S. cerevisiae. Transporter and reductase gene families are expanded in filamentous fungi. Interestingly, P. marneffei has large gene family of 24 putative pepsin-like proteases, which is not so substantial in other fungi studied here.

8.4.2 Age distribution of duplicate genes

In general, we assume Ks increases approximately linearly with time because synonymous substitutions do not alter the amino-acid sequence and therefore there will be lower constraint due to natural selection [212]. 165

Table 8.2: Large multigene families in fungi.

Fungi Size of family Function/Product S. cerevisiae 20 Hexose transporter 20 Seripauperins 17 Amino acid permease 15 GTP-binding protein 13 Helicase S. pombe 20 Multidrug resistance protein 17 GTP-binding protein 12 Amino acid permease 11 Retrotransposable element 10 Protein kinase C. albicans 23 Unknown proteins 21 Amino acid permease 13 GTP-binding protein 11 Ferric reductase transmembrane component 9 Unknown proteins A. nidulans 61 Hexose transporter 42 Putative transporter 36 Oxidoreductase 28 Multidrug resistance protein 21 Aldehyde dehydrogenase P. marneffei 34 MFS multidrug transporter 31 Short chain dehydrogenase/reductase family 27 Hexose transporter protein 24 Pepsin-type protease 23 Major facilitator superfamily N. crassa 21 Oxidoreductase 17 Phosphoethanolamine N-methyltransferase 16 Hexose transporter 11 Aldehyde dehydrogenase 10 Endoglucanase 166 usiuinprsnnmu ie( site synonymous per substitution 8.1: Figure 10 20 30 40 100 0 20 40 60 80 0 A.nidulans S.cerevisiae

rqec itiuinof distribution Frequency

1.0

2.0

2.0

3.0

3.0

4.0

4.0

5.0 K 5.0 N = 142.00 Mean = 2.47 Std. Dev = 1.82 N = 313.00 Mean = 1.39 Std. Dev = 1.65 s K .Arwidctstescn ekin peak second the indicates Arrow ). s rqec itiuino ulct eepisa ucino h ubro synonymous of number the of function a as pairs gene duplicate of distribution Frequency . 10 20 30 10 20 30 40 0 0

P.marneffei S.pombe

1.0

1.0

2.0

2.0

3.0

3.0

4.0

4.0

5.0 5.0 N = 174.00 Mean = 2.16 Std. Dev = 1.72 N = 123.00 Mean = 1.16 Std. Dev = 1.30 .marneffei P. 10 10 20 30 0 2 4 6 8 0

N.crassa albicans C.

1.0

1.0

2.0

2.0

3.0

3.0

4.0

4.0

5.0 5.0 N = 48.00 Mean = 2.47 Std. Dev = 1.49 N = 198.00 Mean = 2.09 Std. Dev = 1.63 167

S. cerevisiae S. pombe C. albicans

1 1 1

Ka 0.1 0.1 0.1

0.01 0.01 0.01 0.01 0.1 1 0.01 0.1 1 0.01 0.1 1

A. nidulans P..marneffei N. crassa

1 1 1

Ka 0.1 0.1 0.1

0.01 0.01 0.01 0.01 0.1 1 0.01 0.1 1 0.01 0.1 1 Ks

Figure 8.2: Log-log plots of Ka vs. Ks for duplicate gene pairs. Log-log plots of the number of nonsynonymous substitution per nonsynonymous site (Ka) vs. the number of synonymous substitution per synonymous site (Ks) for duplicate gene pairs. Each point represents a single pair of gene duplications. Points below the diagonal (Ka < Ks) imply the genes have been subjected to purifying selection against amino acid changes. Open points denote orthologous gene pairs. 168

If this assumption largely holds, the distribution of Ks can be used as an indicator for the distribution of duplication events along a time scale. I plotted the frequency distribution of pairs of duplicate genes as a function of the number of Ks in Fig. 8.1. An obvious pattern found in all species is that most of gene duplicates are young and the density of duplicates drops off with increasing Ks. The distribution of C. albicans shows a flat pattern, in which the gene pairs are evenly distributed over Ks, with a peak around Ks = 0.2. This may indicate small-scale gene duplications happened persistently during the course of evolution. For P. marneffei, there are two peaks in the plot: the first one is a high peak in the age distribution centered around Ks = 0.1, indicating there are a large number of gene pairs of a similar recent age, the second peak coresponds to a low region from Ks = 2.0 to 4.5. I speculate the second peak is a trace of ancient gene duplication events on a relatively large-scale. This proposed ancient duplication would have created many duplicate gene pairs. After such a long evolutionary time, most of these gene pairs would be expected to have mutated and become divergent. Only some pairs retain some degree of similarity, which gives rise to the second peak. This dual-peak pattern is not readily observed in other fungal species, except for N. crassa with a second-peak which might result from gene duplication prior to the development of the repeat-induced point mutation (see below).

8.4.3 Selective constraint between paralogs

As metioned in the Introduction, Ka/Ks is used as a measure of selective constraint between two copies of duplicate genes. The larger the Ka/Ks value, the stronger the selective constraint between the two copies. Table

8.3 gives the estimated Ka/Ks values in different fungi.

Comparison of Ka/Ks values for different fungi revealed that the strength of selection is generally similar among yeasts (i.e., S. cerevisiae, 169

S. pombe and C. albicans, and among moulds (i.e., A. nidulans, P. marn- effei and N. crassa). There is substantial difference in Ka/Ks between yeasts and moulds. The strongest purifying selection is among the S. cerevisiae paralogs and the weakest purifying selection in A. nidulans. Mould paralogs show significantly stronger functional constraints, indi- cated by larger values of Ka/Ks, than those in yeasts (Student’s t-tests for pairwise comparisons).

Table 8.3: Ratio of nonsynonymous to synonymous substitution rates (Ka/Ks) for recently diverged paralogs (0.05 < Ks < 0.5).

Fungi No. of gene pairs Ka/Ks (mean ± SD) S. cerevisiae 89 0.134 ± 0.166 S. pombe 22 0.148 ± 0.234 C. albicans 34 0.245 ± 0.224 A. nidulans 12 0.491 ± 0.214 P. marneffei 29 0.456 ± 0.231 N. crassa 9 0.359 ± 0.276

8.4.4 Ka/Ks between paralogs and orthologs

Ka/Ks is also used to estimate the selective constraints acting on or- thologs. I therefore also characterised rates of synonymous and nonsyn- onymous substitution of orthologs for each genome. By plotting Ka as a function of Ks and superimposing data from paralogs onto those from orthologs, we can get an overall view of how natural selection acts on two groups of comparisons (Fig. 8.2).

In all species, overall Ka values are much smaller than Ks values, which implies that vast majority of duplicate gene are subject to purifying selection. In C. albicans, A. nidulans and P. marneffei, gene pairs with smaller Ks tend to gather round the diagonal line (Ka/Ks = 1) and gene pairs with larger Ks tend to get away from the line. It seems that, in C. albicans, A. nidulans and P. marneffei, recent duplicates appear to 170

tolerate more amino-acid replacement substitution than older duplicates. In mould species, the strength of purifying selection acting on paralogs is smaller than that acting on orthologs with the same level of sequence divergence. As shown in Fig. 8.2, at the same level of Ks, most of the open points are below clusters of closed points, that is to say, Ka in paralogs is generally larger than that of orthologs in A. nidulans, P. marneffei and N. crassa. On the other hand, there is no difference in overall Ka/Ks between paralogs and orthologs in yeasts, S. cerevisiae, S. pombe and C. albicans.

8.4.5 Relative evolutionary rate between paralogs

The two copies of a paralog pair may evolve at the different rate. If most paralog pairs evolve in such an asymmetric way, it may indicate that Ohno’s neofunctionalisation theory is plausible. Therefore, as mentioned, many studies on the relative evolutionary rates between paralogs have been conducted. However, these studies have led to different conclusions. Two critical aspects responsible for the success of such analyses are the sensitivity of methods and the appropriateness of the outgroup used. Here I used a method that incorporates a codon-based model. Gen- erally speaking, methods relying on codon-based models (for example, [111, 226]) are more sensitive than nucleotide-based tests and amino- acid based tests, because, in the latter two, one cannot distinguish be- tween silent substitutions and amino-acid replacement substitutions [59]. Codon-based model however takes into account the ratio between the rate of nonsynonymous and synonymous substitutions which gives a more di- rect measure of the strength of selection or functional constraints on the gene. The major issue is choosing an outgroup is that the potential outgroup cannot be too distant evolutionarily from the paralogs being studied, oth- erwise, saturation in synonymous sites for many genes will interfere with 171

the power of the statistical test. To avoid this influence, Kondrashov et al.[174] used a within-genome approach, since their study included four highly diverged eukaryotic organisms, S. cerevisiae, A. thaliana, C. ele- gans and D. melanogaster. By using the within-genome approach, they identified outgroups of S. cerevisiae paralogs within the S. cerevisiae genome itself. In addition, they required that the two paralogs be closer in amino-acid sequence to each other than to the outgroup. This extra condition, which probably has led to underestimate asymmetric diver- gence, was criticised by Conant and Wagner [59], who adopted a similar within-genome approach in multiple eukaryotes.

In the selection of gene duplicates and their outgroups, I adopted a method similar to that of Conant and Wagner [59]. The only modification made was the search of all fugal genomes for outgroups, instead of using the within-genome approach.

I identified a total 163 triplets (composed of two paralogs and one corresponding outgroup) which included 101 triplets based on paralogs from S. cerevisiae, 6 from S. pombe, 50 from C. albicans, 2 from A. nidulans, 3 from P. marneffei, and 1 from N. crassa.

Because the majority of triplets are from S. cerevisiae and C. albi- cans, the following analysis has no power to distinguish differences among species. Instead it can only be considered as a comprehensive analysis dealing with the subject of ascomycetes as a whole.

I adopted the model of Goldman and Yang [111] (see Methods) in the comparison of the relative rates in amino-acid substitution between each of the paralogs. The result shows that, of a total of 163 analysed gene pairs from the ascomycetes, 29 (17.8%) evolve at a significantly (p < 0.05) different rate (Table 8.4). This figure includes 12 (11.9%) of 101 triplets in S. cerevisiae and 17 (32.7%) of 52 in C. albicans. In the majority of cases, both paralogs evolved at approximately the same rate, under a similar level of purifying selection. 172

In order to examine whether Ka/Ks ratio is the factor causing asym- metry in evolutionary rates between paralogs, I estimated the asymmetry 2 of Ka/Ks ratios between two paralogs. A 2 × 2χ test failed to reject the null hypothesis that the number of pairs with different Ka/Ks ratio is independent of the number of pairs with different amino-acid substi- tution rates (Table 8.4). That is to say, there is no correlation between different Ka/Ks ratios and different amino-acid substitution rates.

Table 8.4: Amino-acid substitution rates versus Ka/Ks ratios in two copies of duplicate genes. Columns show gene pairs with different or equal amino-acid substitution rates between two paralogs; rows show gene pairs with different or equal Ka/Ks ratios between two paralogs.

Different Ka Equal Ka Total

Different Ka/Ks ratio 3 10 13 Equal Ka/Ks ratio 26 124 150 Total 29 134 163

8.5 Discussion

This study took advantage of the avaiability of genome sequences of P. marneffei and other 5 ascomycetes, S. cerevisiae, S. pombe, C. albicans, A. nidulans and N. crassa. It also relied on the recent development of methods to analyse selective constrains on duplicate genes in each genome. Given the considerable phenotypic variation between the two groups of distinct ascomycetes, yeasts and moulds, I speculated that gene duplication may play an evolutionary role at different levels and selection patterns of duplicate genes may be different. To my knowledge, no similar analysis has been conducted in fungi, despite several genome-level studies on gene duplications using S. cerevisiae as one of their model eukaryotic organisms [95]. 173

8.5.1 Gene duplication in ascomycetes is highly diverse

Most genomes show a certain degree of redundancy caused by single- gene duplication, chromosomal segment duplication or complete genome duplication (through polyploidisation). So do the ascomycetes I studied.

S. cerevisiae S. cerevisiae has the largest amount of gene redundancy among all ascomycetes I analysed. Previously studies have revealed that its genome contains approximately 55 large duplicated chromosomal re- gions [345]. It has been widely accepted that the duplicated regions found in the modern Saccharomyces species are probably the result of a whole-genome duplication (tetraploidisation) approximately 108 years ago [95, 250, 279, 280, 345]. This proposed genome duplication might co- incide with the origin of the ability to grow under anaerobic conditions, one of most striking physiological differences between S. cerevisiae and other yeasts.

S. pombe S. pombe and S. cerevisiae have been separated for as long as 420 million years [289]. Comparing the two yeasts, S. pombe has fewer gene duplications than S. cerevisiae, which may account in part for the smaller genome size. Transposable elements exist in the S. pombe genome. However, their proportion is low compared to S. cerevisiae. Using phylogenetic analysis, Hughes and Friedman [136] suggested that parallel gene duplication appears to have played a role in the independent origin of similar adaptations in the two unicellular fungi, S. pombe and S. cerevisiae [136]. That is to say, gene duplications have occurred indepen- dently in the same gene families in S. pombe and S. cerevisiae; S. pombe has adapted to a similar unicellular lifestyle without polyploidisation.

C. albicans The age distribution of relative by young duplicate genes

(Ks < 5) in C. albicans (Fig. 8.1) suggests that duplication events are likely to occur continuously during the course of evolution in this yeast. 174

In either S. pombe or C. albicans, no evidence suggesting polyploidisa- tion, such as, duplicated genomic blocks, has so far been found. Hence, genome duplication, as happened in S. cerevisiae, which may represent an extreme adaptive strategy in providing genetic raw material for func- tional divergence of novel genes, has not occurred in C. albicans.

A. nidulans A. nidulans contains a relatively large number of recently duplicated gene pairs; totally 43 with Ks < 0.5. The age distribution of duplicate genes (Ks < 5) in A. nidulans displays a high peak at Ks = 0.1 to 0.2 and shows a similar pattern with that in S. cerevisiae (Fig. 8.2). However, S. cerevisiae has undergone genome duplication and there are extensive duplicated blocks in its genome as the traces of the proposed ancient tetraploidy that remain detectable after widespread deletion of superfluous duplicate genes and sequence divergence. Most of gene pairs in these duplicated regions are believed to have been produced simultane- ously or within a narrow time frame [95]. Based on the similar patterns of age distribution of gene pairs between A. nidulans and S. cerevisiae, I might propose that duplicate genes in A. nidulans probably originated through one or more episodic, large-scale gene duplications in a relatively short period of time. What is uncertain is whether such a peak of gene duplication over the course of evolution implies a polyploidisation event in A. nidulans. As noted by Friedman and Hughes [95], a peak of gene duplication need not imply polyploidisation event. Therefore, it would be interesting to know how many duplicated blocks are present within and between A. nidulans chromosomes when the genome sequencing of A. nidulans is completely finished.

P. marneffei Slightly fewer genes in P. marneffei belong to multiple gene families than A. nidulans. However, 52 pairs are young duplicate genes compare to 43 in A. nidulans. There is no difference in the overall extent of duplicate genes between these two close species. The pattern 175

of the Ks histogram is broadly similar to those of A. nidulans and S. cerevisiea. A difference is the dual-peak pattern, seemingly implying that besides the modern duplications, there was an ancient large-scale duplication. The modern peak is at the similar location, Ks = 0.1, as that of A. nidulans and other fungi, but on a smaller scale (less than 25% genes belong to this peak) compared to that of A. nidulans. In contrast the second peak at Ks = 2.0 to 4.0 is more apparent than in other fungi except N. crassa. More evidence is needed before any solid conclusion can be reached though.

N. crassa N. crassa exhibits much greater morphological and devel- opmental complexity. Its genome is approximately three times the size of the S. cerevisiae genome, and accordingly has a protein count much larger than those in yeasts . However the paucity of duplicate genes in N. crassa is obvious: (1) the number of multigene families in N. crassa is much smaller than that in yeast, and (2) the number of gene pairs with a small Ks (0.05 < Ks < 0.5) in N. crassa is much smaller that those in unicellular yeasts (Table 8.1). An extraordinary feature of N. crassa, repeat-induced point (RIP) mutation [219], has been suggested to play a major role in preventing gene innovation through gene duplication and response for this paucity. The RIP, acting as a defense against mobile DNA [219], can detect and mutate both copies of a sequence duplica- tion. In fact, the RIP is so efficient that all gene duplications remaining in N. crassa genomes have been proposed to be raised and fixed before the emergence of the RIP mechanism. Examples of the remaining multi- gene families may have ‘survived’ RIP include hexose transporters and cellulases (Table 8.2). N. crassa may have other mechanisms of gene innovation, since gene duplication has rarely occurred in its genome. Ascomycetes display a wide variation in the number of gene duplica- tion events. This may have provided the foundation for specialisation of a number of genes and their corresponding proteins, and formed the basis 176

for diversification. Amplification of their genetic material might increase their fitness of adaptation to the environment. Examples include genes for the yeast hexose transporters increasing fitness in low-glucose; genes for N. crassa cellulases to allow growth on decaying plant material; genes for cytochrome P450 and efflux systems involving in detoxification.

8.5.2 Different selective constraints in yeasts and filamentous ascomycetes

There are differnt models, such as, the classical model and duplication- degeneration-complementation (DDC) model, to explain the creation of novel genes by gene duplication. The classical model emphasises that one copy is neutral and free to evolve while the other remains under selective pressure. The DDC model [90] explains sub-functional diver- gence when a gene has been duplicated. According to the DDC model, the two gene copies then acquire complementary loss of function muta- tions in independent sub-functions. Thus both genes required to produce the full complement of functions of the single ancestral gene. Both the classic model and DDC model predict a period immediately following duplication when the genome should be able to tolerate a high degree of nonsynonymous substitutions in one member of a duplicate pair because the other member is still functioning at full strength.

Comparing Ka with Ks in each genome, I found a common pattern in all fungi which is in partial agreement with these theoretical expec- tations. First, in either filamentous fungi or yeasts, purifying selection was dominant against amino acid changes in paralogous genes. This confirms the earlier observation that paralogs evolve under purifying se- lection [211], which challenges the classical model but supports the DDC model. Second, recent duplicates with smaller Ks appear to tolerate more replacement amino-acid substitutions than older duplicates, which is compatible with both models. I also found two exclusive patterns in filamentous fungi. The first 177

finding is that there are significantly (p < 0.01) higher values of the

Ka/Ks ratio in paralogs in moulds than those in yeasts with a similar level of divergence (Table 8.3). Filamentous fungi show greater morpho- logical and developmental complexity than do yeasts, and their genomes are normally larger. As gene duplication is a source of novel protein functions, the bigger genome size may partially result from frequently occurring gene duplications provided a basis for divergence and resulting in the increase of novel genes caused by the neofunctionlisation, or the increase of gene number caused by the subfunctionalisation. Therefore, the higher value of Ka/Ks ratio in paralogs in moulds may imply that, at the similar stage after duplication, gene pairs in filamentous fungi have faster evolutionary rates than those in yeasts. Either positive selection or relaxed functional constraint can cause the higher value of the Ka/Ks ratio. Few gene pairs in moulds are actually found under positive se- lection, when use Ka/Ks > 1 as indicator of positive selection. Thus, the slightly elevated Ka relative to Ks, accounts for the larger value of

Ka/Ks given by gene pairs in moulds.

Another interesting finding is that paralogs in A. nidulans, P. marn- effei and N. crassa appear to be under weaker functional constraint than orthologs at the same age. In other words, orthologs in moulds expe- rience stronger functional constraints than paralogs. Natural selection seems to allow paralogs in these three filamentous fungi to mutate with less constraint, which may lead to more advantageous mutations. This phenomenon was first observed in eukaryotes [174] but it has not been reported in fungi. Note that this trend is not observed in the unicellular yeasts, S. cerevisiae, S. pombe, and C. albicans. Therefore, it is suggested that elevated functional constraint in orthologs or weaker functional con- straint in paralogs is a more common feature in the evolutionary pattern of multicellular eukaryotes. 178

8.5.3 Majority of paralogous genes evolve symmetrically

Estimation of asymmetric evolution rates were conducted mainly on par- alogs from S. cerevisiae and C. albicans, so the result should not be applied to other species. 29 (17.8%) of a total of 163 analysed gene pairs, evolve at significantly (p < 0.05) different rates (Table 8.4). Therefore, in the majority of cases at least in S. cerevisiae and C. albicans, both paralogs evolved at approximately the same rate, under similar levels of purifying selection. Several similar studies have been done in S. cerevisiae and in several other eukaryotes. Some concluded that both copies of duplicate gene typ- ically evolved at the same rates [137, 174, 265], whereas others suggested asymmetric divergence between two paralogs is not uncommon. Because different organisms were used in those studies and different methods with varying sensitivities were applied, it is hard to compare data in this study with others directly. For instance, Kondrashov et al.[174] selected 15 S. cerevisiae triplet genes and, by using a distance based method they found no paralogs showing different rates. In another study, Conant and Wag- ner [59] identified six of 22 (27%) gene triplets in S. cerevisiae, and three

(21%) of 14 in S. pombe, that showed asymmetry in Ka by using codon based model following Muse and Gaut [226]. An asymmetric evolutionary rate is not always associated with an asymmetric evolutionary constraint, as indicated by Ka/Ks. Moreover, no simple dependence between evolutionary rate and gene function is observed (data not shown). This finding is inconsistent with Zhang’s finding in young paralogs of human genes [371], that genes with different

Ka/Ks ratios tend to evolve at different rates, suggesting that different functional constraints might be largely responsible for the unequal evo- lutionary rates. The incongruence may be again due to the difference in species used in the studies. In conclusion, this chapter reports the variation in the extent of gene 179

duplications in ascomycetes. The age distribution of gene duplications tentatively suggests that the P. marneffei genome has experienced a recent as well as an ancient large-scale duplication. Analysis of the di- vergence of evolutionary rates in S. cerevisiae and C. albicans revealed that less than 20% of gene pairs in these two yeasts show asymmetric divergence patterns in amino-acid substitutions. I speculate that the dif- ferent extent and evolutionary pattern of duplicate genes in ascomycetes might be associated with their genotypical and phenotypical differences. 180

Chapter 9

ACCELERATED EVOLUTIONARY RATE MAY BE RESPONSIBLE FOR THE EMERGENCE OF LINEAGE-SPECIFIC GENES

Once the genome of Penicillium marneffei become available, genes can be predicted and annotated. Hundreds of these predicted genes lack homology to any known gene. They are species-specific genes or called “orphan” genes. Where do these genes come from? This is still a mys- tery. One suggestion has been that most orphan genes evolve rapidly so that similarity to other genes cannot be traced after a certain evolu- tionary distance. This can be tested by examining the divergence rates of genes with different degrees of lineage specificity. Here the lineage specificity (LS) of a gene describes the phylogenetic distribution of that gene’s orthologs in related species. Highly lineage-specific genes will be distributed in fewer species in a phylogeny.

In this chapter, I used the complete genomes of seven ascomycetes and two animals to define several levels of LS, such as, Eukaryotes-core, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, As- pergillus-specific and Saccharomyces-specific. The rates of gene evolution in groups of higher LS to those in groups with lower LS are compared. Molecular evolutionary analyses indicate a significant increase in nonsyn- onymous nucleotide substitution rates in genes with higher LS. Multiple regression analyses suggest that LS is significantly correlated with the evolutionary rate of the gene. This correlation is stronger than those of a number of other factors that have been proposed as predictors of a gene’s evolutionary rate, including the expression level of genes, gene essential- 181

ity or dispensability and the number of protein-protein interactions. The significantly accelerated evolutionary rates of genes with higher LS may reflect the influence of selection and adaptive divergence during the emer- gence of orphan genes. These analyses suggest that accelerated rates of gene evolution may be responsible for the origin of apparently orphan genes. This chapter is very closely based on a paper I have published with colleagues [in press]. The original draft of the manuscript has been re- vised by Dr. David K. Smith, in Department of Biochemistry, HKU. The preliminary version of this work has been presented at the SMBE conference on 17th June 2004.

9.1 Introduction

During annotation of genome sequences a substantial fraction of the puta- tive genes are found to lack sequence similarity to any of the genes in pub- lic databases. These genes or protein-coding regions have been referred to as “orphan” genes. Some may have crucial organism-specific func- tions, however, the origin and evolution of orphan genes remain poorly understood. A proposed explanation of this problem has been that some genes evolve so rapidly that their homologs cannot be discovered over larger evolutionary distances. Although this has been supported by re- cent findings in Drosophila that orphan genes evolve, on average, more than three times faster than non-orphan genes [73], the influence of other factors on the evolutionary rate of genes should be taken into account. These factors include the expression level of genes [127,241], a gene’s dispensability (the organism’s fitness after deletion of the gene) [178], gene essentiality [343], gene duplication [150, 357], and the number of protein-protein interactions involving the gene’s product [93, 335]. Due to the inherently stochastic property of evolutionary rates, the influence of many of these factors has proved difficult to confirm and their relative 182

importance also needs further elaboration.

In order to systematically examine the relationship between a gene’s evolutionary rate and the origin of orphan genes, as well as to assess the influence of other factors, we have devised a study based on the following rationale. Orthologs of a gene usually have a particular phyletic distri- bution in several related species, thus giving each gene a certain lineage specificity (LS). Orphan genes represent the extreme of LS because they are only present in one node of a phylogeny. In contrast, highly con- served genes have a low degree of LS and are widely distributed, while a range of different degrees of LS can be defined for other gene groups. If an elevated evolutionary rate is the major cause of the origin of orphan genes, one should find a correlation between evolutionary rate and LS. Slower evolving genes should tend to be less lineage specific.

Studying the relationship between the evolutionary rate of genes and LS may reveal the dynamic processes that lead to the origin of species specific, or orphan, genes. It can also be tested whether the evolutionary rate leading to the emergence of orphan genes is relatively constant or highly variable. If genes become lineage-specific gradually, one might expect a simple relationship (e.g., a linear relationship, perhaps after data transformation) between divergence time and genetic distance, otherwise, a more complex relationship would be expected.

To investigate these matters, the complete sets of predicted protein- coding genes from Aspergillus fumigatus (http://www.sanger.ac.uk/ Projects/A fumigatus/) and Saccharomyces cerevisiae [110] were ex- tracted. Orthologs of these genes from five other ascomycotan fungi, Aspergillus nidulans (http://www.broad.mit.edu/annotation/fungi/ aspergillus/), Schizosaccharomyces pombe [354], Candida albicans [65], Neurospora crassa [101], and Saccharomyces mikatae , and two meta- zoans Caenorhabditis elegans [79] and [2] were also obtained. 183

The fungi studied here represent three major Ascomycetes classes, Euascomycetes, Hemiascomycetes and Archaeascomycetes. The Euas- comycetes, which contain well over 90% of Ascomycota, comprises As- pergillus and Neurospora. The Hemiascomycetes comprises the Saccha- romyces yeasts and Candida. The fission yeast, S. pombe belongs to the class Archaeascomycetes which are distantly related to each other, pos- sibly remnants of an early radiation of Ascomycota [289]. These fungi also represent two major fungal morphological subdivisions, yeasts and moulds. Yeasts, like S. cerevisiae, S. mikatae, C. albicans, as well as S. pombe, have life cycles characterised by unicellular (occasionally di- morphic) growth. In contrast, the filamentous ascomycota, A. nidulans, A. fumigatus and N. crassa, predominantly grow as hyphal filaments. Despite having such a morphological divergence, all of them share a rela- tively recent common ancestor with respect to the rest of the eukaryotes. The phylogeny of these ascomycota is clear and generally accepted, ex- cept for the ancient Schizosaccharomyces, S. pombe [289].

Genes from S. cerevisiae and A. fumigatus were classified, according to their phylogenetic profiles, into several LS groups as follows: Eukaryote- core, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, Aspergillus-specific and Saccharomyces-specific. Average nonsynonymous substitution rates, Ka, of genes among LS groups were compared and correlations between LS and several other factors, for example, gene ex- pression level, gene dispensability and gene redundancy, were explored. The relative importance of LS and other factors, in terms of the pre- diction of a protein’s evolutionary rate, were evaluated and whether the divergence rate is relatively constant over genes with similar degrees of LS was tested. 184

9.2 Literature Review

Holding the gene-centric rationale, our understanding of evolutionary novelties is limited in the consequence of creation new gene. Recent at- tention has been put to this phenomenon in genomes, yet the mechanism remains mystery. Some insights have been obtained especially by study- ing newly created genes (i.e., young genes) [210, 257, 204]. A number of mechanisms that may be responsible for new gene origination have been proposed. These include gene duplication, exon shuffling, retroposition, lateral gene transfer, and transposable element assimilation (for review, see [204]). Topic regarding to the gene duplication has been reviewed in Chapter 8. Here I only focus on the origination of exon – the basic units of gene. Once exons exist, exon-shuffling, recombination or exclusion of exons, is widely recognised as important in the generation of new genes [109, 244, 155]. The creation of new exons has been proposed through three possi- ble processes: (1) exaptation of transposable elements [27, 215, 230, 293], (2) exon duplication [172,194], and (3) exonisation of intronic sequences [173]. Exaptation of transposable elements is a process in which a retroele- ment has taken on new functions for a genome. It was firstly exampled by the integration of an Alu element into the coding portion of the human decay-accelerating factor (DAF) gene [215], and an L1 retrotransposon el- ement insertion provides a premature stop codon and the polyadenylation sites is responsible for the generation of the secreted form of the human transmembrane protein attractin [305]. Recently as much as about 4% of human genes were found containing transposable elements in their cod- ing regions [230]. Exon duplication has been reported as about 10% of all genes contain tandemly duplicated exons when searching the genomes of human, fly and worm. They are likely to be involved in mutually exclusive alternative splicing events, which might confer further evolu- 185

tionary potential [194]. Exonisation of intronic sequences is the most easily conceived mechanism but few examples of such a process have been reported [173]. Wang et al.[339] identified newly evolved exons by EST comparison against outgroup to learn the ways new exons originate and evolve, and how often new exons appear. They claim that the new exon origination rate is about 2.71−3 per gene per million years and a much higher proportion of new exons have Ka/Ks ratios > 1 than do the old exons. It is noteworthy that gene origination processes mentioned above does not necessarily create new genes with novel functions, instead yield new variants of genes [369]. Moreover, newly evolved genes often come up with elevated evolutionary rate driven by positive selection [205,235,147, 338, 369].

9.3 Materials and Methods

9.3.1 Sequences and data sets

Table 9.1: Genomic sequence sources.

Species Web Source for the sequence data. A. nidulans www-genome.wi.mit.edu/annotation/fungi/aspergillus/ A. fumigatus www.sanger.ac.uk/Projects/A fumigatus N. crassa www-genome.wi.mit.edu/annotation/fungi/neurospora/ S. cerevisiae genome-www.stanford.edu/Saccharomyces S. mikatae ftp://genome-ftp.stanford.edu/pub/yeast/data download/sequence/fungal genomes/S mikatae C. albicans genolist.pasteur.fr/CandidaDB S. pombe www.genedb.org/genedb/pombe/index.jsp C. elegans www.sanger.ac.uk/Projects/C elegans/wormpep D. melanogaster www.fruitfly.org

For each Ascomycotan, the complete set of available amino acid se- quences and coding DNA sequences was downloaded from the repositories 186

Hemiascomycetes-specific Euascomycetes-specific Saccharomyces-specific

Aspergillus-specific Eukaryotes-coreAscomycota-core

A. fumigatus A. nidulans N. crassa S. cerevisiae S. mikatae C. albicans S. pombe ANIMALS

1,4581,144 1,085 841670 ~10

Figure 9.1: LS classification based on phylogenetic profiles of genes. Di- vergence times were adopted from Hedges and Kumar [131]. The diver- gence times between S. cerevisiae and S. mikatae and between A. fumi- gatus and A. nidulans are based on the estimates by Cliften et al.[56] and [87], respectively. A solid square (¥) means the gene is present in corresponding species; an open square point (¤) means it is absent. 187

given in Table 9.1. All known or suspected pseudogenes and genes in mi- tochondrial genomes were removed. The S. mikatae dataset is derived from the ORF predictions of Cliften et al.[56].

Yeast gene expression data came from Cho et al.[51] who charac- terised all mRNA transcript levels during the cell cycle of S. cerevisiae. mRNA levels were measured at 17 time points at 10 min intervals, cover- ing nearly two full cell cycles. The mean of these 17 numbers was taken to produce one general time-averaged expression level for each protein.

Protein dispensability was assessed by the fitness effect of a single- gene deletion, as measured by the average growth rate of the knockout strain in several types of media. The results of assays of a nearly complete set of single gene deletions in S. cerevisiae [297] were obtained, and the data were manipulated following the method by Gu et al.[119]. Briefly, the fitness value fi is defined as ri/rpool, where ri is the growth rate of the strain with gene i deleted and rpool is the pooled average growth rate of different strains.

Essential genes were from the dataset of the Saccharomyces Genome Deletion Project which contains 1,106 essential genes (http://www-sequence. stanford.edu/group/yeast deletion project/). Although gene dis- pensability and gene essentiality are highly associated, they were treated as two separate variables in order to compare the results of each variable to previous studies.

A list of protein-protein interactions among S. cerevisiae proteins was obtained from two integrated interaction databases, YEAST GRID [25] and the yeast subset of DIP [274], and a number of major high- throughput studies published to date [106]. The final non-redundant set contains 252,011 interactions involving 5,698 proteins. 188

9.3.2 Identification of orthologs

Orthologs of the genes from S. cerevisiae and A. fumigatus in each other and in other genomes studied here were identified by the automatic clus- tering method INPARANOID [261]. Orthologs between the genomes of two species are derived in this method from mutual best pairwise BLASTP hits. A further reciprocal test was applied by requiring the longest region of local sequence similarity between putative orthologs to cover ≥ 80% of each sequence and to have ≥ 30% sequence identity in this region. 113 pairs that did not pass this test were excluded. A gene was considered as being absent from another genome if no sequence sim- ilarity could be detected between the gene and the genes in that genome. To define the level at which sequence similarity was not detectable, a TBLASTN expectation (E) value 1×10−2 with respect to a fixed effec- tive search space (set to the size of the N. crassa genome) was used as a cut-off. Orthologs of fast-evolving genes may not be detected in their more dis- tantly related genomes by the TBLASTN search used above. To address this, ancestral sequence(s) were constructed (Collins et al.[58], based on the detected orthologs, using the maximum likelihood method imple- mented in the PAML phylogenetic analysis package version 3.13d [359]. Ancestral sequences are expected to be less divergent from their pos- sible orthologs in the more distant genomes and their reconstructions were used to search, as above, for orthologs in the more distantly related genomes. If potential orthologs were identified, the gene was excluded from further analysis to avoid ambiguity in the assignment of genes to LS groups.

9.3.3 Classification of genes into LS groups

Phylogenetic profiles, a gene table giving 1 (or 0) if a gene is present in (or absent from) a genome, for the genes from S. cerevisiae and A. fumigatus, 189

were constructed based on the detected orthologs in the genomes studied. The genes were then classified into the different LS groups, Eukaryotes- core (present in all genomes studied), Ascomycota-core (present in all fun- gal genomes), Hemiascomycetes-specific, Euascomycetes-specific, Saccharomyces- specific and Aspergillus-specific (Fig. 9.1). The phylogenetic tree relating the species was derived from [131].

9.3.4 Divergence Times

Lineage divergence times are somewhat controversial [285]. In this work divergence times were taken from [130] and [131]. These give the following divergence times (Fig. 9.1): Animals vs Fungi, 1576 Mya; Fungi vs As- comycetes, 1144 Mya; Saccharomyces and Candida vs Aspergillus, 1085 Mya; Candida vs Saccharomyces, 841 Mya; Neurospora vs Aspergillus, 670 Mya. Divergence times for S. cerevisiae vs S. mikatae and A. fumi- gatus vs A. nidulans were taken as ∼10 Mya. To convert LS into numeric form to calculate correlations with other properties, the ratio of the time of the animal-fungi divergence to that of the divergence of a lineage from its last common ancestor was used. For example, the Eukaryotes-core value is 1 (1458/1458) while that of Ascomycota-core is 1.27 (1458/1144). The final results were not sensitive to changes in the divergence time estimates used for this category to numeric conversion.

9.3.5 Estimation of substitution rates and statistical analyses

The number of synonymous substitutions per synonymous site, Ks, and the number of nonsynonymous substitutions per nonsynonymous site,

Ka, were estimated between A. fumigatus-A. nidulans ortholog pairs and S. cerevisiae-S. mikatae ortholog pairs in the Euascomycetes and Hemi- ascomycetes lineages respectively. For each ortholog pair, the ortholo- gous protein sequences were aligned using ClustalW version 1.82 with the 190

default parameters. The corresponding nucleotide-sequence alignments were derived by substituting the respective coding sequences from the protein sequences by using MBEToolbox (Chapter 10 [35] ). Ks and Ka were then estimated by the maximum-likelihood method implemented in the CODEML program of PAML [359].

High apparent sequence divergence, as shown by high Ks or Ka values, is often associated with problems such as difficulty in alignment, or dif- ferences in codon usage bias or nucleotide composition in the sequences.

Ortholog pairs with Ks < 0.05 may include too few substitutions to provide a statistically significant measure of change [371]. To accurately measure the intensity of selective forces acting on a protein, only ortholog pairs with Ka ≤ 2 and 0.05 ≤ Ks ≤ 2 were used. Similar results were obtained when more relaxed cutoffs for Ka and Ks (≤ 5) were used (data not shown). All known ribosomal protein genes were excluded from the data set as their high level of conservation gives them substantially lower average values of Ka, Ks and Ka/Ks than those for the rest of the genes. Statistical regression analyses were performed by referring to the pro- cedure described by Rocha and Danchin. Since the linear regression model works better with normal variables , the scatter plots of Ka by other variables were examined to determine whether linear models are reasonable for these variables. It was necessary to transform the values of Ka, expression level and fitness of gene deletion into their logarithmic forms to give a distribution closer to a normal distribution. For the same reason, log(Ka) values were used in the correlation and partial correlation analyses.

9.3.6 Detection of rate variability across species - Relative Divergence Score (RDS)

To measure the degree of divergence of genes in a species away from or- thologs in other species TBLASTN comparisons for all proteins in the A. 191

fumigatus or S. cerevisiae genomes were run against all DNA sequences in the 9 genomes studied here. The relative divergence score (RDS) was defined as: DA,B = −ln(SA,B/SA,A), where SA,Bis the TBLASTN bit score for the query protein from genome A and subject genome B. Such scores range from 0 (identical proteins found in the subject genome) to infinity (no significant hit found). For genes belonging to each LS group, and to the relevant species at each divergence time point, 10,000 boot- strapped medians of random samples were taken from the RDS values of the genes. The mean of the bootstrapped medians was used as the estimated RDS of the LS group.

9.4 Results

9.4.1 Evolutionary rate differences among LS groups

The Ascomycotan fungi used in this study represent two distinct fun- gal groups: Euascomycetes (A. nidulans, A. fumigatus and N. crassa) and Hemiascomycetes (S. cerevisiae, S. mikatae and C. albicans) and the more distantly related fission yeast, S. pombe. Data from the two groups, Euascomycetes and Hemiascomycetes, were processed separately. For the Euascomycetes sequences, we predicted 6,432 A. fumigatus-A. nidulans orthologs and calculated the nonsynonymous substitution rate,

Ka, and the synonymous substitutions rate, Ks, for each gene pair. We then classified the predicted orthologs into the following groups: (1) Eukaryotes-core, (2) Ascomycota-core, (3) Euascomycetes-specific and (4) Aspergillus-specific, according to the phylogenetic profiles of A. fu- migatus genes. The Hemiascomycetes sequences gave 3,707 pairs of S. cerevisiae-S. mikatae orthologs which were processed similarly and classified into four groups: (1) Eukaryotes-core, (2) Ascomycota-core, (3) Hemiascomycetes-specific and (4) Saccharomyces-specific. Thus, LS groups from (1) to (4) represent increasingly more recent times of origin.

Filtering steps of (1) removing ortholog pairs with Ks,Ka > 2 or 192

.7 (A) .6

.5 .4 .3 .2

.1 0.0 Ka -.1 N = 113 27 22 21 Eukaryotes-core Euascomycetes-spec Ascomycota-core Aspergillus-spec .5 (B) .4

.3

.2

.1

0.0 Ka -.1 N = 17 23 22 297 Eukaryotes-core Hemiascomycetes-spec Ascomycetes-core Saccharomyces-spec

Figure 9.2: Divergence of nonsynonymous substitution rate in LS groups. The edges of the boxes indicate the upper and lower quartiles. The line at the centre of the box indicates the median and the edges of the whiskers represent the limits of 1.5 times the upper or lower inter-quartile ranges. The circle (°) indicates cases with values between 1.5 and 3 box lengths from the upper or lower edge of the box. The number of the gene pairs (N) is given. (A) A. fumigatus-A. nidulans orthologs. (B) S. cerevisiae-S. mikatae orthologs. 193

Ks < 0.05, (2) excluding ribosomal proteins, and (3) eliminating genes where possible similarity to a reconstructed ancestral sequence was found, were applied to the data set. Step 3 removed only 3 gene pairs, 2 in the Hemiascomycetes lineage and 1 in the Euascomycetes lineage, which may be due to either the limits of the ancestral reconstruction method or the relatively conservative criteria adopted in defining orthologs. Final sets of 183 A. fumigatus-A. nidulans ortholog pairs and of 359 S. cerevisiae-

S. mikatae ortholog pairs were obtained. The mean Ka, Ks and Ka/Ks of the ortholog pairs in each LS group are given in Table 9.2.

Genes that are distributed in the more specific lineages tend to have higher Ka values than more widely distributed genes. Box plots of the distribution of the Ka values for the Aspergillus and Saccharomyces genes are shown in Fig. 9.2 (A and B, respectively). In both the Aspergillus and Saccharomyces gene sets, average Ka increases with the degree of LS with significant among-group variation as measured by a Kruskal-Wallis test (Aspergillus, P < 0.001; Saccharomyces, P < 0.001). Moreover, as expected, Ka is consistently smaller than Ks within all LS groups, which suggests the operation of purifying (negative) selection or functional con- straints.

The ratio Ka/Ks (i.e., the rate of nonsynonymous substitutions cor- rected for neutral rates) showed a trend similar to Ka, namely, the values of Ka/Ks for genes of high LS (e.g., Aspergillus-specific or Euascomycetes- specific genes) are significantly higher than those for genes of low LS (e.g., Eukaryotes-core or Ascomycota-core genes). The differences among the rates of sequence divergence for different LS groups are more pronounced for Ka than for Ks, which suggests that the acceleration of a gene’s di- vergence rate may be mainly caused by more relaxed purifying selection against amino acid replacement. Functions of representative genes in dif- ferent LS groups were also examined. Largely, the functions of highly lineage-specific genes are poorly characterised or simply unknown. 194

(A) 0.0

-.5 -1.0

-1.5 Saccharomyces- specific -2.0 Hemiascomycetes- -2.5 specific Ascomycota-core -3.0 Eukaryotes-core Log(Ka) -3.5 All genes -1 0 1 2 3 4 Log(EXP) .4 (B) .2

0.0

-.2

-.4 Saccharomyces- specific -.6 Hemiascomycetes- specific -.8 Ascomycota-core

-1.0 Eukaryotes-core Log(Ks) -1.2 All genes -1 0 1 2 3 4

Log(EXP)

Figure 9.3: Dependence of log gene expression level, Log(EXP), and substitution rate. (A) log non-synonymous substitution rate, log(Ka). (B) log synonymous substitution rate, log(Ks). 195

(A)

2.5

2.0

R2 = 0.9518

1.5 Euascomycetes-specific Ascomycota-core Eukaryotes-core 1.0

R2 = 0.9429 0.5

-ln(D), D=relative dissimilarity score dissimilarity D=relative -ln(D), 0.0 0 500 1000 1500 2000 Divergence time (Myr)

(B)

3.0

2.5

2.0 Hemiascomycetes-specific 1.5 Ascomycota-core Eukaryotes-core R2 = 0.9544 1.0

0.5 R2 = 0.939

-ln(D), D=relative dissimilarity score dissimilarity D=relative -ln(D), 0.0 0 500 1000 1500 2000 Divergence time (Myr)

Figure 9.4: Linear regression analysis of divergence time and RDS. (A) LS of A. fumigatus-A. nidulans genes. (B) LS of S. cerevisiae-S. mikatae genes. 196

9.4.2 Evolutionary rate-related factors of genes belonging to different LS groups

The correlation between Ka and LS may be confounded by other factors. For S. cerevisiae-S. mikatae orthologs, bivariate correlations were used to compute the pairwise associations between Ka and LS and potentially confounding factors. These factors include the expression level of genes, the dispensability or essentiality of a gene, gene duplication and the num- ber of protein-protein interactions of the gene product. The results are summarised in the upper diagonal of Table 9.3. The coefficient for cor- relation between log(Ka) and LS is 0.584 (Pearson’s R, P < 0.01, Table

9.4), which is higher than that between log(Ka) and any other factor or that between any two other factors.

Log gene expression level correlates negatively with log Ka (R = - 0.382, P < 0.01, Table 9.3, Fig. 9.3). This is consistent with previ- ous studies which showed a correlation between Ka and gene expression level [127, 241]. A correlation between Ka and gene essentiality has long been proposed [343] but remains controversial [141,149]. The correlation between log(Ka) and gene essentiality was found to be weak, albeit sig- nificant (R = -0.163, P < 0.01), and essential genes have a lower mean

Ka (0.081, median 0.081) compared to that for non-essential genes (mean 0.136; median 0.110) (Mann-Whitney U test, P = 0.004).

Our data show a weak correlation between log(Ka) and gene dispens- ability (R = 0.186, P < 0.001, Table 9.3), which is at a similar magnitude to that of gene essentiality. This result is consistent with that recently reported by Hirsh and Fraser. This correlation remains significant af- ter controlling for gene expression levels (partial R = 0.240, P < 0.01), suggesting the independent nature of gene dispensability as a factor. Gene duplication has been shown to play a role in influencing gene divergence rates [119,150,357]. Genes were classified as either singletons or duplicate genes if they belonged to any multigene family. The mean 197 K al 9.2: Table rnhadHmacmctsbranch, average Hemiascomycetes of and heterogeneity rate branch significant no reveals test Kruskal-Wallis a raverage or Average soyoacr 3001(.3)069(.7)007(0.040) 0.047 (0.130) (0.026) 0.165 0.029 (0.045) 0.091 (0.172) 0.639 (0.329) (0.213) 0.830 0.586 (0.030) (0.284) 0.031 0.839 (0.100) (0.021) 0.131 (0.042) 0.018 0.080 (0.091) (0.027) (0.037) 0.155 0.039 0.072 (0.329) 1.577 (0.490) (0.441) 1.436 1.431 23 (0.069) 297 0.126 17 (0.118) (0.032) 0.198 0.051 22 27 Saccharomyces-specific 113 22 Hemiascomycetes-specific Ascomycota-core Eukaryotes-core Aspergillus Euascomycetes-specific Ascomycota-core Eukaryotes-core SCasNme fgnspairs genes of Number Class LS .crvsa .mikatae S. – cerevisiae S. nidulans A. – fumigatus A. K a /K K s a seic2 .9 016 .6 057 .6 (0.127) 0.261 (0.567) 1.263 (0.136) 0.293 21 -specific fgnsi ieetL rusi ohEacmctsbac n eisoyee branch, Hemiascomycetes and branch Euascomycetes both in groups LS different in genes of , K s and K a /K s > P mn Sclasses. LS among Hmacmctsbranch) (Hemiascomycetes Eacmctsbranch) (Euascomycetes 0 . 01. ∗ rsa-alsts eel infiatrt eeoeet faverage of heterogeneity rate significant reveals test Kruskal-Wallis A K a ∗ en(SD) mean K s fgnsi ieetLGgop nbt Euascomycetes both in groups LSG different in genes of K s § en(SD) mean K a / K s ∗ en(SD) mean < P 0 . 001. § A 198

Table 9.3: Correlation (Pearson’s R) (upper triangle) and partial corre- lation after controlling for log(Ks) (lower triangle). Abbreviations: Ka: nonsynonymous substitution rate; LS: lineage specificity; EXP: expres- sion level; FIT: fitness effect (gene dispensability); ESS: gene essentiality; DUP; duplicated (or not) gene; (INT) number of interactions. Among them, Ka, Ks, EXP and FIT are in their log forms.

Ka LS EXP FIT ESS DUP INT Ks

Ka – 0.584 -0.382 0.186 -0.163 0.257 -0.308 0.429 LS 0.582 – -0.271 0.195 -0.263 0.324 -0.428 0.185 EXP -0.294 -0.161 – -0.037 0.076 -0.113 0.197 -0.165 FIT 0.240 0.192 -0.049 – 0.032 -0.116 -0.159 -0.048 ESS -0.018 -0.146 -0.091 0.033 – 0.020 0.243 -0.087 DUP 0.215 0.312 -0.065 -0.106 0.028 – -0.163 0.160 INT -0.253 -0.379 0.123 -0.175 -0.007 -0.111 – -0.128

Ka of 0.097 (median 0.049) for duplicate genes was significantly smaller than the mean of 0.138 (median = 0.114) for singleton genes (Mann- Whitney U test, P < 0.001). The same pattern was observed between different LS groups with the exception of the Ascomycota-core group.

Ka has been shown to be positively correlated with Ks in several species [116, 214, 239, 344]. Such a correlation, which may confound cor- relations between log(Ka) and LS or with other factors, was observed here for log(Ka) and log(Ks) (R = 0.429, p < 0.01, Table 9.4). To examine the influence of the correlation ofKa with Ks on other factors, partial cor- relation coefficients between log(Ka) and other variables were calculated while holding the value of log(Ks) constant. The results are given in the lower diagonal portion of Table 9.4 and indicate that, after controlling for log(Ks), log(Ka) remains significantly correlated with LS. There is little change in the value of the coefficients with or without controlling for log(Ks) (partial Rlog(Ka)−LS|log(Ks)=0.582 to Rlog(Ka)−LS=0.584). Thus,

Ka is correlated with LS independently of Ks. A decrease in the absolute value of the correlation coefficient was ob- served between log(Ka) and expression level when controlling for log(Ks) 199 al 9.4: Table ntedpnetvral xlie ytergeso oe osrce rmteidvda aibe h ausidct the indicate values The variable. of individual variance the global from the constructed explain model to step. variable regression each each the of by contribution explained independent variable dependent the in xlddVariables Excluded Variables Included N .9 008-0.546 0.787 1.399 1.836 -0.028 0.038 0.070 0.087 6 5 4 0.048 -0.197 3 -1.149 1 0.095 2 0.027 – 0.066 0.035 0.341 INT 0.164 ESS – DUP log(FIT) log(EXP) LS (Constant) eut ftergeso nlsso 5 predicted 359 on analyses regression the of Results ∗∗ ttsiscnidct h eaieiprac fec aibei h model. the in variable each of importance relative the indicate can statistics t ne.cnrbto (R contribution Indep. 2 ) ¶ nr order Entry ∗ .cerevisiae S. nt.ce (B) coeffi Unstd. ± ± ± .3 027-5.124 -10.148 11.676 -0.247 0.562 – 0.038 0.004 0.113 - .mikatae S. log ( K ± a S t.ce ( coeffi Std. 1SE ). orthologs. ∗ re fvralsetrdit oe at model into entered variables of Order ¶ R 2 stepooto fvariation of proportion the is β t ) ∗∗ < < < 0.0001 0.0001 0.0001 > > > > 0.1 0.1 0.1 0.1 P 200

(|Rlog(Ka)−log(EXP )|Log(Ks)| = 0.294 and |Rlog(Ka)−log(EXP )| = 0.382).

This suggests Ks might be a confounding factor for gene expression level in determining Ka. Figure 9.3 plots the relationship of log expression level with log(Ka) (Fig. 9.3A) and with log(Ks) (Fig. 9.3B) showing the values for the Saccharomyces gene lineage groups. The more consistent relationship of log expression value with log(Ks) among the genes can be seen. Linear multiple regression was used to further examine the effect of multiple factors on log(Ka). Gene essentiality and gene redundancy were recoded to be quantitative variables by using two sets of binary variables (essential = 1 and non-essential = 0; duplicated gene = 1 and singleton gene = 0). A forward stepwise regression model was used to examine the contribution of each independent variable to the regression. The regression model defines log(Ka) as a function of LS (XLS), log expression level (log(Xexp)), log fitness effect of gene deletion (log(Xfit)), essentiality

(Xess), gene duplication (Xdup), and the number of protein interactions

(Xint):

log(Ka) = β0+βlsgXlsg+βexplog(Xexp)+βfitlog(Xfit)+βessXess+βdupXdup+βintXint

Table 9.4 gives the results of the modelling procedure. The final model gives a global R2 of 0.436 (P < 0.001). That is, nearly one half of the variation in log(Ka) is explained by this model. During the construction of the final model, the predictors most highly correlated with log(Ka), LS and the expression level, were kept. The remaining variables, which have minor roles in overall regression with log(Ka), were excluded from the final model (Table 9.4). The standardised coefficients were examined to determine the relative importance of the significant predictors. LS contributes more to the model than does the expression level, as shown by its larger absolute standardised coefficient of 0.562 and t statistic of 201

11.676, when compared with values of 0.247 and 5.124, respectively, for expression level. This analysis suggests that LS is the most relevant predictor of the rate of protein divergence.

9.4.3 Linear regression of divergence time and relative divergence score (RDS)

To relate the group divergence times and RDS a linear regression for each LS group was performed (Fig. 9.4). An increasing linear trend of RDS with divergence time was observed in each LS group, suggesting that genes diverge from other species at an approximately constant rate. Groups with higher LS have greater slopes than those with lower LS, in- dicating that genes with higher LS evolve faster than those with lower LS. This trend would still be apparent if different divergence time estimates were used.

9.5 Discussion

The phylogenetic distribution of a gene has been suggested to be of bi- ological importance. For example, genes with the same phylogenetic distribution may have linked functions [8, 218]. Lineage specificity (LS) is a form of phylogenetic distribution whereby genes are found only in a group of species that diverge from a certain point in a species tree. Orphan genes, those identified from only one species, are the extremes of lineage specificity. How these orphan and lineage specific genes arose is still an open question. Three possibilities are generally proposed [73]. One is that genes in a lineage originate from a lineage ancestral gene formed by the recombina- tion of exons from other genes or from random ORFs. These genes might show similarity to the original exons and so not necessarily be considered orphans or lineage specific. In the case of formation from random ORFs it is unlikely that such a protein would be functional. A second option is 202

gene loss [8, 178]. However it is relatively unlikely that a gene would be lost in all but one lineage [73] and this may not explain most orphan or lineage specific genes. The third option, which is examined here, is that some genes evolve at a rapid rate and so can no longer be recognised as orthologs of the genes they diverged from after a certain time span.

If accelerated rates of evolution lead to the creation of orphan or lineage specific genes, then it follows that genes with a high degree of LS should show higher rates of evolution than genes with lower degrees of LS. This hypothesis has been tested here with respect to the Ascomycotan fungi. If LS arose through widespread gene loss or from creation of new genes from recombination of exons or ORFs there is no reason to expect accelerated evolutionary rates or a trend in evolutionary rate with respect to the degree of LS.

The evolutionary rate of genes in Ascomycotan fungi that have dif- ferent degrees of LS were compared and revealed a significant, strong correlation between LS and the evolutionary rates of the genes. A trend that genes with narrow phylogenetic distributions (high LS) tend to have elevated evolutionary rates when compared with more ubiquitous genes (low LS) was observed. This is consistent with the hypothesis that accel- eration of the evolutionary rate is largely responsible for the formation of lineage specific genes.

However, the rate of gene evolution is one of the most important pa- rameters in molecular evolution. Correlations between the rate of gene evolution and many properties of genes, including their phylogenetic dis- tribution have been explored by several studies. As noted in the In- troduction, the evolutionary rate has been associated with expression level [127,241], gene dispensability [178], essentiality [343] or morbidity , gene duplication, gene loss [178] and protein-protein interactions [93,335]. Not all these studies have been in agreement e.g.,[93,151]. These factors may influence the apparent correlation of LS with evolutionary rate. 203

All pair-wise correlations of these factors with LS, Ka and Ks were examined to investigate the influence of these factors on the relationship between LS and Ka. The strongest correlation observed was that of LS with log(Ka), however log(Ka) also correlated highly with log(Ks). Cor- relations of log(Ka) with LS and the other factors were then calculated after controlling for log(Ks). Again the correlation of LS with log(Ka) was the strongest and similar to that without controlling for log(Ks).

With one exception, both LS and log(Ka) showed significant but low correlations to all other factors. As log(Ka) showed the strongest corre- lation with LS in both cases it seems clear that the evolutionary rate has a considerable, though not unique, influence on the origin of LS.

Further examination of this was undertaken with a stepwise regres- sion analysis of the factors likely to influenceKa. In the final regression model, which explained close to half the variation in log(Ka), only the parameters LS and log expression level were kept, with LS making the larger individual contribution. The other parameters investigated did not make significant contributions to the regression model. This again indicated the role of evolutionary rate on LS.

Another approach used the relative divergence score (RDS) which measures the divergence of a gene from its orthologs in other genomes as a ratio of the TBLASTN score with its orthologs to the maximal (or self-self) score. This provides another view of the degree of divergence within a lineage and, when matched to divergence times, allows an ex- amination of the evolutionary rate as the degree of LS increases. Within each LS group a reasonably constant rate of evolution was seen since the appearance of the LS group. Groups with low LS show lower RDS values and evolutionary rates than groups with higher LS, consistent with the evolutionary rate being a major determinant of LS. Allowing for errors in the determination of divergence times this trend will still hold.

Genes with a certain degree of LS may have arisen from duplication 204

followed by acquisition of a lineage specific function [73] or simply have diverged from a common ancestor to the extent that they cannot be recognised as orthologs across lineages. Our findings support the idea that genes destined to have high levels of LS will have higher evolution- ary rates. It should be noted that Ka is a measurement of the average nonsynonymous substitution rate along the whole length of a gene. Al- though highly lineage-specific genes had higher average Ka, the extent to which region- specific or site-specific contributions to Ka affect this was not examined. Further research could be directed to evaluate such region- or site-specific effects on the rate of protein divergence, especially, for instance, for genes that have high LS but low evolutionary rates or vice versa. For ascomycotan fungi, our findings show that the degree of LS cor- relates with the evolutionary rate and indicate that an elevated evolu- tionary rate may be a major cause of the development of lineage specific genes. 205

Chapter 10

MBETOOLBOX: A MATLAB TOOLBOX FOR SEQUENCE DATA ANALYSIS IN MOLECULAR BIOLOGY AND EVOLUTION

This chapter is very closely based on a paper I have published [35]. The original draft of the manuscript has been revised by Dr. David K. Smith, in Department of Biochemistry, HKU.

10.1 Introduction

Matlab is a high-performance language for technical computing, integrat- ing computation, visualization, and programming in an easy-to-use envi- ronment. It has been widely used in many areas, such as, mathematics and computation, algorithm development, data acquisition, modelling, simulation, and scientific and engineering graphics. However, few func- tions are freely available in Matlab to perform the sequence data analysis for molecular biology and evolution specifically. I have developed a Mat- lab toolbox, called MBEToolbox, aiming at filling this gap by offering efficient implementations of the most needed functions in molecular bi- ology and evolution. It can be used to manipulate aligned sequences, calculate evolutionary distances, estimate synonymous and nonsynony- mous substitution rates, and infer phylogenetic trees. Moreover, it pro- vides an extensible functional framework for more specialized needs in exploring and analysing aligned nucleotide or protein sequences from the evolutionary perspective. The full functions in the toolbox are accessible through command-line for those seasoned Matlab users, yet, it does pro- vide a graphical user interface may be especially useful for non-specialist 206

end users. Through applicaiton of this software during the Penicillium marneffei genome project, MBEToolbox is proved to be a useful tool that can aid in the exploration, interpretation and visualization of data in molecular biology and evolution. The software are publicly available at http://web.hku.hk/∼jamescai/mbetoolbox/.

10.2 Literature Review

10.2.1 Probabilistic DNA substitution models

In this section I will discuss probability models, more specifically, Markov models. (Of course, there also exist other types of models, e.g., determin- istic models). Morkov models can be discrete or continuous in regard to time. The discrete time models are called Markov chains, whereas con- tinuous time models are usually called Markov processes. Mathematical notations used in this section are given as: R - intrinsic rate matrix; Q - (instantaneous) transition rate matrix; P - transition probability ma- trix; X - divergence matrix; Π - matrix base frequencies; and t - time or evolutionary distance.

Molecular evolution of sequences generally is constructed under a hy- pothesis of phylogeny, i.e., modelling sequence evolution along a branch of phylogenetic tree. This is using a continuous time Markov process, more specifically finite, aperiodic, irreducible such processes (here refer to these simply as Markov process). A Markov process has a defined state space, e.g., {A, C, G, T}, and the (instantaneous) transition rate between states is given by any n × n transition rate matrix, Q, where P Qij > 0 for all i 6= j and Qii = − i6=jQij. Amino acid models have 207

n = 20, while nucleotide models have n = 4, e.g.:   −1.218 0.504 0.336 0.378      0.126 −0.882 0.252 0.504    Q =    0.168 0.504 −1.050 0.378    0.126 0.672 0.252 −1.050

Qij indicates the rate for going from state i to state j. Since the total instantaneous rate is zero each row should sum to zero. For a specified time interval, t, we can calculate the transition probability matrix from P(t) = eQt, e.g.:

  0.6883 0.1308 0.0828 0.0981      0.0327 0.7783 0.0654 0.1236    P(t) =    0.0414 0.1308 0.7297 0.0981    0.0327 0.1647 0.0654 0.7372

Here t = 0.33, the exponential operation is matrix exponential. In Matlab, this is computed using a scaling and squaring algorithm with a Pade approximation. In P, the rows sum to one, since the total prob- ability under the time interval is one. If the Markov process are run sufficiently long time, the probabilities, P(t) will converge on a station- ary distribution such that for all pairs (i, k) of states, Pi,j(t) = Pi,k(t). That is the probability of the end state is independent of the starting state. Here we will limit our discussion to cases where the overall rate of changing from state i to state j is the same as the rate from i to j, a constraint to models that are said to be time-reversible. The models used in phylogenetic inference to date are almost exclusively subsets of this class.

The transition rate matrix, Q, can be decomposed into an intrinsic rate matrix, R, and Π, such that: 208

Q = RΠ

If R is symmetric, and Q is constructed as indicated above, and Π is the equilibrium frequency vector. The rates at which each state is replaced with each alternative state in R and methods for calculating or estimating Π are set differently in different situation. Hence, different DNA substitution model are existing. I will start to introduce the most general models of nucleotide substitution is the general time reversible model (REV), also called General Time Reversible model (GTR). The instantaneous rate matrix for the REV model is:   − µa µb µc      µa − µd µe    R =    µb µd − µf    µc µe µf −

In this matrix, the rows (and columns) correspond to the bases A, C, G, and T respectively. The factor µ represents the mean instantaneous rate. This rate is modified with the relative rate parameters a, b, c, ··· , l, which correspond to each possible transformation between two bases. To construct Q, all we need to do is: RΠ, where Π,(πA, πC , πG, πT ), is frequency parameters that correspond to the frequencies of the four bases. The diagonal elements of Q are always chosen so that the row sums are zero (i.e., stationarity). Many other models (still belong to GTR class) have been designated. They are usually designated by the initial letters of the authors last names and the year of the publication. Their relationship can be illustrated as in Fig 10.1. The κ parameter represents the ratio of the instantaneous rate of transition-type substitutions to transversion-type substitutions. It assumes the value 1.0 for models in which all substitutions are taken to occur at the same rate (i.e., the JC and F81 models). In the K2P and 209

Allow transition/ K2P Allow base transversion bias frequencies to vary ππAA= ππCC= ππGG= ππTT α≠β JCJC α≠β HKY85 GTR/REV

ππAA= ππCC= ππGG= ππTT ππAA≠π≠πCC≠π≠πGG≠π≠πTT ππAA≠π≠πCC≠π≠πGG≠π≠πTT α=ββ α≠β a,b,c,d,e,f F81

π ≠π ≠π ≠π Allow base πAA≠πCC≠πGG≠πTT Allow transition/ frequencies to vary α=ββ transversion bias

Figure 10.1: Relationship of GTR class DNA substitution models

HKY models, the rate of transversion is β, with the rate of transitions being determined as α = κβ.

JC model The JC model was described by Jukes & Cantor in 1969 [153] and is the most restrictive model. It assumes that the base fre- quencies are all equal and the instantaneous rate of substitution is the same for all possible changes. When this model is selected, the base fre- quencies (πA, πC , πG, πT ) are all set to 0.25 and a, b, c, ··· , l is set to 1.0. The only free parameter that can be adjusted under this model is the µt parameter.

F81 model The F81 model was described by Felsenstein (1981) [85]. It is like the JC model in assuming that all possible changes occur at the same rate, but allows the base frequencies to be unequal. If the base frequencies are all set to 0.25, this model is equivalent to the JC model. When this model is selected, you will be free to vary the base frequency parameters, but the κ parameter will not be changed as it is set to 1.0 under this model.

K2P model The K2P model was described by Kimura in 1980 [165]. It is like the JC model in assuming equal base frequencies, but allows the 210

rate of transition-type substitutions to differ from the rate of transversion- type substitutions. As you know, the ratio of these two instantaneous rates is κ. Two parameters, both κ and µt, will be free to vary when using this model. In case of setting κ = 1.0, K2P model is identical with the JC model. The base frequency parameters are forced to be equal.

HKY model The Hasegawa, Kishino and Yano (HKY) model [126] allows for a different rate of transitions and transversions as well as un- equal frequencies of base frequencies. The parameters requires by this model are transition to transversion ratio κ and the base frequencies. If base frequencies are uniform, the HKY model reduces to the K2P model.

10.2.2 Maximum likelihood estimation

Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set. Given a set of observations, the method of maximum likelihood finds the parameters of a model that are most consistent with these observations. Here I use a simple and general example to explain the philosophy of MLE. Example n data, X1,X2, . . . , Xn, are drawn from a given discrete probability distribution D with known probability mass function fD and distributional parameter θ. The probability associated with our observed data may be computed:

P (x) = fD(x|θ)

where x ∈ {x1, x2, . . . , xn}. At this moment, although we know that our data comes from the distribution D, we may don’t know the value of the parameter θ. Such a situation is usually the case when we do exper- iment to sample data points so that we can estimate some parameters, such as, θ of a distribution. The question is how should we estimate θ? 211

MLE provides a general technique for seeking an estimate of the value of θ from the sample. We maximise the likelihood of the observed data set over all possible values of θ, i.e., seeking the most likely value of the parameter θ.

We define likelihood mathematically:

Yn lik(θ) = fD(x|θ) i=1

MLE seeks the value θˆ which maximises this likelihood function over all possible θ. MLE methods are versatile and apply to most models and to different types of data.

The general principle of MLE has found its way of applying in many aspects of phylogenetics, such as, phylogenetic parameter estimation, and optimal tree searching [41, 85]. Generally, the likelihood of observing a given set of data is maximised for each topology, and the topology that gives the highest maximum likelihood is chosen as the final tree. In this case, however, the parameters to be considered are not the topologies but the branch lengths for each topology, and the likelihood is maximised to estimate branch lengths rather than the topology. The problem with phylogenetic inference based on the optimisation principle is that it is very time-consuming, because the number of possible topologies is very large for a sizable number of nucleotide sequences (> 15) and an enor- mous amount of computational time is required to find the optimal tree. Calculating MLE’s in phylogeny often requires specialised software for solving complex non-linear equations. Numerical optimisation is often required to solve these non-linear problems.

10.2.3 Elements of phylogenetic theory

The purpose of the reminder section is to explain how phylogenetic trees may be constructed from analysis of nucleotide and protein sequences. 212

Such analyses enable the evolutionary relationships among species or genes to be deduced. I will review basic concepts of phylogenetic the- ory, such as, phylogenetic tree and likelihood calculation of a phylogeny, given a substitution model. Then I will introduce some most commonly used software packages in phylogenetic analyses, their advantages and shortcomings.

Phylogenetic trees

We usually describe evolution, of either genes or species, by using a sketch of a tree-like structure, which represents the hierarchical relationships among species/genes arising through evolution. Such a tree-like struc- ture is phylogenetic tree. In the case of rooted trees the root is the common ancestor of all the nodes. In a evolutionary tree of species, ancestors’ species are located at the root of the tree and contemporary species are the leaves. In this sense, the tree is rooted. The topology of the tree, branching pattern, defines the phylogenetic relationships among the nodes. When the data for the ancestors are missing, the phylogenetic trees produced are unrooted, which are only schematic trees comprising a set of nodes linked together by branches. The location of the com- mon ancestor of all the species/genes under study cannot be identified in unrooted tree. The string representation of a tree, following the newick standard, is usually used. It uses the recursive definition of a tree to represent phylogenies in a computer readable form with nested parentheses. For example, a tree can be written:

(outgroup, neurospora, (penicillium, aspergillus));

However one must be aware that this representation is not unique, the following one works as well:

(penicillium,(outgroup,neurospora),aspergillus)); 213

Sometimes, when an outgroup was provided, the rooted representa- tion is:

(outgroup,(neurospora,(penicillium,aspergillus)));

In addition to the branch topology, the branch lengths in phylogeny are also important to specify a particular tree. The lengths of branches represent the evolutionary distances between two consecutive nodes.

Phylogeny reconstruction

Data required for phylogeny reconstruction is not limited in nucleotide and amino acid sequences; in fact, protein structures or exon-intron struc- tures can also be used for this purpose. But I will limit the following dis- cussion on nucleotide and amino acid sequences merely. It is important to note that most phylogeny-building methods require multiple alignment of sequences. Sequence alignment is one of the most important problems in bioinformatics. Many efforts have been put in improvement of efficiency and accuracy. The area is still actively developing. Once obtaining the multiple alignments, we can usually use 3 different methods to construct phylogeny: the distance matrix method, maximum parsimony method and maximum likelihood method. A good review for all these methods can be found in [199]. Maximum parsimony infers a phylogenetic tree by minimising the total number of evolutionary steps required to explain a given set of data, or in other words by minimising the total tree length. It is a character- based method, the input data used is in the form of “characters” for a range of taxa. Besides protein or nucleotide residue, a character could be a binary value for the presence or absence of a feature (such as the presence of a tail). Maximum parsimony is a very simple approach, and is popular for this reason. However, it is not always very accurate. Maximum likelihood evaluates a hypothesis about evolutionary his- tory in terms of the probability that the proposed model and the hypoth- 214

esised history would give rise to the observed data set. The central of likelihood based method is the likelihood function (for general description, see Section 10.2.2).

Likelihood = f(Data|T, l, θ)

where T is topology, l is branch lengths of the given tree. The topology with the highest maximum probability (likelihood) is chosen. Advantages of maximum likelihood methods over other meth- ods are: may have lower variance than other methods (least affected by sampling error), tend to be robust to violations of the assumptions in the evolutionary model, are statistically well founded, can statistically evaluate different tree topologies and use all of the sequence information. There are also some disadvantages: very computationally intensive (slow) and the result depends on the model of evolution.

Computation of likelihood of phylogeny

Substitution models are a description of the way sequences evolve in time by nucleotide replacements. Most commonly used Markov models of DNA subsititution has been reviewed in Section 10.2.1.

10.2.4 Programs used for phylogenetic analyses

A few selective programs are introduced below, they are representatives of the most commonly used ones in phylogenetic analyses. PAUP* - http://paup.csit.fsu.edu/ is an integrated and user- friendly package. Many distinct models of nucleotide substitution are available (all possible submodels of the GTR + Γ + inv sites model). It does not allow analyses of protein sequences using parametric approaches. Tree-Puzzle - http://www.tree-puzzle.de/ reconstructs phyloge- netic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows 215

analysis of large data sets and automatically assigns estimations of sup- port to each internal branch. It also computes pairwise maximum likeli- hood distances as well as branch lengths for user specified trees.

Mesquite - http://mesquiteproject.org/mesquite/mesquite.html is an extensible and modular program for a variety of evolutionary analy- ses. It is written in Java, therefore, is plantform-independent. At this point Mesquite is of limited usefulness because it is a modular set of programs to which specific applications must be added. But it does im- plement one- and two-parameter models of evolution for ancestral state reconstruction.

MrBayes - http://morphbank.ebc.uu.se/mrbayes/ is a program for Markov chain Monte Carlo analysis of phylogeny. Implements a limited set of submodels of the GTR + Γ + inv sites model. The current version allows the use of mixed models (e.g., distinct GTR + Γ + inv sites sub- models for 1st, 2nd, and 3rd codon positions or for different genes). A number of protein models, using parameters estimated from large-scale analyses of protein databases, are also available. It is only known package implementing the covarion model.

PAML - http://abacus.gene.ucl.ac.uk/software/paml.html, is a package of programs for phylogenetic analyses of DNA or protein se- quences using maximum likelihood. It contains a modular set of programs for various likelihood analyses flexibly (submodels of the GTR + Γ + inv sites model, amino acid models, codon-based models). It is not designed for tree-searches. But it is ideal for analyses of the evolutionary process, estimation of evolutionary parameters, because of its flexibility. PAML has a simulator module called “evolver” that is also quite flexible.

PHYLIP - http://evolution.genetics.washington.edu/phylip. html, is a modular set of programs for various types of phylogenetic analy- ses (including likelihood analyses of DNA and proteins). It implements a heuristic tree space search algorithm, which is faster than PAML, but 216

does not search as rapidly or as extensively as PAUP*.

10.3 Implementation

MBEToolbox is written in the Matlab language and has been tested on the Windows platform with Matlab version 6.1.0. The main functions implemented are: sequence manipulation, computation of evolutionary distances derived from nucleotide-, amino acid- or codon-based substi- tution models, phylogenetic tree construction, sequence statistics and graphics functions to visualize the results of analyses. Although it imple- ments only a small fraction of the multiplicity of existing methods used in molecular evolutionary analyses, interested users can easily extend the toolbox.

10.3.1 Input data and formats

MBEToolbox requires a single ASCII file containing the nucleotide or amino acid sequence alignment in either Phylip [86], ClustalW [312] or Fasta format. The toolbox does provide a built-in Clustalw [312] interface if an unaligned sequence file is provided. Protein-coding DNA sequences can be automatically aligned based on the corresponding pro- tein alignment with the command alignseqfile. After input, in common with the MathWorks bioinformatics tool- box, MBEToolbox represents the alignment as a numeric matrix with every element standing for a nucleic or amino acid character. Nucleotides A, C, G and T are converted to integers 1 to 4, and the 20 amino acids are converted to integers 1 to 20. A header, containing information about the names and type of the sequences as well as the relevant genetic code for protein-coding nucleotides, is attached to the alignment matrix to form a Matlab structure. An example alignment structure, aln, in Matlab code follows:

aln = 217

seqtype: 2 geneticcode: 1 seqnames: {1xn cell} seq: [nxm double] where n is the number of sequences and m is the length of the aligned sequences. The type of sequence is denoted by 1, 2 or 3 for sequences of non-coding nucleotides, protein coding nucleotides and amino acids, respectively.

10.3.2 Sequence Manipulation and Statistics

The alignment structure, aln, can be manipulated using the Matlab lan- guage. For example, aln.seq(x,:) will extract the xth sequence from the alignment, while aln.seq(:,[i:j]) will extract columns i to j from the alignment. Users may easily extract more specific positions by us- ing functions developed in the toolbox, such as extractpos(aln,3) or extractdegeneratesites to obtain the third codon positions or fourfold degenerate sites, respectively. For each sequence, some basic statistics such as the nucleotide composition (ntcomposition) and GC content, can be reported. Other functions include the calculation of the relative synonymous codon usage (RSCU) and the codon adaptation index (CAI), counts of segregating sites, taking the reverse complement or translating a sequence, and determining the sequence complexity.

10.3.3 Evolutionary Distances

The evolutionary distance is one of the important measures in molecu- lar evolutionary studies. It is required to measure the diversity among sequences and to infer distance-based phylogenies. MBEToolbox con- tains a number of functions to calculate evolutionary distances based on the observed number of differences. The formulae used in these functions are analytical solutions of a variety of Markov substitution 218

models, such as JC69 [153], K2P [165], F84 [86], HKY [126] (see [229] for detail). Given the stationarity condition, the most general form of Markov substitution models is the General Time Reversible (GTR or REV) model [185, 309, 266, 358]. There is no analytical formula to cal- culate the GTR distance directly. A general method, described by Ro- driguez et al.[266], has been implemented here. In this method a matrix

F, where Fij denotes the proportion of sites for which sequence 1 (s1) has an i and sequence 2 (s2) has a j, is formed. The GTR distance between s1 and s2 is then given by

dˆ= −tr(Π log(Π−1F)) where Π denotes the diagonal matrix with values of nucleotide equilib- rium frequencies on the diagonal, and tr(A) denotes the trace of matrix A. The above formula can be expressed in Matlab syntax directly as:

>> d=-trace(PI*logm(inv(PI)*F))

MBEToolbox also calculates the gamma distribution distance and the LogDet distance [295](i.e., Lake’s paralinear distance [184]).

For alignments of codons, the toolbox provides calculation or esti- mation of the synonymous (Ks) and non-synonymous (Ka) substitution rates by the counting method of Nei and Gojobori [228], the degenerate methods of Li, Wu and Luo [198] and the method of Li or Pamilo and Bianchi [197, 242], as well as the maximum likelihood method through

PAML [360]. All these methods for calculating Ks and Ka require that the input sequences are aligned in the appropriate reading frame, which can be performed by the function alignseqfile. Unresolved codon sites will be removed automatically. In addition, several quantities, includ- ing the number of substitutions per site at only synonymous sites, at only non-synonymous sites, at only four-fold-degenerate sites, or at only 219

zero-fold-degenerate sites can be calculated. The output from these cal- culations are distance matrices which can be exported into text or Excel files, or used directly in further operations.

10.3.4 Phylogeny Inference

Two distance-based tree creation algorithms, Unweighted Pair Group Method with Arithmetic mean (UPGMA) and neighbour-joining (NJ) [273] are provided and trees from these methods can be displayed or ex- ported. Maximum parsimony and maximum likelihood algorithms can be applied to nucleotide or amino acid alignments through an interface to the Phylip package [86]. As properly implemented maximum likeli- hood methods are the best vehicles for statistical inference of evolution- ary relationships among species from sequence data, several maximum likelihood functions have been explicitly implemented in MBEToolbox. These functions allow users to incorporate various evolutionary models, estimate parameters and compare different evolutionary trees. The simplest case of estimation of the evolutionary distance between two sequences, s1 and s2, can be considered as the estimation of the branch length (the number of substitutions along a branch) separating ancestor and descendent nodes. Branch lengths, relative to a calibrated molecular clock, can reveal the time interval for this separation. A con- tinuous time Markov process is generally used to model evolution along the branch from s1 to s2. A transition rate matrix, Q, is used to indicate the rate of changing from one state to another. For a specified time in- terval or distance, t, the transition probability matrix is calculated from P(t) = eQt. If there are N sites, the full likelihood is

YN 1 2 L = π 1 P (s → s , t) si i i i=1

1 2 In this equation, si and si are the ith bases of sequences 1 and 2 respec- 220

1 tively; π 1 is the expected frequency of base s . si i In MBEToolbox, to calculate the likelihood, L, at a given time interval (or distance) t, we have to specify a substitution model by using an appro- priate model defining function, such as modeljc, modelk2p or modelgtr for non-coding nucleotides, modeljtt or modeldayhoff for amino acids, or modelgy94 for codons. These functions return a model structure com- posed of an instantaneous rate matrix, R, and an equilibrium frequency vector, pi which give Q,(Q=R*diag(pi)). Once the model is specified, the function likelidist(t,model,s1,s2) can calculate the log likeli- hood of the alignment of the two sequences, s1 and s2, with respect to the time or distance, t, under the substitution model, model.

In most cases we wish to estimate t instead of calculating L as a func- tion of t, so the function optimlikelidist(model,s1,s2) will search for the t that maximises the likelihood by using the Nelder-Mead simplex (di- rect search) method, while holding the other parameters in the model at fixed values. This constraint can be relaxed by allowing every parameter in the model to be estimated by functions, such as optimlikelidistk2p, that can estimate both t and the model’s parameters. Figure 10.2(a and b) illustrates the estimation of the evolutionary distance between two ribonuclease genes through the fixed- and free-parameter K2P models, respectively. When the K2P model’s parameter, kappa, is fixed, the re- sult and trace of the optimisation process is illustrated by the graph of L and t (Fig. 10.2a). When kappa is a free parameter, a surface shows the result and trace of the optimisation process (Fig. 10.2b).

When calculating the likelihood of a phylogenetic tree, where s1 and s2 are two (descendant) nodes in a tree joined to an internal (ancestor) node, sa, we must sum over all possible assignments of nucleotides to sa to get the likelihood of the distance between s1 and s2. Consequently, the number of possible combinations of nucleotides becomes too large to be enumerated for even moderately sized trees. The pruning algorithm 221

(a) 1040

1060

1080

1100

1120

1140

ln(Likelihood) 1160

1180

1200

1220

1240

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Distance (substitutions/site)

(b)( 950 1000 1050 1100 1150 1200 ln(Likelihood) 1250 1300 1350 5 4 0.5 3 0.4 2 0.3 0.2 1 0.1 0 kappa 0 Distance (substitutions/site)

Figure 10.2: Log-likelihood of evolutionary distance. (a) Likelihood as function of K2P distance. Distance is estimated by maximising likelihood of the alignment when the bias of transition and transversion, kappa, is fixed. (b) Likelihood as function of distance and kappa. Both distance and kappa are numerically optimised simultaneously to give maximum likelihood. The maximum likelihood peaks are marked with *. The two sequences used are coding regions of two mammalian ribonuclease genes, enc, of 474 bp. 222

introduced by Felsenstein [85] takes advantage of the tree topology to evaluate the summation in a computationally efficient (but mathemati- cally equivalent) manner. This and a simple and elegant mapping from a ‘parentheses’ encoding of a tree to the matrix equation for calculating the likelihood of a tree, developed in the Matlab software, PhylLab [271], have been adopted in likelitree.

10.3.5 Combination of functions

Basic operations can be combined to give more complicated functions. A simple combination of the function to extract the fourfold degenerate sites with the function to calculate GC content produces a new function (countgc4) that determines the GC content at 4-fold degenerate sites (GC4). A subfunction for calculating synonymous and nonsynonymous differences between two codons, getsynnonsyndiff, can be converted into a program for calculating codon volatility [251] with trivial effort. Similarly, karlinsig which returns Karlin’s genomic signature (the din- ucleotide relative abundance or bias) for a given sequence can be easily re-formulated to estimate relative di-codon frequencies, which may be a new index of biological signals in a coding sequence. In addition, the menu-driven user interface, MBEGUI, is also a good example illustrating the power of combination of basic MBEToolbox functions.

10.3.6 Graphics and GUI

Good visualisation is essential for successful numerical model building. Leveraging the rich graphics functionality of Matlab, MBEToolbox pro- vides a number of functions that can be used to create graphic output, such as scatterplots of Ks vs Ka, plots of the number of transitions and transversions against genetic distance, sliding window analyses on a nu- cleotide sequence and the Z-curve (a 3-dimensional curve representation of a DNA sequence [372]). A simple menu-driven graphical user inter- 223

face (GUI) has been developed by using GUIDE (Graphical User Inter- face Development Environment) in Matlab. The top menu includes File, Sequences, Distances, Phylogeny, Graph, Polymorphism and Help sub- menus (Fig. 10.3). It aids the usage of the most frequently required functions so that users do not have to run any scripts or functions from the Matlab command line in most cases.

10.4 Results and Discussion

Only few Matlab toolboxes or functions are freely available for data analy- sis, exploration, and visualisation of nucleotide and protein sequences. The toolbox, MBEToolbox, presented here to fulfil most obvious needs in sequence manipulation, genetic distance estimation and phylogeny infer- ence under Matlab environment. Moreover, it is an extensible functional framework to formulate and solve problems in evolutionary data analysis; it facilitates the rapid construction of both general applications as well as special-purpose tools for computational biologists in a fraction of the time it would take to write a program in a scalar noninteractive language such as C or FORTRAN.

10.4.1 Vectorisation simplifies programming

Matlab is a matrix language, which means it is designed for vector and matrix operations. Programming can be simplified and made more effi- cient by using algorithms that take advantage of vectorisation (converting for and while loops to the equivalent vector or matrix operations). The Matlab compiler in version 7.0 will automatically recognise and vectorise loops without recursion. An example of vectorisation is the calculation of Z-scores [246] for Smith-Waterman alignments [291] to give a mea- sure of the significance of an alignment score against a background of scores from randomly generated sequences with the same composition and length. Hence, Z-scores are designed to overcome the bias due to the 224

Figure 10.3: MBEToolbox GUI. (a) Distances submenu; (b) Phylogeny submenu; and (c) Graph submenu. 225

composition of the alignment and are usually calculated by comparing an actual alignment score with the scores obtained on a set of random sequences generated by a Monte-Carlo process. The Z-score is defined as: Z(A, B) = (S(A, B) − mean)/standard deviation where S(A, B) is the Smith-Waterman (S-W) score between two se- quences A and B. The mean and standard deviation are taken from realignments of the permuted sequences. The algorithm is implemented as follows in Matlab with as few as 15 lines of code:

function [z,z_raw]=zscores(s1,s2,nboot)

m1=length(s1); m2=length(s2);

% Initialise two vectors holding Z-score of % s1_rep and s2_rep, \textit{i.e.}, replicate samples % of sequences s1 and s2. v_z1=zeros(1,nboot); v_z2=zeros(1,nboot);

z_raw=smithwaterman(s1,s2); for (k=1:nboot), s1_rep=s1(:,randperm(m1)); v_z1(1,k)=smithwaterman(s1_rep, s2); s2_rep=s2(:,randperm(m2)); v_z2(1,k)=smithwaterman(s1, s2_rep); end

z1=(z_raw-mean(v_z1))./std(v_z1); 226

z2=(z_raw-mean(v_z2))./std(v_z2); z=min(z1,z2); where randperm(n) is a vector function returning a random permutation of the integers from 1 to n and smithwaterman performs local alignment by the standard dynamic programming technique.

10.4.2 Extensibility

An important distinction between compiled languages with subroutine libraries and interactive environments like Matlab is the ease with which problems can be specified and solved in the latter. Moreover, Matlab toolboxes are traditionally organised in a less object-oriented mode and, consequently, functions are more independent of each other and easier to combine and extend. Several examples were given in the Implementation section.

10.4.3 Comparison with other toolboxes

Some other toolboxes have been developed in Matlab for bioinformatics related analyses. These include PhylLab [271] and MatArray [327] as well as the bioinformatics toolbox developed by MathWorks. Other examples can be found at the link and file exchange maintained at Mat- lab Central [42]. PhylLab is a molecular phylogeny toolbox which also provides some functions for sequence and tree input and manipula- tion. Its main focus is on creating a maximum likelihood tree based on Bayesian principles using a Markov chain Monte Carlo method to com- pute posterior parameter distributions. MatArray is focussed on the analysis of gene expression data from microarrays and provides normali- sation and clustering functions but does not address molecular evolution. The bioinformatics toolbox from MathWorks provides a range of bioin- formatics functions, including some related to molecular evolution. 227

MBEToolbox provides a much broader range of molecular evolution related functions and phylogenetic methods than either the more spe- cialised Phyllab project or the more general bioinformatics toolbox from MathWorks. These extra functions include IO in Phylip format, sta- tistical and sequence manipulation functions relevant to molecular evo- lution (e.g. count segregating sites), evolutionary distance calculation for nucleic and amino acid sequences, phylogeny inference functions and graphic plots relevant to molecular evolution (e.g. Ka vs Ks). As such it makes an important contribution to the bioinformatics analyses that can be performed in the Matlab environment.

10.4.4 A novel enhanced window analysis

To test for the selective pressures in the different lineages of a phyloge- netic tree, the nonsynonymous to synonymous rate ratio (Ka/Ks) is nor- mally estimated [281, 4, 61]. Values of Ka/Ks = 1, > 1, or < 1 indicate neutrality, positive selection, or purifying selection, respectively. How- ever, Ks and Ka are measurements of average synonymous and nonsyn- onymous substitutions per site along the whole length of the sequences.

Average Ks and Ka values give neither the pattern of intragenic fluc- tuation of selective constraints, nor region- or site-specific information. A sliding window method is usually adopted to examine the intragenic pattern of the substitution rates and to test for the occurrence of signifi- cant clusters of variant regions [55, 145, 80, 53]. Significant heterogeneity in Ks would indicate that the neutral substitution rate varies across the gene, whereas heterogeneity in Ka may indicate that selective constraints vary along the gene. The results and accuracy of sliding window meth- ods, either overlapping or non-overlapping, depend on both the size of the window and the moving distance adopted. Large window lengths may obliterate the details of patterns in Ks or Ka, whereas small win- dow lengths usually result in larger statistical fluctuations. Hence, the 228

(a) 2.5 syn nonsyn

2

1.5

1 Substitution number per site

0.5

0 500 1000 1500 2000 2500 3000 C E1 E2 NS2 NS3 NS4 NS5A NS5B

(b) 40 syn nonsyn

20 a e c 0

-20

-40

-60 d b -80 Transformed substitution number per site

-100 f

-120 500 1000 1500 2000 2500 3000 Codon site

Figure 10.4: Comparison between sliding window and enhanced sliding window methods. Sliding window analysis of Ks and Ka for the con- catenated coding regions of two hepatitis C virus strains, HCV-JS and HCV-JT. The number of codons for the C, E1, E2, NS2, NS3, NS4, NS5A, and NS5B genes are 191, 192, 426, 217, 631, 315, 447, and 591, respectively. The different coding regions are separated by vertical lines. (a) illustrates the result of a normal sliding window analysis; (b) illus- trates the result of the enhanced sliding window analysis. Beginnings and ends of regions poor in synonymous substitutions (slope < 0) are indicated by the arrows a and b (genes C and E1) and e and f (gene NS5B). A region rich in synonymous substitutions (slope > 0) in gene NS3 is indicated by arrows c and d. 229

resolution of a sliding window is usually limited.

A mathematical formalism, similar to the Z’-curve [368], is introduced here to solve this problem. Consider a subsequence based analysis of Ks or Ka. In the n-th step, count the cumulative numbers of Ks or Ka occurring from the first to the n-th nucleotide position in the gene se- (n) quences being inspected. Let K denote either Ks or Ka and K denote the cumulative K at the n-th sequence position. K(n) is usually an ap- proximately mono-increasing linear function of n. The points (K(n), n), n = 1, 2, ··· ,N are fit by a least square method to a linear function, f(K(n)) = βn, to give a straight line with β being its slope. We define

K0(n) = K(n) − βn

The two-dimensional curve of (K0(n) ∼ n) gives an alternative represen- tation of the normal sliding window curve.

To compare these two curve representations, the example dataset of Suzuki and Gojobori [303], which contains the coding regions of two hepatitis C virus strains (HCV-JS - Genbank Acc.: D85516 and HCV- JT - Genbank Acc.: D11168), was used. The entire coding sequence is divided into eight regions (C, E1, E2, NS2, NS3, NS4, NS5A, NS5B). Some of the coding regions have been combined as these short ORFs are unlikely to yield meaningful Ks and Ka values. The reduction of Ks in the C, E1 and NS5B regions, as well as its elevation in NS3, which have been shown in previous studies [303], are not clear in a standard sliding window representation (Fig. 10.4a). In contrast a sharp increase in the (K0(n) ∼ n) curve (Fig. 10.4b), indicates an increase in K, while a drop in the curve indicates a decrease in K. This new method has been implemented in the function plotSlidingKaKs. Since it is derived from the sliding window method, it is called the enhanced sliding window method. 230

10.4.5 Limitations

The current version of this toolbox lacks novel algorithms yet it imple- ments a variety of existing algorithms. There are some limitations in the practical use of MBEToolbox. First, though the toolbox provides many methods to infer and handle sequence and evolutionary analyses, the full range of these features can only be accessed through the Matlab command line interface, as in the majority of Matlab packages. Second, some of the functions cannot handle ambiguous nucleotide or amino acid codes in the sequences. The future development of MBEToolbox will overcome these present limitations.

In summary, the MBEToolbox project is an ongoing effort in providing an easy-to-use and yet powerful analysis environment for molecular biology and evolution. Currently, it offers a solid set of frequently used functions to manipulate sequences, calculate genetic distances, infer phylogenetic trees and for related analyzes. MBEToolbox is a useful tool and inspires evolutionary biologists to take advantage of Matlab. Moreover, it has been widely applied in data analysis in the Penicillium marneffei genome project as mentioned in pages 73, 113, 146, 161 and 190. 231

Chapter 11

CONCLUDING REMARKS

In this last chapter I provide a summary of the conclusions and rec- ommendations for future research to the preceding chapters presented. Chapter 1 has presented the draft genome of the important thermally dimorphic fungus Penicillium marneffei. A number of features of the pathogenic fungus have been uncovered. Given the similarity of mitochondrial genome of P. marneffei and other nonpathogenic Aspergillus (Chapter 3), it suggests that P. marnef- fei is more close to mould than yeast, which is consistent with established classification. No direct association between mitochondrion-encoding ge- netic components and pathogenicity can be observed. Moreover, in silico evidences for the capability of melanin biosynthesis P. marneffei (Chap- ter 4) will inspire further research towards the experimental elucidation of melanin’s role in fungal virulence. Based on the computational finding, gene knockout and in vivo animal survival analysis are being undertaken in our department. The possible presence of sexual cycle in P. marneffei reported in Chapter 5 is highly significant as it affects genetic study of the fungus, since the sexual cycle could be a useful genetic tool allowing us to study the way in which the fungus causes disease. On the other hand, if the fungus does reproduce sexually as part of its life cycle, it might evolve more rapidly to become resistant to anti-fungal drugs be- cause sex might create new strains with increased ability to cause disease and infect humans. Chapter 6 explored our current knowledges about the genetic components related to the fungal morphogenesis, trying to emphasise molecular mechanism for dimorphic switching. Yet more re- searches are required in the following directions, including (i) perception 232

of external stimuli by cellular sensors; (ii) transduction of biochemical signal; (iii) alteration of the genomic expression, and (iv) structural re- organization towards the morphological change, in order to solve this far less archived task. The presence of over-abundant intragenic tan- dem repeats (IntraTRs) in P. marneffei genome is a striking finding (Chapter 7). The IntraTRs may create quantitative alterations in phe- notypes (e.g., adhesion, flocculation or biofilm formation). The variation resulted from the quantitative alterations of the fungal cell surface may have allowed the fungus ‘disguise’ itself in order to slip past the host immune system’s vigilant defences. Many P. marneffei proteins contain- ing tandemly repeated domain/motif, with some degree of homology to Plasmodium erythrocyte-binding protein domain.

The area of gene and genome duplication and its evolutionary sig- nificance has attracted significant attention from researchers in recent years. Chapter 8 represents a novel contribution to the field by present- ing a description of gene duplication in five ascomycetes. We have cal- culated the rates of synonymous and non-synonymous substitution using the codon substitution model and reported large variation in the propor- tion of genes in multigene families across these fungi. We also suggest that paralogs of filamentous fungi are under less selective constraint than orthologs (but that this does not hold for yeasts), also there is a lack of evidence for an association between asymmetry in rates of evolution and positive selection, and finally that different extents and consequences of gene duplication may explain some of the phenotypic variation of the ascomycetes. One of new conclusion, that P. marneffei may have under- gone a whole-genome duplication, is not solidly supported by the evidence presented so far; analysis of gene order information will be necessary to support the claim, when the P. marneffei genome sequencing approaches complete. Moreover, at the time when the analysis was performed, As- pergillus genomes remain unpublished, the underlying data may change, 233

and results from a pre-mature analysis may be hard to reproduce or be- come obsolete. Therefore, no Aspergillus genomes was included into the comparison; further analysis of this sort should overcome this limitation. In addition, in Chapter 9 we conducted the analysis on genes with various degree of conservation among species as measured by lineage- specificity of genes (LS). We examined the correlations between evolu- tionary rate and LS, as well as several other related factors, such as expression, essentiality, and protein-protein interactions. We found that in seven ascomycets genomes, the more lineage specific a gene, the higher its evolutionary rate. This is taken as evidence for the hypothesis that orphan genes arise as a result of higher rate of evolution. The general rule applies to the explaining of the origin of P. marneffei-specific genes. Finally, the software products, P. marneffei genome database and MBEToolbox for sequence data analysis, have been developed (Chapters 2 and 10). Two of them literally covers two major aspects of bioin- formatics, i.e., biological database management system and algorithm development. They have been successfully applied throughout the whole genome project, and proved to be efficient and sufficient. In conclusion, the boom in fungal genome sequence data over the past few years came with high expectations for new insights into fungal bi- ology, and pathogen control strategies. In the case of P. marneffei, it became evident that computational approaches can be used in the deci- phering of the genome so as to derive biological meaning or evolutionary processes. This work paves the way for a systemic experimental study of the pathogenic fungus. 234

BIBLIOGRAPHY

[1] N. Adames, K. Blundell, M. N. Ashby, and C. Boone. Role of yeast insulin- degrading enzyme homologs in propheromone processing and bud site selection. Science, 270(5235):464–7, 1995. [2] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Ama- natides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E. Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wort- man, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G. Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt, C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. Andrews- Pfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu, E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bol- shakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C. Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry, S. Cawley, C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher, Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes, S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista, C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg, W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan, M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck, D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, et al. The genome sequence of drosophila melanogaster. Science, 287(5461):2185–95, 2000. [3] L. Ajello, A. A. Padhye, S. Sukroongreung, C. H. Nilakul, and S. Tantimavanic. Occurrence of penicillium marneffei infections among wild bamboo rats in thai- land. Mycopathologia, 131(1):1–8, 1995. [4] H. Akashi. Within- and between-species dna sequence variation and the ‘foot- print’ of natural selection. Gene, 238:39–51, 1999. [5] J. A. Alspaugh, L. M. Cavallo, J. R. Perfect, and J. Heitman. Ras1 regulates fila- mentation, mating and growth at high temperature of cryptococcus neoformans. Mol Microbiol, 36(2):352–65, 2000. [6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–402, 1997. [7] M. A. Andrade, N. P. Brown, C. Leroy, S. Hoersch, A. de Daruvar, C. Reich, A. Franchini, J. Tamames, A. Valencia, C. Ouzounis, and C. Sander. Automated genome sequence analysis and annotation. Bioinformatics, 15(5):391–412, 1999. [8] L. Aravind, H. Watanabe, D. J. Lipman, and E. V. Koonin. Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci U S A, 97(21):11319–24, 2000. [9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel- Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25–9, 2000. [10] C. R. Astell, L. Ahlstrom-Jonasson, M. Smith, K. Tatchell, K. A. Nasmyth, and B. D. Hall. The sequence of the dnas coding for the mating-type loci of saccharomyces cerevisiae. Cell, 27(1 Pt 2):15–23, 1981. 235

[11] J. Baker, J. McCarthy, M. Gatton, D. E. Kyle, V. Belizario, J. Luchavez, D. Bell, and Q. Cheng. Genetic diversity of plasmodium falciparum histidine-rich protein 2 (pfhrp2) and its effect on the performance of pfhrp2-based rapid diagnostic tests. J Infect Dis, 192(5):870–7, 2005. [12] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. Identification and distinct regu- lation of yeast tata box-containing genes. Cell, 116(5):699–709, 2004. [13] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme, C. Yeats, and S. R. Eddy. The pfam protein families database. Nucleic Acids Res, 32(Database issue):D138–41, 2004. [14] D. H. Beach and A. J. Klar. Rearrangements of the transposable mating-type cassettes of fission yeast. Embo J, 3(3):603–10, 1984. [15] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001. [16] R. J. Bennett and S. C. West. Ruvc protein resolves holliday junctions via cleavage of the continuous (noncrossover) strands. Proc Natl Acad Sci U S A, 92(12):5635–9, 1995. [17] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan. Predicting function: from genes to genomes and back. J Mol Biol, 283(4):707–25, 1998. [18] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. The abaa homologue of penicillium marneffei participates in two developmental programmes: conidi- ation and dimorphic growth. Mol Microbiol, 38(5):1034–47, 2000. [19] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. An ste12 homolog from the asexual, dimorphic fungus penicillium marneffei complements the de- fect in sexual development of an aspergillus nidulans stea mutant. Genetics, 157(3):1003–14, 2001. [20] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. A basic helix-loop-helix protein with similarity to the fungal morphological regulators, phd1p, efg1p and stua, controls conidiation but not dimorphic growth in penicillium marneffei. Mol Microbiol, 44(3):621–31, 2002. [21] V. L. Boyartchuk, M. N. Ashby, and J. Rine. Modulation of ras and a-factor function by carboxyl-terminal proteolysis. Science, 275(5307):1796–800, 1997. [22] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The cdc42 homolog of the dimorphic fungus penicillium marneffei is required for correct cell polarization during growth but not development. J Bacteriol, 183(11):3447–57, 2001. [23] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The ras and rho gtpases genetically interact to co-ordinately regulate cell polarity during development in penicillium marneffei. Mol Microbiol, 55(5):1487–501, 2005. [24] A. A. Brakhage, K. Langfelder, G. Wanner, A. Schmidt, and B. Jahn. Pigment biosynthesis and virulence. Contrib Microbiol, 2:205–15, 1999. [25] B. J. Breitkreutz, C. Stark, and M. Tyers. The grid: the general repository for interaction datasets. Genome Biol, 4(3):R23, 2003. [26] C. Brenner and R. S. Fuller. Structural and enzymatic characterization of a purified prohormone-processing enzyme: secreted, soluble kex2 protease. Proc Natl Acad Sci U S A, 89(3):922–6, 1992. [27] J. Brosius and S. J. Gould. On ”genomenclature”: a comprehensive (and re- spectful) taxonomy for pseudogenes and other ”junk dna”. Proc Natl Acad Sci USA, 89(22):10706–10, 1992. 236

[28] D. W. Brown, J. H. Yu, H. S. Kelkar, M. Fernandes, T. C. Nesbitt, N. P. Keller, T. H. Adams, and T. J. Leonard. Twenty-five coregulated transcripts define a sterigmatocystin gene cluster in aspergillus nidulans. Proc Natl Acad Sci U S A, 93(4):1418–22, 1996. [29] T. A. Brown, R. B. Waring, C. Scazzocchio, and R. W. Davies. The aspergillus nidulans mitochondrial genome. Curr Genet, 9(2):113–7, 1985. [30] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic dna. J Mol Biol, 268(1):78–94, 1997. [31] M. Burset and R. Guigo. Evaluation of gene structure prediction programs. Genomics, 34(3):353–67, 1996. [32] H. Bussey. Proteases and the processing of precursors to secreted proteins in yeast. Yeast, 4(1):17–26, 1988. [33] H. J. Bussink and S. A. Osmani. A cyclin-dependent kinase family member (phoa) is required to link developmental fate to environmental conditions in aspergillus nidulans. Embo J, 17(14):3990–4003, 1998. [34] E. T. Buurman, C. Westwater, B. Hube, A. J. Brown, F. C. Odds, and N. A. Gow. Molecular analysis of camnt1p, a mannosyl transferase important for adhe- sion and virulence of candida albicans. Proc Natl Acad Sci U S A, 95(13):7670–5, 1998. [35] J. J. Cai, D. K. Smith, X. Xia, and K. Y. Yuen. Mbetoolbox: a matlab toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinfor- matics, 6(1):64, 2005. [36] R. Calderone. Molecular pathogenesis of fungal infections. Trends Microbiol, 2(12):461–3, 1994. [37] L. Cao, C. M. Chan, C. Lee, S. S. Wong, and K. Y. Yuen. Mp1 encodes an abundant and highly antigenic cell wall mannoprotein in the pathogenic fungus penicillium marneffei. Infect Immun, 66(3):966–73, 1998. [38] L. Cao, K. M. Chan, D. Chen, N. Vanittanakom, C. Lee, C. M. Chan, T. Sirisan- thana, D. N. Tsang, and K. Y. Yuen. Detection of cell wall mannoprotein mp1p in culture supernatants of penicillium marneffei and in sera of penicilliosis pa- tients. J Clin Microbiol, 37(4):981–6, 1999. [39] L. Cao, D. L. Chen, C. Lee, C. M. Chan, K. M. Chan, N. Vanittanakom, D. N. Tsang, and K. Y. Yuen. Detection of specific antibodies to an antigenic manno- protein for diagnosis of penicillium marneffei penicilliosis. J Clin Microbiol, 36(10):3028–31, 1998. [40] T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell, and J. Parkhill. Act: the artemis comparison tool. Bioinformatics, 21(16):3422– 3, 2005. [41] L. L. Cavalli-Sforza and A. W. Edwards. Phylogenetic analysis. models and estimation procedures. Am J Hum Genet, 19(3):Suppl 19:233+, 1967. [42] MATLAB Central. Matlab central, 2005. [43] C. M. Chan, P. C. Woo, A. S. Leung, S. K. Lau, X. Y. Che, L. Cao, and K. Y. Yuen. Detection of antibodies specific to an antigenic cell wall galactomanno- protein for serodiagnosis of aspergillus fumigatus aspergillosis. J Clin Microbiol, 40(6):2041–5, 2002. [44] Y. F. Chan and T. C. Chow. Ultrastructural observations on penicillium marn- effei in natural human infection. Ultrastruct Pathol, 14(5):439–52, 1990. [45] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, and K. E. Nelson. Sea- sonal variation of disseminated penicillium marneffei infections in northern thai- land: a clue to the reservoir? J Infect Dis, 173(6):1490–3, 1996. 237

[46] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, J. Praparattanapan, and K. E. Nelson. Case-control study of risk factors for penicillium marneffei infection in human immunodeficiency virus-infected patients in northern thailand. Clin Infect Dis, 24(6):1080–6, 1997. [47] S. Chariyalertsak, P. Vanittanakom, K. E. Nelson, T. Sirisanthana, and N. Vanit- tanakom. Rhizomys sumatrensis and cannomys badius, new natural animal hosts of penicillium marneffei. J Med Vet Mycol, 34(2):105–10, 1996. [48] D. Charlesworth, B. Charlesworth, and G. A. McVean. Genome sequences and evolutionary biology, a two-way interaction. Trends Ecol Evol, 16(5):235–242, 2001. [49] P. Chen, S. K. Sapperstein, J. D. Choi, and S. Michaelis. Biogenesis of the saccharomyces cerevisiae mating pheromone a-factor. J Cell Biol, 136(2):251– 69, 1997. [50] C. S. Chim, C. Y. Fong, S. K. Ma, S. S. Wong, and K. Y. Yuen. Reactive hemophagocytic syndrome associated with penicillium marneffei infection. Am J Med, 104(2):196–7, 1998. [51] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wod- icka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, and R. W. Davis. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2(1):65–73, 1998. [52] C. Y. Choi, E. L. Schneider, J. M. Kim, I. Y. Gluzman, D. E. Goldberg, J. A. Ellman, and M. A. Marletta. Interference with heme binding to histidine-rich protein-2 as an antimalarial strategy. Chem Biol, 9(8):881–9, 2002. [53] S. S. Choi and B. T. Lahn. Adaptive evolution of mrg, a neuron-specific gene family implicated in nociception. Genome Res, 13:2252–2259, 2003. [54] P. Chongtrakool, S. C. Chaiyaroj, V. Vithayasai, S. Trawatcharegon, R. Tean- paisan, S. Kalnawakul, and S. Sirisinha. Immunoreactivity of a 38-kilodalton penicillium marneffei antigen with human immunodeficiency virus-positive sera. J Clin Microbiol, 35(9):2220–3, 1997. [55] A. G. Clark and T. Kao. Excess nonsynonymous substitution at shared poly- morphic sites among self-incompatibility alleles of solanaceae. Proc Natl Acad Sci USA, 88:9823–9827, 1991. [56] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Wa- terston, B. A. Cohen, and M. Johnston. Finding functional features in saccha- romyces genomes by phylogenetic footprinting. Science, 301(5629):71–6, 2003. [57] L. Coin, A. Bateman, and R. Durbin. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci U S A, 100(8):4516–20, 2003. [58] L. J. Collins, A. M. Poole, and D. Penny. Using ancestral sequences to uncover potential gene homologues. Appl Bioinformatics, 2(3 Suppl):S85–95, 2003. [59] G. C. Conant and A. Wagner. Asymmetric sequence divergence of duplicate genes. Genome Res, 13(9):2052–8, 2003. [60] A. Cooper and H. Bussey. Characterization of the yeast kex1 gene product: a carboxypeptidase involved in processing secreted precursor proteins. Mol Cell Biol, 9(6):2706–14, 1989. [61] KA Crandall, CR Kelsey, H Imamichi, HC Lane, and NP Salzman. Parallel evolution of drug resistance in hiv: failure of nonsynonymous/synonymous sub- stitution rate ratio to detect selection. Mol Biol Evol, 16:372–382, 1999. [62] J. Davey, K. Davis, M. Hughes, G. Ladds, and D. Powner. The processing of yeast pheromones. Semin Cell Dev Biol, 9(1):19–30, 1998. 238

[63] F. De Bernardis, S. Arancia, L. Morelli, B. Hube, D. Sanglard, W. Schafer, and A. Cassone. Evidence that members of the secretory aspartyl proteinase gene family, in particular sap2, are virulence factors for candida vaginitis. J Infect Dis, 179(1):201–8, 1999. [64] R. A. Dean, N. J. Talbot, D. J. Ebbole, M. L. Farman, T. K. Mitchell, M. J. Orbach, M. Thon, R. Kulkarni, J. R. Xu, H. Pan, N. D. Read, Y. H. Lee, I. Car- bone, D. Brown, Y. Y. Oh, N. Donofrio, J. S. Jeong, D. M. Soanes, S. Djonovic, E. Kolomiets, C. Rehmeyer, W. Li, M. Harding, S. Kim, M. H. Lebrun, H. Bohn- ert, S. Coughlan, J. Butler, S. Calvo, L. J. Ma, R. Nicol, S. Purcell, C. Nusbaum, J. E. Galagan, and B. W. Birren. The genome sequence of the rice blast fungus magnaporthe grisea. Nature, 434(7036):980–6, 2005. [65] C. d’Enfert, S. Goyard, S. Rodriguez-Arnaveilhe, L. Frangeul, L. Jones, F. Tekaia, O. Bader, A. Albrecht, L. Castillo, A. Dominguez, J. F. Ernst, C. Fradin, C. Gaillardin, S. Garcia-Sanchez, P. de Groot, B. Hube, F. M. Klis, S. Krishnamurthy, D. Kunze, M. C. Lopez, A. Mavor, N. Martin, I. Moszer, D. Onesime, J. Perez Martin, R. Sentandreu, E. Valentin, and A. J. Brown. Candidadb: a genome database for candida albicans pathogenomics. Nucleic Acids Res, 33(Database issue):D353–7, 2005. [66] Z. L. Deng and D. H. Connor. Progressive disseminated penicilliosis caused by penicillium marneffei. report of eight cases and differentiation of the causative organism from histoplasma capsulatum. Am J Clin Pathol, 84(3):323–7, 1985. [67] Z. L. Deng, M. Yun, and L. Ajello. Human penicilliosis marneffei and its relation to the bamboo rat (rhizomys pruinosus). J Med Vet Mycol, 24(5):383–9, 1986. [68] E. T. Dermitzakis and A. G. Clark. Differential selection after duplication in mammalian developmental genes. Mol Biol Evol, 18(4):557–62, 2001. [69] V. Desakorn, M. D. Smith, A. L. Walsh, A. J. Simpson, D. Sahassananda, A. Ra- januwong, V. Wuthiekanun, P. Howe, B. J. Angus, P. Suntharasamai, and N. J. White. Diagnosis of penicillium marneffei infection by quantitation of urinary antigen by using an enzyme immunoassay. J Clin Microbiol, 37(1):117–21, 1999. [70] A. Dmochowska, D. Dignard, D. Henning, D. Y. Thomas, and H. Bussey. Yeast kex1 gene encodes a putative protease with a carboxypeptidase b-like function involved in killer toxin and alpha-factor precursor processing. Cell, 50(4):573–84, 1987. [71] C. B. Do, M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Prob- abilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330– 40, 2005. [72] J. M. Dolence, L. E. Steward, E. K. Dolence, D. H. Wong, and C. D. Poulter. Studies with recombinant saccharomyces cerevisiae caax prenyl protease rce1p. Biochemistry, 39(14):4096–104, 2000. [73] T. Domazet-Loso and D. Tautz. An evolutionary analysis of orphan genes in drosophila. Genome Res, 13(10):2213–9, 2003. [74] R. F. Doolittle. The multiplicity of domains in proteins. Annu Rev Biochem, 64:287–314, 1995. [75] J. Du, Y. Zhu, A. Shanmugam, and A. L. Kenter. Analysis of immunoglobulin sgamma3 recombination breakpoints by pcr: implications for the mechanism of isotype switching. Nucleic Acids Res, 25(15):3066–73, 1997. [76] P. S. Dyer, M. Paoletti, and D. B. Archer. Genomics reveals sexual secrets of aspergillus. Microbiology, 149(Pt 9):2301–3, 2003. [77] S. E. Eckert, B. Hoffmann, C. Wanke, and G. H. Braus. Sexual develop- ment of aspergillus nidulans in tryptophan auxotrophic strains. Arch Microbiol, 172(3):157–66, 1999. 239

[78] A. Edwards, H. A. Hammond, L. Jin, C. T. Caskey, and R. Chakraborty. Ge- netic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics, 12(2):241–53, 1992. [79] C. elegan Sequencing Consortium. Genome sequence of the nematode c. elegans: a platform for investigating biology. Science, 282(5396):2012–8, 1998. [80] T Endo, K Ikeo, and T Gojobori. Large-scale search for genes on which positive selection may operate. Mol Biol Evol, 13:685–690, 1996. [81] E. Eskin, W. N. Grundy, and Y. Singer. Protein family classification using sparse markov transducers. Proc Int Conf Intell Syst Mol Biol, 8:134–45, 2000. [82] E. Espagne, P. Balhadere, M. L. Penin, C. Barreau, and B. Turcq. Het-e and het-d belong to a new subfamily of wd40 proteins involved in vegetative incom- patibility specificity in the fungus podospora anserina. Genetics, 161(1):71–81, 2002. [83] B. Ewing and P. Green. Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome Res, 8(3):186–94, 1998. [84] B. Ewing, L. Hillier, M. C. Wendl, and P. Green. Base-calling of automated sequencer traces using phred. i. accuracy assessment. Genome Res, 8(3):175–85, 1998. [85] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood approach. J Mol Evol, 17:368–376, 1981. [86] J. Felsenstein. Phylip – phylogeny inference package (version 3.2). Cladistics, 5:164–166, 1989. [87] Fungal Research Community FGI. Fungal genome initiative (http://www.broad.mit.edu/annotation/fungi/fgi/), 2002. [88] M. C. Fisher, D. Aanensen, S. de Hoog, and N. Vanittanakom. Multilocus microsatellite typing system for penicillium marneffei reveals spatially structured populations. J Clin Microbiol, 42(11):5065–9, 2004. [89] M. C. Fisher, W. P. Hanage, S. de Hoog, E. Johnson, M. D. Smith, N. J. White, and N. Vanittanakom. Low effective dispersal of asexual genotypes in heterogeneous landscapes by the endemic pathogen penicillium marneffei. PLoS Pathog, 1(2):e20, 2005. [90] A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait. Preservation of duplicate genes by complementary, degenerative mutations. Ge- netics, 151(4):1531–45, 1999. [91] F. Foury, T. Roganti, N. Lecrenier, and B. Purnelle. The complete sequence of the mitochondrial genome of saccharomyces cerevisiae. FEBS Lett, 440(3):325– 31, 1998. [92] C. M. Fraser and R. D. Fleischmann. Strategies for whole microbial genome sequencing and analysis. Electrophoresis, 18(8):1207–16, 1997. [93] H. B. Fraser, D. P. Wall, and A. E. Hirsh. A simple dependence between protein evolution rate and the number of protein-protein interactions. BMC Evol Biol, 3(1):11, 2003. [94] J. A. Fraser and J. Heitman. Evolution of fungal sex chromosomes. Mol Micro- biol, 51(2):299–306, 2004. [95] R. Friedman and A. L. Hughes. Gene duplication and the structure of eukaryotic genomes. Genome Res, 11(3):373–81, 2001. [96] D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki, C. Gruber, B. Geier, A. Kaps, K. Albermann, A. Volz, C. Wagner, M. Fellenberg, K. Heumann, and H. W. Mewes. The pedant genome database. Nucleic Acids Res, 31(1):207–11, 2003. 240

[97] M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res, 30(14):3214–24, 2002. [98] Y. Fu, G. Rieg, W. A. Fonzi, P. H. Belanger, Jr. Edwards, J. E., and S. G. Filler. Expression of the candida albicans gene als1 in saccharomyces cerevisiae induces adherence to endothelial and epithelial cells. Infect Immun, 66(4):1783–6, 1998. [99] K. Fujimura-Kamada, F. J. Nouvet, and S. Michaelis. A novel membrane- associated metalloprotease, ste24p, is required for the first step of nh2-terminal processing of the yeast a-factor precursor. J Cell Biol, 136(2):271–85, 1997. [100] R. S. Fuller, A. Brake, and J. Thorner. Yeast prohormone processing enzyme (kex2 gene product) is a ca2+-dependent serine protease. Proc Natl Acad Sci U SA, 86(5):1434–8, 1989. [101] J. E. Galagan, S. E. Calvo, K. A. Borkovich, E. U. Selker, N. D. Read, D. Jaffe, W. FitzHugh, L. J. Ma, S. Smirnov, S. Purcell, B. Rehman, T. Elkins, R. Engels, S. Wang, C. B. Nielsen, J. Butler, M. Endrizzi, D. Qui, P. Ianakiev, D. Bell- Pedersen, M. A. Nelson, M. Werner-Washburne, C. P. Selitrennikoff, J. A. Kin- sey, E. L. Braun, A. Zelter, U. Schulte, G. O. Kothe, G. Jedd, W. Mewes, C. Staben, E. Marcotte, D. Greenberg, A. Roy, K. Foley, J. Naylor, N. Stange- Thomann, R. Barrett, S. Gnerre, M. Kamal, M. Kamvysselis, E. Mauceli, C. Bielke, S. Rudd, D. Frishman, S. Krystofova, C. Rasmussen, R. L. Met- zenberg, D. D. Perkins, S. Kroken, C. Cogoni, G. Macino, D. Catcheside, W. Li, R. J. Pratt, S. A. Osmani, C. P. DeSouza, L. Glass, M. J. Orbach, J. A. Berglund, R. Voelker, O. Yarden, M. Plamann, S. Seiler, J. Dunlap, A. Radford, R. Ara- mayo, D. O. Natvig, L. A. Alex, G. Mannhaupt, D. J. Ebbole, M. Freitag, I. Paulsen, M. S. Sachs, E. S. Lander, C. Nusbaum, and B. Birren. The genome sequence of the filamentous fungus neurospora crassa. Nature, 422(6934):859–68, 2003. [102] C. A. Gale, C. M. Bendel, M. McClellan, M. Hauser, J. M. Becker, J. Berman, and M. K. Hostetter. Linkage of adhesion, filamentous growth, and virulence in candida albicans to a single gene, int1. Science, 279(5355):1355–8, 1998. [103] W. Gao, C. H. Khang, S. Y. Park, Y. H. Lee, and S. Kang. Evolution and organization of a highly dynamic, subtelomeric helicase gene family in the rice blast fungus magnaporthe grisea. Genetics, 162(1):103–12, 2002. [104] R. G. Garrison and K. S. Boyd. Dimorphism of penicillium marneffei as observed by electron microscopy. Can J Microbiol, 19(10):1305–9, 1973. [105] S. M. Gasser and M. M. Cockell. The molecular biology of the sir proteins. Gene, 279(1):1–16, 2001. [106] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dick- son, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A. Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415(6868):141–7, 2002. [107] R. F. Geever, L. Huiet, J. A. Baum, B. M. Tyler, V. B. Patel, B. J. Rutledge, M. E. Case, and N. H. Giles. Dna sequence, organization and regulation of the qa gene cluster of neurospora crassa. J Mol Biol, 207(1):15–34, 1989. [108] M. S. Gelfand. Prediction of function in dna sequence analysis. J Comput Biol, 2(1):87–115, 1995. [109] W. Gilbert, S. J. de Souza, and M. Long. Origin of genes. Proc Natl Acad Sci USA, 94(15):7698–703, 1997. 241

[110] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes. Science, 274(5287):546, 563–7, 1996. [111] N. Goldman and Z. Yang. A codon-based model of nucleotide substitution for protein-coding dna sequences. Mol Biol Evol, 11(5):725–36, 1994. [112] D. Gordon, C. Abajian, and P. Green. Consed: a graphical tool for sequence finishing. Genome Res, 8(3):195–202, 1998. [113] N. A. Gow. Candida albicans switches mates. Mol Cell, 10(2):217–8, 2002. [114] N. A. Gow, A. J. Brown, and F. C. Odds. Fungal morphogenesis and host invasion. Curr Opin Microbiol, 5(4):366–71, 2002. [115] D. Grant, P. Cregan, and R. C. Shoemaker. Genome organization in dicots: genome duplication in arabidopsis and synteny between soybean and arabidopsis. Proc Natl Acad Sci U S A, 97(8):4168–73, 2000. [116] D. Graur. Amino acid composition and the evolutionary rates of protein-coding genes. J Mol Evol, 22(1):53–62, 1985. [117] S. I. Grewal and D. Moazed. Heterochromatin and epigenetic control of gene expression. Science, 301(5634):798–802, 2003. [118] Z. Gu, A. Cavalcanti, F. C. Chen, P. Bouman, and W. H. Li. Extent of gene duplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol, 19(3):256–62, 2002. [119] Z. Gu, L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li. Role of duplicate genes in genetic robustness against null mutations. Nature, 421(6918):63–6, 2003. [120] J. E. Haber. Mating-type gene switching in saccharomyces cerevisiae. Annu Rev Genet, 32:561–99, 1998. [121] H. Hamada, M. Seidman, B. H. Howard, and C. M. Gorman. Enhanced gene expression by the poly(dt-dg).poly(dc-da) sequence. Mol Cell Biol, 4(12):2622– 30, 1984. [122] A. J. Hamilton, L. Jeavons, S. Youngchim, and N. Vanittanakom. Recognition of fibronectin by penicillium marneffei conidia via a sialic acid-dependent process and its relationship to the interaction between conidia and laminin. Infect Im- mun, 67(10):5200–5, 1999. [123] A. J. Hamilton, L. Jeavons, S. Youngchim, N. Vanittanakom, and R. J. Hay. Sialic acid-dependent recognition of laminin by penicillium marneffei conidia. Infect Immun, 66(12):6024–6, 1998. [124] K. H. Han, K. Y. Han, J. H. Yu, K. S. Chae, K. Y. Jahng, and D. M. Han. The nsdd gene encodes a putative gata-type transcription factor necessary for sexual development of aspergillus nidulans. Mol Microbiol, 41(2):299–309, 2001. [125] K. H. Han, J. A. Seo, and J. H. Yu. A putative g protein-coupled receptor negatively controls sexual development in aspergillus nidulans. Mol Microbiol, 51(5):1333–45, 2004. [126] M Hasegawa, H Kishino, and T Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial dna. J Mol Evol, 22:160–174, 1985. [127] K. E. Hastings. Strong evolutionary conservation of broadly expressed protein isoforms in the troponin i gene family and other vertebrate gene families. J Mol Evol, 42(6):631–40, 1996. [128] K. Haynes. Virulence in candida species. Trends Microbiol, 9(12):591–6, 2001. 242

[129] B. He, P. Chen, S. Y. Chen, K. L. Vancura, S. Michaelis, and S. Powers. Ram2, an essential gene of yeast, and ram1 encode the two polypeptide components of the farnesyltransferase that prenylates a-factor and ras proteins. Proc Natl Acad Sci U S A, 88(24):11373–7, 1991. [130] D. S. Heckman, D. M. Geiser, B. R. Eidell, R. L. Stauffer, N. L. Kardos, and S. B. Hedges. Molecular evidence for the early colonization of land by fungi and plants. Science, 293(5532):1129–33, 2001. [131] S. B. Hedges and S. Kumar. Genomic clocks and evolutionary timescales. Trends Genet, 19(4):200–6, 2003. [132] I. Herskowitz. Fungal physiology. yeast branches out. Nature, 357(6375):190–1, 1992. [133] L. H. Hogan, S. Josvai, and B. S. Klein. Genomic cloning, characterization, and functional analysis of the major surface adhesin wi-1 on blastomyces dermatitidis yeasts. J Biol Chem, 270(51):30725–32, 1995. [134] P. R. Hsueh, L. J. Teng, C. C. Hung, J. H. Hsu, P. C. Yang, S. W. Ho, and K. T. Luh. Molecular evidence for strain dissemination of penicillium marneffei: an emerging pathogen in taiwan. J Infect Dis, 181(5):1706–12, 2000. [135] H. Huang, W. C. Barker, Y. Chen, and C. H. Wu. iproclass: an integrated database of protein family, function and structure information. Nucleic Acids Res, 31(1):390–2, 2003. [136] A. L. Hughes and R. Friedman. Parallel evolution by gene duplication in the genomes of two unicellular fungi. Genome Res, 13(6A):1259–64, 2003. [137] M. K. Hughes and A. L. Hughes. Evolution of duplicate genes in a tetraploid animal, xenopus laevis. Mol Biol Evol, 10(6):1360–9, 1993. [138] C. M. Hull and A. D. Johnson. Identification of a mating type-like locus in the asexual pathogenic yeast candida albicans. Science, 285(5431):1271–5, 1999. [139] C. M. Hull, R. M. Raisner, and A. D. Johnson. Evidence for mating of the ”asexual” yeast candida albicans in a mammalian host. Science, 289(5477):307– 10, 2000. [140] C. C. Hung, M. Y. Chen, S. M. Hsieh, W. H. Sheng, C. F. Hsiao, and S. C. Chang. Discontinuation of secondary prophylaxis for penicilliosis marneffei in aids patients responding to highly active antiretroviral therapy. Aids, 16(4):672– 3, 2002. [141] L. D. Hurst and N. G. Smith. Do essential genes evolve slowly? Curr Biol, 9(14):747–50, 1999. [142] M. Huynen, B. Snel, 3rd Lathe, W., and P. Bork. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res, 10(8):1204–10, 2000. [143] I. Iliopoulos, S. Tsoka, M. A. Andrade, A. J. Enright, M. Carroll, P. Poul- let, V. Promponas, T. Liakopoulos, G. Palaios, C. Pasquier, S. Hamodrakas, J. Tamames, A. T. Yagnik, A. Tramontano, D. Devos, C. Blaschke, A. Valencia, D. Brett, D. Martin, C. Leroy, I. Rigoutsos, C. Sander, and C. A. Ouzounis. Evaluation of annotation strategies using an entire genome sequence. Bioinfor- matics, 19(6):717–26, 2003. [144] P. Imwidthaya, A. S. Sekhon, T. D. Mastro, A. K. Garg, and E. Ambrosie. Use- fulness of a microimmunodiffusion test for the detection of penicillium marneffei antigenemia, antibodies, and exoantigens. Mycopathologia, 138(2):51–5, 1997. [145] Y. Ina. Oden: a program package for molecular evolutionary analysis and data- base search of dna and amino acid sequences. Comput Appl Biosci, 10:11–12, 1994. 243

[146] L. Jeavons, A. J. Hamilton, N. Vanittanakom, R. Ungpakorn, E. G. Evans, T. Sirisanthana, and R. J. Hay. Identification and purification of specific peni- cillium marneffei antigens and their recognition by human immune sera. J Clin Microbiol, 36(4):949–54, 1998. [147] M. E. Johnson, L. Viggiano, J. A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Roc- chi, and E. E. Eichler. Positive selection of a gene family during the emergence of humans and african apes. Nature, 413(6855):514–9, 2001. [148] T. Jones, N. A. Federspiel, H. Chibana, J. Dungan, S. Kalman, B. B. Magee, G. Newport, Y. R. Thorstenson, N. Agabian, P. T. Magee, R. W. Davis, and S. Scherer. The diploid genome sequence of candida albicans. Proc Natl Acad Sci U S A, 101(19):7329–34, 2004. [149] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res, 12(6):962–8, 2002. [150] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Microevolutionary genomics of bacteria. Theor Popul Biol, 61(4):435–47, 2002. [151] I. K. Jordan, Y. I. Wolf, and E. V. Koonin. No simple dependence between protein evolution rate and the number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol, 3(1):1, 2003. [152] T. Joseph-Horne, D. W. Hollomon, and P. M. Wood. Fungal respiration: a fusion of standard and alternative components. Biochim Biophys Acta, 1504(2- 3):179–95, 2001. [153] T. H. Jukes and C.R. Cantor. Evolution of protein molecules. In H. N. Munro, editor, Mammalian Protein Metabolism, pages 21–132. Academic Press, New York, 1969. [154] D. Julius, L. Blair, A. Brake, G. Sprague, and J. Thorner. Yeast alpha factor is processed from a larger precursor polypeptide: the essential role of a membrane- bound dipeptidyl aminopeptidase. Cell, 32(3):839–52, 1983. [155] H. Kaessmann, S. Zollner, A. Nekrutenko, and W. H. Li. Signatures of domain shuffling in the human genome. Genome Res, 12(11):1642–50, 2002. [156] E. Kafer. Origins of translocations in aspergillus nidulans. Genetics, 52(1):217– 32, 1965. [157] T. Kanbe and J. E. Cutler. Minimum chemical requirements for adhesin activ- ity of the acid-stable part of candida albicans cell wall phosphomannoprotein complex. Infect Immun, 66(12):5812–8, 1998. [158] R. Kappe, C. Fauser, C. N. Okeke, and M. Maiwald. Universal fungus-specific primer systems and group-specific hybridization oligonucleotides for 18s rdna. Mycoses, 39(1-2):25–30, 1996. [159] N. Kato, W. Brooks, and A. M. Calvo. The expression of sterigmatocystin and penicillin genes in aspergillus nidulans is controlled by vea, a gene required for sexual development. Eukaryot Cell, 2(6):1178–86, 2003. [160] L. Kaufman, P. G. Standard, M. Jalbert, P. Kantipong, K. Limpakarnjanarat, and T. D. Mastro. Diagnostic antigenemia tests for penicilliosis marneffei. J Clin Microbiol, 34(10):2503–5, 1996. [161] N. P. Keller and T. M. Hohn. Metabolic pathway gene clusters in filamentous fungi. Fungal Genet Biol, 21(1):17–29, 1997. [162] M. Kelly, J. Burke, M. Smith, A. Klar, and D. Beach. Four mating-type genes control sexual differentiation in the fission yeast. Embo J, 7(5):1537–47, 1988. [163] Z. Kerenyi and L. Hornok. Structure and function of mating-type genes in fusarium species. Acta Microbiol Immunol Hung, 49(2-3):313–4, 2002. 244

[164] H. Kim, K. Han, K. Kim, D. Han, K. Jahng, and K. Chae. The vea gene activates sexual development in aspergillus nidulans. Fungal Genet Biol, 37(1):72–80, 2002. [165] M. Kimura. A simple method for estimating evolutionary rates of base sub- stitutions through comparative studies of nucleotide sequences. J Mol Evol, 16:111–120, 1980. [166] M. Kimura and J. L. King. Fixation of a deleterious allele at one of two ”dupli- cate” loci by mutation pressure and random drift. Proc Natl Acad Sci U S A, 76(6):2858–61, 1979. [167] K. E. Kirk and N. R. Morris. The tubb alpha-tubulin gene is essential for sexual development in aspergillus nidulans. Genes Dev, 5(11):2014–23, 1991. [168] K. E. Kirk and N. R. Morris. Either alpha-tubulin isogene product is sufficient for microtubule function during all stages of growth and differentiation in aspergillus nidulans. Mol Cell Biol, 13(8):4465–76, 1993. [169] B. S. Klein, L. H. Hogan, and J. M. Jones. Immunologic recognition of a 25-amino acid repeat arrayed in tandem on a major antigen of blastomyces dermatitidis. J Clin Invest, 92(1):330–7, 1993. [170] M. A. Klich, E. J. Mullaney, C. B. Daly, and J. W. Cary. Molecular and phys- iological aspects of aflatoxin and sterigmatocystin biosynthesis by aspergillus tamarii and a. ochraceoroseus. Appl Microbiol Biotechnol, 53(5):605–9, 2000. [171] Y. Koguchi, K. Kawakami, S. Kon, T. Segawa, M. Maeda, T. Uede, and A. Saito. Penicillium marneffei causes osteopontin-mediated production of interleukin-12 by peripheral blood mononuclear cells. Infect Immun, 70(3):1042–8, 2002. [172] F. A. Kondrashov and E. V. Koonin. Origin of alternative splicing by tandem exon duplication. Hum Mol Genet, 10(23):2661–9, 2001. [173] F. A. Kondrashov and E. V. Koonin. Evolution of alternative splicing: deletions, insertions and origin of functional parts of proteins from intron sequences. Trends Genet, 19(3):115–9, 2003. [174] F. A. Kondrashov, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Selection in the evolution of gene duplications. Genome Biol, 3(2):RESEARCH0008, 2002. [175] R. Koszul, A. Malpertuy, L. Frangeul, C. Bouchier, P. Wincker, A. Thierry, S. Duthoy, S. Ferris, C. Hennequin, and B. Dujon. The complete mitochondrial genome sequence of the pathogenic yeast candida (torulopsis) glabrata. FEBS Lett, 534(1-3):39–48, 2003. [176] L. Kraakman, K. Lemaire, P. Ma, A. W. Teunissen, M. C. Donaton, P. Van Dijck, J. Winderickx, J. H. de Winde, and J. M. Thevelein. A saccharomyces cerevisiae g-protein coupled receptor, gpr1, is specifically required for glucose activation of the camp pathway during the transition to growth on glucose. Mol Microbiol, 32(5):1002–12, 1999. [177] A. Krause, J. Stoye, and M. Vingron. The systers protein sequence cluster set. Nucleic Acids Res, 28(1):270–2, 2000. [178] D. M. Krylov, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res, 13(10):2229–35, 2003. [179] N. Kudeken, K. Kawakami, and A. Saito. Cytokine-induced fungicidal activity of human polymorphonuclear leukocytes against penicillium marneffei. FEMS Immunol Med Microbiol, 26(2):115–24, 1999. [180] N. Kudeken, K. Kawakami, and A. Saito. Role of superoxide anion in the fungici- dal activity of murine peritoneal exudate macrophages against penicillium marn- effei. Microbiol Immunol, 43(4):323–30, 1999. 245

[181] N. Kudeken, K. Kawakami, and A. Saito. Mechanisms of the in vitro fungi- cidal effects of human neutrophils against penicillium marneffei induced by granulocyte-macrophage colony-stimulating factor (gm-csf). Clin Exp Immunol, 119(3):472–8, 2000. [182] E. Y. Kwan, Y. L. Lau, K. Y. Yuen, B. M. Jones, and L. C. Low. Penicil- lium marneffei infection in a non-hiv infected child. J Paediatr Child Health, 33(3):267–71, 1997. [183] K. J. Kwon-Chung and J. E. Bennett. Distribution of alpha and alpha mating types of cryptococcus neoformans among natural and clinical isolates. Am J Epidemiol, 108(4):337–40, 1978. [184] J. A. Lake. Reconstructing evolutionary trees from dna and protein sequences: paralinear distances. Proc Natl Acad Sci USA, 91:1455–1459, 1994. [185] C. Lanave, G. Preparata, C. Saccone, and G. Serio. A new method for calculating evolutionary substitution rates. J Mol Evol, 20:86–93, 1984. [186] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–9, 1988. [187] K. Langfelder, B. Jahn, H. Gehringer, A. Schmidt, G. Wanner, and A. A. Brakhage. Identification of a polyketide synthase gene (pksp) of aspergillus fu- migatus involved in conidial pigment biosynthesis and virulence. Med Microbiol Immunol (Berl), 187(2):79–89, 1998. [188] L. Latchinian-Sadek and D. Y. Thomas. Expression, purification, and charac- terization of the yeast kex1 gene product, a polypeptide precursor processing carboxypeptidase. J Biol Chem, 268(1):534–40, 1993. [189] J. P. Latge and R. Calderone. Host-microbe interactions: fungi invasive human fungal opportunistic infections. Curr Opin Microbiol, 5(4):355–8, 2002. [190] E. Leberer, D. Harcus, I. D. Broadbent, K. L. Clark, D. Dignard, K. Ziegelbauer, A. Schmidt, N. A. Gow, A. J. Brown, and D. Y. Thomas. Signal transduction through homologs of the ste20p and ste7p protein kinases can trigger hyphal formation in the pathogenic fungus candida albicans. Proc Natl Acad Sci U S A, 93(23):13217–22, 1996. [191] D. W. Lee, S. Kim, S. J. Kim, D. M. Han, K. Y. Jahng, and K. S. Chae. The isda gene is necessary for sexual development inhibition by a salt in aspergillus nidulans. Curr Genet, 39(4):237–43, 2001. [192] K. B. Lengeler, R. C. Davidson, C. D’Souza, T. Harashima, W. C. Shen, P. Wang, X. Pan, M. Waugh, and J. Heitman. Signal transduction cascades reg- ulating fungal development and virulence. Microbiol Mol Biol Rev, 64(4):746–85, 2000. [193] K. B. Lengeler, P. Wang, G. M. Cox, J. R. Perfect, and J. Heitman. Iden- tification of the mata mating-type locus of cryptococcus neoformans reveals a serotype a mata strain thought to have been extinct. Proc Natl Acad Sci U S A, 97(26):14455–60, 2000. [194] I. Letunic, R. R. Copley, and P. Bork. Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet, 11(13):1561–7, 2002. [195] J. C. Li, L. Q. Pan, and S. X. Wu. Mycologic investigation on rhizomys pruinous senex in guangxi as natural carrier with penicillium marneffei. Chin Med J (Engl), 102(6):477–85, 1989. [196] W. H. Li. Rate of gene silencing at duplicate loci: a theoretical study and interpretation of data from tetraploid fishes. Genetics, 95(1):237–58, 1980. [197] W. H. Li. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol, 36:96–99, 1993. 246

[198] W. H. Li, C. I. Wu, and C. C. Luo. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative like- lihood of nucleotide and codon changes. Mol Biol Evol, 2:150–174, 1985. [199] Wen-Hsiung Li. Molecular evolution. Sinauer Associates, Sunderland, Mass., 1997. [200] F. Lisacek, Y. Diaz, and F. Michel. Automatic identification of group i intron cores in genomic dna sequences. J Mol Biol, 235(4):1206–17, 1994. [201] C. Y. Lo, D. T. Chan, K. Y. Yuen, F. K. Li, and K. P. Cheng. Penicillium marneffei infection in a patient with sle. Lupus, 4(3):229–31, 1995. [202] K. F. LoBuglio and J. W. Taylor. Phylogeny and pcr identification of the human pathogenic fungus penicillium marneffei. J Clin Microbiol, 33(1):85–9, 1995. [203] B. J. Loftus, E. Fung, P. Roncaglia, D. Rowley, P. Amedeo, D. Bruno, J. Va- mathevan, M. Miranda, I. J. Anderson, J. A. Fraser, J. E. Allen, I. E. Bosdet, M. R. Brent, R. Chiu, T. L. Doering, M. J. Donlin, C. A. D’Souza, D. S. Fox, V. Grinberg, J. Fu, M. Fukushima, B. J. Haas, J. C. Huang, G. Janbon, S. J. Jones, H. L. Koo, M. I. Krzywinski, J. K. Kwon-Chung, K. B. Lengeler, R. Maiti, M. A. Marra, R. E. Marra, C. A. Mathewson, T. G. Mitchell, M. Pertea, F. R. Riggs, S. L. Salzberg, J. E. Schein, A. Shvartsbeyn, H. Shin, M. Shumway, C. A. Specht, B. B. Suh, A. Tenney, T. R. Utterback, B. L. Wickes, J. R. Wort- man, N. H. Wye, J. W. Kronstad, J. K. Lodge, J. Heitman, R. W. Davis, C. M. Fraser, and R. W. Hyman. The genome of the basidiomycetous yeast and human pathogen cryptococcus neoformans. Science, 307(5713):1321–4, 2005. [204] M. Long, E. Betran, K. Thornton, and W. Wang. The origin of new genes: glimpses from the young and old. Nat Rev Genet, 4(11):865–75, 2003. [205] M. Long and C. H. Langley. Natural selection and the origin of jingwei, a chimeric processed functional gene in drosophila. Science, 260(5104):91–5, 1993. [206] M. C. Lorenz. Genomic approaches to fungal pathogenicity. Curr Opin Micro- biol, 5(4):372–8, 2002. [207] T. M. Lowe and S. R. Eddy. trnascan-se: a program for improved detection of transfer rna genes in genomic sequence. Nucleic Acids Res, 25(5):955–64, 1997. [208] Q. Lu, L. L. Wallrath, H. Granok, and S. C. Elgin. (ct)n (ga)n repeats and heat shock elements have distinct roles in chromatin structure and transcriptional activation of the drosophila hsp26 gene. Mol Cell Biol, 13(5):2802–14, 1993. [209] L. G. Lundin. Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the . Genomics, 16(1):1–19, 1993. [210] M. Lynch and J. S. Conery. The evolutionary fate and consequences of duplicate genes. Science, 290(5494):1151–5, 2000. [211] M. Lynch and J. S. Conery. The evolutionary demography of duplicate genes. J Struct Funct Genomics, 3(1-4):35–44, 2003. [212] M. Lynch and A. Force. The probability of duplicate gene preservation by subfunctionalization. Genetics, 154(1):459–73, 2000. [213] B. B. Magee and P. T. Magee. Induction of mating in candida albicans by construction of mtla and mtlalpha strains. Science, 289(5477):310–3, 2000. [214] W. Makalowski and M. S. Boguski. Synonymous and nonsynonymous substitu- tion distances are correlated in mouse and rat genes. J Mol Evol, 47(2):119–21, 1998. [215] W. Makalowski, G. A. Mitchell, and D. Labuda. Alu sequences in the coding regions of mrna: a source of protein variability. Trends Genet, 10(6):188–93, 1994. 247

[216] G. Mannhaupt, C. Montrone, D. Haase, H. W. Mewes, V. Aign, J. D. Hoheisel, B. Fartmann, G. Nyakatura, F. Kempken, J. Maier, and U. Schulte. What’s in the genome of a filamentous fungus? analysis of the neurospora genome sequence. Nucleic Acids Res, 31(7):1944–54, 2003. [217] E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisen- berg. Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–3, 1999. [218] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, 402(6757):83–6, 1999. [219] A. McLysaght, K. Hokamp, and K. H. Wolfe. Extensive genomic duplication during early chordate evolution. Nat Genet, 31(2):200–4, 2002. [220] H. W. Mewes, K. Albermann, M. Bahr, D. Frishman, A. Gleissner, J. Hani, K. Heumann, K. Kleine, A. Maierl, S. G. Oliver, F. Pfeiffer, and A. Zollner. Overview of the yeast genome. Nature, 387(6632 Suppl):7–65, 1997. [221] A. Meyer and M. Schartl. Gene and genome duplications in vertebrates: the one-to-four (-to-eight in fish) rule and the evolution of novel gene functions. Curr Opin Cell Biol, 11(6):699–704, 1999. [222] K. Y. Miller, T. M. Toennis, T. H. Adams, and B. L. Miller. Isolation and tran- scriptional characterization of a morphological modifier: the aspergillus nidulans stunted (stua) gene. Mol Gen Genet, 227(2):285–92, 1991. [223] T. K. Mitchell and R. A. Dean. The camp-dependent protein kinase catalytic subunit is required for appressorium formation and pathogenesis by the rice blast pathogen magnaporthe grisea. Plant Cell, 7(11):1869–78, 1995. [224] N. P. Money. Plant pathology. reverend berkeley’s devil. Nature, 411(6838):644, 2001. [225] S. A. Mousavi and G. D. Robson. Oxidative and amphotericin b-mediated cell death in the opportunistic pathogen aspergillus fumigatus is associated with an apoptotic-like phenotype. Microbiology, 150(Pt 6):1937–45, 2004. [226] S. V. Muse and B. S. Gaut. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol, 11(5):715–24, 1994. [227] K. A. Nasmyth and K. Tatchell. The structure of transposable yeast mating type loci. Cell, 19(3):753–64, 1980. [228] M. Nei and T. Gojobori. Simple methods for estimating the numbers of synony- mous and nonsynonymous nucleotide substitutions. Mol Biol Evol, 3:418–426, 1986. [229] Masatoshi Nei and S. Kumar. Molecular evolution and phylogenetics. Oxford University Press, Oxford, UK, 2000. [230] A. Nekrutenko and W. H. Li. Transposable elements are found in a large number of human protein-coding genes. Trends Genet, 17(11):619–21, 2001. [231] M. A. Nelson, S. Kang, E. L. Braun, M. E. Crawford, P. L. Dolan, P. M. Leonard, J. Mitchell, A. M. Armijo, L. Bean, E. Blueyes, T. Cushing, A. Er- rett, M. Fleharty, M. Gorman, K. Judson, R. Miller, J. Ortega, I. Pavlova, J. Perea, S. Todisco, R. Trujillo, J. Valentine, A. Wells, M. Werner-Washburne, D. O. Natvig, and et al. Expressed sequences from conidial, mycelial, and sexual stages of neurospora crassa. Fungal Genet Biol, 21(3):348–63, 1997. [232] S. L. Newman, S. Chaturvedi, and B. S. Klein. The wi-1 antigen of blastomyces dermatitidis yeasts mediates binding to human macrophage cd11b/cd18 (cr3) and cd14. J Immunol, 154(2):753–61, 1995. 248

[233] W. C. Nierman, A. Pain, M. J. Anderson, J. R. Wortman, H. S. Kim, J. Ar- royo, M. Berriman, K. Abe, D. B. Archer, C. Bermejo, J. Bennett, P. Bowyer, D. Chen, M. Collins, R. Coulsen, R. Davies, P. S. Dyer, M. Farman, N. Fedorova, T. V. Feldblyum, R. Fischer, N. Fosker, A. Fraser, J. L. Garcia, M. J. Garcia, A. Goble, G. H. Goldman, K. Gomi, S. Griffith-Jones, R. Gwilliam, B. Haas, H. Haas, D. Harris, H. Horiuchi, J. Huang, S. Humphray, J. Jimenez, N. Keller, H. Khouri, K. Kitamoto, T. Kobayashi, S. Konzack, R. Kulkarni, T. Kuma- gai, A. Lafton, J. P. Latge, W. Li, A. Lord, C. Lu, W. H. Majoros, G. S. May, B. L. Miller, Y. Mohamoud, M. Molina, M. Monod, I. Mouyna, S. Mul- ligan, L. Murphy, S. O’Neil, I. Paulsen, M. A. Penalva, M. Pertea, C. Price, B. L. Pritchard, M. A. Quail, E. Rabbinowitsch, N. Rawlins, M. A. Rajan- dream, U. Reichard, H. Renauld, G. D. Robson, S. Rodriguez de Cordoba, J. M. Rodriguez-Pena, C. M. Ronning, S. Rutter, S. L. Salzberg, M. Sanchez, J. C. Sanchez-Ferrero, D. Saunders, K. Seeger, R. Squares, S. Squares, M. Takeuchi, F. Tekaia, G. Turner, C. R. Vazquez de Aldana, J. Weidman, O. White, J. Wood- ward, J. H. Yu, C. Fraser, J. E. Galagan, K. Asai, M. Machida, N. Hall, B. Bar- rell, and D. W. Denning. Genomic sequence of the pathogenic and allergenic filamentous fungus aspergillus fumigatus. Nature, 438(7071):1151–6, 2005. [234] L. R. Nunes, R. Costa de Oliveira, D. B. Leite, V. S. da Silva, E. dos Reis Mar- ques, M. E. da Silva Ferreira, D. C. Ribeiro, L. A. de Souza Bernardes, M. H. Goldman, R. Puccia, L. R. Travassos, W. L. Batista, M. P. Nobrega, F. G. No- brega, D. Y. Yang, C. A. de Braganca Pereira, and G. H. Goldman. Transcrip- tome analysis of paracoccidioides brasiliensis cells undergoing mycelium-to-yeast transition. Eukaryot Cell, 4(12):2115–28, 2005. [235] D. I. Nurminsky, M. V. Nurminskaya, D. De Aguiar, and D. L. Hartl. Se- lective sweep of a newly evolved sperm-specific gene in drosophila. Nature, 396(6711):572–5, 1998. [236] A. Odom, S. Muir, E. Lim, D. L. Toffaletti, J. Perfect, and J. Heitman. Calcineurin is required for virulence of cryptococcus neoformans. Embo J, 16(10):2576–89, 1997. [237] S Ohno. Evolution by Gene Duplication. Springer-Verlag Inc., New York, 1970. [238] T. Ohta. How gene families evolve. Theor Popul Biol, 37(1):213–9, 1990. [239] T. Ohta. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J Mol Evol, 40(1):56–63, 1995. [240] H. D. Osiewacz and E. Kimpel. Mitochondrial-nuclear interactions and lifespan control in fungi. Exp Gerontol, 34(8):901–9, 1999. [241] C. Pal, B. Papp, and L. D. Hurst. Highly expressed genes in yeast evolve slowly. Genetics, 158(2):927–31, 2001. [242] P. Pamilo and N. O. Bianchi. Evolution of the zfx and zfy genes: rates and interdependence between the genes. Mol Biol Evol, 10:271–281, 1993. [243] B. Paquin and B. F. Lang. The mitochondrial dna of allomyces macrogynus: the complete genomic sequence from an ancestral fungus. J Mol Biol, 255(5):688– 701, 1996. [244] L. Patthy. Genome evolution and the evolution of exon-shuffling–a review. Gene, 238(1):103–14, 1999. [245] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol, 183:63–98, 1990. [246] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence com- parison. Proc Natl Acad Sci USA, 85:2444–2448, 1988. [247] J. Pei and N. V. Grishin. Type ii caax prenyl endopeptidases belong to a novel superfamily of putative membrane-bound metalloproteases. Trends Biochem Sci, 26(5):275–7, 2001. 249

[248] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. Assigning protein functions by comparative genome analysis: protein phyloge- netic profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999. [249] G. E. Pierard, J. Arrese Estrada, C. Pierard-Franchimont, A. Thiry, and D. Sty- nen. Immunohistochemical expression of galactomannan in the cytoplasm of phagocytic cells during invasive aspergillosis. Am J Clin Pathol, 96(3):373–6, 1991. [250] J. Piskur. Origin of the duplicated regions in the yeast genomes. Trends Genet, 17(6):302–3, 2001. [251] J. B. Plotkin, J. Dushoff, and H. B. Fraser. Detecting selection using a single genome sequence of m. tuberculosis and p. falciparum. Nature, 428:942–945, 2004. [252] S. Poggeler. Mating-type genes for classical strain improvements of ascomycetes. Appl Microbiol Biotechnol, 56(5-6):589–601, 2001. [253] S. Poggeler. Genomic evidence for mating abilities in the asexual pathogen aspergillus fumigatus. Curr Genet, 42(3):153–60, 2002. [254] S. Pongsunk, A. Andrianopoulos, and S. C. Chaiyaroj. Conditional lethal dis- ruption of tata-binding protein gene in penicillium marneffei. Fungal Genet Biol, 42(11):893–903, 2005. [255] M. Pop, D. S. Kosack, and S. L. Salzberg. Hierarchical scaffolding with bambus. Genome Res, 14(1):149–59, 2004. [256] R. O. Poyton and J. E. McEwen. Crosstalk between nuclear and mitochondrial genomes. Annu Rev Biochem, 65:563–607, 1996. [257] V. E. Prince and F. B. Pickett. Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet, 3(11):827–37, 2002. [258] L. Ramsay, M. Macaulay, S. degli Ivanissevich, K. MacLean, L. Cardle, J. Fuller, K. J. Edwards, S. Tuvesson, M. Morgante, A. Massari, E. Maestri, N. Marmiroli, T. Sjakste, M. Ganal, W. Powell, and R. Waugh. A simple sequence repeat-based linkage map of barley. Genetics, 156(4):1997–2005, 2000. [259] M. Raymond, D. Dignard, A. M. Alarco, N. Mainville, B. B. Magee, and D. Y. Thomas. A ste6p/p-glycoprotein homologue from the asexual yeast candida albicans transports the a-factor mating pheromone in saccharomyces cerevisiae. Mol Microbiol, 27(3):587–98, 1998. [260] Y. Reiss, J. L. Goldstein, M. C. Seabra, P. J. Casey, and M. S. Brown. Inhibition of purified p21ras farnesyl:protein transferase by cys-aax tetrapeptides. Cell, 62(1):81–8, 1990. [261] M. Remm, C. E. Storm, and E. L. Sonnhammer. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol, 314(5):1041–52, 2001. [262] M. Ricchetti, C. Fairhead, and B. Dujon. Mitochondrial dna repairs double- strand breaks in yeast chromosomes. Nature, 402(6757):96–100, 1999. [263] P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology open software suite. Trends Genet, 16(6):276–7, 2000. [264] I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt. Dictionary-driven protein annotation. Nucleic Acids Res, 30(17):3901–16, 2002. [265] M. Robinson-Rechavi and V. Laudet. Evolutionary rates of duplicate genes in fish and mammals. Mol Biol Evol, 18(4):681–3, 2001. [266] F. Rodriguez, J. L. Oliver, A. Marin, and J. R. Medina. The general stochastic model of nucleotide substitution. J Theor Biol, 142:485–501, 1990. 250

[267] S. Rogic, A. K. Mackworth, and F. B. Ouellette. Evaluation of gene-finding programs on mammalian sequences. Genome Res, 11(5):817–32, 2001. [268] S. Rogic, B. F. Ouellette, and A. K. Mackworth. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinfor- matics, 18(8):1034–45, 2002. [269] Y. Rongrungruang and S. M. Levitz. Interactions of penicillium marneffei with human leukocytes in vitro. Infect Immun, 67(9):4732–6, 1999. [270] G. M. Rubin, M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson, I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann, J. M. Cherry, S. Henikoff, M. P. Skupski, S. Misra, M. Ashburner, E. Birney, M. S. Boguski, T. Brody, P. Brokstein, S. E. Celniker, S. A. Chervitz, D. Coates, A. Cravchik, A. Gabrielian, R. F. Galle, W. M. Gelbart, R. A. George, L. S. Goldstein, F. Gong, P. Guan, N. L. Harris, B. A. Hay, R. A. Hoskins, J. Li, Z. Li, R. O. Hynes, S. J. Jones, P. M. Kuehl, B. Lemaitre, J. T. Littleton, D. K. Morrison, C. Mungall, P. H. O’Farrell, O. K. Pickeral, C. Shue, L. B. Vosshall, J. Zhang, Q. Zhao, X. H. Zheng, and S. Lewis. Comparative genomics of the eukaryotes. Science, 287(5461):2204–15, 2000. [271] A. Rzhetsky and P. Morozov. Markov chain monte carlo computation of confi- dence intervals for substitution-rate variation in proteins. Pac Symp Biocomput, 6:203–214, 2001. [272] C. Sadhu, D. Hoekstra, M. J. McEachern, S. I. Reed, and J. B. Hicks. A g- protein alpha subunit from asexual candida albicans functions in the mating signal transduction pathway of saccharomyces cerevisiae and is regulated by the a1-alpha 2 repressor. Mol Cell Biol, 12(5):1977–85, 1992. [273] N. Saitou and M. Nei. The neighbor-joining method: a new method for recon- structing phylogenetic trees. Mol Biol Evol, 4(4):406–25, 1987. [274] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisen- berg. The database of interacting proteins: 2004 update. Nucleic Acids Res, 32(Database issue):D449–51, 2004. [275] G. San-Blas. [dimorphic fungi: biochemical approach to their dimorphism]. Acta Cient Venez, 46(4):221–4, 1995. [276] G. A. Sarosi and D. S. Serstock. Isolation of blastomyces dermatitidis from pigeon manure. Am Rev Respir Dis, 114(6):1179–83, 1976. [277] A. S. Sekhon, J. S. Li, and A. K. Garg. Penicillosis marneffei: serological and exoantigen studies. Mycopathologia, 77(1):51–7, 1982. [278] P. Sengupta and B. H. Cochran. Mat alpha 1 can mediate gene activation by a-mating factor. Genes Dev, 5(10):1924–34, 1991. [279] C. Seoighe and K. H. Wolfe. Extent of genomic rearrangement after genome duplication in yeast. Proc Natl Acad Sci U S A, 95(8):4447–52, 1998. [280] C. Seoighe and K. H. Wolfe. Updated map of duplicated regions in the yeast genome. Gene, 238(1):253–61, 1999. [281] P. M. Sharp. In search of molecular darwinism. Nature, 385:111–112., 1997. [282] P. M. Sharp and W. H. Li. The codon adaptation index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res, 15(3):1281–95, 1987. [283] P. M. Sharp and W. H. Li. The rate of synonymous substitution in enterobac- terial genes is inversely related to codon usage bias. Mol Biol Evol, 4(3):222–30, 1987. [284] J. C. Shepherd, W. McGinnis, A. E. Carrasco, E. M. De Robertis, and W. J. Gehring. Fly and frog homoeo domains show homologies with yeast mating type regulatory proteins. Nature, 310(5972):70–1, 1984. 251

[285] R. Shields. Pushing the envelope on molecular dating. Trends Genet, 20(5):221– 2, 2004. [286] R. A. Sia, K. B. Lengeler, and J. Heitman. Diploid strains of the pathogenic basidiomycete cryptococcus neoformans are thermally dimorphic. Fungal Genet Biol, 29(3):153–63, 2000. [287] A. Sidow. Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev, 6(6):715–22, 1996. [288] R. R. Sinden. Biological implications of the dna structures associated with disease-causing triplet repeats. Am J Hum Genet, 64(2):346–53, 1999. [289] M. Sipiczki. Where does fission yeast sit on the tree of life? Genome Biol, 1(2):REVIEWS1011, 2000. [290] T. Sirisanthana, K. Supparatpinyo, J. Perriens, and K. E. Nelson. Amphotericin b and itraconazole for treatment of disseminated penicillium marneffei infection in human immunodeficiency virus-infected patients. Clin Infect Dis, 26(5):1107– 10, 1998. [291] T. F. Smith and M. S. Waterman. Identification of common molecular subse- quences. J Mol Biol, 147:195–197, 1981. [292] T. F. Smith, M. S. Waterman, and C. Burks. The statistical distribution of nucleic acid similarities. Nucleic Acids Res, 13(2):645–56, 1985. [293] R. Sorek, G. Ast, and D. Graur. Alu-containing exons are alternatively spliced. Genome Res, 12(7):1060–7, 2002. [294] P. Staib, M. Kretschmar, T. Nichterlein, H. Hof, and J. Morschhauser. Differen- tial activation of a candida albicans virulence gene family during infection. Proc Natl Acad Sci U S A, 97(11):6102–7, 2000. [295] M. A. Steel. Recovering a tree from the leaf colourations it generates under a markov model. Appl Math Lett, 7:19–32, 1994. [296] B. R. Steen, T. Lian, S. Zuyderduyn, W. K. MacDonald, M. Marra, S. J. Jones, and J. W. Kronstad. Temperature-regulated transcription in the pathogenic fungus cryptococcus neoformans. Genome Res, 12(9):1386–400, 2002. [297] L. M. Steinmetz, C. Scharfe, A. M. Deutschbauer, D. Mokranjac, Z. S. Herman, T. Jones, A. M. Chu, G. Giaever, H. Prokisch, P. J. Oefner, and R. W. Davis. Systematic screen for human disease genes in yeast. Nat Genet, 31(4):400–4, 2002. [298] A. Stoltzfus. On the possibility of constructive neutral evolution. J Mol Evol, 49(2):169–81, 1999. [299] J. N. Strathern, E. Spatola, C. McGill, and J. B. Hicks. Structure and organi- zation of transposable of transposable mating type cassettes in saccharomyces yeasts. Proc Natl Acad Sci U S A, 77(5):2839–43, 1980. [300] K. Supparatpinyo, C. Khamwan, V. Baosoung, K. E. Nelson, and T. Sirisan- thana. Disseminated penicillium marneffei infection in southeast asia. Lancet, 344(8915):110–3, 1994. [301] K. Supparatpinyo, K. E. Nelson, W. G. Merz, B. J. Breslin, Jr. Cooper, C. R., C. Kamwan, and T. Sirisanthana. Response to antifungal therapy by human immunodeficiency virus-infected patients with disseminated penicillium marn- effei infections and in vitro susceptibilities of isolates from clinical specimens. Antimicrob Agents Chemother, 37(11):2407–11, 1993. [302] K. Supparatpinyo, J. Perriens, K. E. Nelson, and T. Sirisanthana. A con- trolled trial of itraconazole to prevent relapse of penicillium marneffei infection in patients infected with the human immunodeficiency virus. N Engl J Med, 339(24):1739–43, 1998. 252

[303] Y. Suzuki and T Gojobori. Analysis of coding sequences. In M. Salemi and A.M. Vandamme, editors, The phylogenetic handbook: a practical approach to DNA and protein phylogeny, pages 283–311. Cambridge University Press, Cambridge, UK, 2003. [304] A. Tam, W. K. Schmidt, and S. Michaelis. The multispanning membrane protein ste24p catalyzes caax proteolysis and nh2-terminal processing of the yeast a- factor precursor. J Biol Chem, 276(50):46798–806, 2001. [305] W. Tang, T. M. Gunn, D. F. McLaughlin, G. S. Barsh, S. F. Schlossman, and J. S. Duke-Cohan. Secreted and membrane attractin result from alternative splicing of the human atrn gene. Proc Natl Acad Sci U S A, 97(11):6025–30, 2000. [306] D. Taramelli, S. Brambilla, G. Sala, A. Bruccoleri, C. Tognazioli, L. Riviera- Uzielli, and J. R. Boelaert. Effects of iron on extracellular and intracellular growth of penicillium marneffei. Infect Immun, 68(3):1724–6, 2000. [307] D. Taramelli, C. Tognazioli, F. Ravagnani, O. Leopardi, G. Giannulis, and J. R. Boelaert. Inhibition of intramacrophage growth of penicillium marneffei by 4- aminoquinolines. Antimicrob Agents Chemother, 45(5):1450–5, 2001. [308] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T. Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and E. V. Koonin. The cog database: new developments in phylogenetic classifica- tion of proteins from complete genomes. Nucleic Acids Res, 29(1):22–8, 2001. [309] S. Tavare. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986. [310] R. D. Teasdale and M. R. Jackson. Signal-mediated sorting of membrane proteins between the endoplasmic reticulum and the golgi apparatus. Annu Rev Cell Dev Biol, 12:27–54, 1996. [311] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weight- ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673–80, 1994. [312] JD Thompson, DG Higgins, and TJ Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucl Acids Res, 22:4673–4680, 1994. [313] C. Thrane, U. Kaufmann, B. M. Stummann, and S. Olsson. Activation of caspase-like activity and poly (adp-ribose) polymerase degradation during sporu- lation in aspergillus nidulans. Fungal Genet Biol, 41(3):361–8, 2004. [314] W. E. Timberlake. Molecular genetics of aspergillus development. Annu Rev Genet, 24:5–36, 1990. [315] R. B. Todd, J. R. Greenhalgh, M. J. Hynes, and A. Andrianopoulos. Tupa, the penicillium marneffei tup1p homologue, represses both yeast and spore develop- ment. Mol Microbiol, 48(1):85–94, 2003. [316] S. Trewatcharegon, S. Sirisinha, A. Romsai, B. Eampokalap, R. Teanpaisan, and S. C. Chaiyaroj. Molecular typing of penicillium marneffei isolates from thailand by noti macrorestriction and pulsed-field gel electrophoresis. J Clin Microbiol, 39(12):4544–8, 2001. [317] H. F. Tsai, Y. C. Chang, R. G. Washburn, M. H. Wheeler, and K. J. Kwon- Chung. The developmentally regulated alb1 gene of aspergillus fumigatus: its role in modulation of conidial morphology and virulence. J Bacteriol, 180(12):3031–8, 1998. 253

[318] H. F. Tsai, M. H. Wheeler, Y. C. Chang, and K. J. Kwon-Chung. A devel- opmentally regulated gene cluster involved in conidial pigment biosynthesis in aspergillus fumigatus. J Bacteriol, 181(20):6469–77, 1999. [319] N. Tsuchimori, L. L. Sharkey, W. A. Fonzi, S. W. French, Jr. Edwards, J. E., and S. G. Filler. Reduced virulence of hwp1-deficient mutants of candida albicans and their interactions with host cells. Infect Immun, 68(4):1997–2002, 2000. [320] B. G. Turgeon and O. C. Yoder. Proposed nomenclature for mating type genes of filamentous ascomycetes. Fungal Genet Biol, 31(1):1–5, 2000. [321] Y. Van de Peer, J. S. Taylor, I. Braasch, and A. Meyer. The ghost of selection past: rates of evolution and functional divergence of anciently duplicated genes. J Mol Evol, 53(4-5):436–46, 2001. [322] K. Vandepoele, Y. Saeys, C. Simillion, J. Raes, and Y. Van De Peer. The auto- matic detection of homologous regions (adhore) and its application to microco- linearity between arabidopsis and rice. Genome Res, 12(11):1792–801, 2002. [323] N. Vanittanakom, Jr. Cooper, C. R., S. Chariyalertsak, S. Youngchim, K. E. Nelson, and T. Sirisanthana. Restriction endonuclease analysis of penicillium marneffei. J Clin Microbiol, 34(7):1834–6, 1996. [324] N. Vanittanakom, W. G. Merz, N. Sittisombut, C. Khamwan, K. E. Nelson, and T. Sirisanthana. Specific identification of penicillium marneffei by a polymerase chain reaction/hybridization technique. Med Mycol, 36(3):169–75, 1998. [325] N. Vanittanakom, P. Vanittanakom, and R. J. Hay. Rapid identification of penicillium marneffei by pcr-based detection of specific sequences on the rrna gene. J Clin Microbiol, 40(5):1739–42, 2002. [326] J. Varga and B. Toth. Genetic variability and reproductive mode of aspergillus fumigatus. Infect Genet Evol, 3(1):3–17, 2003. [327] D. Venet. Matarray: a matlab toolbox for microarray data. Bioinformatics, 19:659–660, 2003. [328] K. J. Verstrepen, A. Jansen, F. Lewitter, and G. R. Fink. Intragenic tandem repeats generate functional variability. Nat Genet, 37(9):986–90, 2005. [329] K. J. Verstrepen, T. B. Reynolds, and G. R. Fink. Origins of variation in the fungal cell surface. Nat Rev Microbiol, 2(7):533–40, 2004. [330] P. E. Verweij, J. F. Meis, P. van den Hurk, J. Zoll, R. A. Samson, and W. J. Melchers. Phylogenetic relationships of five species of aspergillus and related taxa as deduced by comparison of sequences of small subunit ribosomal rna. J Med Vet Mycol, 33(3):185–90, 1995. [331] K. Vienken, M. Scherer, and R. Fischer. The zn(ii)2cys6 putative aspergillus nidulans transcription factor repressor of sexual development inhibits sexual de- velopment under low-carbon conditions and in submersed culture. Genetics, 169(2):619–30, 2005. [332] M. Viswanathan, G. Muthukumar, Y. S. Cong, and J. Lenard. Seripauperins of saccharomyces cerevisiae: a new multigene family encoding serine-poor relatives of serine-rich proteins. Gene, 148(1):149–53, 1994. [333] M. A. Viviani, A. M. Tortorano, G. Rizzardini, T. Quirino, L. Kaufman, A. A. Padhye, and L. Ajello. Treatment and serological studies of an italian case of penicilliosis marneffei contracted in thailand by a drug addict infected with the human immunodeficiency virus. Eur J Epidemiol, 9(1):79–85, 1993. [334] A. Wagner. The fate of duplicated genes: loss or new function? Bioessays, 20(10):785–8, 1998. [335] A. Wagner. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol, 18(7):1283–92, 2001. 254

[336] J. B. Walsh. How often do duplicated genes evolve new functions? Genetics, 139(1):421–8, 1995. [337] J. D. Walton. Horizontal gene transfer and the evolution of secondary metabolite gene clusters in fungi: an hypothesis. Fungal Genet Biol, 30(3):167–71, 2000. [338] W. Wang, F. G. Brunet, E. Nevo, and M. Long. Origin of sphinx, a young chimeric rna gene in drosophila melanogaster. Proc Natl Acad Sci U S A, 99(7):4448–53, 2002. [339] W. Wang, H. Zheng, S. Yang, H. Yu, J. Li, H. Jiang, J. Su, L. Yang, J. Zhang, J. McDermott, R. Samudrala, J. Wang, H. Yang, J. Yu, K. Kristiansen, and G. K. Wong. Origin and evolution of new exons in rodents. Genome Res, 15(9):1258–64, 2005. [340] J. L. Weber and P. E. May. Abundant class of human dna polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet, 44(3):388–96, 1989. [341] M. H. Wheeler and A. A. Bell. Melanins and their importance in pathogenic fungi. Curr Top Med Mycol, 2:338–87, 1988. [342] S. Whelan and N. Goldman. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol, 18(5):691–9, 2001. [343] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu Rev Biochem, 46:573–639, 1977. [344] K. H. Wolfe and P. M. Sharp. Mammalian gene evolution: nucleotide sequence divergence between mouse and rat. J Mol Evol, 37(4):441–56, 1993. [345] K. H. Wolfe and D. C. Shields. Molecular evidence for an ancient duplication of the entire yeast genome. Nature, 387(6634):708–13, 1997. [346] K. H. Wong and S. S. Lee. Comparing the first and second hundred aids cases in hong kong. Singapore Med J, 39(6):236–40, 1998. [347] L. P. Wong, P. C. Woo, A. Y. Wu, and K. Y. Yuen. Dna immunization using a secreted cell wall antigen mp1p is protective against penicillium marneffei infection. Vaccine, 20(23-24):2878–86, 2002. [348] S. S. Wong, H. Siau, and K. Y. Yuen. Penicilliosis marneffei–west meets east. J Med Microbiol, 48(11):973–5, 1999. [349] S. S. Wong, K. H. Wong, W. T. Hui, S. S. Lee, J. Y. Lo, L. Cao, and K. Y. Yuen. Differences in clinical and laboratory diagnostic characteristics of penicilliosis marneffei in human immunodeficiency virus (hiv)- and non-hiv-infected patients. J Clin Microbiol, 39(12):4535–40, 2001. [350] S. S. Wong, P. C. Woo, and K. Y. Yuen. Candida tropicalis and penicillium marneffei mixed fungaemia in a patient with waldenstrom’s macroglobulinaemia. Eur J Clin Microbiol Infect Dis, 20(2):132–5, 2001. [351] P. C. Woo, C. M. Chan, A. S. Leung, S. K. Lau, X. Y. Che, S. S. Wong, L. Cao, and K. Y. Yuen. Detection of cell wall galactomannoprotein afmp1p in culture supernatants of aspergillus fumigatus and in sera of aspergillosis patients. J Clin Microbiol, 40(11):4382–7, 2002. [352] P. C. Woo, K. T. Chong, A. S. Leung, S. S. Wong, S. K. Lau, and K. Y. Yuen. Aflmp1 encodes an antigenic cel wall protein in aspergillus flavus. J Clin Microbiol, 41(2):845–50, 2003. [353] P. C. Woo, H. Zhen, J. J. Cai, J. Yu, S. K. Lau, J. Wang, J. L. Teng, S. S. Wong, R. H. Tse, R. Chen, H. Yang, B. Liu, and K. Y. Yuen. The mitochondrial genome of the thermal dimorphic fungus penicillium marneffei is more closely related to those of molds than yeasts. FEBS Lett, 555(3):469–77, 2003. 255

[354] V. Wood, R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne, A. Stewart, J. Sgouros, N. Peat, J. Hayles, S. Baker, D. Basham, S. Bowman, K. Brooks, D. Brown, S. Brown, T. Chillingworth, C. Churcher, M. Collins, R. Connor, A. Cronin, P. Davis, T. Feltwell, A. Fraser, S. Gentles, A. Goble, N. Hamlin, D. Harris, J. Hidalgo, G. Hodgson, S. Holroyd, T. Hornsby, S. Howarth, E. J. Huckle, S. Hunt, K. Jagels, K. James, L. Jones, M. Jones, S. Leather, S. Mc- Donald, J. McLean, P. Mooney, S. Moule, K. Mungall, L. Murphy, D. Niblett, C. Odell, K. Oliver, S. O’Neil, D. Pearson, M. A. Quail, E. Rabbinowitsch, K. Rutherford, S. Rutter, D. Saunders, K. Seeger, S. Sharp, J. Skelton, M. Sim- monds, R. Squares, S. Squares, K. Stevens, K. Taylor, R. G. Taylor, A. Tivey, S. Walsh, T. Warren, S. Whitehead, J. Woodward, G. Volckaert, R. Aert, J. Robben, B. Grymonprez, I. Weltjens, E. Vanstreels, M. Rieger, M. Schafer, S. Muller-Auer, C. Gabel, M. Fuchs, A. Dusterhoft, C. Fritzc, E. Holzer, D. Moestl, H. Hilbert, K. Borzym, I. Langer, A. Beck, H. Lehrach, R. Reinhardt, T. M. Pohl, P. Eger, W. Zimmermann, H. Wedler, R. Wambutt, B. Purnelle, A. Goffeau, E. Cadieu, S. Dreano, S. Gloux, et al. The genome sequence of schizosaccharomyces pombe. Nature, 415(6874):871–80, 2002. [355] J. Wu and B. L. Miller. Aspergillus asexual reproduction and sexual reproduc- tion are differentially affected by transcriptional and translational mechanisms regulating stunted gene expression. Mol Cell Biol, 17(10):6191–201, 1997. [356] Z. Yan, X. Li, and J. Xu. Geographic distribution of mating type alleles of cryptococcus neoformans in four areas of the united states. J Clin Microbiol, 40(3):965–72, 2002. [357] J. Yang, Z. Gu, and W. H. Li. Rate of protein evolution versus fitness effect of gene deletion. Mol Biol Evol, 20(5):772–4, 2003. [358] Z. Yang. Estimating the pattern of nucleotide substitution. J Mol Evol, 39:105– 111, 1994. [359] Z. Yang. Paml: a program package for phylogenetic analysis by maximum like- lihood. Comput Appl Biosci, 13(5):555–6, 1997. [360] Z Yang. Phylogenetic Analysis by Maximum Likelihood (PAML). Version 3.0. London: University College, 2000. [361] R. F. Yeh, L. P. Lim, and C. B. Burge. Computational inference of homologous gene structures in the human genome. Genome Res, 11(5):803–16, 2001. [362] G. Yona, N. Linial, and M. Linial. Protomap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res, 28(1):49–55, 2000. [363] K. Y. Yuen, C. M. Chan, K. M. Chan, P. C. Woo, X. Y. Che, A. S. Leung, and L. Cao. Characterization of afmp1: a novel target for serodiagnosis of aspergillosis. J Clin Microbiol, 39(11):3830–7, 2001. [364] K. Y. Yuen, G. Pascal, S. S. Wong, P. Glaser, P. C. Woo, F. Kunst, J. J. Cai, E. Y. Cheung, C. Medigue, and A. Danchin. Exploring the penicillium marneffei genome. Arch Microbiol, 179(5):339–53, 2003. [365] K. Y. Yuen, S. S. Wong, D. N. Tsang, and P. Y. Chau. Serodiagnosis of peni- cillium marneffei infection. Lancet, 344(8920):444–5, 1994. [366] M. Zagulski, B. Babinska, R. Gromadka, A. Migdalski, J. Rytka, J. Sulicka, and C. J. Herbert. The sequence of 24.3 kb from chromosome x reveals five complete open reading frames, all of which correspond to new genes, and a tandem insertion of a ty1 transposon. Yeast, 11(12):1179–86, 1995. [367] E. M. Zdobnov and R. Apweiler. Interproscan–an integration platform for the signature-recognition methods in interpro. Bioinformatics, 17(9):847–8, 2001. [368] C. T. Zhang, J. Wang, and R. Zhang. A novel method to calculate the g+c content of genomic dna sequences. J Biomol Struct Dyn, 19:333–341, 2001. 256

[369] J. Zhang, Y. P. Zhang, and H. F. Rosenberg. Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey. Nat Genet, 30(4):411–5, 2002. [370] L. Zhang, T. J. Vision, and B. S. Gaut. Patterns of nucleotide substitution among simultaneously duplicated gene pairs in arabidopsis thaliana. Mol Biol Evol, 19(9):1464–73, 2002. [371] P. Zhang, Z. Gu, and W. H. Li. Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol, 4(9):R56, 2003. [372] R. Zhang and C. T. Zhang. Z curves, an intutive tool for visualizing and ana- lyzing the dna sequences. J Biomol Struct Dyn, 11:767–782, 1994.