Exploring Genome Structure and Gene Regulation Related to Virulence in Fungal Phytopathogens Using Next Generation Sequencing Techniques

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Jinnan Hu, M.S.

Graduate Program in Plant Pathology

The Ohio State University

2013

Dissertation Committee:

Thomas K. Mitchell, Adviser

Michael J. Boehm

Guo-liang Wang

Kun Huang

Copyright By

Jinnan Hu

2013

Abstract

Over the last decade, a technological revolution has provided enormous advances in the knowledge of complex biological processes largely enabled by the development of next-generation sequencing (NGS) techniques. Applications of NGS include studies of entire genomes, characterization of the entire transcriptome (RNA-Seq), and detection of protein-DNA binding sites (ChIP-Seq). The cost of sequencing a fungal genome, for example, has decreased from more than one hundred thousand dollars to currently only three thousand dollars. With the development of the applications and the affordable cost, NGS is changing the way biologists designing and carrying out research. This dissertation describes developed analysis pipelines of

NGS data in the field of fungal phytopathogens using four different projects as examples. In a genome comparison project, sequenced short reads are de novo assembled to form a genome draft, then gene models are predicted either ab initial or assisted by RNA-Seq, followed by the comparison between genomes at different resolutions such as the nucleotide level and the genome structural level. Two chapters in this dissertation serve as examples of our genome comparison pipeline put to work to address biological questions. In Chapter 2, the Alternaria arborescens sequences of the unique conditionally dispensable chromosome (CDC) were separated from essential chromosomes (EC) using a novel bioinformatics approach. A pair-wise

ii comparison between the CDC and ECs showed that CDC sequences had significant variation and

that it may have been originally acquired through a horizontal gene transfer event. In Chapter 4,

seven field isolates of Magnaporthe oryzae were sequenced and their genome content

compared. Over 10,000 SNP and Indel locations were identified as well as genes under strong

positive selection, which are considered potential virulence related genes. While in a RNA-Seq

analysis pipeline, sequence reads are first assembled de novo or mapped to a reference genome,

and the expression level for individual genes in each sequencing library is calculated to identify

differentially expressed genes. Chapter 3 describes a RNA-Seq analysis project, in which the

transcriptome profile in the dollar spot pathosystem has been sequenced using a combination of

Illumina and Roche 454 NGS technologies. A large number of genes were found up-regulated

during the interaction between Sclerotinia homoeocarpa and Agrostis stolonifera with some

having annotations suggesting their roles in virulence related processes. With regard to

protein-DNA binding, reads from ChIP-Seq experiment are mapped to the reference genome and

the “peak regions” of mapped reads are identified as candidate binding regions, within which binding motifs are predicted. Chapter 5 describes the identifications of the binding sites and motifs of the M. oryzae transcription factor MoCRZ1 using a combination of ChIP-chip and microarray data, and then the prediction accuracy is improved by a novel approach utilizing the spatial distribution pattern of the candidate motifs. Finally, the last chapter summarizes a large

iii scale mutagenesis project to identify two avirulence genes in M. oryzae by generating random

mutants, followed by a pathogenicity screen on rice cultivars containing the corresponding

resistance genes. Although imperfections and challenges remain, this dissertation shows four successful NGS applications in fungal phytopathogens. With the continued development of sequencing techniques and bioinformatics tools, NGS based projects with more sophisticated experimental designs will undoubtedly produce larger and more accurate data to biologists in the near future.

iv Dedication

Dedicated to the students at The Ohio State University

v Acknowledgments

I would like to gratefully and sincerely thank Dr. Thomas Mitchell for his guidance,

understanding, patience, and most importantly, his friendship during my PhD studies. His

mentorship was paramount in providing a well-rounded experience consistent my long-term

career goals. He encouraged me to not only grow as a plant pathologist but also as an instructor

and an independent thinker. For everything you’ve done for me, Dr. Mitchell, I thank you.

I would also like to thank all the three professors served in my dissertation committee. Dr.

Michael Boehm, Dr. Guo-Liang Wang, and Dr. Kun Huang, for their assistance and guidance in getting my graduate career started on the right foot and providing me with the foundation for becoming a qualified scientist.

I would like to thank the faculties of Department of Plant Pathology, and students of PPGSA for

their valuable discussions and friendship. In particular, I would like to thank Dr. Stephen Opiyo

for his data analysis help, Maria Bellizzi and Dr. Songbiao Chen for the guidance and support in

my bench experiments, and Dr. Angela Orshinsky, with whom I worked closely and finished a

publication together.

Finally, and most importantly, I would like to thank my wife Chenxi Chen. Her support,

encouragement, quiet patience and unwavering love were undeniably the bedrock upon which

the past 4 years of my life have been built. I would never reach this far without her company

and support. I thank my parents, for their faith in me and allowing me to be as ambitious as I

wanted. vi Vita

2008…………………………………………………………………B.S. Biochemistry, Nanjing University, China

2008 to present………………………………………………Graduate Research Associate, Department of

Plant Pathology, The Ohio State University

Publications

1. Kim S, Hu J, Oh Y, Park J, Choi J, Lee YH, Dean RA, Mitchell TK: Combining ChIP-chip and expression profiling to model the MoCRZ1 mediated circuit for Ca/calcineurin signaling in the rice blast fungus. PLoS Pathog 2010, 6(5):e1000909. 2. Hu J, Chen C, Peever T, Dang H, Lawrence C, Mitchell T: Genomic characterization of the conditionally dispensable chromosome in Alternaria arborescens provides evidence for horizontal gene transfer. BMC Genomics 2012, 13:171. 3. Orshinsky AM*, Hu J*, Opiyo SO, Reddyvari-Channarayappa V, Mitchell TK, Boehm MJ: RNA-Seq analysis of the Sclerotinia homoeocarpa--creeping bentgrass pathosystem. PloS one 2012, 7(8):e41150. *authors contribute evenly 4. Chen S, Songkumarn P, Venu R, Gowda M, Bellizzi M, Hu J, Liu W, Ebbole D, Mitchell T, Wang GL: Identification and Characterization of In-planta Expressed Secreted Effectors from Magnaporthe oryzae that Induce Cell Death in Rice. Molecular plant-microbe interactions : MPMI 2012. 5. Hu, J. , Chen, C. , Huang, K. and Mitchell, T. (2013) A distribution pattern assisted method of transcription factor discovery for both yeast and filamentous fungi. Advances in Bioscience and Biotechnology, 4, 509-517.

Fields of Study

Major Field: Plant Pathology

vii Table of Contents

Abstract ……………………………………………………………………………………………………………………….…… ii

Dedication ………………………………………………………………………………………………………………………… v

Acknowledgments …………………………………………………………………………………………………….…….. vi

Vita …………………………………………………………………………………………………………………………………. vii

List of Tables ……………………………………………………………………………………………………………..…….. ix

List of Figures ……………………………………………………………………………………………………………..…… xii

Chapter 1: Review of the knowledge obtained from whole genome studies of fungal phytopathogens by applying next-generation sequencing technologies …………………….……... 1

Chapter 2: Genomic characterization of the conditionally dispensable chromosome in

Alternaria arborescens provides evidence for horizontal gene transfer …………………..…….... 39

Chapter 3: RNA-Seq Analysis of the Sclerotinia homoeocarpa – Creeping Bentgrass

Pathosystem ………………………………………….………………………………………………………………..………. 87

Chapter 4: Whole genome sequencing of seven Magnaporthe oryzae field isolates reveals genome content variation and predicted pathogenicity determinants …………………………... 127

Chapter 5: A Distribution Pattern Assisted Method of Transcription Factor Binding Site

Discovery for Both Yeast and Filamentous Fungi ……………………….………………………………………. 158

Chapter 6: Identification of novel Avr genes by Screening of Gain-of-virulence Mutants in rice blast fungus Magnaporthe oryzae …………………………………………………………………………... 187

Conclusions ……………………………………………………………………………………………………………………. 211

References ….………………………………………………………………………………………………………………….. 215

viii List of Tables

Table 1.1. Genomes sequenced of plant pathogen fungi and oomycetes ………………………………… 32

Table 1.2. Comparison of the three mainstream 2nd generation sequencing technologies and

PacBio which is regarded as the 3rd generation sequencing technique. …………………………………… 33

Table 1.3. Comparison of the technique specification and application of RNA-Seq, tilling microarray, and low-throughput techniques …………………………………………………………………………… 34

Table 2.1. Velvet de-novo assembly statistics ………………………..………………………………………………… 67

Table 2.2. Alignment of A. arborescens marker gene contigs and A. brassicicola contigs ………… 68

Table 2.3. Repeat region identification ……………………………………………………………………………………. 69

Table 2.4. Codon usage correlation analysis ……………………………………………………………………………. 70

Table 2.5. Differences in codon usage between CDC and EC genes …………………………………………. 71

Table 2.6. Ka/Ks ratio of CDC protein conversed domains ……………………………………………………….. 72

Table 2.7. General statistical facts of A. arborencens and other sequenced fungi genome ……… 73

Table 2.8. Conserved domains in CDC putative PKS genes ………………………………………………………. 74

Table 2.9. Primers used for Southern hybridization ………………………………………………………………… 75

Table 3.1. Characteristics of the 454 RNA-seq reads …………………………………………………………….. 110

Table 3.2. Mapping characteristics of the Sclerotinia homoeocarpa (SH) and Agrostis stolonifera

ix (AS) SBS reads to the SH and AS transcript assemblies …………………………………..…………………….. 111

Table 3.3. Summary of select Sclerotinia homoeocarpa transcript types that are significantly

increased at 96 hpi after inoculation onto creeping bentgrass ……………………………………………… 112

Table 3.4. Summary of select Agrostis stolonifera transcript types that are significantly increased at 96 hpi after inoculation with Sclerotinia homoeocarpa …………………………………………………….. 113

Table 3.5. Top enriched Sclerotinia homoeocarpa domains at 96 h post inoculation on creeping bentgrass ……………………………………………………………………………………………………………………………… 114

Table 3.6. Top enriched Agrostis stolonifera domains at 96 h post inoculation with Sclerotinia homoeocarpa ……………………………………………………………………………………………………………………….. 115

Table 3.7. Simple sequence repeat (SSR) markers from the Sclerotinia homoeocarpa and Agrostis stolonifera transcriptome ……………………………………………………………………………………………………… 116

Table 4.1. General statistic numbers of the genome assembly and gene prediction ……………... 148

Table 4.2. Top eight genes containing the copy number variation …………………………………………. 149

Table 4.3. Number of SNPs & IDPs identified, their location, and the number of those causing large impact on gene function ………………………………………………………………………………………………. 150

Table 5.1. List of yeast TFs involved in training and testing …………………………………………………………… 174

Table 5.2. TFs in training group …………………………………………………………………………………………………….. 175

Table 5.3. Fkh1 consensus motifs testing result ……………………………………………………………………………. 176

x Table 5.4. Test results of all testing group …………………………………………………………………………………….. 177

Table 5.5. Statistical enrichment test for the three candidate motifs ……………………………………………. 178

Table 5.6. The occurrence of the three candidate motifs in ORFs and upstream regions ………………. 179

Table 5.7. MoCrz1 consensus motifs testing result ……………………………………………………………………….. 180

Table 6.1. Pathogenicity test conducted in Dr. Mitchell’s lab …………………………………………………. 199

Table 6.2. Pathogenicity test conducted in Dr. Pan’s lab ………………………………………………………… 200

Table 6.3. Summary of TAIL-PCR results of different clones generated from Iso_2 and Iso_9 ……….. 201

Table 6.4. Features of 6 candidate Avr genes ………………………………………………………………………… 202

Table 6.5. Calculation of possibility to identify one specific gene or two genes using random insertion approach in different number of mutants generated ……………………………………………… 203

Table 6.6. Primers used in TAIL-PCR ………………………………………………………………………………………. 204

xi List of Figures

Figure 1.1. Number of genomes available for the fungi and oomycete species ……………………..… 35

Figure 1.2. Sequencing cost released by The National Human Genome Research Institute …….. 36

Figure 1.3. Contribution of different stages to the total cost of a genome sequencing project … 37

Figure 1.4. A general pipeline for genome comparison analysis used for this dissertation ………. 38

Figure 2.1. Annotation of the initial two CDC contigs ……………………………………………………………… 76

Figure 2.2. Southern hybridization to validate CDC contigs prediction ……………………………………. 77

Figure 2.3. Global contig alignment between A. arborescens contigs (blue blocks) and A. brassicicola contigs (red blocks) ……………………………………………………………………………………………… 78

Figure 2.4. The Codon Adaptation Index (CAI) distribution of A. arborescens CDC genes compared to EC genes …………………………………………………………………………………………………………………………….. 79

Figure 2.5. BLAST taxonomy report of all CDC genes against NCBI “nr/nt” database ………………. 80

Figure 2.6. GO term of CDC genes …………………………………………………………………………………………… 81

Figure 2.7. Structure of predicted secondary metabolite biosynthesis (SMB) clusters …………….. 82

Figure 2.8. Average Ka/Ks ratio, Ka, and Ks for all aligned EC genes and CDC genes ………………… 83

Figure 2.9. Phylogenetic relationship of proteins in A. arborescens CDC shows discordance compared to EC proteins ………………………………………………………………………………………………………… 84

xii Figure 2.10. Scatter plots of BLAST Score Ratio (BSR) of EC proteins (A) and CDC proteins (B) … 85

Figure 2.11. Diversity of marker genes on CDC and ECs among different Alternaria isolates …… 86

Figure 3.1. Read length of 454 sequencing data. The range of sequence read length (bp) of 454 data is shown as the shaded area …………………………………………………………………………………………. 119

Figure 3.2. Signal tracks of SBS reads for two genes (SH3549 and SH8608) from either the PDA,

PDB, or interaction library …………………………………………………………………………………………………….. 120

Figure 3.3. Venn diagram of RNA-Seq reads unique and common to the SH library, AS library, and

Interaction library …………………………………………………………………………………………………………………. 121

Figure 3.4. Top Species Blast hits: (a) SH transcript library (b) AS transcript library ……………….. 122

Figure 3.5. GO terms associated with up-regulated SH transcripts ………………………………………… 123

Figure 3.6. GO terms associated with up-regulated AS transcripts ………………………………………… 124

Figure 3.7. Comparison of semi-quantitative RT-PCR data to RPKM values …………………………… 125

Figure 3.8. Validation of RPKM data for selected SH and AS transcripts using real time PCR …. 126

Figure 4.1. Dot plots of the alignments among 7 sequenced isolates and reference 70-15 sequence ……………………………………………………………………………………………………………………………… 151

Figure 4.2. Phylogenetic tree of different isolates constructed using genome-wise SNP data . 152

Figure 4.3. Gene pool analysis among the four isolate(s) groups ………………………………………….. 153

Figure 4.4. Copy number variation of the 7 sequenced isolates along genome …………………….. 154

xiii Figure 4.5. Number of Genes with higher/lower copy number compared to 70-15 ………………. 155

Figure 4.6. SNPs & IDPs density (number per 100KB) along genomes of 7 sequenced isolates.156

Figure 4.7. Ka/Ks ratio distribution along genome in the 7 sequenced isolates ……………………… 157

Figure 5.1. The MoCRZ1 binding region identification experiments design ………………………………….. 181

Figure 5.2. Analysis Pipeline for generating model of MDP approach and test it using yeast transcription factor data ………………………………………………………………………………………….………………….. 182

Figure 5.3. Distribution frequency curve of training group …………………………………………………………… 183

Figure 5.4. Frequency curves of Fkh1 candidate motifs ……………………………………………………………….. 184

Figure 5.5. Predicted MoCRZ1 binding motifs identified ………………………………………………………………. 185

Figure 5.6 Frequency curves of MoCrz1 candidate motifs ……………………………………………………………. 186

Figure 6.1. Screening Pipeline …………………………………………………………..…………………………………………. 205

Figure 6.2. Lesions in the first re-inoculation of the 5 isolates from pooled inoculation ……………….. 206

Figure 6.3. Lesions from pathogenicity test conducted in Dr. Mitchell’s lab ………………………………….. 207

Figure 6.4. All four candidate mutants showed strong GFP signal ………………………………………………… 208

Figure 6.5. Position of the T-DNA insertion in Iso_2 and Iso_4 mutants ……………………………………….. 209

Figure 6.6. Structure of the T-DNA inserted …………………………………………………………………………………. 210

xiv

Chapter 1: Review of the knowledge obtained from whole genome studies of fungal

phytopathogens by applying next-generation sequencing technologies

1 With the rapid development of the next-generation sequencing (NGS) technologies, genome sequencing of many organisms in every branch of the life tree has been completed. With regard to fungi and oomycetes, which were often regarded as containing the simplest eukaryotic genomes, the genome of Saccharomyces cerevisiae was the first sequenced in 1996 [1] followed shortly by

Neurospora crassa and Magnaporthe grisea in 2000 and 2002 respectively [2, 3]. Following on the success of these early efforts, more than 100 fungi and oomycete species have had their genomes sequenced, with most completed using NGS technologies. This wealth of data enabled many genome comparison studies, including species that cause diseases of crop plants. Specifically, much work has focused on the identification of virulence determinants and their evolution. NGS technologies have also been applied in studies focusing on the transcriptome (RNA-Seq) or

DNA-protein interactions (ChIP-Seq). This chapter reviews some of the important projects of these efforts, summarizes the biological discoveries, and notes the technical advances made along the way.

Also discussed is the accuracy, reproducibility and cost of NGS studies.

Discoveries made using next-generation sequencing technologies

Genomes sequenced

As of January 2013, there are 2,298 genomes uploaded in the National Center of Biotechnology

Information (NCBI) database for Eukaryotic organism, with 720 of them belong to fungi, of which 406

2 are evolved to the “Complete” category. Of the fungal genomes, 306 belong to the phyla Ascomycota,

70 belong to Basidiomycota, and 32 placed in the “Other” category. Some species have multiple genomes updated for different isolates (Figure 1.1). These are mostly model organisms such as

Saccharomyces cerevisiae, which has genomes available for as many as 52 isolates.

Large variation of genome size

It has been widely known from long before the application of NGS, that fungal genomes are variable with regard to length, but the degree of variability is only now being confirmed following NGS projects. (Table 1.1) Fungal phytopathogens and oomycetes have extreme variability in genome size.

For example, the genome size difference between smut fungi Ustilago maydis (~19-21Mb) and

Phytophthora infestans (~220-280Mb) is as large as nearly 15-fold [4-7]. Generally, filamentous fungal phytopathogens have larger genome than their yeast type relatives [4, 8, 9]. For example, the ascomycete fungi powdery mildews (Golovinomyces orontii) have the largest genome in that phyla – around 160 Mb [8]. For basidiomycetes, the rust fungus (Melampsora larci-populina) has the largest genome at over 89 Mb [9]. Some oomycete pathogens have genomes that are around or larger than

100 Mb, including downy mildew (Hyaloperonospora arabidopsidis) with ~100 Mb genome [10], and the P. infestans clade (~240 Mb) [4, 7]. Compared with pathogens, the genomes of non-pathogenic fungi sequenced so far are more likely to stay at the size of 40 Mb, such as Aspergillus oryzae at 37

Mb [11], Neurospora crassa at 41 Mb [3] and Schizophyllum commune at 39 Mb [12]. The closest

3 non-pathogen relatives of oomycetes – diatoms – have the genome size ranging from 56 Mb for

Aureococcus anophagefferens [13] to 27 Mb for Phaeodactylum tricornutum [14]. However, the trend of fungal phytopathogens towards larger genomes is not absolute, and some filamentous pathogens actually have relatively small genomes, possibly because of gene loss as in Albugo laibachii [15], intron loss as in U. maydis [5], or reduced transposon content in Sclerotinia sclerotiorum [16].

An interesting point discovered through NGS projects was the number of gene models do not correlate with the genome size. An example is the Blumeria graminis which has the genome about

6-fold larger than that of U. maydis but contains fewer gene models [5, 8]. Reasons for this lack of correlation is the fact that most expansion in phytopathogen genomes is not the coding regions but repetitive DNA such as low complexity repeats or transposons [2, 4, 8-10, 17]. In some extreme cases, the amount of repetitive DNA can reach as high as 74% in P. infestans and 65% in Leptosphaeria maculans.

“Lineage-specific” chromosomes & sequences

An interesting finding associated with the genome size variation is “lineage-specific” chromosomes, genome regions, and gene families. Firstly, comparative genome studies revealed that large genomic regions may be variable among isolates of a given species. Category of these variable regions is

4 unique chromosomes are referred to as supernumerary or conditionally dispensable, because they are not typically required for saprophytic growth [18-20]. These chromosomes have been identified in many fungi including Magnaporthe oryzae [2, 21, 22], Fusarium oxysporum [23], Nectria haematococca [24, 25], Mycosphaerella graminicola [26], Cochliobolus heterostrophus [27],

Leptosphaeria maculans [28], and Alternaria alternata [29, 30]. In addition, “lineage-specific” regions can also be found as part of a chromosome, or even distributed in multiple chromosomes as single or multiple copies [25, 31-34]. Some gene families were identified only in the pathogenic species but not in their non-pathogen relatives, which are then regarded as pathogenicity related genes [32]. Such expanded genes may encode putative transporters and lytic necessary for pathogens to infect hosts. A recent study on the the poplar leaf rust M. larci-populina and wheat stem rust P. graminis f. sp. tritici found that around 26% of the predicted genes are lineage-specific, including peptidases, lipases, glycosyl , ion superoxide dismutases, oligopeptide membrane transporters, kinases and transcription factors [9]. Additionally secreted protein coding genes were found upregulated during infection [9]. By comparison, biotrophic fungal species often lack or have a reduced number of these gene families [5].

In some cases, the lineage-specific genes are found sequestered on the “lineage specific” chromosomes, making the question of their evolution more straight forward. For example, in a study of Magnaporthe oryzae, a 1.68 Mb isolate specific genome region was identified, with 3 AVR genes were found in the region [21]. Another example is the pea-pathogenic form of N. haematococca, which contains genes involved in host nutrient utilization and resistance to plant antimicrobials, that

5 all locate on the “lineage-specific” chromosome [24, 35, 36].

Pathogenicity gene cluster

In fungal phytopathogens, genes encoding or being involved in pathogenicity sometimes co-locate in clusters [37, 38]. A typical cluster may include a set of co-regulated genes, or genes that code for biosynthetic enzymes, transcription factors or transporters [39]. An example is the gene cluster in

Fusarium graminearum, which contains 10-12 genes locates next to each other and 2 genes locate outside the cluster [40]. In Alternaria alternata, metabolite gene clusters were found on the

“lineage-specific” chromosomes, such as AF-toxin, and AK-toxin [41, 42]. A slightly different case in U. maydis genes encoding secreted proteins were not located in a large cluster but in small sub-clusters containing 3-26 co-regulated genes dispersed throughout the genome. Deletion of nine genes in the clusters resulted changing of virulence towards its host maize [5, 6].

Not all the pathogenicity related genes are clustered, but most of them show some preference in genome location, including “lineage-specific” chromosome or “conditionally dispensable chromosome” (CDC), and telomere-proximal regions. One shared feature of these regions rich with virulence related genes is the break or absence of synteny with their related species when the genome sequences are aligned [43]. As mentioned above, host-specific toxin genes in Alternaria pathogens are located at the CDC [44], while three AVR genes were found in the 1.68 Mb additional

6 genomic regions of the rice blast fungus Magnaporthe oryzae [21]. Subtelomeric regions are also locations preferred by virulence related genes, since usually they tent to evolve at a higher rate than sequences closer to the centromere [43, 45, 46]. An example also from M. oryzae is the avirulence gene Pita (Avr-Pita), which was found originally to reside in an unstable location only 48bp from the telomeric repeats of Chromosome 3 [45]. It was then reported recently that multiple translocation events were identified for Avr-Pita in multiple M. oryzae species, with this gene found most of the time located near or within retrotansposon-rich subtelomeric regions [47].

Genome structures

Focusing on the regions outside coding sequences, researchers found some interesting points of genome structure: gene-sparse regions, isochore-like regions, and transposable elements. Several filamentous fungal phytopathogens were found to contain a discontinuous distribution pattern of gene density along genome: gene-dense regions and gene-spares regions. This pattern is particularly obvious in the genome of the oomycete Phytophthora infestans, in which two types of regions could be distinguished by the length of flanking noncoding regions. Roughly 2,000 gene-spares regions, which typically contain less than 10 genes, were identified [4, 7, 48]. It should be noted that gene-dense and gene-sparse regions do not differ in GC-content. Differences in GC-content, however, generated “isochore-like regions” in the genome of L. maculans – pathogen of Brassica spp. In this case a sharply changing of GC-content in the isochore-like region was identified compared to the rest

7 of genome, with these regions lacking of coding genes, and have only 148 out of 12,469 (1.2%) genes located in them [17]. It has been proposed that both gene-sparse regions and isochore-like regions were created by invasion of transposable elements [17, 49].

Transposable elements (TE) were often proposed as a driver of phytopathogen genome evolution because of the fact that they mediate chromosomal rearrangements by ectopic recombination. So studies applying NGS, and focusing on TE location were performed, with the goal to elucidate mechanism of TE events and the genome changes they cause. The first finding is the extreme variation of the proportion of TEs in different genomes. In F. graminearum, only 0.1% of total genome and only two classes of TEs were identified [43]. U. maydis is another example of a genome that lacks a large number TEs with only 1.1% presented [5]. Other fungi show modest amounts of

TEs, such as 9.7% in M. oryzae [2], and 28% in F. oxysporum [23]. On the other side, powdery mildew fungi show about 64% TEs which caused huge genome expansion and possible gene losses as it has only ~5900 genes identified [8]. TEs were found enriched in supernumerary chromosomes and telomere-proximal regions [43, 50], as being reported in F. oxysporum f.sp. lycopersici with 74% of the TEs located on the supernumerary chromosomes [23].

8 Mechanisms leading to genome modification

Single nucleotide mutations

Genome structures can be modified by a variety of mechanisms throughout evolutions. Although the mechanism could not be directly studied by NGS genome comparison studies, data from analyzing genome sequences can be used as evidence to support or reject assumptions of what the driving force is. DNA point mutations are one of the most direct mechanisms that lead to genome sequence changing and the alternation of protein function. Usually more than 10,000 Single-nucleotide polymorphisms (SNPs) can be identified in a fungal phytopathogen genome when multiple isolates are compared to a reference. SNPs are not distributed evenly along genome, but in fact several “hot spots” can be identified near telomeres and discrete AT-rich chromosome regions, as showed in the

F. graminearum genome [43, 51]. Additionally the ratio of synonymous to non-synonymous SNPs

(dN/dS) is valuable and can be calculated to serve as a signature of local selection force. The dN/dS ratio can be calculated for individual genes, a genome regions, or the whole genome, with the higher the ratio representing the stronger positive selection force [7, 15, 26, 52]. Usually high values are observed in regions correlated with high TE density or in the CDCs. Individual genes, or domains of particular genes, showing high dN/dS values can often be the ones associated with pathogenicity or virulence effectors. As reported in the RXLR containing effector genes in oomycetes, C-terminal domains showed much higher dN/dS values than the other domains [53]. Another mechanism driving genome sequence change is “repeat-induced point” (RIP) mutations. RIP occurs in a process

9 that mutates a duplicated DNA sequences from CpA to TpA and from TpG to TpA in meiosis [54, 55].

This mutation process has been found in most ascomycete fungi but not oomycetes to date, and only one basidiomycetes Pucciniomycotina subphylum [56]. The function of RIP mutations were generally regarded as deactivation of the duplicated genes by introducing premature stop codons or non-synonymous substitutions. However, this process sometimes affects the flanking sequences outside the duplicated region and thus introduces mutation to the “wild type” sequences [57, 58]. In

Fusarium spp., the higher mutation rate contributed by RIP is speculated to accelerate the evolution of the effector genes [57]. RIP has also being found in L. maculans and may contribute to the generation of isochore-like regions in the genome [17]. Conversely, powdery mildew fungi do not have RIP mutation machinery, and its expanded genome size may be a result of this ommission [8].

Horizontal gene transfer

Horizontal gene transfer (HGT) is another important mechanism leading to large genome reorganization. HGT is the movement, without recombination, of stable genetic material between two individuals [59]. HGT may not only occur between different individuals of the same species, but also between species or even between bacteria and fungi or between fungi and oomycetes [60, 61].

In fungi, the movement of plasmids, mycoviruses, transposable elements, gene clusters, and whole chromosomes have been demonstrated from one individual to another [62]. The most well studied example of HGT in fungi is the movement of the ToxA gene from the wheat blotch pathogen

10 Stagonospora nodorum to Pyrenophora tritici-repentis, the causal agent of tan spot of wheat [63, 64].

This horizontal transfer event was identified by nucleotide sequence similarity and structural comparisons between genes from both species. The direction of transfer was inferred by the fact that the ToxA gene consisted of a single haplotype in P. tritici-repentis but 11 haplotypes in S. nodorum isolates.

Technical development of the next-generation sequencing technologies

The mainstream NGS techniques

Three sequencing methods developed by Roche, Illumina, and SOLid are generally regarded as the second generation sequencing techniques. Other techniques also exist in the market, such as Pacific

Biosciences, but have not been widely adopted. The core of these three approaches is the amplification of input DNA as single molecule so the number copies can reach a scale large enough to be detected. There are millions of DNA molecules to be amplified simultaneously but each molecule has to be amplified within a closed system. (Table 1.2)

The Roche’s 454 sequencing platform amplifies DNA inside water droplets in an oil solution. Each droplet containing a single DNA template connected to a single primer-coated bead, which then forms a colony by so called “emulsion PCR”. The sequencing platform contains a large number of picolitre-volume wells with the size that only one single bead could fit in with the requesite

11 sequencing enzymes. Pyrosequencing uses luciferase to generate a light signal to detect individual nucleotides added to the nascent DNA strand, and the combined light signal records are then used to generate sequence read-outs. 454 sequencing usually generates long sequencing reads length, usually around 400-500 bp, easy for de novo assembly and detection of RNA alternate splicing. But it is difficult to read polymers (continues A/C/G/T) because the fluorescent intensity and length of polymers is not in linear relationship. (Information acquired from official website http://www.454.com)

Illumina’s DNA sequencing technology is called “sequence-by-synthesis”. DNA molecules are first connected to primers on a flow cell surface. Then single molecules are amplified to form clusters, each of which contains thousands of the same nucleotide molecules. Then the four types of Dideoxy nucleoside triphosphates (ddNTPs) are added and non-incorporated nucleotides are washed away. In the next step, DNA is extended by one nucleotide at a time and images of the fluorescently labeled nucleotides are captured by a camera. Then a dye with the 3’ terminal chemically removed is added so that the nucleotide sequence is ready for another cycle. The advantage of Illumina’s SBS sequencer includes its fully automation procedures and relatively low sequencing cost compared to other methods. Sequencing coverage can reach very deep level, so it is good for RNA-Seq and mixed sample sequencing. However, it’s based on a reversible reaction, so sequencing quality goes low when number cycle is high; in other words it only generates short reads which makes it difficult for de novo assembly. (Information acquired from official website http://www.illumina.com)

12 SOLiD sequencing technology uses “sequence-by-ligation”. The process is similar to that used by 454 sequencing, where the DNA is first amplified by emulsion PCR before nucleotide detection.

Therefore, each of the beads containing copies of the same DNA molecule are then deposited on a glass slide. All oligonucleotides of a fixed length are pooled and labeled based on the first two nucleotides. Oligonucleotides are then annealed and ligated; the preferential ligation by DNA for matching sequences results in a signal detection of the nucleotide at that position. In SOLiD, every position in genome is detected twice, so it possesses higher sequencing quality and accuracy, which makes it a good option for SNP detection. But like Illumina’s sequencer, reads are too short for de novo assembly. (Information acquired from official website http://www.appliedbiosystems.com)

Transcriptome detection techniques

While sequencing detects and reports genome information, it does not contain expression data, which at times can be more impacting to the phenotype of the orgasm. Many types of transcriptome detection techniques exist, including SAGE and MPSS, which are tag-based approaches, and microarray and RNA-Seq which are so called “high throughput” approaches. (Table 1.3)

To perform SAGE (Serial Analysis of Gene Expression) a researcher must first isolate short tags from all mRNA transcripts in a cell, then link multiple tags in tandem and clone the newly formed chimeric

DNA strand into E. coli. Following sequencing, tags are mapped to genes and expression level is

13 determined by enumerating tags for each gene. One most important advantage of SAGE is it does not require a priori knowledge of genome structure as only part of genome sequences are required.

Other advantages include its high efficient and ability to detect unpredicted ORFs, anti-sense transcripts and alternate splicing forms. However, not all genes have the required NIaIII site, and different genes may contain the same SAGE tag, which will mislead the downstream analysis [65].

The main Principle behind MPSS (Massively Parallel Signature Sequencing) is to isolate short tags from all mRNA transcripts in a cell, then ligate with an adapter and sequencing primers. Finally after

PCR is performed within beads, they are loaded onto a flow cell for high throughput sequencing. By applying high throughput sequencing methods, the cost is reduced for the amount of data obtained.

However, this technique has the similar limitations as SAGE [66].

In a tiling microarray application in which probes were designed along genome with a certain density, researchers firstly generate a chip containing oligonucleotide probes in each spot, then extract mRNA and label with fluorescence. Dyes cDNAs are then loaded to the premade chips. At this time, the probe binds specifically to target cDNA sequence. After washing away the unbound cDNA sequences, the amount of cDNA bound to each cell is detected by the fluorescence intensity [67].

There are many types of chips on the market, including spotted vs in situ synthesized arrays (such as provided by Agilent and Affymetrix), and two-channel vs. one-channel detections. The general advantages of microarray technologies are that they are relatively mature techniques, which means it is easier be designed to detect specific gene groups and easy for bioinformatics analysis. However,

14 a successful application of microarray need to know gene sequences to design probes. Also, microarrays designed for a whole transcriptome detection are expensive compared to RNA-seq.

The RNA-Seq technique is a newly developed technique that enables researchers to get insight of whole transcriptome in a single sequencing row [68-71]. To perform RNA-Seq, a cDNA library is made from extracted mRNA, then the cDNA is fragmented and ligated with adapters and primers to be sequenced by NGS sequencing, typically 454 or illumina, to generate millions of short reads representing the transcriptome. Following sequencing, there are generally two ways to process the short reads, either align them to the reference genome or perform a de-novo assembly, if no proper reference genome is available. From the mapping data or assembly contigs, researchers derive gene models and the expression level for individual genes in genome scales [72].

Although RNA-Seq is a recently developed technology and is now under active development, it provides some key advantages over other transcriptome detection methods. First, and maybe the most important advantage, is it can be applied to organisms without a mature genome sequence available. For example, a 454-based RNA-Seq approach has been used to sequence the transcriptome of the Glanville fritillary butterfly, and transcriptome contig libraries were de novo assembled to aid in gene model annotation [73]. Another advantage is the accuracy provided by

RNA-Seq: the resolution can reach single base pair sensitivity, which makes it especially useful to detect exon boundaries. The third advantage is that it has a wide detection range compared to microarrays. Because mapping of 75bp reads could be unique if less than two mismatches are

15 allowed, so RNA-Seq has very low background noise. Additionally there’s no upper limitation to the expression level, since the number of mapped reads can be highly correlated to the expression level.

As a result, RNA-Seq has a very large dynamic range of expression level detection, estimated to be greater than 9,000-fold in a recent yeast transcriptome study [70].

Despite the advantages of RNA-Seq, there are still challenges both in the wet lab and dry lab studies.

Unlike small RNAs, which can be sequenced directly, mRNA usually requires a fragmentation step before sequencing and those manipulations complicate further analysis. For example, bias created during RNA fragmentation over the transcript body [70] and bias during cDNA fragmentation towards the identification of sequences from the 3’ ends [68]. Also, in the amplification stage, PCR artifacts may be generated and it’s very difficult to distinguish these from the highly expressed transcripts reads. There are also bioinformatics challenges, like other NGS technologies, that include retrieving and processing the large amount of data efficiently. Further, accurate tools specifically designed to map reads, remove bias, and calculate expression level need to be developed.

Cost of Next-generation sequencing study

The cost of sequencing the first genome of human reached roughly $3 billion, and human labor costs were as high as hundreds of researchers in several international institutes all spending 13 years to complete the project (National Human Genome Research Institute

16 http://www.genome.gov/11006943) . The advance of NGS technology has now made the price drop dramatically during the last decade. James Watsons’ genome was completed for less than $1 million in the year 2007 [74]. By 2009, cost to conduct a human genome sequencing dropped to $100,000

(http://www.genomesunzipped.org/2011/06/guest-post-by-clive-brown-the-disruptive-power-of-ch eap-dna-sequencing.php). Today, after nearly 10 year development, the cost to generate a genome assembly of fungi is as low as surprisingly $3,000. The goal to sequence a genome for only $1,000

[75, 76] seems very close.

One method to analyze the cost of NGS projects is to use the National Human Genome Research

Institute (NHGRI) database, which tracks the cost of sequencing projects it funds (Figure 1.2). The data shows that the cost of sequencing one base pair dropped quickly after 2008, much faster than predicted by Moore’s law, which states that the number of transistors of an integrated circuit doubles roughly every 2 years. However, the NHGRI only calculates the direct cost of sequencing while ignoring other project components, which include continuously developing and maintaining sequencing and bioinformatics tools to analyse data, as well as project control to ensure the quality of analysis.

The cost of a NGS project can be divided roughly into 4 parts: 1) project design, sample collection, and preparation; 2) sample sequencing; 3) data storage and reduction; 4) downstream analysis. After sequencing costs decreased, the cost from other 3 parts (1,3,4) of a project, especially the sample preparation and downstream analysis, became the largest contributors to the total cost (Figure 1.3).

17 NGS projects have currently passed the stage of simply generating a draft genome assembly or even simple genome comparisons of two related species, but now encompass large scale genome comparisons including hundreds of individuals or transcriptome analyses involving multiple individuals and more than ten conditions with replications. In such projects, samples are carefully collected and prepared to ensure accuracy, which often requires collaboration of several labs and escelating costs. The downstream analysis is the most important step as this is where scientific inquiring happens. However, downstream analysis not only requires significant computational resources, but also expensive manpower to install and configure the tools to build pipelines and explain the output results. Confounding the matter, proper tools are often not available and thus new ones need to be developed, adding significant time and cost.

Challenge in the bioinformatics analysis of NGS projects

One important evaluation of scientific research is the reproducibility, which means others should be able to repeat a procedure and obtain the same results based on the methods and materials in the publication. However, for NGS projects, it’s often difficult or even impossible for others to repeat published works. In a survey included 19 recent NGS projects, only 6 provided the name and version of the tools they used in the analysis stages, while others either failed to provide the version of tools or failed to even provide name of tools they used [77]. In another survey of 50 publications using

BWA (a popular aligner), only 7 of them provided detailed version and parameter setting used in the

18 project [77]. Obviously, others could not repeat the analysis without knowing the name of tools, but version and parameters used in the project are necessary. Since some of the widely used tools are under continual development, newly released versions are not only for fixing bugs, but also modification of algorithms occurs. Parameters are also playing fatal roles in analysis, and sometimes even a modest change will dramatically change the results. For example, the mismatches allowed in mapping reads will change the number of predicted SNPs. As such, publications need to include detailed parameters, and the actual default settings.

Another problem in NGS projects is that the customized analysis pipeline makes the results difficult to compare with each other. Since each tool is designed to function well for specific organisms, specific sequencing techniques, or answering specific questions, the result may be not reliable. One possible method to enhance the reproducibility and comparability is the adoption of the analysis pipeline applied and recommended by the 1000 Genome project [78]. If every research team carries the exact same pipeline from 100 Genome project and standardizes every step, then the results will be easily repeated and comparable. However, it’s impracticable for everyone to adopt the same pipeline, either because it is difficult to install and configure the particular tools, or in some stages additional information is required, which is actually not available for most of the non-model organisms.

A compromised solution is to choose the most reasonable tools, but provide detailed documentation for every step in the analysis pipeline. Luckily some tools are being developed to integrate the most

19 used functions and generate documentation with parameters automatically. These “frameworks”, include: Bioextract [79], Galaxy [80], GenePattern [81], GeneProf [82], Mobyle [83] and others, which actually bring together diverse tools under a unified interface.

A general pipeline for genome comparison analysis used for this dissertation

As mentioned above, there are no “one-fit-all” analysis pipelines for all NGS research, since the sequencing technique, organism, and research questions vary dramatically between projects. Here is a generalized pipeline generated and used in the following chapters of this dissertation (Figure 1.4).

Each stage in the pipeline is described, with the hurdles at that stage highlighted.

Stage 1: Sequencing reads manipulation

1) Quality checking and filtering: It’s important to check the average quality score of FastQ files to

insure they are within the normal range (>35). If a deep enough coverage of sequencing reads is

achieved, 20%-50% low quality score reads and especially those at each end of short reads are

trimmed, this process is especially important if these reads are going to map to a reference

genome assembly.

2) Adapter removal: Some reads contain sequencing adapter nucleotides. It is important to remove

these sequences if mapping is the next step.

Hurdle at this step: It’s difficult to perform the filtering work while still keep paired-end sequences in 20 the right order. An additional script in Velvet [84] could pick out paired reads from filtered files in their original paired order and input them into downstream assembly.

Stage 2: De novo assembly

1) Find the most appropriate assembler that performs best and is most compatible with the

sequencing files. For the projects in this dissertation, all available assemblers were tested

including Velvet, Edena [85], and SOAPdenovo [86].

2) Optimize parameters: The range of major parameters for assembly were decided based on the

manual description and reads length. Then testing runs were performs to optimize the

parameters.

Hurdle at this step: Formatting problems arise when merging results from Velvet and Edema using

Minimus2 [87].

Stage 3: Gene prediction & annotation

1) Find the most appropriate gene predictor: the gene predictor should have a high prediction

accuracy in comparison to other available tools and have a pre-trained matrix. For the projects in

this dissertation, Augustus [88] for Magnaporthe species and FGENESH [89] for Alternaria

species were used.

2) Convert gene prediction result to Fasta file format.

3) Annotate genes using Blast2GO [90], a batch blasting software.

21 4) Assign GO terms to the gene sequences using Blast2GO.

Hurdle at this step: FGENESH is online only and has a limited upload volume. Blast2GO becomes unstable when too many genes were analyzed at the same time. It is sometimes necessary to divide data into small pieces and combined them after processing.

Stage 4: Genome characteristic comparison

1) Calculate general statistics such as GC-content, GC3-content [49], average gene length, average

exon numbers by using CLC sequence viewer or in-house written script.

2) Codon usage bias detection: calculate indicator numbers such as RSCU [91] and CAI [92].

Stage 5: Alignment & Structure variation identification

1) Find proper tool for alignment: not many alignment tools are capable enough to conduct contig

verses contig alignment, Mummer suite [93] was chosen for the research in this dissertation.

2) Identification of Structure variation: There are two types of variation: 1) present/absent variation

which is easily detected by alignment; 2) relocation variation: this can be identified by first

aligning entire contigs and then dividing contigs into smaller pieces and aligning each to compare

results with the original whole contig. Any difference may suggest a relocation event.

Hurdle at this step: It’s subjective to a set cut-off to determine alignment; and segmented contigs make it difficult to detect relocation events at a large scale.

22 Stage 6: SNP identification

1) Mapping short reads to reference: this is now pretty mature process; BWA [94], bowtie [95], and

SOAPaligner [96] all perform well and rapidly complete mapping since they use basically the

same algorithms.

2) Calling SNP: It’s better to use a SNP caller that takes quality scores of each base as extra

information such as SOAPsnp.

Hurdle at this step: It’s pretty easy and fast to generate SNP files, however, there are usually more than 10,000 SNPs at the genome scale for fungi, so the hurdle is how to analyze them? For the work in this dissertation, each SNP was labeled as synonymous/non-synonymous.

Stage 7: Phylogenetic analysis

1) Choose sequences: It’s important to choose sequences from different species that have homology and can be aligned. It is ideal to select several sequences from each species to avoid bias. For this dissertation, 6 genes were selected in the Alternaria chapter and more than 10,000 SNP markers were selected in the Magnaporthe comparative chapter.

2) Choose methods: There are multiple algorisms to generate a phylogenetic tree, choose one that best fit your purpose. For this dissertation, maximum likeliy-hood approach was selected for all phylogenetic analysis.

Stage 8: Selection force calculation 23 1) Extract sequences showing homology from different isolates identified by BLAST.

2) Perform a codon alignment: which translate ORFs to protein first and do alignment, then go back

to align nucleotide sequences based on protein alignment. Prank [97] is the alignment tool used

in this dissertation.

3) For projects in this dissertation, input alignment files into Codeml tools in the PAML suite [98]

and choose most appropriate model (M7 or M8) to calculate dN/dS.

24 References

1. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M et al: Life with 6000 genes. Science 1996, 274(5287):546, 563-547. 2. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu JR, Pan H et al: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005, 434(7036):980-986. 3. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, FitzHugh W, Ma LJ, Smirnov S, Purcell S et al: The genome sequence of the filamentous fungus Neurospora crassa. Nature 2003, 422(6934):859-868. 4. Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, Cano LM, Grabherr M, Kodira CD, Raffaele S, Torto-Alalibo T et al: Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 2009, 461(7262):393-398. 5. Kamper J, Kahmann R, Bolker M, Ma LJ, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE, Muller O et al: Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis. Nature 2006, 444(7115):97-101. 6. Schirawski J, Mannhaupt G, Munch K, Brefort T, Schipper K, Doehlemann G, Di Stasio M, Rossel N, Mendoza-Mendoza A, Pester D et al: Pathogenicity determinants in smut fungi revealed by genome comparison. Science 2010, 330(6010):1546-1548. 7. Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, Thines M, Jiang RHY, Zody MC, Kunjeti SG, Donofrio NM et al: Genome Evolution Following Host Jumps in the Irish Potato Famine Pathogen Lineage. Science 2010, 330(6010):1540-1543. 8. Spanu PD, Abbott JC, Amselem J, Burgis TA, Soanes DM, Stuber K, van Themaat EVL, Brown JKM, Butcher SA, Gurr SJ et al: Genome Expansion and Gene Loss in Powdery Mildew Fungi Reveal Tradeoffs in Extreme Parasitism. Science 2010, 330(6010):1543-1546. 9. Duplessis S, Cuomo CA, Lin YC, Aerts A, Tisserant E, Veneault-Fourrey C, Joly DL, Hacquard S, Amselem J, Cantarel BL et al: Obligate biotrophy features unraveled by the genomic analysis of rust fungi. Proceedings of the National Academy of Sciences of the United States of America 2011, 108(22):9166-9171. 10. Baxter L, Tripathy S, Ishaque N, Boot N, Cabral A, Kemen E, Thines M, Ah-Fong A, Anderson R, Badejoko W et al: Signatures of Adaptation to Obligate Biotrophy in the Hyaloperonospora arabidopsidis Genome. Science 2010, 330(6010):1549-1551. 11. Machida M, Asai K, Sano M, Tanaka T, Kumagai T, Terai G, Kusumoto K, Arima T, Akita O, Kashiwagi Y et al: Genome sequencing and analysis of Aspergillus oryzae. Nature 2005, 438(7071):1157-1161. 12. Ohm RA, de Jong JF, Lugones LG, Aerts A, Kothe E, Stajich JE, de Vries RP, Record E, Levasseur A, Baker SE et al: Genome sequence of the model mushroom Schizophyllum commune. Nat Biotechnol 2010, 28(9):957-963. 13. Gobler CJ, Berry DL, Dyhrman ST, Wilhelm SW, Salamov A, Lobanov AV, Zhang Y, Collier JL, Wurch LL, Kustka AB et al: Niche of harmful alga Aureococcus anophagefferens revealed

25 through ecogenomics. Proceedings of the National Academy of Sciences of the United States of America 2011, 108(11):4352-4357. 14. Bowler C, Allen AE, Badger JH, Grimwood J, Jabbari K, Kuo A, Maheswari U, Martens C, Maumus F, Otillar RP et al: The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 2008, 456(7219):239-244. 15. Kemen E, Gardiner A, Schultz-Larsen T, Kemen AC, Balmuth AL, Robert-Seilaniantz A, Bailey K, Holub E, Studholme DJ, Maclean D et al: Gene gain and loss during evolution of obligate parasitism in the white rust pathogen of Arabidopsis thaliana. PLoS Biol 2011, 9(7):e1001094. 16. Amselem J, Cuomo CA, van Kan JA, Viaud M, Benito EP, Couloux A, Coutinho PM, de Vries RP, Dyer PS, Fillinger S et al: Genomic analysis of the necrotrophic fungal pathogens Sclerotinia sclerotiorum and Botrytis cinerea. PLoS genetics 2011, 7(8):e1002230. 17. Rouxel T, Grandaubert J, Hane JK, Hoede C, van de Wouw AP, Couloux A, Dominguez V, Anthouard V, Bally P, Bourras S et al: Effector diversification within compartments of the Leptosphaeria maculans genome affected by Repeat-Induced Point mutations. Nat Commun 2011, 2:202. 18. Covert SF: Supernumerary chromosomes in filamentous fungi. Current genetics 1998, 33(5):311-319. 19. VanEtten H, Jorgensen S, Enkerli J, Covert SF: Inducing the loss of conditionally dispensable chromosomes in Nectria haematococca during vegetative growth. Current genetics 1998, 33(4):299-303. 20. E.B. W: Studies on Chromosomes. V. The chromosomes of Metapodius. A contribution to the hypothesis of the genetic continuity of chromosomes. J Exp Zool 1906(6):147-205. 21. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S et al: Association genetics reveals three novel avirulence genes from the rice blast fungal pathogen Magnaporthe oryzae. The Plant cell 2009, 21(5):1573-1591. 22. Chuma I, Tosa Y, Taga M, Nakayashiki H, Mayama S: Meiotic behavior of a supernumerary chromosome in Magnaporthe oryzae. Current genetics 2003, 43(3):191-198. 23. Ma LJ, van der Does HC, Borkovich KA, Coleman JJ, Daboussi MJ, Di Pietro A, Dufresne M, Freitag M, Grabherr M, Henrissat B et al: Comparative genomics reveals mobile pathogenicity chromosomes in Fusarium. Nature 2010, 464(7287):367-373. 24. Han Y, Liu X, Benny U, Kistler HC, VanEtten HD: Genes determining pathogenicity to pea are clustered on a supernumerary chromosome in the fungal plant pathogen Nectria haematococca. The Plant journal : for cell and molecular biology 2001, 25(3):305-314. 25. Coleman JJ, Rounsley SD, Rodriguez-Carres M, Kuo A, Wasmann CC, Grimwood J, Schmutz J, Taga M, White GJ, Zhou S et al: The genome of Nectria haematococca: contribution of supernumerary chromosomes to gene expansion. PLoS genetics 2009, 5(8):e1000618. 26. Stukenbrock EH, Jorgensen FG, Zala M, Hansen TT, McDonald BA, Schierup MH: Whole-genome and chromosome evolution associated with host adaptation and speciation of the wheat pathogen Mycosphaerella graminicola. PLoS genetics 2010,

26 6(12):e1001189. 27. Tzeng TH, Lyngholm LK, Ford CF, Bronson CR: A restriction fragment length polymorphism map and electrophoretic karyotype of the fungal maize pathogen Cochliobolus heterostrophus. Genetics 1992, 130(1):81-96. 28. Leclair S, Ansan-Melayah D, Rouxel T, Balesdent M: Meiotic behaviour of the minichromosome in the phytopathogenic ascomycete Leptosphaeria maculans. Current genetics 1996, 30(6):541-548. 29. Johnson LJ, Johnson RD, Akamatsu H, Salamiah A, Otani H, Kohmoto K, Kodama M: Spontaneous loss of a conditionally dispensable chromosome from the Alternaria alternata apple pathotype leads to loss of toxin production and pathogenicity. Current genetics 2001, 40(1):65-72. 30. Hatta R, Ito K, Hosaki Y, Tanaka T, Tanaka A, Yamamoto M, Akimitsu K, Tsuge T: A conditionally dispensable chromosome controls host-specific pathogenicity in the fungal plant pathogen Alternaria alternata. Genetics 2002, 161(1):59-70. 31. Tyler BM, Tripathy S, Zhang X, Dehal P, Jiang RH, Aerts A, Arredondo FD, Baxter L, Bensasson D, Beynon JL et al: Phytophthora genome sequences uncover evolutionary origins and mechanisms of pathogenesis. Science 2006, 313(5791):1261-1266. 32. Powell AJ, Conant GC, Brown DE, Carbone I, Dean RA: Altered patterns of gene duplication and differential gene gain and loss in fungal pathogens. BMC Genomics 2008, 9:147. 33. Ellwood SR, Liu Z, Syme RA, Lai Z, Hane JK, Keiper F, Moffat CS, Oliver RP, Friesen TL: A first genome assembly of the barley fungal pathogen Pyrenophora teres f. teres. Genome biology 2010, 11(11):R109. 34. Seidl MF, Van den Ackerveken G, Govers F, Snel B: Reconstruction of oomycete genome evolution identifies differences in evolutionary trajectories leading to present-day large gene families. Genome Biol Evol 2012, 4(3):199-211. 35. Temporini ED, VanEtten HD: An analysis of the phylogenetic distribution of the pea pathogenicity genes of Nectria haematococca MPVI supports the hypothesis of their origin by horizontal transfer and uncovers a potentially new pathogen of garden pea: Neocosmospora boniensis. Current genetics 2004, 46(1):29-36. 36. Rodriguez-Carres M, White G, Tsuchiya D, Taga M, VanEtten HD: The supernumerary chromosome of Nectria haematococca that carries pea-pathogenicity-related genes also carries a trait for pea rhizosphere competitiveness. Applied and environmental microbiology 2008, 74(12):3849-3856. 37. Osbourn A: Secondary metabolic gene clusters: evolutionary toolkits for chemical innovation. Trends Genet 2010, 26(10):449-457. 38. Palmer JM, Keller NP: Secondary metabolism in fungi: does chromosomal location matter? Curr Opin Microbiol 2010, 13(4):431-436. 39. Yu JH, Keller N: Regulation of secondary metabolism in filamentous fungi. Annual review of phytopathology 2005, 43:437-458. 40. Proctor RH MS, Alexander NJ, Desjardins AE.: Evidence that a secondary metabolic

27 biosynthetic gene cluster has grown by gene relocation during evolution of the filamentous fungus Fusarium. Molecular microbiology 2009, 74(5):1128-1142. 41. Nakatsuka S, Ueda, K., Goto, T., Yamamoto, M., Nishimura, S. and Kohmoto, K.: Stucture of AF-toxin II, one of the host-specific toxins produced by Alternaria alternata strawberry pathotype. Tetrahedron Lett 1986, 27:2753-2756. 42. Nakashima T, Ueno, T., Fukami, H., Taga, T., Masuda, H., Osaki, K., et al.: Isolation and stuctures of AK-Toxin I and II, host-specific phytotoxic metabolites produced by Alternaria alternata Japanese pear pathotype. Agric Biol chem 1985, 49:807-815. 43. Cuomo CA, Guldener U, Xu JR, Trail F, Turgeon BG, Di Pietro A, Walton JD, Ma LJ, Baker SE, Rep M et al: The Fusarium graminearum genome reveals a link between localized polymorphism and pathogen specialization. Science 2007, 317(5843):1400-1402. 44. Hu J, Chen C, Peever T, Dang H, Lawrence C, Mitchell T: Genomic characterization of the conditionally dispensable chromosome in Alternaria arborescens provides evidence for horizontal gene transfer. BMC Genomics 2012, 13:171. 45. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B: A telomeric avirulence gene determines efficacy for the rice blast resistance gene Pi-ta. The Plant cell 2000, 12(11):2019-2032. 46. Skamnioti P, Pedersen C, Al-Chaarani GR, Holefors A, Thordal-Christensen H, Brown JK, Ridout CJ: Genetics of avirulence genes in Blumeria graminis f.sp. hordei and physical mapping of AVR(a22) and AVR(a12). Fungal Genet Biol 2008, 45(3):243-252. 47. Chuma I, Isobe C, Hotta Y, Ibaragi K, Futamata N, Kusaba M, Yoshida K, Terauchi R, Fujita Y, Nakayashiki H et al: Multiple Translocation of the AVR-Pita Effector Gene among Chromosomes of the Rice Blast Fungus Magnaporthe oryzae and Related Species. Plos Pathogens 2011, 7(7). 48. Raffaele S, Win J, Cano LM, Kamoun S: Analyses of genome architecture and gene expression reveal novel candidate virulence factors in the secretome of Phytophthora infestans. BMC Genomics 2010, 11:637. 49. Jiang RH, Govers F: Nonneutral GC3 and retroelement codon mimicry in Phytophthora. J Mol Evol 2006, 63(4):458-472. 50. Thon MR, Pan H, Diener S, Papalas J, Taro A, Mitchell TK, Dean RA: The role of transposable element clusters in genome evolution and loss of synteny in the rice blast fungus Magnaporthe oryzae. Genome biology 2006, 7(2):R16. 51. Baer CF, Miyamoto MM, Denver DR: Mutation rate variation in multicellular eukaryotes: causes and consequences. Nat Rev Genet 2007, 8(8):619-631. 52. Stukenbrock EH, Bataillon T, Dutheil JY, Hansen TT, Li R, Zala M, McDonald BA, Wang J, Schierup MH: The making of a new pathogen: insights from comparative population genomics of the domesticated wheat pathogen Mycosphaerella graminicola and its wild sister species. Genome research 2011, 21(12):2157-2166. 53. Win Jea: Adaptive evolution has targeted the C-terminal domain of the RXLR effectors of plant pathogenic oomycetes. The Plant cell 2007, 19:2349-2369.

28 54. Selker EU, Cambareri EB, Jensen BC, Haack KR: Rearrangement of duplicated DNA in specialized cells of Neurospora. Cell 1987, 51(5):741-752. 55. Clutterbuck AJ: Genomic evidence of repeat-induced point mutation (RIP) in filamentous ascomycetes. Fungal Genet Biol 2011, 48(3):306-326. 56. Horns F, Petit E, Yockteng R, Hood ME: Patterns of repeat-induced point mutation in transposable elements of basidiomycete fungi. Genome Biol Evol 2012, 4(3):240-247. 57. Rep M, Kistler HC: The genomic organization of plant pathogenicity in Fusarium species. Current opinion in plant biology 2010, 13(4):420-426. 58. Van de Wouw AP, Cozijnsen AJ, Hane JK, Brunner PC, McDonald BA, Oliver RP, Howlett BJ: Evolution of linked avirulence effectors in Leptosphaeria maculans is affected by genomic environment and exposure to resistance genes in host plants. PLoS Pathog 2010, 6(11):e1001180. 59. Syvanen m: Cross-species gene transfer; implications for a new theory of evolution. Journal of Theoretical Biology 1985, 112(2):333-343. 60. Schmitt I, Lumbsch HT: Ancient horizontal gene transfer from bacteria enhances biosynthetic capabilities of fungi. PloS one 2009, 4(2):e4437. 61. Richards TA, Dacks JB, Jenkinson JM, Thornton CR, Talbot NJ: Evolution of filamentous plant pathogens: gene exchange across eukaryotic kingdoms. Current biology : CB 2006, 16(18):1857-1864. 62. Rosewich UL, Kistler HC: Role of Horizontal Gene Transfer in the Evolution of Fungi. Annual review of phytopathology 2000, 38:325-363. 63. Timothy L Friesen EHS, Zhaohui Liu, Steven Meinhardt, Hua Ling, Justin D Faris1, Jack B Rasmussen, Peter S Solomon, Bruce A McDonald & Richard P Oliver: Emergence of a new disease as a result of interspecific virulence gene transfer. Nature Genetics 2006, 38:953-956. 64. Markham JE, Hille J: Host-selective toxins as agents of cell death in plant-fungus interactions. Molecular plant pathology 2001, 2(4):229-239. 65. Hu M, Polyak K: Serial analysis of gene expression. Nat Protoc 2006, 1(4):1743-1760. 66. Reinartz J, Bruyns E, Lin JZ, Burcham T, Brenner S, Bowen B, Kramer M, Woychik R: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct Genomic Proteomic 2002, 1(1):95-104. 67. Johnson JM, Edwards S, Shoemaker D, Schadt EE: Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trends Genet 2005, 21(2):93-102. 68. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320(5881):1344-1349. 69. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 2008, 453(7199):1239-1243.

29 70. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621-628. 71. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 2008, 133(3):523-536. 72. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63. 73. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 2008, 17(7):1636-1647. 74. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT et al: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452(7189):872-876. 75. Bennett ST, Barnes C, Cox A, Davies L, Brown C: Toward the 1,000 dollars human genome. Pharmacogenomics 2005, 6(4):373-382. 76. Service RF: Gene sequencing. The race for the $1000 genome. Science 2006, 311(5767):1544-1546. 77. Nekrutenko A, Taylor J: Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 2012, 13(9):667-672. 78. A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061-1073. 79. Lushbough CM, Brendel VP: An overview of the BioExtract Server: a distributed, Web-based system for genomic analysis. Advances in experimental medicine and biology 2010, 680:361-369. 80. Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology 2010, 11(8):R86. 81. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet 2006, 38(5):500-501. 82. Halbritter F, Vaidya HJ, Tomlinson SR: GeneProf: analysis of high-throughput sequencing experiments. Nat Methods 2012, 9(1):7-8. 83. Neron B, Menager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S, Tuffery P, Letondal C: Mobyle: a new full web bioinformatics framework. Bioinformatics 2009, 25(22):3005-3011. 84. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 2008, 18(5):821-829. 85. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research 2008, 18(5):802-809. 86. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome

30 research 2010, 20(2):265-272. 87. Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC bioinformatics 2007, 8:64. 88. Stanke M, Morgenstern B: AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 2005, 33:W465-W467. 89. Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome research 2000, 10(4):516-522. 90. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21(18):3674-3676. 91. Sharp PM, Tuohy TM, Mosurski KR: Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic acids research 1986, 14(13):5125-5143. 92. Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic acids research 1987, 15(3):1281-1295. 93. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome biology 2004, 5(2):R12. 94. Lai JS, Li RQ, Xu X, Jin WW, Xu ML, Zhao HN, Xiang ZK, Song WB, Ying K, Zhang M et al: Genome-wide patterns of genetic variation among elite maize inbred lines. Nature Genetics 2010, 42(11):1027-U1158. 95. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology 2009, 10(3):R25. 96. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25(15):1966-1967. 97. Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(30):10557-10562. 98. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution 2007, 24(8):1586-1591.

31

Organism/Name SubGroup Natural Host Genome GC% Predicted Size (Mb) Genes Phytophthora ramorum Oomycete Oak 54.427 53.9 - Phytophthora sojae Oomycete Soybean 79.327 54.6 28142 Phytophthora infestans Oomycete Potato,Tomato 190.142 51 19344 Ustilago maydis 521 Basidiomycetes Maize 19.8578 53.7 6671 Sporisorium reilianum SRZ2 Basidiomycetes Maize 18.4769 59.5 6806 Puccinia striiformis f. sp. tritici Basidiomycetes Wheat 64.799 44.5 - PST-130 Melampsora larici-populina 98AG31 Basidiomycetes Poplar 97.683 41 16380 Pyrenophora teres f. teres 0-1 Ascomycetes Barley 33.581 50.9 11799 Leptosphaeria maculans JN3 Ascomycetes Crucifers 45.1246 - 12469 Sclerotinia sclerotiorum 1980 UF-70 Ascomycetes Multiple 38.204 41.8 14637 Erysiphe pisi Ascomycetes Pea 69.276 39.2 - Blumeria graminis f. sp. hordei DH14 Ascomycetes Grasses 98.284 44 - Magnaporthe oryzae 70-15 Ascomycetes Rice 40.9791 - 13032 Verticillium albo-atrum VaMs.102 Ascomycetes Multiple 30.331 56 10488 Verticillium dahliae VdLs.17 Ascomycetes Multiple 32.975 55.8 10811 Fusarium oxysporum f. sp. lycopersici Ascomycetes Multiple 61.4707 48.2 - 4287 Nectria haematococca mpVI 77-13-4 Ascomycetes Multiple 51.23 50.8 15708 Alternaria arborescens EGS 39-128 Ascomycetes Tomato 33.89 50.9 -

Table 1.1. Genomes sequenced of plant pathogen fungi and oomycetes.

32

NGS Sequencing Read Reads per Company Amplification Claim to fame technique strategy length run First Next-Gen Sequencer 454 Roche Synthesis emPCR 400 bp 1 million long reads First short-read sequencer Up to 640 Hi-Seq Illumina Synthesis BridgePCR 100 bp current leader in million advantages Second short-read Life Up to 1.4 SOLiD Ligation emPCR 50 bp sequencer Technologies billion low error rates First real-time Pacific PacBio Synthesis None 2 Kb 45,000 single-molecule Biosciences sequencing

Table 1.2. Comparison of the three mainstream 2nd generation sequencing technologies and PacBio which is regarded as the 3rd generation sequencing technique.

33

Technology RNA-Seq Tilling Microarray cDNA or EST sequencing Principle NGS Hybridization Sanger sequencing Resolution Single base ~ 10-100 bp Single base Throughput High High Low Genomic sequences Not necessary Necessary Not necessary Background noise Low High Low Detection range High Low Low Isoforms detection yes limited Yes Allelic detection yes limited Yes Amount of input RNA Low High High

Table 1.3. Comparison of the technique specification and application of RNA-Seq, tilling microarray, and low-throughput techniques.

34

Number of genomes available in NCBI 60 50 40 30 20 10 0

Figure 1.1. Number of genomes available for the fungi and oomycete species. These species all have multiple isolates being sequenced.

35

Figure 1.2. Sequencing cost released by The National Human Genome Research Institute (NHGRI).

(http://www.genome.gov/sequencingcosts/) Blue line with dots in it represents the cost per raw

MB of DNA sequencing performed in the institute. The while line represents the expectation made

by Moore’s Law.

36

100% 90% 80% Project design and sample collection 70% Sample sequencing 60% 50% Data storage and reduction 40% 30% Downstream analysis 20% 10% 0% Pre-NGS Now Future

Figure 1.3. Contribution of different stages to the total cost of a genome sequencing project over time. Note the sharp shrink of the component of sequencing stage. The three time points have interval of approximately ten years. (Sboner A, et al.: The real cost of sequencing: higher than you think! Genome Biology 2011, 12:125)

37

Figure 1.4. A general pipeline for genome comparison analysis used for this dissertation. Blue square frames represent analysis steps, Oval frames represent outside data source, purple frames represent tools used for analysis in this dissertation.

38

Chapter 2: Genomic characterization of the conditionally dispensable chromosome in Alternaria

arborescens provides evidence for horizontal gene transfer

39 Introduction

The rapid development of next-generation sequencing technologies over the past decade has led to a flood of both de-novo sequencing and re-sequencing projects in almost every branch of the tree of life. Within the fungal kingdom, comparative genome studies have led to the unexpected finding that large genomic regions may be variable among isolates of a given species. One category of these variable regions is unique chromosomes referred to as supernumerary or conditionally dispensable because they are not typically required for saprophytic growth [1-3]. These chromosomes have been identified in many fungi including Magnaporthe oryzae [4-6], Fusarium oxysporum [7], Nectria haematococca [8, 9], Mycosphaerella graminicola [10], Cochliobolus heterostrophus [11],

Leptosphaeria maculans [12], and Alternaria alternata [13, 14].

Plant pathogenic fungi in the genus Alternaria infect a remarkable range of host plants and are major causes of agricultural yield losses [15]. Conditionally dispensable chromosomes (CDCs) are carried by several of the small-spored, plant-pathogenic Alternaria species [13, 14, 16]. These chromosomes are generally less than 2.0 MB in size, and may be transmitted horizontally between isolates in a population, potentially conferring new pathogenic attributes to the receiving isolate [17-20]. Loss of the CDC can also occur during repeated sub-culturing, resulting in the transition from a pathogenic to saprophytic form of the fungus [13]. Several genes coding host specific toxins (HSTs) have been located to gene clusters on CDCs, including those producing AF-toxin from the strawberry pathotype

[21], AK-toxin from the Japanese pear pathotype [22], and ACT-toxin from the tangerine pathotype

40 [23]. These toxins share a common 9,10-epoxy-8-hydroxy-9-methyl-decatrienoic acid structural moiety, with the genes encoding each toxin sharing a high degree of homology [21-25]. In addition, the AMT gene from the apple pathotype, a gene involved in host-specific AM-toxin cyclic peptide biosynthesis, is located on a small chromosome of 1.1 to 1.7 Mb [13, 26], with at least four copies involved in AM-toxin biosynthesis [27]. The only other gene sequences identified to date on CDCs are extended families of tranposon-like sequences (TLSs) [14].

Horizontal gene transfer (HGT) is the movement, without recombination, of stable genetic material between two individuals [28]. HGT may not only occur between different individuals of the same species, but also between species or even between bacteria and fungi or between fungi and oomycetes [29, 30]. In fungi, the movement of plasmids, mycoviruses, transposable elements, gene clusters, and whole chromosomes have been demonstrated from one individual to another [31]. The first theory to explain gain and loss of HSTs was proposed in 1983 [32]. It has then been hypothesized that the genome content of CDCs in Alternaria species were acquired through HGT events [14]. The most well studied example of HGT in fungi is the movement of the ToxA gene from the wheat blotch pathogen Stagonospora nodorum to Pyrenophora tritici-repentis, the causal agent of tan spot of wheat [33, 34]. This horizontal transfer event was identified by nucleotide sequence similarity and structural comparisons between genes from both species. The direction of transfer was inferred by the fact that the ToxA gene consisted of a single haplotype in P. tritici-repentis but 11 haplotypes in S. nodorum isolates.

41 Alternaria arborescens (synonym A. alternata f. sp. lycopersici), the fungus that produces host-specific AAL toxin, is the causal agent of stem canker of tomato [35, 36]. It has been observed in pulsed field gel electrophoresis (PFGE) studies that A. arborescens carries one CDC of 1.0-Mb [16, 37].

To date, only two genes have been reported to be carried on this CDC including ALT1, which is a PKS gene involved in AAL toxin biosynthesis [38, 39], and AaMSAS, also a PKS gene [40, 41]. A CDC deletion mutant of A. arborescens generated through restriction mediated integration (REMI) showed a toxin- and pathogenicity-minus phenotype [41]. In addition, in protoplast fusion experiments, a CDC from A. arborescens was observed to transfer into the strawberry pathotype, and subsequently introduced new tomato pathogenicity to the fusant [41].

In this study, we used a next generation sequencing approach to produce a (90X) draft sequence of the A. arborescens genome and used a novel bioinformatic approach to separate CDC contigs from the essential chromosome (EC) contigs. The gene content of the CDC was analyzed to answer the following questions: (1) What is the difference between the CDC and EC genome content at the nucleotide level? (2) Are CDC genes under positive selection and could they represent additional virulence factors in addition to the known toxin encoding genes? (3) Is the evolutionary history of the CDC the same as that of the ECs, and is there any evidence of a HGT event? In answering these questions, we confirmed a different genome content pattern of the A. arborescens CDC and found evidence for HGT.

42 Results

Sequencing & assembly

A. arborescens strain EGS 39-128 (CBS 102605) [42] was sequenced by a whole genome shotgun

approach using the Illumina Genome Analyzer II, which resulted in ~50 million paired-end short

reads of 75 bp representing 90X average coverage of the predicted genome content. De-novo

assembly was performed using Velvet [43] (version 0.7), and confirmed by Edena [44] and Minimus2

[45]. The assembly resulted in 1,332 contigs with a N50 of 624KB and total size of 34.0MB (Table 2.1;

Assembly has been deposited at DDBJ/EMBL/GenBank under the accession AIIC00000000. The

version described in this paper is the first version, AIIC01000000.) One hundred thirty-seven large

contigs with lengths greater than 10 KB and representing 98% of the genome assembly content were

chosen for further analysis.

Marker-assisted identification of contigs carrying toxin biosynthetic genes

The first challenge in analyzing the CDC was to isolate its assembly contigs away from EC contigs. For this genome, the process was made more challenging as there is no defined reference genome, few genetic markers, and no optical mapping. It is known from previous studies of Alternaria CDC that most Alternaria species, including the isolate used in this study, have a single CDC [16, 37]. To begin assembly of this chromosome, two previously identified CDC genes that belong to the toxin

43 biosynthetic cluster, ALT1 and AaMSAS [38, 40], were used as markers to search all contigs. Through this strategy, two putative CDC contigs of 15 KB and 48 KB in length were identified as containing

ALT1 and AaMSAS, respectively. These two contigs were annotated to identify PKS genes, other toxin biosynthetic genes, as well as genes with orthologs in other fungi and bacteria (Figure 2.1). Multiple putative HST genes were identified on both contigs, consistent with predictions based on previous reports [14].

Identification of the remaining CDC contigs and validation by Southern hybridization

To identify additional CDC contigs, the Alternaria brassicicola (Ab) genome sequence was used as a

reference (downloaded from The Genome Institute at Washington University). A. brassicicola is a

related species to A. arborescens (Aa) but does not carry CDCs. All contigs from Aa were aligned to

Ab contigs using MUMmer [46] as the alignment tool with an identity cut-off of 90%. Eight

previously identified marker genes from Aa, 6 from the EC and 2 from CDC, were used to set criteria

to distinguish contigs belonging to CDCs versus ECs [41] (Table 2.2). After comparing the alignments

of contigs containing the 8 markers genes, we set a CDC contig cutoff as those contigs with less than

20% coverage of sequence aligned to Ab with higher than 90% identity (including both coding and

non-coding regions). Through this method, 29 predicted CDC contigs were identified with total

length of 1.0 MB, same as the expected size from clamped homogenous electric fields (CHEF) gel

analysis [16, 37] (Figure 2.2). The remaining 108 contigs were considered essential chromosomes

44 (ECs) with total genome size of 32.3 MB (Figure 2.3). Validation that the selected contigs belong to

the CDC was performed by Southern hybridization using genes predicted to reside on the CDC and

EC contigs respectively as probes (Figure 2.2C-F). Five probes were hybridized including 4 from the

CDC and 1 from the EC. Each probe gene is predicted to be present in a single copy. The first probe

was ALT1, a toxin gene known to reside on the CDC [41]. This was followed by hybridizing with three

CDC genes predicted from the de-novo assembly, including a transporter CDC_92, a polysaccharide

export protein CDC_102, and an o-methyltransferase CDC_147. The fifth probe was EC_97_90_g721,

a gene annotated as PKS and predicted to reside on the ECs.

Gene prediction, length, GC3-content, and repeat identification

Nine thousand, one hundred sixty-seven genes were predicted by FGENESH [47] using pre-trained

Alternaria parameters, of which 209 genes were assigned to CDC contigs and 8958 to EC contigs. The average length of each predicted gene was 1.8 KB, and the gene density was 3.7 KB per gene.

Compared to gene predictions for A. brassicicola (average gene length=1.3KB, gene density=3.0KB per gene), A. arborescens genes were longer and present in lower density. To evaluate the origin of the CDC, the predicted genes residing on the CDC and EC contigs were compared at the nucleotide level, including gene length, GC3-content [48], repeat load, and codon usage bias. This analysis showed that CDC genes are about 200 bp shorter on average than EC genes (P=2.36E-09) and have significantly lower GC3-content (P=0.028). Repeat regions composed 5.3% of CDC contigs while only

45 0.6% of EC contigs (Table 2.3). It should be noted that some repeat regions could be lost in short read sequences de-novo assemblies, however even with possible suppressed numbers, this result indicates approximately 10X repeat enrichment in the CDC compared to the EC.

Codon usage analysis

Codon usage comparisons of CDC and EC genes were used to determine whether a bias in codon usage exists between the two groups. Both the Codon Adaptation Index [49] (CAI, P=0, Figure 2.4) and Relative Synonymous Codon Usage (RSCU) [50] correlation (P=1.14E-14, Table 2.4) from the two groups were significantly different, suggesting a different origin. The largest codon usage bias was observed for the amino acids Tyrosine, Lysine, and Asparagine, with a preference for A over G and T over C in CDC genes (Table 2.5). These three amino acids were not biased for CDCs in Fusarium [7], indicating there's no universal codon usage bias pattern between CDCs from Alternaria and Fusarium and ECs.

Annotation of EC genes

The assembly results show the size of essential chromosomes region collectively to be 33.0 MB with

8958 predicted genes. RepeatMasker identied only 0.12% of the EC region as simple repeats (about

50bp in length) and 0.08% as low complexity, indicating that short repeats may be lost during de 46 novo assembly of Illumina sequencing reads. For secreted protein identification, 1099 (12.2%) of the

EC proteins were predicted to contain signal peptides, and were functionally annotated using BLAST to the NCBI database with more than 98% of the genes returning at least one hit with an E-value <

1.0E-3. From the BLAST results, we identified 212 transcription factors, 98 oxidase proteins, 202 kinase proteins, 279 transporters, 81 Cytochrome P450s, and 45 different proteases.

Annotation of CDC genes

Several host-specific toxin genes and transposon-like sequences have been reported to be carried by

CDCs in Alternaria [14]. We used two methods to annotate the functions of resident CDC genes: (1) they were blasted against the NCBI non-redundant database as well as Pfam [51] and NCBI CDD [52] to search for functional domains; (2) they were scanned to identify transcription factors, PKS genes,

NRPS genes, P450s, transporters, and pathogenicity related genes.

From 160 NCBI BLASTN hits of putative CDC genes (Figure 2.5), the top five species matches were

Pyrenophora teres, Pyrenophora tritici-repentis, Phaeosphaeria nodorum, Leptosphaeria maculans,

and Nectria haematococca, all of which are fungal phytopathogens. Interestingly, A. alternata was

ranked 7th in this list, demonstrating that CDC genes were more similar to genes present in other

fungal species rather than other Alternaria spp. Moreover, besides N. haematococca, all these other

fungi are closely related taxonomically belonging to the class, Dothideomycetes.

47 Gene ontology terms were assigned to CDC genes based on BLAST matches with sequences whose function was previously characterized [53]. Ninety CDC genes were assigned to a biological process,

51 for molecular function, and 15 for cellular component (Figure 2.6). Among the biological process assignments, 54% of genes were assigned to “metabolic process”, and 10% to “biosynthetic process”.

Enrichment of metabolic and biosynthetic process in CDC genes as compared to EC genes supported the observation that Alternaria CDC genes are enriched for polyketide synthases (PKS) and toxin synthases. Molecular function terms showed a significant percentage (39%) to “nucleotide/nucleic acid binding”, which shows an enrichment of transcription factors and gene regulation elements.

To provide a more detailed characterization of putative CDC genes, each was translated to identify protein families. Among the 209 predicted CDC proteins, 31 were identified as carrying PKS domains.

Two proteins were found to carry highly modular domains: KS-AT-KR-ACP on CDC_141 and

KS-AT-DH-ER-KR-ACP on CDC_165. The remaining 29 PKS proteins each carried 1 or 2 ACPs (Acyl carrier protein) domains. Seven proteins were found to carry NRPS domains: 3 Enterobactin domains,

2 Bacitrancin domains, 1 Pyochelin domain, and 1 CDA1 domain. Two proteins were identified as hybrid PKS-NRPS. Seven proteins were identified as P450 monooxygenase proteins. For transcription factors, 24 proteins were characterized to contain TF domains, in which Zn2Cys6 was the prominent group. Multiple ADP/ATP transporters, ABC transporters, ion transporters and major facilitator superfamily (MFS) transporters were also found in CDC protein group. Additionally, it was found that multiple proteins carrying FAD binding domains and . Finally, 37 proteins were identified as putative pathogenicity related genes through scanning CDC genes in the pathogen-host

48 interactions database (PHI-base) [54] (E-value < 0.05). See supplementary information for a complete CDC gene annotation list.

Secondary metabolite biosynthetic gene clusters

In fungi, it has been reported that genes responsible for secondary metabolite biosynthesis (SMB) may be clustered [4, 55]. Typically these include PKS or NRPS genes, as well as genes responsible for structural modifications of initial metabolites, for transport, and for transcription regulation [56]. In this study, we screened each CDC gene and those surrounding them, looking for evidence of clustering of PKS, NRPS, transcription factors, transporters, P450 proteins, FAD binding proteins, , and oxidoreductases. We identified 10 putative SMB clusters (Figure 2.7). A typical SMB cluster is formed by 3-6 genes, with 1 or 2 PKS or NRPS genes, and other metabolite syntheses related genes.

Evolutionary selection of CDC genes and domains

To estimate selection on CDC and EC genes, Ka/Ks ratios were calculated, with the assumption that genes with Ka > Ks were likely under positive selection, genes with Ka = Ks were likely evolving neutrally, and genes with Ka < Ks were likely under purifying (negative) selection. Twenty-eight CDC and 6,036 EC genes were successfully aligned to A. brassicicola genes and Ka and Ks values were 49 calculated for each. It was observed CDC genes had about a double Ka (0.08/0.043) and larger Ka/Ks ratios (0.133/0.084) than EC genes (Figure 2.8), possibly indicating greater positive selection on CDC genes. The two CDC genes with highest Ka/Ks ratio was CDC_102 (PKSs) and CDC_146

(phosphotransferase). However, no CDC genes showed Ka/Ks>1, suggesting that in these two

Alternaria species strong positive selection may only occur in specific regions of a protein. The selection ratio only at conserved domains of CDC genes was then estimated. Domains of aligned CDC genes were identified using the NCBI CDD database. Each individual domain was extracted then the

Ka/Ks ratio was calculated and compared to that from same full length protein (Table 2.6). We found two domains from CDC_151 with a higher Ka/Ks ratio compared to whole length protein: a 12x increase for the haloacid dehalogenase-like hydrolases (HAD_like) domain, which uses a nucleophilic aspartate in their phosphoryl transfer reaction, and 2x increase for heavy-metal-associated (HMA) domain, which transports or detoxifies heavy metals. Another interesting example was CDC_144, whose domain patatin-like phospholipase (Pat17, belonging to the alpha-beta family) showed a 1.5x increase compared to whole length protein.

Origin of CDC

In the taxonomy report of CDC BLAST results, top hits came primarily from closely related dothideomycete fungi, indicating CDC may have fungal origin, or a transfer event occurred between

CDC content and one or more fungal genomes. To test whether CDCs have the same phylogenetic

50 placement with ECs, a phylogenetic analysis was conducted, including EC and CDC genes from A. arborescens, genes from A. brassicicola, and from three other ascomycete species: P. tritici-repentis,

L. maculans, and A. oryzae. Proteins coded by 6 genes showing homology in all 6 groups were used to build a distance tree (Figure 2.9) using the neighbor-joining method [57]. Results show the CDC clade was within but basal to the two Alternaria clades.

A BLAST score ratio (BSR) [58] analysis was performed to test whether individual proteins on the A. arborescens CDC had more similarity to A. brassicicola or other fungi, and the result was compared with the same analysis to EC proteins. Complete genome protein sequences of three fungal species:

P. tritici-repentis, L. maculans, and A. oryzae were extracted and built into a library called “3-fungi”, representing proteins from closely related fungal species. Then proteins from the CDC and ECs were compared to the A. brassicicola protein library and “3-fungi” protein library respectively (Figure 2.10).

It was clear, that there was less divergence between EC proteins and A. brassicicola proteins compared to that with other fungi (35.5% vs 15.1%), consistent with the species phylogeny. In contrast, CDC proteins are more similar to proteins from other fungi (16.7% vs 10.0%), suggesting they have different evolutionary history other than EC proteins.

51 Discussion

Comparison of CDCs in other sequenced fungal genomes

Sequenced genomes of phytopathogenic fungi have been increasing quickly due to the availability of the next-generation sequencing technologies and lowered costs. Genome size of phytopathogenic fungi ranges from ~20MB (Sporisorium reilianum) [59] to 150MB (Erysiphe pisi) [60] , but the average are around 40MB. Biotrophs usually have larger genome size than necrotrophs or hemibiotrophs

[61]. Within the Dothideomycete class, there are 13 genome projects with public data available including A. brassicicola from the JGI fungal genomics program (Table 2.7). Most species have genome sizes ranging from 30MB to 45MB except Mycosphaerella fijiensis which has a 74 MB genome (produced at the Joint Genome Institute of the United States Department of Energy in 2007).

A. arborescens has a 34MB genome, smaller than most other species but larger than A. brassicicola, and also has the smallest gene numbers of model and relatively short gene lengths.

Compared to other recently published assemblies of CDCs in filamentous fungi, A. arborescens has a relatively small number of CDCs (one) and CDC size (1.0Mb). M. graminicola, is reported to have the highest number of dispensable chromosomes with upwards of 8 ranging in size from 0.39 to 0.77MB

[62]. Three CDCs in N. haematococca [9, 63], and 4 complete CDCs and partial sequences of another

2 in F. oxysporum [7] were identified. In other Alternaria species, identified CDCs are relatively larger such as 1.05Mb in the strawberry pathotype [14], 1.1 to 1.7 Mb (depending on strains) in the apple pathotype [13], and 4.1 Mb in the Japanese pear pathotype [64, 65]. In A. arborescens, only 1

52 dispensable chromosome is present, representing only 3% of the genome content, which is significantly smaller than other cases and may suggest a more recent acquisition or different origin.

PKS and NRPS clusters

Phytopathogenic fungi produce a diverse array of secondary metabolites, including host-selective toxins conferring pathogenicity [66]. It was reported in two basidiomycete maize pathogens, the candidate effector genes are located in small clusters that are dispersed throughout the genome [67].

However, in other fungi, especially ascomycetes, genes coding for toxins can co-locate in clusters consisting of more than 10 contiguous genes. A well-known example is the trichothecene biosynthetic gene cluster in F. graminearum, which contains 10-12 genes including a terpene synthase gene, P450 monooxygenase genes, acyl genes, regulatory genes, and transporter genes [55]. While in A. fumigatus 26 SMB clusters were identified, each contains 5-48 genes [68]. In our study, 29 PKS, 5 NRPS, and 2 hybrid PKS-NRPS genes were found on the CDC, with larger density compared to other fungi. However, among 10 predicted SMB clusters, most were relatively small and only carried 3-8 genes, which may not represent the true cluster size due to short contigs length that may divide one large cluster into two or more. One example of an identified cluster was located on contig Node_309, which consists of 5 genes, including 2 PKSs, 1 NRPS putatively coding for enterobactin, a phosphate transferase gene, and a MFS transporter. It lacks regulators, P450s, and transporters compared to other typical clusters. However, this cluster locates

53 at the edge of the contig. Only 5 genes away from this cluster, another small cluster was identified containing a PKS, NRPS, P450s and an ABC transporter, suggesting these two could be part of a larger cluster. In this study, PKS genes were identified by screening the PKS sequence database, especially the domain database, which include: ketoacyl synthase (KS), acyl transferase (AT), ketoreductase (KR), (DH), enoyl reductase (ER), and acyl carrier protein (ACP, also known as PP domain). The

KS, AT, and ACP domains are essential for PKS function [69]. Two PKS genes were identified to have multiple domains above: KS-AT-KR-ACP in CDC_141, and KS-AT-DH-ER-KR-ACP in CDC_165. The remaining 29 PKS genes each carries 1 or 2 ACP domains. Despite these conserved domains, other domains carried by these genes were divergent, indicating variance and multifunction of each PKS genes (Table 2.8). However, at least 3 domain families were found to be enriched in the identified

PKS genes: ABC_membrane (4 identified), NADB_Rossmann (7 identified), and P-loop NTPase (6 identified), suggesting these proteins are transmembrane and may catalyze enzymatic reactions. In the NRPS and hybrid PKS-NRPS gene group, enterobactin, bacitracin, pyoverdine, syringomycin, and

CDA1 domains were identified, 4 of which were reported from bacteria [70-73]. We eliminated the possibility of these genes originating from bacterial sequencing contamination by BLAST comparing all assembly contig sequences against the NCBI All Bacterial database with 2017 genome sequences, and found that the species with most hits was Streptomyces coelicolor with > 80% identity. However, only 0.7% of the entire S. coelicolor genome was covered. Indicating that either these genes have an origin from bacteria or their product proteins interact with each other and require a highly conserved structure that was retained during evolution.

54 Comparison of CDC genes from different isolates

After comparing CDC genes with EC genes and other species, the divergence of CDCs in different A. alternata isolates was examined. Sequences of 8 genes, 2 from CDC and 6 from EC [39]were downloaded from NCBI GeneBank. For each gene, homologous were extracted for each from isolates collected worldwide, including both pathogens and non-pathogens (for EC gene only). We also used homologs of marker genes in A. arborescens and a phylogenetic analysis was conducted to calculate the mean distance among different isolates (Figure 2.11). While the 6 EC genes showed some diversity, the 2 CDC genes were almost identical among all isolates, suggesting they may be transmitted among different isolates and had a different evolutionary history than EC genes.

Horizontal gene transfer

According to the horizontal gene transfer hypothesis, A. arborescens may have acquired its CDC from

another Alternaria species, from a fungus other than Alternaria, or possibly from a bacterium or

virus [74]. There are at least two possible explanations for its origin: (1) CDCs were present in an

Alternaria ancestor, but were independently lost during vertical transmission in other

non-pathogenic Alternaria species. (2) CDCs arose from the essential chromosome as a copy first but

then went under divergence so no obvious orthology could be detected. To test which of the three

models fits this case best, we built a complete EC protein library and blasted all CDC proteins against

55 it to detect any possible orthology. Out of 209 CDC proteins, we found 12 (5.7%) showing orthology

to EC proteins. Although the low orthology percentage alone could not exclude the “duplication and

divergence” model, taken together with differences on GC3-content and codon usage bias, the

possibility that this model fits is minimal.

To distinguish between HGT and vertical transmission hypothesis, we identified differences between

A. arborescens CDC and EC genes in length, GC3-content, and codon usage bias. There was limited orthology detected between two groups; CDC genes showed discordant phylogenetic relation with

EC, and had higher similarity to other fungi than A. brassicicola. From previous phylogenetic analysis of 13 A. alternata isolates collected worldwide, CDC genes from different isolates were almost identical despite diverse EC background [41]. Taken together these results, we concluded that the

HGT model may serve as the best fit model in this case. Additionally, these data support the 1983 theory proposed by Nishimura that Alternaria species acquired HSTs by HGT [32].

In this study, we identified evidence for the possibility of HGT event occurred in A. arborescens. For

Alternaria, this strategy has its advantages. First, as a pathogen with a wide host range, as observed in nature, transportable pathogenicity chromosome may increase pathogen's adaptation to environment. Second, loss of a CDC when there's no host may reduce the cost of carrying extra genome content. Third, as asexual fungi, horizontal transfer may compensate the lack of genetic recombination.

We were also trying to identify the donor in A. arborencens’s HGT events, as HGTs have been

56 reported previously from bacteria to fungi, and from fungi to oomycetes [29, 30]. Phytopathogenic fungi were also reported to carry exogenous genes from plants [75]. Here, we tried to find evidence of CDC genes originating from either bacteria or its host tomato by blasting CDC genes against bacteria (available genomes on NCBI) and tomato genome sequences. Two CDC genes were identified showing homology to the tomato genome sequences but no homologs to A. arborescens

EC sequences: CDC_7 to gi47104509 (E_value = 2e-26), and CDC_36 to gi47104495

(E_value=1.1e-20). These two genes are annotated as and NRPS. We were not able to identify any bacterial origin CDC genes. However, when we blasted complete CDC contig sequences to the NCBI bacteria database, we found that Streptomyces as a genus had the most number of hits.

The genome sequence of Streptomyces coelicolor A3 (8.6MB) [76] was downloaded from genebank then used as reference for CDC contigs blast. Setting 80% identity as cutoff, we found 7.0% of CDC contigs had homology with the Streptomyces coelicolor genome. Although these homologous regions were usually fragmented into small hits, bias was observed as in Node_136, the contig containing ALT1 toxin gene, 14.0% aligned with a Streptomyces genome region. This suggests a possible HGT event between Alternaria and Streptomyces. Fragmentation structure of Streptomyces homologs indicates either the horizontal transfer in this case did not include large continuous sequence or it is an ancient event.

57 Conclusions

In this study, we identified A. arborescens CDC sequences through a whole genome sequencing and de-novo assembly process. By comparing nucleotide usage between CDC and EC contigs, we found evidence supporting HGT in A. arborescens. We also identified some predicted CDC genes under positive selection that may serve as virulence factors. However, questions still remain, such as the similarity and difference among CDCs from different A. arborescens isolates. To better understand

CDC characteristics and mechanisms of HGT, other isolates need to be sequenced.

Materials and methods

Sequencing, assembly & alignment

A.arborescens DNA was extracted following a protocol described [77] and the sequencing library was prepared using the Illumina Paired-End DNA Sample Prep Kit. Sequencing was performed using

Illumina Genome Analyzer II. Short reads were assembled de-novo using Velvet, and quality improved by a pipeline including two alternate assemblers: Edena [44], and Minimus2 [45].

Parameters including k-mer length for Velvet and hash length for Edena were optimized by sequential step changes. The alignment between A.arborescens and A.brassicicola was conducted using the Nucmer program in the MUMMER suit, with parameter c=15, l=10. Alignments with identities lower than 90% or lengths shorter than 100bp were removed.

58 Southern hybridization

On the CHEF gel membrane presented in Figure 2.2, lane 1 contains size markers, lane 2 contains A. arborescens chromosomes that had degraded, and lane 3 contains intact A. arborescens chromosomes. Southern hybridization was conducted using the GE health CDP-Star kit with 5 gene probes, including 1 CDC marker gene ALT1, 3 predicted CDC genes, and 1 predicted EC gene. Primers

(Table 2.9) were designed using Primer3 [78] (v0.4.0). Blots were stripped between hybridizations to ensure no probes from previous hybridization remained. Film was exposed for 48 hours.

Gene prediction, codon usage analysis & repetitive DNA identification

Gene prediction was conducted using FGENESH [47], an ab initio gene predictor provided in the

Softberry website. A pre-trained Alternaria matrix was used to optimize predictions. Both CDSs and

protein sequences were generated and converted into fasta format files. ACUA [79] was used for

calculating CAI and RSUC for each gene, and CAI distribution curves from the CDC group and EC

group were compared to each other. Student's f-test was used to analyse data. RepeatScout [80] was

used for de-novo identification of repeat sequences in both CDC and EC sequences. The repeat

libraries were then aligned back to CDC and EC contigs using Nucmer to calculate the repeat

percentage for each group.

59 Gene annotation

Blast2go [81] was used to annotate genes by “BLASTX” to the NCBI non-redundant protein database

and then GO term assignment from the gene ontology database. Annotation of conserved domains

was identified by scanning proteins through Pfam and NCBI CDD. PKS and NRPS genes were

identified through scanning an online database SBSPKS [82]. The Fungal transcription factor

database (FTFD) [83] was used to identify transcription factors. Transporters, P450s, and

oxidoreducatases were identified based on BLAST and domain inspection. Potential secreted

proteins were predicted using Signal 3.0 [84]. Pathogenicity and virulence factors were identified

through scanning CDC genes in the pathogen-host interactions database (PHI-base) [54].

Estimating Ka/Ks Ratios

A. arborescens proteins were blasted against A.brassicicola proteins to generate a match list between the two groups with a bits score cut-off at 300. The gene sequences coding for aligned proteins were extracted by an in-house PERL script. Prank [85] was used to conduct codon alignment, in which two protein sequences were aligned first and then DNA sequences were aligned based on the corresponding protein alignments. The codon alignment result was then entered into “Codem” in PAML [86] (v4.0) for Ka and Ks calculation with model M0. In calculating, the Nei and Gojobori [87] method and Yang and Nielsen [88] method were used.

60 References

1. Covert SF: Supernumerary chromosomes in filamentous fungi. Current genetics 1998, 33(5):311-319. 2. VanEtten H, Jorgensen S, Enkerli J, Covert SF: Inducing the loss of conditionally dispensable chromosomes in Nectria haematococca during vegetative growth. Current genetics 1998, 33(4):299-303. 3. E.B. W: Studies on Chromosomes. V. The chromosomes of Metapodius. A contribution to the hypothesis of the genetic continuity of chromosomes. J Exp Zool 1906(6):147-205. 4. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu JR, Pan H et al: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005, 434(7036):980-986. 5. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S et al: Association genetics reveals three novel avirulence genes from the rice blast fungal pathogen Magnaporthe oryzae. The Plant cell 2009, 21(5):1573-1591. 6. Chuma I, Tosa Y, Taga M, Nakayashiki H, Mayama S: Meiotic behavior of a supernumerary chromosome in Magnaporthe oryzae. Current genetics 2003, 43(3):191-198. 7. Ma LJ, van der Does HC, Borkovich KA, Coleman JJ, Daboussi MJ, Di Pietro A, Dufresne M, Freitag M, Grabherr M, Henrissat B et al: Comparative genomics reveals mobile pathogenicity chromosomes in Fusarium. Nature 2010, 464(7287):367-373. 8. Han Y, Liu X, Benny U, Kistler HC, VanEtten HD: Genes determining pathogenicity to pea are clustered on a supernumerary chromosome in the fungal plant pathogen Nectria haematococca. The Plant journal : for cell and molecular biology 2001, 25(3):305-314. 9. Coleman JJ, Rounsley SD, Rodriguez-Carres M, Kuo A, Wasmann CC, Grimwood J, Schmutz J, Taga M, White GJ, Zhou S et al: The genome of Nectria haematococca: contribution of supernumerary chromosomes to gene expansion. PLoS genetics 2009, 5(8):e1000618. 10. Stukenbrock EH, Jorgensen FG, Zala M, Hansen TT, McDonald BA, Schierup MH: Whole-genome and chromosome evolution associated with host adaptation and speciation of the wheat pathogen Mycosphaerella graminicola. PLoS genetics 2010, 6(12):e1001189. 11. Tzeng TH, Lyngholm LK, Ford CF, Bronson CR: A restriction fragment length polymorphism map and electrophoretic karyotype of the fungal maize pathogen Cochliobolus heterostrophus. Genetics 1992, 130(1):81-96. 12. Leclair S, Ansan-Melayah D, Rouxel T, Balesdent M: Meiotic behaviour of the minichromosome in the phytopathogenic ascomycete Leptosphaeria maculans. Current genetics 1996, 30(6):541-548. 13. Johnson LJ, Johnson RD, Akamatsu H, Salamiah A, Otani H, Kohmoto K, Kodama M: Spontaneous loss of a conditionally dispensable chromosome from the Alternaria alternata apple pathotype leads to loss of toxin production and pathogenicity. Current genetics 2001, 40(1):65-72.

61 14. Hatta R, Ito K, Hosaki Y, Tanaka T, Tanaka A, Yamamoto M, Akimitsu K, Tsuge T: A conditionally dispensable chromosome controls host-specific pathogenicity in the fungal plant pathogen Alternaria alternata. Genetics 2002, 161(1):59-70. 15. Rotem J: The Genus Alternaria. Biology, epidemiology, and pathogenicity. St. Paul: APS Press; 1994. 16. Akamatsu H, Taga M, Kodama M, Johnson R, Otani H, Kohmoto K: Molecular karyotypes for Alternaria plant pathogens known to produce host-specific toxins. Current genetics 1999, 35(6):647-656. 17. Yasunori Akagi MT, Mikihiro Yamamoto, Takashi Tsuge, Yukitaka Fukumasa-Nakai, Hiroshi Otani and Motoichiro Kodama: Chromosome constitution of hybrid strains constructed by protoplast fusion between the tomato and strawberry pathotypes of Alternaria alternata. J Gen Plant Pathol 2009, 75(2):101-109. 18. SALAMIAH HA, Yukitaka FUKUMASA-NAKAI, Hiroshi OTANI and Motoichiro KODAMA: Construction and Genetic Analysis of Hybrid Strains between Apple and Tomato Pathotypes of Alternaria alternata by Protoplast Fusion J Gen Plant Pathol 2001, 67(2):97-105. 19. SALAMIAH YF-N, Hajime AKAMATSU, Hiroshi OTANI, Keisuke KOHMOTO and Motoichiro KODAMA: Genetic Analysis of Pathogenicity and Host-specific Toxin Production of Alternaria alternata Tomato Pathotype by Protoplast Fusion J Gen Plant Pathol 2001, 67(1):7-14. 20. Masunaka A, Ohtani K, Peever TL, Timmer LW, Tsuge T, Yamamoto M, Yamamoto H, Akimitsu K: An Isolate of Alternaria alternata That Is Pathogenic to Both Tangerines and Rough Lemon and Produces Two Host-Selective Toxins, ACT- and ACR-Toxins. Phytopathology 2005, 95(3):241-247. 21. Nakatsuka S, Ueda, K., Goto, T., Yamamoto, M., Nishimura, S. and Kohmoto, K.: Stucture of AF-toxin II, one of the host-specific toxins produced by Alternaria alternata strawberry pathotype. Tetrahedron Lett 1986, 27:2753-2756. 22. Nakashima T, Ueno, T., Fukami, H., Taga, T., Masuda, H., Osaki, K., et al.: Isolation and stuctures of AK-Toxin I and II, host-specific phytotoxic metabolites produced by Alternaria alternata Japanese pear pathotype. Agric Biol chem 1985, 49:807-815. 23. Kohmoto K, Itoh, Y., Shimomura, N., Kondoh, Y., Otani, H., Kodama, M., et al.: Isolation and biological activities of 2 host-specific toxins from the tangerine pathotype of Alternaria alternata. Phytopathology 1993, 83:495-502. 24. Feng BN, Nakatsuka, S., Goto, T., Tsuge, T. and Nishimura, S.: Biosynthesis of host-selective toxins produced by alternaria alternata pathogens, I: (8r,9s)-9,10-epoxy-8-hydroxy-9-methyl-deca-(2e,4z,6e)-trienoic acid as a biological precursor of AK-toxins. Agric Biol Chem 1990, 54:845-848. 25. Nakatsuka S, Feng BN, Goto T, Tsuge T, Nishimura S: Biosynthesis of Host-Selective Toxins Produced by Alternaria-Alternata Pathogens .2. Biosynthetic Origin of (8r,9s)-9,10-Epoxy-8-Hydroxy-9-Methyl-Deca-(2e,4z,6e)-Trienoic Acid, a Precursor of

62 Ak-Toxins Produced by Alternaria-Alternata. Phytochemistry 1990, 29(5):1529-1531. 26. Johnson RD, Johnson L, Itoh Y, Kodama M, Otani H, Kahmoto K: Cloning and characterization of a cyclic peptide synthetase gene from Alternaria alternata apple pathotype whose product is involved in AM-toxin synthesis and pathogenicity. Mol Plant Microbe In 2000, 13(7):742-753. 27. Harimoto Y, Tanaka T, Kodama M, Yamamoto M, Otani H, Tsuge T: Multiple copies of AMT2 are prerequisite for the apple pathotype of Alternaria alternata to produce enough AM-toxin for expressing pathogenicity. J Gen Plant Pathol 2008, 74(3):222-229. 28. Syvanen m: Cross-species gene transfer; implications for a new theory of evolution. Journal of Theoretical Biology 1985, 112(2):333-343. 29. Schmitt I, Lumbsch HT: Ancient horizontal gene transfer from bacteria enhances biosynthetic capabilities of fungi. PloS one 2009, 4(2):e4437. 30. Richards TA, Dacks JB, Jenkinson JM, Thornton CR, Talbot NJ: Evolution of filamentous plant pathogens: gene exchange across eukaryotic kingdoms. Current biology : CB 2006, 16(18):1857-1864. 31. Rosewich UL, Kistler HC: Role of Horizontal Gene Transfer in the Evolution of Fungi. Annual review of phytopathology 2000, 38:325-363. 32. Nishimura S, Kohmoto K: Host-Specific Toxins and Chemical Structures from Alternaria Species. Annu Rev Phytopathol 1983, 21:87-116. 33. Timothy L Friesen EHS, Zhaohui Liu, Steven Meinhardt, Hua Ling, Justin D Faris1, Jack B Rasmussen, Peter S Solomon, Bruce A McDonald & Richard P Oliver: Emergence of a new disease as a result of interspecific virulence gene transfer. Nature Genetics 2006, 38:953-956. 34. Markham JE, Hille J: Host-selective toxins as agents of cell death in plant-fungus interactions. Molecular plant pathology 2001, 2(4):229-239. 35. Gilchrist DG, Grogan RG: Production and Nature of a Host-Specific Toxin from Alternaria-Alternata F-Sp Lycopersici. Phytopathology 1976, 66(2):165-171. 36. Peever TL, Su G, Carpenter-Boggs L, Timmer LW: Molecular systematics of citrus-associated Alternaria species. Mycologia 2004, 96(1):119-134. 37. Akamatsu H: Molecular biological studies on the pathogenicity of Alternaria alternata tomato pathotype. J Gen Plant Pathol 2004(70):389. 38. Yamagishi D, H. Akamatsu, H. Otani, and M. Kodama.: Pathological evaluation of host-specific AAL-toxins and fumonisin mycotoxins produced by Alternaria and Fusarium species. J Gen Plant Pathol 2006, 72(5):323-327. 39. Akamatsu H, H. Otani, and M. Kodama: Characterization of a gene cluster for host-specific AAL-toxin biosynthesis in the tomato pathotype of Alternaria alternata. Fungal genet Newsl 2003, 50:355. 40. Kroken S, Glass NL, Taylor JW, Yoder OC, Turgeon BG: Phylogenomic analysis of type I polyketide synthase genes in pathogenic and saprobic ascomycetes. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(26):15670-15675.

63 41. Akagi Y, Akamatsu H, Otani H, Kodama M: Horizontal Chromosome Transfer, a Mechanism for the Evolution and Differentiation of a Plant-Pathogenic Fungus. Eukaryot Cell 2009, 8(11):1732-1738. 42. Simmons EG: Alternaria themes and variations (236-243) - Host-specific toxin producers. Mycotaxon 1999, 70:325-369. 43. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 2008, 18(5):821-829. 44. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research 2008, 18(5):802-809. 45. Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC bioinformatics 2007, 8:64. 46. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome biology 2004, 5(2):R12. 47. Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome research 2000, 10(4):516-522. 48. Gojobori MIBaT: Significant differences between the G+C content of synonymous codons in orthologous genes and the genomic G+C content. Gene 1999, 238(1):33-37. 49. Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic acids research 1987, 15(3):1281-1295. 50. Sharp PM, Tuohy TM, Mosurski KR: Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic acids research 1986, 14(13):5125-5143. 51. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al: The Pfam protein families database. Nucleic acids research 2004, 32(Database issue):D138-141. 52. Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z et al: CDD: a Conserved Domain Database for protein classification. Nucleic acids research 2005, 33(Database issue):D192-196. 53. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25-29. 54. Baldwin TK, Winnenburg R, Urban M, Rawlings C, Koehler J, Hammond-Kosack KE: The pathogen-host interactions database (PHI-base) provides insights into generic and novel themes of pathogenicity. Molecular plant-microbe interactions : MPMI 2006, 19(12):1451-1462. 55. Proctor RH, McCormick SP, Alexander NJ, Desjardins AE: Evidence that a secondary metabolic biosynthetic gene cluster has grown by gene relocation during evolution of the filamentous fungus Fusarium. Molecular microbiology 2009, 74(5):1128-1142.

64 56. Yu JH, Keller N: Regulation of secondary metabolism in filamentous fungi. Annual review of phytopathology 2005, 43:437-458. 57. N. Saitou MN: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406-425. 58. Parkinson J, Blaxter M: SimiTri--visualizing similarity relationships for groups of sequences. Bioinformatics 2003, 19(3):390-395. 59. Schirawski J, Mannhaupt G, Munch K, Brefort T, Schipper K, Doehlemann G, Di Stasio M, Rossel N, Mendoza-Mendoza A, Pester D et al: Pathogenicity determinants in smut fungi revealed by genome comparison. Science 2010, 330(6010):1546-1548. 60. Spanu PD, Abbott JC, Amselem J, Burgis TA, Soanes DM, Stuber K, van Themaat EVL, Brown JKM, Butcher SA, Gurr SJ et al: Genome Expansion and Gene Loss in Powdery Mildew Fungi Reveal Tradeoffs in Extreme Parasitism. Science 2010, 330(6010):1543-1546. 61. Schmidt SM, Panstruga R: Pathogenomics of fungal plant parasites: what have we learnt about pathogenesis? Current opinion in plant biology 2011. 62. Mehrabi R, Taga M, Kema GH: Electrophoretic and cytological karyotyping of the foliar wheat pathogen Mycosphaerella graminicola reveals many chromosomes with a large size range. Mycologia 2007, 99(6):868-876. 63. Miao VP, Covert SF, VanEtten HD: A fungal gene for antibiotic resistance on a dispensable ("B") chromosome. Science 1991, 254(5039):1773-1776. 64. Tsuge T, Tanaka A, Shiotani H, Yamamoto M: Insertional mutagenesis and cloning of the genes required for biosynthesis of the host-specific AK-toxin in the Japanese pear pathotype of Alternaria alternata. Mol Plant Microbe In 1999, 12(8):691-702. 65. Tsuge T, Tanaka A: Structural and functional complexity of the genomic region controlling AK-toxin biosynthesis and pathogenicity in the Japanese pear pathotype of Alternaria alternata. Mol Plant Microbe In 2000, 13(9):975-986. 66. Wolpert TJ, Dunkle LD, Ciuffetti LM: Host-selective toxins and avirulence determinants: what's in a name? Annual review of phytopathology 2002, 40:251-285. 67. Kamper J, Kahmann R, Bolker M, Ma LJ, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE, Muller O et al: Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis. Nature 2006, 444(7115):97-101. 68. Nierman WC, Pain A, Anderson MJ, Wortman JR, Kim HS, Arroyo J, Berriman M, Abe K, Archer DB, Bermejo C et al: Genomic sequence of the pathogenic and allergenic filamentous fungus Aspergillus fumigatus. Nature 2005, 438(7071):1151-1156. 69. Gokhale RSaT, D.: Biochemistry of polyketide synthases. Biotechnology 2001, 10:341-372. 70. Dertz EA, Xu J, Stintzi A, Raymond KN: Bacillibactin-mediated iron transport in Bacillus subtilis. Journal of the American Chemical Society 2006, 128(1):22-23. 71. Johnson BA, Anker H, Meleney FL: Bacitracin: A New Antibiotic Produced by a Member of the B. Subtilis Group. Science 1945, 102(2650):376-377. 72. S. Wendenbaum PD, A. Dell, J. M. Meyer and M. A. Abdallah: The structure of pyoverdine Pa, the siderophore of Pseudomonas aeruginosa. Tetrahedron Letters 1983, 24(44):4877-4880.

65 73. Scholz-Schroeder BK, Soule JD, Gross DC: The sypA, sypS, and sypC synthetase genes encode twenty-two modules involved in the nonribosomal peptide synthesis of syringopeptin by Pseudomonas syringae pv. syringae B301D. Molecular plant-microbe interactions : MPMI 2003, 16(4):271-280. 74. Lawrence CB, Mitchell TK, Craven KD, Cho Y, Cramer RA, Kim KH: At death's door: Alternaria pathogenicity mechanisms. Plant Pathology J 2008, 24(2):101-111. 75. Richards TA, Soanes DM, Foster PG, Leonard G, Thornton CR, Talbot NJ: Phylogenomic analysis demonstrates a pattern of rare and ancient horizontal gene transfer between plants and fungi. The Plant cell 2009, 21(7):1897-1911. 76. Bentley SD, Chater KF, Cerdeno-Tarraga AM, Challis GL, Thomson NR, James KD, Harris DE, Quail MA, Kieser H, Harper D et al: Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 2002, 417(6885):141-147. 77. Al-Samarrai TH, Schmid J: A simple method for extraction of fungal genomic DNA. Lett Appl Microbiol 2000, 30(1):53-56. 78. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 2000, 132:365-386. 79. Vetrivel U, Arunkumar V, Dorairaj S: ACUA: a software tool for automated codon usage analysis. Bioinformation 2007, 2(2):62-63. 80. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21 Suppl 1:i351-358. 81. Gotz S, Garcia-Gomez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talon M, Dopazo J, Conesa A: High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic acids research 2008, 36(10):3420-3435. 82. Anand S, Prasad MV, Yadav G, Kumar N, Shehara J, Ansari MZ, Mohanty D: SBSPKS: structure based sequence analysis of polyketide synthases. Nucleic acids research 2010, 38(Web Server issue):W487-496. 83. Park J, Jang S, Kim S, Kong S, Choi J, Ahn K, Kim J, Lee S, Park B, Jung K et al: FTFD: an informatics pipeline supporting phylogenomic analysis of fungal transcription factors. Bioinformatics 2008, 24(7):1024-1025. 84. Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. Journal of molecular biology 2004, 340(4):783-795. 85. Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(30):10557-10562. 86. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution 2007, 24(8):1586-1591. 87. Nei M. gT: Simple methods for estimating the numbers of syonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986, 3:418-426. 88. Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Molecular biology and evolution 2000, 17(1):32-43.

66

Ins_length Exp_ Min_lgth N50 Max_lgth Total size Reads K Cut-off (bp) cov (bp) (bp) (bp) (Mb) used 31 350 42 5 100 84390 485954 33.9 46910423 41 350 33 5 100 390559 1219988 33.9 46820556 51 350 24 5 100 624394 1544694 34.0 46047345 57 350 16 5 100 441130 2135139 34.0 45564058

Table 2.1. Velvet de-novo assembly statistics. Results from assembly attempts using varying parameters. The row in shadow shows the optimized assembly.

67

Contig ID Length (bp) Alignment Marker gene New ID GC% coverage NODE_9 48862 13.6% AaMSAS CDC_contig_23 53.015% NODE_136 14729 15.7% ALT1 CDC_contig_27 24.942% NODE_758 174048 26.1% AKS17 EC_contig_009 51.127% NODE_8 199480 44.1% AKS21 EC_contig_038 51.143% NODE_58 833831 61.9% MAT1-2-1 EC_contig_063 51.209% NODE_82 1201916 69.9% ALM EC_contig_085 51.495% NODE_274 1472031 72.1% VKS2 EC_contig_091 51.489% NODE_151 812677 73.5% AaTUB EC_contig_094 51.705%

Table 2.2. Alignment of A. arborescens marker gene contigs and A. brassicicola contigs. The first two contigs contain two CDC marker genes and the following six contigs contain six EC marker genes.

68

Chromosome Total length (bp) Repeat length (bp) Repeat coverage CDC 906,671 47,990 5.29% ECs 32,675,713 186,054 0.57%

Table 2.3. Repeat region identification.

69

CDC_1 CDC_2 EC_1 EC_2 EC_3 EC_4 EC_5 EC_6 EC_7 EC_8 CDC_1 1.000 CDC_2 0.995 1.000 EC_1 0.988 0.991 1.000 EC_2 0.992 0.995 0.999 1.000 EC_3 0.992 0.994 0.999 0.999 1.000 EC_4 0.989 0.991 1.000 0.999 0.999 1.000 EC_5 0.990 0.992 1.000 0.999 1.000 1.000 1.000 EC_6 0.991 0.995 0.999 1.000 0.999 0.999 0.999 1.000 EC_7 0.992 0.994 0.999 0.999 1.000 0.999 1.000 0.999 1.000 EC_8 0.992 0.994 0.999 0.999 1.000 0.999 1.000 0.999 1.000 1.000

Table 2.4. Codon usage correlation analysis. Numbers in red show lower correlation scores between

CDC and EC contigs compared to correlation within EC groups or within CDC groups, indicating existence of codon usage bias of CDC genes.

70

Codon Amino Acid Frequency in Frequency in EC Significance CDC AAA K 35.0% 29.6% P=0.61 AAG K 65.0% 70.4% P=0.73 AAT N 36.8% 32.3% P=0.62 AAC N 63.3% 67.7% P=0.72 TAT Y 37.7% 32.2% P=0.54 TAC Y 62.4% 67.8% P=0.66

Table 2.5. Differences in codon usage between CDC and EC genes. Top three amino acids showing the most bias are listed.

71 Domain_ID Ka Ks Domain Protein_ Domain/pr Ka/Ks Ka/Ks otein-ratio CDC_151_HAD_like 67.10 28.90 0.74 0.06 12.33 CDC_151_HMA 51.70 17.30 0.13 0.06 2.10 CDC_84_GAL4 22.00 2.00 0.21 0.11 1.91 CDC_151_E1-E2_ATPase 103.80 49.20 0.10 0.06 1.72 CDC_144_Pat17_PNPLA8_PNP 113.50 48.50 0.41 0.30 1.36 LA9_like CDC_178_AbfB 327.90 107.10 0.12 0.10 1.17 CDC_68_p450 442.30 172.70 0.14 0.14 1.01 CDC_161_RTA1 184.00 77.00 0.12 0.13 0.99 CDC_118_MOZ_SAS 441.40 149.60 0.03 0.03 0.97 CDC_46_TA_like 395.20 153.80 0.10 0.11 0.93 CDC_92_Aa_trans 900.30 293.70 0.07 0.08 0.92 CDC_28_SDR_c 533.30 216.70 0.03 0.03 0.90 CDC_66_P21-Arc 438.40 128.60 0.03 0.04 0.84 CDC_183_Syja_N 916.00 146.00 0.02 0.02 0.84 CDC_169_Rubis-subs-bind 106.90 34.10 0.09 0.12 0.74 CDC_161_Peptidases_S8_S53 256.10 106.90 0.09 0.13 0.71 CDC_102_CAP59_mtransfer 163.00 53.00 0.34 0.51 0.66 CDC_188_Peptidases_S53 957.50 212.50 0.03 0.05 0.64 CDC_178_ArabFuran-catal 629.00 238.00 0.06 0.10 0.60 CDC_67_Rep_fac_C 196.40 70.60 0.02 0.03 0.58 CDC_184_Rab5_related 240.70 56.30 0.01 0.03 0.52 CDC_146_APH 72.30 29.70 0.20 0.45 0.45 CDC_107_PLN02844 469.90 133.10 0.06 0.14 0.44 CDC_161_PA_PoS1_like 245.50 114.50 0.05 0.13 0.42 CDC_161_Peptidases_S8_5 405.40 176.60 0.05 0.13 0.38 CDC_161_DUF1034 237.50 95.50 0.05 0.13 0.37 CDC_51_ACC_central 210.10 68.90 0.01 0.02 0.34 CDC_67_AAA 336.40 116.60 0.01 0.03 0.17 CDC_189_Na_Ca_ex 277.80 88.20 0.00 0.02 0.17

Table 2.6. Ka/Ks ratio of CDC protein conversed domains. The last column shows the enrichment of

Ka/Ks domains as compared to the whole protein.

72

Species Genome Size Gene Model Number Average Transcript Length Alternaria arborescens 34 9167 1300 Cochliobolus heterostrophus 34.9 9633 1639 Dothistroma septosporum 30.2 12580 1778 Hysterium pulicare 38.4 12352 1524 Leptosphaeria maculans 44.9 12469 1258 Mycosphaerella fijiensis 74.1 13107 1833 Mycosphaerella graminicola 39.7 10952 ? Pyrenophora teres f. teres 33.6 11799 1390 Pyrenophora tritici-repentis 37.8 12171 ? Rhytidhysteron rufulum 40.2 12117 1505 Septoria musiva 29.4 10233 1841 Setosphaeria turcica 43 11702 1490 Stagonospora nodorum 37.2 16597 ? Alternaria brassicicola 31 10688 1333

Table 2.7. General statistical facts of A. arborescens and other sequenced fungi genome.

73 Query From To Bitscore Accession Short name CDC_4 2 295 212.038 cl02872 DHQ_Fe-ADH superfamily CDC_6 41 299 123.824 cl10444 Ras_like_GTPase superfamily CDC_23 23 257 189.418 cd05233 SDR_c CDC_23 23 257 189.418 cl09931 NADB_Rossmann superfamily CDC_23 3 259 208.475 PRK05653 fabG CDC_28 50 299 211.374 cd05233 SDR_c CDC_28 50 299 211.374 cl09931 NADB_Rossmann superfamily CDC_28 43 303 220.031 PRK05653 fabG CDC_36 154 381 314.006 cd03249 ABC_MTABC3_MDL1_MDL2 CDC_36 154 381 314.006 cl09099 P-loop NTPase superfamily CDC_36 695 934 313.621 cd03249 ABC_MTABC3_MDL1_MDL2 CDC_36 695 934 313.621 cl09099 P-loop NTPase superfamily CDC_36 695 923 142.167 COG1126 GlnQ CDC_36 154 370 141.397 COG1126 GlnQ CDC_37 67 496 169.38 cl12078 p450 superfamily CDC_41 457 662 196.457 cl09099 P-loop NTPase superfamily CDC_41 150 688 161.391 COG1132 MdlB CDC_42 694 1102 421.667 cd06206 bifunctional_CYPOR CDC_42 694 1102 421.667 cl06868 FNR_like superfamily CDC_42 26 439 227.546 cl12078 p450 superfamily CDC_42 466 1102 260.325 COG0369 CysJ CDC_49 180 552 285.031 cl00285 superfamily CDC_80 1022 1205 116.368 cl00160 LbetaH superfamily CDC_141 33 454 533.223 cd00833 PKS CDC_141 33 454 533.223 cl09938 cond_enzymes superfamily CDC_141 1237 1608 202.13 cd08955 KR_2_FAS_SDR_x CDC_141 1237 1608 202.13 cl09931 NADB_Rossmann superfamily CDC_141 564 879 270.822 cl08282 Acyl_transf_1 superfamily CDC_141 34 457 623.228 smart00825 PKS_KS CDC_144 271 596 127.448 cd07199 Pat17_PNPLA8_PNPLA9_like CDC_144 271 596 127.448 cl11396 Patatin_and_cPLA2 superfamily CDC_147 172 365 121.98 cl14604 Methyltransf_2 superfamily CDC_165 30 447 519.356 cd00833 PKS CDC_165 30 447 519.356 cl09938 cond_enzymes superfamily CDC_165 1909 2144 276.372 cd05195 enoyl_red CDC_165 1909 2144 276.372 cl14614 MDR superfamily CDC_165 2158 2381 171.314 cd08955 KR_2_FAS_SDR_x CDC_165 2158 2381 171.314 cl09931 NADB_Rossmann superfamily CDC_165 640 940 283.123 cl08282 Acyl_transf_1 superfamily CDC_165 1005 1179 110.38 cl11739 PKS_DH superfamily CDC_165 32 449 595.108 smart00825 PKS_KS CDC_165 1913 2144 293.921 smart00829 PKS_ER CDC_183 57 410 355.778 cl11995 Syja_N superfamily

Table 2.8. Conserved domains in CDC putative PKS genes. 74

Probe Primer_F Primer_R ALT1 TGCAGTCGAGCTGTCACTTT GCGATCAGAGATGACGAACA CDC_92 CGTCCGTTATCCTGGTCACT GAATCGCAGATGCAATGATG CDC_102 CCTCCGCAGCTCTACGATAC CCTATGCCGTTCCAACAACT CDC_147 ATGATTCGGCAAATCTCTGG CTTGAGGTAGGCAGGCAAAG EC_97_90_g721 GCCTATCTGCACCGCTCTAC GATGGCGACTGCTAGACCTC

Table 2.9. Primers used for Southern hybridization.

75

Figure 2.1. Annotation of the initial two CDC contigs. The two lines represent two contigs containing

CDC marker genes. Each box represents a BLAST annotation with the direction indicated by arrows.

76

Figure 2.2. Southern hybridization to validate CDC contigs prediction. (A) Chromosomes of A.

arborescens separated by CHEF. Hybridization of (B) ALT1 gene, three predicted CDC genes including

(C) CDC_92, (D) CDC_102, (E) CDC_147, and (F) a predicted EC gene EC_97_90_g721 are shown. It

should be noted that the membrane used for hybridization was prepared in 2004, so only a limited

number of hybridization experiments were possible.

77

Figure 2.3. Global contig alignment between A. arborescens contigs (blue blocks) and A. brassicicola contigs (red blocks). The first 11 lines contain EC contigs alignment and the last line (and the magnified inset) contains CDC contigs alignment. Alignment identity cut-off was set as 70%. Contig numbers are given above blocks. The corresponding alignment identity is shown as spots under contigs.

78

Figure 2.4. The Codon Adaptation Index (CAI) distribution of A. arborescens CDC genes compared to

EC genes. The CAI derived from the RSCU estimations was computed using “Automated Codon Usage

Analysis Software” (ACUA).

79

Figure 2.5. BLAST taxonomy report of all CDC genes against NCBI “Nucleotide collection (nr/nt)” database. Species of BLAST hits (E-value < 1.0E-3) were ranked by their occurrence.

80

Figure 2.6. GO term of CDC genes. GO terms were assigned to all CDC genes and level three terms in the categories Biological Process (A), Molecular Function (B), and Cellular Component (C) are showed in the corresponding pie chart.

81

Figure 2.7. Structure of predicted secondary metabolite biosynthesis (SMB) clusters. Each block

represents a predicted CDC gene with the annotation listed above the block. Contig (NODE) numbers

and color legend are given. Type I PKSs are multifunctional enzymes that are organized into modules,

each harbors a set of distinct, non-iteratively acting activities responsible for the catalysis, while type

II PKSs are multienzyme complexes carrying a single set of iteratively acting activities.

82

Figure 2.8. Average Ka/Ks ratio, Ka, and Ks for all aligned EC genes and CDC genes. 6323 EC proteins and 34 CDC proteins were involved and codem program in the PAML suite used.

83

Figure 2.9. Phylogenetic relationship of proteins in A. arborescens CDC shows discordance compared to EC proteins. The maximum-likelihood tree was constructed by “MEGA 5” using six protein sequences showing homology in all six groups.

84

Figure 2.10. Scatter plots of BLAST Score Ratio (BSR) of EC proteins (A) and CDC proteins (B). The numbers in yellow regions indicate the percentage of genes that either lack homologous sequences

(lower left corner) or contain homologous sequences (upper right corner) in both A. brassicicola (Ab) library and 3-fungi library (P. tritici-repentis, L. maculans, and A. oryzae). The numbers in pink regions

(upper left) indicate the percentage of genes with homologous sequences in 3-fungi library but not in Ab library, while numbers in green regions (lower right) indicate the percentage of genes with homologous sequences in the Ab library but not in the 3-fungi library.

85

Figure 2.11. Diversity of marker genes on CDC and ECs among different Alternaria isolates. The first 6 genes in black color are EC marker genes and the two in red are CDC markers.

86

Chapter 3: RNA-Seq Analysis of the Sclerotinia homoeocarpa – Creeping Bentgrass Pathosystem

87 Introduction

Sclerotinia homoeocarpa is an ascomycete fungus, which causes dollar spot disease on turfgrass all around the world [1]. S. homoeocarpa’s hosts include all species of turfgrass and some of dicot plants [1]. One of its hosts, creeping bentgrass (also known as Agrostis stolonifera L.), is a cool-season turfgrass commonly planted on golf course greens in both the United States and Canada

[2]. Many commonly planted creeping bentgrass cultivars contain no resistance to dollar spot disease and the frequent, low mowing practices applied to them in a golf course setting promote disease outbreaks [1, 3].

Straw-colored, hourglass-shaped lesions with characteristic reddish-brown borders can be observed in dollar spot infected plants. Diseased areas may grow to about 2.5 cm wide [1], and the diseased areas will merge to form larger patches of diseased turf under environmental conditions for the pathogen [4-6]. However, predictive models designed to reduce fungicide applications have developed slowly because of a lack of a full understanding of S. homoeocarpa lifestyle, epidemiology, and disease etiology [1, 4].

The overwinter structure of S. homoeocarpa is a stroma and the pathogen becomes active in the spring. The fungus infects the newly emerging leaf tissue through the stomate, wounds, and directly with an appressoriu [7-9]. From reports, only the infertile apothecia have been seen for North

American isolates [10], but population studies indicate the possibility of genetic recombination in this fungus [11]. Early studies recorded root browning and cell death through the production of

88 diffusible toxins [12]. Recently, several tetranorditerpenoid compounds have been identified that can cause root-browning via extremely phytotoxic properties; however, a potential correlation between the production of these compounds and disease symptoms was still missing [13].

To control dollar spot, cultural management includes maintaining adequate nitrogen balance, sufficient air flow to remove dew, and planting moderately resistant cultivars or species of turf [14,

15]. Fungicides are often applied biweekly to weekly on highly maintained areas such as golf courses because cultural practices are not sufficient for management of the disease. As a result, high amounts of fungicide use have led to pathogen resistance to several chemical classes commonly used on turf [1]. Except for traditional chemical fungicides, other disease controlling products have been used including plant defense activators that function by stimulating two different plant defense pathways: systemic acquired resistance and induced systemic resistance. However, it is unclear which pathway is more efficient for preventing dollar spot epidemics. To develop more sustainable and practical management strategies, it will be essential to elucidate the molecular interactions between S. homoeocarpa and creeping bentgrass, which may greatly accelerate the development of plant defense activators and the cultivars containing increased resistance to S. homoeocarpa.

One of the available methods that enable researchers to investigate molecular change of pathogen and host in a transcriptome scale level is the 2nd generation sequencing (NGS) technology, which could also reduce the cost to a relatively low level in return for a vast amounts of data with quantitative properties [16-19]. In this chapter, two NGS technologies were used to generate

89 sequence data for RNA-seq analysis: Illumina’s sequencing-by-synthesis (SBS) and Roche’s

454-pyrosequencing. The 454 reads were used for the de novo assembly of S. homoeocarpa and creeping bentgrass transcriptome libraries. SBS reads were mapped to the 454 assemblies to calculate transcript levels from S. homoeocarpa and creeping bentgrass during dollar spot disease development. The objective of this study was to identify transcripts that may be important for fungal virulence and creeping bentgrass defense. The results of the analysis will be used to form testable hypotheses for future studies on dollar spot etiology and turfgrass defense mechanisms.

Results

Transcriptomic analysis

Six cDNA libraries for 454 sequencing and 4 cDNA libraries for SBS sequencing were prepared by Dr.

Venu Reddyvari-Channarayappa. The S. homoeocarpa (SH) 454 sequencing data contained 600,760 reads in total, and the A. stolonifera (AS) 454 sequence data contained 205,40 reads in total (Table

3.1). The 454 read lengths ranged from 50 bp to > 500 bp, with a majority of the reads between 40 and 500 bp in length (Figure 3.1). The transcriptome coverage for each of the 454 assemblies was

3.3x coverage for AS and 17.2x coverage for the SH assembly. These sequencing coverage numbers were calculated by dividing the total number of sequence reads by the size of the respective, assembled transcriptome libraries.

90 The SBS reads, which was used for calculating significant differences in transcript levels between libraries, resulted in 4.3 – 7. Million reads (Table 3.2); however, these reads were only 16 bp long, because of the first generation of Illumina sequencing protocols we used. The SBS reads provided ample coverage of the entire transcript assemblies: 14x coverage for SH and 12x for AS. The 454 library construction resulted in 9,319 SH transcripts with length ranging from 400 bp to 3,500 bp, with a distribution peak at 500 bp. Construction of the 454 transcript library for AS resulted in

20,293 transcript sequences with length ranging from 200bp to 8,500 bp with a distribution peak at

480 bp. The full length of transcripts was not estimated since very few annotated coding sequences from either SH or AS were available at Genebank. The mapping percentage of SBS transcripts was quite low. Only 67.4% of SH PDA-only reads, 63.8% of SH PDB-only reads, and 33.3% of Interaction reads were mapped to the SH transcripts assembly, respectively. Only 16.8% of AS only reads, and

5.7% of the Interaction reads mapped to the AS transcripts library (Table 3.2). This is likely due to the short length of the SBS reads and stringent mapping parameters used in this experiment, which discarded reads mapped to multiple locations and did not allow any base mismatches.

The quality of the SBS mapping was confirmed by visualizing coverage of previously annotated transcripts, elongation factor-1 and calmodulin-type gene (Figure 3.2). The signal tracks demonstrated that both transcript types provided a coverage of mapped reads from SH from PDA, from PDB, and from the interaction library.

A Venn diagram was constructed to identify the distribution of the unique and common reads in

91 SH-only, AS-only, and interaction libraries for both SBS and 454 data (Figure 3.3). There are discrepancies in the reads distribution between the SBS and 454 libraries. The most obvious example is the increase in the number of reads that are unique to the S?H-only library from 6.2% in the SBS to

58.9% in the 454 reads. This is because there were more total SH reads in the 454 dataset, where four of the six libraries (69.8% of the reads) were constructed using RNA from S. homoeocarpa grown in pure culture. In the SBS dataset, only two out of four (31.7% of the reads) of the libraries were constructed from S. homoeocarpa grown in pure culture. The number of reads shared by the

SH-only and AS-only libraries was consistently low in both SBS and 454 datasets. These shared reads are likely a result of similar sequences in both organisms, including those of transposable elements found in both organisms.

Comparative analysis

The transcript assemblies were blasted against the NCBI protein database for functional annotation using Blast2GO [20] ‘blastx’ function. There were 57% (5,257 of the 9,319) SH transcripts returned at least one annotation hit. Likewise, 57% (11,551 of the 20,293) AS transcripts returned at least one hit. From the ‘blastn’ taxonomy report, top hits of the S. homoeocarpa sequences were primarily from Sclerotinia sclerotiorum (44.1%) and Botryotinia fuckeliana (34.9%) (Figure 3.4a). This is not surprising considering their taxonomic similarity to S. homoeocarpa, and the abundance of S. sclerotiorum and B. fuckeliana sequence data available on GeneBank. The 26 fungal species

92 represented in the taxonomy report belong to various classes of the subphylum Pezizomycotina, with seven of the 26 species being pathogens of monocot plants. There were a small number of sequences related to monocots including rice (Oryza sativa, 1.6%), bamboo (Phyllostachys edulis,

0.3%), and barley (Hordeum vulgare, 0.4%). We attributed this either to contamination of the fungal sequences by A. stolonifera sequences in the interaction library, to the presence of very highly conserved sequences, or to the presence of sequences occurring in both organisms, such as those f transposable elements. The top blast hit species for the A. stolonifera sequences included predominantly monocot grass species (Figure 3.4b). Fungal sequences that were found included those of Pyrenophora teres Smed.-Pet. And Phaeosphaeria nodorum (E. Mull) Hedjar, pathogens of barley and wheat, respectively.

Differentially expressed gene Identification

Reads Per Kilobase of exon model per Million mapped reads (RPKM) value was calculated of all SH and AS transcripts in different libraries and the Log2 ratio was calculated to reflect the fold changes of interaction library compared to control libraries (SH PDA and AS-only).

To identify transcripts that were up-regulated in the interaction condition, transcripts with statistically significant changes of RPKM values during infection were examined to determine if there were patterns in the types of genes with increased or decreased transcription level. Table 3.3

93 summarizes some of the types of SH transcripts of interest during infection including a long list of glycosyl hydroalase transcripts and various proteases. Table 3.4 summarizes transcripts of interest in the AS library including a long list of retrotransposons, defense-related proteins including 13 different germin-like proteins, and various enzymes involved in plant hormone synthesis and regulation. Interestingly, all jasmonate-induced protein transcripts were significantly down regulated at 96 hpi.

Gene ontology terms analysis

The list of significantly up-regulated SH and AS transcripts were sorted into gene ontology (GO) term categories for molecular function, biological processes, and cellular component using the Blast2GO

[20] program (Figures 3.5 and 3.6). The molecular function terms associated with up-regulated SH transcripts include a large percentage of plant cell-wall degrading enzymes categories (Figure 3.5a) including polygalacturonase activity (4.0%), cellulose activity (5.1%), endopeptidase activity (4.0%),

α-arabinofuranosidase activity (2.9%), and serine-type peptidase activity (5.1%). Other up-regulated molecular function of GO terms included those for xenobiotic-transporting activity (5.1%) and activity (7.4%). Biological process of up-regulated SH transcript GO terms included auxin biosynthesis (11.4%), transmembrane transport (18.9%), and oxidation reduction (25.1%)

(Figure 3.5b). A majority of the up-regulated SH cellular component GO terms were for those integral to the membrane (37.2%) and those for extracellular regions (34.0%) (Figure 3.5c), perhaps

94 reflecting the secretion of cell wall degrading enzymes and other secondary metabolites important for pathogenicity and virulence of S. homoeocarpa.

The molecular function of GO terms associated with the up-regulated AS transcripts included those associated with hydrolase activity (19.2%), DNA binding activity (10.9%), and nucleotide binding

(24.3%) (Figure 3.6a). The Biological process GO terms associated with the up-regulated AS transcript library included those related to translation (10.8%), transport (14.6%), and DNA metabolic processes (11.1%) (Figure 3.6b). Finally, an overwhelming number of cellular process GO terms from the AS up-regulated library were associated with the mitochondrion (30.8%) and plastids (31.8%)

(Figure 3.6c).

Conserved domain analysis

Protein domains were identified in the S. homoeocarpa and A. stolonifera transcripts libraries using

HMMER software (v3.0) [21]. Also among SH transcripts, 1588 were predicted to contain a signal peptide by SignalP (v4.0) [22], and thus could be candidates for secreted proteins. Three proteases with secretion signals included two subtilisin/sedolisin proteases with log-fold increases of 64 and

6.2 as well as a cuticle degrading serine protease with a log-fold increase of 4.0. Conserved domains significantly enriched in the interaction library (P<0.01) are listed in Table 3.5 (SH conserved domains) and Table 3.6 (AS conserved domains). Domains enriched in the SH interaction library include a

95 variety of glycosyl hydrolases, proteases domains, and transporters. Domains of interest in the enriched AS interaction library include cytochrome P450, various ABC transporters, and cupin domains, which include germin-class enzymes.

Identification of microsatellite markers

Simple sequence repeats (SSRs) were identified using Microsatellate Identification Tool (MISA). In the AS transcriptome library, 380 SSR markers were identified. The distribution in size of these SSRs is shown in Table 3.7. The largest groups of SSR markers in grass were the single and tri-nucleotide repeats that made up approximately 47.7% and 27.2% of the identified markers, respectively. There were 715 SSR sequences identified in the SH transcripts library (Table 3.7). Similar to the AS SSRs, the largest groups were the single and tri-nucleotide repeats that made up 37.2% and 30.9% of the identified marers, respectively.

Validation of RPKM value by RT-PCR

To verify the RPKM calculation, RT-PCR and semi-quantitive PCR of selected genes were performed by Dr. Angela Orshinsky. Firstly, Five SH transcripts up-regulated in infection stage were selected for semi-quantitive PCR, which all showed positive correlation between RPKM fold change and relative intensity of gel bands (Figure 3.7). To further confirm RPKM calculations, real-time RT-PCR assays 96 were performed for genes with significantly different transcription levels during infection. The relative transcription data verified that the RPKM calculation and statistical analysis of transcriptome data was accurate (Figure 3.8). The real time RT-PCR data confirms the up-regulateion of two fungal xylanases (SH_5411 and SH_5726), a fungal laccase (SH_7961), and a creeping bentgrass germin protein (AS_608). Down-regulateed transcripts SH_6925 (scyatalone dehydrogenase) and SH_8369

(β-mannosidase) were also verified.

Discussions

Both SBS and 454 technologies have been reported to show high reproducibility and accuracy for transcriptome studies [18, 23], and reached higher-quality data generation than microarray and cloning based EST method [18, 19]. Compared to 454, SBS technology is considered to be superior because of the greater sequencing coverage for a similar cost. In this study, we obtained a mapping percentage of SBS to the S. homoeocarpa assembly ranging from 33.3-67.4%, and A. stolonifera mapping percentages ranging from 5.7-16.8%. The low mapping percentage could be explained by the highly stringent parameters we applied, which eliminated all reads with even a single mismatch and multi-locus mapping reads, contributing factors may also be transcriptome misassembly. The use of our analysis pipeline and the statistics applied was verified with real-time PCR. It should be noted that mapping uncertainty exists and accounts for a large portion of the reads in the mapping, especially for SBS reads, since short sequence lengths make it difficult to distinguish paralogous

97 genes, alternatively spliced isoforms, and low complexity sequences, all of which result in multi-locus mapping reads that are discarded in process [24]. Low mapping percentage may also be due to the incompletely assembled transcriptome library using 454 reads.

It is now possible to identify potential pathogenicity factors highly transcribed by S. homoeocarpa during infection of A. stolonifera by RNA-seq analysis. The most surprising discovery was the number of enriched S. homoeocarpa transcripts within the interaction library that encoded glycosyl hydrolase enzymes. In total, 52 of the up-regulated SH transcripts encoded glycosyl hydrolase genes and 22.3% of molecular function GO terms were associated with glycosyl hydrolase activity. A keyword search using search terms “glycosyl hydrolase”, “cellulose” and “pectinase” in the SH transcript library revealed over 100 transcripts annotated as glycosyl hydrolase, representing 85 different families of the total 125 glycosyl hydrolase families [25, 26]. The enriched number and such a varied group of glycosyl hydrolase families is consistent with the ability of this pathogen to cause disease on a wide range of plant hosts. From the data in this project, S. homoeocarpa glycosyl hydrolase genes showing up-regulateion were mostly xylanases and arabinose. This was expected since grass cell walls contain roughly 40% xylans as well as glucuronoarabinoxylans (GAX) which make up a majority of monocot hemicellulose [27]. It was reported that pectinases, xylanases, and celluases were often virulence and pathogenicity factors for fungal phytopathogens. For example, Botrytis cinera Pers. and Septoria nodorum (Berk.), showed reduced virulence in a mutant of a cell wall degrading enzyme [28-30].

However, the large number of up-regulated and functionally redundant cell wall degrading enzymes in the interaction library will make it very difficult to verify the role of any single enzyme.

98 A large number of proteinases, mostly serine proteases, were also found associated with pathogen inoculation. Secretion signals were detected on three of the proteases of interest, including two subtilisin/sedolisin proteases and one identified as a cuticle degrading serine protease. There were a

4.0-6.4 log-fold up-regulateion for these secreted proteases in the interaction library. In the molecular function GO terms, 5.1% were associated specifically with serine-proease activity. Serine proteases are known as potential pathogenicity determinants for a variety of fungal phytopathogens including nematode-parasitizing fungi, entomopathogens [31].

It was observed that multi-drug resistance (MDR) ATP binding cassette (ABC) transporter proteins were also up-regulated in the S. homoeocarpa interaction condition. Members of the MDR-ABC transporter family play important roles in resistance to plant defense compounds and fungicide, as well as efflux of endogeneous toxins [32, 33]. In B. cinerea, both ABC and MFS transporters have been implicated in resistance of the pathogen to a wide range of fungicides including phenylpyrroles, azoles, anilopyrimidines, dicarboximides, and strobilurins [34-36]. Also, the enrichment of different transport proteins with redundant functions in the S. homoeocarpa interaction library may explain the development of isolates that show resistance to multiple fungicide class.

On the plant side, some of the most prominent up-regulated defense-related transcripts include the germin and germin-like proteins. Germins belong to the cupin superfamily and one important defense enzyme – oxalate oxidase [37] – were put in this category. Germin-type oxalates are enzymes which are very unique to Gramineaeious plants [37]. Oxalate oxidase has been known to be

99 more active in moderately resistant creeping bentgrass cultivars such as L-93, compared to Crenshaw which is a highly susceptible cultivar [38]. In addition to the plant defense-related transcripts, other significantly up-regulated creeping bentgrass transcripts were annotated as transposons and retrotransposons, which is not surprising since it has been reported that transposons make up about

80% of the DNA in cereal crops [39]. Retrotransposons were reported to be transcriptionally active in plants undergoing stress such as during wounding and in the presence of salicylic acid and jasmonate

[40]. Although the clear connection between transposons and plant defense was not discovered, their increased transcription level during infection should be further investigated.

The RNA-seq analysis presented in this project showed the value of next generation sequencing technology for helping elucidate mechanism of a plant-host interaction system. The rapid development of the sequencing technology since the completion of this project has already increase reads lengths and provided deeper sequencing coverage with a relatively lower cost. Also, the development and upgrade of analysis tools has made the assembly process faster and more accurate.

This resulting data analysis will be used to support future research efforts to characterize S. homoeocarpa virulence factors and creeping bentgrass defense-related transcripts.

100 Materials and Methods

Preparation RNA for SBS sequencing libraries

RNA libraries for SBS and 454 sequencing were prepared by Dr. Venu Reddyvari-Channarayappa in this project.

Sclerotinia homoeocarpa isolate MB01 was isolated from creeping bentgrass at The Ohio State

University Turfgrass Research and Education Facility. The fungus was cultured on PDA media and

5mm plugs of actively growing mycelium were transferred to PDB medium and incubated at 26oC with shaking at 160 rpm for 96 h. Cultures grown on PDA overplayed with cellophane membranes were also used for SBS library construction. At 96 h, mycelia were filtered from PDB or scraped from the cellophane membrane on PDA, and total RNA was extracted using TriReagent (Sigma) according to manufacturer’s instructions. For interaction and creeping bentgrass libraries, creeping bentgrass cv. Crenshaw was grown in a growth chamber at 26oC day and 22oC night temperatures with a 12 h photoperiod and at 70% relative humidity. Three-week old seedlings were challenged with millet seeds colonized by S. homoeocarpa MB01. These plants were incubated in clear bags to create humidity for 48 h after which the plants were uncovered during the day and bagged again for the night. After 96 h, leaves from inoculated and non-inoculated plants were harvested, homogenized in liquid nitrogen, and RNA was extracted using TriReagent according to the manufacturer’s instruction.

The SBS library templates were prepared using the Illumina Duplex-Specific thermostable nuclease

101 (DSN) normalization kit and were analyzed using the Illumina Genome Analyzer I (GAI) at the

Molecular and Cellular Imaging Center of The Ohio State University in Wooster, Ohio. The resulting libraries included a 96 h PDB culture control, a 96 h PDA culture control, a 96 h creeping bentgrass control, and a 96 h S. homoeocarpa – creeping bentgrass interaction library.

Preparation RNA for 454 sequencing libraries

S. homoeocarpa isolate MB01 was cultured on PDA and 5 mm plugs of actively growing mycelium were transferred to PDB medium and incubated at 26oC with shaking at 160 rpm for 48, 96, or 144h.

At each time point, mycelia were filtered and total RNA was extracted using TriReagent according to manufacturer’s instructions. For interaction and creeping bentgrass libraries, creeping bentgrass cv.

Crenshaw was grown in a growth chamber as described previously. Three-week old seedlings were inoculated in one of two ways: with a 5mm plug of actively growing mycelium from a PDA culture or misting of a mycelia homogenate, in water, from a 96h PDB culture of S. homoeocarpa MB01. These plants were incubated as described for SBS library preparation. After 96 h, leaves from inoculated and non-inoculated plants were harvested, homogenized in liquid nitrogen, and RNA was extracted using TriReagent according to the manufacturer’s instructions. Total RNA was sent to the Core

Genomics Facility at Purdue University, Indiana for preparation and sequencing using Roche GS-FLX

(454) sequencer with Titanium chemistry. Seven RNA-seq libraries were prepared that included S. homoeocarpa PDB cultures at 48, 96, and 144 h after inoculation, S. homoeocarpa grown on PDA at

102 96 h, creeping bentgrass non-inoculated controls, and for creeping bentgrass incubated for 96 h with

S. homoeocarpa applied as a PDA culture plug or mycelia homogenate in water.

Sequencing and transcripts library assembly

The 454 raw sequence reads were assembled using GS Data Analysis Software (v2.5, Roche,

Indianapolis, IN) after removal of adaptor sequences. S. homoeocarpa (SH) reads from the PDA, PDB, and 96 hip interaction libraries, and the A. stolonifera (AS) reads from the 96h control and 96 h interaction libraries were assembled using GS de-novo assembler (v2.5, Roche) with parameter setting as 90% identity and a minimum 40 bp overlap. To eliminate SH reads in AS assembly library, reads from the SH interaction libraries that matched AS GeneBank EST reads were removed from the

SH transcript assembly. Then reads from the AS interaction library matching SH assembly were also removed. Both SH and AS transcriptome assembly as used in this study was submitted to GeneBank

TSA database (Accession: PRJNA84359).

Full length SBS reads were directly mapped to the SH and AS transcriptome libraries using the Maq

(v0.7) [41] as mapping tool. Mapping parameters allowed for only unique mapped reads with no mismatches. The number of mapped SBS reads for each transcript was counted by “soap.coverage” tool in SOAPAligner package[42].

103 Transcript expression analysis

RPKM value were calculated for each transcript by dividing number of mapped reads by length of the transcripts and number of total sequenced reads in this library. Fisher’s Exact Test, a test used for analyzing unreplicated transcript data [43] was used to determine statistical significance of transcript

RPKM value change. The test was done using a pairwise comparison of interaction RPKM value versus RPKM from fungus grown on PDA, or for interaction RPKM values non-inoculated control grass. The SH assembly contained 9,319 unique annotated transcripts and the AS assembly contained 20,293 unique annotated transcripts. Therefore, the statistical significance of the Fisher’s

Test was evaluated against a Bonferroni corrected P-value of 1.86x10-5 and 2.43x10-6 for fungal and grass transcripts, respectively. Furthermore, only transcripts with at least 2-log fold change in transcript abundance were selected as up-regulated. These conservative criteria were applied to avoid false positives.

The log-fold RPKM values of selected transcripts were validated using real-time RT-PCR reaction on an iQ5 real time thermocycler. RNA was extracted from fungus grown on PDA, non-inoculated grass, and infected grass as descript above. RNA was extracted using Trizol (Invitrogen) and RNA extracts were treated with RQ1 DNAse (Promega) according to the manufacturer’s directions, except that

RNAse Inhibitor (Promega) was added to each 100ul of RNA extract. Complementary DNA was created from 1ug of total RNA using iScript (BioRad). PCR was conducted using the SYBR supermix

(Biorad) as per manufacturer’s directions. The cycle conditions were 95oC for 3 minutes followed by

104 40 cycles of 95oC for 30 s, 58.5oC for 30 s, and 72oC for 30 s. The melting temperature profiles and gel electrophoresis was used to evaluate the specificity of the reactions and the absence of primer dimers. Real-time RT-PCR was conducted with three biological replicates for each of grass, interaction, and fungal cDNA samples. Three technical replicates of each cDNA sample were used in each experiment.

Reaction efficiency and relative expression data were analyzed using the relative expression software tool (REST) program (Qiagen). Log base two values of relative expression ratios calculated in the REST program and the corresponding log base two RPKM expression ratios were compared graphically.

Transcripts library annotation

Taxonomic and functional annotation of the SH and AS transcripts were conducted by using

Blast2GO [20] software to run “blastx” and “blastn” algorithms against non-redundant nucleo-tide/protein database from NCBI. Within the SH library, any transcripts that resulted in top blastn hit species of Hordeum, Oryza, Sorghum, Agrostis, or Vitis were removed. Combined graphs for GO terms associated with statistically up-regulated transcripts are presented in the results section.

105 Conserved domain analysis

Transcript sequences were translated to proteins in all 6 possible frames, and conserved protein domains were identified using HMMER (v3.0) [21] to screen the Pfam-A database. All the predicted domains with E-value <0.01 were reported. The enrichment ratio and significant value was calculated based on a Fiser’s Exact Test. Descriptions for enriched domains were found on the Pfam website.

Identification of microsatellite markers

SSR markers were identified using Micro Satellite identification tool (MISA) in both SH and AS transcripts libraries. Input files were the complete sequences from SH and AS libraries. Parameters allowed for the inclusion of mononucleotide repeats that were a minimum of 10ntd long, dinucleotides with minimum of 6 repeats, trinucleotide minimum of 5 repeats, tetranucleotide minimum of 5 repeats, pentanucleotides with a minimum of 5 repeats, and hexanucleotides with a minimum of 5 repeats. The maximum interruption distance between 2 SSRs was 10 ntd.

106 References

1. Walsh B, Ikeda SS, Boland GJ: Biology and management of dollar spot (Sclerotinia homoeocarpa); an important disease of turfgrass. Hortscience 1999, 34(1):13-21. 2. Chakraborty N, Chang T, Casler MD, Jung G: Response of bentgrass cultivars to Sclerotinia homoeocarpa isolates representing 10 vegetative compatibility groups. Crop Sci 2006, 46(3):1237-1244. 3. Pigati RL, Dernoeden PH, Grybauskas AP, Momen B: Simulated Rainfall and Mowing Impact Fungicide Performance When Targeting Dollar Spot in Creeping Bentgrass. Plant Dis 2010, 94(5):596-603. 4. Burpee LL CL: Evaluation of two dollarspot forecasting systems for creeping bentgrass. Canadian Journal of Plant Science 1986, 5:345-351. 5. Couch HB BJ: Influence of environment on disease of turfgrasses. II. Effect of nutrition, pH, and soil moisture on Sclerotinia Dollar Spot. Phytopathology 1960, 257:761-765. 6. Hall R: Relationship between Weather Factors and Dollar Spot of Creeping Bentgrass. Canadian Journal of Plant Science 1984, 64(1):167-174. 7. Orshinsky AM: Characterization of the growth and host colonization of virulent, asymptomatic, and hypovirulent isolates of Sclerotinia homoeocarpa. Guelph: University of Guelph 2010:203. 8. Endo RM: Control of Dollar Spot of Turfgrass by Nitrogen and Its Probable Bases. Phytopathology 1966, 56(8):877-&. 9. Monteith J DA: Turfgrass diseases and their control. Bulletin of the United States Golf Association 1932, 12(85). 10. Orshinsky AM, Boland GJ: Ophiostoma mitovirus 3a, ascorbic acid, glutathione, and photoperiod affect the development of stromata and apothecia by Sclerotinia homoeocarpa. Canadian journal of microbiology 2011, 57(5):398-407. 11. Hsiang T, Mahuku GS: Genetic variation within and between southern Ontario populations of Sclerotinia homoeocarpa. Plant Pathol 1999, 48(1):83-94. 12. RM E: Influence of temperature on rate of growth of five fungus pathogens of turfgrass and on rate of disease spread. Phytopathology 1963, 53:857-877. 13. Herath HMTB, Herath WHMW, Carvalho P, Khan SI, Tekwani BL, Duke SO, Tomaso-Peterson M, Nanayakkara NPD: Biologically Active Tetranorditerpenoids from the Fungus Sclerotinia homoeocarpa Causal Agent of Dollar Spot in Turfgrass. Journal of natural products 2009, 72(12):2091-2097. 14. Bonos SA BR, Clarke BB (ed.): An integrated approach to dollar spot disease in turfgrasses. New Brunswick, NJ: Rutgers Cooperative Extension; 2007. 15. R L (ed.): Turfgrass disease profiles: Dollar Spot: West Lafayette: Purdue Extension; 2009. 16. Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome biology 2007, 8(7). 17. Birch P KS: Studying interaction transcriptomes: coordinated analyses of gene expression

107 during plant-microorganism interactions. New technologies for life sciences: A Trends Guide 2000, 0:1471-1931. 18. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome research 2008, 18(9):1509-1517. 19. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63. 20. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21(18):3674-3676. 21. Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic acids research 2011, 39:W29-W37. 22. Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 2011, 8(10):785-786. 23. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 2008, 17(7):1636-1647. 24. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 2010, 26(4):493-500. 25. Henrissat B: A Classification of Glycosyl Hydrolases Based on Amino-Acid-Sequence Similarities. Biochemical Journal 1991, 280:309-316. 26. Henrissat B, Bairoch A: New Families in the Classification of Glycosyl Hydrolases Based on Amino-Acid-Sequence Similarities. Biochemical Journal 1993, 293:781-788. 27. Vogel J: Unique aspects of the grass cell wall. Current opinion in plant biology 2008, 11(3):301-307. 28. Kars I vKJ (ed.): Extracellular enzymes and metabolites involved in pathogenesis of Botrytis. Norwell, MA, USA: Kluwer Academic Publishers; 2007. 29. Rowe HC, Kliebenstein DJ: Elevated genetic variation within virulence-associated Botrytis cinerea polygalacturonase loci. Molecular plant-microbe interactions : MPMI 2007, 20(9):1126-1137. 30. Lehtinen U: Plant-Cell Wall Degrading Enzymes of Septoria-Nodorum. Physiol Mol Plant P 1993, 43(2):121-134. 31. Li J, Yu L, Yang JK, Dong LQ, Tian BY, Yu ZF, Liang LM, Zhang Y, Wang X, Zhang KQ: New insights into the evolution of subtilisin-like serine protease genes in Pezizomycotina. Bmc Evol Biol 2010, 10. 32. de Waard MA, Andrade AC, Hayashi K, Schoonbeek HJ, Stergiopoulos I, Zwiers LH: Impact of fungal drug transporters on fungicide sensitivity, multidrug resistance and virulence. Pest Manag Sci 2006, 62(3):195-207. 33. Del Sorbo G, Schoonbeek H, De Waard MA: Fungal transporters involved in efflux of natural toxic compounds and fungicides. Fungal Genet Biol 2000, 30(1):1-15.

108 34. Kretschmer M, Leroch M, Mosbach A, Walker AS, Fillinger S, Mernke D, Schoonbeek HJ, Pradier JM, Leroux P, De Waard MA et al: Fungicide-Driven Evolution and Molecular Basis of Multidrug Resistance in Field Populations of the Grey Mould Fungus Botrytis cinerea. Plos Pathogens 2009, 5(12). 35. Schoonbeek H, Del Sorbo G, De Waard MA: The ABC transporter BcatrB affects the sensitivity of Botrytis cinerea to the phytoalexin resveratrol and the fungicide fenpiclonil. Mol Plant Microbe In 2001, 14(4):562-571. 36. Stefanato FL, Abou-Mansour E, Buchala A, Kretschmer M, Mosbach A, Hahn M, Bochet CG, Metraux JP, Schoonbeek HJ: The ABC transporter BcatrB from Botrytis cinerea exports camalexin and is a virulence factor on Arabidopsis thaliana. Plant Journal 2009, 58(3):499-510. 37. Davidson RM, Reeves PA, Manosalva PM, Leach JE: Germins: A diverse important for crop improvement. Plant Sci 2009, 177(6):499-510. 38. Da Rocha AB, Hammerschmidt R: Increase in oxalate oxidase activity in bentgrass response to Sclerotinia homoeocarpa and chemical treatment. Phytopathology 2004, 94(6):S157-S157. 39. Sabot F, Simon D, Bernard M: Plant transposable elements, with an emphasis on grass species. Euphytica 2004, 139(3):227-247. 40. Grandbastien MA: Activation of plant retrotransposons under stress conditions. Trends Plant Sci 1998, 3(5):181-187. 41. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 2008, 18(11):1851-1858. 42. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24(5):713-714. 43. Pfaffl MW, Horgan GW, Dempfle L: Relative expression software tool (REST (c)) for group-wise comparison and statistical analysis of relative expression results in real-time PCR. Nucleic acids research 2002, 30(9).

109

Sequencing Mean of Transcript Number Number of Number of Libraries Isotig Length Assembly of Reads Isotigs Singletons Included (bp) Total SH SH PDA 96h SH PDB 48h / 600,760 10,101 1,172 51,502 96h / 144h Interaction 96h Total AS AS 96h 205,403 5,017 898 58,446 Interaction 96h

Table 3.1. Characteristics of the 454 RNA-Seq reads.

110

SBS Library SH PDA SH PDB AS 96h Interaction Number of Reads 6,101,988 4,350,510 6,885,250 7,182,868 Mapping Percentage 67.4% 63.8% 0.9% 33.3% to SH Assembly Mapping Percentage 2.6% 2.4% 16.8% 5.7% to AS Assembly

Table 3.2. Mapping characteristics of the Sclerotinia homoeocarpa (SH) and Agrostis stolonifera (AS)

SBS reads to the SH and AS transcript assemblies.

111 Transcript Group Description Transcripts Log fold change

ferruloyl esterase 2 4.8 - 11.5 dihydroceramidase 1 3.1 amylase 1 3.0 arabinofuranosidase 7 2.4-9.9 rhamnosidase 1 2.1 endoglucanase 4 2.9-5.7 glucosidase 4 3.1-4.9 xylosidase 5 2.6-5.3 cellobiohydrolase 4 4.3-6.9 cellobiose dehydrogenase 1 4.1 Glycosyl hydrolases cellulase 1 3.6 cutinase 1 3.0 mannosidase 2 3.1-4.6 xylanase 3 4.6-8.5 glucanase 8 2.2-7.3 polygalacturonase 3 3.0-8.1 rhamnogalacturonase 2 2.9-8.1 lysozyme 1 5.7 pectinesterase 1 2.0 Total 52 2.0-11.5 serine protease 4 2.9-9.4 aspergillopepsin-2 1 2.3 Proteases peptidases 8 2.4-6.5 neutral protease 1 5.8 Total 14 2.3-9.4 mdr-like abc transporter 7 4.1-6.6 Transporters mfs multidrug 2 2.7-2.9 Total 9 2.7-6.6

Table 3.3. Summary of select Sclerotinia homoeocarpa transcript types that are significantly increased at 96 hpi after inoculation onto creeping bentgrass.

112

Transcript Group Product Transcripts Log fold change athila retroelement 1 5.8 copia-type retroelement 2 2.6-4.3 gag-pol polyprotein 10 2.4-4.9 mutator-like transposase 1 4.6 retroelement pol poly 2 2.2-2.7 retrotransposon line 1 3.0 Transposon retrotransposon ty1-copia 6 3.4-5.8 retrotransposon ty3-gypsy 13 2.5-8.6 transposon en spm 15 2.0-5.2 unidentified transposon 2 2.3-3.7 retrotransposon unclassified 18 2.4-7.6 Total 72 2.0-8.6 anthranilate synthase 1 4.5 calcineurin 1 3.5 cinnamoyl reductase 1 3.5 cytochrome p450 5 2.0-5.5 disease resistance nbs-lrr 3 2.3-4.0 e3 ubiquitin-protein ligase 2 3.7-4.0 fusarium resistance, i2c-5-like 1 2.9 germin a 13 3.4-6.2 Defense mdr abc transporter 6 5.9-7.7 pathogenesis associated pep2 1 9.1 rust resistance kinase lr10 1 4.5 terpene synthase 1 2.9 ubiquitin 2 2.1-4.8 ubiquitin-conjugating enzyme 1 2.1 ubiquitin-specific protease 1 3.8 zingiberene synthase 1 2.8 Total 47 2.0-9.1

Table 3.4. Summary of select Agrostis stolonifera transcript types that are significantly increased at

96 hpi after inoculation with Sclerotinia homoeocarpa.

113 Domain name Enrichment P-value Description Glyco_hydro_45 17.0 0.0030 Endoglucanase Glyco_hydro_62 17.0 0.0030 Alpha -L-arabinofuranosidase Glyco_hydro_61 15.3 <0.0001 Endoglucanase Peptidase_S28 13.6 0.0010 Serine carboxypeptidase S28 CBM_1 13.4 <0.0001 Fungal cellulose binding domain Lipid recognition, recognition of E1_DerP2_DerF2 12.8 0.0003 pathogen products Pectinesterase 12.8 0.0003 Pectinesterase NAD(P)H binding domain, Eno-Rase_NADH_b 12.2 0.0003 trans-2-enoyl-CoA reductase Glyco_hydro_43 11.4 <0.0001 Arabinanase COX3 11.4 0.0016 Cytochrome c oxidase subunit III Flavodoxin_2 11.4 0.0016 Flavodoxin-like fold Glyco_hydro_11 11.4 0.0016 Xylanases Glyco_hydro_12 11.4 0.0016 Endoglucanase and xyloglucan hydrolase Nucleoplasmin 11.4 0.0016 Chromatin decondensation proteins PLA2_B 11.4 0.0016 Lysophospholipase catalytic domain Polygalacturonase, Glyco_hydro_28 11.0 <0.0001 Rhamnogalacturonase A ABC_ATPase 10.8 <0.0001 Predicted ATPase of the ABC class Syja_N 10.2 0.0006 SacI homology domain Cellulase 9.7 0.0023 Cellulase Cysteine rich,putative role in fungal CFEM 9.7 0.0023 pathogenesis Flavodoxin_5 8.5 0.0030 Flavodoxin FMN_red 8.5 0.0030 NADPH-dependent FMN reductase Glyco_hydro_7 8.5 0.0030 Endoglucanase; cellobiohydrolase Glyco_hydro_92 8.5 0.0030 Alpha-1,2-mannosidases Mannosyl_trans3 8.5 0.0030 Mannosyltransferase putative Transport between ER and the Golgi Yos1 8.5 0.0030 complex 7tm_1 8.5 0.0089 7 transmembrane receptor Chorismate_synt 8.5 0.0089 Chorismate synthesis COX6B 8.5 0.0089 Cytochrome oxidase c subunit VIb

Continued

Table 3.5. Top enriched Sclerotinia homoeocarpa domains at 96 h post inoculation on creeping bentgrass 114 Table 3.5 continued

Cu-oxidase_2 8.5 0.0089 Multicopper oxidase Cupin_5 8.5 0.0089 Cupin superfamily DHDPS 8.5 0.0089 Dihydrodipicolinate synthetase family Ca2+ regulator and membrane fusion Fig1 8.5 0.0089 protein Fig1 Metallophos_2 8.5 0.0089 Calcineurin-like phosphoesterase N terminal of a number of known and Pex2_Pex12 8.5 0.0089 predicted peroxins Ribosomal_L35Ae 8.5 0.0089 Ribosomal protein L35Ae SAPS 8.5 0.0089 SIT4 phosphatase-associated protein UbiA 8.5 0.0089 UbiA prenyltransferase family ABC_membrane 8.2 <0.0001 ABC transporter transmembrane region Fungal_lectin 6.8 0.0061 Fungal fucose-specific lectin OPT 6.2 0.0080 OPT oligopeptide transporter protein Peptidase_S8 6.1 <0.0001 Subtilase family ABC_tran 5.8 <0.0001 ABC transporter

115 Domain name Enrichment P-value Description eIF-5_eIF-2B 20.2 <0.0001 Zinc binding C4 finger. ATS3 20.2 0.002 Embryo-specific protein 3 BDS_I_II 20.2 0.002 Antihypertensive protein BDS-I/II CSD 20.2 0.002 'Cold-shock' DNA-binding domain Folate_rec 20.2 0.002 Folate receptor family MaoC_dehydratas 20.2 0.002 Synthesis of monoamine oxidase Signal transduction related to cytoskeletal SH3_1 20.2 0.002 organisation Assembly of the transcription preinitiation TAFII28 20.2 0.002 complex. ABC2_membrane_3 15.1 <0.0001 ABC-2 type transporter Fer4 13.5 <0.0001 Iron sulfur binding domain Fer4_2 13.5 <0.0001 Iron sulfur binding domain Fer4_6 13.5 <0.0001 Iron sulfur binding domain Ribosomal_S2 13.5 <0.0001 Ribosomal protein S2 Calcium binding and coiled-coil domain CALCOCO1 13.5 0.0077 (CALCOCO1) iPGM_N 13.5 0.0077 independent phosphoglycerate mutase tRNA synthetases class I (E and Q), tRNA-synt_1c 12.6 <0.0001 catalytic domain TATA element modulatory factor 1 DNA TMF_DNA_bd 12.1 0.0004 binding UDP-glucoronosyl and UDP-glucosyl UDPGT 12.1 0.0004 transferase Cupin_1 11.9 <0.0001 Cupins, including germins Cupin_2 11.1 <0.0001 Cupin domain ABC2_membrane 10.9 <0.0001 ABC-2 type transporter GrpE 10.1 0.0011 GrpE nucleotide exchange factor EFG_C 9.0 0.0002 Elongation factor G C-terminus GTP_EFTU_D3 8.8 <0.0001 Elongation factor Tu C-terminal domain ABC_ATPase 8.7 <0.0001 Predicted ATPase of the ABC class IF-2B 8.7 0.0025 Initiation factor 2 subunit family

Continued

Table 3.6. Top enriched Agrostis stolonifera domains at 96 h post inoculation with Sclerotinia homoeocarpa.

116 Table 3.6 continued

RNA_pol_Rpb1_5 8.7 0.0025 RNA polymerase Rpb1, domain 5 Cytochrom_B_C 7.6 0.0049 Cytochrome b/b6, C-terminal structural maintenance of chromosomes SMC_N 7.3 <0.0001 protein ABC_membrane 6.7 <0.0001 ABC transporter transmembrane region ATP-synt_C 6.7 0.0086 ATP synthase subunit C ATP-synt_ab_C 6.7 0.0012 ATP synthase alpha/beta chain DUF1602 6.1 <0.0001 Protein of unknown function GTP_EFTU 5.8 <0.0001 Elongation factor Tu GTP binding domain ABC_tran 5.5 <0.0001 ABC transporters Terpene synthase family, metal binding Terpene_synth_C 5.4 0.0049 domain p450 5.0 <0.0001 Oxidative degradation of compounds GTP_EFTU_D2 5.0 0.0005 Elongation factor Tu domain 2

117

Type of SSR Number Percentage Length (bp) Sclerotinia homoeocarpa Mononucleotide repeat 266 37.2% 10 – 23 Dinucleotide repeat 100 14.0% 12 – 42 Trinucleotide repeat 221 30.9% 15 – 84 Tetranucleotide repeat 56 7.8% 20 – 72 Pentanucleotide repeat 25 3.5% 25 – 45 Hexanucleotide repeat 24 3.4% 30 – 54 Complex 23 3.2% 21 – 112 Total 715 Agrostis stolonifera Mononucleotide repeat 186 47.7% 10 – 14 Dinucleotide repeat 66 16.9% 12 – 52 Trinucleotide repeat 106 27.2% 15 – 36 Tetranucleotide repeat 12 3.1% 24 – 28 Pentanucleotide repeat 0 0.0% n/a Hexanucleotide repeat 2 0.5% 48 – 54 Complex 18 4.6% 21 - 157 Total 390

Table 3.7. Simple Sequence Repeat (SSR) markers from the Sclerotinia homoeocarpa and Agrostis stolonifera transcriptome.

118

Figure 3.1. Read length of 454 sequencing data. The range of sequence read length (bp) of 454 data is shown as the shaded area. The length of the SBS reads is 16 bp indicated with the arrow.

119

Figure 3.2. Signal tracks of SBS reads for two genes (SH3549 and SH8608) from either the PDA, PDB, or interaction library.

120

Figure 3.3. Venn diagram of RNA-Seq reads unique and common to the SH library, AS library, and

Interaction library. A: Reads unique to the Inoculation library, B: Reads unique to the SH only library,

C: Reads unique to the AS only library, D: Reads common to the Inoculation and SH library, E: Reads common to the Inoculation and AS library, F: Reads common to the SH and AS libraries, G: Reads common to the Inoculation, SH, and AS libraries.

121

Figure 3.4. Top Species Blast hits: (a) SH transcript library (b) AS transcript library.

122

Figure 3.5. GO terms associated with up-regulated SH transcripts. (a) Molecular Function, (b)

Biological Process, (c) Cellular Component. 123

Figure 3.6. GO terms associated with up-regulated AS transcripts. (a) Molecular Function, (b)

Biological Process, (c) Cellular Component. 124

Figure 3.7. Comparison of semi-quantitative RT-PCR data to RPKM values. Creeping bentgrass was infected with Sclerotinia homoeocarpa from a PDA culture or sprayed on as a mycelia homogenate in

PDB.

125

Figure 3.8. Validation of RPKM data for selected SH and AS transcripts using relative expression real time PCR.

126

Chapter 4: Whole genome sequencing of seven Magnaporthe oryzae field isolates reveals genome

content variation and predicted pathogenicity determinants

127 Introduction

Rice has been served as major food source worldwide, especially in Asia and Africa, however, a large portion of yields are lost through disease annually. Rice blast, caused by fungal pathogen

Magnaporthe oryzae, is one of the most severe rice diseases and is found almost everywhere rice is grown. Severe disease outbreaks can destroy up to 90% of rice yields for an entire field, region, or country causing a dramatic impact on human welfare and regional economies [1-3]. Condium of

Magnaporthe oryzae are spread by rain or water flush, and accomplish the infection by penetrating into rice leaves using a special structure called an appressorium. Mycelia then explore inside host tissue and eventually cause cell death [4, 5]. The most effective method to control of rice blast comes from the introduction of resistance (R) gene into elite rice cultivars, which will recognize the corresponding avirulence (Avr) genes and trigger “Hyper sensitive response” [6]. However, this resistance is often broken within a few years [7, 8], mostly due to the mutation or function loss of the Avr genes under strong selection force.

M. oryzae and rice serve as model organisms for studying host-pathogen interaction since both genomes are publically available [9]. This pathosystem generally controlled by the gene-for-gene model [10-13], which was proposed by Flor in 1971 [14]. While many R genes in rice have been identified and cloned, the identification of Avr genes in the fungus has been limited. There are three

Avr genes involved in host species specificity: PWL1 [15], PWL2 [16], and Avr-CO39 [17], and at least six Avr genes being involved in rice cultivar specificity, including Avr-Pita [18], ACE1 [19], AvrPiz-t

128 [20], AvrPia [21, 22], Avr-Pii [22], and Avr-Pik/km/kp [22]. While R proteins are reported to contain some conserved functional domains - such as NBS-LRR domains, no obvious sequence or structural patterns can be identified for Avr genes, which make their identification difficult. However,

“genome-wide association analysis” (GWAS) can help in identifying novel Avr genes based on the association between Avr genes and cultivar specific virulence.

In one study focusing on the Avr-Pita family that includes AvrPita1, AvrPita2, and AvrPita3, it was found that AvrPita1 and AvrPita2 were associated with transposon elements. These located randomly in different chromosomes in various field isolates, which may be a result of high selection imposed by the corresponding R gene [23]. This dramatic location variation, as well as high diversifying selection, makes it almost impossible to identify avirulence genes without genome sequencing and association analysis.

The reference strain of M. oryzae is 70-15, which was also used as the first rice blast isolate sequenced in 2002 [9]. This strain was generated from a cross between a rice pathotype and a weeping love grass (Eragrostis curvula) pathotype which was then backcrossed with rice isolate

Guy11 from French Guyana [24, 25]. The resulting 70-15 isolate has been used as reference strain in many laboratories and used for many fungal genome comparison studies. It was reported that in

70-15, some weeping love grass pathogen sequences were retained.

With the fast development of next-generation sequencing technology, genome re-sequencing and comparative studies have been reported on multiple fungal phytopathogens including: Alternaria

129 [26], Fusarium [27], Aspergillus [28], and Utilagus [29] et al. Some of common findings from these studies include high level of variation among the genomes of different isolates and unique genome regions that contain isolate specific virulence factor genes.

In M. oryzae, several genome comparison projects have been initiated recently. The first identification of novel Avr genes via genome comparison was reported in 2009 [22]. A research group for that report first tried to screen the secreted proteins in 70-15 in different field isolates and associate polymorphisms with pathogenicity. However, no orthologous genes were found. Then they sequenced field isolate “Ina168” which is known to contain 9 Avr genes, and identified a

1.68Mb region unique to 70-15. Through an association study for genes located on this specific region, three novo Avr genes were identified. The project demonstrated the value of 70-15 reference genome, which can be used as an Avr gene minimum reference for comparison to field isolates.

Recently, another genome comparison between two field isolates of M. oryzae has been performed with 70-15 included as a reference [30]. The comparison discovered many isolate-specific genomic regions. For those shared genes, many predicted to be involved in host-pathogen interactions were expanded and found to be under diversifying selection. This study also identified a large number of transposon-like elements, with less than 30% of them conserved among all three isolates, suggesting recent active transposition events in this pathogen.

While the genome content of M. oryzae has been characterized through multiple sequencing

130 projects, other important genome-wide transcript profiling techniques have been investigated. In

2012, the transcriptome libraries of appressorium development have been sequenced using a NGS technique called “SuperSAGE”. By analyzing the expression level change during appressorium formation, Pmk1 MAP kinase was revealed as a key global regulator in appressorium formation and related process [31]. Other noticeable examples included a ChIP-Seq study to examine the regulation network of a calcium signal transcription factor MoCrz1 [32]. These successfully high throughput studies on Magnaporthe, as well as other organisms, demonstrated the power of these approaches, and built base of applying NGS studies to studying pathogen-host interactions.

This study investigated “isolate-specific” genome content. Major genome content differences in isolates of M. oryzae have been reported. The first report was in Ina168 by comparison to the reference 70-15 genome [22]. Differences were also identified easily through CHEF gels in an

AVR-Pita study [23]. Both studies confirmed that Magnaporthe oryzae contains supernumerary chromosomes that vary in size and AVR genes may locate on them. Supernumerary chromosomes are sometimes called “conditionally dispensable chromosome”. They were firstly discovered and achieved notoriety in fungi in the genus Alternaria [33, 34]. Subsequently, through genome sequencing studies, similar chromosomes were reported in other fungi included Fusarium oxysporum [27], Nectria haematococca [35, 36], Mycosphaerella graminicola [37], Cochliobolus heterostrophus [38], Leptosphaeria maculans [39].

In this study, we performed whole genome sequencing on seven Magnaporthe oryzae field isolates

131 CHL1, CHL2, CHL42, CHL43, CHL346, CHL359, and CHL381, and then compare the de novo assembly and aligned to the reference sequence of 70-15. Through this analysis we attempted to answer the following questions: 1) Do isolate-specific genes or genome regions exist between the seven field isolates? What are the functions of unique genomic regions and can we determine where were they originated? 2) How many variations can be found between each isolate and 70-15? Do

SNPs/Indels/CNVs (copy number variation) serve as a major mutation sources and how many genes do they affect? 3) How many secreted proteins are under positive selection force and showed polymorphisms that suggest a function as potential virulence effectors?

Results and Discussions

Genome sequencing and assembly

Genomes of seven M. oryzae isolates strains CHL1, CHL2, CHL42, CHL43, CHL346, CHL359, and

CHL381 were sequenced by a Illumina Genome Analyzer II, a next-generation sequencing approach.

The resulting 75bp paired-end short reads were assembled by Velvet [40] in a hybrid manner. De novo assembly was followed by aligning to the 70-15 finished sequence Version 8 as reference to build a scaffold. It was observed that the genome size of the 7 field isolates showed significant variation ranging from 37.2MB of CHL43 to 39.3MB of CHL346 (Table 4.1). The N50 of CHL42 and

CHL43 showed a significantly lower level compared to other 5 isolates, which may be caused by a

132 higher percentage of repeat sequences that made de novo assembly more difficult. All field isolates have average 10.5K predicted ORFs. Gene density and average CDS length also showed a similar number as the 70-15 genome. The number of predicted gene models was about 2000 less than

70-15 on average[9], and also less than the number in another two publically available sequenced field isolates P131 and Y34 [30], which can be explained by the different gene prediction tools used.

FGENESH [41] was used for the prediction of P131 and Y34, and it was reported that the predictor tended to produce a significant number of false positive predictions. Another possible reason for our reduced predicted genes number was the lower quality of the genome assembly where many paralogous genes may have assembled into one location in the assembly.

Genome content comparison

Genome assembles comparison among all the 7 sequenced isolates and the reference 70-15 showed that each of sequenced genome could be aligned with each other with identity of at least 82.4%.

There’s much more noise in the dot plot of comparison between 70-15 and field isolates, compared to the alignment between field isolates themselves, suggesting that more diversifying sequences exist between 70-15 and field isolates. While each strain contains unique genome content, strains collected from same geographical location also shared unique content within the group, but not in other geographic locations (Figure 4.1).

133 Phylogenetic tree construction

The phylogenetic relationships of the 7 sequenced field strains were determined using the genetic distances calculated from the whole genome SNPs, which contain about 10K SNP locations. The resulting neighbor-joining tree showed three divergent groups generally segregating according to their geographical origins, with field isolates CHL1 and CHL2 grouping more closely to the reference

70-15. (Figure 4.2)

It was surprising to find the close relation between the reference 70-15 and the CHL1/CHL2 group as

70-15 was generated from rice pathotype in France and CHL1/CHL2 were collected from Barley in

Thailand. The possible explanation could be that 70-15 contains some genome content from love grass pathogen, while CHL1 and CHL2 are barley pathogens, thus they could share some sequences.

Unique genes / Presence-absence variation

Among the ~10K predicted ORFs in each of the 7 sequenced strains and 70-15, unique genes were counted. Since the 7 filed isolates belong to 3 phylogenetic and geographical groups, to simplify the

Venn diagram, CHL1, CHL42, CHL346, and 70-15 were chosen to represent all 4 groups. As can be seen in Figure 4.3, most genes (84.7%) fell into “core genes”, and 5% were shared among all field isolate but not 70-15. An interesting finding was that 2,914 genes were found unique in 70-15 but not in any of the field isolates. As mentioned above, the large number of unique 70-15 genes is most

134 likely the result of the extra predicted genes compared to the field isolates. Examination of these

2,914 genes found that most of them were short fragmented genes with the average length of only

618 bp, compared to the average length of all the 70-15 genes at 1,317bp. However, despite the fact the 70-15 unique genes may be generated by false positive gene predictions, it’s still worth considering their functions. Another important fact is the field isolates specific genes, where 1,525 such genes were not found in 70-15. In a previous study aiming to identify Avr genes, 330 candidate genes were found only in the field isolate “Ina168” with 3 of them being confirmed as Avr genes

[22].

Copy number variation analysis

Copy number variation (CNV) is now regarded as having significant effects on gene function, which in this case was calculated using sequenced reads coverage along the genome. The coverage was counted based on 70-15 as the reference. CNV was calculated for each sequenced strain individually and results are showed in Figure 4.4. It can be observed that 1) copy number variation does exist along the genome especially at the ends of chromosomes; 2) All field isolates shared a similar CNV distribution; 3) there are some unique CNV positions that only exist in some of the genomes but not the others. We also calculated the read coverage of all the predicted ORFs using 70-15 as the reference genome (Figure 4.5). When five-fold was used as a cut-off, typically a field isolate contained about 300 genes showing an expansion of gene copy numbers and 200 genes showing

135 reduced gene copy number. Table 4.2 shows the top 8 70-15 genes, which were under obvious CNVs.

MGG_15335 in CHL346 had a copy number of 3, while this gene in other isolates showed no copy number difference.

SNP and Indel polymorphisms

SNP profiles were generated by mapping short reads to the reference 70-15 genome sequence, with one mismatch allowed, using SOAPsnp [42] to call SNPs. IDPs were identified by short reads mapped allowing gaps. As showed in Table 4.3, all isolates had about 10,000 to 15,000 SNPs and 3,000-10,000

Indels with CHL346, CHL359, and CHL381 showing the most SNPs and Indels as compared to 70-15.

Among the SNPs, we identified those that have a potentially disabling effect on gene function as showed in Table 4.3. In total 355 SNPs were expected to introduce premature stop codons, 752 were expected to alter initiation methionine residues, and 132 were expected to disrupt splicing donor or acceptor sites. Large effect IDPs were identified as those cause frame shifts. There were 47,083 IDPs identified for all strains and 910 of them were large effect IDPs. Genes within large effect SNPs and

IDPs in their coding regions were regarded as losing the original functions, and thus could be a good candidate for virulence determinates.

It was reported in previous studies that SNPs and IDPs were not distributed evenly along the genome, but in fact they were found enriched in localized regions and thus make sequences in those regions

136 highly variable. Here we calculated the density of SNPs and IDPs in the genomes to identify enrichment regions and compare them among isolates. Figure 4.6 shows the SNP density along genome. The result is that 1) SNP hot spots do exists and majority of them locate at the telomere regions; 2) Different isolates have unique hot spots but also share common locations where the SNPs and IDPs were enriched.

Selection force

Ka/Ks ratios reflect the selection force of either a genomic region or a specific gene with the assumption that Ka/Ks ratio=1 means it is under neutral selection and a higher ratio means it’s under positive selection. Similar as SNP enriched regions, Ka/Ks ratio hot spots were also reported previously. As showed in Figure 4.7 as sharp peaks, most selection force hot spots were common in all sequenced genomes, however some unique peaks were observed in one or several genomes, indicating events of isolate-specific selection enriched regions.

Compared to SNP enriched regions, the Ka/Ks ratio peaks were less in number but much sharper, indicating some regions were under strong diversifying selection. It should be noted that more than half of such peaks occur at the telomere regions, which is consistent with previous reports that the edge of chromosomes and subtelomere regions are unstable.

137 Genes under positive selection

Evolution occurs when fungal pathogens interact with host and adapt to environmental change. We are interested in detecting genes involved in this adaptation. We found several genes containing a

Ka/Ks ratio > 1 and thus under positive selection associated with compelling functional annotations.

MGG_15380 encodes a salicylate hydroxylase. Endogenesis of salicylic acid has been reported to be required for systemic acquired resistance in plants [43]. Salicylate hydroxylase from pathogen hydroxylates salicylic acid, convert to catechol, which suppresses systemic acquired resistance in plants and favors infection of pathogens. The salicylate hydroxylase locus shows positive selection during co-evolution with rice plants harboring salicylic acid, which indicates a correlation between virulence and activity of salicylate hydroxylase in Magnaporthe.

MGG_17750 contains a SET structural domain. SET proteins methylate lysine residues on histone tails [44]. A previous study showed that SET gene families were involved in epigenetic modification and relevant to growth and development of multi-cellular organisms. SET family members SET1-6,

SET9, SET-JmJc and SET-TPR genes have been reported in fungi. The analysis of fungal SET gene family phyletic evolution shows that SET-domain proteins were highly conserved in plants, fungi and mammals. Interestingly, SET-domain proteins differentiate species-specific lineages. JmJc-SET,

SET-TPR and SET5/6 are specific to filamentous fungi and their homologous genes in rice blast isolates (MGG_09180, MGG_08578, MGG_09389, MGG_06798, MGG_04672) are classified in the same cluster during systemic phyletic evolution analysis, while there is no homologous genes in

138 Neurospora crassa and other non-pathogenic fungi [45]. This fact suggests that SET gene families may be involved in pathogenicity in rice blast isolates. Yeast SET3 orthologous gene SET3 in

Magnaporthe serves as important component of histone HDACs, which interacts with Tig1.

Conclusion

In this study, we sequenced the genome of 7 Magnaporthe oryzae field isolates using next generation sequencing technology and performed a reference assisted de novo genome assembly.

We found the genome statistics of the sequenced isolates were similar to the reference 70-15, but less gene models were predicted, probably due to the different set of prediction tools that we used.

One interesting finding was the close relationship between 70-15 and two barley pathotype

Magnaporthe oryzae. Variations at the gene level in the form of SNPs, Indels, CNVs were identified, and it was surprising to find most of these polymorphisms were enriched in select of the genome regions, which we call “hot spots”, and that different isolates contain unique “hot spots”. The selection ratios in genes were also calculated. Thus, those genes unique to 70-15 located in the “hot spots” and containing high Ka/Ks ratio we predict to be good candidates for novel virulence related and Avr genes.

139 Materials and Methods

Strains

All sequenced strains were provided by Dr. Qinghua Pan from China South Agricultural University.

CHL1 and CHL2 were collected from Barley in Thailand; CHL42 and CHL43 were collected from rice in

Yunnan Province, China; CHL346, CHL359, and CHL381 were collected from rice in Jiangsu Province,

China.

Sequencing

Genome DNA was extracted from the plate cultivation following an in-house DNA extraction protocol.

Whole genome sequencing library was generated with Illumina kit with an insert length around

300bp. Sequencing was performed using Illumina Genome Analyzer II at MCIC, the Ohio State

University. 78pb paired-end reads were generated with coverage of about 100X for each isolate.

Assembly

Since we have both a polished reference genome 70-15 and a deep sequenced short reads library, a hybrid assembly method: “velvet with Columbus module”, [40] which firstly map reads to reference genome to guide the assembly and then perform a de novo assembly. Assembly parameters 140 including k-mer were optimized by test running. In the assembly we tried to avoid misguidance by the reference so we compared sequences of this hybrid assembly to a pure de novo assembly and found most sequences were identical between the two methods expect hybrid assembly generated longer contigs. Gene prediction was performed using Augustus [46] with pre-trained “Magnaporthe grisea” matrix.

Genome content comparison

Genome of different isolates were compared with each other by performing a genome-wide sequence alignment using “Nucmer” tool in “MUMmer” package [47]. The alignment identity cut-off was set as 80%, and then the alignment results were visualized as plot dots pictures using

“mummerplot” tools in the package.

Unique gene identification

To identify the common and unique genes, whole genome gene set of CHL1, CHL42, CHL346, and

70-15 were firstly input into orthologous proteins were acquired by “Inparanoid” [48] to perform a pair-wised alignment. Then the result was processed by “Quickparanoid”

(http://pl.postech.ac.kr/QuickParanoid/) to generate gene ortholog cluster with default parameter. A

Venn diagram was then drawn based on this ortholog cluster. 141 Copy number variation (CNV) analysis

Short sequencing reads from individual isolates were mapped to reference 70-15 genome using

SOAPaligner2 [49] with only 1 mismatch allowed. Then the reads coverage depth of genome was scanned in a 1KB window to search for CNV. We also checked the average coverage depth of individual predicted genes using Cufflinks [50]. Then standard deviation and coefficient of variation were calculated for each gene in order to pick genes showing strong CNV.

SNPs & IDPs analysis

To identify the SNPs, short reads were mapped to reference 70-15 genome by SOAPaligner2 [49].

After sorting alignment file, SOAPsnp was used to identify SNP locations with default parameters.

Then the SNP identification file was converted into VCF format and input into snpEff [51] together with gene prediction to predict the SNP effect. SNP density was calculated along genome in a 10KB window and was visualized to show the existence of SNP hot spots. IDPs were identified in a similar way, but pipeline including SAMtools for alignment and BCFtools for Indel identification were used.

IDP effects were also predicted by snpEff.

142 Phylogenetic tree construction

All the nucleotide bases (about 10KB) in SNP locations were extracted and form an artificial sequence for each of the 7 isolates and 70-15 genome. Then these 10KB sequences which contain all the SNP information were input into MEGA5 [52]to generate a phylogenetic tree using maximum-likelihood algorithm.

Selection force calculation

From the SNP analysis, every SNP in coding region was counted for either synonymous mutation of non-synonymous mutation. And the ratio between number of non-synonymous and synonymous mutation was calculated in every 10KB region window. Then a genome wide Ka/Ks ratio picture was drawn based on this calculation.

143 References

1. Valent B, Chumley FG: Molecular Genetic-Analysis of the Rice Blast Fungus, Magnaporthe-Grisea. Annual review of phytopathology 1991, 29:443-467. 2. Talbot NJ: On the trail of a cereal killer: Exploring the biology of Magnaporthe grisea. Annual Review of Microbiology 2003, 57:177-202. 3. Ebbole DJ: Magnaporthe as a model for understanding host-pathogen interactions. Annual review of phytopathology 2007, 45:437-456. 4. Wilson RA, Talbot NJ: Under pressure: investigating the biology of plant infection by Magnaporthe oryzae. Nat Rev Microbiol 2009, 7(3):185-195. 5. Kankanala P, Czymmek K, Valent B: Roles for rice membrane dynamics and plasmodesmata during biotrophic invasion by the blast fungus. The Plant cell 2007, 19(2):706-724. 6. Heath MC: A Generalized Concept of Host-Parasite Specificity. Phytopathology 1981, 71(11):1121-1123. 7. Leach JE, Cruz CMV, Bai JF, Leung H: Pathogen fitness penalty as a predictor of durability of disease resistance genes. Annual review of phytopathology 2001, 39:187-224. 8. Kiyosawa S: Genetics and Epidemiological Modeling of Breakdown of Plant-Disease Resistance. Annual review of phytopathology 1982, 20:93-117. 9. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu JR, Pan H et al: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005, 434(7036):980-986. 10. Silue D, Notteghem JL, Tharreau D: Evidence of a Gene-for-Gene Relationship in the Oryza-Sativa-Magnaporthe-Grisea Pathosystem. Phytopathology 1992, 82(5):577-580. 11. Takabayashi N, Tosa Y, Oh HS, Mayama S: A gene-for-gene relationship underlying the species-specific parasitism of Avena/Triticum isolates of Magnaporthe grisea on wheat cultivars. Phytopathology 2002, 92(11):1182-1188. 12. Tosa Y, Tamba H, Tanaka K, Mayama S: Genetic analysis of host species specificity of Magnaporthe oryzae isolates from rice and wheat. Phytopathology 2006, 96(5):480-484. 13. Valent B, Khang CH: Recent advances in rice blast effector research. Current opinion in plant biology 2010, 13(4):434-441. 14. H.H. F: Current status of the gene-for-gene concept. Annual review of phytopathology 1971, 9:275-296. 15. Kang SC, Sweigard JA, Valent B: The PWL host specificity gene family in the blast fungus Magnaporthe grisea. Mol Plant Microbe In 1995, 8(6):939-948. 16. Sweigard JA, Carroll AM, Kang S, Farrall L, Chumley FG, Valent B: Identification, Cloning, and Characterization of Pwl2, a Gene for Host Species-Specificity in the Rice Blast Fungus. The Plant cell 1995, 7(8):1221-1233. 17. Farman ML, Leong SA: Chromosome walking to the AVR1-CO39 avirulence gene of Magnaporthe grisea: Discrepancy between the physical and genetic maps. Genetics 1998, 150(3):1049-1058.

144 18. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B: A telomeric avirulence gene determines efficacy for the rice blast resistance gene Pi-ta. The Plant cell 2000, 12(11):2019-2032. 19. Bohnert HU, Fudal I, Dioh W, Tharreau D, Notteghem JL, Lebrun MH: A putative polyketide synthase peptide synthetase from Magnaporthe grisea signals pathogen attack to resistant rice. The Plant cell 2004, 16(9):2499-2513. 20. Li W, Wang BH, Wu J, Lu GD, Hu YJ, Zhang X, Zhang ZG, Zhao Q, Feng QY, Zhang HY et al: The Magnaporthe oryzae Avirulence Gene AvrPiz-t Encodes a Predicted Secreted Protein That Triggers the Immunity in Rice Mediated by the Blast Resistance Gene Piz-t. Mol Plant Microbe In 2009, 22(4):411-420. 21. Miki S, Matsui K, Kito H, Otsuka K, Ashizawa T, Yasuda N, Fukiya S, Sato J, Hirayae K, Fujita Y et al: Molecular cloning and characterization of the AVR-Pia locus from a Japanese field isolate of Magnaporthe oryzae. Molecular plant pathology 2009, 10(3):361-374. 22. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S et al: Association genetics reveals three novel avirulence genes from the rice blast fungal pathogen Magnaporthe oryzae. The Plant cell 2009, 21(5):1573-1591. 23. Chuma I, Isobe C, Hotta Y, Ibaragi K, Futamata N, Kusaba M, Yoshida K, Terauchi R, Fujita Y, Nakayashiki H et al: Multiple Translocation of the AVR-Pita Effector Gene among Chromosomes of the Rice Blast Fungus Magnaporthe oryzae and Related Species. Plos Pathogens 2011, 7(7). 24. Leung H, Borromeo ES, Bernardo MA, Notteghem JL: Genetic-Analysis of Virulence in the Rice Blast Fungus Magnaporthe-Grisea. Phytopathology 1988, 78(9):1227-1233. 25. Chao CCT, Ellingboe AH: Selection for Mating Competence in Magnaporthe-Grisea Pathogenic to Rice. Can J Bot 1991, 69(10):2130-2134. 26. Hu J, Chen C, Peever T, Dang H, Lawrence C, Mitchell T: Genomic characterization of the conditionally dispensable chromosome in Alternaria arborescens provides evidence for horizontal gene transfer. BMC Genomics 2012, 13:171. 27. Ma LJ, van der Does HC, Borkovich KA, Coleman JJ, Daboussi MJ, Di Pietro A, Dufresne M, Freitag M, Grabherr M, Henrissat B et al: Comparative genomics reveals mobile pathogenicity chromosomes in Fusarium. Nature 2010, 464(7287):367-373. 28. Andersen MR, Salazar MP, Schaap PJ, van de Vondervoort PJI, Culley D, Thykaer J, Frisvad JC, Nielsen KF, Albang R, Albermann K et al: Comparative genomics of citric-acid-producing Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome research 2011, 21(6):885-897. 29. Schirawski J, Mannhaupt G, Munch K, Brefort T, Schipper K, Doehlemann G, Di Stasio M, Rossel N, Mendoza-Mendoza A, Pester D et al: Pathogenicity determinants in smut fungi revealed by genome comparison. Science 2010, 330(6010):1546-1548. 30. Xue MF, Yang J, Li ZG, Hu SNA, Yao N, Dean RA, Zhao WS, Shen M, Zhang HW, Li C et al: Comparative Analysis of the Genomes of Two Field Isolates of the Rice Blast Fungus Magnaporthe oryzae. PLoS genetics 2012, 8(8).

145 31. Soanes DM, Chakrabarti A, Paszkiewicz KH, Dawe AL, Talbot NJ: Genome-wide Transcriptional Profiling of Appressorium Development by the Rice Blast Fungus Magnaporthe oryzae. Plos Pathogens 2012, 8(2). 32. Kim S, Hu J, Oh Y, Park J, Choi J, Lee YH, Dean RA, Mitchell TK: Combining ChIP-chip and expression profiling to model the MoCRZ1 mediated circuit for Ca/calcineurin signaling in the rice blast fungus. PLoS Pathog 2010, 6(5):e1000909. 33. Johnson LJ, Johnson RD, Akamatsu H, Salamiah A, Otani H, Kohmoto K, Kodama M: Spontaneous loss of a conditionally dispensable chromosome from the Alternaria alternata apple pathotype leads to loss of toxin production and pathogenicity. Current genetics 2001, 40(1):65-72. 34. Hatta R, Ito K, Hosaki Y, Tanaka T, Tanaka A, Yamamoto M, Akimitsu K, Tsuge T: A conditionally dispensable chromosome controls host-specific pathogenicity in the fungal plant pathogen Alternaria alternata. Genetics 2002, 161(1):59-70. 35. Han Y, Liu X, Benny U, Kistler HC, VanEtten HD: Genes determining pathogenicity to pea are clustered on a supernumerary chromosome in the fungal plant pathogen Nectria haematococca. The Plant journal : for cell and molecular biology 2001, 25(3):305-314. 36. Coleman JJ, Rounsley SD, Rodriguez-Carres M, Kuo A, Wasmann CC, Grimwood J, Schmutz J, Taga M, White GJ, Zhou S et al: The genome of Nectria haematococca: contribution of supernumerary chromosomes to gene expansion. PLoS genetics 2009, 5(8):e1000618. 37. Stukenbrock EH, Jorgensen FG, Zala M, Hansen TT, McDonald BA, Schierup MH: Whole-genome and chromosome evolution associated with host adaptation and speciation of the wheat pathogen Mycosphaerella graminicola. PLoS genetics 2010, 6(12):e1001189. 38. Tzeng TH, Lyngholm LK, Ford CF, Bronson CR: A restriction fragment length polymorphism map and electrophoretic karyotype of the fungal maize pathogen Cochliobolus heterostrophus. Genetics 1992, 130(1):81-96. 39. Leclair S, Ansan-Melayah D, Rouxel T, Balesdent M: Meiotic behaviour of the minichromosome in the phytopathogenic ascomycete Leptosphaeria maculans. Current genetics 1996, 30(6):541-548. 40. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 2008, 18(5):821-829. 41. Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome research 2000, 10(4):516-522. 42. Li RQ, Li YR, Fang XD, Yang HM, Wang J, Kristiansen K, Wang J: SNP detection for massively parallel whole-genome resequencing. Genome research 2009, 19(6):1124-1132. 43. Mandal S, Mallick N, Mitra A: Salicylic acid-induced resistance to Fusarium oxysporum f. sp. lycopersici in tomato. Plant Physiol Biochem 2009, 47(7):642-649. 44. Dillon SC, Zhang X, Trievel RC, Cheng X: The SET-domain : protein lysine methyltransferases. Genome biology 2005, 6(8):227. 45. Veerappan CS, Avramova Z, Moriyama EN: Evolution of SET-domain protein families in the

146 unicellular and multicellular Ascomycota fungi. Bmc Evol Biol 2008, 8:190. 46. Stanke M, Morgenstern B: AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research 2005, 33:W465-W467. 47. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome biology 2004, 5(2):R12. 48. Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic acids research 2010, 38(Database issue):D196-203. 49. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25(15):1966-1967. 50. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, 28(5):511-515. 51. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012, 6(2):80-92. 52. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular biology and evolution 2011, 28(10):2731-2739.

147

Genome Contig predicted Gene density Average CDS Isolates N50 (Kb) GC% size (Mb) number ORFs (Kb/per gene) length (Kb)

CHL1 38.3 4227 116.7 51.56 10666 3.59 1524

CHL2 36.8 2990 221.2 51.88 10286 3.58 1543

CHL42 38.0 4282 50.9 51.57 10674 3.56 1531

CHL43 37.2 4254 52.3 51.61 10457 3.56 1513

CHL346 39.3 7523 151.9 51.00 10352 3.80 1523

CHL359 38.5 5870 175.2 51.70 10447 3.69 1536

CHL381 38.4 5619 175.2 51.73 10473 3.67 1536

Table 4.1. General statistic numbers of the genome assembly and gene prediction of sequenced isolates.

148

Standard Coefficient Gene ID CHL1 CHL2 CHL42 CHL43 CHL346 CHL359 CHL381 Deviation of Variation MGG_15335 0.2 0.5 0.6 0.3 3.1 0.3 0.7 1.0 1.3 MGG_11967 7.0 0.6 1.4 0.8 1.2 1.5 0.5 2.3 1.2 MGG_15435 0.1 0.4 0.3 1.1 3.0 0.5 1.4 1.0 1.0 MGG_15793 0.2 1.9 1.6 1.3 4.8 0.2 0.9 1.5 1.0 MGG_16934 4.5 1.6 1.1 1.1 0.4 1.0 0.5 1.4 1.0 MGG_17291 0.6 1.0 1.9 1.8 9.7 3.3 5.3 3.2 1.0 MGG_15398 1.8 3.6 2.2 2.6 1.2 9.6 13.3 4.7 1.0 MGG_16562 0.2 1.3 1.8 1.4 3.7 0.2 0.4 1.2 1.0

Table 4.2. Top eight genes containing the copy number variation. Copy number of genes in each isolates were calculated using 70-15 as reference, standard Deviation (SD) and Coefficient of

Variation (C.V.) were calculated for each genes and sorted by C.V.

149

Region Distribution of SNPs Number of High Impact Alternation Start Stop Strains SNPs Indels 1KB up Interg Altered Frame Exon Intron Codon Codon stream enic Splice Site Shift Changed Changed CHL1 9397 4704 22.9% 5.8% 23.6% 29.7% 88 53 16 116

CHL2 13188 7020 18.4% 5.2% 24.1% 38.4% 83 50 18 130

CHL42 12643 5610 23.3% 5.3% 22.1% 30.1% 105 70 17 122

CHL43 10590 3259 21.4% 5.2% 22.2% 35.0% 98 41 14 97

CHL346 14787 8456 20.0% 5.1% 22.8% 37.0% 131 46 21 163

CHL359 13462 8757 21.3% 5.3% 22.4% 34.8% 120 47 21 142

CHL381 13752 9277 21.0% 5.1% 23.1% 34.4% 127 48 25 140

Table 4.3. Number of SNPs & IDPs identified, their location, and the number of those causing large impact on gene function. (change of start codon, stop codon, splicing place frame shift)

150

Figure 4.1. Dot plots of the alignments among 7 sequenced isolates and reference 70-15 sequence.

Percentage of genome region that could be successfully aligned were noted under each dot plots.

Much more noises dots (non-synteny regions) could be observed in the alignments between 70-15 and field isolates.

151

Figure 4.2. Phylogenetic tree of different isolates constructed using genome-wise SNP data. This tree was calculated and drawn by software “MEGA” with maximum-likelihood algorithm. It should be noted that Reference 70-15 has relatively close relationship with CHL1/CHL2 group, even closer compared to other filed isolates.

152

Figure 4.3. Gene pool analysis among the four isolate(s) groups. Number of gene families was marked in every individual region. It should be noted that majority of gene families were shared by all isolates, and 70-15 has about 3000 genes unique to all the field isolates.

153

Figure 4.4. Copy number variation of the 7 sequenced isolates along genome. The base line represents no CNV in the region, and Peaks show the increased copy number enriched in those regions.

154

Figure 4.5. Number of Genes with higher/lower copy number compared to 70-15. Red bar represents the number of genes showing a 5 folds higher copy number compared to the gene in 70-15, blue bar represents the genes with reduced copy number.

155

Figure 4.6. SNPs & IDPs density (number per 100KB) along genomes of 7 sequenced isolates. Peaks represent enrichments of the SNPs and IDPs in those regions.

156

Figure 4.7. Ka/Ks ratio distribution along genome in the 7 sequenced isolates. Peaks represent regions under strong diversifying selections.

157

Chapter 5: A Distribution Pattern Assisted Method of Transcription Factor Binding Site Discovery

for Both Yeast and Filamentous Fungi

158 Introduction

Transcription factors (TFs) are proteins containing one or more DNA-binding domains, which bind to specific DNA sequences to activate or repress the recruitment of RNA polymerase, thereby up or down regulating transcription of downstream genes [1, 2]. In fungi, TFs bind to either enhancer or promoter regions, usually 1-1500bp upstream to the ORFs they regulate [3-5].

Understanding networks of transcriptional regulation is one of the most challenging yet important tasks in genome analysis. Transcription factors function by binding to recognition sites in gene regulatory regions, which are generally degenerate motifs of 5−15 base pairs [6]. Extensive research has focused on identifying transcription factor binding sites (TFBSs) by biological validation.

Nevertheless, experiments identifying TFBS are usually time-consuming and laborious, leaving the binding sites of most transcription factors unclear [7]. Therefore, prediction of potential TFBSs using bioinformatics approaches has become an essential tool to explore gene structure and regulation.

Different approaches have been tried and applied to discover novel TFBSs over years, which could be put into two categories. Prediction tools in the first category search for the over-represented motifs in a given set of sequences - usually promoter regions of co-regulated genes or ChIP-chip/ChIP-Seq identified binding regions. This strategy has been used widely since it does not require additional information other than the sequences. In the searching process, TFBSs could be treated as either a

Position Weight Matrix as in MEME [8] and GLAM [9], or consensuses as in SMILE [10] or Weeder

[11], when heuristic approach was applied and exhaustive enumeration was avoided. Top candidates

159 after scoring and sorting are then represented in the form of a position frequency matrix (PFM) [12], or a position weight matrix (PWM). However, application of this approach was limited by the computational speed and scale of input data, thus could hardly be applied on the thousands of sequences generated by ChIP-Seq. As a result, new tools such as DREME [13] and WordSeeker [14] were designed specifically for large input dataset, which utilize new mapping algorithms and multi-nodes computation platform to significantly increase speed. Another way to search TFBSs focuses on biological information such as evolutionary conservation. This strategy is based on the assumption that the conserved non-coding regions across related species are likely to be under negative selection force and thus contain functional motifs. Due to the fast increase of available genome sequences, this method has been developed quickly in recent years, and generate lower rate of false-positive compared to other methods. [15, 16]. To date, various software have been developed to analyze possible binding motifs, however, the multifaceted biochemical interactions between proteins and DNA may easily lead to false-positive results and make theoretical predictions of TFBSs error prone [17].

Yeast transcription factors were used in this study as training and testing data sets. Saccharomyces cerevisiae is the most widely used and most well studied yeast species, whose genome serves as one of the most thoroughly analyzed to date [18]. Recently, multiple yeast databases have been built and published, including two transcription factor databases used in this study [19, 20].

To expand this method to filamentous fungi, the MoCRZ1 transcription factor from Magnaporthe

160 oryzae, a fungal pathogen that cause severe rice blast disease worldwide, was included [21].

MoCRZ1 is a C2H2 zinc-finger type transcription factor that is activated by calcineurin dephosphorylation and functions as a mediator of calcineurin signaling [22]. In a project conducted by Dr. Soonok Kim – a senior researcher in Dr. Thomas Mitchell’s lab, binding sequences of MoCRZ1 have been identified by applying both ChIP-chip to identify the binding region of this transcription factor and microarray approaches to identify the regulated genes and their promoters [23] (Figure

5.1), which was then used to test different binding motif prediction approaches.

In this study, we developed a strategy to improve the predictions made by MEME, using the transcription factor binding motif distribution pattern (MDP) information. This MDP approach was then tested with data from both yeast and Magnaporthe transcription factors binding sites and predicted motifs. Although similar spatial distribution of yeast TFBSs has been reported [24], we focused on TFBS prediction improvement by using a novel MDP approach.

Results and Discussions

The general distribution frequency curve

As showed in the analysis pipeline (Figure 5.2), we checked all the 113 TFs with their 301 documented binding motifs from the yeast transcription factor databases “YEASTRACT” [19]. Then in the filtering step 32 TFs which have more than 50 documented regulated genes were chosen. Some

161 TFs had multiple binding motifs but similar to each other, so we referred to the positional-weight matrix provided by the database “JASPAR” [20] and merged those similar motifs to generate a consensus motif sequence (details in methods). In total, 32 TFs with 63 target motifs were included in the analysis dataset (Table 5.1). 25 TFs with 37 validated binding motifs were randomly chosen to build the training dataset (Table 5.2), and the other 7 TFs formed the testing dataset. For each motif, their occurrence locations in the validated regulated gene models were scanned in 1000bp upstream of transcription start site (TSS). Next, a general distribution curve was drawn from the average distribution of all motifs (Figure 5.3). It was observed from the curve that the lowest frequency was

2.3% at a region -50bp to 0bp, while the highest frequency was 9.2% at a region -200bp to -150bp. A peak was observed from the region -275bp to -100bp, with center at about -200bp. This distribution pattern was similar to those from previous studies in yeast [24, 25] and human [26].

Estimating the reliability of motif prediction

It was assumed from the pattern of the distribution curve, that majority of transcription factors worked as short distance cis-elements binding specifically at -300 to -100bp - a region referred to as the “PR” (peak region), but not at the other two regions: -150 to 0bp regarded as the “NBR”

(non-binding region) where most transcription initiation complexes bind; and -1000 to -250bp regarded as the “DR” (distal region). To quantify pattern fitness, a DP (distribution pattern) value was introduced to estimate fitness of any TF binding motif to the general frequency curve. The DP value

162 was generated from shape of the general distribution curve and an assumption: a “true” binding motif should occur more often in the PR, but not NBR or DR, while a random over-represented motif sequences may not have any specific distribution preference in the 1KB upstream region. So if you compare the average occurrence/frequency of PR and that of NBR and DR, the former should be higher in a “true” binding motif and a larger difference represents a more reliable prediction. We thus proposed a formula to calculate the DP value which was expected to be close to zero in a random motif. A motif with DP value zero or negative then has lower possibility to be the biologically

“true” TF binding motif.

The DP value was calculated as the following formula:

DP value = (Average PR Occurrence x 2) – (Average NBR Occurrence) – (Average DR Occurrence)

Testing on yeast

Seven TFs were chosen to be involved to test if the utilization of DP value could assist in improving

TFBS predictions. Four public motif finding tools were included in the performance comparison:

MEME (published in 1994), MDscan (published in 2002), WordSeeker (published in 2010), and

DREME (published in 2011). These four tools all search for statistically over-represented motifs in a given sequences set. MEME uses the expectation maximization to fit a two-component finite mixture model to the input sequences, and multiple motifs are found by probabilistically erasing the

163 occurrences of the top motif and then repeating the process [8]. MDscan combines the advantages of two motif search strategies: position-specific weight matrix updating and word enumeration to enhance the success rate [27]. DREME [13] and WordSeeker [14] were developed in recent years and specifically designed to process large size of ChIP-chip/ChIP-Seq datasets on scalable analysis platforms.

For each TF, the 1000bp upstream sequences of their documented regulated genes were firstly selected as input into MEME for consensus motif search with the target motif length parameter set from 5bp to 9bp. The top ten consensus motifs in the results were then processed to calculate their

OR and DP value.

The OR (over-represent) value, or the observed:expected frequency ratios (O/E) descripted and utilized in a previous study [28] reflects the statistical over-represent of these consensus motifs. OR value was calculated as following formula, where O refers to the overall occurrence of a motif across the 1KB upstream sequences set, ‘ln’ is the natural logarithm, and Eo represents the expected occurrence of that motif:

OR value = O x ln (O / Eo)

Taking transcription factor “Fkh1” for example, the top ten consensus motifs from MEME were originally ranked by P-value which represents the possibility of obtaining this motif solely by chance.

The first two motifs AAA[AG]A[AG]AAA and TT[TC][TC]T[TC]TT[CT], were likely to be simple sequence

164 repeats and thus being removed from the ranked results. The 2nd motif [GT]GTAAACAA and the 3rd motif [TCG]TTGTTTAC were reverse complimentary to each other and matched documented Fkh1 motifs [AG][CT]AAACA[AT][AT] [29] and [AG]TAAA[CT]AA [30]. The remaining 6 consensus motifs were not Fkh1 documented binding motifs (Figure 5.4, Table 5.3).

The MDP approach utilize both the over-representation information – measured by OR value, and the distribution pattern information – measured by DP value. Since both values represent the reliability of the motif prediction – higher value represents higher reliability - a Re-rank value was introduced to combine both values so the “true” motifs with over-represented occurrence and distribution pattern fitting the general curve will obtain higher value and thus be picked up from all the candidates. To even the contribution of the two values, we checked the OR and DP value in all the training dataset, and decide to amplify DP value by 1000 times so the average and standard deviation of both value is close to each other.

The Re-rank value was calculated as following formula:

Re-rank value = OR value + DP value x 1000

After re-ranking based on the Re-rank value, the motif originally in the 1st place ranked by MEME showed a negative DP value and thus dropped to 5th in the MDP rank, since its distribution pattern showed little similarity to the general TF frequency curve, while the two documented Fkh1 target motifs were raised from 2nd/3rd to 1st /2nd. Since these two target motifs are reverse

165 complimentary to each other, we recorded the rank change as 2nd in MEME and 1st in MDP.

Same as descripted in “Fkh1”, the upstream sequences of the 7 TFs were input into MDscan,

WordSeeker and DREME, seeking for motifs with expected length around 8bp and other parameters were left as default. Out of 7 tested TFs, 2 motifs (Xbp1 and Gzf3) failed to find validated target binding motifs in the top 10 consensus motifs predicted by MEME or other predictors. Summary of the remaining 5 TFs were shown in table 5.4. Among the 5 target motifs, MDP predicted 4 motifs as the 1st and one motif as the 2nd in rank. While MEME and DREME only predicted two target motifs as the 1st in rank; MDscan and WordSeeker each failed to predict two target motifs in the top ten results. Some repeat-like sequences were noticed in MDscan and WordSeeker results, indicating their detections were somehow disturbed. Overall, MDP generated a better rank for the target motifs compared to other four tools.

Prediction of Magnaporthe MoCRZ1 binding motifs using ChIP-chip data

MoCRZ1, a transcription factor involved in Ca2+/Calcineurin signaling in Magnaporthe oryzae, was also used to estimate this MDP approach. The strategy for the test was to firstly predict binding motifs from ChIP-chip identified binding regions using traditional methods and then apply our new

MDP approach without ChIP-chip data to see if a similar prediction could be reached. In the other word, we identify the “true” motifs first and then tested if the MDP approach could identify “true”

166 MoCRZ1 binding motifs without the need of ChIP-chip data.

To identify the binding regions and regulated promoters of MoCRZ1, both ChIP-chip and microarray experiments were conducted previously. Then the exact binding sequences of MoCRZ1 (50–1247 bp) revealed from ChIP-chip studies were retrieved from the promoters of genes in common between

ChIP-chip and microarray expression studies and subjected to motif signature analyses (Figure 5.5A).

Initially, 106 sequences from each of the 83 genes differentially regulated in the WT/Dmocrz1 comparison of the microarray data were analyzed with MEME [8] and MDScan [27]. There were more sequences than genes as 21 genes had 2 peaks in their promoter region and 2 genes had 3 peaks. Candidate motifs from both algorithms were manually interrogated and enumerated to identify the top 3 candidates, which were subsequently screened against randomly retrieved 106 intergenic sequences with an average length of 509 bp (Figure 5.5A). The most enriched motif of

CAC[AT]GCC was identified in 33 sequences in front of 24 genes, a 16X enrichment in MoCRZ1 bound sequences. The most common motif of TTGNTTG was found in 68 promoter sequences in front of 42 genes with 4X enrichment. Motif TAC[AC]GTA occurred in 22 promoter sequences of 18 genes with

4X enrichment (Table 5.5). Fifty-six genes had at least one motif, while all three of these motifs occurred in front of 5 genes. These three candidate motifs were also screened against the

Magnaporthe genome to check the occurrence positions. Table 5.6 showed that motif TAC[AC]GTA occurred 1900 times in ORF but 4727 times in promoter region, but CAC[AT]GCC showed a higher occurrence in ORF regions. These motifs were searched against yeast motif database using TOMTOM

[31]. The top match for CAC[AT]GCC was MET28 (p-value = 0.0013), while the second match was

167 CRZ1 with significant p-value (0.0022), showing Crz1p of S. cerevisiae has this motif in its promoters although it was not previously identified as a calcineurin dependent response element (CDRE)

(Figure 5.5B). Pbx1b (P-value = 0.00039) and Zec (P-value = 0.00045) were best two matches for

TTGNTTG, while no significant match was returned for TAC[AC]GTA. Binding of MoCRZ1 to the promoter region was confirmed by Electrophoretic Mobility Shift Assay (EMSA) conducted by Dr.

Soonok Kim. A 209 bp PCR fragment having 1 TTGNTTG and 2 CAC[AT]GCC motifs from the MoCRZ1 promoter region was bound to purified MoCRZ1 protein (Figure 5.5C, left panel). A 325 bp fragment of CBP1 (MGG_03218) having 1 TTGNTTG and 1 CAC[AT]GCC motifs was also shown to bind to purified MoCRZ1 (Figure 5.5C, right panel).

Testing MDP approach to predict MoCRZ1 binding motifs

Then we tested if the MDP approach could reach the same binding motif predictions with only the microarray but not the use of ChIP-chip data. From the microarray data[23], 190 genes were picked as predicted MoCRZ1 regulated genes as they all showed a 2 fold or greater expression change between the control and all three conditions including Ca2+ deficiency, MoCrz1 inhibitor added, and the MoCRZ1 deletion background. Results of the top ten consensus motifs predicted by MEME from

1000bp upstream of MoCRZ1 regulated genes were shown in figure 5.6 and Table 5.7 with their distribution pattern. Three target motifs (TTGNTTG, CAC[AT]GCC, TAC[AC]GTA) were originally ranked as 3rd, 7th, and 8th, with simple repeats at the 1st and 2nd in rank. After re-ranking, these three

168 target motifs went up to 4th, 6th, and 7th. It was observed that another two motifs originally ranked at 5th and 10th went to 3rd and 5th, showing a significant rank improvement. These two motifs were next searched in the TOMTOM [31] database and identified as “Rim101” binding motifs, which was reported as a TF involved in a pathway acting in parallel to Crz1 via calcineurin [32].

Conclusions

In this study, we developed the MDP approach to improve TFBS prediction. Genome-wide TFBS analysis is generally challenging with both experimental validation and computational analysis required to refine TFBS predictions. The use of TFBS distribution profiles improves the accuracy of predictions by estimating both the over-representative level of the candidate motifs and their general distribution pattern as well. The major originality here is that we are more focusing on improving TFBSs prediction by using distribution pattern.

Methods

To select transcription factors, all the 113 TFs from the yeast transcription factor database

“YEASTRACT” were checked and the number of their documented regulated genes was counted.

Those TFs having less than 50 documented regulated genes were filtered out. We noticed that some

TFs recorded in “YEASTRACT” had multiple documented binding motifs, however, regular expression 169 sequences of some binding motifs from the same TF showed high similarity, and thus could be clustered into a single motif. In those cases, the PWM (position weight matrix) was checked from the

“JASPAR” database. If the clustering was supported by the PWM, then the different motifs were merged into a new one. For example, TF “Aft1” had 4 recorded binding motifs in “YESTRACT” database: YRCACCCR, TGCACCC, GGCACCC, and TGCACCCA, while only one matrix is recorded in

“JASPAR”. So in the clustering process, consensus binding motif sequences of Aft1 was generated as

“TRCACCY”. After clustering, each of the target motifs was scanned in the 1kb upstream TSS region of all genes in the yeast genome to check for number of occurrence. Any motif with less than 100 or more than 2000 occurrence were removed. The remaining 32 TFs were randomly divided into two groups: 25 TFs in training group and 7 in testing group.

List of documented regulated genes of the 32 TFs was downloaded from “YEASTRACT” database.

Their 1KB upstream sequences were extracted from the yeast genome sequence and were used to scan for the binding motif sequences. The 1KB upstream region was divided into twenty 50bp windows and motif occurrence in each window was counted. Then the average occurrence of each window was calculated and a general frequency curve was generated.

For each TF in the testing group, the 1KB upstream sequences of their regulated genes were input into MEME running on a local cluster, to search for consensus motifs with their expected length around 8bp, as well as into MDscan, WordSeeker, and DREME. Then the top ten consensus motifs reported from MEME were used as queries to define the MDP.

170 The two commonly used motif discovery programs, MEME and MDScan, were used to identify the

MoCRZ1 binding motif. Input data consisted of the exact binding sequences retrieved from the promoters of 83 genes with differential expression in the WT/Dmocrz1 comparison. Candidate motifs from both algorithms were manually interrogated and enumerated to identify the 3 top candidates, and queried against the yeast motif database using TOMTOM [31]. Enrichment was calculated over the 106 background sequence set which was randomly retrieved from intergenic region of the whole genome. Consensus motif sequences were calculated using WebLogo server at http://weblogo.berkeley.edu.

171 References

1. Latchman DS: Transcription factors: An overview. Int J Biochem Cell B 1997, 29(12):1305-1312. 2. Karin M: Too many transcription factors: positive and negative interactions. New Biol 1990, 2(2):126-131. 3. Roeder RG: The role of general initiation factors in transcription by RNA polymerase II. Trends Biochem Sci 1996, 21(9):327-335. 4. Nikolov DB, Burley SK: RNA polymerase II transcription initiation: a structural view. Proceedings of the National Academy of Sciences of the United States of America 1997, 94(1):15-22. 5. Lee TI, Young RA: Transcription of eukaryotic protein-coding genes. Annu Rev Genet 2000, 34:77-137. 6. Biggin MD: To bind or not to bind. Nat Genet 2001, 28(4):303-304. 7. Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 2005, 21(11):2657-2666. 8. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36. 9. Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple local alignment. Nucleic acids research 2004, 32(1):189-200. 10. Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2000, 7(3-4):345-362. 11. Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17 Suppl 1:S207-214. 12. Stormo GD: Consensus patterns in DNA. Methods Enzymol 1990, 183:211-221. 13. Bailey TL: DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 2011, 27(12):1653-1659. 14. Lichtenberg J, Kurz K, Liang X, Al-ouran R, Neiman L, Nau LJ, Welch JD, Jacox E, Bitterman T, Ecker K et al: WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures. BMC bioinformatics 2010, 11 Suppl 12:S6. 15. Levy S, Hannenhalli S: Identification of transcription factor binding sites in the human genome sequence. Mamm Genome 2002, 13(9):510-514. 16. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 2005, 434(7031):338-345. 17. Bulyk ML: Computational prediction of transcription-factor binding site locations. Genome biology 2003, 5(1):201. 18. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq 172 C, Johnston M et al: Life with 6000 genes. Science 1996, 274(5287):546, 563-547. 19. Abdulrehman D, Monteiro PT, Teixeira MC, Mira NP, Lourenco AB, dos Santos SC, Cabrito TR, Francisco AP, Madeira SC, Aires RS et al: YEASTRACT: providing a programmatic access to curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web services interface. Nucleic acids research 2011, 39(Database issue):D136-140. 20. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):D91-94. 21. Ebbole DJ: Magnaporthe as a model for understanding host-pathogen interactions. Annual review of phytopathology 2007, 45:437-456. 22. Choi J, Kim Y, Kim S, Park J, Lee YH: MoCRZ1, a gene encoding a calcineurin-responsive transcription factor, regulates fungal growth and pathogenicity of Magnaporthe oryzae. Fungal Genet Biol 2009, 46(3):243-254. 23. Kim S, Hu J, Oh Y, Park J, Choi J, Lee YH, Dean RA, Mitchell TK: Combining ChIP-chip and expression profiling to model the MoCRZ1 mediated circuit for Ca/calcineurin signaling in the rice blast fungus. PLoS Pathog 2010, 6(5):e1000909. 24. Lin Z, Wu WS, Liang H, Woo Y, Li WH: The spatial distribution of cis regulatory elements in yeast promoters and its implications for transcriptional regulation. BMC Genomics 2010, 11:581. 25. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J et al: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99-104. 26. Koudritsky M, Domany E: Positional distribution of human transcription factor binding sites. Nucleic acids research 2008, 36(21):6795-6805. 27. Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 2002, 20(8):835-839. 28. Lichtenberg J, Yilmaz A, Welch JD, Kurz K, Liang XY, Drews F, Ecker K, Lee SS, Geisler M, Grotewold E et al: The word landscape of the non-coding segments of the Arabidopsis thaliana genome. BMC Genomics 2009, 10. 29. Zhu G, Spellman PT, Volpe T, Brown PO, Botstein D, Davis TN, Futcher B: Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature 2000, 406(6791):90-94. 30. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273-3297. 31. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs. Genome biology 2007, 8(2):R24. 32. Kullas AL, Martin SJ, Davis D: Adaptation to environmental pH: integrating the Rim101 and calcineurin signal transduction pathways. Molecular microbiology 2007, 66(4):858-871.

173

Number of Number of TF TF TF Structure Regulated TF Structure Regulated Name Name Genes Genes

Rgt1 Fungal Zn cluster 63 Xbp1 Rel 501 Azf1 Beta-beta-alpha zinc finger 127 Reb1 Myb 512 Rtg1 Helix-loop-helix (bHLH) 129 Pdr8 Fungal Zn cluster 547 Mot3 Beta-beta-alpha zinc finger 135 Hsf1 E2F 571 Gzf3 GATA 149 Tec1 Homeo 571 Rlm1 Beta-beta-alpha zinc finger 205 Swi4 Rel 614 Gis1 Beta-beta-alpha zinc finger 223 Gln3 GATA 668 Fkh1 Forkhead-associated (FHA) 241 Abf1 Helix-loop-helix (bHLH) 669 Fkh2 Forkhead-associated (FHA) 313 Msn Beta-beta-alpha zinc 740 Cbf1 Helix-loop-helix (bHLH) 337 Aft1 No confident structure 1114 Pho4 Helix-loop-helix (bHLH) 379 Msn Beta-beta-alpha zinc 1187 Nrg1 Beta-beta-alpha zinc finger 399 Gcn4 Leucine zipper 1260 Mcm MADS 403 Met4 Leucine zipper 1260 Adr1 Beta-beta-alpha zinc finger 443 Yap1 Leucine zipper 1824 Leu3 Fungal Zn cluster 495 Ste1 Homeo 2142 Mbp Rel 498 Yap3 Leucine zipper 59

Table 5.1. List of yeast TFs involved in training and testing.

174 Number of TF Documented Motif Motif Length Peak Location Binding Sites RTCRYBN{4}ACG 13 212 175 RTCRN{6}ACGNR 15 156 175 Abf1 TNNCGTN{6}TGAT 16 61 175 TCN{7}ACG 12 507 175 Adr1 TTGGRG 6 146 275 Aft1 TRCACCY 7 138 125 Azf1 AAMRGHA 7 221 75 RYAAACAWW 9 85 75 Fkh2 RTAAAYAA 8 118 100 TGASTCAY 9 85 200 Gcn4 RRTGACTC 8 79 175 CACGTG 8 168 275 Gis1 AGGGG 5 120 300 Gln3 GATWDG 6 618 Unidentifiable TTCNNGAA 8 240 200 Hsf1 GAANNTTC 8 188 225 DCCYWWWNNRG 11 201 225 Mcm1 CCYWWWNNRG 10 234 200 Met4 TCACGTG 7 79 375 Mot3 AGGYA 5 203 325 Msn2 CCCCT 5 536 175 Msn4 CCCCT 5 336 275 CCCTC 5 167 325 Nrg1 CCCCT 5 165 325 Pdr8 TCCGHGGA 8 51 325 Pho4 CACGTK 6 182 300 Rgt1 WWNNTCCK 8 105 350 Rlm1 TAWWWWTAGM 10 52 325 Rtg1 GTCAC 5 233 Unidentifiable Ste12 TGAAACA 7 210 225 Swi4 CRCGAAW 7 150 300 RMATTCYY 8 234 275 CATTCTT 7 89 Unidentifiable Tec1 CATTCT 6 217 Unidentifiable CATTCC 6 137 Unidentifiable Yap1 TTACGTAA 8 50 175 Yap3 TGACTCA 7 88 175

Table 5.2. TFs in training group.

175

Rank by OR DP Value Re-rank Rank by Candidate Motif Annotation MEME Value (x1000) Value MDP AAA[AG]A[AG]AAA Simple repeat removed 671.5 47.2 718.7 removed TT[TC][TC]T[TC]TT[CT] Simple repeat removed 255.8 60.3 316.1 removed [TC]TG[TC]TG[TC]TG 1 115 -37.0 78.0 5 [GT]GTAAACAA Fhk1 2 79.1 114.9 194.0 Re-rank 2 [TCG]TTGTTTAC Fhk1 3 71.8 147.6 219.4 Process 1 CAGC[AG]GC 4 54.6 -28.5 26.1 8 CAAGAAA 5 61.1 -26.6 34.5 7 T[AG]TATATAT 6 43.2 56.5 99.7 4 GAAAAAG 7 53.8 11.2 65.0 6 T[TG][TGC]CC[CT]TTT 8 72.6 98.4 171.0 3

Table 5.3. Fkh1 consensus motifs testing result. The two motifs in shadow represent two documented Fhk1 binding motifs.

176

TF Documented binding Rank by Rank by Rank by Rank by Rank by Name Motif MDP MEME MDscan WordSeeker DREME Fkh1 [TCG]TTGTTTAC 1 2 N/A 3 2 Cbf1 [TG]CACGTG[AC][TC] 1 1 1 1 1 Leu3 G[CG]C[AG][CAT]GGCC 2 6 5 N/A 8 Mbp1 [AT]GC[TGA]GC[TA]G[CA] 1 3 N/A N/A 8 Reb1 [GA]TTACCCG[GC] 1 1 1 1 1

Table 5.4. Test results of all testing group.

177

Candidate Occurrence in Statistical Enrichment to Occurrence in Enrichment in motifs binding region expectation expectation random region binding region TTGNTTG 68 15 4.5 17 4 CAC[AT]GCC 33 5 6.6 2 16.5 TAC[AC]GTA 22 7 3.1 6 3.7

Table 5.5. Statistical enrichment test for the three candidate motifs. The statistical expectation was calculated based on the length and frequency of each nucleotide of the motif; the Occurrence in random region was calculated based on a set of randomly extracted intergenic region with the same length of the binding region sequences.

178

Occurrence in Candidate Motifs Total occurrence Occurrence in ORFs upstream regions TTGNTTG 38124 12246 14854 CAC[AT]GCC 12895 7275 3253 TAC[AC]GTA 9648 1900 4727

Table 5.6. The occurrence of the three candidate motifs in ORFs and upstream regions.

179

Rank by OR DP Value Re-rank Rank by Candidate Motif Annotation MEME Value (x1000) Value MDP AAAAAAAAA Simple repeat removed 356.2 -1.3 354.9 removed TTTT[TC]TTTT Simple repeat removed 369.5 1.1 370.6 removed TA[GC][GC]TACCT MoCRZ1 1 165.7 40.4 206.2 2 AGGTAGGTA 2 84.8 0 84.8 Re-rank 8 [GT]CTTGGC Rim101 3 136.3 88.5 224.8 Process 1 CTAG[AT]CTAG 4 68.1 48.6 116.8 6 CACAGCC MoCRZ1 5 85.5 46.6 132.2 5 T[GT]GTT[TG]T[GT]G MoCRZ1 6 77.2 56.6 133.9 4 TTT[GT][GCT]TTGC 7 101.3 0.7 102.1 7 TGCCAAG Rim101 8 65.7 78.7 144.4 3

Table 5.7. MoCrz1 consensus motifs testing result. The five motifs in shadow are three MoCRZ1 binding motifs and two Rim101 binding motifs.

180

Figure 5.1. The MoCRZ1 binding region identification experiments design. (A) ChIP-chip experimental design to identify MoCRZ1 targets activated by calcium treatment. The Ca2+/FK506 treated sample served as the negative control treatment. (B) Expression microarray design. Wild type strain and

MoCRZ1 deletion mutant (Dmocrz1) were treated with Ca2+ and/or FK506 as depicted. After global normalization of signal intensities to the average expression level of all probes among the replicated data sets, pairwise comparison between treatments was conducted. (C) Venn diagram showing number of genes identified from ChIP-chip and up-regulated / down-regulated in transcriptome profiling Number of genes with more than 2 fold differential expression with P-value <0.05 were noted in Ca2+ treated wild type samples in each comparison.

181

Figure 5.2. Analysis Pipeline for generating model of MDP approach and test it using yeast transcription factor data.

182

Figure 5.3. Distribution frequency curve of training group. Each block coded by colors represents the frequency of one training motif in each scanning window. The curve was generated as average of all blocks.

183 Figure 5.4. Frequency curves of Fkh1 candidate motifs.

184

Figure 5.5. Predicted MoCRZ1 binding motifs identified. (A) Analysis pipeline to identify predicted

MoCRZ1 binding motifs. (B) WebLogo of Top 2 motifs and the best hits in the yeast motif database.

Best hit from yeast motif database was displayed for comparison. (C) Electrophoretic mobility shift assay. Probe DNA was amplified by PCR from the promoter regions of MoCRZ1 (left panel, lanes 1 to

3) and CBP1 (right panel, lanes 4 to 6). Lanes 1 and 4, Biotin-labeled probe DNA; lanes 2 and 5, probe

DNA with purified MoCRZ1 protein; lanes 3 and 6, competition reaction with 200 fold molar excess of unlabeded DNA. FP: free probe; SP: shifted probe.

185

Figure 5.6 Frequency curves of MoCrz1 candidate motifs.

186

Chapter 6: Identification of novel Avr genes by Screening of Gain-of-virulence Mutants in rice blast

fungus Magnaporthe oryzae

187 Introduction

Magnaporthe oryzae is a filamentous fungus that causes rice blast disease, which is also an important model organism for investigating fungal-plant interactions. Unlike other fungal pathogens of plants, the genomic sequences of both of the pathogen and the rice host are available [1, 2], providing a unique opportunity to study a host–parasite interaction from both sides using functional genomic approaches. The gene-for-gene model has been widely accepted and utilized in agricultural disease control since it was proposed decades ago (Flor 1971). To control rice blast, introducing a variety of confirmed resistance genes (R genes) into elite rice cultivars is considered as an effective strategy. For this purpose, it becomes essential to identify a large number of R genes in rice and corresponding Avr genes in the fungus, as well as understand their interactions. While more than 73 rice blast R genes to date have been identified and physically mapped [3], only a limited number of

Avr genes have been identified [4-11]. This is partially because there are no obvious sequence patterns that can be easily recognized.

This study focuses on the corresponding Avr genes of the two R genes Pi-2 and Pi-9. The Pi-2 gene was introgressed from a highly resistant indica cultivar 5173 into the susceptible cultivar CO39 and named C101A51 [12]. Through wild hybridization and repeated backcrossing, the resistance gene

Pi-9 was transferred from O. minuta into the elite breeding line IR31917 [13]. To examine the resistance spectra of Pi-2 and Pi-9 to isolates from other rice-growing areas, Dr. Guo-liang Wang’s group in the Ohio State University has tested them on 43 M. oryzae isolates collected from 13

188 different countries. The inoculation experiments confirmed that these two genes conferred high resistance to many blast isolates. Using the identified RAPD markers and BAC ends, a high-resolution map of the Pi-9 locus was constructed [14]. Interestingly, Pi-2 was found tightly linked to all of the

Pi-9 markers, suggesting that Pi-2 and Pi-9 are co-localized on rice chromosome 6, or may actually be allelic [15].

To generate a library of Magnaporthe random mutants efficiently, Agrobacterium tumefaciens-mediated transformation (ATMT) was used in this study. ATMT is used to transfer genes to a wide variety of plants [16] as well as filamentous fungi [17]. Recently, several fungi were transformed using an ATMT approach to generate banks of random mutants [18, 19]. For insertion mutagenesis, this technique offers a huge potential [20, 21]. One of the principal advantages of

ATMT over conventional transformation techniques is the versatility in choosing transformation system. A. tumefaciens is able to transform using protoplasts, hyphae, spores, and blocks of mushroom mycelia tissue [17, 22, 23]. Furthermore, ATMT generates a high percentage of transformants with only one single insert of T-DNA in the fungal genome, which will facilitate the subsequent isolation of tagged genes [24].

In this study, we generated random insertion transformants of M. oryzae through ATMT approach.

By conducting pathogenicity assay on rice cultivars containing Pi-2 and Pi-9 R genes, we re-isolated 4 mutants that were able to break Pi-2 or Pi-9 resistance (Pipeline in Figure 6.1). Then we used a

TAIL-PCR (Thermal Asymmertric Interlaced PCR) approach to amplify the insertion flanking

189 sequences to clone the affected locus. Several candidate genes were identified. However, the complementary experiments to confirm their avirulence gene functions remain.

Results

ATMT transformants generation and pathogenicity screening

A pipeline was built to produce ATMT mutants with the speed of 200 mutants a week. In total, roughly 5,000 ATMT mutants were generated during six months of continual transformations. The transformation cassette used to insert into M. oryzae isolate KJ201 genome contains a hygromycin resistant gene and a GFP expression gene with proper promoters. Each transformant was labeled and stored, before being put through pathogenicity tests on susceptible Nipponbare (NPB), and resistant Pi-2 and Pi-9 rice plants.

During pathogenicity screens, each possible lesion was transferred onto selection media to increase sensitivity. Among the 5,000 tested mutants, only 5 mutated strains were successfully re-isolated from lesions (Figure 6.2). They were then re-inoculated onto rice cultivars again, where 4 of them were re-isolated from lesions and thus became candidate mutants. They were renamed as Iso_1,

Iso_2, Iso_3, and Iso_4 (Figure 6.3).

All the 4 candidate gain-of-virulence mutants were confirmed to have GFP signal gained from the

ATMT inserted vectors under fluorescence microscope, with strong signal observed in conidia (Figure 190 6.4).

Pathogenicity tests of four candidate mutants

The 4 candidate mutants were inoculated to rice cultivars to test their pathogenicity. The pathogenicity test has been performed both in Dr. Thomas Mitchell’s lab at The Ohio State University

(Table 6.1, Figure 6.3) as well in Dr. Qinghua Pan’s lab in South Agricultural University in China to independently verifing results (Table 6.2). The pathogenicity test performed at OSU included M. oryzae isolate KJ201 as the negative control and NPB as the susceptible rice cultivar control, as well as rice lines containing the Pi-2 and Pi-9 resistant genes respectively. Lesions from each pots were counted. Positive controls for infection for all assays NPB showed average lesion counts more than

46. Negative control with wild KJ201 failed to make a single lesion on Pi-2 and Pi-9 containing plants, with an average lesion count less than one. Iso_1 had 7 average lesions on Pi-2, while this number for Iso_2 and Iso_4 were 15 and 5 respectively, while Iso_3 only generated tiny black spots and no typical lesions (Table 6.1, Figure 6.3). This pathogenicity test supported Iso_1, Iso_2 and Iso_4 strong

Avr gene candidate mutants. For pathogenicity tests conducted in Dr. Qinghua-Pan’s lab, all controls performed as expected with Iso_4 showing virulence on Pi-2, Iso_2 showing virulence on Pi-9, and

Iso_3 showing modest virulence on Pi-2. To summarize pathogenicity test from both labs, Iso_2 and

Iso_4 were supported to gain the virulence to Pi-2 or Pi-9, and were advanced for futher analysis and identification of the mutated gene.

191 TAIL-PCR to identify insertion regions

Direct TAIL-PCR of Iso_2 and Iso_4 showed multiple bands and Sanger sequencing results showed mixed nucleotide calls, so a cloning procedure was conducted prior to TAIL-PCR. After cloning and isolating, TAIL-PCR and sequencing was performed successfully in Dr. Qinghua Pan’s lab (Table 6.3,

Figure 6.5).

Characterization of candidate genes

Sequences of TAIL-PCR products were aligned to the M. oryzae 70-15 reference genome using The

Broad Institute’s web service. As a result, 6 candidate genes were identified. Features of these 6 candidate genes were characterized as showed in table 6.4. The six candidate genes were located at chromosomes I, III, and IV. Although none of them were predicted to contain a signal peptide, 4 of them were predicted to localize extracellular. The insertion position of T-DNA varies, as 2 were inserted at 5’ upstream regions, 3 were inserted at 3’ downstream regions, and 1 was inserted in the coding region. Although inserting in upstream or coding region may have larger effect on gene function, the 3 downstream insertions may also change gene expression. We cannot confirm or eliminate any of these candidate genes based on the characterization information so far.

192 Future work

This is an ongoing project and the there’s still work to do before the identification of Avr-Pi2 and

Avr-Pi9 genes. Future direction includes:

1. Functional annotation of all the candidate genes, estimate locus structure in all sequenced M.

oryzae genomes, and predict the candidate Avr gene conferring virulence.

2. Transfer the candidate Avr-Pi2 genes into rice cultivar Chinos or other isolates which do not

contain AvrPi2 gene to confirm its function.

Methods and Materials

Strains used

The Fugal strain used to generate mutants is M. oryzae strain KJ201. The agrobacterium tumefaciens strains used in ATMT approach is TMC55, which contained Kanamycin resistance gene as selection marker. The Rice strains used in this project were Nipponbare as a positive control and two strains containing Pi-2 and Pi-9, respectively provided by Dr. Guo-liang Wang’s group.

193 Number of mutants necessary to generate

Given the fact that the average gene length of M. oryzae is 1.7kb and the whole genome size is around 38Mb [1], possibility of transformants covering any one particular gene is calculated in Table

6.5. So if we want to reach an acceptable possibility to cover both Avr-Pi2 and Avr-Pi9, 30,000 transformants is the number that we consider. There’s limited information from previous study focusing on this screen of gain-of-virulence, which made it difficult to expect how many among

30,000 transformants will make visible changes in pathogenicity. However, in one reported study,

639 transformants of M. oryzae were obtained by REMI and 4 of them showed gain of ability to infect a rice strain which was resistant to wild-type [25]. If this proportion works in our case, then 31 transformants containing virulence ability should be expected among 5,000 transformants.

Agrobacterium tumefaciens-mediated (ATMT) transformation protocol

1. Grow Agrobacterium strain containing an appropriate binary vector in 5 ml Minimal medium (MM) with an appropriate antibiotics (Kanamycin for our binary vectors) for 1-2 days at 28 °C.

2. Inoculate the culture to Induction Medium (IM), containing Kan (75 ppm) and acetosyringone (AS), make O.D.600=0.15.

3. Culture the Agrobacterium cells for 6 hours at 28 °C by shaking at 200 rpm.

194 (Prepare Co-cultivation media with membrane on it, we can make 3 square plates with 100ml, 5 membranes per plate; Prepare TB3 agar with 200ppm hygromycin in square plates)

6 4. Prepare the M. oryzae spore suspension with dH2O at concentration of 10 spores/ml.

5. Mix 100 µl of the spore suspension with 100 µl of Agrobacterium culture.

6. Spread the 200 µl mixture onto the 3 nitrocellulose membrane (Whatman Cat. #7141

104; 47mm in diameter; 0.45 um in pore size) placed on the co-cultivation medium.

7. Incubate the plates for 26 hrs (24 to 48 hrs) under room temperature.

8. Transfer the membrane to the selection medium for 4 days.

Pathogenicity assay

Pathogenicity of the mutants generated was tested on pot-grown rice plants at the 6-8 leaf stage. In order to scan large number of mutants efficiently, 50 ATMT mutants were transformed to one V8 medium for growth and sporulation. Conidial suspensions (5 × 104 conidia/ml) from 1 week cultures on V8 agar plates were collected and spray-inoculated onto rice plants grown in commercial potting soil in vinyl pots (25cm2). Each mixture of conidial suspension were inoculated on a group of rice including 2 pots of Pi-2, 2 pots of Pi-9 and 1 pots of Nipponbare. Inoculated plants were kept in a dew chamber for 24 hrs and allowed to grow in a growth chamber at the Ohio Department of 195 Agriculture. Disease symptoms were examined 1 week after inoculation [8].

Lesions on each plant were counted and scanned for documentation. Then the edge of lesion was cut into 0.5cm x 0.5cm pieces to process through a surface sterilization and then plated on a water agar plate. Mycelia growing out from the lesions were then transferred to selection medium containing hygromycin. Mycelia growing on the selection medium were then transferred to a V8 medium agar for sporulation and then inoculated again to Pi-2, Pi-9 and Nipponbare plants to confirm its gain-of-virulence.

Recovery of flanking sequences

The insertion sequences and their associated flanking sequences were recovered from ATMT lines using TAIL-PCR (Thermal Asymmetric Interlaced PCR)following the protocol described [26]. Genomic

DNA of the confirmed mutants were extracted and amplified by TAIL-PCR using primers in Table 6.6

[27]. The PCR products were then purified and sequenced using traditional Sanger sequencing

(Figure 6.6). Sequencing results were then mapped to M. oryzae reference 70-15 genome sequences downloaded from the Broad Institute using their online blast service to localize and identify the insertion regions [28].

196 References

1. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu JR, Pan H et al: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005, 434(7036):980-986. 2. Matsumoto T, Wu JZ, Kanamori H, Katayose Y, Fujisawa M, Namiki N, Mizuno H, Yamamoto K, Antonio BA, Baba T et al: The map-based sequence of the rice genome. Nature 2005, 436(7052):793-800. 3. Ballini E, Morel JB, Droc G, Price A, Courtois B, Notteghem JL, Tharreau D: A genome-wide meta-analysis of rice blast resistance genes and quantitative trait loci provides new insights into partial and complete resistance. Mol Plant Microbe In 2008, 21(7):859-868. 4. Kang SC, Sweigard JA, Valent B: The PWL host specificity gene family in the blast fungus Magnaporthe grisea. Mol Plant Microbe In 1995, 8(6):939-948. 5. Sweigard JA, Carroll AM, Kang S, Farrall L, Chumley FG, Valent B: Identification, Cloning, and Characterization of Pwl2, a Gene for Host Species-Specificity in the Rice Blast Fungus. The Plant cell 1995, 7(8):1221-1233. 6. Farman ML, Eto Y, Nakao T, Tosa Y, Nakayashiki H, Mayama S, Leong SA: Analysis of the structure of the AVR1-CO39 avirulence locus in virulent rice-infecting isolates of Magnaporthe grisea. Mol Plant Microbe In 2002, 15(1):6-16. 7. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B: A telomeric avirulence gene determines efficacy for the rice blast resistance gene Pi-ta. The Plant cell 2000, 12(11):2019-2032. 8. Bohnert HU, Fudal I, Dioh W, Tharreau D, Notteghem JL, Lebrun MH: A putative polyketide synthase peptide synthetase from Magnaporthe grisea signals pathogen attack to resistant rice. The Plant cell 2004, 16(9):2499-2513. 9. Li W, Wang BH, Wu J, Lu GD, Hu YJ, Zhang X, Zhang ZG, Zhao Q, Feng QY, Zhang HY et al: The Magnaporthe oryzae Avirulence Gene AvrPiz-t Encodes a Predicted Secreted Protein That Triggers the Immunity in Rice Mediated by the Blast Resistance Gene Piz-t. Mol Plant Microbe In 2009, 22(4):411-420. 10. Miki S, Matsui K, Kito H, Otsuka K, Ashizawa T, Yasuda N, Fukiya S, Sato J, Hirayae K, Fujita Y et al: Molecular cloning and characterization of the AVR-Pia locus from a Japanese field isolate of Magnaporthe oryzae. Molecular plant pathology 2009, 10(3):361-374. 11. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S et al: Association genetics reveals three novel avirulence genes from the rice blast fungal pathogen Magnaporthe oryzae. The Plant cell 2009, 21(5):1573-1591. 12. LInukai T MD, Bonman JM, Sarkarung S, Zeigler R, Nelson R, Takamure I, Kinoshita T: Blast resistance gene Pi2(t) and Pi-Z may be allelic. Rice Genet Newsl 1992, 9:90-92. 13. Amantebordeos A, Sitch LA, Nelson R, Dalmacio RD, Oliva NP, Aswidinnoor H, Leung H: Transfer of Bacterial-Blight and Blast Resistance from the Tetraploid Wild-Rice Oryza-Minuta to Cultivated Rice, Oryza-Sativa. Theoretical and Applied Genetics 1992, 84(3-4):345-354.

197 14. Qu SH, Liu GF, Zhou B, Bellizzi M, Zeng LR, Dai LY, Han B, Wang GL: The broad-spectrum blast resistance gene Pi9 encodes a nucleotide-binding site-leucine-rich repeat protein and is a member of a multigene family in rice. Genetics 2006, 172(3):1901-1914. 15. Liu G, Lu G, Zeng L, Wang GL: Two broad-spectrum blast resistance genes, Pi9(t) and Pi2(t), are physically linked on rice chromosome 6. Mol Genet Genomics 2002, 267(4):472-480. 16. Valvekens D, Vanmontagu M, Vanlijsebettens M: Agrobacterium-Tumefaciens-Mediated Transformation of Arabidopsis-Thaliana Root Explants by Using Kanamycin Selection. Proceedings of the National Academy of Sciences of the United States of America 1988, 85(15):5536-5540. 17. de Groot MJA: Agrobacterium tumefaciens-mediated transformation of filamentous fungi (vol 16, pg 839, 1998). Nature Biotechnology 1998, 16(11):1074-1074. 18. Dobinson KF, Grant SJ, Kang S: Cloning and targeted disruption, via Agrobacterium tumefaciens-mediated transformation, of a trypsin protease gene from the vascular wilt fungus Verticillium dahliae. Current genetics 2004, 45(2):104-110. 19. Bundock P, Hooykaas PJJ: Integration of Agrobacterium tumefaciens T-DNA in the Saccharomyces cerevisiae genome by illegitimate recombination. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(26):15272-15275. 20. Bundock P, van Attikum H, den Dulk-Ras A, Hooykaas PJJ: Insertional mutagenesis in yeasts using T-DNA from Agrobacterium tumefaciens. Yeast 2002, 19(6):529-536. 21. Blaise F, Remy E, Meyer M, Zhou LG, Narcy JP, Roux J, Balesdent MH, Rouxel T: A critical assessment of Agrobacterium tumefaciens-mediated transformation as a tool for pathogenicity gene discovery in the phytopathogenic fungus Leptosphaeria maculans. Fungal Genetics and Biology 2007, 44(2):123-138. 22. Michielse CB, Hooykaas PJJ, van den Hondel CAMJJ, Ram AFJ: Agrobacterium-mediated transformation as a tool for functional genomics in fungi. Current genetics 2005, 48(1):1-17. 23. Chen X, Stone M, Schlagnhaufer C, Romaine CP: A fruiting body tissue method for efficient Agrobacterium-mediated transformation of Agaricus bisporus. Applied and environmental microbiology 2000, 66(10):4510-4513. 24. Rho HS, Kang S, Lee YH: Agrobacterium tumefaciens-mediated transformation of the plant pathogenic fungus, Magnaporthe grisea. Mol Cells 2001, 12(3):407-411. 25. Ruyi Xiong JL, Yijun Zhou, Yongjian Fan, Xiaobo Zheng: Screening and identification of mutants of Magnaporthe oryzae by REMI. Front Agric China 2007, 1(2):179-182. 26. Liu YG, Chen Y: High-efficiency thermal asymmetric interlaced PCR for amplification of unknown flanking sequences. Biotechniques 2007, 43(5):649-+. 27. Meng Y, Patel G, Heist M, Betts MF, Tucker SL, Galadima N, Donofrio NM, Brown D, Mitchell TK, Li L et al: A systematic analysis of T-DNA insertion events in Magnaporthe oryzae. Fungal Genetics and Biology 2007, 44(10):1050-1064. 28. Betts MF, Tucker SL, Galadima N, Meng Y, Patel G, Li L, Donofrio N, Floyd A, Nolin S, Brown D et al: Development of a high throughput transformation system for insertional mutagenesis in Magnaporthe oryzae. Fungal Genetics and Biology 2007, 44(10):1035-1049.

198

KJ201 Iso_1 Iso_2 Iso_3 Iso_4 Host Lesion Lesion Lesion Lesion Lesion Ave Ave Ave Ave Ave count count count count count 63 71 101 51 115 NPB 65 52 21 46 43 66 50 51 75 94 28 45 53 51 93 0 0 23 8 2 6 0 0 Pi-2 1 7 15 Black Spots 5 8+spot 0 17 22 s

Table 6.1. Pathogenicity test conducted in Dr. Mitchell’s lab. Result indicates Iso_1, Iso_2, and Iso_4 may gain the virulence to break Pi-2 resistance.

199

Host KJ201 Iso_1 Iso_2 Iso_3 Iso_4 NPB S S S S S Pi-2 ND R R ND S Pi-9 R R ND R R Piz-t R R R R R Q1063 S S S S S S: Susceptible; R: Resistant; ND: Not determined;

Table 6.2. Pathogenicity test conducted in Dr. Pan’s lab. Result indicates Iso_2, Iso_3, and Iso_4 may gain the virulence to break Pi-2 or Pi-9 resistance

200

Iso_2 T-DNA Clone Flanking sequences T-DNA sequences LB Mu1 Supercontig 12:390652-391443 207-bp Mu2 No specificity No Mu3 Supercontig 12:390652-391443 207-bp RB Mu13 Surpercontig 22: 490226-491286 211-bp Mu18 Surpercontig 21: 1734002-1734834 No Iso_4 LB Mu9 Supercontig 27:1359766-1360697 160-bp Mu10 Supercontig 27:1360719-1361435 208-bp RB Mu29 Surpercontig 22: 490349-491286 213-bp Mu30 Surpercontig 16: 245651-246865 No

Table 6.3. Summary of TAIL-PCR results of different clones generated from Iso_2 and Iso_4.

201

Target gene AvrPi9 AvrPi9 AvrPi2 AvrPi2 AvrPi2/9 AvrPi2/9 MGG_042 MGG_121 MGG_088 Unnamed MGG_04843 Locus MGG_04842.6 11.6 75.6 31.6 locus .6 C6 zinc Tyrocidine finger Gene Predicted hypothetic Homoaconitas synthetase Unkown domain-cont product protein al e 1 aining protein Chromosom IV IV I I III III e 12: 12: 27: 27: 22: Location 22: 388793-38 392138-40 1358931-1 1360826-1 492076-493 Supercontig 486294-48874 9263 7245 360184 360885 950 Length 109 4837 269 19 776 599 Amino acid Signal No No No No No No Peptides Extracellular Extracellular Extracellular Mitochondrial Subcellular (4.43) (4.43) Extracellular (4.4) (3.16) location Plasma No Plasma (9.7) Cytoplasmic Cytoplamsic (score) membrane membrane (1.09) (2.86) (2.20) (2.20) transmembr No No Yes No Yes No ane helices GPI No No No No No No T-DNA 5’ 5’ Coding 3’ 3’ 3’ inserted upstream upstream region downstrea downstream downstream position 2183-bp 695-bp 835-bp m 105-bp 2512-bp 790-bp

Table 6.4. Features of 6 candidate Avr genes.

202

Chance to identify Chance to identify both Chance to identify at Number of mutants one gene (Pi2 or Pi9) genes (Pi2 and Pi9) least one gene 5,000 20.0% 4.0% 36.0% 10,000 36.1% 13.0% 59.2% 20,000 59.1% 34.9% 83.3% 30,000 73.9% 54.6% 93.2% 40,000 83.3% 69.4% 97.2% 50,000 89.3% 79.7% 98.9%

Table 6.5. Calculation of possibility to identify one Avr gene or two genes using random insertion approach when different number of mutants generated.

203

TMP633 LB1_T_DNA GTCCGAGGGCAAAGAAATAGAGTA TMP634 LB2_T_DNA CATGTGTTGAGCATATAAGAAACCCT TMP635 LB3_T_DNA GAATTAATTCGGCGTTAATTCAGT TMP636 RB1_T_DNA TTACAACGTCGTGACTGGGAAAAC TMP637 RB2_T_DNA CTGGCGTAATAGCGAAGAGG TMP638 RB3_T_DNA CCCTTCCCAACAGTTGCGCA TMP639 AD1_T_DNA NGTCGASWGANAWGAA TMP640 AD2_T_DNA TGWGNAGSANCASAGA TMP641 AD3_T_DNA AGWGNAGWANCAWAGG TMP642 AD4_T_DNA WAGTGNAGWANCANGAA TMP643 AD6_T_DNA WGTGNAGWANCANAGA TMP644 AD-1_T_DNA WAGTGNAGWANCANAGA TMP904 LAD1-1 ACGATGGACTCCAGAGCGGCCGCVNVNNNGGAA TMP905 LAD1-3 ACGATGGACTCCAGAGCGGCCGCVVNVNNNCCAA TMP906 AC1 ACGATGGACTCCAGAG

Table 6.6. Primers used in TAIL-PCR.

204

Figure 6.1. Screening Pipeline. ATMT mutants were generated and pooled together, then the mixtures were inoculated onto rice leaves. Candidate mutants were reisolated from lesions and selected by hygromycin selection medium.

205

Figure 6.2. Lesions in the first re-inoculation of the 5 isolates from pooled inoculation. Note Iso_5 showed only minimum lesion, similar to the wild type KJ201, so Iso_5 were eliminated as candidates from this step.

206

Figure 6.3. Lesions from pathogenicity test conducted in Dr. Mitchell’s lab. Both positive and negative control works and Iso_1, Iso_2, and Iso_4 all made lesions on Pi-2 containing rice strain.

207

Figure 6.4. All four candidate mutants showed strong GFP signal. (A) Iso_1; (B) Iso_2; (C) Iso_3; (D)

Iso_4;

208

Figure 6.5. Position of the T-DNA insertion in Iso_2 and Iso_4 mutants. The nucleotide in the brackets was deleted by T-DNA insergration. The arrows (▼) indicate the T-DNA (2.2 kb) insertion positions in Iso_2 and Iso_4, respectively. The thick arrows represent the coding regions and the thin line joining these coding regions indicates the position of the introns, or the intergenic regions.

209

Figure 6.6. Structure of the T-DNA inserted. In TAIL-PCR, multiple primers from LB or RB were matched with Arbitrary primers to capture the outer sequences of T-DNA.

210

Conclusions

This dissertation focuses on the application of next generation sequencing (NGS) approaches on the studies of fungal phytopathogens. The first chapter summarizes the development of NGS techniques and their successful application to studies to understand the molecular underpinning of fungal virulence to plants. NGS plays an important role in investigating biological questions at the genome scale, such as comparisons of between genomes, identification of differentially expressed genes, and characterization of binding regions of transcription.

Chapter 2 and Chapter 4 described two genome sequence comparison projects. In the first project, we identified Alternaria arborescens conditionally dispensable chromosomes (CDC) sequences through whole genome sequencing and a de novo assembly process. By comparing nucleotide variations between CDC and essential chromosome (EC) contigs, we found evidence supporting horizontal gene transfer of the CDC to A. arborescens. We also identified predicted CDC genes under positive selection that may serve as virulence factors. In Chapter 4 the M. oryzae comparative study, genomes of 7 M. oryzae field isolates were sequenced and a reference assisted de novo assembly was performed. We found the general genome content of the sequenced field isolates to be similar to the reference 70-15, but with fewer gene models predicted. One interesting finding was the close

211 relationship between 70-15 and two barley pathogens. Variations at the gene level in the form of

SNPs, Indels, copy number variations were identified, and it was interesting to find that most of these polymorphisms were enriched in local genome neighborhoods, which we call “hot spots”. The selection ratios (Ka/Ks) for individual genes were also calculated. Thus, those genes unique to 70-15 located in the “hot spots” and containing Ka/Ks ratio >1 we predict to be candidates for novel virulence related and avirulence genes.

A RNA-Seq project is described in Chapter 3, where two NGS platforms were used to generate transcriptome sequencing data: Illumina’s sequencing-by-synthesis (SBS) and Roche’s

454-pyrosequencing. The 454 reads were used for the de novo assembly of S. homoeocarpa and creeping bentgrass transcriptome libraries, while SBS reads were mapped to the 454 assemblies to calculate transcript levels in order to identify differentially expressed genes associated with the infection. Fungal genes associated with infections include glycosyl hydrolases, proteases, and ABC transporters. Particularly, a large number of glycosyl hydrolase transcripts that target a wide range of plant cell wall compounds were found. Transcripts in creeping bentgrass associated with infections include germins, ubiquitin transcripts involved in proteasome degradation, and cinnamoyl reductase, which is involved in lignin production.

In Chapter 5, we proposed a novel motif distribution pattern (MDP) approach to improve transcription factor binding site (TFBS) predictions using data generated from ChIP-Seq techniques.

In the MDP approach, the distance to the transcription start site of every documented binding motif

212 was collected and utilized to build a spatial model. The MDP approach improves the accuracy of predictions by estimating both the over-representative level of the candidate motifs and their general distribution pattern. The improved performance of MDP was observed in both 5 transcription factors in yeast and MoCRZ1 in M. oryzae.

All these high through-put projects using NGS techniques showed enormous variations in genome content between and within species and the dramatic expression changes in the transcriptome expression profiles during infection related development. For example, the number of SNP and Indel locations among different M. oryzae field isolates can reach as high as over ten thousand, and over

20% of genes in the RNA-Seq projects showed alternate splicing that generate multiple isoforms with different expression profiles. Studying systems with this extreme complexity can only be approached through NGS techniques to provide data at the whole genome scale. Concurrently, advanced bioinformatics tools are needed to reduce the complexity so that questions can be asked of the data.

When the proper analysis pipeline is built, another issue researchers need to cope with is uncertainty, which is represented by the extensive number of parameters and cut-off options. By setting these values, researchers unavoidably introduce a bias, which is necessary but can be subjective and possibly misleading. To arrive at a robust conclusion, biological and technical replications as well as statistical support are required but rarely performed. Another way to validate the in silico analysis is to carry out experimentally validation with smaller data sets, such as Southern hybridizations for confirming the CDC gene identification in Chapter 2, RT-PCR for confirming the

RPKM values as in Chapter 3, and gel shift assays for confirming the binding motif predictions as in

213 Chapter 5. Overall, projects in this dissertation demonstrated the practicability of applying NGS techniques to the analysis of fungal genomes and transcriptomes to explore the underlying mechanisms of virulence.

214

References

1. Abdulrehman D, Monteiro PT, Teixeira MC, Mira NP, Lourenco AB, dos Santos SC, Cabrito TR,

Francisco AP, Madeira SC, Aires RS et al: YEASTRACT: providing a programmatic access to

curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web

services interface. Nucleic acids research 2011, 39(Database issue):D136-140.

2. Akagi Y, Akamatsu H, Otani H, Kodama M: Horizontal Chromosome Transfer, a Mechanism for

the Evolution and Differentiation of a Plant-Pathogenic Fungus. Eukaryot Cell 2009,

8(11):1732-1738.

3. Akamatsu H, H. Otani, and M. Kodama: Characterization of a gene cluster for host-specific

AAL-toxin biosynthesis in the tomato pathotype of Alternaria alternata. Fungal genet Newsl

2003, 50:355.

4. Akamatsu H, Taga M, Kodama M, Johnson R, Otani H, Kohmoto K: Molecular karyotypes for

Alternaria plant pathogens known to produce host-specific toxins. Current genetics 1999,

35(6):647-656.

5. Akamatsu H: Molecular biological studies on the pathogenicity of Alternaria alternata tomato

pathotype. J Gen Plant Pathol 2004(70):389.

6. Al-Samarrai TH, Schmid J: A simple method for extraction of fungal genomic DNA. Lett Appl

Microbiol 2000, 30(1):53-56.

7. Amantebordeos A, Sitch LA, Nelson R, Dalmacio RD, Oliva NP, Aswidinnoor H, Leung H:

Transfer of Bacterial-Blight and Blast Resistance from the Tetraploid Wild-Rice Oryza-Minuta

to Cultivated Rice, Oryza-Sativa. Theoretical and Applied Genetics 1992, 84(3-4):345-354.

215 8. Amselem J, Cuomo CA, van Kan JA, Viaud M, Benito EP, Couloux A, Coutinho PM, de Vries RP,

Dyer PS, Fillinger S et al: Genomic analysis of the necrotrophic fungal pathogens Sclerotinia

sclerotiorum and Botrytis cinerea. PLoS genetics 2011, 7(8):e1002230.

9. Anand S, Prasad MV, Yadav G, Kumar N, Shehara J, Ansari MZ, Mohanty D: SBSPKS: structure

based sequence analysis of polyketide synthases. Nucleic acids research 2010, 38(Web Server

issue):W487-496.

10. Andersen MR, Salazar MP, Schaap PJ, van de Vondervoort PJI, Culley D, Thykaer J, Frisvad JC,

Nielsen KF, Albang R, Albermann K et al: Comparative genomics of citric-acid-producing

Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome research 2011,

21(6):885-897.

11. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight

SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene Ontology

Consortium. Nat Genet 2000, 25(1):25-29.

12. Baer CF, Miyamoto MM, Denver DR: Mutation rate variation in multicellular eukaryotes:

causes and consequences. Nat Rev Genet 2007, 8(8):619-631.

13. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in

biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36.

14. Bailey TL: DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 2011,

27(12):1653-1659.

15. Baldwin TK, Winnenburg R, Urban M, Rawlings C, Koehler J, Hammond-Kosack KE: The

pathogen-host interactions database (PHI-base) provides insights into generic and novel

themes of pathogenicity. Molecular plant-microbe interactions : MPMI 2006,

19(12):1451-1462.

16. Ballini E, Morel JB, Droc G, Price A, Courtois B, Notteghem JL, Tharreau D: A genome-wide

meta-analysis of rice blast resistance genes and quantitative trait loci provides new insights

into partial and complete resistance. Mol Plant Microbe In 2008, 21(7):859-868.

17. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M,

216 Moxon S, Sonnhammer EL et al: The Pfam protein families database. Nucleic acids research

2004, 32(Database issue):D138-141.

18. Baxter L, Tripathy S, Ishaque N, Boot N, Cabral A, Kemen E, Thines M, Ah-Fong A, Anderson R,

Badejoko W et al: Signatures of Adaptation to Obligate Biotrophy in the Hyaloperonospora

arabidopsidis Genome. Science 2010, 330(6010):1549-1551.

19. Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides:

SignalP 3.0. Journal of molecular biology 2004, 340(4):783-795.

20. Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of

transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 2005,

21(11):2657-2666.

21. Bennett ST, Barnes C, Cox A, Davies L, Brown C: Toward the 1,000 dollars human genome.

Pharmacogenomics 2005, 6(4):373-382.

22. Bentley SD, Chater KF, Cerdeno-Tarraga AM, Challis GL, Thomson NR, James KD, Harris DE,

Quail MA, Kieser H, Harper D et al: Complete genome sequence of the model actinomycete

Streptomyces coelicolor A3(2). Nature 2002, 417(6885):141-147.

23. Betts MF, Tucker SL, Galadima N, Meng Y, Patel G, Li L, Donofrio N, Floyd A, Nolin S, Brown D

et al: Development of a high throughput transformation system for insertional mutagenesis in

Magnaporthe oryzae. Fungal Genetics and Biology 2007, 44(10):1035-1049.

24. Biggin MD: To bind or not to bind. Nat Genet 2001, 28(4):303-304.

25. Birch P KS: Studying interaction transcriptomes: coordinated analyses of gene expression

during plant-microorganism interactions. New technologies for life sciences: A Trends Guide

2000, 0:1471-1931.

26. Blaise F, Remy E, Meyer M, Zhou LG, Narcy JP, Roux J, Balesdent MH, Rouxel T: A critical

assessment of Agrobacterium tumefaciens-mediated transformation as a tool for

pathogenicity gene discovery in the phytopathogenic fungus Leptosphaeria maculans. Fungal

Genetics and Biology 2007, 44(2):123-138.

27. Bohnert HU, Fudal I, Dioh W, Tharreau D, Notteghem JL, Lebrun MH: A putative polyketide

217 synthase peptide synthetase from Magnaporthe grisea signals pathogen attack to resistant

rice. The Plant cell 2004, 16(9):2499-2513.

28. Bonos SA BR, Clarke BB (ed.): An integrated approach to dollar spot disease in turfgrasses.

New Brunswick, NJ: Rutgers Cooperative Extension; 2007.

29. Bowler C, Allen AE, Badger JH, Grimwood J, Jabbari K, Kuo A, Maheswari U, Martens C,

Maumus F, Otillar RP et al: The Phaeodactylum genome reveals the evolutionary history of

diatom genomes. Nature 2008, 456(7219):239-244.

30. Bulyk ML: Computational prediction of transcription-factor binding site locations. Genome

biology 2003, 5(1):201.

31. Bundock P, Hooykaas PJJ: Integration of Agrobacterium tumefaciens T-DNA in the

Saccharomyces cerevisiae genome by illegitimate recombination. Proceedings of the National

Academy of Sciences of the United States of America 1996, 93(26):15272-15275.

32. Bundock P, van Attikum H, den Dulk-Ras A, Hooykaas PJJ: Insertional mutagenesis in yeasts

using T-DNA from Agrobacterium tumefaciens. Yeast 2002, 19(6):529-536.

33. Burpee LL CL: Evaluation of two dollarspot forecasting systems for creeping bentgrass.

Canadian Journal of Plant Science 1986, 5:345-351.

34. Chakraborty N, Chang T, Casler MD, Jung G: Response of bentgrass cultivars to Sclerotinia

homoeocarpa isolates representing 10 vegetative compatibility groups. Crop Sci 2006,

46(3):1237-1244.

35. Chao CCT, Ellingboe AH: Selection for Mating Competence in Magnaporthe-Grisea Pathogenic

to Rice. Can J Bot 1991, 69(10):2130-2134.

36. Chen X, Stone M, Schlagnhaufer C, Romaine CP: A fruiting body tissue method for efficient

Agrobacterium-mediated transformation of Agaricus bisporus. Applied and environmental

microbiology 2000, 66(10):4510-4513.

37. Choi J, Kim Y, Kim S, Park J, Lee YH: MoCRZ1, a gene encoding a calcineurin-responsive

transcription factor, regulates fungal growth and pathogenicity of Magnaporthe oryzae.

Fungal Genet Biol 2009, 46(3):243-254.

218 38. Chuma I, Isobe C, Hotta Y, Ibaragi K, Futamata N, Kusaba M, Yoshida K, Terauchi R, Fujita Y,

Nakayashiki H et al: Multiple Translocation of the AVR-Pita Effector Gene among

Chromosomes of the Rice Blast Fungus Magnaporthe oryzae and Related Species. Plos

Pathogens 2011, 7(7).

39. Chuma I, Tosa Y, Taga M, Nakayashiki H, Mayama S: Meiotic behavior of a supernumerary

chromosome in Magnaporthe oryzae. Current genetics 2003, 43(3):191-198.

40. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM: A

program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:

SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012,

6(2):80-92.

41. Clutterbuck AJ: Genomic evidence of repeat-induced point mutation (RIP) in filamentous

ascomycetes. Fungal Genet Biol 2011, 48(3):306-326.

42. Coleman JJ, Rounsley SD, Rodriguez-Carres M, Kuo A, Wasmann CC, Grimwood J, Schmutz J,

Taga M, White GJ, Zhou S et al: The genome of Nectria haematococca: contribution of

supernumerary chromosomes to gene expansion. PLoS genetics 2009, 5(8):e1000618.

43. Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for

annotation, visualization and analysis in functional genomics research. Bioinformatics 2005,

21(18):3674-3676.

44. Couch HB BJ: Influence of environment on disease of turfgrasses. II. Effect of nutrition, pH,

and soil moisture on Sclerotinia Dollar Spot. Phytopathology 1960, 257:761-765.

45. Covert SF: Supernumerary chromosomes in filamentous fungi. Current genetics 1998,

33(5):311-319.

46. Cuomo CA, Guldener U, Xu JR, Trail F, Turgeon BG, Di Pietro A, Walton JD, Ma LJ, Baker SE, Rep

M et al: The Fusarium graminearum genome reveals a link between localized polymorphism

and pathogen specialization. Science 2007, 317(5843):1400-1402.

47. Da Rocha AB, Hammerschmidt R: Increase in oxalate oxidase activity in bentgrass response to

Sclerotinia homoeocarpa and chemical treatment. Phytopathology 2004, 94(6):S157-S157.

219 48. Davidson RM, Reeves PA, Manosalva PM, Leach JE: Germins: A diverse protein family

important for crop improvement. Plant Sci 2009, 177(6):499-510.

49. de Groot MJA: Agrobacterium tumefaciens-mediated transformation of filamentous fungi (vol

16, pg 839, 1998). Nature Biotechnology 1998, 16(11):1074-1074.

50. de Waard MA, Andrade AC, Hayashi K, Schoonbeek HJ, Stergiopoulos I, Zwiers LH: Impact of

fungal drug transporters on fungicide sensitivity, multidrug resistance and virulence. Pest

Manag Sci 2006, 62(3):195-207.

51. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu JR,

Pan H et al: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005,

434(7036):980-986.

52. Del Sorbo G, Schoonbeek H, De Waard MA: Fungal transporters involved in efflux of natural

toxic compounds and fungicides. Fungal Genet Biol 2000, 30(1):1-15.

53. Dertz EA, Xu J, Stintzi A, Raymond KN: Bacillibactin-mediated iron transport in Bacillus subtilis.

Journal of the American Chemical Society 2006, 128(1):22-23.

54. Dillon SC, Zhang X, Trievel RC, Cheng X: The SET-domain protein superfamily: protein lysine

methyltransferases. Genome biology 2005, 6(8):227.

55. Dobinson KF, Grant SJ, Kang S: Cloning and targeted disruption, via Agrobacterium

tumefaciens-mediated transformation, of a trypsin protease gene from the vascular wilt

fungus Verticillium dahliae. Current genetics 2004, 45(2):104-110.

56. Duplessis S, Cuomo CA, Lin YC, Aerts A, Tisserant E, Veneault-Fourrey C, Joly DL, Hacquard S,

Amselem J, Cantarel BL et al: Obligate biotrophy features unraveled by the genomic analysis

of rust fungi. Proceedings of the National Academy of Sciences of the United States of

America 2011, 108(22):9166-9171.

57. E.B. W: Studies on Chromosomes. V. The chromosomes of Metapodius. A contribution to the

hypothesis of the genetic continuity of chromosomes. J Exp Zool 1906(6):147-205.

58. Ebbole DJ: Magnaporthe as a model for understanding host-pathogen interactions. Annual

review of phytopathology 2007, 45:437-456.

220 59. Ellwood SR, Liu Z, Syme RA, Lai Z, Hane JK, Keiper F, Moffat CS, Oliver RP, Friesen TL: A first

genome assembly of the barley fungal pathogen Pyrenophora teres f. teres. Genome biology

2010, 11(11):R109.

60. Endo RM: Control of Dollar Spot of Turfgrass by Nitrogen and Its Probable Bases.

Phytopathology 1966, 56(8):877-&.

61. Farman ML, Eto Y, Nakao T, Tosa Y, Nakayashiki H, Mayama S, Leong SA: Analysis of the

structure of the AVR1-CO39 avirulence locus in virulent rice-infecting isolates of Magnaporthe

grisea. Mol Plant Microbe In 2002, 15(1):6-16.

62. Farman ML, Leong SA: Chromosome walking to the AVR1-CO39 avirulence gene of

Magnaporthe grisea: Discrepancy between the physical and genetic maps. Genetics 1998,

150(3):1049-1058.

63. Feng BN, Nakatsuka, S., Goto, T., Tsuge, T. and Nishimura, S.: Biosynthesis of host-selective

toxins produced by alternaria alternata pathogens, I:

(8r,9s)-9,10-epoxy-8-hydroxy-9-methyl-deca-(2e,4z,6e)-trienoic acid as a biological precursor

of AK-toxins. Agric Biol Chem 1990, 54:845-848.

64. Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching.

Nucleic acids research 2011, 39:W29-W37.

65. Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple

local alignment. Nucleic acids research 2004, 32(1):189-200.

66. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND, Jaffe D, FitzHugh W, Ma LJ, Smirnov S,

Purcell S et al: The genome sequence of the filamentous fungus Neurospora crassa. Nature

2003, 422(6934):859-868.

67. Gilchrist DG, Grogan RG: Production and Nature of a Host-Specific Toxin from

Alternaria-Alternata F-Sp Lycopersici. Phytopathology 1976, 66(2):165-171.

68. Gobler CJ, Berry DL, Dyhrman ST, Wilhelm SW, Salamov A, Lobanov AV, Zhang Y, Collier JL,

Wurch LL, Kustka AB et al: Niche of harmful alga Aureococcus anophagefferens revealed

through ecogenomics. Proceedings of the National Academy of Sciences of the United States

221 of America 2011, 108(11):4352-4357.

69. Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible,

reproducible, and transparent computational research in the life sciences. Genome biology

2010, 11(8):R86.

70. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq

C, Johnston M et al: Life with 6000 genes. Science 1996, 274(5287):546, 563-547.

71. Gojobori MIBaT: Significant differences between the G+C content of synonymous codons in

orthologous genes and the genomic G+C content. Gene 1999, 238(1):33-37.

72. Gokhale RSaT, D.: Biochemistry of polyketide synthases. Biotechnology 2001, 10:341-372.

73. Gotz S, Garcia-Gomez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talon M,

Dopazo J, Conesa A: High-throughput functional annotation and data mining with the

Blast2GO suite. Nucleic acids research 2008, 36(10):3420-3435.

74. Grandbastien MA: Activation of plant retrotransposons under stress conditions. Trends Plant

Sci 1998, 3(5):181-187.

75. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS: Quantifying similarity between motifs.

Genome biology 2007, 8(2):R24.

76. H.H. F: Current status of the gene-for-gene concept. Annual review of phytopathology 1971,

9:275-296.

77. Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, Cano LM, Grabherr M, Kodira CD,

Raffaele S, Torto-Alalibo T et al: Genome sequence and analysis of the Irish potato famine

pathogen Phytophthora infestans. Nature 2009, 461(7262):393-398.

78. Halbritter F, Vaidya HJ, Tomlinson SR: GeneProf: analysis of high-throughput sequencing

experiments. Nat Methods 2012, 9(1):7-8.

79. Hall R: Relationship between Weather Factors and Dollar Spot of Creeping Bentgrass.

Canadian Journal of Plant Science 1984, 64(1):167-174.

80. Han Y, Liu X, Benny U, Kistler HC, VanEtten HD: Genes determining pathogenicity to pea are

clustered on a supernumerary chromosome in the fungal plant pathogen Nectria

222 haematococca. The Plant journal : for cell and molecular biology 2001, 25(3):305-314.

81. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB,

Reynolds DB, Yoo J et al: Transcriptional regulatory code of a eukaryotic genome. Nature 2004,

431(7004):99-104.

82. Harimoto Y, Tanaka T, Kodama M, Yamamoto M, Otani H, Tsuge T: Multiple copies of AMT2

are prerequisite for the apple pathotype of Alternaria alternata to produce enough AM-toxin

for expressing pathogenicity. J Gen Plant Pathol 2008, 74(3):222-229.

83. Hatta R, Ito K, Hosaki Y, Tanaka T, Tanaka A, Yamamoto M, Akimitsu K, Tsuge T: A conditionally

dispensable chromosome controls host-specific pathogenicity in the fungal plant pathogen

Alternaria alternata. Genetics 2002, 161(1):59-70.

84. Heath MC: A Generalized Concept of Host-Parasite Specificity. Phytopathology 1981,

71(11):1121-1123.

85. Henrissat B, Bairoch A: New Families in the Classification of Glycosyl Hydrolases Based on

Amino-Acid-Sequence Similarities. Biochemical Journal 1993, 293:781-788.

86. Henrissat B: A Classification of Glycosyl Hydrolases Based on Amino-Acid-Sequence

Similarities. Biochemical Journal 1991, 280:309-316.

87. Herath HMTB, Herath WHMW, Carvalho P, Khan SI, Tekwani BL, Duke SO, Tomaso-Peterson M,

Nanayakkara NPD: Biologically Active Tetranorditerpenoids from the Fungus Sclerotinia

homoeocarpa Causal Agent of Dollar Spot in Turfgrass. Journal of natural products 2009,

72(12):2091-2097.

88. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J: De novo bacterial genome

sequencing: millions of very short reads assembled on a desktop computer. Genome research

2008, 18(5):802-809.

89. Horns F, Petit E, Yockteng R, Hood ME: Patterns of repeat-induced point mutation in

transposable elements of basidiomycete fungi. Genome Biol Evol 2012, 4(3):240-247.

90. Hsiang T, Mahuku GS: Genetic variation within and between southern Ontario populations of

Sclerotinia homoeocarpa. Plant Pathol 1999, 48(1):83-94.

223 91. Hu J, Chen C, Peever T, Dang H, Lawrence C, Mitchell T: Genomic characterization of the

conditionally dispensable chromosome in Alternaria arborescens provides evidence for

horizontal gene transfer. BMC Genomics 2012, 13:171.

92. Hu M, Polyak K: Serial analysis of gene expression. Nat Protoc 2006, 1(4):1743-1760.

93. Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D: Accuracy and quality of massively

parallel DNA pyrosequencing. Genome biology 2007, 8(7).

94. Jiang RH, Govers F: Nonneutral GC3 and retroelement codon mimicry in Phytophthora. J Mol

Evol 2006, 63(4):458-472.

95. Johnson BA, Anker H, Meleney FL: Bacitracin: A New Antibiotic Produced by a Member of the

B. Subtilis Group. Science 1945, 102(2650):376-377.

96. Johnson JM, Edwards S, Shoemaker D, Schadt EE: Dark matter in the genome: evidence of

widespread transcription detected by microarray tiling experiments. Trends Genet 2005,

21(2):93-102.

97. Johnson LJ, Johnson RD, Akamatsu H, Salamiah A, Otani H, Kohmoto K, Kodama M:

Spontaneous loss of a conditionally dispensable chromosome from the Alternaria alternata

apple pathotype leads to loss of toxin production and pathogenicity. Current genetics 2001,

40(1):65-72.

98. Kamper J, Kahmann R, Bolker M, Ma LJ, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE,

Muller O et al: Insights from the genome of the biotrophic fungal plant pathogen Ustilago

maydis. Nature 2006, 444(7115):97-101.

99. Kang SC, Sweigard JA, Valent B: The PWL host specificity gene family in the blast fungus

Magnaporthe grisea. Mol Plant Microbe In 1995, 8(6):939-948.

100. Kankanala P, Czymmek K, Valent B: Roles for rice membrane dynamics and plasmodesmata

during biotrophic invasion by the blast fungus. The Plant cell 2007, 19(2):706-724.

101. Karin M: Too many transcription factors: positive and negative interactions. New Biol 1990,

2(2):126-131.

102. Kars I vKJ (ed.): Extracellular enzymes and metabolites involved in pathogenesis of Botrytis.

224 Norwell, MA, USA: Kluwer Academic Publishers; 2007.

103. Kemen E, Gardiner A, Schultz-Larsen T, Kemen AC, Balmuth AL, Robert-Seilaniantz A, Bailey K,

Holub E, Studholme DJ, Maclean D et al: Gene gain and loss during evolution of obligate

parasitism in the white rust pathogen of Arabidopsis thaliana. PLoS Biol 2011, 9(7):e1001094.

104. Kim S, Hu J, Oh Y, Park J, Choi J, Lee YH, Dean RA, Mitchell TK: Combining ChIP-chip and

expression profiling to model the MoCRZ1 mediated circuit for Ca/calcineurin signaling in the

rice blast fungus. PLoS Pathog 2010, 6(5):e1000909.

105. Kiyosawa S: Genetics and Epidemiological Modeling of Breakdown of Plant-Disease Resistance.

Annual review of phytopathology 1982, 20:93-117.

106. Kohmoto K, Itoh, Y., Shimomura, N., Kondoh, Y., Otani, H., Kodama, M., et al.: Isolation and

biological activities of 2 host-specific toxins from the tangerine pathotype of Alternaria

alternata. Phytopathology 1993, 83:495-502.

107. Koudritsky M, Domany E: Positional distribution of human transcription factor binding sites.

Nucleic acids research 2008, 36(21):6795-6805.

108. Kretschmer M, Leroch M, Mosbach A, Walker AS, Fillinger S, Mernke D, Schoonbeek HJ,

Pradier JM, Leroux P, De Waard MA et al: Fungicide-Driven Evolution and Molecular Basis of

Multidrug Resistance in Field Populations of the Grey Mould Fungus Botrytis cinerea. Plos

Pathogens 2009, 5(12).

109. Kroken S, Glass NL, Taylor JW, Yoder OC, Turgeon BG: Phylogenomic analysis of type I

polyketide synthase genes in pathogenic and saprobic ascomycetes. Proceedings of the

National Academy of Sciences of the United States of America 2003, 100(26):15670-15675.

110. Kullas AL, Martin SJ, Davis D: Adaptation to environmental pH: integrating the Rim101 and

calcineurin signal transduction pathways. Molecular microbiology 2007, 66(4):858-871.

111. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile

and open software for comparing large genomes. Genome biology 2004, 5(2):R12.

112. Lai JS, Li RQ, Xu X, Jin WW, Xu ML, Zhao HN, Xiang ZK, Song WB, Ying K, Zhang M et al:

Genome-wide patterns of genetic variation among elite maize inbred lines. Nature Genetics

225 2010, 42(11):1027-U1158.

113. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of

short DNA sequences to the human genome. Genome biology 2009, 10(3):R25.

114. Latchman DS: Transcription factors: An overview. Int J Biochem Cell B 1997,

29(12):1305-1312.

115. Lawrence CB, Mitchell TK, Craven KD, Cho Y, Cramer RA, Kim KH: At death's door: Alternaria

pathogenicity mechanisms. Plant Pathology J 2008, 24(2):101-111.

116. Leach JE, Cruz CMV, Bai JF, Leung H: Pathogen fitness penalty as a predictor of durability of

disease resistance genes. Annual review of phytopathology 2001, 39:187-224.

117. Leclair S, Ansan-Melayah D, Rouxel T, Balesdent M: Meiotic behaviour of the

minichromosome in the phytopathogenic ascomycete Leptosphaeria maculans. Current

genetics 1996, 30(6):541-548.

118. Lee TI, Young RA: Transcription of eukaryotic protein-coding genes. Annu Rev Genet 2000,

34:77-137.

119. Lehtinen U: Plant-Cell Wall Degrading Enzymes of Septoria-Nodorum. Physiol Mol Plant P

1993, 43(2):121-134.

120. Leung H, Borromeo ES, Bernardo MA, Notteghem JL: Genetic-Analysis of Virulence in the Rice

Blast Fungus Magnaporthe-Grisea. Phytopathology 1988, 78(9):1227-1233.

121. Levy S, Hannenhalli S: Identification of transcription factor binding sites in the human genome

sequence. Mamm Genome 2002, 13(9):510-514.

122. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN: RNA-Seq gene expression estimation with

read mapping uncertainty. Bioinformatics 2010, 26(4):493-500.

123. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using

mapping quality scores. Genome research 2008, 18(11):1851-1858.

124. Li J, Yu L, Yang JK, Dong LQ, Tian BY, Yu ZF, Liang LM, Zhang Y, Wang X, Zhang KQ: New insights

into the evolution of subtilisin-like serine protease genes in Pezizomycotina. Bmc Evol Biol

2010, 10.

226 125. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program.

Bioinformatics 2008, 24(5):713-714.

126. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for

short read alignment. Bioinformatics 2009, 25(15):1966-1967.

127. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De novo

assembly of human genomes with massively parallel short read sequencing. Genome

research 2010, 20(2):265-272.

128. Li RQ, Li YR, Fang XD, Yang HM, Wang J, Kristiansen K, Wang J: SNP detection for massively

parallel whole-genome resequencing. Genome research 2009, 19(6):1124-1132.

129. Li W, Wang BH, Wu J, Lu GD, Hu YJ, Zhang X, Zhang ZG, Zhao Q, Feng QY, Zhang HY et al: The

Magnaporthe oryzae Avirulence Gene AvrPiz-t Encodes a Predicted Secreted Protein That

Triggers the Immunity in Rice Mediated by the Blast Resistance Gene Piz-t. Mol Plant Microbe

In 2009, 22(4):411-420.

130. Li W, Wang BH, Wu J, Lu GD, Hu YJ, Zhang X, Zhang ZG, Zhao Q, Feng QY, Zhang HY et al: The

Magnaporthe oryzae Avirulence Gene AvrPiz-t Encodes a Predicted Secreted Protein That

Triggers the Immunity in Rice Mediated by the Blast Resistance Gene Piz-t. Mol Plant Microbe

In 2009, 22(4):411-420.

131. Lichtenberg J, Kurz K, Liang X, Al-ouran R, Neiman L, Nau LJ, Welch JD, Jacox E, Bitterman T,

Ecker K et al: WordSeeker: concurrent bioinformatics software for discovering genome-wide

patterns and word-based genomic signatures. BMC bioinformatics 2010, 11 Suppl 12:S6.

132. Lichtenberg J, Yilmaz A, Welch JD, Kurz K, Liang XY, Drews F, Ecker K, Lee SS, Geisler M,

Grotewold E et al: The word landscape of the non-coding segments of the Arabidopsis

thaliana genome. BMC Genomics 2009, 10.

133. Lin Z, Wu WS, Liang H, Woo Y, Li WH: The spatial distribution of cis regulatory elements in

yeast promoters and its implications for transcriptional regulation. BMC Genomics 2010,

11:581.

134. LInukai T MD, Bonman JM, Sarkarung S, Zeigler R, Nelson R, Takamure I, Kinoshita T: Blast

227 resistance gene Pi2(t) and Pi-Z may be allelic. Rice Genet Newsl 1992, 9:90-92.

135. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly

integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 2008,

133(3):523-536.

136. Liu G, Lu G, Zeng L, Wang GL: Two broad-spectrum blast resistance genes, Pi9(t) and Pi2(t),

are physically linked on rice chromosome 6. Mol Genet Genomics 2002, 267(4):472-480.

137. Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications

to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol 2002,

20(8):835-839.

138. Liu YG, Chen Y: High-efficiency thermal asymmetric interlaced PCR for amplification of

unknown flanking sequences. Biotechniques 2007, 43(5):649-+.

139. Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with

insertions. Proceedings of the National Academy of Sciences of the United States of America

2005, 102(30):10557-10562.

140. Lushbough CM, Brendel VP: An overview of the BioExtract Server: a distributed, Web-based

system for genomic analysis. Advances in experimental medicine and biology 2010,

680:361-369.

141. Ma LJ, van der Does HC, Borkovich KA, Coleman JJ, Daboussi MJ, Di Pietro A, Dufresne M,

Freitag M, Grabherr M, Henrissat B et al: Comparative genomics reveals mobile pathogenicity

chromosomes in Fusarium. Nature 2010, 464(7287):367-373.

142. Machida M, Asai K, Sano M, Tanaka T, Kumagai T, Terai G, Kusumoto K, Arima T, Akita O,

Kashiwagi Y et al: Genome sequencing and analysis of Aspergillus oryzae. Nature 2005,

438(7071):1157-1161.

143. Mandal S, Mallick N, Mitra A: Salicylic acid-induced resistance to Fusarium oxysporum f. sp.

lycopersici in tomato. Plant Physiol Biochem 2009, 47(7):642-649.

144. Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S,

Hurwitz DI, Jackson JD, Ke Z et al: CDD: a Conserved Domain Database for protein

228 classification. Nucleic acids research 2005, 33(Database issue):D192-196.

145. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical

reproducibility and comparison with gene expression arrays. Genome research 2008,

18(9):1509-1517.

146. Markham JE, Hille J: Host-selective toxins as agents of cell death in plant-fungus interactions.

Molecular plant pathology 2001, 2(4):229-239.

147. Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an

application to promoter and regulatory site consensus identification. J Comput Biol 2000,

7(3-4):345-362.

148. Masunaka A, Ohtani K, Peever TL, Timmer LW, Tsuge T, Yamamoto M, Yamamoto H, Akimitsu

K: An Isolate of Alternaria alternata That Is Pathogenic to Both Tangerines and Rough Lemon

and Produces Two Host-Selective Toxins, ACT- and ACR-Toxins. Phytopathology 2005,

95(3):241-247.

149. Matsumoto T, Wu JZ, Kanamori H, Katayose Y, Fujisawa M, Namiki N, Mizuno H, Yamamoto K,

Antonio BA, Baba T et al: The map-based sequence of the rice genome. Nature 2005,

436(7052):793-800.

150. Mehrabi R, Taga M, Kema GH: Electrophoretic and cytological karyotyping of the foliar wheat

pathogen Mycosphaerella graminicola reveals many chromosomes with a large size range.

Mycologia 2007, 99(6):868-876.

151. Meng Y, Patel G, Heist M, Betts MF, Tucker SL, Galadima N, Donofrio NM, Brown D, Mitchell

TK, Li L et al: A systematic analysis of T-DNA insertion events in Magnaporthe oryzae. Fungal

Genetics and Biology 2007, 44(10):1050-1064.

152. Miao VP, Covert SF, VanEtten HD: A fungal gene for antibiotic resistance on a dispensable ("B")

chromosome. Science 1991, 254(5039):1773-1776.

153. Michielse CB, Hooykaas PJJ, van den Hondel CAMJJ, Ram AFJ: Agrobacterium-mediated

transformation as a tool for functional genomics in fungi. Current genetics 2005, 48(1):1-17.

154. Miki S, Matsui K, Kito H, Otsuka K, Ashizawa T, Yasuda N, Fukiya S, Sato J, Hirayae K, Fujita Y et

229 al: Molecular cloning and characterization of the AVR-Pia locus from a Japanese field isolate

of Magnaporthe oryzae. Molecular plant pathology 2009, 10(3):361-374.

155. Monteith J DA: Turfgrass diseases and their control. Bulletin of the United States Golf

Association 1932, 12(85).

156. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying

mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621-628.

157. N. Saitou MN: The neighbor-joining method: a new method for reconstructing phylogenetic

trees. Mol Biol Evol 1987, 4(4):406-425.

158. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional

landscape of the yeast genome defined by RNA sequencing. Science 2008,

320(5881):1344-1349.

159. Nakashima T, Ueno, T., Fukami, H., Taga, T., Masuda, H., Osaki, K., et al.: Isolation and

stuctures of AK-Toxin I and II, host-specific phytotoxic metabolites produced by Alternaria

alternata Japanese pear pathotype. Agric Biol chem 1985, 49:807-815.

160. Nakatsuka S, Feng BN, Goto T, Tsuge T, Nishimura S: Biosynthesis of Host-Selective Toxins

Produced by Alternaria-Alternata Pathogens .2. Biosynthetic Origin of

(8r,9s)-9,10-Epoxy-8-Hydroxy-9-Methyl-Deca-(2e,4z,6e)-Trienoic Acid, a Precursor of Ak-Toxins

Produced by Alternaria-Alternata. Phytochemistry 1990, 29(5):1529-1531.

161. Nakatsuka S, Ueda, K., Goto, T., Yamamoto, M., Nishimura, S. and Kohmoto, K.: Stucture of

AF-toxin II, one of the host-specific toxins produced by Alternaria alternata strawberry

pathotype. Tetrahedron Lett 1986, 27:2753-2756.

162. Nei M. gT: Simple methods for estimating the numbers of syonymous and nonsynonymous

nucleotide substitutions. Mol Biol Evol 1986, 3:418-426.

163. Nekrutenko A, Taylor J: Next-generation sequencing data interpretation: enhancing

reproducibility and accessibility. Nat Rev Genet 2012, 13(9):667-672.

164. Neron B, Menager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S, Tuffery P, Letondal C:

Mobyle: a new full web bioinformatics framework. Bioinformatics 2009, 25(22):3005-3011.

230 165. Nierman WC, Pain A, Anderson MJ, Wortman JR, Kim HS, Arroyo J, Berriman M, Abe K, Archer

DB, Bermejo C et al: Genomic sequence of the pathogenic and allergenic filamentous fungus

Aspergillus fumigatus. Nature 2005, 438(7071):1151-1156.

166. Nikolov DB, Burley SK: RNA polymerase II transcription initiation: a structural view.

Proceedings of the National Academy of Sciences of the United States of America 1997,

94(1):15-22.

167. Nishimura S, Kohmoto K: Host-Specific Toxins and Chemical Structures from Alternaria Species.

Annu Rev Phytopathol 1983, 21:87-116.

168. Ohm RA, de Jong JF, Lugones LG, Aerts A, Kothe E, Stajich JE, de Vries RP, Record E, Levasseur

A, Baker SE et al: Genome sequence of the model mushroom Schizophyllum commune. Nat

Biotechnol 2010, 28(9):957-963.

169. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B: A telomeric avirulence gene

determines efficacy for the rice blast resistance gene Pi-ta. The Plant cell 2000,

12(11):2019-2032.

170. Orshinsky AM, Boland GJ: Ophiostoma mitovirus 3a, ascorbic acid, glutathione, and

photoperiod affect the development of stromata and apothecia by Sclerotinia homoeocarpa.

Canadian journal of microbiology 2011, 57(5):398-407.

171. Orshinsky AM: Characterization of the growth and host colonization of virulent, asymptomatic,

and hypovirulent isolates of Sclerotinia homoeocarpa. Guelph: University of Guelph

2010:203.

172. Osbourn A: Secondary metabolic gene clusters: evolutionary toolkits for chemical innovation.

Trends Genet 2010, 26(10):449-457.

173. Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL:

InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic acids

research 2010, 38(Database issue):D196-203.

174. Palmer JM, Keller NP: Secondary metabolism in fungi: does chromosomal location matter?

Curr Opin Microbiol 2010, 13(4):431-436.

231 175. Park J, Jang S, Kim S, Kong S, Choi J, Ahn K, Kim J, Lee S, Park B, Jung K et al: FTFD: an

informatics pipeline supporting phylogenomic analysis of fungal transcription factors.

Bioinformatics 2008, 24(7):1024-1025.

176. Parkinson J, Blaxter M: SimiTri--visualizing similarity relationships for groups of sequences.

Bioinformatics 2003, 19(3):390-395.

177. Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA

sequences. Bioinformatics 2001, 17 Suppl 1:S207-214.

178. Peever TL, Su G, Carpenter-Boggs L, Timmer LW: Molecular systematics of citrus-associated

Alternaria species. Mycologia 2004, 96(1):119-134.

179. Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides

from transmembrane regions. Nat Methods 2011, 8(10):785-786.

180. Pfaffl MW, Horgan GW, Dempfle L: Relative expression software tool (REST (c)) for group-wise

comparison and statistical analysis of relative expression results in real-time PCR. Nucleic

acids research 2002, 30(9).

181. Pigati RL, Dernoeden PH, Grybauskas AP, Momen B: Simulated Rainfall and Mowing Impact

Fungicide Performance When Targeting Dollar Spot in Creeping Bentgrass. Plant Dis 2010,

94(5):596-603.

182. Powell AJ, Conant GC, Brown DE, Carbone I, Dean RA: Altered patterns of gene duplication

and differential gene gain and loss in fungal pathogens. BMC Genomics 2008, 9:147.

183. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes.

Bioinformatics 2005, 21 Suppl 1:i351-358.

184. Proctor RH, McCormick SP, Alexander NJ, Desjardins AE: Evidence that a secondary metabolic

biosynthetic gene cluster has grown by gene relocation during evolution of the filamentous

fungus Fusarium. Molecular microbiology 2009, 74(5):1128-1142.

185. Qu SH, Liu GF, Zhou B, Bellizzi M, Zeng LR, Dai LY, Han B, Wang GL: The broad-spectrum blast

resistance gene Pi9 encodes a nucleotide-binding site-leucine-rich repeat protein and is a

member of a multigene family in rice. Genetics 2006, 172(3):1901-1914.

232 186. R L (ed.): Turfgrass disease profiles: Dollar Spot: West Lafayette: Purdue Extension; 2009.

187. Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, Thines M, Jiang RHY, Zody MC,

Kunjeti SG, Donofrio NM et al: Genome Evolution Following Host Jumps in the Irish Potato

Famine Pathogen Lineage. Science 2010, 330(6010):1540-1543.

188. Raffaele S, Win J, Cano LM, Kamoun S: Analyses of genome architecture and gene expression

reveal novel candidate virulence factors in the secretome of Phytophthora infestans. BMC

Genomics 2010, 11:637.

189. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nat Genet 2006,

38(5):500-501.

190. Reinartz J, Bruyns E, Lin JZ, Burcham T, Brenner S, Bowen B, Kramer M, Woychik R: Massively

parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression

profiling in all organisms. Brief Funct Genomic Proteomic 2002, 1(1):95-104.

191. Rep M, Kistler HC: The genomic organization of plant pathogenicity in Fusarium species.

Current opinion in plant biology 2010, 13(4):420-426.

192. Rho HS, Kang S, Lee YH: Agrobacterium tumefaciens-mediated transformation of the plant

pathogenic fungus, Magnaporthe grisea. Mol Cells 2001, 12(3):407-411.

193. Richards TA, Dacks JB, Jenkinson JM, Thornton CR, Talbot NJ: Evolution of filamentous plant

pathogens: gene exchange across eukaryotic kingdoms. Current biology : CB 2006,

16(18):1857-1864.

194. Richards TA, Soanes DM, Foster PG, Leonard G, Thornton CR, Talbot NJ: Phylogenomic analysis

demonstrates a pattern of rare and ancient horizontal gene transfer between plants and fungi.

The Plant cell 2009, 21(7):1897-1911.

195. RM E: Influence of temperature on rate of growth of five fungus pathogens of turfgrass and

on rate of disease spread. Phytopathology 1963, 53:857-877.

196. Rodriguez-Carres M, White G, Tsuchiya D, Taga M, VanEtten HD: The supernumerary

chromosome of Nectria haematococca that carries pea-pathogenicity-related genes also

carries a trait for pea rhizosphere competitiveness. Applied and environmental microbiology

233 2008, 74(12):3849-3856.

197. Roeder RG: The role of general initiation factors in transcription by RNA polymerase II. Trends

Biochem Sci 1996, 21(9):327-335.

198. Rosewich UL, Kistler HC: Role of Horizontal Gene Transfer in the Evolution of Fungi. Annual

review of phytopathology 2000, 38:325-363.

199. Rotem J: The Genus Alternaria. Biology, epidemiology, and pathogenicity. St. Paul: APS Press;

1994.

200. Rouxel T, Grandaubert J, Hane JK, Hoede C, van de Wouw AP, Couloux A, Dominguez V,

Anthouard V, Bally P, Bourras S et al: Effector diversification within compartments of the

Leptosphaeria maculans genome affected by Repeat-Induced Point mutations. Nat Commun

2011, 2:202.

201. Rowe HC, Kliebenstein DJ: Elevated genetic variation within virulence-associated Botrytis

cinerea polygalacturonase loci. Molecular plant-microbe interactions : MPMI 2007,

20(9):1126-1137.

202. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers.

Methods Mol Biol 2000, 132:365-386.

203. Ruyi Xiong JL, Yijun Zhou, Yongjian Fan, Xiaobo Zheng: Screening and identification of mutants

of Magnaporthe oryzae by REMI. Front Agric China 2007, 1(2):179-182.

204. S. Wendenbaum PD, A. Dell, J. M. Meyer and M. A. Abdallah: The structure of pyoverdine Pa,

the siderophore of Pseudomonas aeruginosa. Tetrahedron Letters 1983, 24(44):4877-4880.

205. Sabot F, Simon D, Bernard M: Plant transposable elements, with an emphasis on grass species.

Euphytica 2004, 139(3):227-247.

206. SALAMIAH HA, Yukitaka FUKUMASA-NAKAI, Hiroshi OTANI and Motoichiro KODAMA:

Construction and Genetic Analysis of Hybrid Strains between Apple and Tomato Pathotypes of

Alternaria alternata by Protoplast Fusion J Gen Plant Pathol 2001, 67(2):97-105.

207. SALAMIAH YF-N, Hajime AKAMATSU, Hiroshi OTANI, Keisuke KOHMOTO and Motoichiro

KODAMA: Genetic Analysis of Pathogenicity and Host-specific Toxin Production of Alternaria

234 alternata Tomato Pathotype by Protoplast Fusion J Gen Plant Pathol 2001, 67(1):7-14.

208. Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome

research 2000, 10(4):516-522.

209. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access

database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004,

32(Database issue):D91-94.

210. Schirawski J, Mannhaupt G, Munch K, Brefort T, Schipper K, Doehlemann G, Di Stasio M,

Rossel N, Mendoza-Mendoza A, Pester D et al: Pathogenicity determinants in smut fungi

revealed by genome comparison. Science 2010, 330(6010):1546-1548.

211. Schmidt SM, Panstruga R: Pathogenomics of fungal plant parasites: what have we learnt

about pathogenesis? Current opinion in plant biology 2011.

212. Schmitt I, Lumbsch HT: Ancient horizontal gene transfer from bacteria enhances biosynthetic

capabilities of fungi. PloS one 2009, 4(2):e4437.

213. Scholz-Schroeder BK, Soule JD, Gross DC: The sypA, sypS, and sypC synthetase genes encode

twenty-two modules involved in the nonribosomal peptide synthesis of syringopeptin by

Pseudomonas syringae pv. syringae B301D. Molecular plant-microbe interactions : MPMI

2003, 16(4):271-280.

214. Schoonbeek H, Del Sorbo G, De Waard MA: The ABC transporter BcatrB affects the sensitivity

of Botrytis cinerea to the phytoalexin resveratrol and the fungicide fenpiclonil. Mol Plant

Microbe In 2001, 14(4):562-571.

215. Seidl MF, Van den Ackerveken G, Govers F, Snel B: Reconstruction of oomycete genome

evolution identifies differences in evolutionary trajectories leading to present-day large gene

families. Genome Biol Evol 2012, 4(3):199-211.

216. Selker EU, Cambareri EB, Jensen BC, Haack KR: Rearrangement of duplicated DNA in

specialized cells of Neurospora. Cell 1987, 51(5):741-752.

217. Service RF: Gene sequencing. The race for the $1000 genome. Science 2006,

311(5767):1544-1546.

235 218. Sharp PM, Li WH: The codon Adaptation Index--a measure of directional synonymous codon

usage bias, and its potential applications. Nucleic acids research 1987, 15(3):1281-1295.

219. Sharp PM, Tuohy TM, Mosurski KR: Codon usage in yeast: cluster analysis clearly

differentiates highly and lowly expressed genes. Nucleic acids research 1986,

14(13):5125-5143.

220. Silue D, Notteghem JL, Tharreau D: Evidence of a Gene-for-Gene Relationship in the

Oryza-Sativa-Magnaporthe-Grisea Pathosystem. Phytopathology 1992, 82(5):577-580.

221. Simmons EG: Alternaria themes and variations (236-243) - Host-specific toxin producers.

Mycotaxon 1999, 70:325-369.

222. Skamnioti P, Pedersen C, Al-Chaarani GR, Holefors A, Thordal-Christensen H, Brown JK, Ridout

CJ: Genetics of avirulence genes in Blumeria graminis f.sp. hordei and physical mapping of

AVR(a22) and AVR(a12). Fungal Genet Biol 2008, 45(3):243-252.

223. Soanes DM, Chakrabarti A, Paszkiewicz KH, Dawe AL, Talbot NJ: Genome-wide Transcriptional

Profiling of Appressorium Development by the Rice Blast Fungus Magnaporthe oryzae. Plos

Pathogens 2012, 8(2).

224. Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler.

BMC bioinformatics 2007, 8:64.

225. Spanu PD, Abbott JC, Amselem J, Burgis TA, Soanes DM, Stuber K, van Themaat EVL, Brown

JKM, Butcher SA, Gurr SJ et al: Genome Expansion and Gene Loss in Powdery Mildew Fungi

Reveal Tradeoffs in Extreme Parasitism. Science 2010, 330(6010):1543-1546.

226. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D,

Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast

Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273-3297.

227. Stanke M, Morgenstern B: AUGUSTUS: a web server for gene prediction in eukaryotes that

allows user-defined constraints. Nucleic acids research 2005, 33:W465-W467.

228. Stefanato FL, Abou-Mansour E, Buchala A, Kretschmer M, Mosbach A, Hahn M, Bochet CG,

Metraux JP, Schoonbeek HJ: The ABC transporter BcatrB from Botrytis cinerea exports

236 camalexin and is a virulence factor on Arabidopsis thaliana. Plant Journal 2009,

58(3):499-510.

229. Stormo GD: Consensus patterns in DNA. Methods Enzymol 1990, 183:211-221.

230. Stukenbrock EH, Bataillon T, Dutheil JY, Hansen TT, Li R, Zala M, McDonald BA, Wang J,

Schierup MH: The making of a new pathogen: insights from comparative population genomics

of the domesticated wheat pathogen Mycosphaerella graminicola and its wild sister species.

Genome research 2011, 21(12):2157-2166.

231. Stukenbrock EH, Jorgensen FG, Zala M, Hansen TT, McDonald BA, Schierup MH:

Whole-genome and chromosome evolution associated with host adaptation and speciation of

the wheat pathogen Mycosphaerella graminicola. PLoS genetics 2010, 6(12):e1001189.

232. Sweigard JA, Carroll AM, Kang S, Farrall L, Chumley FG, Valent B: Identification, Cloning, and

Characterization of Pwl2, a Gene for Host Species-Specificity in the Rice Blast Fungus. The

Plant cell 1995, 7(8):1221-1233.

233. Syvanen m: Cross-species gene transfer; implications for a new theory of evolution. Journal of

Theoretical Biology 1985, 112(2):333-343.

234. Takabayashi N, Tosa Y, Oh HS, Mayama S: A gene-for-gene relationship underlying the

species-specific parasitism of Avena/Triticum isolates of Magnaporthe grisea on wheat

cultivars. Phytopathology 2002, 92(11):1182-1188.

235. Talbot NJ: On the trail of a cereal killer: Exploring the biology of Magnaporthe grisea. Annual

Review of Microbiology 2003, 57:177-202.

236. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular

evolutionary genetics analysis using maximum likelihood, evolutionary distance, and

maximum parsimony methods. Molecular biology and evolution 2011, 28(10):2731-2739.

237. Temporini ED, VanEtten HD: An analysis of the phylogenetic distribution of the pea

pathogenicity genes of Nectria haematococca MPVI supports the hypothesis of their origin by

horizontal transfer and uncovers a potentially new pathogen of garden pea: Neocosmospora

boniensis. Current genetics 2004, 46(1):29-36.

237 238. Thon MR, Pan H, Diener S, Papalas J, Taro A, Mitchell TK, Dean RA: The role of transposable

element clusters in genome evolution and loss of synteny in the rice blast fungus

Magnaporthe oryzae. Genome biology 2006, 7(2):R16.

239. Timothy L Friesen EHS, Zhaohui Liu, Steven Meinhardt, Hua Ling, Justin D Faris1, Jack B

Rasmussen, Peter S Solomon, Bruce A McDonald & Richard P Oliver: Emergence of a new

disease as a result of interspecific virulence gene transfer. Nature Genetics 2006, 38:953-956.

240. Tosa Y, Tamba H, Tanaka K, Mayama S: Genetic analysis of host species specificity of

Magnaporthe oryzae isolates from rice and wheat. Phytopathology 2006, 96(5):480-484.

241. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ,

Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts

and isoform switching during cell differentiation. Nat Biotechnol 2010, 28(5):511-515.

242. Tsuge T, Tanaka A, Shiotani H, Yamamoto M: Insertional mutagenesis and cloning of the genes

required for biosynthesis of the host-specific AK-toxin in the Japanese pear pathotype of

Alternaria alternata. Mol Plant Microbe In 1999, 12(8):691-702.

243. Tsuge T, Tanaka A: Structural and functional complexity of the genomic region controlling

AK-toxin biosynthesis and pathogenicity in the Japanese pear pathotype of Alternaria

alternata. Mol Plant Microbe In 2000, 13(9):975-986.

244. Tyler BM, Tripathy S, Zhang X, Dehal P, Jiang RH, Aerts A, Arredondo FD, Baxter L, Bensasson D,

Beynon JL et al: Phytophthora genome sequences uncover evolutionary origins and

mechanisms of pathogenesis. Science 2006, 313(5791):1261-1266.

245. Tzeng TH, Lyngholm LK, Ford CF, Bronson CR: A restriction fragment length polymorphism

map and electrophoretic karyotype of the fungal maize pathogen Cochliobolus

heterostrophus. Genetics 1992, 130(1):81-96.

246. Valent B, Chumley FG: Molecular Genetic-Analysis of the Rice Blast Fungus,

Magnaporthe-Grisea. Annual review of phytopathology 1991, 29:443-467.

247. Valent B, Khang CH: Recent advances in rice blast effector research. Current opinion in plant

biology 2010, 13(4):434-441.

238 248. Valvekens D, Vanmontagu M, Vanlijsebettens M: Agrobacterium-Tumefaciens-Mediated

Transformation of Arabidopsis-Thaliana Root Explants by Using Kanamycin Selection.

Proceedings of the National Academy of Sciences of the United States of America 1988,

85(15):5536-5540.

249. Van de Wouw AP, Cozijnsen AJ, Hane JK, Brunner PC, McDonald BA, Oliver RP, Howlett BJ:

Evolution of linked avirulence effectors in Leptosphaeria maculans is affected by genomic

environment and exposure to resistance genes in host plants. PLoS Pathog 2010,

6(11):e1001180.

250. VanEtten H, Jorgensen S, Enkerli J, Covert SF: Inducing the loss of conditionally dispensable

chromosomes in Nectria haematococca during vegetative growth. Current genetics 1998,

33(4):299-303.

251. Veerappan CS, Avramova Z, Moriyama EN: Evolution of SET-domain protein families in the

unicellular and multicellular Ascomycota fungi. Bmc Evol Biol 2008, 8:190.

252. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid

transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol

2008, 17(7):1636-1647.

253. Vetrivel U, Arunkumar V, Dorairaj S: ACUA: a software tool for automated codon usage

analysis. Bioinformation 2007, 2(2):62-63.

254. Vogel J: Unique aspects of the grass cell wall. Current opinion in plant biology 2008,

11(3):301-307.

255. Walsh B, Ikeda SS, Boland GJ: Biology and management of dollar spot (Sclerotinia

homoeocarpa); an important disease of turfgrass. Hortscience 1999, 34(1):13-21.

256. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev

Genet 2009, 10(1):57-63.

257. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V,

Roth GT et al: The complete genome of an individual by massively parallel DNA sequencing.

Nature 2008, 452(7189):872-876.

239 258. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J,

Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide

resolution. Nature 2008, 453(7199):1239-1243.

259. Wilson RA, Talbot NJ: Under pressure: investigating the biology of plant infection by

Magnaporthe oryzae. Nat Rev Microbiol 2009, 7(3):185-195.

260. Win Jea: Adaptive evolution has targeted the C-terminal domain of the RXLR effectors of plant

pathogenic oomycetes. The Plant cell 2007, 19:2349-2369.

261. Wolpert TJ, Dunkle LD, Ciuffetti LM: Host-selective toxins and avirulence determinants: what's

in a name? Annual review of phytopathology 2002, 40:251-285.

262. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M: Systematic

discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several

mammals. Nature 2005, 434(7031):338-345.

263. Xue MF, Yang J, Li ZG, Hu SNA, Yao N, Dean RA, Zhao WS, Shen M, Zhang HW, Li C et al:

Comparative Analysis of the Genomes of Two Field Isolates of the Rice Blast Fungus

Magnaporthe oryzae. PLoS genetics 2012, 8(8).

264. Yamagishi D, H. Akamatsu, H. Otani, and M. Kodama.: Pathological evaluation of host-specific

AAL-toxins and fumonisin mycotoxins produced by Alternaria and Fusarium species. J Gen

Plant Pathol 2006, 72(5):323-327.

265. Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under

realistic evolutionary models. Molecular biology and evolution 2000, 17(1):32-43.

266. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and

evolution 2007, 24(8):1586-1591.

267. Yasunori Akagi MT, Mikihiro Yamamoto, Takashi Tsuge, Yukitaka Fukumasa-Nakai, Hiroshi

Otani and Motoichiro Kodama: Chromosome constitution of hybrid strains constructed by

protoplast fusion between the tomato and strawberry pathotypes of Alternaria alternata. J

Gen Plant Pathol 2009, 75(2):101-109.

268. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Tosa Y, Chuma I, Takano Y, Win J,

240 Kamoun S et al: Association genetics reveals three novel avirulence genes from the rice blast

fungal pathogen Magnaporthe oryzae. The Plant cell 2009, 21(5):1573-1591.

269. Yu JH, Keller N: Regulation of secondary metabolism in filamentous fungi. Annual review of

phytopathology 2005, 43:437-458.

270. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn

graphs. Genome research 2008, 18(5):821-829.

271. Zhu G, Spellman PT, Volpe T, Brown PO, Botstein D, Davis TN, Futcher B: Two yeast forkhead

genes regulate the cell cycle and pseudohyphal growth. Nature 2000, 406(6791):90-94.

241