ABSTRACT

SAILSBERY, JOSHUA KENT. Comparative Genomic and Transcriptional Analyses of Magnaporthe oryzae and other Eukaryotes. (Under the direction of Dr. Ralph A. Dean).

Magnaporthe oryzae causes devastation of rice crops around the world; destroying enough to feed at least 60 million people annually. Here-in, I investigate this pathogenic to elucidate the genomic components utilized during infection. Investigated components include regulatory elements, RNA sequences, and genes (especially transcription factors).

Chapter 1 contains the background and introduction to my research. In Chapter

2, I characterize a common transcription factor component, the basic Helix-Loop-Helix

(bHLH) domain, in M. oryzae and other fungi. Through phylogenetic analyses I identified 12 major groupings within Fungi; identifying conserved motifs and functions specific to each group. Several classification models were built to distinguish the 12 groups and elucidate the most discerning sites in the domain. These models were highly accurate and led to the identification of 12 highly discerning sites (1, 4, 6, 7, 8, 12,

15, 16, 19, 20, 50, and 53), which were incorporated into a set of rules to classify sequences into one of these 12 groups. Conservation of amino acid sites and phylogenetic analyses established that, like plant bHLH proteins, fungal bHLH containing proteins are most closely related to animal Group B

In Chapter 3, I assessed the bHLH domain across Plants, Animals, and Fungi to identify unique sequence characteristics pertaining to each Kingdom. Using classification models, I identified five essential amino acid sites that are highly characteristic of these Kingdoms. Hidden Markov Models, built on expertly aligned domains, were used with the classification models to identify and classify bHLH sequences from a marine environmental sample. Last, I created an online tool that can align, extract, and classify bHLH sequences.

Next generation sequencing was used to perform a detailed examination and characterization of small RNA molecules from mycelia and appressoria in Chapter 4. In a collaborative project, my work showed that genomic features contributed differentially to the RNA sequence libraries. Mycelia RNAs were enriched for intergenic and repetitive elements while a higher proportion of appressoria RNAs were enriched for tRNA loci. Differential mapping of small RNAs to the 5’ and 3’ halves of mature tRNAs was also observed. This led to the identification of sites with post- transcriptional modification within tRNAs and showed a difference in that modification between the two tissues.

In a second collaborative RNA study (Chapter 5), methylguanosine-capped and polyadenylated small RNAs (CPA-sRNAs) were sequenced with 454 technologies. My work showed that CPA-sRNAs mapped to rRNAs, tRNAs, snRNAs, transposable elements and intergenic regions. Where CPA-sRNAs were mapped to protein coding genes, they were predominately associated with transcriptional start and termination sites. Those proteins enriched for CPA-sRNAs, especially ribosomal encoding proteins, were positively correlated with gene expression. Finally, in Chapter 6, I designed a new comparative genomics software package

(D-SynD) that can detect regions of syntenic DNA between multiple large genomes simultaneously. D-SynD requires no gene models and makes no assumptions with regards to gene order or orientation. Additionally, detected syntenic regions are statistically evaluated for significance. The software allows many user options, such as defining the preferred syntenic region size and complexity. D-SynD is released as an open-source software package for use in comparative genomic studies. Comparative Genomic and Transcriptional Analyses of Magnaporthe oryzae and other Eukaryotes

by Joshua Kent Sailsbery

A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Bioinformatics

Raleigh, North Carolina

2011

APPROVED BY:

______Dr. Ralph A. Dean Dr. Ignazio Carbone Co-Chair of Advisory Committee Co-Chair of Advisory Committee

______Dr. Gary A. Payne Dr. Eric A. Stone

______Dr. Jeffrey Thorne DEDICATION

Without question, this work is dedicated to my beloved wife Stacy Dawn Sailsbery

and our two boys

Aiden Kent and Ryan James Sailsbery.

ii

BIOGRAPHY

Joshua was born on May 31st, 1979 and raised in the Pacific Northwest. In High school, with the support of his entire family, he participated in the Washington State funded Running Start program. As a result, Joshua graduated with his two year degree a day prior to obtaining his high school diploma. In 1998, Joshua moved to Provo, UT and started his undergraduate research at Brigham Young University. After a year working toward his Bachelor’s degree, Joshua put his academic career on hold to serve the wonderful people struggling along the Rio Grande.

In 2001, after completing two years of service in south east Texas, he returned to

BYU to complete his studies. In this time he met and married his sweetheart Stacy

Searle. While planning his professional career, Joshua overheard a Bioinformatics student explain the field to a career counselor and Joshua was hooked. From there, he was employed by Dr. David A. McClellan, an Integrated Biology professor, to create two

Bioinformatics software packages (CDM and TreeSAAP).

After graduating from BYU with a Bachelor’s degree in Computer Science,

Computational Biology, and Bioinformatics in 2005, Joshua pursued graduate school at

North Carolina State University. With funding provided by the NIH endowed IGERT grant, Joshua joined the Bioinformatics Resource Center. During the next three years,

Joshua had the opportunity to be educated from and work with many talented professionals in his field.

iii

Following his third year of study, Joshua joined Dr. Ralph A. Dean’s lab at the

Center for Integrated Fungal Research. While there, he had the opportunity to make large contributions to two projects, and led three projects of his own. He also was fortunate over the course of three summers to mentor junior researchers recruited through the NSF supported Research Experience for Undergraduates program.

iv

ACKNOWLEDGMENTS

First I would like to thank my advisor Dr. Ralph A. Dean, for the opportunity to work in the Fungal Genomics Lab. He has provided immeasurable support in critical thinking, manuscript editing, and project direction. In many ways he embodies scientific excellence. I will be ever grateful to have been a part of his team at the Center for Integrative Fungal Research (CIFR).

I am also grateful for my committee members Eric A. Stone and Jeffrey Throne for their wonderful lectures, and for supporting my academic career since my arrival at

NCSU. I would like to thank my other committee members Gary Payne and Ignazio

Carbone for their suggestions in several projects and their support in finishing the requirements for this degree.

To the people behind the IGERT, NSF, and NIH grants that have funded my academic career, thank you. The funding has allowed me to participate in many exciting fields of study, financial support for my family, and to complete my Ph.D. work.

I am sincerely grateful to Douglas E. Brown for his example of professionalism in the work place. His optimistic outlook on life was especially appreciated. Also, I would like to thank Minfeng Xue who showed me the finer art of Perl scripting.

Many thanks to all the current and former members of CIFR for the expertise, professionalism, and great working environment they provided, including, Vickie

Randleman, Greg C. Bernard, Junhyun Jeon, William Sharpee, Dr. Malali Gowda, and Dr.

Yeon Yee Oh.

v

I would like to thank all the Research Experience for Undergraduate students I had the pleasure of working with. It’s hard not to succeed when you have such talented and bright people working with you. I’m particularly grateful to Brent Clay who toiled for more than a year, employing his massive skills to the progression of our work.

Finally, I would like to thank my supportive family. Both of my parents, who have nothing but the upmost confidence in me. My sister Tawnie and her family who moved to North Carolina and helped Stacy and I in innumerable ways. My two wonderful boys who have sacrificed so much time with their father so he could “write his paper”. And most importantly, my wife Stacy without whom none of this would even be possible. Thank you Dear.

vi

TABLE OF CONTENTS

LIST OF TABLES ...... x LIST OF FIGURES ...... xii CHAPTER 1-Insights to rice blast disease provided by the Magnaporthe oryzae genome ...... 1 Background ...... 2 References ...... 5 CHAPTER 2-Phylogenetic Analysis and Classification of the Fungal bHLH Domain . 8 Abstract ...... 9 Introduction ...... 10 Material and Methods ...... 13 Whole genome Fungal bHLH Sequence Identification and Analysis ...... 13 Phylogenetic Analysis by Taxonomic grouping ...... 14 Classification Models ...... 16 Testing Methods ...... 18 Results...... 20 Identification of Fungal bHLH sequences from whole genome projects ...... 20 Positional Conservation and Consensus Motif ...... 22 Phylogenic Analysis of Fungal sequences ...... 24 Expansion of F4 in Sordariomycetes ...... 26 Conserved Motifs in Fungal bHLH proteins ...... 27 Sequence classification using Decision Trees ...... 30 Sequence classification using Discriminant analysis ...... 31 Simplified Model for F1-F12 ...... 33 Classification Model Testing ...... 34 Comparison of Positional Amino Acid Conservation ...... 34 Phylogenetic relationship of Fungal bHLH to Animal and Plant bHLH domains ...... 35 Cross Kingdom Classification ...... 36 Discussion ...... 38 Supplementary Material ...... 45 Acknowledgements ...... 45 References ...... 46

vii

CHAPTER 3-Fundamental Characteristics and Discerning Sites of the Eukaryotic bHLH Domain ...... 73 Abstract ...... 74 Introduction ...... 75 Material and Methods ...... 79 Classification Models ...... 80 Hidden Markov Models ...... 82 Testing Methods ...... 83 Environmental Sample ...... 83 Results...... 84 Comparing conserved sites in bHLH domains for Plants, Animals, and Fungi ...... 84 Decision Trees Analysis ...... 85 StepWise Discriminant Analysis ...... 86 Canonical Variate Analysis ...... 88 Classification Model Testing ...... 90 bHLH Origins from Environmental Samples ...... 91 Discussion ...... 94 Supplementary Material ...... 98 Acknowledgements ...... 98 References ...... 99 CHAPTER 4-Diverse and tissue-enriched small RNAs in the plant pathogenic fungus, Magnaporthe oryzae ...... 116 Summary ...... 117 Abstract ...... 119 Results...... 120 Discussion ...... 128 Conclusions ...... 134 References ...... 136 CHAPTER 5-Genome-wide characterization of methylguanosine-capped and polyadenylated small RNAs in the rice blast fungus Magnaporthe oryzae ...... 139 Summary ...... 140 Abstract ...... 142 Materials and Methods ...... 143 Results...... 145 Discussion ...... 151 References ...... 152

viii

CHAPTER 6-D-SynD: Dimensional Synteny Detection, Identification of Syntenic Regions between Multiple Genomes...... 153 Abstract ...... 154 Introduction ...... 155 Implementation ...... 158 Algorithm ...... 158 Finding Homologous DNA ...... 158 Superblock Formation ...... 159 CSS Segments ...... 159 Distance Matrix ...... 161 Single-link Hierarchical Clustering ...... 161 Thresholds ...... 162 Cluster Scoring Balanced Statistic ...... 163 Cluster Visualization ...... 166 Statistical Validation of the CSBS ...... 167 Determining Significant Clusters ...... 168 Case Studies ...... 169 Conclusion ...... 171 Supplemental ...... 172 Acknowledgements ...... 172 References ...... 173 APPENDICES ...... 190 Appendix A ...... 191 Appendix B ...... 204 Appendix C ...... 214 Appendix D ...... 224 Appendix E ...... 234

ix

LIST OF TABLES

Phylogenetic Analysis and Classification of the Fungal bHLH Domain ...... 8 Table 1-Gene count summary of the 12 fungal bHLH groups by completed fungal genome ...... 53 Table 2-Structural attributes and significant sites of the bHLH domain ...... 55 Table 3-Known biological and molecular functions for bHLH proteins by fungal group ...... 57 Table 4-Validation of bHLH classification methods ...... 58 Table 5-Simplified model for the 12 fungal groups ...... 59 Table 6- Cross Kingdom classification of bHLH domains ...... 60 Table 7-Mahalanobis distance between animal and fungal bHLH groups ...... 61

Fundamental Characteristics and Discerning Sites of the Eukaryotic bHLH Domain ...... 73 Table 1-Structural attributes and significant sites of the bHLH domain ...... 103 Table 2-Validation of bHLH classification methods ...... 105 Table 3-Discerning ability of determined canonical variates ...... 106 Table 4-Classification of unknown bHLH sequences ...... 107

Diverse and tissue-enriched small RNAs in the plant pathogenic fungus, Magnaporthe oryzae ...... 116 Table 1-Summary statistics of small RNA libraries from mycelia and appressoria tissues ...... 121 Table 2-Distribution of mycelia small RNAs with perfect map to nuclear and mitochondria genomes ...... 122 Table 3-Distribution of appressoria small RNAs with perfect map to nuclear and mitochondria genomes ...... 123 Table 4-Association of mycelia small RNAs with ESTs and ESS ...... 126 Table 5-Association of appressoria small RNAs with ESTs and ESS ...... 126

x

Genome-wide characterization of methylguanosine-capped and polyadenylated small RNAs in the rice blast fungus Magnaporthe oryzae ...... 139 Table 1-Distribution of CPA-sRNAs mapped to genomic and mitochondrial features ...... 146 Table 2-Association of CPA-sRNAs with other transcriptional evidence ...... 146

D-SynD: Dimensional Synteny Detection, Identification of Syntenic Regions between Multiple Genomes...... 153 Table 1-Cluster Attributes based on Threshold cutoff for Soderiomycetes ...... 176 Table 2-Pair-wise summary statistics of the comparative analyses ...... 176 Table 3-Multidimensional summary statistics of the comparative analyses ...... 176 Table 4-Significant cluster attributes for M. oryzae ...... 177

xi

LIST OF FIGURES

Phylogenetic Analysis and Classification of the Fungal bHLH Domain ...... 8 Figure 1–Fungi bHLH entropies, logos, and consensus. A. The bHLH normalized group entropy by position. Lower values indicate conservation, while values close to one approach complete randomness. B. The graphical representation of the amino acids at each position of the bHLH domain. Symbols representing amino acids are scaled by their bit score (a derivation of entropy) at a given position. C. The 50-10 consensus sequences for Fungi. Using an alignment of bHLH domains, amino acids occurring at a frequency of more than 50% at a given site are displayed. At each of these sites, additional amino acids are displayed beneath if they are conserved in 10% or more of the sequences...... 63 Figure 2–Phylogenetic analysis of fungal bHLH. Phylogenetic relationships, taxonomic representation, bHLH motif statistics, and architecture of conserved protein motifs for 12 fungal bHLH groups. ML tree of 490 fungal bHLH proteins (full representation of Basidiomycota, Pezizomycotina, Saccharyomycotina, and Fungal trees available in Supplemental Figure S1A-D). The tree is drawn to scale (branch lengths proportional to evolutionary distances) and has been rooted with a single representative from Chlamydomonas reinhardtii. Groups, determined by clades with strong support, are collapsed as triangles with width and depth proportional to the size and sequence divergence of each group, respectively. Groups supported by bootstrap values >30 in Neighbor-Joining (NJ) or Maximum Parsimony (MP) analyses are colored black. The shaded group F9 was ambiguously retrieved in NJ, MP, or BA trees. Ungrouped genes are indicated as single lines and the scale bar represents the estimated number of amino acid replacements per site. Basidiomycota (B), Pezizomycotina (P) and Saccharyomycotina (S) clades associated with each fungal group are noted in brackets. The architecture of conserved motifs, as determined through MEME, is graphically represented as boxes drawn to scale. Box enumeration corresponds to specific motifs found by MEME (Supplemental Figure 2C). Grey boxes represent motifs that match the bHLH domain. Last of all, the sequence logo for B4, P4 and S4 for 21 amino acids downstream of the bHLH domain are shown (motif 20)...... 65 Figure 3–NJ analyses of F4 proteins from Sordariomycete organisms. The NJ trees were inferred from the bHLH domain only and from the entire bHLH containing protein sequence. The trees are displayed with corresponding bootstrap values where branches with a bootstrap of less than 30 have been collapsed. Six strongly supported F4 subclades have been noted; where subclades A-C contain one member from each Sordariomycete organism and Groups D-F each have one member from P. anserina, S. macrospora, and N. crassa. * The Q2GSP1 bHLH domain has been determined to be highly divergent by 1) containing 7 mismatches to the fungal

xii

consensus motif; 2) possessing an uncharacteristic amino acid at site 12 for an F4 protein (D); and 3) containing a simple sequence repeat through both the basic and Helix 2 regions (DDDDDD). The full Q2GSP1 sequence, however, is strongly supported in the A subclade; suggesting the highly divergent bHLH arose from either evolutionary pressure or sequencing errors...... 67 Figure 4–Fungal decision tree analysis. The decision tree describing the separation of bHLH fungal groups F1-F12 by amino acid sites found in the bHLH domain. Each box of the figure represents a step in the decision tree which consist of a number of bHLH sequences from each fungal group, and the amino acids at a given bHLH position (state). The sample size and proportion of group representatives is provided in the accompanying table. Diamonds contain the bHLH amino acid site which bifurcate the data into subsets of the previous state...... 69 Figure 5–Canonical Variate Analysis of fungal bHLH groups. A. Projection of 488 fungal bHLH sequences onto eigenvectors (Canonical Variates) for the all dataset. Plot contains the first and second Canonical Variate out of 11 total. Axes reflect the Mahalanobis distance between fungal groups F1-F12. B. Pairwise Mahalanobis distance between fungal group centroids in the CVA{all} analysis...... 71

Fundamental Characteristics and Discerning Sites of the Eukaryotic bHLH Domain ...... 73 Figure 1–Animal, Plant, and Fungi bHLH entropies, logos, and consensus. A. The bHLH normalized group entropy by position. Lower values indicate conservation, while values close to one approach unity. B. The graphical representation of the amino acids at each position of the bHLH domain. Symbols representing amino acids are scaled by their bit score (a derivation of entropy) at a given position. C. The 50-10 consensus sequences for Animal, Plant and Fungi. Using an alignment of bHLH domains within each Kingdom, amino acids occurring at a frequency of more than 50% at a given site are displayed. At each of these sites, additional amino acids are displayed if they are conserved in 10% or more of the sequences. The Plant and Animal consensus sequences are derived from previous work (Carretero-Paulet et al.,2010;Atchley et al.,1999). Boundaries between bHLH subdomains are shown at the bottom...... 108 Figure 2–Decision tree describing the classification of Plant, Animal, and Fungal bHLH sequences by amino acid sites found in the bHLH domain. Each box of the figure represents a step in the decision tree which consist of a number of bHLH sequences from each Kingdom, and the amino acids at a given bHLH position (state). The sample size and proportion of group representatives is provided in the accompanying table. Diamonds contain the bHLH amino acid site which bifurcate the data into subsets of the previous state. Analysis is similar to a dichotomous taxonomic key...... 110

xiii

Figure 3–Projection of PAF sequences onto canonical vectors for the factor score transformed datasets. Each plot contains the two canonical variates from the CVA and the sqrt of the Mahalanobis pairwise distance between the centroids of Plants, Animals, and Fungi...... 112 Figure 4-HMM results from test dataset and the marine environmental sample. A. Overlap of bHLH sequences found with the Kingdom specific HMMs on the environmental dataset. B. The number of bHLH sequences found using Kingdom specific HMMs on the 6987 test dataset. C. SWDA and CVA classification of the 376 bHLH sequences identified by combinatorial HMMs on the environmental data. D. Classification of the Kingdom specific HMM sequences on the environmental sample. SWDA and CVA results are reported as “Correctly Classified (Total Classified)” for a given set of Kingdom specific sequences...... 114

D-SynD: Dimensional Synteny Detection, Identification of Syntenic Regions between Multiple Genomes...... 153 Figure 1-Workflow of the D-SynD software package...... 178

Figure 2–Homologous DNA segments of a Superblock. Query (Q) and subject (O1, O2) organisms are designated. Connections between subject and query organisms represent homologous DNA used to construct Q’s Superblock. The Superblock represents a nearly contiguous segment of homologous DNA within Q...... 179 Figure 3–M. poae vs. G. graminis dot plots. A. All pair-wise CSS between M. poae and G. graminis. The points represent homologous DNA between two genomic locations of the organisms. Genomes are graphically represented by sequentially appending supercontigs and contigs end to end. B. Pair-wise dots located only within significant clusters...... 180

Figure 4–Formation of a three dimensional CSS from pair-wise CSS. The axes O1, O2, and O3 represent the genomes of three organisms, and black boxes are Superblocks on each organism. Blue and red dots represent CSS, two dimensional and three dimensional, respectively...... 181 Figure 5–Geometric representation of two pair-wise CSS segments. The distance between the CSS is determined by the DAB, CD function...... 182 Figure 6–Single-Linkage Clustering of a Sample Data Set. A. Dot plot of 16 pair- wise CSS, numbered 1-16. Subsets 1 and 2 show the dendrogram created by SLH clustering which connects all the CSS segments. A threshold cut-off at a distance of 3 is also demonstrated. Only clusters that were formed below this threshold are kept. B. The combine and merge steps of SLH clustering are shown in detail. CSS segments in parenthesis are connected into clusters of two. These clusters are then represented by brackets and step number on which they were created. Dots are

xiv then added to existing clusters, e.g. CSS 1 (1) into cluster 2 [2]. Finally, clusters are merged together resulting in a new larger cluster, e.g. cluster [6] is formed by combining the CSS in clusters [4] and [5]...... 183 Figure 7–CSBS measurements on hypothetical clusters from SLHC. This figure demonstrates two hypothetical clusters C1, C2 between two organisms OA, OB. LC,O are the full length of a given cluster and lS,O are the lengths of the involved Superblock S on organism O. Pair-wise CSS are represented as two dimensional segments...... 185 Figure 8–3 dimensional representation of a cluster from the Sordariomycetes comparative analysis. Gray boxes represent each 3D-CSS and connected using the convex hull function. Rendering of 3D objects were performed using the Blender open-source suite...... 186 Figure 9–Distributions of the Global Cluster. A. Tally of possible clusters by size (CSS count) from a set of 10 CSS. B. The CSBS of 10,000 clusters sampled randomly from the Global Cluster distribution of 300 CSS. C. Tally of clusters by size from a simulated SLHC run with 300 CSS. D. The CSBS of a simulated SLHC run on 300 CSS...... 187 Figure 10–Noise reduction in the Hominidae study. Two sets of graphs are shown. All 3D-CSS contain the 2 dimensional representations of the 3D-CSS. They are essentially the dot plots of homologous DNA between the organisms on the axes. Sig. Clusters contains only plot those 3D-CSS that belong to significant clusters at the  = .05 level...... 188 Figure 11–GEVO views of syntenic regions in the Sordariomycetes analysis. A. A nearly perfect syntenic block. B Syntenic region with little to no genes conserved...... 189

xv

Chapter 1

Insights to rice blast disease provided by the Magnaporthe oryzae genome

Joshua K. Sailsbery1,2

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA

1

BACKGROUND

Three billion people depend on rice as a principle source of nutrition. Indeed, with an ever growing world population, the demand for rice is continually increasing.

Each year, rice yield must increase by 3% to meet demand, with a total increase of 38% by the year 2030 [1]. Crop loss due to diseases is one of the main factors limiting increases in rice production, even outweighing threats such as global warming, soil fertility, pest, and water availability. Accounting for a 2% loss in total rice production

[2], rice blast disease, caused by Magnaporthe oryzae, is the most damaging disease of rice. Rice blast has been responsible for numerous large scale epidemics, such as destroying 5.7 million hectares of rice in China between 2001 and 2005. One of the factors that makes M. oryzae so devastating is its ability to quickly adapt to and infect blast-resistant rice cultivars. For these reasons, elucidating the fundamental processes, pathways, and functions employed by M. oryzae to overcome plant resistance is essential in order to feed future world populations.

Due to its importance as a rice pathogen, M. oryzae has quickly become a model organism for plant pathogenesis. With the release of its completed genome, new types of studies have been performed to analyze the secretory and transcriptional mechanisms of this organism as they contribute to pathogenesis [3]. The genome itself is composed 40 Mb over 7 chromosomes, containing approximately 11,000 genes.

Additionally, over 10% of the genome is repetitive DNA, creating additional challenges

2 in aligning and identifying origins of small sequence data. Repetitive DNA is essential for M. oryzae genome functionality, providing the structural diversity necessary for adapting to resistant rice cultivars. Further, genetic work on M. oryzae does not end with a sequenced genome, as the function of many genes important for pathogenesis remain to be characterized.

Transcriptional studies have utilized the genome to identify genes involved in various stages of pathogenesis [4, 5]. For example, Oh et.al. provided new clues to the role of protein turnover up-regulated in pathogenesis. Gowda et.al. used the genome to develop resources such as massively parallel signature sequencing (MPSS) and robust- long serial analysis of gene expression (RL-SAGE) [6]. By sequencing RNA strands,

Gowda was able to compare tissue samples taken from different stages in M. oryzae’s life cycle, further elucidating genes involved in pathogenicity. Such RNA studies have the potential for yielding even greater insights into M. oryzae genomic features, regulatory patterns, and host interactions. Within animals and plants small RNAs are known to be involved in a number of critical biological processes, including, DNA damage response, biotic stress response, and genome integrity [7, 8]. Knowledge of these small RNAs is lacking for fungi. Thus, in Chapters 4 and 5 we conduct two such studies, further expanding the knowledge base of M. oryzae RNA.

Transcription of genes within the genome is regulated by a variety of different transcription factors, such as Max, MAT, TFIIa-1, SREBP, among many others [9].

Nearly all transcription factors interact with genomic sequences through a DNA-binding

3 domain (short conserved sequence). One of the most ancient and conserved DNA- binding domains is the basic Helix-Loop-Helix (bHLH) domain [10, 11]. This extremely important domain is found across all eukaryotic life and plays a role in many biological processes such as neurogenesis, hematopoiesis, root and stem development, embryonic development, and many more. Proteins containing this domain have also been known to have a role in pathogenesis [12], however, their roles within M. oryzae remains unknown. We set out to characterize the 10 bHLH domains found in M. oryzae by comparing the bHLH domains from pathogenic and non-pathogenic fungi in an evolutionary context. Our studies found 12 distinct patterns between the different bHLH sequences within fungi (Chapter 2). These 12 groups did not represent pathogens or non-pathogens solely, suggesting that there exists more to the transcription factor/pathogenesis link than the bHLH binding site. To determine if these 12 patterns were distinct to Fungi, we characterized bHLH domains found across

Plants, Animals, and Fungi (Chapter 3). This work led to the creation of an online tool that allows researchers to identify and characterize their own bHLH domains.

The ability to directly compare the genomes of related organisms can provide valuable insight into the evolution of traits, including the ability to infect plants.

However, suitable bioinformatic tools are limited. Here I developed a tool to compare the genomes of two closely related plant pathogens, Magnaporthe poae and

Gaeumannomyces graminis var tritici, with M. oryzae. Most software used for comparative genomics detects small highly syntenic (collinear) regions and require

4 accurate gene models [13–17], which were unavailable for M. poae and G. graminis.

This led to the development of software (D-SynD) that can detect small collinear syntenic blocks and large conserved genomic structures (syntenic regions) without the need for gene models (Chapter 6). D-SynD is freely available software that researchers can use as another means to conduct a comparative genomic analysis.

Overall, this work has led to a better understanding of M. oryzae and the features that make it an effective pathogen. The transcriptional work has led to a better understanding of the roles of small RNA within M. oryzae. Studies of the bHLH domain led to a identification of 12 distinct groups which provide insight into many of its transcription factors. Finally, this work led to the development of a comparative genomic tool which will allow researchers to further characterize genomic features conserved in M. oryzae and closely related species.

REFERENCES

1. Wilson RA, Talbot NJ: Under pressure: investigating the biology of plant infection by Magnaporthe oryzae. Nat. Rev. Microbiol. 2009, 7:185-195.

2. Skamnioti P, Gurr SJ: Against the grain: safeguarding rice from rice blast disease. Trends Biotechnol. 2009, 27:141-150.

3. Dean RA, Talbot NJ, Ebbole DJ, Farman ML, Mitchell TK, Orbach MJ, Thon M, Kulkarni R, Xu J-R, Pan H, Read ND, Lee Y-H, Carbone I, Brown D, Oh YY, Donofrio N, Jeong JS, Soanes DM, Djonovic S, Kolomiets E, Rehmeyer C, Li W, Harding M, Kim S, Lebrun M-H, Bohnert H, Coughlan S, Butler J, Calvo S, Ma L-J, Nicol R, Purcell S, Nusbaum C, Galagan JE, Birren BW: The genome sequence of the rice blast fungus Magnaporthe grisea. Nature 2005, 434:980-986.

5

4. Mathioni SM, Beló A, Rizzo CJ, Dean RA, Donofrio NM: Transcriptome profiling of the rice blast fungus during invasive plant infection and in vitro stresses. BMC Genomics 2011, 12:49.

5. Oh Y, Donofrio N, Pan H, Coughlan S, Brown DE, Meng S, Mitchell T, Dean RA: Transcriptome analysis reveals new insight into appressorium formation and function in the rice blast fungus Magnaporthe oryzae. Genome Biol. 2008, 9:R85.

6. Gowda M, Venu RC, Raghupathy MB, Nobuta K, Li H, Wing R, Stahlberg E, Couglan S, Haudenschild CD, Dean R, Nahm B-H, Meyers BC, Wang G-L: Deep and comparative analysis of the mycelium and appressorium transcriptomes of Magnaporthe grisea using MPSS, RL-SAGE, and oligoarray methods. BMC Genomics 2006, 7:310.

7. Chapman EJ, Carrington JC: Specialization and evolution of endogenous small RNA pathways. Nat Rev Genet 2007, 8:884-896.

8. Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 1993, 75:843-854.

9. Pabo CO, Sauer RT: Transcription factors: structural families and principles of DNA recognition. Annual review of biochemistry 1992, 61:1053–1095.

10. Robinson KA, Lopes JM: SURVEY AND SUMMARY: Saccharomyces cerevisiae basic helix-loop-helix proteins regulate diverse biological processes. Nucleic Acids Res 2000, 28:1499-1505.

11. Jones S: An overview of the basic helix-loop-helix proteins. Genome Biol 2004, 5:226.

12. Osborne TF, Espenshade PJ: Evolutionary conservation and adaptation in the mechanism that regulates SREBP action: what a long, strange tRIP it’s been. Genes & Development 2009, 23:2578 -2591.

13. Soderlund C, Bomhoff M, Nelson WM: SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011, 39:e68.

14. Darling ACE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14:1394- 1403.

15. Zeng X, Nesbitt MJ, Pei J, Wang K, Vergara IA, Chen N: OrthoCluster. In ACM Press; 2008:656.

6

16. Rödelsperger C, Dieterich C: CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes. PLoS ONE 2010, 5:e8861.

17. Boyer F, Morgat A, Labarre L, Pothier J, Viari A: Syntons, metabolons and interactons: an exact graph-theoretical approach for exploring neighbourhood between genomic and functional data. Bioinformatics 2005, 21:4209-4215.

7

Chapter 2

Phylogenetic Analysis and Classification of the Fungal bHLH Domain

Joshua K. Sailsbery1,2, William Atchley3, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Genetics, North Carolina State University, Raleigh, NC 27606, USA

Submitted for publication as a research paper in Molecular Biology and Evolution

* To whom correspondence should be addressed.

Ralph A. Dean 851 Main Campus Drive, Suite 233 Centennial Campus Center for Integrated Fungal Research North Carolina State University Raleigh, NC 27606 Tel.: +1-919-513-0020; Fax: +1-919-513-0024; Email: [email protected]

8

ABSTRACT

The basic Helix-Loop-Helix (bHLH) domain is an essential, highly conserved

DNA-binding domain found in many transcription factors in all eukaryotic organisms.

The bHLH domain has been well studied in the Animal and Plant Kingdoms, but has yet to be characterized within Fungi. Herein, we obtained and evaluated the phylogenetic relationship of 490 fungal specific bHLH containing proteins from 55 whole genome projects composed of 49 and 6 Basidiomycota organisms. We identified 12 major groupings within Fungi (F1-F12); identifying conserved motifs and functions specific to each group. Several classification models were built to distinguish the 12 groups and elucidate the most discerning sites in the domain. Performance testing on these models, for correct group classification, resulted in a maximum sensitivity and specificity of 98.5% and 99.8%, respectively. We identified 12 highly discerning sites and incorporated those into a set of rules (simplified model) to classify sequences into the correct group. Conservation of amino acid sites and phylogenetic analyses established that like plant bHLH proteins, fungal bHLH containing proteins are most closely related to animal Group B. The models used in these analyses were incorporated into a software package; the source code for which is available at www.fungalgenomics.ncsu.edu.

9

INTRODUCTION

The basic Helix-Loop-Helix (bHLH) domain is a highly conserved DNA-binding motif found in Eukarya and Bacteria that is involved in a number of important cellular signaling processes; including differentiation, metabolism, and environmental response

[1–3]. Proteins containing the bHLH domain compose a superfamily of transcription factors commonly found in large numbers within plant, animal, and fungal genomes [4–

6]. Across such transcription factors, the bHLH domain is evolutionarily conserved while little sequence similarity exists beyond the motif itself [7].

The ~60 amino acid bHLH region is divided into two main components; basic and dimerization regions. The first 13 N-terminal amino acids are responsible for DNA interaction; generally containing 5 to 6 basic residues that facilitate DNA binding [8].

Many bHLH domains bind to the hexanucleotide sequence known as the E-box

(CANNTG). The dimerization region consists of two amphipathic alpha-helices separated by a loop of variable length. These alpha-helices either homo or hetero- dimerize to a secondary alpha-helix containing protein to facilitate transcription [9, 10].

The bHLH domain was first elucidated in Animals where bHLH proteins have been grouped into six major groups (A-F) based on evolutionary relatedness, DNA- binding motifs, and functional properties [6, 11]. Group A includes proteins such as

MyoD, dHand, Twist, and E12. Group A sequences bind the E-box sequence CAGCTG or

CACCTG and are identified by containing an R at position 8 in the basic region. Group B

10 sequences are known to bind the E-box sequence CACGTG, containing a Histidine (H) or

Lysine (K) at position 5 and an Arginine (R) at position 13 in the basic region. Members of Group B include Myc, Mad, Max, SREBP and Tfe. Many Group B proteins are known to contain an additional Leucine Zipper domain directly adjacent to the second helix.

Group C members, such as Sim, Trh and Ahr, have a conserved downstream Per-Arnt-

Sim (PAS) domain that facilitates dimerization to other PAS containing proteins and generally bind non-E-box sequences. Group D includes Id and Emc, however they lack a conserved basic region and act as transcription regulators through hetero-dimerization

[12]. Group E proteins bind the target sequence CACGNG, contain a Proline (P) in the basic region at site 6, and consists of members such as E(spl), Gridlock, Hairy and Hey.

Finally, Group F consists of COE-bHLH proteins; having more divergent bHLH sequences when compared to Groups A-E, and contain an additional PAS domain [13].

Early studies of plant bHLH proteins primarily focused on Arabidopsis thaliana and Oryza sativa, which contain 167 and 177 bHLH sequences, respectively, compared to 39 and 125 in Caenorhabditis elegans and Homo sapiens, respectively [7, 14]. With the recent abundance of genome initiatives, current studies include a more diverse selection such as algae, bryophytes and other land plants. In contrast to animals, phylogenetic analyses of plant bHLH proteins classify them into 26-33 subgroups [7, 13,

15]. Characterized members within these groups influence many biological processes including light and hormone signaling [16, 17], wound and drought response [18], fruit and flower development [19, 20], and stomata and root development [21, 22].

11

Phylogenetic analyses suggest that plant sequences are most related to animal

Group B [15, 23]. From the few fungi included in these studies, it has been noted that fungal sequences also appear to share most similarity to Group B [6, 11, 24].

Here, we conduct a comprehensive analysis of bHLH containing proteins from 55 completed fungal genomes encompassing Ascomycota and Basidiomycota organisms.

Classification of these proteins is essential for understanding the evolutionary diversification of the bHLH domain and the biological roles they play in fungal organisms. Using a variety of bioinformatic and phylogenetic tools we were able to identify and characterize 12 conserved bHLH fungal groups and determine patterns of gain and loss of bHLH proteins from a taxonomic perspective. Several statistical tools were then applied to evaluate the fundamental molecular architecture differences between the 12 fungal groups, including several classification models to accurately distinguish sequences into the groups. Some models not only distinguished groups but provided a measure of the biological significance of discerning amino acid sites. These models were then tested against a larger set of known bHLH sequences, providing a measurement for the performance of each model. Finally, we show that, like plants, fungal bHLH are most closely related to animal Group B, suggesting that animal Groups

A, C-E were likely not present in the metazoan common ancestor. The models, sequence data, and source code obtained and built for these analyses were incorporated into a software package available at www.fungalgenomics.ncsu.edu.

12

MATERIAL AND METHODS

Whole genome Fungal bHLH Sequence Identification and Analysis

Fungal bHLH sequences were aligned against plant and animal bHLH amino acid sets available from previous work [25, 26]. Each fungal sequence was aligned to these expert sets using an iterative approach that retained the length and structure of the bHLH domain as follows [27, 28]. 1) A full length protein sequence was chosen from the set to be aligned. 2) BLAST [29] was used to identify up to 10 orthologs from the expert sets, choosing hits with the lowest e-value. 3) The query and orthologs were then globally aligned with MUSCLE 5.0 [30]. 4) The alignment was then evaluated for retention of the bHLH structure; i.e. there were no gaps inserted into either the query or orthologs within the basic, Helix 1 or Helix 2 sub-domains. 5) The newly aligned bHLH motifs contained in the query sequences were then placed into the expertly aligned set or the query sequence was placed back into the unaligned set depending on fulfillment of step 4. Steps 1-5 were repeated until most query sequences were aligned.

Those few sequences still not aligned were then manually edited to meet bHLH domain requirements. This resulted in a new sequence dataset of expertly aligned fungal bHLH domains.

Consensus sequences were determined by using the 50-10 rule [7]. A given site of the bHLH domain was included in the consensus sequence if an amino acid at that

13 site was present in over 50% of the sequences. For each site included in the consensus, an additional amino acid was added if it existed in at least 10% of all sequences.

The Boltzmann-Shannon entropy value was calculated for each site in the sequence alignment for fungal sequences. To determine the normalized group entropy value: 1) amino acids were grouped based on molecular characteristics (acidic, basic, aromatic, aliphatic, aminic, hydroxylated, Cysteine, Proline) resulting in eight sets (DE,

HKR, FWY, AGILMV, NQ, ST, C, P, respectively) [28, 31]; 2) The Boltzmann-Shannon entropy values, based on individual amino acids and the eight amino acid groups, were calculated at each site [24]; 3) The entropy values were normalized to range from 0 to

1, with respect to possible minimum and maximum values, respectively. Amino Acid sites were then interpreted from conserved to variable based on entropy values closer to 0 or 1, respectively.

Conserved motifs within bHLH containing proteins were identified using MEME

3.5.7 [32]. Meme parameters: minimum motif width, 8; maximum motif width, 100; maximum motifs to find, 50. Functionality of detected motifs was determined, where possible, by evaluating said motifs through MAST [33], NCBI’s Conserved Domain

Database [34], Prosite [35], and InterPro [36].

Phylogenetic Analysis by Taxonomic grouping

Evolutionary relationships of the bHLH domain were determined in the same manner for several different fungal sequence datasets (All Fungi; Basidiomycota;

14

Pezizomycotina; and Saccharomycotina). Each datasets’ phylogeny was determined with Maximum Likelihood (ML), Neighbor-Joining (NJ), and Maximum Parsimony (MP) analyses. Bayesian Analysis (BA) was conducted on the entire set of plant, animal and fungal aligned bHLH domain sequences.

ProtTest 1.4 [37] was used to determine the best fit amino acid substitution model and parameter values for each dataset. In each case, the Le and Gascuel [38] (LG) model with an estimated γ-distribution parameter (G) and the proportion of invariant sites (I) was the best fit according to the Akaike information criterion; with the JTT+I+G

(Jones, Taylor, Thorton) model a close second.

PHYML 2.4.5 [39], with the LG+I+G model, was used to run the ML analysis. The invariant sites and γ-parameter were set to values obtained with ProtTest, and eight relative substation rate categories to correct for the heterogeneity of amino acid substitution rates. The Subtree Pruning and Regrafting (SPR) method was used to search tree topology. Branch support for the resulting topology was determined by both the Shimodaria-Hasegawa-like approximate Likelihood Ratio Test (LRT) and a

1,000 replicate bootstrap analysis.

MEGA 4.0 [40] was used to run the NJ and MP analyses, including a 1000 replicate bootstrap test to estimate topology support. The JTT+I+G model was used for the NJ analysis. The NJ running options used were: 1) Pairwise deletion for

Gaps/Missing data to account for highly variable sites, specifically in the loop subdomain; 2) rates among sites was set to Gamma distributed; 3) the value for the γ-

15 value determined by ProtTest for the Gamma Parameter. For the MP analysis the

Gaps/Missing Data parameter was set to “Use All Sites” to account for variable amino acid sites.

BA was performed with MrBayes 3.1.2 [41] with the following parameters: two independent runs with four Markov chains each, 10 million generations, sampling every

1000th generation, invgamma model, and 8 categories. The standard deviation of split frequencies was below .01 at generation 10 million, at which point a consensus tree was constructed from 1800 trees (900 from both runs) after first discarding 100,000 generations as burn-in.

Classification Models

Decision Trees [26, 42] were built using SAS software, Enterprise Miner 5.2. A

Chi-square test with a significance level of 20% was used as the splitting criteria. The bifurcating tree was limited to a depth of 5 nodes, requiring a minimum of 10 observations for a split and at least 4 observations per leaf.

Following the data transformation process described in Atchley and Zhao 2007, amino acids for each sequence were transformed into a 1 x 5 vector of factor scores using the HDMD package [43]. Factor scores are quantitative values for amino acids based on amino acid properties. The five factor scores, which can be interpreted as independent physiochemical indices, were derived by Atchley et al. 2005, from 495 measurable amino acid properties. The factor scores (pah; pss; ms; cc; and ec) are

16 associated with biological properties (polarity, accessibility, and hydrophobicity; propensity for secondary structure; molecular size or volume; codon composition; and electrostatic charge; respectively). Factor scores are independent, thus, we created an additional dataset containing the combination of all five factor scores (all). This resulted in the total of six factor score transformed datasets: pah, pss, ms, cc, ec, and all from the 488 grouped fungal sequences.

Discriminant analyses [44] Canonical Variate Analysis (CVA) and StepWise

Discriminant Analysis (SWDA) were used to build models on all six factor score datasets to evaluate molecular differences between the 12 fungal group sequences.

These discriminant analyses were used to define the latent structure of covariation among-groups and obtain a set of amino acid sites that best differentiate between the groups in the fungal group datasets.

The step-up SWDA procedure was used to rank amino acid sites based on their ability to discriminate defined groups (r2) [26]. In the step-up procedure, variables

(amino acid sites) were added sequentially (step) based on the site’s discriminating power. Amino acid sites were added until an average squared canonical correlation

(ASCC) reached a value of 70% for pah, pss, ms, cc, ec and 80% for all datasets. The

ASCC describes the related distinctiveness of the groups at a given step in the model, meaning a 100% ASCC would imply complete discrimination between the defined groups. SAS software, Version 9.2 was utilized in the SWDA. Those variables with r2

>70% were considered the most discerning sites.

17

CVA assesses the discriminatory ability of all variables (factor score transformed amino acid sites) simultaneously to generate a linear model to differentiate between defined groups. The CVA includes the calculation of eigenvectors (canonical variates) from the among-group covariance matrix. CVA for the six factor score datasets resulted in 11 canonical variates for each analysis. The square root of the Mahalanobis pairwise distance was also calculated, providing a relative measure of the divergence between groups. CVA and plotting of canonical variates were conducted utilizing the statistical software package R [45]. Amino acid sites were considered discerning if they met the following criteria: 1) contained within canonical variates that explained >5% of the among group covariation; 2) had absolute magnitudes >1 for the pah, pss, ms, cc, and ec analyses; 3) had absolute magnitude >8 for the all analysis.

Testing Methods

Fungal protein sequences annotated with a bHLH domain (707) were obtained from Interpro 31.0 [36]. A dataset was then constructed for the testing of classification models from the 198 fungal sequences not used in model construction (F.198). These sequences were assigned fungal groups by utilizing BLAST to find homologous sequences that had a priori defined groups. In the few instances where a sequence aligned to more than one fungal group, assignation was based on majority rule. The bHLH sequences in F.198 were evaluated as follows: 1) the bHLH domain was extracted from the full amino acid sequence; 2) transformed into factor scores; 3) subjected to

18 several classification methods as described under classification methods. F.198 was used to test the performance of each classification method.

To determine the performance of the classification models, confusion matrices were generated by classifying sequences from the F.198 dataset. We then measured the sensitivity (ability to identify positive results) and specificity (ability to identify negative results) for each model (data not shown). These measures were calculated from the One versus All approach commonly used with multiclass classification models

[46]. Finally, model performance was measured by determining the overall accuracy

(ability to correctly identify results) and its assessment (coefficient of agreement;

Equation 1) [47, 48]. Good accuracies have assessments >80%.

19

RESULTS

Identification of Fungal bHLH sequences from whole genome projects

Previous bHLH analyses have focused primarily on animal or plant sequences, with token references to fungal organisms [15, 23, 24, 49]. This has provided insightful, but limited, information of the phylogenetic relationship of fungal bHLH sequences to those of plants and animals. Nevertheless, it has provided no insight into the diversity of the bHLH domain within Fungi.

To obtain fungal specific gene sequences containing the bHLH domain, we utilized the protein sequence analysis and classification database InterPro. Using protein signatures built on known bHLH domains (IPR001092) and classified based on , we identified 707 fungal bHLH containing sequences. From this set, 198 sequences not belonging to whole genome projects or originating from projects with incomplete assemblies and gene calls were set aside. This resulted in 509 full amino acid sequences putatively containing the bHLH domain from 55 genome projects representing major evolutionary fungal lineages, encompassing the Ascomycota (49 members) and Basidiomycota (6 members) Phylums. An iterative global alignment to a reference set of 147 plant bHLH [15] and 284 animal bHLH domains [26], resulted in the identification and alignment of all 509 fungal bHLH domains (Supplemental Data 1).

The location of each bHLH domain in each protein sequence predicted by the protein signature from InterPro directly corresponded to the location of the domain

20 determined through our alignment method (Supplemental Data 1). Using this iterative global alignment approach, we were able to ensure direct comparison of homologous amino acids by enumerating the bHLH domain as described in previous work on

Animals (Region: basic, first Helix, Loop, second Helix; Sites: 1-13, 14-28, 29-49, 50-64; respectively) [11].

We identified 34 perfect duplicate bHLH domains within eight fungal species

(data not shown). Duplicate bHLH domains arose for a variety of genome sequencing artifacts (including inconsistent gene calls and strain specific sequencing differences) and were not likely due to recent sequence duplications. A representative was chosen from each set of duplicates; resulting in 19 sequences being removed from our analyses.

The remaining 490 bHLH fungal sequences from 422 Ascomycota and 59

Basidiomycota proteins are shown in Table 1 arranged by organism and taxonomy.

The number of bHLH proteins in the fungal genomes ranged from a maximum of

16 (Podospora anserina) to as few as 4 within the Taphrinomycotina Subphylum (Table

1). Members of Saccharomycotina Subphylum typically contained 8 bHLH sequences; however some contained 9 or 10 proteins while Candida tropicalis, Eremothecium gossypii, Lodderomyces elongisporus and Scheffersomces stipitis each contained only 7.

The Sordariomycetes Class members contained between 10 and 16 members with a median of 12 while members of the Class ranged between 7 and 11.

The and Eurotiales Orders, within the Eurotiomycetes Class, typically contained 8 and 10 proteins each, respectively. The number of bHLH proteins in

21

Basidiomycota ranged from 7 to 14. An insufficient number of sequenced taxa were available to identify clear patterns within the Basidiomycota Phylum. In summary, we observed distinct differences in the typical number of bHLH proteins within the

Sordariomycetes and Saccharomycetes classes and the Onygenales and Eurotiales orders.

Positional Conservation and Consensus Motif

To determine the conservation of amino acid sites of the fungal bHLH domain we performed Boltzman-Shannon Entropy and Group Entropy analyses [25, 28, 50], generated a bit score weblogo [51], and determined the consensus sequence motif

(Figure 1) on the set of non-redundant, aligned sequences.

We evaluated the conformity of bHLH sequences to the entire fungal set by determining the number of mismatches between each sequence and the consensus sequence (Supplemental Data 1). In previous work, sequences were considered highly divergent and removed from subsequent analyses if they contained more than 8 to 10 mismatches to the consensus sequence [15, 23, 52]. We retained all 490 fungal bHLH sequences, as there were no sequences with more than 7 such mismatches.

We identified 17 conserved positions in Fungi based on amino acid frequency

(Table 2). Six additional conserved sites were identified based on low group entropy

(conserved amino acid properties). As shown in Figure 1, in the basic region (sites 1-

13) of the fungal consensus motif, amino acid positions 2, 5, 9, and 13 had low

22 entropies, high bit scores and were represented by amino acids R, H, E, and R, respectively, at a frequency of at least 50%. Sites 8, 10, 11 and 12 were considered moderately conserved, having group entropies between 0.276 and 0.308. Sites 16, 23, and 28 were highly conserved in the first Helix (sites 14-28), having I, L and P amino acids at frequencies of 59%, 88%, and 92%, respectively. Site 27 had high entropy but low group entropy being highly conserved for aliphatic amino acids, with V, I, L and M at frequencies of 49%, 24%, 21%, and 3%, respectively. Additionally, moderately conserved Helix 1 sites 17, 20 and 26 had group entropies between 0.327 and 0.366. In

Helix 2 (sites 50-64), highly conserved sites included 50, 53, 54, 60, 61, and 64. Each of these sites had amino acids K, I, L, Y, I, and L at frequencies of 90%, 58%, 84%, 68%,

67%, and 85% respectively. Site 57 contained A in over 50% of fungal sequences, however, could only be considered moderately conserved as it had entropy and group entropy values of 0.357 and 0.330, respectively.

A number of these sites are conserved in plant and animal bHLH domains [27].

At site 9, Glutamic acid (E) was present in over 90% of E-box binding animal proteins and has been shown to directly contact DNA [13, 28]. In a recent plant study, site 9 was represented by E in more than 74% of such sequences [13]. We found that in Fungi,

>98% of bHLH sequences contained an E at site 9.

Site 28 is another highly conserved site that has a conserved P that breaks the first Helix and starts the loop region. This highly conserved site in Plants and Animals contained P in 92% of fungal sequences. Sites 23 and 64 contained L (helix

23 stabilization) in over 80% of plant and animal sequences [13], and over 85% of fungal sequences. Aliphatic amino acids, essential for dimerization, were conserved within sites 54 and 61 at 98% and 89% in Fungi and over 98% and 93% in Animals and Plants.

The presence of these highly conserved sites demonstrates that the fungal bHLH domain shares similar architecture to those identified in Plants and Animals.

Phylogenic Analysis of Fungal sequences

To elucidate the evolutionary relationships between bHLH domains within and between fungal lineages, we determined the phylogeny of sequences in four datasets

(Basidiomycota, Pezizomycotina: filamentous members of Ascomycota,

Saccharyomycotina: yeast-like members of Ascomycota, and all Fungi) using five phylogenetic analyses (Maximum Likelihood, Neighbor-joining, Maximum Parsimony,

Maximum Likelihood Bootstrap, and Bayesian). Based on high support values, tree topology, branch lengths and majority support from each phylogeny, the 59

Basidiomycota, 286 Pezizomycotina, and 137 Saccharyomycotina bHLH proteins were split into 11, 9, and 10 clades, respectively (Supplemental Figure S1A-C). Based on the same methods, the 490 fungal bHLH proteins were split into 12 major clades (fungal groups F1-F12) (Supplemental Figure S1D). Annotated sequences, where available, shared similar biological and molecular functions with their group members (Table 3).

Each group was further supported by conserved loop length [15], consistency of basic amino acids in the basic subdomain [11], and low divergence from the consensus

24 sequence (Supplemental Data 1). Several groups had average loop lengths of >40 amino acids, uncommon in either plant or animal bHLH sequences (Supplemental Data

1). However, sequences with these extended loops typically were found in the same clade, such as F2. Conservation of such clades across Basidiomycota and Ascomycota fungi possibly arose from additional functionality provided by an elongated loop.

Each fungal group was composed of one or more clades from the Basidiomycota

(B1-B12), Pezizomycotina (P1-P12) or Saccharyomycotina (S1-S12) phylogenies, as denoted on the Maximum Likelihood (ML) tree in Figure 2. Groups B1-B12, P1-P12, and S1-S12 were enumerated to reflect their associated fungal group, eg. B1 is a clade within F1. Based on the composition of each fungal group, many bHLH domain gains and losses have occurred since the Most Recent Common Ancestor (MRCA) between

Basidiomycota, Pezizomycotina and Saccharomycotina organisms. The MRCA likely contained bHLH domains found in F2-F5 and F11 as Basidiomycota, Pezizomycotina, and Saccharomycotina organisms were all represented in these groups. Additionally, we observed expansion of F2 but not F3 in the Saccharomycotina subphylum (Table 1).

Basidiomycota and Pezizomycotina fungi were represented in groups F8 and F10, but lost from the Saccharomycotina branch since the MRCA. Similarly, we observed that

Pezizomycotina fungi have lost bHLH representation in F9 since the MRCA. F6 was either gained by the MRCA of Pezizomycotina and Saccharomycotina subphylums, or was present in the MRCA and lost by Basidiomycotas. Finally, Saccharomycotina fungi have gained novel bHLH sequences present in F12, and Basidiomycota fungi in F1 and

25

F7. Expansion and loss patterns were also observed at various taxonomic ranks within the Basidiomycota and Ascomycota Phylums (Table 1). In F4, most fungi within the

Ascomycota phylum experienced large expansions (2 - 8 copies), except for members of the Onygenales order (1 copy). P. anserina had the largest expansion in F4 sequences, accounting for half of its 16 bHLH sequences.

Several other taxonomic groups experienced expansion, such as

Dothideomycetes members in F6 (2 copies), while the other Ascomycotas retained only a single representative. Within Basidiomycota fungi, an expansion of F9 occurred in the

Agaricomycotinas as compared to the Ustilaginomycotinas, in which only Ustilago maydis had a single F9 sequence. Thus, we observed many instances of expansion and loss among taxonomic ranks, except within F3, which has retained constant representation in all taxonomic groups (1 copy). In summary, the phylogenetic analysis shows that fungal bHLH proteins form 12 groups, each correlated with sequence characteristics, such as conserved loop length. Many of these groups remain distinct throughout fungal evolution despite the dramatic diversification of Fungi.

Expansion of F4 in Sordariomycetes

The most dramatic expansion observed was that of the Sodariomycetes within

F4. Each member contained a minimum of 3 copies, with Sodaria macrospora,

Neurospora crassa, and P. anserina having 6, 7 and 8 copies, respectively. To determine if the expansion was due to recent duplications within each organism or due to distinct

26 bHLH sequences likely found in the Sodariomycete MRCA we performed an additional phylogenetic analysis. Two NJ phylogenies were built, one from the bHLH domain and another from the entire bHLH containing protein sequence (Figure 3). We found that the 33 sequences formed 6 distinct subclades each with bootstrap values of between 52 and 100 in both trees. Subclades A-C were composed of one copy from each

Sodariomycete organism. Also, clades E-F each contained one protein from P. anserina,

S. macrospora, and N. crassa. All members of subclade A were homologous to SRE1 & 2

(SREB) proteins. These findings support the MRCA containing an expansion of F4 rather than a large number of recent duplications in each of the Sodariomycete organisms. Additionally, these subclades generally support the published phylogeny of

Sodariomycete organisms [53–55]. With the notable exception of P. anserina, which shared 6 subclades with S. macrospora and N. crassa sequences, but only 3 with

Chaetomium globosum sequences.

Conserved Motifs in Fungal bHLH proteins

To identify conserved motifs in fungal bHLH proteins, we used MEME [32] to search for 50 frequently occurring motifs in 490 sequences and correlated the results with Basidiomycota, Pezizomycotina and Saccharomycotina groups (Figure 2). Motifs ranged in length from 11-86 amino acids, were significant with e-values from 2.3e-6014 to 2.7e-146 and were non-overlapping. The results provided additional support for

Basidiomycota, Pezizomycotina and Saccharomycotina group designations as the

27 protein architecture (occurrence and location of motifs) was highly conserved within each fungal group.

The first and second most abundant motifs (Motifs 1 and 2) corresponded to components of the bHLH domain noted as basic and Helix 1 regions; and Helix 2 region, respectively. Both motifs were present in all sequences, with only a few exceptions.

Pezizomycotina clade P2 was the only group to contain motifs that matched to the highly variable loop region where the average loop length was ~63 amino acids. We also noted that the loop length between motif 1 and 2 was exceptionally long in the

Basidiomycota clade B2 with an average length of ~70 amino acids. However, the B2 clade contained no identified motifs in the loop region. Therefore, the conserved domain within P2 loops may be an artifact of sampling of Pezizomycotina organisms or of the conserved nature of the full bHLH containing bHLH proteins within P2s. Thus, the loop remains a highly variable subdomain with undetermined function within

Fungi.

Several motifs were found to be linked to functional properties besides the bHLH domain. For instance, motifs 3, 4, 7, 8, 12, 17, 26, and 35 were found in the C-terminal of many P4 proteins such as subclade A (Figure 2, Figure 3). These motifs were found to be part of ER membrane bound transcription factors (SREB). Within the fungal group F3, motif 6 was found to be related to functional components of the centromere- binding protein (CPB-1). Despite being found in many fungal bHLH sequences across

28 several Basidiomycota, Pezizomycotina and Saccharomycotina clades, the biological role of the highly repetitive motif 13 (Q-P-Q{22}) has yet to be defined.

The bHLH-ZIP domain consists of a conserved heptad Leucine repeat (Leucine

Zipper) adjacent to the bHLH domain. The bHLH-ZIP has been found in both plant and animal sequences, however, they are extremely divergent between Kingdoms with previous work supporting convergence [11, 13, 56]. We found evidence of Leucine

Zippers in fungal groups F2 and F4, and in Basidiomycota, Pezizomycotina, and

Saccharomycotina clades B5, B7, P5, P10 and S11 (Supplemental Figure S2A). Motif 20, found extensively in F4, was composed of conserved Leucines at downstream positions

7, 14, and 21 from the bHLH domain (Figure 2), indicative of the bHLH-ZIP domain.

Motif 20 was the only motif with a known molecular function besides motifs 1 and 2

(bHLH). While many motifs were linked to specific groups of bHLH containing transcription factors, the role of these proteins and consequently the function of the majority of the motifs remain to be determined.

We observed that the spatial orientation of the bHLH domain with respect to the protein sequence (NH2-terminus, middle, or COOH terminus) was conserved within many of the Basidiomycota, Pezizomycotina, and Saccharomycotina groups (Figure 2).

The approximate location of the bHLH domain within members of the fungal groups F3,

F5, F8, and F10 was consistent within said groups. In addition, motifs 6, 9, 19, 20, 29 and 32 showed low spatial variation with respect to the bHLH domain. Conservation of

29 special location within groups is likely indicative of a functional link between the motif, the bHLH domain, and the protein function.

Sequence classification using Decision Trees

To identify key sites that distinguish fungal group sequences, we performed a

Decision Tree analysis using the state of amino acid sites in the basic, Helix 1 and Helix

2 regions (Figure 4). Before starting the decision tree analysis, we created a new dataset from the set of 490 fungal sequences by removing two sequences which were not placed into groups F1-F12. Starting with the entire dataset of 488 sequences, each step bifurcates the data based on the amino acids at a given site. Steps are added until either there are too few sequences to split, the tree hits a user set maximum depth or the data subset converges on a group. Discerning sites 1, 4, 7, 8, 11, 12, 15, 19, 20 and

50 (Table 2) accurately placed fungal sequences to their a priori defined groups with an accuracy rate for each group over 98% with the exception of F9 which was 88%.

Overall, the accuracy of the decision tree was 95.5% (Table 4).

All groups were accurately separated within 5 steps. For instance, all group F4 sequences were deduced in two steps: First, they contained a S or A at site 8 (step 2); and second, they had a Y at site 12 (step 5). The amino acid composition at discriminating sites used in the decision tree was readily visualized in the fungal group weblogos (Supplemental Figure S2B).

30

Sequence classification using Discriminant analysis

To evaluate and compare the discriminating power each site had on separating groups F1-F12, we performed a stepwise discriminant analysis. Amino acid data for each site were transformed into numerical values by utilizing five numerical indices

(Factor Scores) based on measured physiochemical amino acid properties as described in previous work [26]. Factor scores 1-5 have been linked to biological properties, including; polarity, accessibility and hydrophobicity (pah); propensity for secondary structure (pss); molecular size (ms); codon composition (cc); and electrostatic charge

(ec), respectively. The transformed dataset resulted in five numeric values for each amino acid for every position in each bHLH sequence (all).

To identify the discerning sites between fungal groups F1-F12, StepWise

Discriminant Analysis (SWDA) and Canonical Variate Analysis (CVA) were performed on the factor score transformed fungal data [26] denoted as SWDA{factor score} and

CVA{factor score}, respectively. SWDA pah, pss, ms and cc each required more than

30 amino acid sites to explain >70% of the among group variance, where ec required only 20 sites (Supplemental Table S1). SWDA using the all dataset, where each amino acid site was represented by five values, obtained an ASCC of 80% using 16 sites in 28 steps. These results showed that using only SWDA requires numerous amino acid sites to completely distinguish between the different fungal groups. However, SWDA did reveal a few highly discerning sites such as 6, 8, 12, 50 and 51.

31

The first Canonical Variate (CV) of the CVA explained the vast majority of the variance in each of the six analyses i.e. revealed highly discerning sites (Supplemental

Table S2). For example, the pah, pss, ms, and cc CVA separated F4 from the other groups in the first CV. Additionally, the first CV in ec and all separated out group F11.

Plotting the first and second CV of the CVA{all} revealed clear separation between all 12 fungal groups (Figure 5). The first two CVs in the other five CVAs did not fully separate the groups. However, they each explained more than 65% of the variance in their respective analyses. In addition, while each CV contains all amino acid sites, only a few sites (2 - 11) contributed to the CVs’ discerning power (Supplemental Table S2). Thus, overall, only a small number of amino acids were required to discriminate between fungal groups using CVA.

Both SWDA and CVA had highly supported (>99% Coefficient of Agreement) and nearly perfect (>99.9%) accuracies (Table 4). CVA and SWDA both determined the sites

6, 8, 10, 11, 12, 15, 50, 51, and 54 (Table 2) to be discerning. Site 12 appeared most often in these analyses and was a discerning site for each of the five factor score datasets in both SWDA and CVA. The other eight sites were found across the six different CVA and SWDA. In summary, using two independent statistical methods, we found 9 sites common to both sets of analyses that were central in distinguishing between fungal groups F1-F12.

32

Simplified Model for F1-F12

To identify the inherent characteristics that effectively separated F1-F12, we utilized the consensus sequence, the decision tree and the discriminant analyses to manually build a simplified model that characterizes each set of fungal group sequences. As shown in Table 5, each fungal group was characterized by a model that used 4 amino acid sites or less; where groups F4 and F11 were discerned by a single amino acid site (12 and 50, respectively). Site 8 was the most frequently used discerning amino acid site in the model, where amino acids S or A were characteristic of groups F6-F10 and I or V of groups F1 and F3.

To assess the effectiveness of the simplified model, we tested it against the sequences from completed fungal genomes which were a priori assigned to a fungal group by our phylogenetic analyses (488 sequences). The simplified model was extremely accurate with a score of 99% and a coefficient of agreement of 98.8% (Table

4). Only 8 of the 488 sequences were left unclassified. Thus, the performance of the simplified model to differentiate fungal groups was very similar to the fungal CVA,

SWDA, and decision tree analyses.

33

Classification Model Testing

To test the effectiveness of the different classification models in discerning fungal groups, we determined the sensitivity, specificity and accuracy for a set of 198 bHLH domains from fungal sequences not used to build the classification models

(F.198) (Table 4). As shown, all the classification methods had accuracies >92.9% with high coefficients of agreement (>94.7%). While CVA{all} and SWDA{all} had nearly identical accuracies, SWDA{all} had better performance as it was able to classify 98.4% of the sequences. CVA{all} was only able to classify 90.4% from the F.198 sequence set.

The simplified model performed well in both accuracy (96.4%) and sequences classified

(195 out of 198). However, while the simplified model had great performance, each model tested was extremely accurate for fungal bHLH sequence classification.

Comparison of Positional Amino Acid Conservation

To determine the relationship between fungal and animal sequences, we characterized the conservation of amino acids at specific bHLH positions and compared those patterns to each animal binding group. Sites 5, 8, and 13 were used by Atchley and Fitch (2003) to classify animal bHLH proteins into either Group A or B. Group A contains an R at site 8, while Group B contains amino acids H or K at site 5 and R at site

13. All the fungal sequences fit best into Group B where 94% had an H (0% had a K) at site 5, and 99% had an R at site 13. No fungal bHLH sequences fit the Group A pattern as none had an R at position 8. Animal Group E proteins follow the 5-8-13 Group B rule

34 with the addition of P at site 6. Fungal sequences did not follow this pattern as there were not any sequences with P at site 6. Group C bHLH proteins contain an extra PAS domain, which is not typically found within fungal bHLH proteins. In our dataset, the

PAS domain was only found in a single protein from S. macrospora. Group F proteins contain the COE domain, not found in Fungi. Last, Group D proteins do not bind DNA; however, given the conservation of E at site 9 the vast majority of our sequences are E- box binders. These results support previous studies that fungi bHLH are most closely related to animal Group B.

Phylogenetic relationship of Fungal bHLH to Animal and Plant bHLH domains

To further determine the relationship between plant and animal families and our fungal sequences, we built a BA phylogeny based on all sequences from the three

Kingdoms. The analysis was based on 916 total sequences, including 147 from Plants,

490 from Fungi, and 279 from Animals (six fungal sequences removed). The majority of the previously defined plant, animal, and fungal bHLH groups were identified in corresponding phylogenetic clades with high posterior probabilities (Supplemental

Figure S1E). Fungal sequences predominantly clustered with animal Group B, however, animal Group B was not conserved as a single clade. This resulted in many previously unidentified evolutionary relationships between Group B and fungal groups F1-F12.

For instance, fungal group F2 and four Group B sequences, including TFE3, TFEB, TFEC, and MITF (MiT/TFE family) from H. sapiens, were located in a strongly supported clade.

35

Group F4 was closely related to four animal sequences belonging to the sterol regulatory element-binding family (SREB). These four Group B sequences from Mus musculus, Sus scrofa, and Drosophila melanogaster have biological roles similar to the

SRE1 and SRE2 proteins found in fungal group F4. Though interesting, additional comparisons between fungal groups and animal Group B were not supported by high posterior probabilities.

Cross Kingdom Classification

To gain deeper insight into the evolutionary relationship between fungal and animal sequences, we classified 707 fungal bHLH sequences available from Interpro using the animal group classification models described in Atchley and Zhao 2007 (Table

6). Every animal model, except the classical animal binding-group model, classified

>72% of fungal sequences as animal Group B, with CVA{all} classifying over 88% of fungal sequences as Group B. In most instances, the remaining fungal sequences matched animal Group E. Thus, we found that fungal bHLH sequences were predominately classified as members of animal Group B.

Finally, to determine which groups were most closely related, we calculated the pairwise distances between all fungal and animal groups by building a CVA{all} classification model on the combined F1-F12 and animal Group A-E datasets. Of the 16 canonical variates (not shown), the first 7 explained 94% of the among-groups variation. Animal Groups A, C, D, and E could be separated from each other and all

36 fungal groups within the first four CVs. Additionally, fungal groups F3, F4, F6, and F8-

F12 were all distinguishable within the first seven CVs. Group B could not be discerned from the remaining fungal groups until after the 7th CV. The Mahalanobis distance between animal and fungal groups (Table 7) supported the close relationship of Group

B to fungal groups as Group B had the lowest relative distance from each fungal group averaging 37.6, compared to 121.9 for animal Group D. The average relative distance of fungal to animal groups were much more consistent, with values between 61.5 and

74.1; except F11 with a distance of 129.1. Within this analysis, we also observed that animal Groups B and E were more closely related to each other with a Mahalanobis distance of 29.1, with the other animal pairwise distances ranging from 56.7 – 120.4.

Thus, we determined that animal Group B was more similar to fungal groups than any other animal group and there was not a particular fungal group to which animal groups were more closely associated.

37

DISCUSSION

Based on analysis of whole genome projects of fungi, we identified between 4 and 16 bHLH sequences per genome. Overall, the copy count of bHLH proteins is fairly invariant in the Fungal kingdom, with the majority of Fungi containing 9 bHLH proteins.

The bHLH copy count was more consistent within taxonomic groups such as the

Onygenales and Eurotiales orders and the Saccharomycetes and Sordariomycetes classes. Thus, the number of bHLH proteins within specific fungal lineages is, in general, strictly conserved. The occurrence of bHLH proteins in Plants and Animals differ dramatically from Fungi where Plants contain copy counts of >160 and Animals which have a wider range (50-200) proteins per organism. The lower bHLH copy count, as compared to Animals and Plants, is consistent with but not proportionate to lower gene counts in Fungi.

In Ascomycota and Basidiomycota fungi we identified 12 distinct phylogenetic groups of bHLH domains. Fungal groups F2-F5 and F11 contained representatives with ties to essential biological functions, such as chromosome segregation, interorganelle communication, sexual development, and phospholipid synthesis (Table 3). These five groups are found in all fungi examined, and were likely present in the MRCA of

Ascomycotas and Basidiomycotas.

It is unclear whether 12 fungal groups are linked to specific binding motifs as observed in Animals. Within Animals, the six bHLH groups are linked to specific

38 binding motifs, with the exception of animal Group D, which does not bind DNA. On the other hand, plant bHLH groups are not currently tied to specific binding motifs. While many of the discerning sites between fungal groups are located in the basic region, this is not always the case. Determination of binding properties of the fungal groups will require additional experimentation.

We observed expansions and losses in all fungal groups, except F3 which had a single representative in every fungal organism. F4 had the largest number of expansions, with at least one set of expansions (subclade A) linked to SREB proteins

(Table 1, Figure 3). Most fungal organisms were represented at least once in this SREB subclade. This was exemplified by members of the Onygenales, which only had a single

F4 sequence. Each was a member of subclade A (data not shown). When evaluating expansions within F4 of Sodariomycete organisms the results favored ancient divergence rather than recent duplication events. N. crassa and other Sodariomycetes exhibit Repeat-Induced Point mutation (RIP) which inactivates duplicated genes [57–

60]. The presence of these expansions in F4 suggest that these duplications either predate or were protected from RIP. Additionally, F4 was the only group to have the

Leucine Zipper domain found across both Basidiomycota and Ascomycota members

(Figure 2).

Saccharomycotina organisms appear to have lost the F8 bHLH domain, which has been tied to sexual differentiation and conjugation in Taphrinomycotina organisms

[61]. Additionally, most Saccharomycotina organisms lack the F10 domain, known to

39 be associated with alkane response [62]. Likewise, Basidiomycota fungi either lost or never gained the F6 group, which has been linked to phosphate starvation response and chromatin remodeling in Saccharomyces cerevisiae [63, 64] as well as copper ion response and sexual/asexual development in N. crassa [65]. Basidiomycota organisms, however, do not lack these biological functions [66–68], possibly utilizing transcription factors with degenerate or missing bHLH domains. As shown in Table 3, to date very little is known of the function of bHLH proteins in Fungi. However, we were able to identify that group F3 is associated with chromosomal segregation and several essential biological process within several Pezizomycotina, Saccharyomycotina and

Basidiomycota organisms. Also, we found in Aspergillus fumigates that the group F5 protein Q6MYV5 is essential in nitrate assimilation and quinate utilization. Thus, bHLH proteins belonging to specific Phylums, Subphylums, and Orders were associated with particular biological functions and conserved motifs. Additionally, many of these associations correlated with bHLH gain and losses within fungal groups.

S. cerevisiae bHLH heterodimers YAS1/YAS2 and INO2/INO4 are found in groups

F10-F12 with both INO4 and YAS1 in F11. Interestingly, group F10 contains only two

Saccharoymycotina sequences while group F12 contains them exclusively. From the phylogenetic analysis we know that these groups are more closely related to each other than to the other groups (Figure 2). We also know that the relative distance between

F10 and F12 is much smaller than either one is to F11 (Figure 5). Given these lines of evidence, it is reasonable to view F10 and F12 as a larger group that is closely related to

40

F11. Thus, the F10/F12 and F11 clades portray the relationship of heterodimers as two distinct yet functionally tied groups. This relationship provides additional insight into potential heterodimers in other Fungi with F10/F12 and F11 bHLH domains.

We built several different models to classify bHLH domains into different groups and determined that fungal group origin could be deduced using only a handful of amino acids. This is very similar to the classical animal binding group model, in that only a few amino acid sites are needed to discern between groups [26]. Our fungal simplified model only required 12 amino acid positions to accurately distinguish F1-

F12 sequences. In the model, groups F4 and F11 were so distinct that they were identified by a single site. The simple model (Table 5) was nearly as accurate as the discriminant analyses (Table 4) in testing and was very useful for rapid assessment of fungal bHLH proteins. For example, if a bHLH containing protein of interest contained a

Y at site 12, the simplified model identified it as a F4 sequence. Thus, in many instances the sequence would be similar to SRE1 & 2 and likely contain a SREB domain.

Many of the most discriminating sites between fungal groups are tied to the fundamental molecular architecture of the bHLH domain, as described primarily with crystal structure studies on animal proteins Max [27, 69], E47 [70], USF [71], MyoD [9],

PHO4 [10], and SREBP [72]. For example, site 12 is a highly discerning site useful in identifying group F4 sequences. Site 12 was identified as highly discerning in the decision tree analysis and each of the SWDA and CVAs. It was also used in the simplified model and found to be moderately conserved during the consensus sequence

41 analysis. This site is conserved in animals and has been found to bind the phosphate backbone and/or the DNA within the E-box [73].

Site 50 is another site that is conserved in both Fungi and Animals. It has been determined to pack against buried site 20 and to contact the DNA and/or phosphate backbone in Max, MyoD, PHO4, and USF. In our analyses, it was determined to be a discerning site within the decision tree analysis, significant in many of the SWDA and

CVA, and used in the simplified model. From these analyses, we were able to determine that group F11 sequences were uniquely identified by having an E at position 50.

Site 8, known to contact the phosphate backbone and/or DNA of the E-box in

MyoD and E47, was a moderately conserved site in fungal sequences. It was the first discerning site in the decision tree analysis and found to be a highly discerning site by both the SWDA{all} and CVA{all}. Site 8 was also utilized in the conventional classification of animal binding groups [11] in which amino acids RK were characteristic of animal Group A. However, the fact that it was a discerning site in both models is where the similarity to the animal model ends as RK was not found at site 8 for any fungal sequence.

Use of classification models can find weak linkages not found using conventional approaches. For example, all members of the Pezizomycotina contained a single copy in

F10, except Magnaporthe oryzae. Absence of the F10 bHLH domain was assessed using two methods. Performing a BLAST [74] with several F10 representative domains and a large e-value (10.0) returned only sequences assigned to other fungal groups. A Hidden

42

Markov Model [75] was also constructed from F10 sequences and used to scan the entire M. oryzae genome, with similar results to the BLAST analysis (data not shown).

As shown in Table 1, M. oryzae protein MGG_01090 contained an unclassified bHLH domain. However, when we applied the classification models discussed here, we found that four of the nine models classified MGG_01090 as F10 (CVA{pss}, CVA{ec},

SWDA{all}, and the simplified model). The results were not unanimous as the models

CVA{ms}, CVA{all} both classified the protein to the closely related F9 group. The

Decision Tree classified MGG_01090 as an F2, which deviated from any of the statistical models. Thus, the developed classification models may be of considerable utility to identify potential group origins of phylogenetic outliers.

Previous work has hinted at a link between the fungal bHLH domain and animal

Group B [24, 60, 76]. In our analyses we provided multiple lines of evidence that Fungi are closely related to Group B. First, in the consensus sequence, it was shown that fungal sequences follow the BxR rule for bHLH positions 5-8-13; characteristic of Group

B [11]. Second, the highest supported clades between Fungi and Animals were only to

Group B sequences; specifically linking F2 and F4 to Group B proteins. Third, in our cross kingdom classification analysis, we determined that fungal sequences were predominantly classified into animal Group B. Last, the Mahalanobis distance between

Group B and groups F1-F12 was much shorter than any other animal group. Thus, in a comprehensive analysis of fungal bHLH domains, there is clear evidence that fungal sequences are directly related to animal Group B sequences.

43

We did note that some fungal sequences were classified as animal Group E in the cross Kingdom analysis. The binding domains for Group B and E are very similar.

Furthermore, we show that Groups B and E are closely related to each other as evidenced by the Mahalanobis distance between these two groups. However, no fungal sequences contained a P at position 6, required by the classical animal binding group model for animal Group E [6, 26]. Recent studies of Class VI proteins in C. elegans suggest that P is not absolutely required at site 6 to be a member of Group E [77–79]. If

P is not required, our findings would support that metazoan bHLH sequences may not be uniquely derived from Group B.

In summary, we have determined the conserved sites for the fungal bHLH domain using entropy and consensus sequences. We have also identified 12 major fungal bHLH groups through phylogenetic analysis and tied these groups to conserved domains and biological functions. Using statistical classification models we have shown that fungal group origin (F1-F12) can be determined with a high degree of accuracy, utilizing only a handful of highly conserved sites that are directly correlated to molecular functions. We have demonstrated the utility of these classification models by identifying group origin with degenerate sequences. Finally, we have made publically available these models, source code, and experimental data at www.fungalgenomics.ncsu.edu.

44

SUPPLEMENTARY MATERIAL

Supplementary Data 1, Tables S1 and S2, Figures S1 and S2 are available at

Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

ACKNOWLEDGEMENTS

We would like to thank Lisa McFerrin and members of the Center for Integrated

Fungal Research for their critical comments and discussion. This work was supported by a grant to the Bioinformatics Research Center of North Carolina State University from the National Institute of Health; a grant to RAD from the National Science

Foundation: MCB-0731808.

45

REFERENCES

1. Robinson KA, Lopes JM: SURVEY AND SUMMARY: Saccharomyces cerevisiae basic helix-loop-helix proteins regulate diverse biological processes. Nucleic Acids Res 2000, 28:1499-1505.

2. Jones S: An overview of the basic helix-loop-helix proteins. Genome Biol 2004, 5:226.

3. Castillon A, Shen H, Huq E: Phytochrome Interacting Factors: central players in phytochrome-mediated light signaling networks. Trends Plant Sci 2007, 12:514-521.

4. Murre C, McCaw PS, Baltimore D: A new DNA binding and dimerization motif in immunoglobulin enhancer binding, daughterless, MyoD, and myc proteins. Cell 1989, 56:777-783.

5. Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, Creelman R, Pilgrim M, Broun P, Zhang JZ, Ghandehari D, Sherman BK, Yu G: Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science 2000, 290:2105-2110.

6. Ledent V, Vervoort M: The basic helix-loop-helix protein family: comparative genomics and phylogenetic analysis. Genome Res 2001, 11:754-770.

7. Carretero-Paulet L, Galstyan A, Roig-Villanova I, Martinez-Garcia JF, Bilbao-Castro JR, Robertson DL: Genome-Wide Classification and Evolutionary Analysis of the bHLH Family of Transcription Factors in Arabidopsis, Poplar, Rice, Moss, and Algae. Plant physiology 2010, 153:1398.

8. Massari ME, Murre C: Helix-loop-helix proteins: regulators of transcription in eucaryotic organisms. Mol. Cell. Biol 2000, 20:429-440.

9. Ma PC, Rould MA, Weintraub H, Pabo CO: Crystal structure of MyoD bHLH domain- DNA complex: perspectives on DNA recognition and implications for transcriptional activation. Cell 1994, 77:451-459.

10. Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, Oshima Y, Hakoshima T: Crystal structure of PHO4 bHLH domain-DNA complex: flanking base recognition. EMBO J 1997, 16:4689-4697.

11. Atchley WR, Fitch WM: A natural classification of the basic helix-loop-helix class of transcription factors. Proc. Natl. Acad. Sci. U.S.A 1997, 94:5172-5176.

46

12. Fairman R, Beran-Steed RK, Anthony-Cahill SJ, Lear JD, Stafford WF 3rd, DeGrado WF, Benfield PA, Brenner SL: Multiple oligomeric states regulate the DNA binding of helix-loop-helix peptides. Proc. Natl. Acad. Sci. U.S.A 1993, 90:10429-10433.

13. Pires N, Dolan L: Origin and Diversification of Basic-Helix-Loop-Helix Proteins in Plants. Molecular Biology and Evolution 2009, 27:862-874.

14. Ledent V, Paquet O, Vervoort M: Phylogenetic analysis of the human basic helix- loop-helix proteins. Genome Biol 2002, 3:1-18.

15. Buck MJ, Atchley WR: Phylogenetic Analysis of Plant Basic Helix-Loop-Helix Proteins. Journal of Molecular Evolution 2003, 56:742-750.

16. Ni M, Tepperman JM, Quail PH: PIF3, a phytochrome-interacting factor necessary for normal photoinduced signal transduction, is a novel basic helix-loop-helix protein. Cell 1998, 95:657-667.

17. Friedrichsen DM, Nemhauser J, Muramitsu T, Maloof JN, Alonso J, Ecker JR, Furuya M, Chory J: Three redundant brassinosteroid early response genes encode putative bHLH transcription factors required for normal growth. Genetics 2002, 162:1445-1456.

18. Smolen GA, Pawlowski L, Wilensky SE, Bender J: Dominant alleles of the basic helix- loop-helix transcription factor ATR2 activate stress-responsive genes in Arabidopsis. Genetics 2002, 161:1235-1246.

19. Liljegren SJ, Roeder AHK, Kempin SA, Gremski K, Østergaard L, Guimil S, Reyes DK, Yanofsky MF: Control of fruit patterning in Arabidopsis by INDEHISCENT. Cell 2004, 116:843-853.

20. Szécsi J, Joly C, Bordji K, Varaud E, Cock JM, Dumas C, Bendahmane M: BIGPETALp, a bHLH transcription factor is involved in the control of Arabidopsis petal size. EMBO J 2006, 25:3912-3920.

21. Pillitteri LJ, Sloan DB, Bogenschutz NL, Torii KU: Termination of asymmetric cell division and differentiation of stomata. Nature 2007, 445:501-505.

22. Menand B, Yi K, Jouannic S, Hoffmann L, Ryan E, Linstead P, Schaefer DG, Dolan L: An ancient mechanism controls the development of cells with a rooting function in land plants. Science 2007, 316:1477-1480.

23. Heim MA, Jakoby M, Werber M, Martin C, Weisshaar B, Bailey PC: The basic helix- loop-helix transcription factor family in plants: a genome-wide study of protein structure and functional diversity. Mol. Biol. Evol 2003, 20:735-747.

47

24. Atchley WR, Fernandes AD: Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:6401.

25. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol. Biol. Evol 2000, 17:164-178.

26. Atchley WR, Zhao J: Molecular Architecture of the DNA-Binding Region and Its Relationship to Classification of Basic Helix–Loop–Helix Proteins. Molecular biology and evolution 2007, 24:192.

27. Ferré-D’Amaré AR, Prendergast GC, Ziff EB, Burley SK: Recognition by Max of its cognate DNA through a dimeric b/HLH/Z domain. Nature 1993, 363:38-45.

28. Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. J. Mol. Evol 1999, 48:501-516.

29. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.

30. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113.

31. Wang Z, Atchley WR: Spectral analysis of sequence variability in basic-helix-loop- helix (bHLH) protein domains. Evol. Bioinform. Online 2006, 2:187-196.

32. Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2:28-36.

33. Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics 1998, 14:48-54.

34. Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH: CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 2009, 37:D205-210.

35. Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 2010, 38:D161-166.

48

36. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: InterPro: the integrative protein signature database. Nucleic Acids Res 2009, 37:D211-215.

37. Abascal F, Zardoya R, Posada D: ProtTest: selection of best-fit models of protein evolution. Bioinformatics 2005, 21:2104-2105.

38. Le SQ, Gascuel O: An improved general amino acid replacement matrix. Molecular biology and evolution 2008, 25:1307.

39. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol 2003, 52:696-704.

40. Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol 2007, 24:1596-1599.

41. Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19:1572-1574.

42. Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and Regression Trees. 1st edition. Chapman and Hall/CRC; 1984.

43. HDMD: Statistical Analysis Tools for High Dimension Molecular Data [http://cran.r- project.org/web/packages/HDMD/].

44. Johnson RA, Wichern DW: Applied Multivariate Statistical Analysis. 5th edition. Prentice Hall; 2001.

45. R Development Core Team: {R: A language and environment for statistical computing}. Vienna, Austria: R Foundation for Statistical Computing; 2009.

46. Rifkin R, Klautau A: In defense of one-vs-all classification. The Journal of Machine Learning Research 2004, 5:101–141.

47. Gross ST: The Kappa Coefficient of Agreement for Multiple Observers When the Number of Subjects is Small. Biometrics 1986, 42:883-893.

48. Tsoumakas G, Katakis I: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 2007, 3:1–13.

49

49. Li X, Duan X, Jiang H, Sun Y, Tang Y, Yuan Z, Guo J, Liang W, Chen L, Yin J, Ma H, Wang J, Zhang D: Genome-wide analysis of basic/helix-loop-helix transcription factor family in rice and Arabidopsis. Plant Physiol 2006, 141:1167-1184.

50. Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc. Natl. Acad. Sci. U.S.A 2000, 97:3288-3291.

51. Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome research 2004, 14:1188.

52. Toledo-Ortiz G, Huq E, Quail PH: The Arabidopsis basic/helix-loop-helix transcription factor family. Plant Cell 2003, 15:1749-1770.

53. Robbertse B, Reeves JB, Schoch CL, Spatafora JW: A phylogenomic analysis of the Ascomycota. Fungal genetics and biology 2006, 43:715–725.

54. Zhang N, Castlebury LA, Miller AN, Huhndorf SM, Schoch CL, Seifert KA, Rossman AY, Rogers JD, Kohlmeyer J, Volkmann-Kohlmeyer B, Sung G-H: An overview of the systematics of the Sordariomycetes based on a four-gene phylogeny. Mycologia 2006, 98:1076-1087.

55. Nowrousian M, Stajich JE, Chu M, Engh I, Espagne E, Halliday K, Kamerewerd J, Kempken F, Knab B, Kuo H-C, Osiewacz HD, Pöggeler S, Read ND, Seiler S, Smith KM, Zickler D, Kück U, Freitag M: De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis. PLoS Genet 2010, 6:1-22.

56. Morgenstern B, Atchley WR: Evolution of bHLH transcription factors: modular evolution by domain shuffling? Mol. Biol. Evol 1999, 16:1654-1663.

57. Cambareri EB, Singer MJ, Selker EU: Recurrence of Repeat-Induced Point Mutation (Rip) in Neurospora Crassa. Genetics 1991, 127:699-710.

58. Graïa F, Lespinet O, Rimbault B, Dequard-Chablat M, Coppin E, Picard M: Genome quality control: RIP (repeat-induced point mutation) comes to Podospora. Mol. Microbiol 2001, 40:586-595.

59. Ikeda K, Nakayashiki H, Kataoka T, Tamba H, Hashimoto Y, Tosa Y, Mayama S: Repeat‐induced point mutation (RIP) in Magnaporthe grisea: implications for its sexual cycle in the natural field context. Molecular Microbiology 2002, 45:1355-1364.

50

60. Osborne TF, Espenshade PJ: Evolutionary conservation and adaptation in the mechanism that regulates SREBP action: what a long, strange tRIP it’s been. Genes & Development 2009, 23:2578 -2591.

61. Benton BK, Reid MS, Okayama H: A Schizosaccharomyces pombe gene that promotes sexual differentiation encodes a helix-loop-helix protein with homology to MyoD. EMBO J 1993, 12:135-143.

62. Endoh-Yamagami S, Hirakawa K, Morioka D, Fukuda R, Ohta A: Basic helix-loop- helix transcription factor heterocomplex of Yas1p and Yas2p regulates cytochrome P450 expression in response to alkanes in the yeast Yarrowia lipolytica. Eukaryotic Cell 2007, 6:734-743.

63. O’Neill EM, Kaffman A, Jolly ER, O’Shea EK: Regulation of PHO4 Nuclear Localization by the PHO80-PHO85 Cyclin-CDK Complex. Science 1996, 271:209 -212.

64. Then Bergh F, Flinn EM, Svaren J, Wright AP, Hörz W: Comparison of nucleosome remodeling by the yeast transcription factor Pho4 and the glucocorticoid receptor. J. Biol. Chem 2000, 275:9035-9042.

65. Park G, Colot HV, Collopy PD, Krystofova S, Crew C, Ringelberg C, Litvinkova L, Altamirano L, Li L, Curilla S, Wang W, Gorrochotegui-Escalante N, Dunlap JC, Borkovich KA: High-throughput production of gene replacement mutants in Neurospora crassa. Methods Mol. Biol 2011, 722:179-189.

66. Morrow CA, Fraser JA: Sexual reproduction and dimorphism in the pathogenic basidiomycetes. FEMS Yeast Res 2009, 9:161-177.

67. Tatry M-V, El Kassis E, Lambilliotte R, Corratgé C, van Aarle I, Amenc LK, Alary R, Zimmermann S, Sentenac H, Plassard C: Two differentially regulated phosphate transporters from the symbiotic fungus Hebeloma cylindrosporum and phosphorus acquisition by ectomycorrhizal Pinus pinaster. Plant J 2009, 57:1092-1102.

68. Mendonça Maciel MJ, Castro e Silva A, Telles Ribeiro HC: Industrial and biotechnological applications of ligninolytic enzymes of the basidiomycota: a review. Electron Journ of Biotechnol 2010, 13:1-6.

69. Brownlie P, Ceska T, Lamers M, Romier C, Stier G, Teo H, Suck D: The crystal structure of an intact human Max-DNA complex: new insights into mechanisms of transcriptional control. Structure 1997, 5:509-520.

70. Ellenberger T, Fass D, Arnaud M, Harrison SC: Crystal structure of transcription factor E47: E-box recognition by a basic region helix-loop-helix dimer. Genes Dev 1994, 8:970-980.

51

71. Ferré-D’Amaré AR, Pognonec P, Roeder RG, Burley SK: Structure and function of the b/HLH/Z domain of USF. EMBO J 1994, 13:180-189.

72. Párraga A, Bellsolell L, Ferré-D’Amaré AR, Burley SK: Co-crystal structure of sterol regulatory element binding protein 1a at 2.3 A resolution. Structure 1998, 6:661-672.

73. De Masi F, Grove CA, Vedenko A, Alibés A, Gisselbrecht SS, Serrano L, Bulyk ML, Walhout AJM: Using a structural and logics systems approach to infer bHLH-DNA binding specificity determinants. Nucleic Acids Res. 2011, 39:4553-4563.

74. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215:403–410.

75. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res 1999, 27:260-262.

76. Skinner MK, Rawls A, Wilson-Rawls J, Roalson EH: Basic helix-loop-helix transcription factor gene family phylogenetics and nomenclature. Differentiation 2010, 80:1-8.

77. Sablitzky F: Protein Motifs: the Helix-Loop-Helix Motif. In Encyclopedia of Life Sciences. edited by John Wiley & Sons, Ltd Chichester: John Wiley & Sons, Ltd; 2005.

78. Guimera J, Vogt Weisenhorn D, Echevarría D, Martínez S, Wurst W: Molecular characterization, structure and developmental expression of Megane bHLH factor. Gene 2006, 377:65-76.

79. Grove CA, De Masi F, Barrasa MI, Newburger DE, Alkema MJ, Bulyk ML, Walhout AJM: A Multiparameter Network Reveals Extensive Divergence between C. elegans bHLH Transcription Factors. Cell 2009, 138:314-327.

52

Table 1 Gene count summary of the 12 fungal bHLH groups by completed fungal genome Taxonomy F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 Un Tot Basidiomycota Ustilaginomycotina Malassezia globosa 1 1 1 1 1 1 1 7 Ustilago maydis 2 1 1 1 2 1 1 1 1 1 12 Agaricomycotina Postia placenta 1 1 1 1 1 1 1 7 Laccaria bicolor 1 4 1 1 1 1 1 2 1 1 14 Coprinopsis cinerea 1 1 1 1 1 1 2 1 1 10 Filobasidiella neoformans 1 1 1 1 1 1 1 2 9 Ascomycota Taphrinomycotina Schizosaccharomyces japonicus 1 2 1 4 Schizosaccharomyces pombe 1 2 1 4 Saccharomycotina Saccharomycetes, Saccharomycetales Metschnikowiaceae Clavispora lusitaniae 2 1 3 1 1 8 Dipodascaceae Yarrowia lipolytica 1 1 2 1 1 1 1 1 9 Candida (mitosporic Saccharomycetales) Candida dubliniensis 2 1 3 1 1 1 9 Candida tropicalis 2 3 1 1 7 Candida albicans 2 1 3 1 1 8 Saccharomycetaceae Pichia pastoris 1 1 2 1 2 1 1 1 10 Lachancea thermotolerans 2 1 2 1 1 1 8 Vanderwaltozyma polyspora 2 1 3 1 1 8 Eremothecium gossypii 2 1 2 1 1 7 Kluyveromyces lactis 2 1 3 1 1 8 Candida glabrata 2 1 3 1 1 1 9 Zygosaccharomyces rouxii 2 1 2 1 1 1 8 Saccharomyces cerevisiae 2 1 2 1 1 1 8 Debaryomycetaceae Lodderomyces elongisporus 2 1 2 1 1 7 Debaryomyces hansenii 2 1 2 1 1 1 8 Meyerozyma guilliermondii 2 1 3 1 1 8 Scheffersomyces stipitis 2 1 2 1 1 7 Table 1-Gene count summary of the 12 fungal bHLH groups by completed fungal genome

53

Table 1 (continued) Taxonomy F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 Un Tot Ascomycota Pezizomycotina Dothideomycetes Pyrenophora tritici-repentis 1 1 4 2 1 1 1 11 Phaeosphaeria nodorum 1 1 4 1 2 1 1 1 12 Leotiomycetes Sclerotinia sclerotiorum 1 1 2 1 1 1 1 1 9 Botryotinia fuckeliana 1 1 2 1 1 1 1 1 9 Sordariomycetes Nectria haematococca 1 1 5 1 1 1 1 1 12 Magnaporthe oryzae 1 1 3 1 1 1 1 1 10 Chaetomium globosum 1 4 1 2 1 1 1 11 Podospora anserina 1 1 8 1 2 1 1 1 16 Sordaria macrospora 1 1 6 1 1 1 1 12 Neurospora crassa 1 1 7 1 1 1 1 1 14 Eurotiomycetes Onygenales Trichophyton verrucosum 1 1 1 1 1 1 1 1 8 Arthroderma benhamiae 1 1 1 1 1 1 1 1 8 Arthroderma otae 1 1 1 1 1 1 1 1 8 Ajellomyces dermatitidis 1 1 1 1 1 1 1 1 8 Ajellomyces capsulatus 1 1 1 1 1 1 6 reesii 1 1 1 1 1 1 1 7 Paracoccidioides brasiliensis 1 1 1 1 1 1 1 7 posadasii 1 1 1 1 1 1 1 1 8 Eurotiales Talaromyces stipitatus 1 1 3 1 1 1 1 1 10 Emericella nidulans 1 4 1 2 1 1 1 1 12 Neosartorya fischeri 1 1 3 1 1 1 1 1 10 Aspergillus fumigatus 1 1 3 1 1 1 1 9 Penicillium chrysogenum 1 1 2 1 1 1 2 1 10 Penicillium marneffei 1 1 3 1 1 1 1 1 10 Aspergillus niger 1 1 2 1 1 1 1 1 9 Aspergillus terreus 1 1 3 1 1 1 1 1 10 Aspergillus flavus 1 1 4 1 1 1 1 1 11 Aspergillus oryzae 1 1 4 1 1 1 1 10 Aspergillus clavatus 1 1 2 1 1 1 1 1 9 NOTE.–Listed fungal organisms have completed genome projects, fully annotated gene sets and contain bHLH genes. A simplified taxonomic classification, the total bHLH copy count, and the bHLH copy count within fungal groups F1-F12 are provided for each organism.

54

Table 2 Structural attributes and significant sites of the bHLH domain Site Structural CS DT SM pah pss ms cc ec all 1 DP √ √ C 2 DP √ C C 3 4 √ √ 5 DP √ C C 6 P √ SC S C C SC SC 7 √ √ 8 DP * √ √ SC SC C SC 9 DP √ C C C 10 P * C C SC S C 11 P * √ S SC C 12 DP * √ √ SC SC SC SC SC SC 13 DP √ C C C C C 14 S 15 P √ √ SC SC 16 B √ √ C C C 17 P * C C C C 18 19 √ √ S S 20 B * √ √ C C C C C 21 22 23 B √ C C C C C 24 S 25 26 * 27 B √ C 28 B √ Table 2-Structural attributes and significant sites of the bHLH domain

55

Table 2 (continued) Site Structural CS DT SM pah pss ms cc ec all 50 DPB √ √ √ C SC SC C SC SC 51 S SC C C 52 53 B √ √ C C 54 B √ C SC C 55 56 57 * 58 59 60 √ 61 B √ 62 63 64 B √ NOTE.–The molecular architechure of bHLH positions is compiled from previous work on crystaline structures of animal proteins. Structural attributes noted are DNA contact of the E-box (D), phosphate backbone contact (P), or buried site within the hydrophobic core of the dimerized helicies (B). Highly (√) and moderately (*) conserved sites are denoted (CS). Sites integral in the decision tree analysis (DT) and the simplified model (SM) are also reported. Last, SWDA (S) and CVA (C) significant sites are shown within each factor score dataset (pah, pss, ms, cc, ec, and all).

56

Table 3 Known biological and molecular functions for bHLH proteins by fungal group Reported Group Biological Function Members F2 RTG1, RTG3, Interorganelle communication between mitochondria, peroxisomes, MGG_05709 and nucleus. F3 CBF1, CBF1P, Chromosome segregation, Methionine auxotrohpic growth, rRNA CaCBF1, AnBH1, transcription, Repression of Penicillin Biosynthesis, Regulation of CPF1 sulfur utilization, ribosome biogenesis, and glycolysis. F4 TYE7, SAH-2, Sexual development, Aeorial hyphae development, Hypoxic HMS1, SRE1, response, Carbon catabolite transcription activation, Regulation of SRE2, CPH2, glycolysis, ergosterol biosynthesis, heme biosynthesis, phospholipid CAP1P biosynthesis. F5 Q6MYV5 Nitrate assimilation, Quinate utilization F6 PHO4, NUC-1, Response to copper ion, Regulate phosphate acquisition and PalcA metabolic process, Promotes sexual development, Represses asexual development. F8 ESC1, devR Sexual differentiation, Sexual conjugation, Devolopment under standard growth conditions. F10 YAS2 Alkane response. F11 INO4, YAS1 Derepression of phospholipid synthesis, Alkane response. F12 INO2 Derepression of phospholipid synthesis NOTE.–No functional annotations were found for members of groups F1 and F7. Literature describing biological functions of the reported members of groups F2-F12 are cited in the manuscript.

Table 3-Known biological and molecular functions for bHLH proteins by fungal group

57

Table 4 Validation of bHLH classification methods Stat. Decision CVA CVA CVA CVA CVA CVA SWDA Simplified Measure Tree {pah} {pss} {ms} {cc} {ec} {all} {all} Model 488 Accuracy 95.5 100.0 99.6 99.6 98.4 99.8 100.0 99.2 99.0 Coefficient 94.7 100.0 99.5 99.5 98.1 99.8 100.0 99.1 98.8 Unclassified 0 4 4 4 4 4 4 2 8 198 Accuracy 92.9 96.1 96.1 97.2 93.9 96.1 95.0 95.4 96.4 Coefficient 92.0 95.6 95.6 96.8 93.0 95.6 94.3 94.8 95.9 Unclassified 0 19 19 19 19 19 19 3 3 NOTE.–The accuracy and coefficient of agreement are reported for the Decision Tree, SWDA{all}, each CVA, and the Simplified Model classification methods. The measures are derived from two datasets. The first measurements are based on the 488 fungal sequences used in building the models. The second set assesses the models on with the F.198 sequence set. The number of sequences that were unable to be classified (Unclassified) for each model are also reported.

Table 4-Validation of bHLH classification methods

58

Table 5 Simplified model for the 12 fungal groups Grp 1 4 6 7 8 12 15 16 19 20 50 53 F1 QTM IV Y F2 Y KR I F3 E V F4 Y F5 S K F6 A R F7 A S F8 L A F9 QAL SA F10 N S I E F11 E F12 K L NOTE.–The bHLH positions and their states (ie. amino acids) that best distinguish groups F1-F12 are given. Those amino acids in bold italics are uncharacteristic of a given fungal group (e.g. F2 sequences do not contain Y at site 12).

Table 5-Simplified model for the 12 fungal groups

59

Table 6 Cross Kingdom classification of bHLH domains Stat. Decision CVA CVA CVA CVA CVA CVA SWDA Classic Measure Tree {pah} {pss} {ms} {cc} {ec} {all} {all} Model A 0.1 0.4 18.4 8.1 3.6 0.0 4.4 0.1 0.1 B 97.4 81.2 72.0 72.7 84.4 87.8 87.8 82.4 1.4 C 0.1 1.0 5.7 1.8 2.3 0.7 0.3 0.9 2.1 D 2.3 0.0 0.3 0.0 0.0 0.3 0.4 0.1 0.0 E 0.0 14.4 0.7 14.4 6.8 8.3 4.1 15.9 0.0 Unclass. 0 21 21 21 21 21 21 4 677 NOTE.–The percentage of 707 fungal sequences classified into animal Groups (A-E) are reported for animal classification models Decision Tree, SWDA{all}, each CVA and the Classic model. The percentage unclassified for each model are also reported. Unclass. identifies the number of bHLH sequences that were unable to be classified for each model.

Table 6- Cross Kingdom classification of bHLH domains

60

Table 7 Mahalanobis distance between animal and fungal bHLH groups Grp A B C D E Avg F1 89.4 32.5 63.4 120.5 45.3 70.2 F2 82.1 17.5 58.3 117.8 31.7 61.5 F3 86.8 28.0 57.2 116.4 33.6 64.4 F4 86.1 26.1 65.7 117.1 40.3 67.1 F5 81.0 23.1 61.0 118.6 27.8 62.3 F6 89.1 40.4 62.6 118.1 43.4 70.7 F7 89.0 34.2 62.4 120.7 43.0 69.9 F8 91.0 43.4 72.0 118.1 46.2 74.1 F9 83.9 26.5 62.3 117.0 36.6 65.3 F10 83.1 30.3 66.0 118.9 43.3 68.3 F11 132.9 109.7 128.4 156.8 117.7 129.1 F12 92.0 39.6 67.2 122.2 48.5 73.9 Avg 90.5 37.6 68.9 121.9 46.4 NOTE.–A CVA{all} was constructed on the entire set of grouped animal and fungal proteins. The relative distance (Mahalonobis distance between group centroids) of F1-F12 and animal Groups A-E are reported. The average of these distances for each fungal and animal group is also shown.

Table 7-Mahalanobis distance between animal and fungal bHLH groups

61

Equation 1–Coefficient of Agreement K.

For a given confusion matrix:

N = number of trials k = number of states (rows and columns) xij = value at row i and column j of matrix

62

Figure 1–Fungi bHLH entropies, logos, and consensus. A. The bHLH normalized group entropy by position. Lower values indicate conservation, while values close to one approach complete randomness. B. The graphical representation of the amino acids at each position of the bHLH domain. Symbols representing amino acids are scaled by their bit score (a derivation of entropy) at a given position. C. The 50-10 consensus sequences for Fungi. Using an alignment of bHLH domains, amino acids occurring at a frequency of more than 50% at a given site are displayed. At each of these sites, additional amino acids are displayed beneath if they are conserved in 10% or more of the sequences.

63

A

B

C

Basic Helix1 Loop Helix2

64

Figure 2–Phylogenetic analysis of fungal bHLH. Phylogenetic relationships, taxonomic representation, bHLH motif statistics, and architecture of conserved protein motifs for

12 fungal bHLH groups. ML tree of 490 fungal bHLH proteins (full representation of

Basidiomycota, Pezizomycotina, Saccharyomycotina, and Fungal trees available in

Supplemental Figure S1A-D). The tree is drawn to scale (branch lengths proportional to evolutionary distances) and has been rooted with a single representative from

Chlamydomonas reinhardtii. Groups, determined by clades with strong support, are collapsed as triangles with width and depth proportional to the size and sequence divergence of each group, respectively. Groups supported by bootstrap values >30 in

Neighbor-Joining (NJ) or Maximum Parsimony (MP) analyses are colored black. The shaded group F9 was ambiguously retrieved in NJ, MP, or BA trees. Ungrouped genes are indicated as single lines and the scale bar represents the estimated number of amino acid replacements per site. Basidiomycota (B), Pezizomycotina (P) and

Saccharyomycotina (S) clades associated with each fungal group are noted in brackets.

The architecture of conserved motifs, as determined through MEME, is graphically represented as boxes drawn to scale. Box enumeration corresponds to specific motifs found by MEME (Supplemental Figure 2C). Grey boxes represent motifs that match the bHLH domain. Last of all, the sequence logo for B4, P4 and S4 for 21 amino acids downstream of the bHLH domain are shown (motif 20).

65

66

Figure 3–NJ analyses of F4 proteins from Sordariomycete organisms. The NJ trees were inferred from the bHLH domain only and from the entire bHLH containing protein sequence. The trees are displayed with corresponding bootstrap values where branches with a bootstrap of less than 30 have been collapsed. Six strongly supported F4 subclades have been noted; where subclades A-C contain one member from each Sordariomycete organism and

Groups D-F each have one member from P. anserina, S. macrospora, and N. crassa. * The Q2GSP1 bHLH domain has been determined to be highly divergent by 1) containing 7 mismatches to the fungal consensus motif; 2) possessing an uncharacteristic amino acid at site 12 for an F4 protein (D); and 3) containing a simple sequence repeat through both the basic and Helix 2 regions (DDDDDD). The full Q2GSP1 sequence, however, is strongly supported in the A subclade; suggesting the highly divergent bHLH arose from either evolutionary pressure or sequencing errors.

67

A A

B B

C C

D D

E E

F F

68

Figure 4–Fungal decision tree analysis. The decision tree describing the separation of bHLH fungal groups F1-F12 by amino acid sites found in the bHLH domain. Each box of the figure represents a step in the decision tree which consist of a number of bHLH sequences from each fungal group, and the amino acids at a given bHLH position (state).

The sample size and proportion of group representatives is provided in the accompanying table. Diamonds contain the bHLH amino acid site which bifurcate the data into subsets of the previous state.

69

1

Site 8

2 3

Site 12 Site 15

4 5 6 7

Site 20 Site 50 Site 11

8 9 10 11 13 12

Site 15 Site 19 Site 15 Site 1

15 14 16 17 18 19 21 20

Site 4 Site 7 Site 4

22 23 24 25 26 27

70

Figure 5–Canonical Variate Analysis of fungal bHLH groups. A. Projection of 488 fungal bHLH sequences onto eigenvectors (Canonical Variates) for the all dataset. Plot contains the first and second Canonical Variate out of 11 total. Axes reflect the

Mahalanobis distance between fungal groups F1-F12. B. Pairwise Mahalanobis distance between fungal group centroids in the CVA{all} analysis.

71

A

B

72

Chapter 3

Fundamental Characteristics and Discerning Sites of the Eukaryotic bHLH Domain

Joshua K. Sailsbery1,2, William Atchley3, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Genetics, North Carolina State University, Raleigh, NC 27606, USA

For publication as a research paper in BMC Bioinformatics

* To whom correspondence should be addressed.

Ralph A. Dean 851 Main Campus Drive, Suite 233 Centennial Campus Center for Integrated Fungal Research North Carolina State University Raleigh, NC 27606 Tel.: +1-919-513-0020; Fax: +1-919-513-0024; Email: [email protected]

73

ABSTRACT

The highly conserved bHLH domain, found in many transcription factors, has been well characterized separately in Plants, Animals, and Fungi. Herein, we investigated the fundamental architecture of the bHLH domain across all Eukarya. We identified five DNA-binding amino acid sites that are highly characteristic of each

Kingdom, and developed classification methods that accurately determined the origin of a bHLH sequence. Expertly aligned bHLH domains were used to generate Hidden

Markov Models (HMM), which were also highly accurate for identifying bHLH domains.

The HMMs and classification methods were tested against a well known environmental sample to determine the Kingdom of previously unknown bHLH domains. Last, we have created an online tool that can align, extract, and classify bHLH sequences.

74

INTRODUCTION

Few amino acid sequences are well conserved across all Eukaryotic life. One of the exceptions is the basic Helix-Loop-Helix (bHLH) domain. The bHLH domain is an essential part of many transcription factors involved in a myriad of regulatory processes, from neurogenesis in mammals [1, 2] to environmental response in Plants

[3]. It provides two of the crucial molecular roles for transcription factors, DNA binding and transcriptional regulation.

The tripartite bHLH domain is ~60 amino acids in length. The DNA-binding region (basic) is located at the N-terminus. This 13 amino acid region is known to bind the hexanucleotide sequence E-box (CANNTG) or its degenerate forms in most bHLH proteins [1]. The two alpha-helices (Helix 1, Helix 2) stabilize DNA interaction and dimerize with secondary proteins to promote transcription or facilitate protein-protein interactions [4, 5]. The two helices contain ~15 amino acids and are separated by a loop of variable length.

Animal bHLH domains form six distinct phylogenetic clades, called groups A-F

[6, 7]. Group B contains many ancient, highly conserved members such as Myc, Mad,

Hairy, and Pho4 [8]. These E-box binders can be identified by the BxR motif at sites 5, 8, and 13, where B is either H or K. Animal group A proteins follow xRx motif at sites 5, 8, and 13 [6]. Members of group A include E12, dHand, MyoD, and Twist which bind the specific E-box sequence CAGCTG or CACCTG. Group E also possess a motif in the basic

75 region, they contain Proline (P) at site 6, which allows them to bind the E-box degenerate CACGNG sequence. Distinguishing characteristics for groups C and F include the conservation of additional downstream domains, PAS (Per-Arnt-Sim) and

COE (Collier/Olf-1/EBF) [7, 9], respectively. Finally, group D, which includes members such as Id and Emc, lack any type of basic region and act as transcriptional regulators through hetero-dimerization [10].

From algae to angiosperms, recent work on plants has identified ~33 distinct bHLH domain groups [7, 11]. These plant groups are tied to biological functions through characterized members and include processes from light and hormone signaling [12, 13] to tissue development [14, 15]. Studies have shown that plant bHLH domains are most closely related to animal group B [16, 17].

We have recently identified 12 distinct phylogenetic groups in Fungi [18].

Members of these groups have been tied to specific biological roles such as sexual development and glycolysis regulation. Like Plants, fungal bHLH sequences are most closely related to animal group B. Similar to animal groups, the 12 fungal groups can be distinguished with only a few amino acids. For example, only fungal group F4 sequences contain a Tryptophan (Y) at site 12 in the basic region. Statistical classification analyses led to the identification of the sites that discern between fungal groups.

Amino acids are typically represented as alphabetic letters, according to the

IUPAC standard. However, such representation limits many statistical analyses as they

76 require numerical data. To answer this, Atchley et al. [19] developed five numerical indices (factor scores) that allow translation of amino acids from alphabetical to numerical data. These factor scores were derived from Factor Analysis on 495 physiochemical amino acid properties. Each of these independent factor scores (pah, pss, ms, cc, ec) represent distinct molecular functions of amino acids. Factor score pah represents the amino acid properties of polarity, accessibility, and hydrophobicity. The propensity for secondary structure is linked to the pss index. Molecular size, codon composition, and electrical charge are represented by the ms, cc, and ec factor scores, respectively.

Numerical amino acid data permit the use of classification methods such as

StepWise Discriminant Analysis (SWDA) and Canonical Variate Analysis (CVA) [20].

These analyses examine patterns of covariation to obtain a unique classification solution. SWDA and CVA have been used to identify the most discerning sites between animal groups and also fungal groups [21, 18]. These discerning sites are characteristic of the sequences within each group and are often tied to the molecular function of the bHLH domain itself.

Herein, we conducted a study of the discerning sites that exist between plant, animal, and fungal bHLH domains. We also built several classification models to discern

Kingdom origin and incorporated these into a web-based tool to classify any bHLH sequence. These models provided valuable insight into the fundamental differences between plant, animal, and fungal bHLH domains. We then built several HMMs (Hidden

77

Markov Models) and used these to identify and classify bHLH sequences from a large marine environmental sample genome project. The tool, all source code, and relevant data are available at www.fungalgenomics.ncsu.edu.

78

MATERIAL AND METHODS

A set of 1302 expertly aligned and enumerated bHLH was compiled from previous work [9, 18, 21, 22]. This set was used to build classification models and included 514 plant, 279 animal, and 509 fungal sequences. A bHLH sequence dataset for testing classification models was generated as follows. First, protein sequences containing the IPR001092 signature were identified using Interpro 31.0 [23]. Next, using an iterative alignment [18], each bHLH domain was enumerated comparable to the expert build set. Last, bHLH sequences were extracted and placed into the test dataset. The resulting test dataset was comprised of 6987 bHLH plant (1900), animal

(4722), and fungal (365) sequences.

The Boltzmann-Shannon entropy value was calculated for each site in the build sequence alignment for: 1) plant sequences, 2) animal sequences, and 3) fungal sequences. To determine the normalized group entropy value: 1) amino acids were grouped based on their physicochemical properties (acidic, basic, aromatic, aliphatic, aminic, hydroxylated, Cysteine, Proline) resulting in eight sets (DE, HKR, FWY, AGILMV,

NQ, ST, C, P, respectively) [6, 24]; 2) The Boltzmann-Shannon entropy value, based on the eight amino acid groups, was calculated at each site [25]; 3) These group entropy values were normalized to range from 0 to 1, with respect to possible minimum and maximum values, respectively.

79

Consensus sequences were determined by using the 50-10 rule [11]. For a given site, an amino acid was included in the consensus sequence if it had a concentration

>50% across the entire sequence alignment. For each incorporated 50 site, every amino acid with a concentration >10% was also included. Using these rules, the animal consensus motif was adapted from previous work [6]. Fungal and plant consensus motifs were taken from previously published work [11, 18].

Classification Models

Decision Trees [21, 26] were built using SAS software, Enterprise Miner 5.2. A

Chi-square test with a significance level of 20% was used as the splitting criteria. The bifurcating tree was limited to a depth of 4 nodes, requiring a minimum of 10 observations for a split and at least 4 observations per leaf.

Alphabetic amino acid data from the 1302 bHLH sequence alignment was transformed into a 1 x 5 vector of numeric values using the HDMD package [27].

Transformation to numerical values used independent factor scores which are quantitative values for amino acids based on 495 amino acid properties [21]. The five factor scores (pah, pss, ms, cc, ec) are associated with the biological properties: polarity, accessibility, and hydrophobicity; propensity for secondary structure; molecular size or volume; codon composition; and electrostatic charge; respectively.

We created an additional dataset containing the combination of all five factor scores

80

(all). This resulted in the total of six factor score transformed datasets: pah, pss, ms, cc, ec, and all from the 1302 bHLH sequences.

Canonical Variate Analysis (CVA) and StepWise Discriminant Analysis (SWDA)

[20] were used to build classification models on all six factor score datasets. These discriminant analyses defined the latent structure of covariation among the Kingdoms and identified the sites that best differentiated between plant, animal, and fungal sequences.

The step-up SWDA procedure ranked amino acid sites based on their ability to discriminate Plants, Animals, and Fungi, as measured by the Wilks’ lambda likelihood ratio [21]. In the step-up procedure, variables (amino acid sites) were added sequentially (steps) based on Wilk’s lambda, until an Average Squared Canonical

Correlation (ASCC) reached a value of 70% for pah, pss, ms, cc, ec and 80% for all datasets. The ASCC describes the related distinctiveness of the groups at a given step in the model, meaning a 100% ASCC would imply complete discrimination between the defined groups. The partial correlation (r2) was also measured, which is a measure of each site’s ability to discriminate between groups while controlling for the effects of other variables already in the model. . Those variables with r2 > 20% were considered the most discerning sites. SAS software, Version 9.2 was utilized in the SWDA.

CVA assesses the discriminatory ability of all variables simultaneously to generate a linear model to differentiate between defined groups. The CVA includes the calculation of eigenvectors (canonical variates) from the among-group covariance

81 matrix. CVA for the six factor score datasets resulted in 2 canonical variates for each analysis. The square root of the Mahalanobis pairwise distance was also calculated, providing a relative measure of the divergence between groups. CVA and plotting of canonical variates were conducted utilizing the statistical software package R [28], specifically the lda function as described in the HDMD package [27]. Amino acid sites were considered discerning if they had absolute magnitudes >1 in either canonical variate for the pah, pss, ms, cc, ec , and all CVAs.

Hidden Markov Models

Hidden Markov Models (HMM) were built using HMMER 2.3.2 [29, 30]. Two sets of HMMs were built to cover very specific sets of sequences; Kingdom specific HMMs and combinatorial HMMs. The 1302 model dataset was first divided by Kingdom.

HMMs were then separately constructed on the first and second helix for sequences from each Kingdom. Two more were constructed from the first and second helix from the entire model dataset. There are three Kingdom specific HMMs (Plant Helix 1 – Plant

Helix 2; Animal Helix 1 - Animal Helix 2; Fungal Helix 1 - Fungal Helix 2). Combinatorial

HMMs include a total of 16 HMMs, one for every combination of Helix 1 and Helix 2.

Sequences were identified as bHLH if and only if a Helix 1 hit completely preceded Helix

2 hit and both helices had scores greater than 0.1.

82

Testing Methods

The performance of classification models was determined by comparing predicted Kingdom origin to the actual origin in the 7692 test sequence set [18].

Positive and negative classification results were recorded for each model in 3x3 confusion matricies (data not shown) where sensitivity and specificity were measured using the One versus All approach [31]. The overall ability of models to correctly identify results was then determined (accuracy).

Environmental Sample

Translated amino acid sequences from the Sargasso Sea marine sequencing project [32] were provided by JCVI (www.jcvi.org). This set included 6.125 * 106 putative protein sequences. The sequences were scanned by InterPro Scan installed on compute clusters provided by the NC State Office of Information Technology High

Performance Computing (www.hpc.ncsu.edu). BLAST [33] analyses were performed on selected environmental sequences using the non-redundant NCBI database, e-value threshold of 10, word size of 3, and the BLOSUM62 matrix.

83

RESULTS

Comparing conserved sites in bHLH domains for Plants, Animals, and Fungi

To quantitatively measure the conservation of amino acid sites in the bHLH domain we performed Boltzman-Shannon analyses [6, 22, 34] for 1302 plant (523), animal (279), and fungal (509) sequences (Figure 1). Within these analyses we grouped like amino acids to account for substitutions of functionally similar amino acids. Normalized group entropy values, ranging from 0-1, indicate conserved to variable sites, respectively. Finally, we visualized the frequency of amino acids per site in bit-score weblogos [35] and provided the published consensus sequences [8, 11, 18].

Overall, the conservation of amino acids by site of the bHLH were very similar in plants, animals and fungi (Table 1). The pattern of entropy in the three analyses was most similar for the first and second helices. Within the basic region, fungal sequences contained more conserved sites and lower entropy values in general.

Further inspection of the basic region revealed that animals shared several moderately conserved sites with fungi. These shared sites included 2, 9, 10, and 12 where the most frequent amino acid was R, E, R, and R, respectively. Plants had three strongly conserved sites in the basic region, including sites 10, 12, and 13. Of these highly conserved plant positions, site 13 was also highly conserved in fungi, while sites

10 and 12 were only moderately conserved in animals and fungi.

In the first helix, many sites were highly conserved in all three kingdoms. These included sites 16, 23 and 27. Site 27 was the one of the most conserved in all three

84 analyses with L being found almost exclusively at this position. Sites 16 and 23 were composed of hydrophobic amino acids I, L, V, and M. Site 20 was moderately conserved in all three kingdoms with the nearly the same set of amino acids (F, I, L, and M). Fungi shared conserved site 17 with Animals only, and site 28 only with Plants. Site 17 was only moderately conserved in Animals and Fungi with R being the most frequently occurring amino acid. Site 28, at the end of the first helix, P occurred at a frequency of

92%, 88%, and 73% for Fungi, Plants, and Animals, respectively.

We identified six conserved sites in the second helix, shared between Animals,

Plants, and Fungi, including positions 53, 54, 57, 60, 61, and 64. The most frequently occurring amino acids in these sites were the same for all three Kingdoms (Figure 1).

One of the most dramatic differences in conservation occurred at the beginning of the second helix (site 50), where animal and fungal sequences contained a K and plant sequences were much more variable. Finally, site 51 was the only conserved site shared exclusively between Plants and Animals, however, the most frequently occurring amino acids differed greatly, V, A, I, and L as compared to A, S, and V, respectively.

Decision Trees Analysis

Building a bifurcating decision tree based on observable characteristics has been the traditional method for classifying subjects. Using the model bHLH sequences, we built a decision tree based on the amino acids at given sites of the domain (Figure 2).

85

Starting from the entire 1302 sequence dataset, bifurcating steps were added until there were less than 10 sequences in a subset, the tree hit a depth of 4 steps, or the subset was almost completely comprised of plant, animal, or fungal sequences.

The resulting decision tree provided a straight forward method for classifying sequences, requiring only eight steps. At step 1, site 8 effectively discriminated animal sequences from plant and fungi, with the exception of animal groups D and E. Site 8 has previously been noted as a discerning site between animal groups [21]. Steps 2 and 4 split sequences based on sites 56 and 50, respectively. This separated plants from fungi and animal group D and E sequences. Finally, site 9 at step 8 split fungal sequences from animal groups D and E.

The decision tree accurately classified sequences as either plant, animal, or fungal. It had accuracies of 95.6% for Plants, 95.8% for Animals, and 94.3% for Fungi with an overall accuracy of 92.8% (Table 2). Thus using only sites 8 and 9 (basic); 19

(Helix 1); 50 and 56 (Helix 2) the decision tree was able to discern Kingdom of origin for bHLH sequences with an accuracy over 92%.

StepWise Discriminant Analysis

To measure the discerning ability of individual amino acid sites at determining plant, animal, or fungal origin, we performed a StepWise Discriminant Analysis (SWDA).

In order to perform this analysis, alphabetic amino acid data was transformed into numerical values by utilizing five numerical indices (factor scores). The five factor

86 scores (pah, pss, ms, cc, ec) were based on measured physiochemical amino acid properties [21]. The transformation resulted in five numeric values for each amino acid at every position in each bHLH sequence. An additional factor score set (all) was created from the combination of all five factor scores. The result was a total of six factor score transformed datasets each containing 1302 numeric bHLH sequences.

To determine the most discriminating sites, numerical sequences were analyzed using the SWDA function in SAS (Supplemental Table S2) denoted as SWDA{factor score}. The discriminating power of each site was evaluated by its partial correlation

(r2) and the accumulated Average Squared Canonical Correlation (ASCC). Factor Scores pah and ec performed the best in SWDA, explaining 70% of the among-group variance in only 20 steps each. In contrast, Factor Scores pss, ms, and cc could only obtain average squared canonical correlations (ASCC) of 57%, 61% and 64% when using 36 amino acid sites, respectively. In the first three steps, pah and ec obtained 44% and

42% ASCC, respectively, and shared sites 2 and 13 within the top three discriminating amino acids.

When utilizing all, an ASCC of 80% was obtained using only 14 different amino acid sites within the 20 steps. The three most discerning sites, accounting for 49% of the among-group variability, were 5, 56 and 50. The most discerning site in this analysis, site 5, was transformed using the codon composition (cc) factor score. Sites

56 and 50 had been transformed by the ec, and pah factor scores, respectively.

87

StepWise models were built from SWDA at cutoffs described in Supplemental

Table S2. The 1302 model sequence dataset was then classified by the StepWise classification models and the results recorded in confusion matrices (not shown). The models provided overall accuracies ranging from 88.1% - 95.2% for individual factor scores (Table 2). SWDA{all} was the most accurate (95.2%) of the stepwise analyses and utilized only 20 amino acid sites.

Amino acid sites with partial correlation (r2) > 20% were found to be highly diagnostic of plant, animal, or fungal bHLH origin (Table 1). SWDA using the five factor scores (pah, pss, ms, cc, ec), identified between 1-3 discerning sites for each analysis.

With the most accurate Stepwise model, SWDA{all}, 4 sites (2, 5, 50, 56) were found to highly discerning.

Canonical Variate Analysis

To leverage more discerning power, we utilized all bHLH amino acid sites simultaneously in a Canonical Variate Analysis (CVA) on the six numeric datasets, denoted CVA{factor score}. Figure 3 shows the discerning power of each factor score set by plotting the two canonical variates (eigenvectors), along with the sqrt of the

Mahalanobis pairwise distance between the centroids of plant, animal, and fungal sequences. The visual distinction between groups was most evident in the CVA{all}

(Figure 3). The Mahalanobis distances between plant, animal, and fungal sequences were greatest in this same analysis. Although the Mahalanobis distance between Plants,

88

Animals, and Fungi were smaller in pah, pss, ms, cc, and ec, the sequences clearly clustered among their Kingdom of origin.

Both of the canonical variates (CV) in the factor score analyses were important in discerning between fungal, plant, and animal sequences (Table 3). The first CV in pah, pss, ms, cc, ec, and all explained 67%, 62%, 72%, 57%, 69%, and 62% of the total variation within their respective analyses. By default, the second CV accounted for the remaining unexplained variance. Fungi and Animals were separated from Plants by the first CV in pah, ms, ec and all; with the second CV discerning between Fungi and

Animals. In the pss analysis, the first CV separated out Plants and Fungi from Animals; while in cc, Plants and Animals were first separated from Fungi.

Classification models were then built from the CVA and used to classify the 1302 model sequence dataset. Accuracies of the CVA classification models ranged from

90.7% - 99.6% for Fungi, 91.5% - 100% for Plants, and 92.0% - 99.7% for Animals

(Table 2). CVA{all} was the most accurate model with an total accuracy over 99%.

Thus, through CVA, we effectively distinguished bHLH domains from Fungi, Animals, and Plants.

In addition to the first CV of pah, ms, ec, and all separating Plants from Animals and Fungi, the Mahalanobis distances between animal and fungi bHLH domains were smaller in each CVA compared to the distances to plant domains. Taken together, these findings suggest, as expected, that Animals and Fungi are more closely related than they are to Plants.

89

Many of the top ranked sites of each CV, based on the magnitude of CV coefficients (Supp Table 8), were identified as major discerning sites in SWDA. Sites 2,

5, 12, 13, 50 and 51 were identified by SWDA and CVA within the same factor score transformed datasets. For example, sites 2 and 5 were highly discerning when using the cc transformation. Likewise for pss, the most discerning sites were 12 and 13. Site

51 was discerning for the ec dataset in both analyses. Sites 13 and 50 in the pah data.

While several sites were discerning for the ms dataset, none were shared by CVA and

SWDA. Each of the most discerning sites had known molecular characteristics shared by the biological roles of the factor scores. Sites 2 and 5, discerning sites in cc, contact

DNA and are crucial for correct E-box binding. Sites 12 and 13 are located at the transition from the secondary structures basic region to Helix 1. Last, sites 13 and 50 play crucial roles at the start of their helices, unsurprisingly detected as discerning sites in the pah analyses. Thus, through CVA, we accurately discerned Kingdom origin and linked highly discriminating sites to their conserved molecular functions.

Classification Model Testing

To evaluate the effectiveness of the different classification models in discerning between Kingdom origin we determined and assessed the accuracy for a set of 6987 test bHLH domains (Table 2). This dataset was comprised of plant, animal, and fungal sequences not used to build models. The overall accuracy for each test was above 89% for Plants, slightly lower at just over 79% for Animals or Fungi. The CVA{all} was the

90 most accurate of the tests with a overall score of 95.6%. However, this test requires an amino acid in every position (no gaps) of the basic, Helix 1 and Helix 2 and could not classify 6.8% of the 6987 sequences. SWDA{all}, which used only 20 positions, was able classify a further 86 sequences that CVA{all} could not. While the CVA had a higher accuracy compared to SWDA (88.2%), SWDA may be preferable when CVA{all} cannot classify a given sequence.

bHLH Origins from Environmental Samples

Classification of bHLH sequences of unknown origin is one possible application of these classification models. To test such an application, we obtained a publically available marine environmental sample with >6.125 million amino acid sequences [32].

We then ran InterPro Scan [36] to identify the presence of the bHLH domain in these sequences. This resulted in only 10 sequences being annotated with the IPR001092 signature. These 10 sequences were then extracted and classified using the SWDA{all} and CVA{all}. Additionally, we investigated Kingdom origin through the best BLAST hit for each sequence (Table 4). For 6 of the 10 sequences, CVA{all} and BLAST agreed with Kingdom origin. Three of the remaining four sequences could not be compared directly to the CVA{all} results as they had very low identity and coverage in the BLAST analysis. The hit with the best coverage and identity (ECU95545.1), was identified as an Ostreococcus lucimarinus (algae) sequence by BLAST. CVA{all} classified this same sequence as an animal, however, the Mahalanobis distance between the animal centroid

91 and the algae sequence was greater than those between animal and plant centroids

(Figure 3). This distance is not unexpected as many algae have been shown to be highly divergent from other plant bHLH sequences [9], and only a handful of algae organisms were used to build the classification models.

To further explore the environmental sample, we built several HMMs using the alignment of the 1302 sequences in the expert data set. Each of these models were very accurate (99%) in correctly identifying the origin of sequences in the 6987 test data set

(Table 5). To further test the ability of the HMMs, the complete proteome of

Magnaporthe oryzae and Arabidopsis thaliana were scanned and compared to bHLH annotations from Interpro. Using Kingdom specific models, 18 and 268 bHLH sequences were identified in M. oryzae and A. thaliana, respectively. Within M. oryzae, the fungal HMM identified 10 bHLH sequences, the same set available with bHLH annotations from Interpro. The A. thaliana scan with the plant HMM identified 256 sequences, including the 234 sequences annotated by Interpro.

Scanning the environmental sample with the combinatorial HMMs resulted in the identification of 376 bHLH domains, 8 of which were also identified by Interpro. Of the 376 sequences, approximately twice as many were assigned to plant or animal origin as were assigned to fungal origin by our classification models (Table 6). Only

117 sequences were uniquely assigned to being of plant (68), animal (44), or fungal (5) origin by the Kingdom specific HMMs (Figure 4). Eight sequences were found by more than one Kingdom model, all of which were also identified by Interpro. Although more

92 than a third of the 117 sequences could not be assigned by our classification models, where an assignment could be made, the models largely supported Kingdom origin assigned by the HMM (Table 7). For example, 26 of 32 sequences identified by the animal HMM were assigned as animal by SWDA and CVA. These findings suggest that our HMMs are useful, in combination with our classification models, for determining the origin of bHLH domains in complex samples.

93

DISCUSSION

Using a decision tree, SWDA and CVA models, we found that only a few amino acid sites were necessary to classify bHLH domains by Kingdom. In the decision tree, only sites 8, 9, 19, 50 and 56 were needed to accurately classify sequences as Plant,

Animal or Fungal. However, decision trees have many solutions and therefore do not identify discerning sites. Thus, we used Discriminant analysis. SWDA and CVA were performed on the numerically (factor score) transformed bHLH data, which revealed 5 highly discerning sites in common, 2, 5, 12, 13, and 50 (Table 1). It is noteworthy that four of the five sites are in the basic region all (including site 50) are known to contact either the phosphate backbone or the DNA based on animal bHLH crystalline structures

[5, 37, 38]. From Figure 1, we also observed that the group entropies at these sites were distinct. For example, site 50 was highly conserved in Animals and Fungi, but more variable in Plants. These differences in conservation were characteristic of all highly discerning sites shared by CVA and SWDA.

As shown in Table 1, CVA typically identified discerning sites more readily than

SWDA. For example, CVA, but not SWDA, identified a set of 6 amino acid sites spanning

Helix 1 and Helix 2 that were highly discerning. Sites 16, 20, 23, 27, 53 and 54 were each determined to be discriminating by CVA{all}, and by at least one of the other CVA models. Each of these sites are important to the molecular architecture of the bHLH as buried sites within the helices [4, 39–41]. We also observed that each site was

94 moderately to highly conserved in Plants, Animals, and Fungi. However differences in the degree of conservation between the Kingdoms were not clearly evident at each of these sites. Our findings suggest that the CVA model is more sensitive than SWDA to variations of conservation. One possible explanation for the lack of sensitivity of the

SWDA model is that SWDA considers each site separately, whereas CVA considers all sites simultaneously.

We established that the decision tree, SWDA and CVA models were highly accurate using the model data set (Table 2). On the test dataset, the decision tree’s accuracy fell by more than 12% for Animals and Fungi, but remained high for Plants.

SWDA{all} accuracy also declined by 8% for Animals and Fungi, while improving for

Plants. Last, CVA{all} exhibited this same behavior, however the drop in accuracy for

Animals and Fungi was only about 4%. As expected, accuracies declined using the test dataset, but overall remained high, particularly for Plants.

The discrepancy in accuracies between the tests may be due to underrepresentation of animal and fungal organisms in the model dataset. Fungal sequences in the model dataset only included bHLH proteins from completed genome projects, encompassing the Ascomycota and Basidiomycota phylums. Other phylums, such as Zygomycota and Chytridiomycota, were not represented in the fungal model dataset. Model animal sequences, obtained from published work, are known to be highly representative of higher order Metazoans. However, this set of animal model bHLH domains lacks sequences from ancestral (early Metazoan) organisms. The higher

95 accuracies for Plants using the test dataset may reflect a lack of variation in the available sequences, in particular a lack of sequences from algae, as evidenced by the inability to classify an algae sequence in the environmental sample. Upon further examination, this sequence was quite distant from any group and would likely have been classified as Plant if more algae sequences had been incorporated into the model dataset.

From our testing of classification models, it was evident that CVA{all} was the most accurate of all the analyses. However, CVA{all} does not tolerate missing data

(gaps), and could not classify nearly 7% of the test sequences. Since SWDA{all} requires less discerning sites, it was more tolerant of missing data, classifying 86 sequences that CVA could not. Thus, even with lower accuracies, SWDA is the preferable classification model in instances of missing data.

Accuracy measurements established the effectiveness of the classification models. However, the CVA and SWDA models are not simply designed to classify sequences, but they elucidate the underlying architecture of the bHLH domain

(discerning sites). Other classification methods, such as phylogeny or BLAST based methods, may accurately classify sequences to correct Kingdom origin, but ignore the fundamental attributes. Furthermore, CVA and SWDA models have been used to classify highly divergent sequences in which BLAST failed to identify [18].

The models we have developed are valuable for identification of bHLH domains of unknown origins within large datasets, such as an environmental sample. Using

96

InterPro Scan, only 10 bHLH sequences were identified. However, the 376 sequences were identified by our combinatorial HMMs (each combination of Plant, Animal, Fungal, and all Helix 1 HMMs to Helix 2 HMMs).

CVA and SWDA with the 376 sequences, the 125 Kingdom specific HMM selection, and the 10 Interpro Scan selections, classified twice (or more) as many

Animal and Plant sequences as Fungal. This may be due to the sample containing fewer fungal organisms or may be a reflection of the sampling technique used to collect the sample (i.e. passing the sample through a 10 μm).

The environmental sample contained 6.125 * 106 protein sequences, yet our scans only found between 10 - 376 contain the bHLH domain. This is likely due to a combination of many factors, including; 1) 1.2 * 106 sequences contained less than 100 amino acids, 2) sequencing and assembly errors, 3) incorrect translations of DNA, and

4) 10 μm filter removing the majority of organisms that are represented in the model building set.

In conclusion, we have shown that SWDA and CVA classification models can be used to study the underlying architecture of proteins sequences and utilized to accurately classify unknown sequences, particularly in high-throughput compute environments. All models and data have been incorporated into an open source online tool at www.fungalgenomics.ncsu.edu.

97

SUPPLEMENTARY MATERIAL

Supplementary Tables S1, S2, and S3 and Supplementary Data 1 are available at

BMC Bioinformatics online (http://www.biomedcentral.com/bmcbioinformatics).

ACKNOWLEDGEMENTS

We would like to thank members of the Center for Integrated Fungal Research for their critical comments and discussion. This work was supported by a grant to the

Bioinformatics Research Center of North Carolina State University from the National

Institute of Health; a grant to RAD from the National Science Foundation: MCB-

0731808.

98

REFERENCES

1. Massari ME, Murre C: Helix-loop-helix proteins: regulators of transcription in eucaryotic organisms. Mol. Cell. Biol 2000, 20:429-440.

2. Jones S: An overview of the basic helix-loop-helix proteins. Genome Biol 2004, 5:226.

3. Castillon A, Shen H, Huq E: Phytochrome Interacting Factors: central players in phytochrome-mediated light signaling networks. Trends Plant Sci 2007, 12:514- 521.

4. Ma PC, Rould MA, Weintraub H, Pabo CO: Crystal structure of MyoD bHLH domain- DNA complex: perspectives on DNA recognition and implications for transcriptional activation. Cell 1994, 77:451-459.

5. Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, Oshima Y, Hakoshima T: Crystal structure of PHO4 bHLH domain-DNA complex: flanking base recognition. EMBO J 1997, 16:4689-4697.

6. Atchley WR, Terhalle W, Dress A: Positional dependence, cliques, and predictive motifs in the bHLH protein domain. J. Mol. Evol 1999, 48:501-516.

7. Ledent V, Vervoort M: The basic helix-loop-helix protein family: comparative genomics and phylogenetic analysis. Genome Res 2001, 11:754-770.

8. Atchley WR, Fitch WM: A natural classification of the basic helix-loop-helix class of transcription factors. Proc. Natl. Acad. Sci. U.S.A 1997, 94:5172-5176.

9. Pires N, Dolan L: Origin and Diversification of Basic-Helix-Loop-Helix Proteins in Plants. Molecular Biology and Evolution 2009, 27:862-874.

10. Fairman R, Beran-Steed RK, Anthony-Cahill SJ, Lear JD, Stafford WF 3rd, DeGrado WF, Benfield PA, Brenner SL: Multiple oligomeric states regulate the DNA binding of helix-loop-helix peptides. Proc. Natl. Acad. Sci. U.S.A 1993, 90:10429-10433.

11. Carretero-Paulet L, Galstyan A, Roig-Villanova I, Martinez-Garcia JF, Bilbao-Castro JR, Robertson DL: Genome-Wide Classification and Evolutionary Analysis of the bHLH Family of Transcription Factors in Arabidopsis, Poplar, Rice, Moss, and Algae. Plant physiology 2010, 153:1398.

99

12. Ni M, Tepperman JM, Quail PH: PIF3, a phytochrome-interacting factor necessary for normal photoinduced signal transduction, is a novel basic helix- loop-helix protein. Cell 1998, 95:657-667.

13. Friedrichsen DM, Nemhauser J, Muramitsu T, Maloof JN, Alonso J, Ecker JR, Furuya M, Chory J: Three redundant brassinosteroid early response genes encode putative bHLH transcription factors required for normal growth. Genetics 2002, 162:1445-1456.

14. Liljegren SJ, Roeder AHK, Kempin SA, Gremski K, Østergaard L, Guimil S, Reyes DK, Yanofsky MF: Control of fruit patterning in Arabidopsis by INDEHISCENT. Cell 2004, 116:843-853.

15. Menand B, Yi K, Jouannic S, Hoffmann L, Ryan E, Linstead P, Schaefer DG, Dolan L: An ancient mechanism controls the development of cells with a rooting function in land plants. Science 2007, 316:1477-1480.

16. Buck MJ, Atchley WR: Phylogenetic Analysis of Plant Basic Helix-Loop-Helix Proteins. Journal of Molecular Evolution 2003, 56:742-750.

17. Heim MA, Jakoby M, Werber M, Martin C, Weisshaar B, Bailey PC: The basic helix- loop-helix transcription factor family in plants: a genome-wide study of protein structure and functional diversity. Mol. Biol. Evol 2003, 20:735-747.

18. Sailsbery J, Atchley W, Dean D: Phylogenetic Analysis and Classification of the Fungal bHLH Domain. Molecular Biology and Evolution TBA.

19. Atchley WR, Zhao J, Fernandes AD, Drüke T: Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U.S.A 2005, 102:6395-6400.

20. Johnson RA, Wichern DW: Applied Multivariate Statistical Analysis. 5th edition. Prentice Hall; 2001.

21. Atchley WR, Zhao J: Molecular Architecture of the DNA-Binding Region and Its Relationship to Classification of Basic Helix–Loop–Helix Proteins. Molecular biology and evolution 2007, 24:192.

22. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol. Biol. Evol 2000, 17:164-178.

23. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J,

100

Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: InterPro: the integrative protein signature database. Nucleic Acids Research 2009, 37:D211-D215.

24. Wang Z, Atchley WR: Spectral analysis of sequence variability in basic-helix- loop-helix (bHLH) protein domains. Evol. Bioinform. Online 2006, 2:187-196.

25. Atchley WR, Fernandes AD: Sequence signatures and the probabilistic identification of proteins in the Myc-Max-Mad network. Proceedings of the National Academy of Sciences of the United States of America 2005, 102:6401.

26. Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and Regression Trees. 1st edition. Chapman and Hall/CRC; 1984.

27. HDMD: Statistical Analysis Tools for High Dimension Molecular Data [http://cran.r-project.org/web/packages/HDMD/].

28. R Development Core Team: {R: A language and environment for statistical computing}. Vienna, Austria: R Foundation for Statistical Computing; 2009.

29. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14:755-763.

30. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL: Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res 1999, 27:260-262.

31. Rifkin R, Klautau A: In defense of one-vs-all classification. The Journal of Machine Learning Research 2004, 5:101–141.

32. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y-H, Smith HO: Environmental genome shotgun sequencing of the Sargasso Sea. Science 2004, 304:66-74.

33. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215:403–410.

34. Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc. Natl. Acad. Sci. U.S.A 2000, 97:3288-3291.

35. Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome research 2004, 14:1188.

101

36. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Research 2005, 33:W116- W120.

37. Ellenberger T, Fass D, Arnaud M, Harrison SC: Crystal structure of transcription factor E47: E-box recognition by a basic region helix-loop-helix dimer. Genes Dev 1994, 8:970-980.

38. Ferré-D’Amaré AR, Prendergast GC, Ziff EB, Burley SK: Recognition by Max of its cognate DNA through a dimeric b/HLH/Z domain. Nature 1993, 363:38-45.

39. Brownlie P, Ceska T, Lamers M, Romier C, Stier G, Teo H, Suck D: The crystal structure of an intact human Max-DNA complex: new insights into mechanisms of transcriptional control. Structure 1997, 5:509-520.

40. Párraga A, Bellsolell L, Ferré-D’Amaré AR, Burley SK: Co-crystal structure of sterol regulatory element binding protein 1a at 2.3 A resolution. Structure 1998, 6:661- 672.

41. Ferré-D’Amaré AR, Pognonec P, Roeder RG, Burley SK: Structure and function of the b/HLH/Z domain of USF. EMBO J 1994, 13:180-189.

102

Table 1 Structural attributes and significant sites of the bHLH domain Site Structural Attributes CS-F CS-A CS-P DT pah pss ms cc ec all 1 Contacts DNA Contacts phosphate backbone 2 Contacts phosphate backbone √ * S SC S S 3 4 5 Contacts DNA √ SC S 6 Contacts phosphate backbone S C 7 8 Contacts DNA * √ Contacts phosphate backbone 9 Contacts DNA √ * √ C 10 Contacts phosphate backbone * * √ C C 11 Contacts phosphate backbone * 12 Contacts phosphate backbone * * √ SC C C 13 Contacts DNA √ √ SC SC S C Contacts phosphate backbone 14 15 Contacts phosphate backbone * 16 Buried Site √ √ √ C C 17 Contacts phosphate backbone * * 18 19 * √ 20 Buried Site * * * C C C 21 22 23 Buried Site √ √ √ C C C C 24 * 25 26 * √ 27 Buried Site √ √ √ C C 28 Buried Site √ √

Table 1-Structural attributes and significant sites of the bHLH domain

103

Table 1 (continued) Site Structural Attributes CS-F CS-A CS-P DT pah pss ms cc ec all 50 Contacts DNA √ √ √ SC C C SC Contacts phosphate backbone Buried Site 51 √ * C C 52 53 Buried Site √ * √ C C 54 Buried Site √ √ √ C C C 55 S 56 √ S S 57 * * √ C 58 √ 59 60 √ * * 61 Buried Site √ √ √ C 62 63 64 Buried Site √ √ √ NOTE.–The molecular architechure of bHLH positions is compiled from previous work on crystalline structures of animal proteins. Highly (√) and moderately (*) conserved sites are denoted for plant (CS-P), animal (CS-A), and fungal (CS-F) sequences. Sites integral in the decision tree analysis (DT) are also reported. Last, SWDA (S) and CVA (C) significant sites are shown within each factor score dataset (pah, pss, ms, cc, ec and all).

104

Table 2 Validation of bHLH classification methods Decision SWDA CVA Kingdom Tree pah pss ms cc ec all pah pss ms cc ec all 1302 Plant 95.6 96.5 90.7 96.5 92.7 96.8 97.3 97.9 91.5 97.0 92.1 97.7 100 Animal 95.8 94.2 94.0 91.1 93.4 94.8 97.2 95.7 94.3 92.0 94.3 96.0 99.7 Fungal 94.3 92.9 91.6 90.4 90.6 94.6 95.8 94.0 90.7 90.8 90.7 95.0 99.6 Total 92.8 91.8 88.1 89.0 88.4 93.1 95.2 93.8 88.2 89.9 88.5 94.4 99.6 Unclassified 5 62 60 65 74 65 61 80 80 80 80 80 80 6987 Plant 94.9 96.6 86.6 95.0 93.4 96.0 97.9 97.5 89.1 96.3 92.8 98.6 99.3 Animal 81.7 81.4 79.5 79.1 81.9 81.2 89.3 84.0 81.1 81.0 82.7 82.4 95.8 Fungal 82.7 82.0 84.8 81.6 84.1 82.4 89.2 84.6 85.4 82.3 85.3 82.6 96.1 Total 79.6 80.0 75.4 77.9 79.7 79.8 88.2 83.1 77.8 79.8 80.4 81.7 95.6 Unclassified 37 404 424 447 461 422 395 481 481 481 481 481 481 NOTE.–The accuracies are reported for several classification models; including, the decision tree analysis, SWDAs, and CVAs. The first measurements are based on the 935 plant, animal, and fungal sequences used in building the models. The second set assesses the models with the 7692 sequence set which were not used in building the models. The number of sequences that were unable to be classified (Unclassified) for each model are also provided.

Table 2-Validation of bHLH classification methods

105

Table 3 Discerning ability of determined canonical variates Factor Amino Acid CV Var Grps all Amino Acid Sites Score Sites pah 1 66.5% P 50, 54 9, 13, 20, 23, 50, 53, 57 2 33.5% A, F 13, 23 10, 16, 27 pss 1 61.9% A 12, 13 6, 13, 20, 23, 54 2 38.1% P, F 23, 53 10, 23 ms 1 71.6% P 20, 50 13, 20, 23, 54 2 28.4% A, F 12, 23 12, 13, 16, 23 cc 1 56.6% F 5, 10 13 2 43.4% P, A 2, 50 10, 12 ec 1 69.1% P 20, 27, 51, 54 9, 13, 16, 20, 23, 27, 51, 54, 57, 61 2 30.9% A, F 16, 20, 54 12, 20, 27, 51, 54 all 1 62.4% P * - 2 37.6% A, F * - NOTE.–The variance between plant, animal, and fungal sequences explained by each canoncial variate (CV) is reported. The type of sequences (Grps) that are separated by each CV is also shown. Within each CV, the top 2 amino acid sites (as ranked by weight) and those sites with weights >1 are provided. CV1 and CV2 in the all analysis had 27 and 16 discerning sites, respectively (all Amino Acid Sites).

Table 3-Discerning ability of determined canonical variates

106

Table 4 Classification of unknown bHLH sequences BLAST Mahal. Dist. Kingdom Seq Id Species Name (Common) Cov. iid Animal Fungi Plant (Prob) ECU95545.1 Ostreococcus lucimarinus 93% 90% 7.60 13.6 11.0 Animal (green algae) (100%) ECR14344.1 Sus scrofa 87% 32% 4.90 9.18 8.22 Animal (pig) (100%) ECB02340.1 Micromonas sp. RCC299 85% 68% 7.58 8.58 4.93 Plant (green algae) (100%) ECZ39823.1 Ostreococcus tauri 79% 53% 9.14 6.53 3.51 Plant (green algae) (100%) EBY00838.1 Monosiga brevicolli 49% 66% 3.34 8.02 14.3 Animal (choanoflagellates) (100%) EBG07815.1 Methanocaldo-coccus 38% 29% 1.62 9.99 13.2 Animal vulcanius (100%) EBZ65459.1 Branchiostoma(euryarchaeotes) belcheri 36% 38% 2.31 10.2 12.1 Animal (Japanese lancelet) (100%) ECI19714.1 Daphnia pulex 34% 44% 7.36 2.60 12.0 Fungal (common water flea) (100%) ECP37863.1 Aedes aegypti 30% 37% 4.63 4.81 8.19 Animal (yellow fever mosquito) (52%) EBW92904.1 Cryptococcus neoformans 15% 38% 10.9 7.74 2.11 Plant (basidiomycetes) (100%) NOTE.–Ten unknown bHLH sequences from a marine environmental sample as identified by InterPro. The attributes of the best BLAST hit are reported (species, coverage, identity). The results of the CVA{all} are also reported along with the distance of the bHLH sequence to Plant, Animal, and Fungal Kingdoms. The predicted Kingdom of origin and the probability are also reported.

Table 4-Classification of unknown bHLH sequences

107

Figure 1–Animal, Plant, and Fungi bHLH entropies, logos, and consensus. A. The bHLH normalized group entropy by position. Lower values indicate conservation, while values close to one approach unity. B. The graphical representation of the amino acids at each position of the bHLH domain. Symbols representing amino acids are scaled by their bit score (a derivation of entropy) at a given position. C. The 50-10 consensus sequences for Animal, Plant and Fungi. Using an alignment of bHLH domains within each Kingdom, amino acids occurring at a frequency of more than 50% at a given site are displayed. At each of these sites, additional amino acids are displayed if they are conserved in 10% or more of the sequences. The Plant and Animal consensus sequences are derived from previous work (Carretero-Paulet et al.,2010;Atchley et al.,1999). Boundaries between bHLH subdomains are shown at the bottom.

108

A

B

Animal

Plant Fungi

C

Animal

Plant Fungi Basic Helix1 Loop Helix2

109

Figure 2–Decision tree describing the classification of Plant, Animal, and Fungal bHLH sequences by amino acid sites found in the bHLH domain. Each box of the figure represents a step in the decision tree which consist of a number of bHLH sequences from each Kingdom, and the amino acids at a given bHLH position (state). The sample size and proportion of group representatives is provided in the accompanying table.

Diamonds contain the bHLH amino acid site which bifurcate the data into subsets of the previous state. Analysis is similar to a dichotomous taxonomic key.

110

1

Site 8

2 3

Site 56 Site 19

4 5 6 7

Site 50 Site 56

8 9 10 11

Site 9

12 13 Node (Site) 1 2 (8) 3 (8) 4 (56) 5 (56) 6 (19) - A C D G HI EFKLM PQ R A C FG HIKL ED GASCE State NSTVWY MNQRSTV Animal 21.4% 279 5.9% 59 72.8% 220 8.4% 56 0.9% 3 97.7% 209 Plant 39.5% 514 43.5% 435 26.2% 79 16.7% 111 97.0% 324 1.4% 3 Fungi 39.1% 509 50.6% 506 1.0% 3 74.9% 499 2.1% 7 0.9% 2 Node (Site) 7 (19) 8 (50) 9 (50) 10 (56) 11 (56) 12 (9) 13 (9) HKLM N KRE ILM N P ED A C G HK EPGDHQ A State QRTY QSTVY LQR Y Animal 12.5% 11 9.2% 53 3.4% 3 1.5% 1 47.6% 10 5.9% 33 95.2% 20 Plant 86.4% 76 4.9% 28 93.3% 83 98.5% 66 47.6% 10 5.0% 28 0.0% 0 Fungi 1.1% 1 86.0% 496 3.4% 3 0.0% 0 4.8% 1 89.0% 495 4.8% 1

111

Figure 3–Projection of PAF sequences onto canonical vectors for the factor score transformed datasets. Each plot contains the two canonical variates from the CVA and the sqrt of the Mahalanobis pairwise distance between the centroids of Plants, Animals, and Fungi.

112

113

Figure 4-HMM results from test dataset and the marine environmental sample. A. Overlap of bHLH sequences found with the Kingdom specific HMMs on the environmental dataset. B. The number of bHLH sequences found using

Kingdom specific HMMs on the 6987 test dataset. C. SWDA and CVA classification of the 376 bHLH sequences identified by combinatorial HMMs on the environmental data. D. Classification of the Kingdom specific HMM sequences on the environmental sample. SWDA and CVA results are reported as “Correctly Classified (Total

Classified)” for a given set of Kingdom specific sequences.

114

A B Kingdom Animal Fungal Plant Possible Animal 4704 4134 3787 4722 Fungal 304 363 275 365 Plant 1481 1466 1895 1900 C Kingdom SWDA CVA Animal 103 112 Fungal 61 43 Plant 136 109 Unclass. 76 112 D Kingdom HMM SWDA CVA Animal 44 26 (36) 26 (32) Fungal 5 2 (5) 0 (5) Plant 68 22 (35) 22 (28) Unclass. - 41 52

115

Chapter 4

Diverse and tissue-enriched small RNAs in the plant pathogenic fungus, Magnaporthe oryzae

Cristiano C. Nunes1, Malali Gowda1,2, Joshua Sailsbery1, Minfeng Xue1,3, Feng Chen4, Douglas Brown1, YeonYee Oh1, Thomas K. Mitchell5, Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC, 27606, USA 2 Next-Generation Genomics Laboratory, Center for Cellular and Molecular Platform, NCBS-GKVK Campus, Bangalore 560065, India 3 Department of Plant Pathology, China Agricultural University, Beijing 1000193, PR China 4 US DOE Joint Genome Institute, Walnut Creek, CA 94598, USA 5 Department of Plant Pathology, The Ohio State University, Columbus, OH 43210,USA

BMC Genomics 2011, 12:288

* To whom correspondence should be addressed.

Ralph A. Dean 851 Main Campus Drive, Suite 233 Centennial Campus Center for Integrated Fungal Research North Carolina State University Raleigh, NC 27606 Tel.: +1-919-513-0020; Fax: +1-919-513-0024; Email: [email protected]

116

SUMMARY

As a member of the Fungal Genomics Lab, I had the opportunity to work with large amounts of sequence data from a variety of sources. One of these data were small

RNAs (15-40 nt). In collaboration with Dr. Cristiano Nunes and several members of the lab, we studied small RNAs from two tissues types of Magnaporthe oryzae. Small RNA samples were prepared from two pathogenically important stages of M. oryzae, mycelia and appressoria (tissues). Small RNAs were sequenced by next-generation sequencing technologies, and then characterized and elucidated through a variety of bioinformatic methods.

While the sequence libraries for these tissue types were very different, small

RNAs from the two tissues showed enrichment for different genomic features. Mycelia

RNAs were enriched for intergenic and repetitive elements while a higher proportion of appressoria RNAs were enriched for tRNA loci. We also observed differential mapping of small RNAs to the 5’ and 3’ halves of mature tRNAs. From these we were able to identify sites of post transcriptional modification with the tRNAs and showed a difference in that modification for the two tissues. My contributions included mapping and prorating of small RNAs from both libraries to genomic features and expressed sequences (EST, MPSS, SAGE, CPA-sRNA). This work also included characterization of the sequence libraries, gbrowse visualization of sequence data on the M. oryzae genome, and the generation of tRNA logos with small RNA frequencies. From these

117 analyses, I produced Tables 2-5, Figure 8, and Additional Files 3, 5-7 and provided parts of Figures 2, 4, 5, 7, 10 and Additional File 2 in the published manuscript provided in this chapter.

118

ABSTRACT

119

RESULTS

120

Table 1-Summary statistics of small RNA libraries from mycelia and appressoria tissues

121

Table 2-Distribution of mycelia small RNAs with perfect map to nuclear and mitochondria genomes

122

Table 3-Distribution of appressoria small RNAs with perfect map to nuclear and mitochondria genomes

123

124

125

Table 4-Association of mycelia small RNAs with ESTs and ESS

Table 5-Association of appressoria small RNAs with ESTs and ESS

126

127

DISCUSSION

128

129

130

131

132

133

CONCLUSIONS

134

135

REFERENCES

136

137

138

Chapter 5

Genome-wide characterization of methylguanosine-capped and polyadenylated small RNAs in the rice blast fungus Magnaporthe oryzae

Malali Gowda1,2, Cristiano C. Nunes1, Joshua Sailsbery1, Minfeng Xue1,3, Feng Chen4, Cassie A. Nelson5, Douglas E. Brown1, Yeonyee Oh1, Shaowu Meng1, Thomas Mitchell6, Curt H. Hagedorn5 and Ralph A. Dean1

1Fungal Genomics Laboratory, Center for Integrated Fungal Research, North Carolina State University, Raleigh, NC 27606, 2Plant Biology, Michigan State University, East Lansing, MI 48824, USA, 3Department of Plant Pathology, China Agricultural University, Beijing 100094, China, 4US DOE Joint Genome Institute, Walnut Creek, CA 94598, 5University of Utah School of Medicine and Huntsman Cancer Institute, Salt Lake City, UT 84132 6Department of Plant Pathology, Ohio State University, Columbus OH 43210, USA

Nucleic Acids Research, 2010, Vol. 38, No. 21, pages 7558 – 7569

* To whom correspondence should be addressed.

Ralph A. Dean 851 Main Campus Drive, Suite 233 Centennial Campus Center for Integrated Fungal Research North Carolina State University Raleigh, NC 27606 Tel.: +1-919-513-0020; Fax: +1-919-513-0024; Email: [email protected]

139

SUMMARY

During the course of my Ph.D. dissertation I had the opportunity to work with many new and exciting sequence data. In collaboration with Dr. Malali Gawda and several members of the Fungal Genomics Lab, we discovered and characterized methylguanosine-capped and polyadenylated small RNAs (CPA-sRNAs) from

Magnaporthe oryzae. CPA-sRNAs were prepared from mycelia tissue, sequenced by next-generation sequencing technologies, and then characterized and elucidated through a variety of bioinformatic methods.

CPA-sRNAs were found to be predominately associated with transcriptional start and termination sites of protein-coding genes. Proteins enriched for CPA-sRNAs, especially ribosomal encoding proteins, were positively correlated with gene expression. My work directly contributed to both of these findings. This work included genome visualization (through gbrowse) which required the local alignment, mapping, and database construction of the RNAs. It also required the definition of the complete transcriptional unit of genes, tRNA, rRNA, and repetitive elements in order to properly annotate and map CPA-sRNAs to M. oryzae genomic features. Sophisticated SQL queries were created to calculate the alignment, read count, and prorating numbers and to identify the CPA-sRNA enriched genes.

Results from these analyses have been published in the manuscript included in this chapter. From these analyses I produced Tables 1, 2, Supplemental Table 3, and

140

Supplemental Figure 1 and provided parts of Figure 3 and Supplemental Figures 5-7.

Identification of enriched genes was essential in order to produce Figure 4 and

Supplemental Figure 7.

141

ABSTRACT

142

MATERIALS AND METHODS

143

144

RESULTS

145

Table 1-Distribution of CPA-sRNAs mapped to genomic and mitochondrial features

Table 2-Association of CPA-sRNAs with other transcriptional evidence

146

147

148

149

150

DISCUSSION

151

REFERENCES

152

Chapter 6

D-SynD: Dimensional Synteny Detection, Identification of Syntenic Regions between Multiple Genomes.

Joshua K. Sailsbery1,2, Brent J. Clay3, Carey R. Jackson4, Douglas E. Brown1, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Mechanical and Aerospace Engineering, North Carolina State University, Raleigh, NC 27606, USA 4 Department of Statistics, North Carolina State University, Raleigh, NC 27606, USA

For publication as a research paper in Nucleic Acids Research

* To whom correspondence should be addressed.

Ralph A. Dean 851 Main Campus Drive, Suite 233 Centennial Campus Center for Integrated Fungal Research North Carolina State University Raleigh, NC 27606 Tel.: +1-919-513-0020; Fax: +1-919-513-0024; Email: [email protected]

153

ABSTRACT

D-SynD is a software package that can detect regions of syntenic DNA between multiple large genomes simultaneously. The software allows many user options, such as defining the preferred syntenic region size and complexity. D-SynD assigns each syntenic region a probability and determines those that are the most significant. The measures used to determine probability compensate for the effects of repetitive DNA.

D-SynD determines syntenic regions based entirely on homologous DNA and thus requires no gene models and makes no assumptions with regards to gene order or orientation. Here, we present two test cases using D-SynD, each comparison involving three related species. D-SynD is released as an open-source software package and is available at http://www.fungalgenomics.ncsu.edu.

154

INTRODUCTION

Comparative genomics is an emerging field fueled by the recent availability of complete genomes. Fundamentally, comparative genomics is the study of the relationship of genome structure across multiple organisms of various evolutionary distances [1]. The majority of software developed for this field has focused on gene order and orientation [2–18]. The software varies by implementation, algorithms, target genome size, comparable number of subjects, and user friendliness, among other differences. However, these syntenic software packages all require quality gene models.

The term synteny, in context of comparative genomics, has become more specific since Nadeau and Taylor [19] first described syntenic breakpoints between closely related genomes. At first thought to be random, these breakpoints have been shown not to occur in regions where gene order is functionally constrained (e.g. operons) [20,

21]. This led to the definition of “solid” regions of synteny, which accumulate fewer breakpoints over evolutionary distance [22]. At first these syntenic regions were not tied directly to gene colinearity.

Gene colinearity is defined by a set of homologous genes with conserved order and orientation between compared genomes. Software designed to detect perfect gene colinearity have used terms such as microcolinearity [2], locally collinear blocks [3],

155 colinearity [4, 5], and conserved synteny [6]. Those designed to detect imperfect gene colinearity (e.g. allowing for small rearrangements) used terms such as segmental homologs [7], orthologous segments [8], and synteny blocks [9–11]. The term syntenic block can be used for both perfect and imperfect colinearity and is detected by software such as OrthoCluster [12, 13] and SyMap v3.4 [14].

Regulatory elements have been identified within introns and within neighboring genes [23]. Thus the disruption of colinearity would degrade the regulation of certain genes, providing a functional constraint on syntenic blocks. Another constraint would be a common promoter shared by two or more genes [24]. While the software cited here-in may detect regulatory elements contained within genes, they would not detect promoter regions outside of the gene model. Such regions remain, at best, unused in synteny determination and, at worst, all together undetected.

With more and more genomes available, the ability for software to do analyses across multiple genomes simultaneously (multidimensional) is becoming a necessity.

Many software packages perform multidimensional analyses (ccpart [15], Cynteny [16],

Cyntenator [6], MCMuSeC [17], OrthoCluster, TEAM [18]), while others offer statistical validation of the results (ColinearScan [4], FISH [7]), but none do both. OrthoCluster and MCMuSeC are multidimensional packages that utilize set-theory to cluster genes based on homology scores. While this clustering is based only on gene homologs and not directly on genome position, it does highlight the creation of clusters to identify syntenic blocks.

156

Here-in we describe the algorithms of the D-SynD software package. D-SynD utilizes all homologous DNA between genomes in a multidimensional analysis to determine the regions of shared synteny. It does not require any gene information, such as order, orientation, position, or model. Syntenic regions are detected through clustering methods and assigned a probability.

Two comparative genomic case studies were performed with D-SynD. The

Sodariomycetes comparative analysis includes fully sequenced genomes from the filamentous fungi Magnaporthe oryzae, Magnaporthe poae and Gaeumannomyces graminis var tritici. The second analysis focuses on the first chromosomes of Homo sapiens, Pan troglodytes (chimp), and Pongo pygmaeus (orangutan), denoted as the

Hominidae analysis.

157

IMPLEMENTATION

The software was developed in Java, Perl, and R. The D-SynD package was developed in an UNIX environment and user interaction occurs by command line. The flow chart in Figure 1 displays the workflow through each step in the D-SynD suite.

Here-in we describe, in detail, the algorithms within each step of D-SynD.

Algorithm

Finding Homologous DNA

D-SynD accepts homologous DNA input from sequence alignment programs such as LASTZ, BLASTN [25], and MUMmer [26]. The score, coverage, coordinates, and identity of each alignment (Hit) are required by D-SynD. For each genome included in the analysis, the program requires the Hits from all possible pair-wise combinations; nn*1   where n is the number of genomes. For example, in a study with 4 genomes,

D-SynD would require 12 (4*3) separate homology analyses. Optionally, D-SynD will run LASTZ and/or MUMmer bi-directionally for each pair-wise combination of user provided genomes.

158

Superblock Formation

Superblocks are contiguous sections of homologous DNA within the scope of an analysis. They are formed independently for each organism from their pair-wise alignments. Figure 2 illustrates the formation of a Superblock on organism Q. In the figure, Hits are shown between organisms Q and O1, O2 where Q was the query organism. On Q, the region of DNA within the dashed box is designated as a Superblock.

It contains the overlapping set of homologous DNA from the other two organisms.

Superblocks are allowed to span a user defined gap length of non-homologous DNA

(300bp by default). Additionally, Superblocks are not allowed to span genomic or sequencing structures, such as chromosomes, supercontigs, scaffolds, etc. Thus,

Superblocks represent discrete sections of homologous DNA specific to each organism, derived only when that organism was the query in an alignment.

CSS Segments

A connected superblock set (CSS) represents a line segment in N-dimensional space, where N is the number of organisms. For example, a pair-wise (two dimensional) CSS is a link between two Superblocks of two organisms (O1, O2). Thus, for two Superblocks A (O1) and B (O2), a pair-wise CSS will be created if there exists a bi- directional set of hits between them. There is no limitation on the number of pair-wise

CSS that a Superblock can generate. Thus, many pair-wise CSS may also exist between

Superblock A and other Superblocks on O2. This ensures that duplicate Hits are

159 detected but can give rise to noisy data if the organisms of interest contain a large number of repetitive sequences. For this reason, it is recommended that some form of repeat-masking be applied to the genomes before the comparative analysis. The complete set of pair-wise CSS between two organisms can be plotted to generate a syntenic dot plot (Figure 3A).

Multidimensional CSS (ND-CSS) represent a link between Superblocks of each organism (Figure 4). ND-CSS are constructed from a set of pair-wise CSS. ND-CSS require that there exists a pair-wise CSS for each binary combination of Superblocks within the ND-CSS. For example, a 3D-CSS is composed of three Superblocks, one from each organism (O1, O2, O3), where there exists three pair-wise CSS (Figure 4). These pair-wise CSS represent members of the intersect set between all pair-wise CSS of O1,

O2, and O3.

RPPP{1, 2 2, 3 ..N , 1 }

Where R is the set of all ND-CSS, and P is the pair-wise CSS between organisms n, n+1

where n={1..N}.

160

Distance Matrix

Euclidean distances between ND-CSS are calculated between either their center points only or the points closest to each segment. As a result of Superblock formation, all pair-wise CSS are non-intersecting segments. Thus, the shortest distance between any two CSS will involve one of the four endpoints (start or end of each segment)

(Figure 5). Utilizing the distance function DAB, CD , D-SynD will then generate a distance matrix between every CSS. For ND-CSS where N>2, the distance function is used as a heuristic calculation of distance, as CSS segments can overlap in each additional dimension > 2.

The user also has the option of calculating distance matrix from the Mahalanobis distance between CSS center points. However, due to the nature of CSS construction,

Mahalanobis distances generally reflect a scaled version of the Euclidean distance between center points and may not provide additional benefit to the user.

Single-link Hierarchical Clustering

Utilizing the generated distance matrix, CSS are clustered using the Single-

Linkage Hierarchal Clustering (SLHC) method [27, 28]. Unlike other clustering methods, SLHC does not take a cluster’s overall characteristics into consideration when deciding whether to add a singleton; the only requirement is that the singleton lies within a satisfactory distance (Euclidean, Mahalanobis, etc.) of any element of the cluster. The same logic applies when merging two clusters, as cluster centroids are not

161 considered during the merging procedure. Two clusters are merged simply when any two points of the clusters are within a given distance of one another. Following these rules, SLHC favors linear clusters, while methods such as K-Means clustering [29] actually penalize linear clusters in favor of circular forms. The SLHC method is demonstrated in Figure 6. SLHC culminates in single hierarchical graph interconnecting all ND-CSS. From this graph we can deduce the relationship of clusters and CSS to one another.

The SLHC distance function, where X and Y are any two sets of elements considered as

clusters, and d(x,y) denotes the distance between the two elements x and y.

Thresholds

Thresholds are a means to create clusters within a maximal distance. In essence, dendrograms are trimmed at a user-specified value (Figure 6) resulting in a set of clusters whose distance between their CSS is less than the cutoff. Clusters range in size and quality based on the threshold. Cluster length, incorporation of genes, and bases covered by Superblocks are all affected by the threshold. As shown in Table 1, increasing the threshold has little effect on the number of clusters, however, the average cluster length and corresponding number of CSS per cluster increases dramatically. Furthermore, as expected with increasing thresholds, the proportion of

162

DNA within a cluster covered by superblocks is reduced. Utilizing thresholds, the user can specify maximum allowed gap for desired syntenic regions based on the scope of their analysis.

Cluster Scoring Balanced Statistic

Defining clusters by a uniform maximum distance (thresholds) often provides for noisy results. For example, the dilution of smaller syntenic regions using larger thresholds or the removing of large significant clusters from the analysis if the threshold is too small. While thresholds provide a means to limit syntenic regions by gapping, they do not provide any other measure of cluster strength. The Cluster-

Scoring Balanced Statistic (CSBS) provides an additional measurement of clusters and mitigates recognition of erroneous clusters. Often, repetitive DNA leads to the formation of these invalid clusters. The three components of the CSBS, length, coverage, and correlation have been developed to compensate for these effects. Additionally, the

CSBS provides a measure of the strength of each cluster, allowing the use of statistical models and identification of significant clusters.

The lengths of syntenic regions play a vital role in reducing the effects of repetitive DNA. First, it measures the balance of the cluster length across each organism. This penalizes the clustering of small repeats from one organism to large syntenic regions of the others.

163

min(L ) balance  CO, max(LCO, )

Where b=the balancing score, LCO, = Length of a cluster C on organism O.

Second, D-SynD allows the user to utilize the Gompertz function [30] to specify appropriate lengths for syntenic regions. The Gompertz is a sigmoid function that approaches 0 the smaller the cluster length, and a the greater the length. This allows the user to penalize small syntenic regions, while not disproportionally rewarding extremely large regions.

bect y() t ae

Where a=1, b=-5, c= l*loge(-loge(.5)/5), and l = user specified length.

The balancing and targeting of user-specified lengths are multiplied to yield the length component of the CSBS. The length component will always be between 0-1, as the balancing and the targeting of the length are both between 0-1.

The coverage component measures the complexity of Hits in each given cluster and is composed of the Superblock coverage and the overlap. Superblock coverage is the amount of DNA used by the cluster over the total length of the cluster for each organism (Figure 7). Overlap is the number of repetitive Hits over the amount of DNA used. In this manner, Superblock coverage is the ratio of homologous DNA to the length of the cluster and overlap is the ratio of repetitive DNA to homologous DNA. The

164 coverage relates these ratios to reflect the complexity of Hits within the cluster.

Additionally, 01cover .

N  SO, O1 sb  N  LCO, O1 N VCO, O1 ov  N  SO, O1 sb cover  1 ov2

Where sb = Superblock coverage, ov = overlap, cover=coverage component, LC,O =full length of a given cluster C and organism O, lS,O = length of the Superblock S on organism

O, and VC,O = amount of DNA overlapped in a given cluster C and organism O.

The correlation component measures the degree of dependence between organisms for start, center and end points of each CSS. For a given cluster, it will reflect the linearity of the CSS.

covXX ,..,  COCO,,1 N  corrCXX ,.., COCO,,1 N XX.. COCO,,1 N

Where corr = correlation component, N = number of genomes, O = organism, X = set of

start, center, and end points for each CSS on organism O in cluster C.

165

CSBSCCCC corr** cover length

Where C is a given cluster.

The CSBS measures the complexity, balance, and linearity of a given cluster.

Clusters that accurately identify syntenic regions maximize each of the components of the CSBS.

Cluster Visualization

At several points in the workflow of D-SynD, the user has the option to create 2 and/or 3 dimensional graphs. First, users can plot each pair-wise CSS, which yields a dot plot of all contiguous Hits between two organisms. After constructing the ND-CSS, the user can then create a three dimensional representation of 3D-CSS. After building or selecting significant clusters (as defined in following sections), the user can plot these clusters from a 2 or 3 dimensional perspective. All 2 dimensional graphs are created using R. Blender, an open-source 3D modeling package, is used to generate 3 dimensional plots (Figure 8). Blender may also be used to explore the 3 dimensional environment of cluster and 3D-CSS [31].

166

Statistical Validation of the CSBS

Clusters obtained from SLHC are a subset of all possible clusters in a given analysis. Thus, the null hypothesis of subsequent statistical tests would be that SLHC clusters are randomly constructed. The Global Cluster (GC) distribution contains the

CSBS for all possible combination of CSS within an analysis.

n n C   k2 k

Where C=number of possible clusters, n=number of CSS.

For example, with just 10 CSS, there are 1022 possible clusters (Figure 9A).

n n Cluster size follows a normal distribution with  = and  2 = . Following this 2 4 distribution, D-SynD created random clusters that properly reflected the frequency of cluster size. Sampling the CSBS of random clusters properly reflected the Global Cluster distribution (Figure 9B). However, as shown in Figure 9C, the size of clusters determined by SLHC is not normally distributed; thus, the expected distribution of experimental CSBS would be exponential. Figure 9D shows the CSBS of samples taken from clusters following the trend line of Figure 9C. Biased cluster size is expected in experimental clusters, because they are built step-up from singletons and smaller clusters. Estimating the parameters from the Global Cluster distribution will allow D-

SynD to determine which experimental clusters are significant.

167

Determining Significant Clusters

Significant clusters are determined by calculating a cutoff based on units of  each cluster is from  , (i.e. n ). From n , the significance level ( ) can be calculated

(e.g. n=3,  =0.00135). Next, the ˆ is determined from a 1000 sample of the GC distribution. Then, using  , ˆ , and assuming the experimental values for CSBS ~ EXP(

 ), we derive a significance cutoff. All experimental clusters with a CSBS greater than the cutoff are called significant. As a user option, D-SynD allows the user to specify either n or  (default  =.001).

 1   (n ) 1   ˆ c  ex dx 1 0 c ex 1  0 ec 11   ec   ln( ) c  

Where n= units of sigma above mean, ˆ is estimated from GC, and c=significance cutoff.

168

CASE STUDIES

We performed two comparative studies on D-SynD, denoted as Sordiomycetes and Hominidae (Table 2). The Hominidae analysis was between the first chromosomes

(~140Mb) of H. sapiens, P. troglodytes, and P. pygmaeus. D-SynD created 12,110 total superblocks from over 279,000 BLASTZ hits (Table 2). This resulted in 15,560 3D-CSS and over 14,000 significant clusters ( > .001) (Table 3). Mapping of the CSS from significant clusters resulted in a reduction of background noise, a consequence of repetitive sequences, between these closely related organisms (Figure 10).

The Sodariomycetes analysis included fully sequenced genomes (~40Mb) from the filamentous fungi M. oryzae, M. poae and G. graminis, all members of the

Magnaporthacae family. The BLASTZ pair-wise analysis resulted in 18,196-29,928 hits between these three organisms. Using the BLASTZ hits, D-SynD created ~5,500 superblocks for M. poae and G. graminis and over 8,200 for M. oryzae. The superblocks in M. oryzae were smaller and more dispersed than the M. poae or G. graminis superblocks (data not shown). This is not unexpected as M. poae and G. graminis are more closely related to each other than they are to M. oryzae. Interconnected superblocks resulted in 900-1,680 pair-wise CSS, 7,337 3D-CSS and 5,081 clusters

(Table 3). At the .1% level, D-SynD identified 1,208 out of 5,081 clusters as significant.

Before (all pair-wise CSS) and after (pair-wise CSS within significant clusters) dot plots between M. poae and G. graminis revealed a dramatic decrease in background noise

169 between these organisms (Figure 3). Syntenic regions were then mapped from the coordinates of significant clusters, and hyperlinks created for the comparative genome browser GEVO [32, 33]. Utilizing GEVO, we visualized a wide range of syntenic regions

(Figure 11, Supplemental Figures S1 and S2). These included small nearly perfect collinear blocks to large syntenic regions encompassing genomic rearrangements.

Further inspection of the 1,208 syntenic regions revealed the percentage of homologous genes within a cluster was inversely proportional to the length of that cluster (Table 4). For example, genes within clusters of length < 10 kb were 91% likely to be homologous while genes found in clusters > 100 kb were only 6% likely.

Additionally, many homologous intergenic regions (1.4-11.5) were identified in clusters of varying lengths (for M. oryzae). These regions are of great interest as they may contain genomic features, such as promoter regions, regulatory elements, etc., that are conserved between these organisms in highly syntenic regions.

170

CONCLUSION

D-SynD is an open source software package that effectively compares multiple genomes simultaneously, solely on the DNA sequence of the each genome. Gene models are not required. In our case studies, D-SynD identified syntenic ranging from regions from perfect syntenic blocks to regions with little to no gene conservancy (i.e. regulatory elements, other genomic features) (Figure 11). The software was also able to reduce the noisiness of the data (Figures 3, 10) by compensating for repetitive DNA through computation of the CSBS. The CSBS provided the ability to quantitatively measure cluster strength. Formalization of the GC distribution and expected experimental distribution led to the creation of a significance cutoff. Application of this package produces a user defined set of syntenic regions that are statistically supported

(Figures 3, 10-11). Furthermore, the user is also able to view the integrity of syntenic regions as 3 dimensional clusters or through genomic browsers such as GEVO. This package is freely available at http://www.fungalgenomics.ncsu.edu.

171

SUPPLEMENTAL

Supplemental Figures S1-S2 show additional significant syntenic regions and are available at Nucleic Acids Research online (http://nar.oxfordjournals.org).

ACKNOWLEDGEMENTS

We would like to thank current and former members of the Center for Integrated

Fungal Research for their critical comments, discussion, and contributions. This work was supported by a grant to RAD from the National Science Foundation: MCB-0731808, and the Summer Research Experience for Undergraduates (REU) from North Carolina

State University.

.

172

REFERENCES

1. Gregory TR: The Evolution of the Genome. 1st edition. Academic Press; 2005.

2. Vandepoele K, Saeys Y, Simillion C, Raes J, Van De Peer Y: The automatic detection of homologous regions (ADHoRe) and its application to microcolinearity between Arabidopsis and rice. Genome Res. 2002, 12:1792-1801.

3. Darling ACE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004, 14:1394- 1403.

4. Wang X, Shi X, Li Z, Zhu Q, Kong L, Tang W, Ge S, Luo J: Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice. BMC Bioinformatics 2006, 7:447.

5. Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 2004, 20:3643-3646.

6. Rödelsperger C, Dieterich C: CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes. PLoS ONE 2010, 5:e8861.

7. Calabrese PP, Chakravarty S, Vision TJ: Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics 2003, 19 Suppl 1:i74-80.

8. Hachiya T, Osana Y, Popendorf K, Sakakibara Y: Accurate identification of orthologous segments among multiple genomes. Bioinformatics 2009, 25:853-860.

9. Pevzner P, Tesler G: Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 2003, 13:37-45.

10. Cannon SB, Kozik A, Chan B, Michelmore R, Young ND: DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization. Genome Biol. 2003, 4:R68.

11. Soderlund C, Nelson W, Shoemaker A, Paterson A: SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res. 2006, 16:1159- 1168.

173

12. Zeng X, Nesbitt MJ, Pei J, Wang K, Vergara IA, Chen N: OrthoCluster. In ACM Press; 2008:656.

13. Ng M-P, Vergara IA, Frech C, Chen Q, Zeng X, Pei J, Chen N: OrthoClusterDB: an online platform for synteny blocks. BMC Bioinformatics 2009, 10:192.

14. Soderlund C, Bomhoff M, Nelson WM: SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 2011, 39:e68.

15. Boyer F, Morgat A, Labarre L, Pothier J, Viari A: Syntons, metabolons and interactons: an exact graph-theoretical approach for exploring neighbourhood between genomic and functional data. Bioinformatics 2005, 21:4209-4215.

16. Sinha AU, Meller J: Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics 2007, 8:82.

17. Ling X, He X, Xin D: Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics 2009, 25:571-577.

18. Luc N, Risler J-L, Bergeron A, Raffinot M: Gene teams: a new formalization of gene clusters for comparative genomics. Comput Biol Chem 2003, 27:59-67.

19. Nadeau JH, Taylor BA: Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl. Acad. Sci. U.S.A. 1984, 81:814-818.

20. Satou Y, Mineta K, Ogasawara M, Sasakura Y, Shoguchi E, Ueno K, Yamada L, Matsumoto J, Wasserscheid J, Dewar K, Wiley GB, Macmil SL, Roe BA, Zeller RW, Hastings KEM, Lemaire P, Lindquist E, Endo T, Hotta K, Inaba K: Improved genome assembly and evidence-based global gene model set for the chordate Ciona intestinalis: new insight into intron and operon populations. Genome Biol. 2008, 9:R152.

21. Blumenthal T, Evans D, Link CD, Guffanti A, Lawson D, Thierry-Mieg J, Thierry-Mieg D, Chiu WL, Duke K, Kiraly M, Kim SK: A global analysis of Caenorhabditis elegans operons. Nature 2002, 417:851-854.

22. Pevzner P, Tesler G: Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. Proc. Natl. Acad. Sci. U.S.A. 2003, 100:7672-7677.

23. Kikuta H, Laplante M, Navratilova P, Komisarczuk AZ, Engström PG, Fredman D, Akalin A, Caccamo M, Sealy I, Howe K, Ghislain J, Pezeron G, Mourrain P, Ellingsen S, Oates AC, Thisse C, Thisse B, Foucher I, Adolf B, Geling A, Lenhard B, Becker TS:

174

Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome Res. 2007, 17:545-555.

24. Yang MQ, Koehly LM, Elnitski LL: Comprehensive annotation of bidirectional promoters identifies co-regulation among breast and ovarian cancer genes. PLoS Comput. Biol. 2007, 3:e72.

25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215:403–410.

26. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucleic Acids Res. 1999, 27:2369-2376.

27. Kawaji H, Yamaguchi Y, Matsuda H, Hashimoto A: A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Inform 2001, 12:93-102.

28. Watanabe H, Otsuka J: A comprehensive representation of extensive similarity linkage between large numbers of proteins. Comput. Appl. Biosci. 1995, 11:159-166.

29. MacQueen J, others: Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967, 1:14.

30. J.F. Kenny, Kepping ES: Mathematics of Statistics Pt. 1. 3rd edition. Van Nostrand, Princeton, NJ; 1962.

31. Mullen T: Mastering Blender. 1st edition. Sybex; 2009.

32. Lyons E, Pedersen B, Kane J, Freeling M: The Value of Nonmodel Genomes and an Example Using SynMap Within CoGe to Dissect the Hexaploidy that Predates the Rosids. Tropical Plant Biology 2008, 1:181-190.

33. Lyons E, Freeling M: How to usefully compare homologous plant genes and chromosomes as DNA sequences. The Plant Journal 2008, 53:661-673.

175

Table 1 Cluster Attributes based on Threshold cutoff for Soderiomycetes Number of Average Number Average Cluster Average Superblock Gene Density Threshold Clusters of CSS per Cluster Length (bp) Coverage (Genes/MB) 20000 1366 2.7 25422 87% 402 50000 1512 3.4 41796 79% 350 300000 1318 5.2 204650 61% 335 NOTE.–Cluster attributes are examined at several thresholds for the Sodariomycetes comparative study. The number of clusters generated, average number of 3D-CSS segments, average length of each CSS, average Superblock coverage, and gene density are reported.

Table 1-Cluster Attributes based on Threshold cutoff for Soderiomycetes

Table 2 Pair-wise summary statistics of the comparative analyses.

Study Organism (O1) Organism (O2) Hits O1-O2 Hits O2-O1 Superblocks O1 2D-CSS Sodariomycetes M. oryzae M. poae 19391 18196 8251 900 M.poae G. graminis 28034 29928 5545 1680 G. graminis M. oryzae 20526 22629 5574 1350 Hominidae H. sapiens P. troglodytes 45921 45876 906 3180 P. troglodytes P. pygmaeus 48289 48859 4523 9278 P. pygmaeus H. sapiens 45379 45081 6681 4248

NOTE.–The number of pair-wise hits beween Organsims O1 and O2 is reported. The Superblocks built for O1, and the nubmer of pair-wise CSS (2D-CSS) between O1 and O2. Table 2-Pair-wise summary statistics of the comparative analyses

Table 3 Multidimensional summary statistics of the comparative analyses. Significant Clusters Study 3D-CSS Clusters  =.1 =.01 =.001 Sodariomycetes 7337 5081 1783 1402 1208 Hominidae 15560 15559 14782 14442 14299 NOTE.–The number of 3D-CSS, cluster, and significant clusters are reported

Table 3-Multidimensional summary statistics of the comparative analyses

176

Table 4 Significant cluster attributes for M. oryzae Cluster Homologous Numbera Genesb Length (kb) Genesc Interd < 10 317 2.2 2.0 1.4 10-50 273 8.2 4.7 3.4 50-100 91 25.0 6.7 4.8 > 100 527 279.5 18.0 11.5 NOTE-Data shown is per cluster length window. a. Number of significant clusters. b. Average number of genes per cluster. c. Average number of genes with homologs in M. poae and G. graminis . d. Average number of homologous intergenic regions shared with M. poae and G. graminis .

Table 4-Significant cluster attributes for M. oryzae

177

N

Start Alignments All n(n-1) Y 1 ran? Y pair-wise 1 Superblocks 2D dot N CSS 2 plots? Y Distance 2 ND-CSS 3D dot N 3 plots? Matrix Y

Build 2D,3D N 3 SLHC 4 Clusters dot plots

Y Significant Report 4 2D,3D N 5 Clusters Results dot plots

5 End

Figure 1-Workflow of the D-SynD software package.

178

Figure 2–Homologous DNA segments of a Superblock. Query (Q) and subject (O1, O2) organisms are designated. Connections between subject and query organisms represent homologous DNA used to construct Q’s Superblock. The Superblock represents a nearly contiguous segment of homologous DNA within Q.

179

A B

40

40

30

30

(Mb) (Mb)

0 0

2 2

graminis graminis

G. G. G.

10 10

0 0

0 10 20 30 40 0 10 20 30 40

M. poae (Mb) M. poae (Mb)

Figure 3–M. poae vs. G. graminis dot plots. A. All pair-wise CSS between M. poae and G. graminis. The points represent homologous DNA between two genomic locations of the organisms. Genomes are graphically represented by sequentially appending supercontigs and contigs end to end. B. Pair-wise dots located only within significant clusters.

180

Figure 4–Formation of a three dimensional CSS from pair-wise CSS. The axes O1, O2, and O3 represent the genomes of three organisms, and black boxes are Superblocks on each organism. Blue and red dots represent CSS, two dimensional and three dimensional, respectively.

181

2 2 2 1 abc     cos  A 2bc 2 2 2 α 1 a b c   cos  c 2ac b 2 2 2 1 a b c   cos  β B 2ab γ a C asin iff    90,   90 dCc,   minab , 

DAB,,,,, CDmin( d A CD ,  d B CD ,  d C AB ,  d D AB )

D

Figure 5–Geometric representation of two pair-wise CSS segments. The distance between the CSS is determined by the

DAB, CD function.

182

Figure 6–Single-Linkage Clustering of a Sample Data Set. A. Dot plot of 16 pair-wise CSS, numbered 1-16. Subsets 1 and 2 show the dendrogram created by SLH clustering which connects all the CSS segments. A threshold cut-off at a distance of 3 is also demonstrated. Only clusters that were formed below this threshold are kept. B. The combine and merge steps of

SLH clustering are shown in detail. CSS segments in parenthesis are connected into clusters of two. These clusters are then represented by brackets and step number on which they were created. Dots are then added to existing clusters, e.g.

CSS 1 (1) into cluster 2 [2]. Finally, clusters are merged together resulting in a new larger cluster, e.g. cluster [6] is formed by combining the CSS in clusters [4] and [5].

183

Subset 2 A

m bp 3 Distance

o2

n bp o1 B Subset 1 Subset 2 Step Merge Operation Distance Merge Operation Distance Subset 1 1 (5) and (6) 1.41 (12) and (13) 0.71 2 (2) and (3) 2.12 (8) And (9) 2.12 3 3 (1) and [2] 2.83 (11) And [1] 2.12 4 (7) and [1] 2.83 (10) And [2] 3.54 Distance 5 (4) and [3] 3.54 [3] and [4] 4.24 6 [4] and [5] 4.24 - -

184

OB C2 C1 l4,B

l3,B

LC,B

l2,B

l1,B

OA

l1,A l2,A l3,A l4,A L C,A

Figure 7–CSBS measurements on hypothetical clusters from SLHC. This figure demonstrates two hypothetical clusters C1,

C2 between two organisms OA, OB. LC,O are the full length of a given cluster and lS,O are the lengths of the involved

Superblock S on organism O. Pair-wise CSS are represented as two dimensional segments.

185

Figure 8–3 dimensional representation of a cluster from the Sordariomycetes comparative analysis. Gray boxes represent each 3D-CSS and connected using the convex hull function. Rendering of 3D objects were performed using the Blender open-source suite.

186

A B

  5  2  2.5

cluster size CSBS C D

cluster size CSBS

Figure 9–Distributions of the Global Cluster. A. Tally of possible clusters by size (CSS count) from a set of 10 CSS. B. The CSBS of 10,000 clusters sampled randomly from the

Global Cluster distribution of 300 CSS. C. Tally of clusters by size from a simulated

SLHC run with 300 CSS. D. The CSBS of a simulated SLHC run on 300 CSS.

187

CSS

-

All 3D All Sig. Clusters Sig.

Figure 10–Noise reduction in the Hominidae study. Two sets of graphs are shown. All 3D-CSS contain the 2 dimensional representations of the 3D-CSS. They are essentially the dot plots of homologous DNA between the organisms on the axes.

Sig. Clusters contains only plot those 3D-CSS that belong to significant clusters at the  = .05 level.

188

A

B

Figure 11–GEVO views of syntenic regions in the Sordariomycetes analysis. A. A nearly perfect syntenic block. B Syntenic region with little to no genes conserved.

189

APPENDICES

190

APPENDIX A

Chapter 2

Phylogenetic Analysis and Classification of the Fungal bHLH Domain

Joshua K. Sailsbery1,2, William Atchley3, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Genetics, North Carolina State University, Raleigh, NC 27606, USA

Additional Material

191

Supplemental Data 1

Annotations, classification and sequences for Plant, Animal and Fungal bHLH proteins

#

Name

Group

Common

Uniprot Kingdom

Organism Subgroup End bHLH Len Loop

Subphylum

Short Name Short Start bHLH B in RKH # diff Motif

Uniprot OLD Uniprot Description

Uniprot Full Uniprot Organ. Short

IPR001092 End IPR001092 Sequence bHLH Sequence Full

Legacy Changes Legacy IPR001092 Start IPR001092

NOTE.–Fungal sequences were obtained utilizing Interpro 31.0 and Genbank 180. Plant and Animal bHLH sequences were obtained from published datasets (Toledo-Ortiz et al.,2003;Atchley and Zhao,2007). Annotations for each sequence were obtained using Uniprot (Name, Common, Short Name, Description, Short Organism, Uniprot, Uniprot Full, Uniprot OLD). Discrepencies between published datasets and the current iteration of Uniprot are also noted (Legacy Changes). Taxomonic information was obtained using Genbank's Taxonomic browser (Organism, Kingdom, Subphylum). bHLH domain location within each sequece is given by the Interpro prediction position (IPR001092 Start and End) and the expertly aligned dataset positioning (bHLH Start and End). Using the expert alignment, the count of basic amino acids in the basic region and the length of the loop are determined (# RKH in B, Loop Len). The number of mismatches between each Fungal bHLH sequence and the Fungal consensus sequence is also given (Motif diff). Groups and Subgroups are listed for each sequence as determined through this and previous analyses (A-F for Animal, Binder types for Plant, and F1-F12 for Fungi sequences). Last of all, the expertly determined bHLH domain and the full amino acid sequence are provided (bHLH Sequence, Full Sequence).

192

Supplemental Table S1 SWDA on Fungal sequences pah pss ms cc ec all Step Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter FS r 2 ASCC 1 Site 12 1.00 0.09 Site 12 0.99 0.09 Site 12 0.98 0.09 Site 12 0.99 0.09 Site 50 0.98 0.09 Site 12 pah 1.00 0.09 2 Site 6 0.92 0.17 Site 8 0.80 0.16 Site 8 0.87 0.17 Site 51 0.77 0.16 Site 12 0.97 0.18 Site 12 cc 1.00 0.09 3 Site 19 0.80 0.25 Site 6 0.75 0.23 Site 50 0.81 0.24 Site 10 0.74 0.23 Site 6 0.89 0.26 Site 50 ec 0.98 0.18 4 Site 11 0.79 0.32 Site 50 0.73 0.29 Site 11 0.73 0.30 Site 6 0.68 0.28 Site 54 0.83 0.33 Site 6 pah 0.90 0.26 5 Site 24 0.65 0.37 Site 11 0.66 0.35 Site 51 0.70 0.37 Site 11 0.53 0.32 Site 10 0.74 0.39 Site 8 ms 0.88 0.34 6 Site 1 0.51 0.40 Site 10 0.56 0.39 Site 19 0.67 0.43 Site 24 0.45 0.36 Site 15 0.71 0.45 Site 15 ec 0.81 0.41 7 Site 8 0.48 0.44 Site 17 0.52 0.42 Site 6 0.46 0.45 Site 14 0.39 0.38 Site 19 0.65 0.50 Site 24 pah 0.72 0.47 8 Site 15 0.38 0.45 Site 54 0.42 0.44 Site 10 0.42 0.47 Site 1 0.47 0.40 Site 14 0.56 0.54 Site 19 pah 0.71 0.53 9 Site 17 0.37 0.47 Site 19 0.41 0.46 Site 2 0.42 0.48 Site 20 0.40 0.42 Site 11 0.46 0.56 Site 1 cc 0.65 0.55 10 Site 10 0.37 0.49 Site 15 0.36 0.49 Site 60 0.32 0.51 Site 16 0.34 0.44 Site 8 0.45 0.59 Site 11 pah 0.60 0.59 11 Site 50 0.36 0.52 Site 57 0.34 0.51 Site 7 0.31 0.52 Site 50 0.32 0.45 Site 51 0.43 0.61 Site 51 ms 0.55 0.62 12 Site 56 0.31 0.54 Site 16 0.33 0.53 Site 15 0.31 0.54 Site 15 0.26 0.47 Site 17 0.36 0.62 Site 6 pss 0.53 0.63 13 Site 51 0.27 0.55 Site 20 0.32 0.54 Site 17 0.33 0.55 Site 58 0.24 0.49 Site 7 0.37 0.63 Site 54 ec 0.47 0.66 14 Site 7 0.28 0.57 Site 14 0.32 0.55 Site 56 0.32 0.57 Site 3 0.25 0.50 Site 24 0.34 0.65 Site 10 ec 0.47 0.68 15 Site 64 0.23 0.58 Site 4 0.29 0.57 Site 57 0.27 0.58 Site 19 0.24 0.51 Site 20 0.42 0.66 Site 14 ec 0.48 0.71 16 Site 5 0.22 0.59 Site 51 0.30 0.58 Site 24 0.26 0.59 Site 53 0.20 0.52 Site 60 0.28 0.68 Site 6 ec 0.39 0.71 17 Site 57 0.22 0.60 Site 55 0.26 0.59 Site 14 0.26 0.60 Site 7 0.23 0.53 Site 21 0.28 0.68 Site 2 ms 0.36 0.72 18 Site 55 0.21 0.61 Site 1 0.20 0.60 Site 20 0.26 0.61 Site 4 0.21 0.54 Site 56 0.27 0.70 Site 51 cc 0.35 0.72 19 Site 2 0.20 0.61 Site 53 0.19 0.60 Site 58 0.22 0.62 Site 5 0.21 0.56 Site 9 0.24 0.70 Site 11 ec 0.31 0.74 20 Site 4 0.18 0.62 Site 2 0.19 0.61 Site 3 0.21 0.62 Site 8 0.24 0.56 Site 52 0.25 0.70 Site 8 pss 0.29 0.75

193

Supplemental Table S1 (continued) pah pss ms cc ec all Step Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter FS r 2 ASCC 21 Site 20 0.18 0.62 Site 7 0.16 0.62 Site 4 0.21 0.63 Site 54 0.14 0.57 Site 8 pah 0.33 0.75 22 Site 60 0.19 0.63 Site 64 0.16 0.63 Site 16 0.20 0.63 Site 17 0.14 0.57 Site 5 cc 0.33 0.77 23 Site 52 0.17 0.63 Site 24 0.15 0.63 Site 64 0.19 0.64 Site 21 0.13 0.58 Site 5 ec 0.29 0.77 24 Site 58 0.15 0.64 Site 28 0.14 0.64 Site 55 0.19 0.65 Site 25 0.13 0.58 Site 9 cc 0.33 0.78 25 Site 14 0.14 0.65 Site 27 0.15 0.64 Site 59 0.16 0.66 Site 57 0.14 0.59 Site 51 ec 0.25 0.78 26 Site 3 0.16 0.65 Site 58 0.14 0.65 Site 1 0.16 0.66 Site 2 0.11 0.60 Site 5 pah 0.25 0.79 27 Site 53 0.15 0.66 Site 26 0.13 0.65 Site 26 0.16 0.67 Site 55 0.11 0.60 Site 15 pss 0.28 0.80 28 Site 26 0.13 0.66 Site 13 0.12 0.66 Site 63 0.15 0.67 Site 9 0.11 0.60 Site 15 pah 0.49 0.80 29 Site 59 0.13 0.67 Site 9 0.13 0.66 Site 52 0.14 0.68 Site 64 0.10 0.60 30 Site 28 0.12 0.67 Site 56 0.12 0.66 Site 21 0.14 0.68 Site 63 0.09 0.61 NOTE.–Fungal group sequences were transformed into six factor score datasets (pah, pss, ms, cc, ec, and all). Results from SWDA on each transformed dataset are reported. Each step, up to 30, are given for each SWDA. ASCC are indicitive of model performance up to a given step. Steps were not reported after an ASCC of 70% was obtained for the first five factor scores or 80% for the all dataset. Amino acid sites are entered based on their ability to discriminate groups (r2).

194

Supplemental Table S2 Discerning bHLH amino acid sites by CV FS CV Var% Grps Sites All Sites pah 1 84.7 F4 6 12 13 23 8 13 23 50 53 54 2 6.26 F10 F11 5 6 10 12 13 20 50 8 20 23 50 54 3 2.55 F2 2 8 9 10 12 13 19 20 23 54 5 13 20 23 50 54 4 2.36 F6 F8 10 11 13 19 27 6 8 9 10 13 16 20 23 27 50 54 5 1.95 F5 6 9 12 13 15 16 23 24 57 5 8 13 23 6 0.80 F1 F3 5 8 9 13 17 19 23 50 51 8 13 20 23 27 50 54 7 0.51 F12 10 13 50 51 54 13 27 54 8 0.36 F9 5 10 13 16 23 28 50 8 20 23 27 54 9 0.28 F7 5 9 13 8 13 16 20 23 27 50 10 0.12 10 12 13 15 23 27 13 20 23 27 50 54 11 0.07 F1 13 17 23 53 54 57 8 13 16 20 23 27 pss 1 81.0 F4 12 23 8 50 54 2 6.74 F6 F11 8 12 13 20 23 50 54 10 11 20 23 54 3 4.42 F1 F2 F10 6 10 11 12 13 17 23 4 2.28 F3 F5 6 8 10 12 13 20 23 64 13 20 23 5 2.17 2 8 10 13 20 23 27 50 54 23 6 1.26 F2 F8 11 13 16 13 20 23 54 7 0.98 13 50 54 10 13 8 0.51 F7 8 10 13 17 23 50 54 9 0.33 F1 F12 8 16 50 54 10 13 16 20 23 54 10 0.20 F9 2 5 11 12 13 50 54 64 13 20 23 54 11 0.16 13 16 50 20

195

Supplemental Table S2 (continued) FS CV Var% Grps Sites All Sites ms 1 63.5 F4 8 11 12 20 23 9 10 27 50 54 2 13.9 F3 F10 F11 2 8 10 12 13 17 20 23 50 8 9 10 13 17 20 23 54 3 8.09 F6 6 8 11 13 16 17 20 23 50 9 13 20 4 4.65 F5 F8 6 10 12 13 16 50 51 2 20 27 54 5 3.49 F2 6 14 17 19 23 50 8 23 6 2.88 8 10 11 12 17 20 23 50 13 17 20 7 1.43 F9 9 15 50 56 27 8 1.00 F12 6 8 10 12 20 50 57 60 8 20 54 9 0.58 F7 5 10 12 20 27 57 8 20 23 27 50 10 0.26 10 12 13 17 20 23 27 50 64 20 23 27 54 11 0.24 F1 9 10 11 20 64 16 20 23 27 cc 1 79.1 F4 9 12 13 50 2 7.74 F2 F8 1 6 10 17 50 51 9 13 50 3 4.99 F6 5 6 9 11 13 20 50 51 5 13 50 4 2.95 F5 1 6 8 9 10 11 51 9 10 13 50 5 2.20 1 10 11 13 20 - 6 1.05 F10 F11 2 14 50 13 7 0.82 F3 F12 2 5 9 12 13 57 13 8 0.38 5 6 9 28 9 9 0.34 F1 F7 2 12 13 16 9 13 50 10 0.24 F9 5 13 13 11 0.17 9 11 13 50 57 -

196

Supplemental Table S2 (continued) FS CV Var% Grps Sites All Sites ec 1 43.1 F11 8 9 12 13 16 17 20 23 50 51 54 8 20 50 53 54 2 22.8 F4 6 8 9 12 13 20 50 53 54 6 8 9 15 17 20 23 27 51 54 3 10.6 F2 F6 6 8 9 12 13 51 54 8 15 20 54 4 9.01 F8 15 20 23 50 51 54 2 16 20 27 50 53 54 5 5.74 F3 6 8 9 12 20 23 50 51 54 20 23 51 6 3.81 F1 F5 8 15 19 20 23 51 54 17 20 54 7 2.13 F9 8 10 16 20 50 51 54 8 27 54 8 1.17 F12 8 9 16 20 27 28 54 60 8 20 27 51 54 9 0.95 F10 4 5 8 9 16 17 20 23 53 54 8 16 20 23 27 10 0.43 8 9 20 28 51 54 20 23 27 54 11 0.28 F7 6 8 16 20 27 28 51 53 57 8 20 27 54 all 1 33.3 F11 2 33.3 F7 3 14.8 F6 F8 4 7.44 F9 F10 5 4.97 F5 6 2.63 F1 F3 All Sites 7 1.70 F2 F4 8 0.75 F12 9 0.45 10 0.36 11 0.24

197

Supplemental Table S2 (continued) FS CV Var% Grps Sites All Sites NOTE.–The 11 Canonical Variates (CV) for each factor score analysis (pah, pss, ms, cc, ms, all) are listed. The percentage of variance explained by each CV and the groups seperated by that CV are provided. Amino acid sites that are ranked in the top 2, or have a absolute magnitude >1 in each CV are also listed. Highly ranked sites in CVA{all} with weights >8 are reported under "All Sites" within the corresponding factor score.

198

Supplemental Figure S1–bHLH Phylogenetic Analyses. Maximum Likelihood (ML) and

Bayesian (BA) phylogenies for Fungal, Plant and Animal sequences are reported. The branch lengths in both ML and BA are not proportional to distances between sequences.

ML trees have been rooted with a single representative from Chlamydomonas reinhardtii. Shimodaria-Hasegawa-like approximate Likelihood Ratio Test (aLRT) support values are shown for each branch. Those branches with less than 30% aLRT support have been collapsed. ML clades with high support from Neighbor Joining,

Maximum Parsimony, Maximum Likelihood Bootstrap, and Bayesian Posterior

Probabilities are shown in parenthesis, respectively. A. ML of 59 Basidiomycota sequences cluster into 11 groups. These clades are designated B1-B11 associating each

Basidiomycota group with a Fungal group of the same number. B. ML of 137

Saccharyomycotina sequences cluster into 10 groups. These clades are designated S2-

S12 associating each Saccharyomycotina group with a Fungal group of the same number. C. ML of 286 Pezizomycotina sequences cluster into 9 groups. These clades are designated P2-P11 associating each Pezizomycotina group with a Fungal group of the same number. D. ML of 490 Fungal sequences cluster into 12 groups. These clades are designated F1-F12 associating each Fungal group with Basidiomycota,

Pezizomycotina and Saccharomycotina groups of the same number. E. BA of 916 Plant,

Animal and Fungal bHLH sequences. The tree is unrooted. Posterior probabilities are shown as support values for each branch. Branch lengths are not proportional to distances between sequences. Sequences are collapsed into groups and families

199 represented by triangles where the height of the triangle reflects the number of collapsed sequences. Plants are grouped by binding domains (GB, NB, NE, NG), Animals are grouped by family (A-E), and Fungi are grouped (F1-F12). Several Animal B sequences have been left ungrouped to demonstrate that both F2 and F4 are closely to

Group B bHLH sequences.

200

E A D

B

C

201

Supplemental Figure S2–bHLH Group Weblogos. A. Putative Leucine Zipper domains by Basidiomycota (B1-B12), Pezizomycotina (P1-P12), and Saccharomycotina (S1-S12) clades. Logos are built on the unaligned downstream amino acid sequence of clade members. Amino acid sites are numbered starting at +1. The heptad repeat of L at positions 7, 14, 21 and 27-28 is evident in the logos. B. Bit-score weblogos were built for each Fungal group (F1-F12). The basic, Helix1, Loop, and Helix 2 regions of the bHLH domain are designated.

202

A

B F12 F11F12 F10 F9 F8 F7F6 F5 F4 F3 F2 F1 Basic Helix1 Loop Helix2

203

APPENDIX B

Chapter 3

Fundamental Characteristics and Discerning Sites of the Eukaryotic bHLH Domain

Joshua K. Sailsbery1,2, William Atchley3, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Genetics, North Carolina State University, Raleigh, NC 27606, USA

Additional Material

204

Supplemental Data 1

Annotations, classification and sequences for Plant, Animal and Fungal bHLH proteins

#

Name

Group

Common

Legacy Legacy

Uniprot Kingdom

Changes

Organism End bHLH Len Loop

Short Name Short Start bHLH B in RKH #

Short Organ. Short

Uniprot OLD Uniprot Description

Uniprot Full Uniprot

bHLH Sequence bHLH Sequence Full

NOTE.–Plant, Animal, and Fungal bHLH sequences were obtained from published datasets (Toledo-Ortiz et al.,2003;Atchley and Zhao,2007; Sailsbery, 2011). Annotations for each sequence were obtained using Uniprot (Name, Common, Short Name, Description, Short Organism, Uniprot, Uniprot Full, Uniprot OLD). Discrepencies between published datasets and the current iteration of Uniprot are also noted (Legacy Changes). Taxomonic information was obtained using Genbank's Taxonomic browser (Organism, Kingdom, Subphylum). bHLH domain location within each sequece is given by the Interpro prediction position (IPR001092 Start and End) and the expertly aligned dataset positioning (bHLH Start and End). Using the expert alignment, the count of basic amino acids in the basic region and the length of the loop are determined (# RKH in B, Loop Len). Groups are listed for each sequence as determined through this and previous analyses (A-E for Animal, Binder types for Plant, and F1-F12 for Fungi sequences). Last of all, the expertly determined bHLH domain and the full amino acid sequence are provided (bHLH Sequence, Full Sequence).

205

Supplemental Table S1 SWDA results on Plant, Animal, and Fungal sequences pah pss ms cc ec all Step Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter FS r 2 ASCC 1 Site 13 0.35 0.18 Site 13 0.20 0.10 Site 55 0.28 0.14 Site 5 0.41 0.20 Site 56 0.39 0.20 Site 5 FS4 0.41 0.20 2 Site 2 0.35 0.35 Site 6 0.12 0.15 Site 12 0.15 0.21 Site 2 0.21 0.31 Site 13 0.34 0.37 Site 56 FS5 0.38 0.39 3 Site 50 0.30 0.44 Site 12 0.20 0.22 Site 19 0.12 0.25 Site 6 0.18 0.38 Site 2 0.18 0.42 Site 50 FS1 0.31 0.49 4 Site 55 0.15 0.49 Site 50 0.11 0.28 Site 51 0.11 0.30 Site 50 0.13 0.42 Site 19 0.12 0.45 Site 2 FS1 0.22 0.54 5 Site 22 0.11 0.52 Site 14 0.09 0.32 Site 8 0.11 0.34 Site 10 0.09 0.45 Site 11 0.11 0.48 Site 10 FS5 0.13 0.57 6 Site 17 0.10 0.53 Site 51 0.09 0.34 Site 56 0.12 0.37 Site 51 0.07 0.47 Site 10 0.14 0.52 Site 6 FS2 0.12 0.60 7 Site 12 0.11 0.56 Site 53 0.07 0.37 Site 20 0.09 0.40 Site 64 0.07 0.49 Site 50 0.10 0.54 Site 19 FS1 0.29 0.67 8 Site 16 0.09 0.58 Site 19 0.07 0.39 Site 57 0.06 0.42 Site 26 0.05 0.50 Site 16 0.08 0.56 Site 51 FS5 0.13 0.69 9 Site 24 0.09 0.60 Site 7 0.06 0.41 Site 21 0.07 0.45 Site 24 0.05 0.51 Site 6 0.14 0.60 Site 55 FS5 0.12 0.71 10 Site 5 0.08 0.62 Site 28 0.06 0.43 Site 28 0.06 0.46 Site 63 0.05 0.52 Site 24 0.09 0.61 Site 56 FS1 0.11 0.72 11 Site 54 0.06 0.63 Site 10 0.06 0.44 Site 50 0.06 0.48 Site 15 0.03 0.53 Site 21 0.09 0.63 Site 1 FS4 0.10 0.73 12 Site 19 0.06 0.64 Site 20 0.05 0.46 Site 24 0.06 0.49 Site 52 0.04 0.54 Site 12 0.07 0.64 Site 9 FS1 0.09 0.75 13 Site 21 0.05 0.65 Site 59 0.04 0.47 Site 7 0.06 0.50 Site 8 0.04 0.55 Site 51 0.06 0.65 Site 51 FS3 0.07 0.76 14 Site 57 0.05 0.66 Site 1 0.03 0.48 Site 62 0.05 0.52 Site 12 0.03 0.55 Site 1 0.07 0.66 Site 5 FS1 0.08 0.77 15 Site 6 0.06 0.67 Site 52 0.03 0.49 Site 14 0.04 0.52 Site 16 0.04 0.56 Site 20 0.06 0.67 Site 55 FS3 0.06 0.78 16 Site 1 0.05 0.68 Site 22 0.04 0.50 Site 2 0.04 0.53 Site 18 0.03 0.57 Site 59 0.05 0.68 Site 12 FS2 0.06 0.78 17 Site 56 0.04 0.68 Site 54 0.03 0.51 Site 59 0.04 0.54 Site 19 0.03 0.58 Site 18 0.04 0.69 Site 17 FS1 0.05 0.79 18 Site 26 0.04 0.68 Site 55 0.03 0.51 Site 4 0.03 0.55 Site 53 0.03 0.58 Site 9 0.04 0.69 Site 55 FS1 0.05 0.79 19 Site 53 0.04 0.69 Site 15 0.02 0.52 Site 26 0.03 0.56 Site 13 0.03 0.59 Site 5 0.04 0.70 Site 13 FS1 0.06 0.80 20 Site 8 0.04 0.70 Site 18 0.03 0.52 Site 27 0.03 0.56 Site 60 0.03 0.59 Site 62 0.03 0.70 Site 19 FS3 0.06 0.81

206

Supplemental Table S1 (continued) pah pss ms cc ec all Step Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter r 2 ASCC Enter FS r 2 ASCC 21 Site 58 0.03 0.70 Site 57 0.02 0.53 Site 5 0.02 0.57 Site 3 0.02 0.60 22 Site 23 0.02 0.53 Site 63 0.02 0.57 Site 27 0.02 0.60 23 Site 17 0.02 0.54 Site 16 0.02 0.58 Site 61 0.02 0.61 24 Site 56 0.02 0.54 Site 6 0.03 0.58 Site 22 0.02 0.61 25 Site 4 0.02 0.55 Site 17 0.02 0.59 Site 17 0.02 0.61 26 Site 8 0.01 0.55 Site 13 0.02 0.59 Site 11 0.02 0.62 27 Site 60 0.01 0.55 Site 10 0.02 0.59 Site 20 0.02 0.62 28 Site 9 0.01 0.56 Site 22 0.02 0.60 Site 25 0.02 0.62 29 Site 62 0.01 0.56 Site 1 0.01 0.60 Site 23 0.01 0.62 30 Site 58 0.01 0.56 Site 9 0.01 0.60 Site 4 0.01 0.63 NOTE.–Fungal group sequences were transformed into six factor score datasets (pah, pss, ms, cc, ec, and all). Results from SWDA on each transformed dataset are reported. Each step, up to 30, are given for each SWDA. ASCC are indicitive of model performance up to a given step. Steps were not reported after an ASCC of 70% was obtained for the first five factor scores or 80% for the all dataset. Amino acid sites are entered based on their ability to discriminate groups (r 2).

207

Supplemental Table S2 CVA results on Plant, Animal, and Fungal sequences pah pss ms cc ec Rank CV1 CV2 CV1 CV2 CV1 CV2 CV1 CV2 CV1 CV2 Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight 1 54 -1.69 23 1.23 12 1.93 23 -1.31 20 -1.85 12 -1.56 10 -1.47 50 1.78 20 -3.29 16 1.50 2 50 -1.26 13 -1.11 13 1.23 53 -0.99 50 1.00 23 -0.93 5 -0.80 2 1.20 54 -1.45 54 -1.42 3 2 -0.79 54 0.93 10 -0.92 50 0.94 28 -0.90 27 -0.93 12 -0.74 6 0.97 51 1.44 20 -1.40 4 17 -0.77 53 -0.80 6 -0.90 20 -0.57 12 -0.88 28 0.92 17 -0.62 13 -0.68 27 -1.04 13 0.78 5 13 0.71 57 -0.73 54 0.70 54 0.56 23 0.79 9 0.64 23 -0.61 8 0.60 50 0.70 27 0.62 6 12 0.69 12 0.73 7 -0.43 15 -0.47 13 -0.78 57 0.58 2 -0.54 9 -0.45 56 0.69 6 0.59 7 57 -0.64 5 0.66 9 -0.37 10 0.41 56 -0.77 4 0.53 13 -0.53 10 0.45 10 -0.62 53 0.57 8 51 0.51 28 -0.64 53 -0.36 14 -0.38 16 0.63 13 -0.47 28 -0.49 64 0.43 19 -0.57 50 -0.50 9 19 0.49 55 0.55 27 -0.34 27 -0.37 54 0.56 51 0.45 16 -0.47 53 -0.41 23 -0.56 28 -0.45 10 26 -0.49 16 0.52 64 -0.34 60 0.37 5 0.56 17 -0.44 8 0.44 15 -0.36 28 0.55 12 -0.40 11 24 0.39 9 -0.50 19 -0.29 17 0.34 10 0.54 10 0.43 64 0.39 28 -0.34 9 -0.55 60 -0.40 12 28 0.36 8 0.48 28 0.29 52 0.34 2 -0.54 55 0.41 24 0.38 26 0.34 16 -0.47 23 0.31 13 1 -0.36 27 -0.47 1 -0.27 2 0.33 6 0.53 21 0.38 18 -0.36 5 -0.32 57 0.45 5 -0.29 14 55 -0.34 10 0.42 8 0.26 61 0.30 53 -0.53 62 0.37 63 -0.35 11 0.30 6 -0.42 18 -0.25 15 9 0.30 24 0.39 61 0.26 58 -0.29 9 0.53 59 0.34 52 -0.32 12 -0.30 2 0.38 9 -0.24 16 27 0.30 50 -0.35 51 0.25 1 -0.28 55 -0.50 61 0.33 19 0.31 22 0.30 61 0.37 59 -0.24 17 20 0.30 2 -0.29 23 0.24 57 -0.27 21 0.49 20 0.33 15 0.30 60 0.28 26 -0.36 8 -0.23 18 22 -0.30 21 -0.25 55 0.23 64 -0.26 19 0.37 54 -0.32 27 0.28 25 0.25 21 -0.36 57 -0.20 19 14 0.29 61 -0.23 26 -0.23 51 -0.25 14 0.36 26 -0.32 20 -0.25 23 0.23 53 -0.36 24 -0.20 20 16 0.28 4 0.20 20 -0.22 55 0.22 57 0.35 64 -0.29 60 -0.25 55 0.22 12 -0.34 62 -0.19 21 56 0.28 58 0.18 57 0.19 8 0.21 61 0.32 18 0.26 1 0.23 3 -0.22 52 -0.33 11 0.18 22 11 -0.27 18 -0.18 4 0.19 22 0.21 24 0.30 56 0.26 61 -0.22 18 -0.21 64 0.31 17 -0.18 23 8 0.24 51 -0.17 16 0.18 11 -0.18 27 -0.30 8 0.25 9 -0.20 1 0.20 15 -0.31 55 -0.18 24 6 -0.23 15 0.15 18 -0.18 12 -0.18 8 -0.29 50 0.22 4 0.20 14 0.16 60 0.31 63 -0.14 25 63 0.22 19 -0.15 62 0.17 7 0.18 64 0.28 63 -0.21 51 -0.20 27 0.15 24 -0.28 26 0.13 26 53 0.21 59 0.14 56 0.15 24 0.17 22 0.24 24 0.20 58 0.18 57 0.14 5 -0.28 64 -0.13 27 59 0.19 20 -0.14 5 -0.15 19 -0.17 7 -0.24 16 0.20 11 0.16 16 -0.13 63 0.26 21 -0.12

208

Supplemental Table S2 (continued) pah pss ms cc ec Rank CV1 CV2 CV1 CV2 CV1 CV2 CV1 CV2 CV1 CV2 Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight Site Weight 28 18 0.18 7 0.12 59 -0.14 6 -0.16 15 0.23 7 -0.18 14 0.13 59 -0.13 7 0.25 52 0.12 29 4 0.18 25 -0.11 60 0.14 21 0.15 1 -0.21 5 -0.16 3 -0.13 51 0.13 1 0.24 56 -0.11 30 5 -0.14 17 -0.10 22 0.13 59 0.14 17 -0.20 11 0.14 50 -0.11 52 0.11 3 -0.18 1 -0.11 31 3 0.10 62 0.10 58 -0.12 56 -0.14 11 0.19 2 0.14 26 0.10 58 0.09 4 -0.17 10 -0.10 32 58 -0.10 52 -0.09 24 -0.12 5 0.13 62 -0.19 52 -0.13 53 0.10 61 0.09 13 0.16 19 0.09 33 64 0.09 3 -0.08 63 0.10 4 0.09 25 -0.16 3 -0.12 22 -0.09 21 0.09 62 0.15 7 0.09 34 25 -0.09 11 -0.08 2 -0.09 62 0.08 60 -0.13 14 0.11 55 0.07 24 -0.09 58 -0.12 22 -0.07 35 60 -0.08 26 0.06 21 -0.08 26 -0.07 59 0.09 19 0.11 7 -0.06 4 -0.08 11 0.11 51 -0.07 36 23 0.06 1 0.05 52 0.07 13 -0.07 63 0.09 15 -0.09 62 -0.06 7 -0.05 14 -0.10 14 -0.07 37 52 -0.05 22 -0.05 15 0.06 9 0.06 26 0.08 1 -0.06 21 0.05 54 0.04 22 -0.09 3 -0.06 38 61 0.05 6 0.05 50 -0.05 3 0.05 51 0.05 53 -0.06 57 -0.05 63 0.04 55 0.06 25 0.05 39 62 0.03 64 0.05 11 -0.05 25 -0.05 52 0.05 58 -0.05 6 -0.05 56 0.04 59 0.04 58 -0.05 40 7 -0.02 60 0.04 3 0.03 28 -0.04 4 0.05 6 -0.05 54 -0.04 19 -0.04 8 0.02 61 -0.04 41 15 0.02 56 -0.03 25 -0.02 63 -0.04 58 0.03 25 -0.02 56 -0.01 17 0.02 17 -0.02 4 -0.04 42 10 0.02 14 -0.01 14 -0.01 16 0.02 18 -0.01 60 0.01 25 -0.01 20 0.02 25 0.02 2 0.04 43 21 0.00 63 0.00 17 -0.01 18 0.02 3 -0.01 22 0.00 59 0.00 62 0.01 18 0.01 15 -0.03

NOTE.–Two canonical variates from five different CVA are provided. Each analysis corresponds to a factor score transformed Plant, Animal, Fungal dataset ( pah, pss, ms, cc, and ec). The bHLH amino acid sites are ranked by the absolute value of their corresponding coefficent (weight) in the canonical variate. Sites with weights further from zero have higher discerning power between Plant, Animal and Fungal sequences.

209

Supplemental Table S3 CVA{all} results on Plant, Animal, and Fungal sequences Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CV1 Factor pah ec ms ec ms ms ec pps pps ec ec pah pah ec ms ec pah ec pah pah Site 23 23 23 20 13 20 51 13 23 54 57 13 20 13 54 27 53 16 50 57 Weight 26.38 25.30 23.70 -9.61 -3.63 -3.51 3.30 -2.99 -2.85 2.56 2.36 2.16 -2.01 -1.83 1.76 -1.75 1.65 -1.52 -1.49 -1.39 CV2 Factor ms ec ec ms ms pah pps ec cc cc ec ms ec pah pah pps ec cc ms ms Site 23 54 20 16 12 16 23 27 10 12 51 13 12 10 27 10 13 1 54 50 Weight -4.15 -3.50 -2.91 2.78 -2.66 2.61 2.34 1.85 1.68 -1.61 1.56 1.38 -1.19 1.16 1.13 -1.07 0.95 -0.92 0.92 0.92 Rank 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 CV1 Factor ec pps pps cc ec pps pah cc ms pah ec pah pah pps pps cc cc ms pps pps Site 9 20 6 13 61 54 9 14 53 6 56 28 58 12 9 1 8 50 57 19 Weight -1.39 1.25 -1.20 1.17 1.13 -1.12 1.07 0.87 0.85 -0.80 0.79 0.79 0.76 0.75 -0.74 0.74 0.73 0.72 0.71 -0.71 CV2 Factor pps cc ec ms pah ms ec cc ms pps pps pah pah ec pah pah ec pah pps cc Site 6 57 53 61 50 19 28 9 20 12 8 19 12 57 54 28 61 8 19 6 Weight 0.91 -0.90 0.85 0.83 -0.82 0.82 0.80 0.75 -0.75 -0.73 -0.72 -0.71 -0.69 -0.67 -0.67 -0.66 0.59 0.59 0.58 -0.56 Rank 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 CV1 Factor pps pah ms pps ec ms ec ec ec ms ec pah cc pah pah ms ms ec cc ms Site 61 17 61 56 8 16 11 19 12 58 58 19 54 11 56 11 21 53 2 8 Weight -0.71 -0.70 -0.70 0.68 -0.64 0.64 0.64 -0.62 0.59 0.59 0.59 0.58 -0.58 0.58 0.58 0.57 -0.57 0.56 0.56 -0.56 CV2 Factor ms cc pps ec pps pps ms cc ms cc pah ms ms pah ec pah pah pps cc pps Site 53 14 24 23 61 54 6 19 57 23 61 8 9 57 56 23 55 57 54 28 Weight 0.55 -0.54 0.54 0.54 -0.54 0.53 0.52 -0.52 0.50 0.48 0.48 0.48 0.47 -0.46 0.45 -0.43 0.42 -0.42 0.40 -0.40

210

Supplemental Table S3 (continued) Rank 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 CV1 Factor cc ec cc pah pah ms ms pps ec pah cc pah pah ms pah pah ms cc ms ms Site 16 6 15 5 8 14 19 58 21 15 3 14 26 1 55 64 28 57 2 10 Weight -0.56 -0.55 0.55 -0.53 0.52 0.52 -0.52 0.50 -0.49 0.49 -0.49 0.48 -0.48 0.48 -0.47 -0.46 0.45 0.44 0.42 0.41 CV2 Factor ec ms ms cc pps ec pps ms cc ms pps pps cc ms ms ms cc ms cc pah Site 55 15 3 13 20 24 56 62 28 21 64 21 21 17 5 7 2 51 53 2 Weight -0.40 -0.39 -0.39 -0.38 0.38 -0.37 -0.37 -0.36 0.36 0.35 -0.35 0.34 -0.33 -0.32 -0.31 0.31 0.31 0.31 -0.31 -0.30 Rank 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 CV1 Factor ec cc cc ec pah pah cc cc ec cc pah cc cc ms cc cc pah ms ms ec Site 28 11 5 2 61 54 51 12 17 64 12 10 61 3 17 55 16 6 18 10 Weight 0.41 0.40 -0.39 0.39 0.39 -0.37 -0.37 0.37 0.37 0.36 0.35 -0.35 0.34 0.34 -0.33 -0.32 0.32 -0.31 -0.31 -0.31 CV2 Factor cc ec ec ec ec cc ms cc cc ms ms pps ec cc pah cc ms cc ms ms Site 25 3 15 50 5 16 26 63 52 11 52 2 9 5 13 27 58 22 59 14 Weight 0.29 -0.29 -0.29 0.28 -0.28 -0.27 0.26 0.26 0.25 0.25 -0.25 -0.24 0.24 0.24 -0.24 -0.24 0.24 0.23 0.23 0.23 Rank 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 CV1 Factor cc pah pps ec cc pps ec cc ec pah pps ms pah pps ec cc ms pps pps ms Site 59 27 64 7 20 50 15 62 60 2 28 55 10 1 52 58 12 14 22 5 Weight -0.31 -0.31 0.31 0.30 -0.30 -0.30 -0.29 -0.29 0.29 0.28 0.28 -0.28 -0.27 0.27 -0.26 0.26 -0.25 -0.25 -0.24 -0.23 CV2 Factor ec ms pah pps cc pah pah ms cc ms ms cc pah ec pah pps pps ec ec pps Site 4 55 20 58 24 7 52 18 64 63 64 11 56 1 58 18 14 63 18 60 Weight -0.22 -0.22 -0.22 -0.22 -0.22 0.22 -0.22 0.21 -0.21 0.20 0.20 -0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.18 -0.18

211

Supplemental Table S3 (continued) Rank 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 CV1 Factor pps ms ec pps pps ms ms ec ec cc pps pps pps ec pah ec cc pah ms ms Site 55 7 1 51 18 51 4 4 50 6 8 7 15 64 18 63 18 62 17 60 Weight -0.22 -0.22 0.22 -0.21 -0.20 -0.20 0.20 -0.20 0.20 -0.20 0.20 -0.19 -0.19 0.19 0.19 0.18 0.18 -0.17 0.17 0.17 CV2 Factor ec pps ms ms cc pah cc ec pah cc ms pah cc pps ec ec cc ms cc ec Site 62 52 28 25 50 26 51 2 22 20 10 1 61 62 21 26 60 56 17 14 Weight -0.18 0.18 -0.18 -0.18 -0.18 0.17 0.17 0.17 0.17 -0.16 -0.16 -0.16 0.16 -0.16 -0.15 0.15 0.15 0.15 0.15 0.15 Rank 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 CV1 Factor cc cc ms cc ms ms pps cc pps pah ms ec cc cc pps ec cc ms pah pps Site 4 23 26 60 25 52 53 52 11 52 59 59 21 27 10 22 9 62 51 52 Weight 0.17 -0.16 0.16 -0.16 0.16 -0.16 -0.15 -0.15 -0.15 0.14 0.14 0.14 0.14 -0.13 0.13 -0.13 0.12 -0.12 -0.12 0.12 CV2 Factor pps pps pah pps ms pps pps pah ec pah cc cc cc pps pps pps ec pah cc ec Site 7 5 18 26 4 3 1 25 11 17 55 4 8 27 50 53 16 53 58 8 Weight 0.14 0.14 -0.14 0.14 0.14 -0.14 -0.13 0.12 0.12 -0.12 0.12 -0.11 -0.11 -0.11 0.11 0.10 -0.10 0.10 -0.10 0.10 Rank 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 CV1 Factor pah ec ms cc pps ms pps cc cc ec pah pah pah pps pah pps ms pah pps ms Site 22 26 56 56 63 22 59 19 50 24 3 4 60 4 7 24 63 21 25 24 Weight -0.12 -0.11 -0.11 -0.11 0.11 -0.11 0.11 0.11 -0.10 -0.10 -0.10 -0.09 0.09 0.09 0.08 0.08 -0.08 -0.08 -0.08 -0.08 CV2 Factor pps ec ec cc pps pps ms pah cc cc ec pah pah pps cc ec pps pah pah pps Site 63 58 60 56 51 16 22 4 15 62 64 9 24 15 18 52 55 64 21 9 Weight 0.10 0.10 -0.09 -0.09 0.09 -0.09 0.09 0.09 -0.09 0.09 -0.09 -0.08 0.08 0.08 -0.07 -0.07 -0.07 0.07 -0.07 -0.07

212

Supplemental Table S3 (continued) Rank 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 CV1 Factor pss ec pah pss ms cc pss pss ec ms ec pss cc cc pss pss ec cc pss pah Site 5 18 24 3 64 28 2 27 55 27 14 62 53 63 21 17 62 24 26 59 Weight -0.08 0.07 -0.07 0.07 -0.07 -0.06 -0.06 -0.06 0.06 0.05 -0.05 0.05 -0.04 -0.04 0.04 -0.04 -0.04 0.03 -0.03 -0.03 CV2 Factor pss ms ms ms pah pah pah pss ec cc pss cc ms pah pah ec ec pah pah pah Site 4 2 60 24 3 63 15 22 19 59 59 7 27 14 6 17 10 60 59 62 Weight -0.07 0.07 -0.07 -0.07 -0.07 -0.06 0.06 0.05 0.05 -0.05 0.05 0.05 -0.04 -0.04 0.04 0.04 0.04 0.04 0.03 -0.03 Rank 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 CV1 Factor cc ec cc ms ms pah pss pss cc pah pah cc ec ms ec Site 7 25 22 9 15 63 60 16 25 25 1 26 5 57 3 Weight 0.03 -0.02 -0.02 -0.02 -0.02 0.02 -0.02 0.01 0.01 -0.01 0.01 -0.01 0.00 0.00 0.00 CV2 Factor ec pah pss pss ec pss ec ec pah cc cc pah ec ms pss Site 22 5 11 25 59 17 25 7 51 3 26 11 6 1 13 Weight -0.03 0.03 -0.03 0.03 0.02 -0.02 -0.02 0.02 0.02 0.01 0.01 0.00 0.00 0.00 0.00 NOTE.–Two canonical variates from the CVA{all} are provided. The factor score (pah, pss, ms, cc, and ec) and the site are ranked within each canonical variate by their weight. Site/factor score pairs with weights further from zero have higher discerning power between Plant, Animal and Fungal sequences.

213

APPENDIX C

Chapter 4

Diverse and tissue-enriched small RNAs in the plant pathogenic fungus, Magnaporthe oryzae

Cristiano C. Nunes1, Malali Gowda1,2, Joshua Sailsbery1, Minfeng Xue1,3, Feng Chen4, Douglas Brown1, YeonYee Oh1, Thomas K. Mitchell5, Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC, 27606, USA 2 Next-Generation Genomics Laboratory, Center for Cellular and Molecular Platform, NCBS-GKVK Campus, Bangalore 560065, India 3 Department of Plant Pathology, China Agricultural University, Beijing 1000193, PR China 4 US DOE Joint Genome Institute, Walnut Creek, CA 94598, USA 5 Department of Plant Pathology, The Ohio State University, Columbus, OH 43210,USA

BMC Genomics 2011, 12:288

Additional Material

214

Additional file 2-Small RNAs with perfect match to snRNAs. snRNA-derived small RNAs mapped predominantly to 3' end of U2 (A) and U4 (B). In contrast, U6-derivd small RNA mapped largely toward to the 5' end (C)

215

III:2461400..2461800

216

IV:2289300..2289700

217

VI:3317789..3317631

218

Additional file 3 – Distribution of mycelia small RNAs mapped to repetitive elements

Alignmenta Read Countb Proratedc Featuresd Total Sense Antisense Total Sense Antisense Total Sense Antisense Mapped Total Coverage Transposable Elements 248528 113951 134620 3369 1522 1883 3305 1484 1821 1898 3448 55% AFUT1 0 0 0 0 0 0 0 0 0 0 1 0% BCBOTYPOL 0 0 0 0 0 0 0 0 0 0 36 0% GYMAG1_I 25380 10370 15012 955 408 549 953 405 548 72 78 92% GYMAG1_LTR 499 152 349 4 2 4 4 1 3 250 303 83% GYMAG2_I 21639 9684 11957 742 331 413 739 329 410 75 81 93% GYMAG2_LTR 409 139 272 12 3 11 7 2 5 118 155 76% Gypsy-1-I_AN 0 0 0 0 0 0 0 0 0 0 2 0% GYPSY1_MG 0 0 0 0 0 0 0 0 0 0 10 0% Gypsy2-I_AO 0 0 0 0 0 0 0 0 0 0 3 0% Helitron-1_AN 0 0 0 0 0 0 0 0 0 0 1 0% Hop 0 0 0 0 0 0 0 0 0 0 3 0% MAGGY_I 167086 75618 91470 1353 620 735 1347 616 731 326 329 99% MAGGY_LTR 7282 2250 5034 27 10 19 25 8 17 360 365 99% Mariner-1_AF 0 0 0 0 0 0 0 0 0 0 1 0% Mariner-6_AN 0 0 0 0 0 0 0 0 0 0 1 0% MARY1_TM 0 0 0 0 0 0 0 0 0 0 1 0%

MGR583 24865 14988 9879 157 95 67 111 72 39 388 425 91% MGRL3_I 821 350 473 96 43 55 94 42 52 24 39 62% MGRL3_LTR 1 1 1 1 1 1 0 0 0 1 93 1% MOLLY_SN 119 119 1 3 3 1 2 2 0 60 70 86% NHT2_I 0 0 0 0 0 0 0 0 0 0 1 0% OCCAN_MG 12 8 6 12 8 6 0 0 0 2 97 2% POT2 162 146 18 28 13 17 3 2 1 137 414 33% PYRET_I 139 111 30 16 14 8 7 3 3 67 499 13% PYRET_LTR 4 4 1 2 2 1 0 0 0 4 417 1% REALAA_I 0 0 0 0 0 0 0 0 0 0 2 0% SKIPPY 0 0 0 0 0 0 0 0 0 0 6 0% TCN1-I 0 0 0 0 0 0 0 0 0 0 1 0% TY3 56 1 56 29 1 29 12 0 12 2 2 100% U35230 546 314 234 92 57 37 2 1 1 12 12 100%

a Alignment refers to the summation of small RNA alignments to any genomic feature. b Read Count represents the summation of distinct reads mapping to a given feature.

Noteworthy the values for each genome feature are generally less than the sum of its sub-features due to the small RNAs mapping to multiple features (See “Material and

Methods” for more details). c Prorated apportions the weight of any small RNA between alignments and features. d Features represent the proportion of genomic features mapped by small RNAs where mapped indicates the number of members for each genomic feature mapped by small

RNAs among the total possible.

219

Additional file 5 – Distribution of mycelia small RNAs mapped to tRNAs.

Alignmenta Read Countb Proratedc Featuresd Total Sense Antisense Total Sense Antisense Total Sense Antisense Mapped Total Coverage tRNA 32008 26258 5752 5710 5623 89 4482 4447 35 356 361 99% Ala 1032 1032 1 379 379 1 310 310 0 15 15 100% Arg 352 352 1 224 224 1 209 209 0 15 16 94% Asn 253 253 1 131 131 1 91 91 0 8 8 100% Asp 2420 2420 1 848 848 1 528 528 0 12 12 100% Cys 43 43 1 43 43 1 30 30 0 3 3 100% Gln 560 560 1 238 238 1 175 175 0 8 8 100% Glu 2959 2957 4 579 579 2 306 306 0 11 12 92% Gly 1171 1171 1 256 256 1 207 207 0 21 22 95% His 479 479 1 128 128 1 70 70 0 5 5 100% Ile 127 127 1 70 70 1 68 68 0 10 10 100% Leu 2433 2433 1 863 863 1 473 473 0 15 15 100% Lys 3236 3228 10 646 646 2 322 321 1 13 13 100% Met 242 242 1 87 87 1 71 71 0 8 8 100% Phe 276 276 1 88 88 1 47 47 0 8 8 100% Pro 706 706 1 207 207 1 167 167 0 9 9 100% SeC 28 28 1 14 14 1 13 13 0 2 2 100%

Ser 806 806 1 376 376 1 290 290 0 13 13 100% Thr 1428 1428 1 588 588 1 236 236 0 10 10 100% Trp 56 56 1 14 14 1 12 12 0 4 4 100% Tyr 105 105 1 21 21 1 16 16 0 5 5 100% Val 1473 1473 1 759 759 1 273 273 0 11 11 100% Pseudo 15322 9618 5706 1034 948 88 568 534 34 149 151 99% Undet 89 53 38 89 53 38 2 1 0 1 1 100%

a Alignment refers to the summation of small RNA alignments to any genomic feature. b Read Count represents the summation of distinct reads mapping to a given feature.

Noteworthy the values for each genome feature are generally less than the sum of its sub-features due to the small RNAs mapping to multiple features (See “Material and

Methods” for more details). c Prorated apportions the weight of any small RNA between alignments and features. d Features represent the proportion of genomic features mapped by small RNAs where mapped indicates the number of members for each genomic feature mapped by small

RNAs among the total possible.

220

Additional file 6 – Distribution of appressoria small RNAs mapped to tRNAs.

Alignmenta Read Countb Proratedc Featuresd Total Sense Antisense Total Sense Antisense Total Sense Antisense Mapped Total Coverage tRNA 8749 8476 275 1989 1988 3 1636 1635 2 330 361 91% Ala 1996 1996 1 341 341 1 313 313 0 13 15 87% Arg 165 165 1 103 103 1 94 94 0 12 16 75% Asn 107 107 1 48 48 1 37 37 0 8 8 100% Asp 1589 1589 1 394 394 1 278 278 0 12 12 100% Cys 10 10 1 10 10 1 10 10 0 1 3 33% Gln 366 366 1 143 143 1 125 125 0 8 8 100% Glu 871 871 1 200 200 1 108 108 0 11 12 92% Gly 1068 1068 1 114 114 1 94 94 0 20 22 91% His 165 165 1 42 42 1 23 23 0 5 5 100% Ile 39 39 1 15 15 1 15 15 0 4 10 40% Leu 419 419 1 268 268 1 71 71 0 15 15 100% Lys 1044 1044 1 227 227 1 128 128 0 13 13 100% Met 110 110 1 58 58 1 40 40 0 7 8 88% Phe 9 9 1 3 3 1 1 1 0 8 8 100% Pro 94 94 1 39 39 1 33 33 0 9 9 100% SeC 0 0 0 0 0 0 0 0 0 0 2 0%

Ser 214 214 1 85 85 1 69 69 0 11 13 85% Thr 753 753 1 215 215 1 136 136 0 10 10 100% Trp 20 20 1 5 5 1 4 4 0 4 4 100% Tyr 50 50 1 10 10 1 7 7 0 5 5 100% Val 77 77 1 46 46 1 45 45 0 10 11 91% Pseudo 287 16 273 12 11 3 7 5 2 143 151 95% Undet 2 1 2 2 1 2 0 0 0 1 1 100%

a Alignment refers to the summation of small RNA alignments to any genomic feature. b Read Count represents the summation of distinct reads mapping to a given feature.

Noteworthy the values for each genome feature are generally less than the sum of its sub-features due to the small RNAs mapping to multiple features (See “Material and

Methods” for more details). c Prorated apportions the weight of any small RNA between alignments and features. d Features represent the proportion of genomic features mapped by small RNAs where mapped indicates the number of members for each genomic feature mapped by small

RNAs among the total possible.

221

Additional file 7-Logos of tRFs mapping to tRNAAla. Members of tRNAAla grouped into four types. In both libraries, tRFs mapped predominantly to the 3' half and preferentially to one tRNAAla type.

222

Mycelia Appressoria

223

APPENDIX D

Chapter 5

Genome-wide characterization of methylguanosine-capped and polyadenylated small RNAs in the rice blast fungus Magnaporthe oryzae

Malali Gowda1,2, Cristiano C. Nunes1, Joshua Sailsbery1, Minfeng Xue1,3, Feng Chen4, Cassie A. Nelson5, Douglas E. Brown1, Yeonyee Oh1, Shaowu Meng1, Thomas Mitchell6, Curt H. Hagedorn5 and Ralph A. Dean1

1Fungal Genomics Laboratory, Center for Integrated Fungal Research, North Carolina State University, Raleigh, NC 27606, 2Plant Biology, Michigan State University, East Lansing, MI 48824, USA, 3Department of Plant Pathology, China Agricultural University, Beijing 100094, China, 4US DOE Joint Genome Institute, Walnut Creek, CA 94598, 5University of Utah School of Medicine and Huntsman Cancer Institute, Salt Lake City, UT 84132 6Department of Plant Pathology, Ohio State University, Columbus OH 43210, USA

Nucleic Acids Research, 2010, Vol. 38, No. 21, pages 7558 – 7569

Additional Material

224

Supplemetary Table 3 Detailed characterization of CPA-sRNA mapping and alignments to genomic and mitochondrial features; types of tRNA; and specific types of Transposable Elements

Read Count Prorated Features1 Alignment Count2 Total Unique3 Multiple4 Total Total Total Unique Multiple Sense Antisense Sense Antisense Sense Antisense Sense Antisense Mapped Coverage Sense Antisense Sense Antisense Sense Antisense Genome 13340 3547 11047 2785 2293 762 12296 2250 6753 26% 56920 8539 11047 2785 45873 5754 Genes 8894 3507 8133 2774 761 733 7579 2201 4327 39% 12385 6356 8133 2774 4252 3582

Introns 456 313 283 161 173 152 239 139 467 2% 1318 818 283 161 1035 657 Nuclear Exons 8685 3386 7949 2680 736 706 7340 2062 4977 16% 11407 5660 7949 2680 3458 2980 5'UTR 2526 705 2187 381 339 324 1967 356 1325 12% 3485 1035 2187 381 1298 654 EST 1201 46 1119 34 82 12 1074 31 375 15% 1295 50 1119 34 176 16 No-EST 1384 660 1068 347 316 313 893 325 950 11% 2190 985 1068 347 1122 638 CDS 1603 779 1208 435 395 344 1090 500 1581 14% 2178 1032 1208 435 970 597 EST 1233 518 988 289 245 229 847 320 1085 17% 1707 642 988 289 719 353 No-EST 426 340 220 146 206 194 243 180 496 11% 471 390 220 146 251 244 3'UTR 5653 2276 5237 1924 416 352 4283 1206 2597 23% 6763 3735 5237 1924 1526 1811 EST 3643 148 3498 133 145 15 2887 94 801 31% 3782 148 3498 133 284 15 No-EST 2090 2131 1739 1791 351 340 1397 1111 1796 21% 2981 3587 1739 1791 1242 1796 tRNA 394 31 247 6 147 25 1396 6 287 84% 4447 169 247 6 4200 163 5'Leader 237 24 211 1 26 23 191 1 151 44% 297 24 211 1 86 23 Mature 226 1 146 0 80 1 188 0 274 80% 2665 1 146 0 2519 1 3'Term 93 7 21 5 72 2 36 5 186 55% 2185 145 21 5 2164 140 rRNA 1740 1 675 1 1065 0 1642 1 47 98% 3916 1 675 1 3241 0 5.8s 82 0 38 0 44 0 82 0 3 100% 132 0 38 0 94 0 8s 66 0 18 0 48 0 46 0 41 100% 1350 0 18 0 1332 0 18s 660 1 529 1 131 0 592 1 1 50% 660 1 529 1 131 0 28s 932 0 90 0 842 0 922 0 2 100% 1774 0 90 0 1684 0 snRNA 24 0 7 0 17 0 16 0 5 31% 41 0 7 0 34 0 Trans. Elem. 325 102 79 5 246 97 278 42 2087 61% 20671 2168 79 5 20592 2163 Intergenic 2778 - 1968 - 810 - 2498 - 0 - 19376 - 1968 - 17408 - Genome 41 3 40 3 1 0 40 3 14 36% 40 3 40 3 1 0 Genes 8 3 7 3 1 0 6 3 8 53% 8 3 7 3 1 0 CDS 3 0 2 0 1 0 3 0 2 5% 3 0 2 0 1 0 tRNA 3 0 3 0 0 0 2 0 4 20% 3 0 3 0 0 0 Mature 1 0 1 0 0 0 1 0 1 5% 1 0 1 0 0 0 Mitochondrial rRNA 31 0 31 0 0 0 30 0 2 100% 31 0 31 0 0 0 Intergenic 3 - 3 - 0 - 3 - 0 - 3 - 3 - 0 -

5 Ala 27 0 23 0 4 0 20 0 10 67% 29 0 23 0 6 0 Arg 11 0 10 0 1 0 9 0 8 50% 12 0 10 0 2 0 tRNA Asn 4 2 3 1 1 1 3 1 5 63% 7 2 3 1 4 1 Asp 10 2 4 1 6 1 7 1 8 67% 19 2 4 1 15 1 Cys 4 0 4 0 0 0 2 0 2 67% 4 0 4 0 0 0 Gln 7 0 7 0 0 0 7 0 5 63% 7 0 7 0 0 0 Glu 21 1 15 1 6 0 14 1 11 92% 38 1 15 1 23 0 Gly 31 0 25 0 6 0 27 0 14 64% 83 0 25 0 58 0 His 4 0 1 0 3 0 2 0 5 100% 13 0 1 0 12 0 Ile 15 2 13 2 2 0 15 2 8 80% 19 2 13 2 6 0 Leu 25 0 18 0 7 0 18 0 13 87% 37 0 18 0 19 0 Lys 23 0 17 0 6 0 15 0 9 69% 57 0 17 0 40 0 Met 20 0 10 0 10 0 17 0 6 75% 39 0 10 0 29 0 Phe 7 1 7 1 0 0 6 0 4 50% 7 1 7 1 0 0 Pro 8 0 7 0 1 0 6 0 7 78% 12 0 7 0 5 0 SeC 1 0 1 0 0 0 1 0 1 50% 1 0 1 0 0 0 Ser 18 0 11 0 7 0 13 0 13 100% 26 0 11 0 15 0 Thr 73 0 25 0 48 0 19 0 9 90% 95 0 25 0 70 0 Trp 8 1 4 0 4 1 7 0 4 100% 16 1 4 0 12 1 Tyr 1 0 1 0 0 0 0 0 1 20% 1 0 1 0 0 0 Val 22 0 22 0 0 0 20 0 7 64% 22 0 22 0 0 0 Pseudo 73 22 29 0 44 22 53 1 140 93% 3913 159 29 0 3884 159 Undet 18 1 0 0 18 1 0 0 1 100% 18 1 0 0 18 1

225

Supplemetary Table 3 (continued) Read Count Prorated Features1 Alignment Count2 Total Unique3 Multiple4 Total Total Total Unique Multiple Sense Antisense Sense Antisense Sense Antisense Sense Antisense Mapped Coverage Sense Antisense Sense Antisense Sense Antisense GYMAG1 I 7 3 1 0 6 3 6 3 57 73% 79 92 1 0 78 92 GYMAG1 LTR 23 1 2 0 21 1 22 0 161 53% 2769 1 2 0 2767 1 GYMAG2 I 4 5 0 0 4 5 2 5 54 67% 58 105 0 0 58 105 GYMAG2 LTR 4 19 0 0 4 19 4 1 111 72% 196 123 0 0 196 123 GYPSY1 MG 1 1 0 1 1 0 1 1 5 50% 4 1 0 1 4 0 MAGGY I 10 2 0 0 10 2 8 1 236 72% 848 138 0 0 848 138 MAGGY LTR 50 0 5 0 45 0 46 0 343 94% 6386 0 5 0 6381 0 MGR583 195 23 67 1 128 22 150 5 391 92% 9153 543 67 1 9086 542

MGRL3 I 1 3 1 1 0 2 1 2 13 33% 1 12 1 1 0 11 Transposable Elements MGRL3 LTR 2 1 1 0 1 1 2 0 4 4% 3 1 1 0 2 1 MOLLY SN 4 1 0 1 4 0 2 1 60 86% 115 1 0 1 115 0 OCCAN MG 9 2 0 0 9 2 7 1 50 52% 272 45 0 0 272 45 POT2 33 31 13 1 20 30 12 8 341 82% 466 663 13 1 453 662 PYRET I 16 16 1 0 15 16 3 11 73 15% 43 166 1 0 42 166 PYRET LTR 4 4 0 0 4 4 2 3 174 42% 140 245 0 0 140 245 TY3 0 1 0 0 0 1 0 0 2 100% 0 2 0 0 0 2 U35230 58 3 0 0 58 3 9 0 12 100% 163 30 0 0 163 30 1For each Feature, the number of members mapped by a CPA-sRNA is given; followed by the percentage of those mapped over the total number of features possible. Each location in the genome of a transposable element is considered a different member of that transposable element. 2Alignments are the number of CPA-sRNA genomic alignments that map to a given feature. CPA-sRNAs may align to the genome once, many times or not at all; and thus be annotated to different features. 3The number of CPA-sRNAs that align to one and only one location in the genome. 4The number of CPA-sRNAs that align to more than one location in the genome. 5tRNA types are located in both the nuclear and mitochondrial genomes.

226

111CPA111-sRNA

Supplementary Figure 1- Example of Prorating CPA-sRNA genomic alignments. The Prorating of each mapping is given in parenthesis.

227

Supplementary Figure 3- Correlation analysis of sense mapping MPSS (A and B) or SAGE (C and D) tags with the number of mapped CPA-sRNAs based on (A and C, sense mapping; B and D, anti-sense mapping) bins. Genes were grouped into 100 bins based on MPSS and SAGE expression. Average MPSS and SAGE expression per bin was plotted verses average number of CPA-sRNAs for each bin.

228

A B

R2 = 0.94 R2 = 0.02

Mean CPA_sRNA number per bin bin per number CPA_sRNA Mean bin per number CPA_sRNA Mean

Mean MPSS tags per bin Mean MPSS tags per bin

C D

R2 = 0.89 R2 = 0.02

Mean CPA_sRNA number per bin bin per number CPA_sRNA Mean bin per number CPA_sRNA Mean

Mean SAGE tags per bin Mean SAGE tags per bin

229

Supplementary Figure 5- - Distribution of CPA-sRNAs across 18S, 5.8S and 28S rRNA loci. (A) graphical representation of genomic contig, ESTs, rRNA features, CPA-sRNAs,

MPSS and SAGE tags. (B) distribution of CPA-sRNAs across different rRNA elements

(5.8S, 28S and 18S). (C) sequence diversity CPA-sRNAs across 5.8S rRNA locus. The gel photograph represents the 3' RACE products obtained from 5' methylguanosine capped and polyA+ RNAs. The underline sequences indicate non-matched nucleotides. SAGE tags location from 3’ end of 5.8S region is shown at the bottom.

230

A. B.

CPA-sRNAq

C.

Genome aata atcaaa actttc aacaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaccc tcaagc ctcggc ttggtg 5.8S rRNA a actttc aacaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tca ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaccc tcaagc ctcggc ttg c ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaccc tcaagc ctcggc ttggtg EST c ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaccc tcaagc ctcggc ttggtg

gaaga caaaaa ctttca acaacg gatctc ttggtt ctggca tcgatg aagaac gcagac gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattc cccc(a )n gacctaaac taggaa ctttca acaacg gaatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtc aattgc aatggc gttgtt tttccc aac(a) n gaagtt tacaaa actttc aacaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatta ac(a)n tttcaa caacgg atctac ttggtt ctaggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt ac(a)n tttc aacaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt ac(a)n aatac caaaca atttac aacaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatta gaac(a )n 454 gcc taacta ggttca acaaac ggatct cttggt tctggc atcgat gaagaa cgcagc gaagtg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt acg(a) n gttaac ttaagg aacaaa tacgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg acc(a) n (CPA-sRNA) aga aaaaga ataaag taaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgca- att(a) n aa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatt( a)n gaa agttaa gaaaat tagaca agaaat tcaagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg taaag( a)n ga cgaagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatta cc(a)n ga cgaagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatta cc(a)n gcgagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcatta cc(a)n gaac tcaagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaacc (a)n aat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cc(a)n

gatc tcttgg ttctgg catcga tgaaga acgcag cgaaat gcgata agtaat gtgaat tgcaga attcag tgaatc atcgaa tctttg aacgca cattgc gcccgc cggtat tccggc gggcat gcctgt tcgagc gtcatt tc(a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt c(a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttgg acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt c(a)n 3’RACE gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattc c(a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattc c(a)n (CPA-sRNA) gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt c(a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt c(a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt caaccc (a)n gatct cttggt tctggc atcgat gaagaa cgcagc gaaatg cgataa gtaatg tgaatt gcagaa ttcagt gaatca tcgaat ctttga acgcac attgcg cccgcc ggtatt ccggcg ggcatg cctgtt cgagcg tcattt cg(a)n

catgcctgt tcgagc gtcatt catgcctgt tcgagc gtcatc catgcctgt tcgagc gtcctt 5.8S catgcctgt tcgagc gtcact catgcctgt tcgagc gtcata catgcctgt tcgagc gccatt catgcctgt tcgagc gtcgtt catgcctgt tcgagt gtcatt catgcctgt tcgaga gtcatt catgcctgt tcgaac gtcatt catgcctgt tcgagc gtcttt catgcctgt tcgagc gtcaat 300 bp CPA-sRNAs catgcctgt tcgagc gttatt catgcctgt tcgagc gtcaaa catgcctgt tcgagc atcatt catgcctgt tcgaga aaaaaa catgcctgt tcgaaa aaaaaa catgccaaa aaaaaa aaaaaa catgcctgt tcgagc gtaaaa catgcctgt tcgagc gtcacc catgcctgt tcgacc gtcatt 100 bp catgcctgt tcgagc ggcatt catgcctgt tcgagc gccttt catgcctgt tcgagc gtcagg catgcctgt tcgagc gtaatg catgcctgt tcgagc gtaagg catgcctgt tcgagc gtaatt

231

I:475600..482600

454 reads CPA-sRNA

Supplementary Figure 6- Distribution of CPA-sRNAs across the retro-transposon, MAGGY with LTR.

232

454 qreads CPA-sRNA

Supplementary Figure 7- Distribution of CPA-sRNAs in the M. oryzae mitochondrial genome.

233

APPENDIX E

Chapter 6

D-SynD: Dimensional Synteny Detection, Identification of Syntenic Regions between Multiple Genomes.

Joshua K. Sailsbery1,2, Brent J. Clay3, Carey R. Jackson4, Douglas E. Brown1, and Ralph A. Dean1*

1 Fungal Genomics Laboratory, Center for Integrated Fungal Research, Department of Plant Pathology, North Carolina State University, Raleigh, NC 27606, USA 2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27606, USA 3 Department of Mechanical and Aerospace Engineering, North Carolina State University, Raleigh, NC 27606, USA 4 Department of Statistics, North Carolina State University, Raleigh, NC 27606, USA

Additional Material

234

Supplemental Figure S1 –Imperfect syntenic block found by D-SynD in the Sordariomycetes analysis.

235

Supplemental Figure S2 – Imperfect syntenic block, including two sequence inversions, found by D-SynD in the

Sordariomycetes analysis.

236