AN ABSTRACT OF THE THESIS OF

Dabao Sun Lu for the degree of Master of Science in Botany and presented on September 6, 2016.

Title: Genome Organization in the Ectomycorrhizal Truffle Rhizopogon vesiculosus.

Abstract approved: ______Joseph W. Spatafora

Rhizopogon vesiculosus is a common ectomycorrhizal (EM) symbiont of

Pseudotusga menziesii (Douglas-fir) in the coast range of the Pacific Northwest. The has been studied for its systematics, genet size, population structure, and competitive ability in several field and experimental studies. This thesis seeks to provide a more thorough characterization of the genome of R. vesiculosus using bioinformatic analyses with the objective of improving our understanding of its genomic architecture and to obtain a more robust reference for future studies on Rhizopogon. The analysis was facilitated by Chromatin conformation capture assay (Hi-C) sequencing which enabled the assembly of our previous draft genome into 10 chromosome level linkage groups.

Kingdom wide comparative genomic studies in fungi suggest that there are particular signatures in ectomycorrhizal genomes including an increase of transposable elements

(TE) and lineage specific small secreted proteins (SSPs). This observation was true for R. vesiculosus as we mined the genome for TEs and SSPs. We sought to classify the TEs, but found that a large proportion of them had no hits to curated TE databases. The abundance of SSPs fit the profile of other EM fungi with approximately 200 SSPs and

half of them being species specific. We identified species specific genes (SSG) and orthologous clusters shared with the sister Suillus, along with clusters unique to genus Rhizopogon and R. subgenus Villosuli, as well as the mating loci and centromeric regions. The long-range contiguity allowed us to map all these elements together with single nucleotide polymorphisms (SNPs). We hypothesized that the SSPs, TE, and lineage specific genes would cluster more around the ends of our linkage groups where recombination rate tends to be elevated, and that the centromeres would be highly enriched for TEs, but the pattern turned out to be more complex. Centromeres were regional in nature, comprising an average length of 40,000 bp and a gene density similar to the remainder of the genome. Most chromosomes contained regions that were enriched for species specific genes, but these regions were not clustered towards the ends of linkage groups. However genome mapping revealed higher SNP densities in TEs and species specific genes, in contrast to reduced levels in the orthologs shared with Suillus, suggestive of a “two-speed-genome” with heterogeneous rates of evolution. Moreover, the transition/transversion ratio was found to be significantly higher in TEs, which we interpreted as a possible signature of genome defense mechanisms, highlighting the many opposing evolutionary forces that a genome must balance. This work represents a significant step forward in the genomics of Rhizopogon, and we have demonstrated the utility of Hi-C sequencing for improving assemblies of dikaryotic fungi. A number of bioinformatics pipelines were developed or refined, and these may serve to streamline future Hi-C analysis of other genomes.

©Copyright by Dabao Sun Lu September 6, 2016 All Rights Reserved

Genome Organization in the Ectomycorrhizal Truffle Rhizopogon vesiculosus

by Dabao Sun Lu

A THESIS

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Master of Science

Presented September 6, 2016 Commencement June 2017

Master of Science thesis of Dabao Sun Lu presented on September 6, 2016

APPROVED:

Major Professor, representing Botany and Plant Pathology

Head of the Department of Botany and Plant Pathology

Dean of the Graduate School

I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my thesis to any reader upon request.

Dabao Sun Lu, Author

ACKNOWLEDGEMENTS

I feel privileged for having been given the opportunity to embark on what will be remembered as three enjoyable and formative years as a student at Oregon State

University where I have been able to grow significantly as a person and was molded as a scientist. The many people who in some way have contributed to this thesis and experience outnumber the literature references, and I will therefore limit myself to mention those who have been most directly involved with my project. First and foremost

I would like to thank my advisor Dr. Joseph Spatafora. I am truly grateful for his willingness to take me on in his lab and believing that I had what it took to undertake and complete graduate studies in , where many others, including myself, at times doubted so. His way of thinking about mycology has been mind-opening, and his passion for it is contagious. His personal support has been crucial for navigating me through what at times can be rough waters in the graduate school research experience, and I would never have felt as much at home here without him and his wife Elizabeth’s hospitality and care. I am grateful to the whole Spatafora lab for being the friendly and creative working environment which it is. Among lab members I credit Alija Mujic for introducing me to the wonders of the hypogeous world, and I am particularly appreciative for all the hours he spent patiently mentoring me in every aspect of mycology, and for being a dear friend. I am thankful to Kathryn Bushley and Alisha Quandt for showing me the ropes of molecular biology and making someone far away from home feel welcome through their cheerful presence. I thank Derek Johnson for significantly reducing the shock factor in my introduction to the world of bioinformatics, and for keeping me

company in the small hours before deadlines. I will never forget our many discussions and shared frustration over poorly documented code. I thank Rheannon Arvidson and

Ying Chang for being good lab citizens, for helpful suggestions that improved this project along the way, and for sharing my preference for tea. Thank you to Kazuya Tsukagoshi for brightening up the days with his good humor and friendship. A special thanks to

Brittny Gardner and her zealous spirit which helped me get through what otherwise would have been insurmountable amounts of labwork, for her efforts to improve my

English in spite of any tangible results, and for showing me Oregon outside Corvallis. I am indebted to our collaborators Ivan Liachko and Shawn Sullivan who provided the data, without their expertise and initiative this project would never have happened.

I know my stay here would not have been as streamlined without Dianne Simpson,

Carleen Nutt, and Blaine Baker in the BPP department handling the administrative affairs.

It would not have been as good of a learning experience without the numerous instructors

I was able to interact with through my coursework, and most importantly it would not have been as enjoyable without all my fellow students in BPP. I am thankful for the international office at OSU, and in particular I want to recognize Emiko Christopherson and Marigold Holmes for helping me through copious amounts of paperwork, and I also thank Kelly Bonin and her colleagues at the IIE for working with me the entire time. I thank my committee members, Jane Smith, Nikolaus Grunwald and David Myrold, for all their support, and for understanding the unpredictable nature of biological research. I would like to acknowledge Christopher Sullivan for all the hard work he puts in to keep the CGRB computational cluster running, which my thesis has heavily relied on.

There are many foundations from which I have received support. Foremost, I am indebted to the Fulbright program and Fulbright for making the prospect of pursuing a graduate program in the US a financially possible option for me in the first place. In addition to Fulbright, the support I received from Dr. Haavard Kauserud among several professors at the University of Oslo increased my confidence enough to apply for graduate programs, and provided me with the momentum to venture all the way across the Atlantic and the North American continent in search for new pastures. I received the

Haakon Styri Award from the American Scandinavian Foundation during my second year.

The scholarship is administered by the Norway America Association, and was key for continuing my funding, and also contributed significant amounts to my research funds.

Among other things, it facilitated a visit to the Bruns lab at UC Berkeley where Sara

Branco generously and patiently took her time to train me in several bioinformatic pipelines containing aspects which have been important in my thesis work. I thank NATS and all their members for constituting both a monetary, intellectual, and recreational resource. I have received financial support from NATS through the Henry Pavelek

Memorial Scholarship Fund and a student scholarship which sent me truffling to the

Sierra Nevada. I am also grateful to OMS and SOMA for generously supporting me through their student scholarships. I received the GSA and Anita Summers travel grant from the BPP to present this project at the annual meeting of the Mycological Society of

America at Berkeley in 2016, and I would like to extend my gratitude for the administers, founders, and fundraisers for these funds. During my last year, I have received funding through working as a research assistant for the Oregon State University Herbarium, and I would like to thank Dr. Richard Halse and all the other herbarium workers for making it a

peaceful and friendly working environment. The herbarium position involves teaching the mycology class, which proved to be a great learning experience much thanks to Dr.

Dan Luoma and Chris Bennemann.

I thank Joey Deshields, Elva Manquera, and Eric Renteria for hunting

Rhizopoogon in the field with me, although the samples we collected did not get to play a role in this project, the cultures we obtained from them represent valuable additions that will likely play a role in future Rhizopogon projects. I am grateful for all the wonderful people I have met who finally have made a third-culture child feel at home, including my housemates at Lilly Avenue who kept me grounded and have become a second family to me. Finally, I would like to thank my parents, who are probably responsible for getting me interested in mycology in the first place, for their encouragement and support for me to take on this path.

With appreciation, Dabao Sun Lu

CONTRIBUTIONS OF AUTHORS

Chapter 2: Alija B. Mujic sequenced, assembled, and annotated the Illumina draft genome of Rhizopogon vesiculosus. Ivan Liachko and Shawn Sullivan performed the Hi- C sequencing, ran the clustering analysis and centromere calling, and provided advice on the interpretation of these results together with Maitreya Dunham. Joseph W. Spatafora advised and assisted with analyses and edited the manuscript.

TABLE OF CONTENTS

Page

Chapter 1: Introduction………………….....……………………………………………..1

Order Boletales……………………………………………………………………2

Rhizopogon………....…....……………………………………………………...... 3

Genus Rhizopogon...…………………………………………………...... 3

Rhizopogon and the ectomycorrhizal symbiosis…………………………..4

Biology of Rhizopogon vesiculosus………….…..………………………..5

Population genetic structure and patterns of genet distribution in R. vesiculosus………………………..……….…..………………………..7

Agaricomycete mating genetics………………..….………………………………8

Life cycle and mating systems…………………………………………….8

Genetic function and architecture of the mating genes……………………9

The mating genetics of Rhizopogon...... ….……………………………..11

Genomics………………………………………………………………………...12

A short history of sequencing technology...……………………………..12

Hi-C sequencing…………………...……………………………………..14

Overview of fungal genomics...... ……………………………………...17

Transposable elements.…………………………………………………..20

Effectors and small secreted proteins…...…………………………….…22

Comparative genomics of ectomycorrhizal fungi…..……………………23

Conclusion……………………………………………………………………….25

References……………………….....……...…………………………………….26

TABLE OF CONTENTS (Continued) Page

Chapter 2: Hi-C assembly of Rhizopogon vesiculosus reveals the genome organization of an ectomycorrhizal truffle in Boletales...……………………………..43

Abstract……………………..……………………………………………………44

Introduction..………………..……………………………………………………46

Materials and Methods….…..……………………………………………………50

Results…………..…………..……………………………………………………57

Discussion...…………….…..……………………………………………………63

References...…………….…..……………………………………………………73

Chapter 3: Conclusion...……………..…..……………………………………………...124

References………………………………….…………………………………...131

Appendix 1: Scripts written to facilitate data analysis.…………………………………133

R script 1: sliding_window_frequency_counter……………………………….134

R script 2: snp_frequency_counter……………………………………………..135

Python script 1: cysteine_length_counter……………………………………....136

Python script 2: repeatmasker_annotator……………………………………….137

LIST OF FIGURES

Figure Page

Figure 1.1 ITS phylogram of genus Rhizopogon.………………………………...... 36

Figure 1.2 Morphology of R. vesiculosus life history stages .……………………...... 37

Figure 1.3 Rhizopogon life cycle…………………………………………………...... 38

Figure 1.4 Hi-C library construction process……………………………………………39

Figure 1.5 Main classes of transposable elements and mechanisms of transposition….40

Figure 2.1 Postclustering heatmap….…………………………………………………...81

Figure 2.2 Mummerplot of the Illumina assembly against the 10 linkage groups of the Hi-C assembly...…………………………………………………………………………82

Figure 2.3 Classification and distribution of repetitive DNA and TE in linkage groups..83

Figure 2.4 Phylogeny of Gypsy and Copia elements……………………………………84

Figure 2.5 Overview of main content in linkage groups.……………………………85

Figure 2.6 The A and B mating loci...…………………………………………………...86

Figure 2.7 Venn diagram quantifying distribution of orthologous clusters.…………….87

Figure 2.8 Frequency of orthologous clusters in linkage group 4 plotted in 50 kb sliding windows…….……………………………………………………..……………..88

Figure 2.9 Distribution of regions enrichhed for species specific genes in all linkage groups…….………………………………………………..……………………89

Figure 2.10 Distribution of small secreted proteins.………………………………….....90

Figure 2.11 Combined view of different genetic elements in linkage group 4 plotted in 50 kb sliding windows….…………………………………………………………….…91

Figure S-2.1 Frequency of orthologous clusters plotted in 50 kb windows in linkage groups 0-9…...... 98

Figure S-2.2 Combined view of different genetic elements plotted in 50kb windows in linkage groups 0-9….…………………………………………………………………..107

LIST OF FIGURES (Continued) Figure Page

Figure S-2.3 Barplot showing repeat number of repetitive sequences in linkage groups 0-9……………………………………..………………………………………116

Figure S-2.4 GO annotation pie charts of biological processes in shared and species specific genes...... …………………………………………………………...117

Figure S-2.5 Venn diagram quantifying the distribution of orthologous clusters using gene models derived from the unmasked genome…………………...………………...118

LIST OF TABLES

Table Page

Table 1.1 Sequenced and annotated genomes of the Boletales by 1KFG………………41

Table 1.2 Mating systems in the Boletales………………………………………..…….42

Table 2.1 Comparison of Hi-C and Illumina assembly statistics and repeat content...... 92

Table 2.2 Linkage group statistics.……………………..……………………………….93

Table 2.3 Repeat number and content by linkage group..….…………………………...94

Table 2.4 Summary of SNP frequency and densitiy...... ……………………………… 96

Table S-2.1 Selected centromere gene models…………………………………...…….119

Table S-2.2 Contingency tables for Fishers’ exact test ..…………………………….....123

Chapter 1. Introduction

1

Introduction

Projects like the 1000 Thousand Fungal Genome Project (1KFG) hosted by the

Joint Genome Institute (JGI) have resulted in the sequencing and annotation of a number of Boletales genomes. Included among these genomes are species of the genus

Rhizopogon, which are important ectomycorrhizal symbionts of Pinaceae and which occupy a broad range of habitats and exhibit varying degrees of host specificity. These genomes were generated from short read sequencing alone, however, and are typically fragmented over thousands of contigs. While these are sufficient for capturing gene space, they offer limited information on complex and repetitive regions, which is necessary for answering questions related to chromosome level genome organization. Chromatin conformation capture assay (Hi-C) creates genome-wide contact maps, and was originally developed to capture 3D genome architecture. However the long-range contiguity information in these maps has proven useful in numerous applications such as scaffolding in reference guided and de novo genome assemblies, haplotype phasing, and in identifying centromeres. A Hi-C assembly of Rhizopogon vesiculosus was generated by scaffolding contigs from the corresponding Illumina assembly with the Hi-C read information. The result is the first Boletales genome with chromosome level annotation, which is used to leverage analysis of large scale genome organization as a step towards a more complete genome analysis of Rhizopogon.

2

Order Boletales

The order Boletales (E.J. Gilbert) has a worldwide distribution and currently encompasses 17 families, 88 genera, and about 1400 species (Hibbett et al. 2014). The typical fruit body morphology is pileate-stipitate (~77%) with tubular or sometimes gilled hymenophores, but gasteroid morphologies, such as Rhizopogon, as well as resupinate, meruloid, and toothed forms, have evolved multiple times within the order (Binder and

Bresinsky 2002). Boletales is also ecologically diverse, comprising ectomycorrhizal

(EM), brown rot wood decay, and mycoparasitic species. Ancestral character state reconstruction analyses support the resupinate fruiting body morphology and brown rot ecology as symplesiomorphic traits for the order (Hibbett et al. 2014). The genera

Serpula and Coniophora represent examples of this morphology and ecology, and have been extensively studied due to their economic importance in their ability to decompose wooden building materials. Stipitate morphologies and ectomycorrhizal (EM) ecologies represent apomorphic character states for Boletales, and approximately 90 % of the described species are involved in EM symbiosis (Hibbett et al. 2014). EM host plants are predominantly members of Betulaceae, Caesalpiniaceae, Dipterocarpaceae, Fagaceae,

Myrtaceae, Nothofagaceae, Pinaceae, Salicaceae, and Ericaceae (in the case of arbutoid mycorrhiza), with Pinaceae likely representing the ancestral host association for EM species in Boletales (Hibbett et al. 2014). To better understand EM biology of Boletales, numerous studies have been conducted on species of Suillus and Rhizopogon with particular research foci including life history traits such as genet size (Kretzer et al. 2005,

Kretzer et al. 2004, Dunham et al. 2013, Beiler et al. 2012, Dahlberg and Stenlid 1994), competition dynamics (Kennedy and Bruns 2005, Kennedy et al. 2007), ecological

3

succession (Bruns et al. 2009), population genetic structure (Branco et al. 2015), and host specificity (e.g., Massicotte et al. 1994).

Rhizopogon

Genus Rhizopogon:

Rhizopogon (Fries 1817) is a genus of fungi forming hypogeous sporocarps (= sequestrate fruiting bodies, false truffles) with a spongy gleba (=hymenium). It represents one of the many independent origins of hypogeous sporocarps within Boletales, with

Suillus being the most closely related genus forming epigeous, stipitate fruiting bodies

(Grubisha et al 2002). The 1966 monograph on the genus by Alexander H. Smith and

Sanford M. Zeller (1966) based in still serves as the modern foundation for species concepts, while another monograph specific to Europe was published more recently by Maria P. Martin (1996). The distribution and diversity in other parts of the world remains less surveyed, although several species have been described from the

Pacific Rim in Asia (Hosford 1988, Mujic et al. 2014, Mujic 2015). The global hotspot of diversity is thought to occur in Northwestern North America, with the natural range constricted to the northern hemisphere (Grubisha et al. 2002). With 229 unique names recognized as legitimate on Mycobank as of December 2015, it represents the largest genus of hypogeous basidiomycetes (Molina et al. 1999). The sporocarps are commonly dispersed through mycophagy by small mammals, and have been shown to constitute a significant part of their diet in the Pacific Northwest (Maser et al. 2008, Jacobs and

Luoma 2008, Izzo 2005). Based on phylogenetic analysis by Grubisha et al. (2002) of the

4

internal transcribed spacer region (ITS) of ribosomal DNA, five subgenera are currently recognized: Amylopogon, Rhizopogon, Roseoli, Villosulli, and Versicolores (Fig1.1).

Rhizopogon and the ectomycorrhizal symbiosis:

All species of Rhizopogon form obligate ectomycorrhizae (EM), and they cannot complete their life cycle without their host, which almost exclusively are members of the

Pinaceae (Grubisha et al. 2002, Molina and Trappe 1994). The EM symbiosis is mutualistic, benefitting the fungal partner by granting them access to photosynthates from the plant host, which in return is supplied with nutrients (most notably nitrogen, phosphorus, potassium), enhanced water uptake, and increased protection from pathogens

(Smith and Read 2008). Colonization by EM alters the root morphology of the host plant where root hair growth is suppressed and the fungi forms a “mantle” that ensheaths the fine roots of the host plant which then typically takes on the shape of short lateral roots

(Figure 1.2D, F). The hyphae do not penetrate the root cells in their inward growth, but instead extend in between epidermal and cortical cells and form a characteristic “hartig net” (Figure 1.2E) to maximize the contact area for the reciprocal nutrient transfer which takes place here (Smith and Read 2008). Extramatrical hyphae may also radiate outwards from the mantle and into the substrate and surrounding environment (Smith and Read

2008). Plant productivity is generally positively affected, and around 20% of the plant assimilates are allocated to the EM fungi, which provide increased storage of carbon in the soil, and likely play a key role in the global carbon cycle (van der Heijden et al. 2015).

Ectomycorrhiza is the prevalent form of mycorrhizae in temperate and boreal forest systems, and they has evolved independently a number of times within ,

Basidiomycota, and (van der Heijden et al. 2015), with Tedersoo et al.

5

(2010) suggesting a many as 66 times in fungi, along with at least 10 transitions among the host plants (Brundrett 2002).

The existence of common mycorrhizal networks which may link multiple plants together has been demonstrated through the tracking of labeled isotopes (Simard et al.

1997) and the role of these networks in forest ecosystems is increasingly being recognized (van der Heijden et al. 2015). Many Rhizopogon species have been shown to form rhizomorphs (Molina and Trappe 1994), linear aggregates of hyphae thought to be important in long distance translocation of water and nutrients (Duddridge et al. 1980,

Simard et al. 1997, Brownlee et al. 1983). This attribute gives them a structural significance in common mycorrhizal networks, and may aid in the establishment of seedlings by increasing their growth and resistance to drought (reviewed in Molina et al.

1999).

Biology of Rhizopogon vesiculosus:

The foundation for this project was laid by series of studies on two sister species,

R. vesiculosus A.H. Smith (sensu Kretzer et al. 2003) and Rhizopogon vinicolor A.H.

Smith, in R. subgenus Villosulli conducted by Kretzer et al. (2004), Kretzer et al. (2005), and Dunham et al. (2013). All species in R. subgenus Villosulli are host specific to members of Pseudotsuga, which has been inferred as a single host switch to Pseudotsuga

(Grubisha et al. 2002). Rhizopogon vesiculosus and R. vinicolor are host specific and obligate ectomycorrhizal symbionts of Pseudotsuga menziesii (Mirb.) Franco (Douglas- fir), a dominant species in low to mid elevation Northwestern North America, where it is economically important as timber (Lavender and Hermann 2014). The natural range of P. menziesii extends from central British Columbia in the north, to central Mexico in the

6

south (Lavender and Hermann 2014). Differences in morphology and geographic range, as well as population genetics studies employing allozymes (Li and Adams 1989), and

RAPD markers (Aagaard et al. 1995), allow us to distinguish between the coastal form, P. menziesii var. menziesii, and the interior form; P. menziesii var. glauca. These studies also suggested that the interior form is further divided into northern and southern populations, which was confirmed by Gugger et al. (2010) using chloroplast and mitochondrial DNA sequences. The geographic distribution of R. vesiculosus mainly follows that of P. menziesii var. menziesii in the coast range, and unlike R. vinicolor, it appears to be less common or absent with P. menziesii var. glauca (Mujic 2015).

Morphologically, R. vesiculosus is nearly indistinguishable from R. vinicolor as noted by A. H. Smith, and evidenced from the common misidentifications (Luoma 2011).

Both form hypogeous basidioma, globose to ovoid oblong, or sometimes irregular in shape, measuring 1-3 cm in diameters. The peridium is dry, matted fibrillose, and white when young, with tinges of yellow increasing with age, staining pink and vinaceous upon handling (Fig 1.2 A), and turning dark vinaceous, cinnamon, or blackish brown with dark bruise spots when mature (fig 1.2 B). A character used to distinguish R. vesiculosus by

A.H. Smith (1966), is the presence of vesicles, flask shaped inflated cells in the peridium visible under a microscope. However Luoma et al. (2011) discovered that these vesicles are confined to regions where the rhizomorphs attach. Furthermore the character is not always present, and the vesicles have been reported to occur in R. vinicolor with low frequency (Luoma et al. 2011). However the two species can be more readily distinguished through phylogenetic analysis of the ITS barcode (Figure 1.1) or microsatellites, and sampling with these markers including the type material (Kretzer et

7

al 2003) as well as phylogenetic analysis of genome scale datasets (Mujic 2015) have justified the application of the names R. vesiculosus and R. vinicolor.

Population genetic structure and patterns of genet distribution in

R. vesiculosus:

R. vesiculosus and R. vinicolor are sympatric and appear with equal frequency throughout the coast range of the Pacific Northwest (Beiler et al. 2015, Dunham et al.

2013), but systematic surveys of distribution of multilocus genotypes have unveiled distinct population genetic structure between R. vesiculosus and R. vinicolor. In work pioneered by Kretzer et al (2005) using microsatellite markers, it was found that R. vesiculosus tends to have the same multilocus genotypes clustered together, interpreted as larger genets (=clones). In subsequent work by Beiler et al. (2010) a similar pattern was found, and it was indicated that the larger genets of R. vesiculosus links a greater number of individual trees. Furthermore, in Beiler et al. (2012), it was demonstrated that this pattern is 3-dimensional, with R. vesiculosus present at greater depths than its sympatric sister. Moreover the same study also showed that soil moisture has some effect on the average depth, where both species can be found deeper in wet soils, although the same vertical distribution pattern between them is maintained. Beiler et al. (2015) revealed through modeling that R. vinicolor genets tend to be nested within larger R. vesiculosus genets. The competitive dynamics between the two species were investigated further by

Mujic et al. (2016) using bioassays. They found that R. vesiculosus was a superior colonizer outcompeting R. vinicolor in the experimental settings.

R. vesiculosus also shows patterns of inbreeding with reduced genetic diversity, and more subpopulation differentiation as quantified by Fst when compared to R.

8

vinicolor (Kretzer et al. 2005). Dunham et al. (2013) further supported the disjunctive pattern using spatial autocorrelation analysis which showed a nonrandom distribution of genotypes within 120 m in R. vesiculosus, signifying significant inbreeding at this scale, while R. vinicolor showed no significant population structure at the sampled watershed level. Through whole genome single nucleotide polymorphism (SNP) analysis using rates of non-synonymous substitution Mujic (2015) found that R. vinicolor has a higher level of genome wide level of heterozygosity than R. vesiculosus, consistent with the observed patterns of inbreeding.

Agaricomycete mating genetics:

Life cycle and mating systems:

Basidiomycetes typically progress through a life cycle where the basidiospores produced by meiosis germinate into haploid mycelia known as monokaryons (Fig. 1.3).

When two monokaryons encounter each other they may undergo hyphal fusion

(=anastomosis) and form a sexually competent entity if their mating type genes are compatible. Notably, anastomosis is not immediately followed by karyogamy, the fusion of nuclei, and this stage is therefore termed dikaryotic, and consists of two distinct haploid nuclei in the cytoplasm of each cell produced by nuclear migration

(=plasmogamy) through the developing hyphae or mycelium (Lee et al. 2010). The dikaryon make up the dominant stage of ’ life in space and time

(Casselton 2002), and form root tips in the case of ectomycorrhizal Agaricomycetes (Rai and Varma 2010). The life cycle is completed through the production of a dikaryotic basidiocarp with a hymenium carrying basidia, where karyogamy rapidly succeeded by

9

meiosis results in basidiospores which are dispersed by the basidiocarp (Brown and

Casselton 2001).

The stage of plasmogamy between monokaryons is recognized as the mating event, which is governed by molecular recognition, where the presence of homogenic incompatibility factors determines and mediates syngamy (Casselton 1997). The mating system can be either homothallic or heterothallic (Whitehouse 1949b). In a homothallic system there is complete compatibility even between the same mating-types (=sexes), .i.e. self-fertility, whereas a heterothallic system requires different mating types for successful mating, i.e. cross fertility (Lee et al. 2010). It is estimated that about 10% of

Agaricomycetes are homothallic, and the remainder heterothallic (James 2007). Within heterothallic basidiomycetes two distinct subsystems are recognized, namely unifactorial or bipolar, and bifactorial or tetrapolar systems (Whitehouse 1949b). In a bipolar system there is one locus or factor acting as incompatibility factor, so that on average 50% of full siblings will be mating compatible. A tetrapolar system, on the other hand, has two independent loci or factors operating, so that on average only 25% of full siblings will be compatible in mating (Raper 1966). It is estimated that 25-35% of heterothallic

Agaricomycetes are bipolar (Whitehouse 1949a, Whitehouse 1949b) with the remainder being tetrapolar (James 2007).

Genetic function and architecture of the mating genes:

Through cloning, transformation, and generation of mutants in model systems

(Casselton and Olesnicky 1998), the two separate loci governing mating, the A (=MAT-

HD)- and B-locus (=P/R genes) have been extensively characterized to support the original hypotheses on basidiomycete mating biology put forward by H. Kniep and J.R.

10

Raper (1966). The A-locus contains two subloci, HD1 and HD2, each encoding for a different three-helical homeodomain protein, which together form a heterodimer acting as a transcription factor that activates pathways responsible for a transition to the dikaryon by clamp cell formation (Brown and Casselton 2001) and growth regulation

(Nieuwenhuis et al. 2013).

The B-locus consists of pheromones and pheromone receptors. The archetypical organization of the B-locus put forward by Lorna Casselton has three pheromone receptors each linked to a single pheromone precursor gene with the cassettes located in physical proximity, but the copy number can be highly variable between different lineages of Agaricomycetes (Kües et al. 2011). The pheromone receptor is a G-coupled

7-transmembrane receptor, while the pheromone is derived from N and –C terminal processing of the pheromone precursor gene which yields a final pheromone of 9 to 14 amino acids in length (Kües et al. 2011). A pheromone receptor can only efficiently bind a pheromone of a different, compatible mating type (Casselton and Olesnicky 1998). The binding of a compatible ligand results in a conformational change of the C-terminus of the pheromone receptor which activates a G-protein that in turn induces a MAP-kinase cascade. The result is transcriptional changes associated with syngamy and transition to a dikaryotic state, such as nuclear migration and clamp cell fusion (Brown and Casselton

2001). However the full extent of the downstream regulatory pathways activated and regulated by either mating loci remains obscure (Stajich et al. 2010).

Whole genome sequencing has allowed comparisons on the organization and structure of the mating loci across multiple taxa. The emerging pattern is a high degree of synteny at the A-locus, whereas the B-locus often exhibits greater variability with regards

11

to both location and copy number of pheromones and pheromone receptors (James 2007,

James et al. 2013, Mujic 2015). The existence of multiple paralogues in tandem forming different subloci, which can act as different alleles as they are redundant in function

(Casselton 2002), as well as different allelic versions within each paralogue at both the A and B-locus, generates an effective number of mating types that can exceed 20000

(Heitman et al. 2013). A tetrapolar system, which is assumed to be the ancestral state for

Agaricomycetes (Heitman et al. 2013), can be rendered bipolar if the two mating loci become linked, i.e., located in near physical proximity on the same chromosome. This has been proposed as the mechanism responsible for the bipolar system in Cryptococcus neoformans (Heitman et al. 2013) and Ustilago hordei (Ni et al. 2011).

The mating genetics of Rhizopogon:

To date most mating studies have been conducted on members of the and Polyporales, while mating biology in Boletales remains less surveyed (James 2007,

Heitman et al. 2013). Moreover the classic method of pairing sister isolates of monokaryons introduced a strong sampling bias towards saprotrophic fungi which can be germinated in culture (Kawai et al. 2008). The great majority of ectomycorrhizal fungi do not culture (Tedersoo et al. 2010), and additional biases are introduced for species where the dikaryon does not produce clamp connections (Raper 1966). A summary of the mating studies of Boletales is provided in table 1.2.

A study of Rhizopogon roseolus (=R. rubescens) by Kawai et al. (2008) found that this is bipolar. The researchers were able to germinate the and obtain monokaryotic isolates for test crosses using agar plates on which the host Pinus thunbergii had been grown in advance axenically. The A-locus of Japanese strains of R.

12

roseolus was further characterized by Wan et al. (2013) through cloning and PCR amplification of the 3’region of MIP and HD1. Mujic (2015) annotated the Illumina genome assemblies of R. vesiculosus and R. vinicolor, and did comparative genomic analysis of the genomic regions surrounding the mating genes using a dataset including a number of Boletales, Agaricales, and Polyporales genomes. The gene annotation revealed that both species possess a typical A-locus with a single divergently transcribed pair of

HD genes. The B-locus was characterized as possessing 4 pheromone receptor genes and

4 pheromone precursor genes in R. vesiculosus, and 6 pheromone receptor genes and 9 pheromone precursor genes in R. vinicolor. R. vinicolor was also found to carry one additional isoprenyl cysteine methyltransferase (ICMT) gene, which has been shown to function in post-translational modification of pheromones, and the presence of an extra copy number of this gene was found to correlate with a tetrapolar mating system in the

Boletales. Mujic (2015) therefore postulated that R. vinicolor displays features consistent with tetrapolar species while the loss of genetic complexity at the B-locus in R. vesiculosus is more characteristic of bipolar species in Boletales. However, the A and B loci were fragmented over multiple contigs in the genome assembly of both species, and overall synteny of the loci was inferred through comparison with other Boletales species, and the linkage status of the A and B loci could not be ascertained.

Genomics

A short history of sequencing technology

Next-generation DNA sequencing (NGS) is used to describe the sequencing technologies that were developed after Sanger sequencing. These technologies have benefitted the research community by generating large scale quantities of sequence data

13

at a drastic reduction in cost followed by increasing availability and commercialization

(Metzker 2010). The cyclic-array sequencing platforms of Illumina (San Diego, CA) and the real time sequencing of PacBio (Menlo Park, CA) are the two major NGS platforms in practice today and will be reviewed here. Illumina sequencing is conceptually similar to the classic shotgun approach in that a library of random fragments of the DNA is created which is then sequenced using alternating cycles of enzyme-driven biochemistry and imaging-based data acquisition (Metzker 2010). A genome sequence can be assembled from the reads through the employment of various algorithms, a popular one being the De Bruijn graph algorithms which first split reads up into smaller k-mers and then reassemble them based on overlap (Compeau et al. 2011). These algorithms tend to be computationally intensive, in particular in the case of de novo assemblies (Pop 2009), and the parallel development of more powerful computers and efficient algorithms to match the growing output from sequencing technology has been essential for the analysis of these data (Pop et al. 2004, Pop 2009).

Flot et al. (2015) remarked that: “In an ideal world, one would recover after assembly as many contigs as there are chromosomes in the species being sequenced,” though the authors recognize that is hardly ever the case. Illumina sequencing relies on short reads of ~150bps, which have proven to be limited in their effectiveness and accuracy in sequencing and assembling repetitive regions of the genome (English et al.

2012, Burton et al. 2013, Flot et al. 2015). This fact has led to the inability to completely assemble genomes, and resulted in draft genomes that comprise hundreds to thousands of contigs or scaffolds, as evidenced in the assembly statistics of the Boletales genomes listed in Table 1.1.

14

PacBio is part of the third generation of sequencing technology which seeks novel approaches beyond cyclic array methods to augment sequencing (Schadt et al. 2010). The method is classified as a real-time sequencing technology since it relies on real time imaging of the incorporation of dye-labeled nucleotides during DNA synthesis (Metzker

2010), which simultaneously reveals the methylation status of the sequenced DNA (Fang et al. 2012). PacBio has shown promise for mitigating or circumventing some of the limitations inherent in the Illumina method as it can consistently provide reads up to 10 kb (Berlin et al. 2015), and this long read information has proven useful for improving upon existing Illumina assemblies or even for de novo assemblies (English et al. 2012,

Faino et al. 2015, Van Kan et al. 2016). Nevertheless, PacBio represents its own set of drawbacks such as the large amount of DNA required to reach sufficient sequencing depth to compensate for a higher error rate, and in addition a higher cost (Ferrarini et al.

2013). Auxiliary approaches that can obtain information beyond one dimensional basepair order, such as CHiP-Seq which captures epigenetic modifications (Landt et al.

2012), and Hi-C derivative technologies for 3D chromatin structure, which will be reviewed more in depth here in the following section, have also become available on the market.

Hi-C sequencing:

C-technologies, also referred to as “contact genomics”, are a series of techniques designed to capture the chromosome confirmation and involve the quantification of 3D contacts along the chromosome (Flot et al. 2015). It was initially conceived to quantify physical interactions between chromosomal regions across the genome, but has since found numerous additional applications. In the first version, Chromosome Conformation

15

Capture or 3C, the interaction frequency of two selected regions in the genome was quantified individually using PCR (Barutcu et al. 2016). Chromosome Confirmation

Capture on Chip, or 4C, stretched beyond the pairwise comparisons of 3C and allowed for the capture of interactions between one locus and other genomic loci (Barutcu et al.

2016), but like 3C is limited to detection of interactions within a few megabases (van de

Werken et al. 2012). Meanwhile 5C, Chromosome Conformation Capture Carbon Copy, was a significant improvement as it can detect interactions between many loci (Dostie et al. 2006), but with the drawback of having a low coverage, and is best utilized as a targeted approach when a priori knowledge of the region is available (Barutcu et al.

2016). Finally, Hi-C has the capability to capture whole genome interactions with high coverage by employing massive parallel sequencing through platforms like Illumina

(Belton et al. 2012).

The Hi-C procedure is outlined in figure 1.4: Chromatin is first crosslinked within cells, and the crosslinked chromatin is then digested with restriction enzymes, biotinylated and ligated to form chimeric sequences. These chimeric sequences are size fractionated and ligated with adaptor sequences to form a Hi-C library for deep high throughput sequencing. The new chimeric sequences reflect interactions between the regions of chromatin, with the rate of crosslinkage being a function of physical interaction between two sequences. That is, sequences that are crosslinked at a higher rate interact at a higher rate within the nucleus. A contact map can be constructed from the sequenced library. These maps quantify the contact frequencies between all restriction fragments in the genome, but the data should first be binned to increase statistical power, and normalized to correct for biases (Belton et al. 2012). Crosslinking is more likely to

16

happen between regions in physical proximity along the chromosome. It is this direct relationship between genomic distance and interaction frequency, known as “distant dependent decay of interaction frequency” (Lajoie et al. 2015), that can uncover the spatial genome configuration (Belton et al. 2012), and allow for the establishment of co- linearity of DNA across large distances and the scaffolding of contigs in existing assemblies into linkage groups (Flot et al. 2015). The scaffolding with Hi-C is not without challenges, however, as there can be substantial noise in the data. Clustering algorithms such as Lachesis or triDNA are popular solutions that have been shown to correctly scaffold the reads into chromosomes (Burton et al. 2013, Flot et al. 2015).

Lachesis requires a pre-specified number of expected chromosomes for the clustering analysis (Burton et al. 2013), while triDNA estimates a distance matrix from the contact matrix without making any prior assumptions about the number of clusters/chromosomes

(Flot et al. 2015). A drawback for such clustering methods is their cumulative errors, and they have been shown to be susceptible to assembly errors in the contigs given to them.

On the other hand, probabilistic methods such as GRAAL and HiRISE offer a more objective and quantitative approach to assess the validity of a genome through Bayesian inference, but are unable to assess the gap size (Flot et al. 2015).

Another application for these contact maps is the phasing of haplotypes since two variants present on the same chromosome in cis position are much more likely to be crosslinked than variants across haplotypes, in trans position (e.g., see Selvaraj et al. 2013,

Korbel and Lee 2013). This pattern is likely caused by the phenomenon of “chromosome territories” (Lajoie et al. 2015). Recently, Hi-C has been applied to metagenomics, through the realization that individual genomes can be distinguished from others in a mix

17

using the same contact frequency information (Burton et al. 2014). As pointed out by Flot et al. (2015) Hi-C can also be applied for the detection of extra-genomic elements sharing the same compartment as a given genome. It is expected that the metagenomics applications of Hi-C will improve and find broader and increasingly sophisticated applications, such as strain detection in populations (Flot et al. 2015).

Hi-C has also been repurposed for the characterization and location of centromeres by exploiting the centromere’s tendency to cluster in 3-dimensional space in the nucleus. Centromeres have been difficult to annotate from shotgun genome assemblies as their sequences can be highly variable (Varoquaux et al. 2015), and are comprised of repetitive regions in the cases of regional centromeres (Lamb and Birchler

2003). It is likely that Hi-C with its long range contiguity and chromatin confirmation information will gain increasing popularity as the field of comparative genomics is moving from a focus on protein coding genes to a more holistic view including the repetitive complement.

Overview of fungal genomics:

The ascomycete yeast Saccharomyces cerevisae was the first to have its genome sequenced (Goffeau et al. 1996). Since then, fungal genomics has played an essential role in advancing the understanding of eukaryote gene expression and genome organization. This work has mainly focused on a few fungal model organisms, which in addition to S. cerevisae include Saccharomyces pombe, Neurospora crassa and

Aspergiullus nidulans. Characters like small genome size, fast life cycles, ease of cultivation and experimental manipulation, along with haploid life cycles in Neurospora and Aspergillus account for the popularity of these species as model organisms (Galagan

18

et al. 2005). The Fungal Genome Initiative (FGI) launched at Broad Institute in 2000 was instrumental in the initial sequencing of fungi, which in addition to model organisms targeted a kingdom wide sampling including many human and plant pathogens (Galagan et al. 2005). As of August 2016 the FGI has released 67 fungal genomes. More recently, the whole genome sequencing effort of fungi has largely been taken over by the Joint

Genome Institute (JGI) through the 1000 Fungal Genomes Project (1KFG)

(http://1000.fungalgenomes.org) as part of the JGI Fungal Program (Grigoriev et al.

2011). 1KFG aims to sequence at least two reference genomes from the > 500 recognized fungal families in order to fill in remaining gaps in the fungal tree of life. The project is open for the public to nominate species to be sequenced, and has drastically accelerated the sequencing of fungi, including many EM Agaricomycetes. The sequenced and annotated genomes are available on the MycoCosm webportal (Grigoriev et al. 2014), and as of August 2016, more than 600 fungal genomes have been released, consisting of the following taxon sampling by phylum: 357 Ascomycota, 231 , 6

Chytridiomycota (sensu lato), 29 (sensu lato) and 22 taxa of early diverging lineages (http://genome.jgi.doe.gov/programs/fungi/1000fungalgenomes.jsf).

The greater numbers of fungal genomes available has allowed numerous comparative genomic analyses across taxa to identify shared and unique characteristics in fungal genomes with regards to both taxonomic level and ecological mode. From these studies it has become apparent that fungi have relatively small genomes in comparison to other . Most fungal genomes are in the 10-60 Mb range (Gregory et al. 2007), but outliers like the obligate parasites of with genomes of only 2Mb, and some Pucciniomycetes and Glomeromycetes reported to be close to 1000 Mb (Tavares et

19

al. 2014, Kullman 2005) exist. Gene content ranges from a few thousand in some unicellular fungi to a typical number of 10000-20000 genes in filamentous fungi, and up to 30000 in the endomycorrhizal fungus Rhizophagus irregularis (Grigoriev et al. 2014).

Consequently, fungi can be characterized as relatively gene dense, with a gene density between ~30-60%, and possessing fewer introns than animals (Galagan et al. 2005). In a phylum wide comparison of the ascomycote genomes by Hane et al. (2011), synteny was found to be poorly conserved even among closely related taxa. Instead, fungal genomes abound with gene duplications and translocations, which in part can be attributed to chromosomal reshufflings (Gladieux et al. 2014). Furthermore, a nonrandom distribution of gene dense and of gene sparse regions have been found in many fungi, where the gene sparse regions may contain fast evolving islands enriched for transposable elements (TE) and ecologically important genes (Stukenbrock and Croll 2014, Gladieux et al. 2014).

This trend appears to be particularly common with regards to effectors in plant pathogens which tend to be located in the gene sparse regions (Dong et al. 2015), and have been termed “two-speed genomes” (Gladieux et al. 2014).

Additional noteworthy findings from genome studies of fungi include common introgression events as reviewed by Gladieux et al. (2014), numerous cases of horizontal gene transfer from bacteria (Marcet-Houben and Gabaldon 2010), and the phenomenon of Repeat-Induced Point Mutation (RIP). RIP can permanently silence repeated sequences through the introduction of transitions (C-T or A-G) until copy similarity is less than 85%, (Smith et al. 2012) which effectively mitigates the activity of TE, and as such has been defined as a genome defense mechanism (Amselem et al. 2015).

Interestingly, in the phylum Basidiomycota, RIP has so far only been found in subphylum

20

Pucciniomycotina (Horns et al. 2012). A similar defense process, methylation induced premeiotically (MIP) act through epigenetic methylation to silence TEs (Jeffrey and

Selker 1996).

The small size of fungal nuclei make the chromosomes challenging to characterize with microscopy (Wieloch 2006), and taxa where chromosomes have been characterized are mainly within the Ascomycota. The development of pulse field electrophoresis has facilitated chromosome characterization, and revealed pervasive length polymorphisms within Coprinopsis cinerea (Zolan et al. 1994). Basidiomycetes where the chromosome number has been determined include C. cinerea (n=13; Stajich et al. 2010), ishikarensis (n=7; Chang and Jung 2008), Cryptococcus neoformans

(n=14, but copy number can vary; Hu et al. 2011), Ustilago maydis (n=23; Kamper et al.

2006), Pleurotus ostreatus (n=11; Larraya et al. 1999), while Phanerochaete chrysosporium is estimated to have between 7 and 9 chromosomes (Martinez et al. 2004).

A review of the existing literature did not reveal any estimates of Boletales chromosome number. The centromeres in fungi also remain largely uncharacterized, but both simple point centromeres as those of S. cerevisae to more complex regional centromeres like the ones in C. cinerea have been described. In a review by Smith et al. (2012) the authors conclude that the centromere sequences are highly variable within fungi, and that the associated protein complexes rather than sequence identity are responsible for the kinetochore assembly.

Transposable elements:

Transposable elements (TE) are genomic elements that have the ability to multiply and move (transpose) themselves in the genomes, and have consequently been

21

characterized as selfish DNA (Rebollo et al. 2012). Together with tandem repeats, TEs comprise what is classified as repetitive DNA in genomes (Georgiev et al. 1983), and have been found to comprise the majority of genome content in many plants and certain animal species (Grandaubert 2014). In fungi, TEs generally comprise 3-20% of the assembled genome (Santana 2015), but can be significantly more numerous in some taxa, e.g., 85% in Blumeria graminis (Santana 2015). Transposable elements were first discovered by Barbara McClintock through the observation of spontaneous translocations in the maize genome (Biemont 2010), and was initially referred to as junk DNA

(Biemont 2010). They have since then been implicated in playing a major role in gene regulation, inducing genome rearrangements, and facilitating adaptive evolution of novel morphologies and ecologies; see (Grandaubert et al 2014) for a review and discussion of the impact of TEs on the evolution of fungal genomes.

The classification scheme proposed by Kapitonov and Jurka (2008) divides TE into two main classes, RNA (class I) and DNA (class II) transposons, depending on their intermediate of transposition. Each class is further classified into autonomous and nonautonomous subclasses, super families, and families by structure, transposition mechanism, and sequence similarity (Fig 1.5A). Autonomous elements encode for the enzymatic machinery necessary for copy and paste mechanisms of Class I

Retrotransposons and cut and paste mechanisms of Class II DNA transposons (Fig 1.5B).

LTR Retrotransposons are flanked by two long terminal repeats (LTR) and typically contain two genes that encode for gag viral structure protein (Gag) and a polyprotein pol

(Pol) that includes a protease (Pr), an integrase (In), a reverse transcriptase (Rv), and a ribonuclease (Rn) (Fig 1.5A). The different types of LTRs are classified into subfamilies

22

based on the number and orientation of Pol domains and sequence similarity. Non-LTR long interspersed nuclear elements, or LINEs, lack the flanking LTR regions and typically encode for Rt and Rn proteins. Gypsy and Copia tend to be the most prevalent families of TE observed in fungi (Santana 2015). Autonomous Class II DNA transposons are flanked by terminal inverted repeats (TIR) and encode for a transposase (Tr). They are grouped into different families based on sequence similarity and phylogenetic analyses with Mariner being one of the more abundant families of Class II transposons in fungi (Santana 2015). Finally, there are nonautonomous TEs that do not encode for enzymes necessary for proliferation and transposition, but rely on the host’s enzymatic machinery. Non-LTR short interspersed nuclear elements (SINE) proceed through an

RNA intermediate, whereas miniature inverted-repeat transposable elements (MITE) proceed through a DNA intermediate. Class I TE tend to be more abundant in fungal genomes, but DNA transposons may contribute significantly in species with lower TE- content (Grandaubert et al 2014).

Effectors and small secreted proteins:

Effectors are generally defined as secreted proteins that manipulate the host’s immune system and facilitate host-microbe interactions including infection and colonization (Gladieux et al. 2014). Fungal effectors tend to be small (<300 amino acids), secreted, and cysteine rich (Stergiopoulos and de Wit 2009), and are therefore frequently referred to as small secreted proteins (SSPs). They show strong sequence divergence with few recognizable paralogs, and as such the majority of them have been characterized as lineage specific (Grandaubert et al 2014). The high rates of divergence has been interpreted as accelerated rates of change, and effectors have in many cases been found to

23

be associated with TEs and AT-rich blocks in the genome, and likely being acted upon by

RIP (Grandaubert et al 2014). The majority of work on fungal effectors has focused on plant pathogens, which has revealed different selection pressures depending on whether the interaction with the plant host defense is direct or indirect (Stergiopoulos and de Wit

2009). Although EM is considered a mutualistic interaction, the physiology of it resembles that of a plant pathogen where the fungus must suppress or bypass the plant host defense through effectors in order to establish the symbiosis, using mechanisms similar to that of plant pathogens (Plett and Martin 2015).

Comparative genomics of ectomycorrhizal fungi:

The sequencing of Laccaria bicolor under the French National Institute for

Agricultural Research (INRA) by Martin et al. (2008) represented the first whole genome of an EM fungus to be released. The subsequent annotation of the genome revealed a strong reduction of plant degrading enzymes (PCWDEs), and that a significant portion, 21% of the 65 Mb genome, was comprised of TEs. The high proportion of TEs, and expansion of large lineage specific gene families including small secreted protein effectors (SSPs), accounted for the relative large genome size when compared to other non EM fungi. Using transcriptomics, gene expression in the EM root tip tissue was investigated, and mycorrhiza induced small secreted proteins (MiSSP) localized to the

EM tissue were identified (Martin et al. 2008). Plett et al. (2011) demonstrated that L. bicolor transformants with reduced expression of one of these proteins, MiSSP7, failed to enter symbiosis with plant roots. A further mechanistic basis for the role of this effector was established when Plett et al. (2014) showed that MiSSP7 in L. bicolor inhibits the

24

jasmonic acid pathway in the host plant which allows the fungus to bypass its defense system and form the mycorrhiza.

In 2010 the same group at INRA released the first genome of the ascomycete EM

Tuber melanosporum. In T. melanosporum TE and repetitive DNA comprised an even larger portion of the genome, 58% of the 125 Mb genome, but no duplications or expansion of gene families were detected (Martin et al. 2010). A large genome size and high proportion of TEs was also found to be the case for the EM ascomycete

Elaphomyces granulatus (Quandt et al. 2015). Since then, many more EM genomes have been released with an increasing rate under the MGI, resulting in a comparative study of

49 fungal genomes containing 11 EM species by Kohler et al. (2015). They found a consistent pattern in the genomes of the EM taxa including: i) reduced complement of

PCWDEs, ii) genome expansions associated with the invasion and proliferation of transposable elements, and iii) expansion of lineage specific orphan genes induced for symbiosis referred to as a “species specific symbiosis toolkit” by the authors, supporting the hypothesis which were put forward in the initial work on L. bicolor and T. melanosporum.

In a study by Hess et al. (2014) comparing genomes of 3 symbiotic and 3 saprophytic

Amanita species along with the saprophytic Vovariella volvacea as outgroup, the same trend of increased TE content among the EM species was found. The authors deployed a strategy of characterizing TE elements from raw sequencing libraries to circumvent the challenge of accurately capturing these repetitive regions in assemblies, and found that the estimates of TE content became two to five times greater than when assessing the assembled content (Hess et al. 2014). Phylogenetic analysis of the TE elements showed a

25

large portion of lineage specific LINE and Gypsy families in the EM species, which was interpreted as a recent expansion of these TE families.

Conclusion:

As of June 2016 whole genome sequences of 22 Boletales species including both

EM and saprophytic species have been released under the MGI and the 1KFG through the

JGI, as is summarized in Table 1.1. Three Rhizopogon and six Suillus species have been sequenced using Illumina technologies. Here an annotated Hi-C assembly of Rhizopogon vesiculosus is presented, and the long contiguity information is applied to advance genomic studies of chromosome level organization. Particular emphasis is placed on the genomic content and organization of mating type loci, species-specific genes including

SSPs, TEs, and centromeres, along with variation within the dikaryon measured through

SNPs. This represents the first chromosome level assembly of a species in the Boletales, and one of few EM fungi where the genome has been subjected to a more thorough characterization.

26

REFERENCES:

Aagaard, J. E., Vollmer, S. S., Sorensen, F. C. & Strauss, S. H. 1995. Mitochondrial DNA products among RAPD profiles are frequent and strongly differentiated between races of Douglas-fir. Molecular Ecology, 4, 441-446. Ainsworth, A. M. & Rayner, A. D. M. 1990. Mycelial interactions and outcrossing in the Coniophora puteana complex. Mycological Research, 94, 627-634. Amselem, J., Lebrun, M.-H. & Quesneville, H. 2015. Whole genome comparative analysis of transposable elements provides new insight into mechanisms of their inactivation in fungal genomes. BMC Genomics, 16, 1-14. Barutcu, A. R., Fritz, A. J., Zaidi, S. K., Van Wijnen, A. J., Lian, J. B., Stein, J. L., Nickerson, J. A., Imbalzano, A. N. & Stein, G. S. 2016. C-ing the Genome: A Compendium of Chromosome Conformation Capture Methods to Study Higher- Order Chromatin Organization. J Cell Physiol, 231, 31-5. Beiler, K. J., Durall, D. M., Simard, S. W., Maxwell, S. A. & Kretzer, A. M. 2010. Architecture of the wood-wide web: Rhizopogon spp. genets link multiple Douglas-fir cohorts. New Phytol, 185, 543-53. Beiler, K. J., Simard, S. W. & Durall, D. M. 2015. Topology of tree-mycorrhizal fungus interaction networks in xeric and mesic Douglas-fir forests. Journal of Ecology, 103, 616-628. Beiler, K. J., Simard, S. W., Lemay, V. & Durall, D. M. 2012. Vertical partitioning between sister species of Rhizopogon fungi on mesic and xeric sites in an interior Douglas-fir forest. Mol Ecol, 21, 6163-74. Belton, J. M., Mccord, R. P., Gibcus, J. H., Naumova, N., Zhan, Y. & Dekker, J. 2012. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods, 58, 268-76. Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M. & Phillippy, A. M. 2015. Assembling large genomes with single-molecule sequencing and locality- sensitive hashing. Nat Biotech, 33, 623-630. Biemont, C. 2010. A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics, 186. Billiard, S., Lopez-Villavicencio, M., Hood, M. E. & Giraud, T. 2012. Sex, outcrossing and mating types: unsolved questions in fungi and beyond. Journal of Evolutionary Biology, 25, 1020-1038. Branco, S., Gladieux, P., Ellison, C. E., Kuo, A., Labutti, K., Lipzen, A., Grigoriev, I. V., Liao, H.-L., Vilgalys, R., Peay, K. G., Taylor, J. W. & Bruns, T. D. 2015. Genetic isolation between two recently diverged populations of a symbiotic fungus. Molecular Ecology, 24, 2747-2758. Brown, A. J. & Casselton, L. A. 2001. Mating in mushrooms: increasing the chances but prolonging the affair. Trends Genet, 17, 393-400. Brownlee, C., Duddridge, J. A., Malibari, A. & Read, D. J. 1983. The structure and function of mycelial systems of ectomycorrhizal roots with special reference to their role in forming inter-plant connections and providing pathways for assimilate and water transport. Plant and Soil, 71, 433-443.

27

Brundrett, M. C. 2002. Coevolution of roots and mycorrhizas of land plants. New Phytologist, 154, 275-304. Bruns, T. D., Peay, K. G., Boynton, P. J., Grubisha, L. C., Hynson, N. A., Nguyen, N. H. & Rosenstock, N. P. 2009. Inoculum potential of Rhizopogon spores increases with time over the first 4 yr of a 99-yr burial experiment. New Phytologist, 181, 463-470. Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O. & Shendure, J. 2013. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotech, 31, 1119-1125. Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. 2014. Species-Level Deconvolution of Metagenome Assemblies with Hi-C-Based Contact Probability Maps. G3: Genes|Genomes|Genetics. Casselton, L. A. 1997. Molecular recognition in fungal mating. Endeavour, 21, 159-163. Casselton, L. A. 2002. Mate recognition in fungi. Heredity (Edinb), 88, 142-7. Casselton, L. A. & Olesnicky, N. S. 1998. Molecular genetics of mating recognition in basidiomycete fungi. Microbiol Mol Biol Rev, 62, 55-70. Chang, S. W. & Jung, G. 2008. The first linkage map of the plant-pathogenic basidiomycete Typhula ishikariensis. Genome, 51, 128-136. Compeau, P. E. C., Pevzner, P. A. & Tesler, G. 2011. How to apply de Bruijn graphs to genome assembly. Nat Biotech, 29, 987-991. Dahlberg, A. & Stenlid, J. 1994. Size, distribution and biomass of genets in populations of Suillus bovinus (L.: Fr.) Roussel revealed by somatic incompatibility. New Phytologist, 128, 225-234. Dong, S., Raffaele, S. & Kamoun, S. 2015. The two-speed genomes of filamentous pathogens: waltz with plants. Current Opinion in Genetics & Development, 35, 57-65. Dostie, J., Richmond, T. A., Arnaout, R. A., Selzer, R. R., Lee, W. L., Honan, T. A., Rubio, E. D., Krumm, A., Lamb, J., Nusbaum, C., Green, R. D. & Dekker, J. 2006. Chromosome Conformation Capture Carbon Copy (5C): A massively parallel solution for mapping interactions between genomic elements. Genome Research, 16, 1299-1309. Douhan, G. W., Vincenot, L., Gryta, H. & Selosse, M. A. 2011. Population genetics of ectomycorrhizal fungi: from current knowledge to emerging directions. Fungal Biol, 115, 569-97. Duddridge, J. A., Malibari, A. & Read, D. J. 1980. Structure and function of mycorrhizal rhizomorphs with special reference to their role in water transport. Nature, 287, 834-836. Dunham, S. M., Mujic, A. B., Spatafora, J. W. & Kretzer, A. M. 2013. Within-population genetic structure differs between two sympatric sister-species of ectomycorrhizal fungi, Rhizopogon vinicolor and R. vesiculosus. Mycologia, 105, 814-26. English, A. C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D. M., Reid, J. G., Worley, K. C. & Gibbs, R. A. 2012. Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology. PLoS ONE, 7, e47768.

28

Faino, L., Seidl, M. F., Datema, E., Van Den Berg, G. C., Janssen, A., Wittenberg, A. H. & Thomma, B. P. 2015. Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome. MBio, 6. Fang, G., Munera, D., Friedman, D. I., Mandlik, A., Chao, M. C., Banerjee, O., Feng, Z., Losic, B., Mahajan, M. C., Jabado, O. J., Deikus, G., Clark, T. A., Luong, K., Murray, I. A., Davis, B. M., Keren-Paz, A., Chess, A., Roberts, R. J., Korlach, J., Turner, S. W., Kumar, V., Waldor, M. K. & Schadt, E. E. 2012. Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing. Nat Biotech, 30, 1232-1239. Ferrarini, M., Moretto, M., Ward, J. A., Surbanovski, N., Stevanovic, V., Giongo, L., Viola, R., Cavalieri, D., Velasco, R., Cestaro, A. & Sargent, D. J. 2013. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics, 14, 670. Flot, J. F., Marie-Nelly, H. & Koszul, R. 2015. Contact genomics: scaffolding and phasing (meta)genomes using chromosome 3D physical signatures. Febs Letters, 589, 2966-2974. Fries, E. M. 1817. Symbolae Gasteromycorum. Lundae: Ex officina Berlingiana. Fries, N. 1985. Intersterility groups in Paxillus involutus. Mycotaxon, 24, 403-409. Fries, N. 1994. Sexual incompatibility in Suillus variegatus. Mycological Research, 98, 545-546. Fries, N. & Neumann, W. 1990. Sexual incompatibility in Suillus luteus and S. granulatus. Mycological Research, 94, 64-70. Fries, N. & Sun, Y.-P. 1992. The mating system of Suillus bovinus. Mycological Research, 96, 237-238. Galagan, J. E., Henn, M. R., Ma, L. J., Cuomo, C. A. & Birren, B. 2005. Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res, 15, 1620-31. Georgiev, G. P., Kramerov, D. A., Ryskov, A. P., Skryabin, K. G. & Lukanidin, E. M. 1983. Dispersed Repetitive Sequences in Eukaryotic Genomes and Their Possible Biological Significance. Cold Spring Harbor Symposia on Quantitative Biology, 47, 1109-1121. Gladieux, P., Ropars, J., Badouin, H., Branca, A., Aguileta, G., De Vienne, D. M., De La Vega, R. C. R., Branco, S. & Giraud, T. 2014. Fungal evolutionary genomics provides insight into the mechanisms of adaptive divergence in eukaryotes. Molecular Ecology, 23, 753-773. Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin, H. & Oliver, S. G. 1996. Life with 6000 genes. Science, 274, 546, 563-7. Grandaubert, J., Rouxel T., Balesdent, M., 2014. Evolutionary and Adaptive Role of Transposable Elements in Fungal Genomes. Advances in Botanical Research, 70. Gregory, T. R., Nicol, J. A., Tamm, H., Kullman, B., Kullman, K., Leitch, I. J., Murray, B. G., Kapraun, D. F., Greilhuber, J. & Bennett, M. D. 2007. Eukaryotic genome size databases. Nucleic Acids Research, 35, D332-D338.

29

Grigoriev, I. V., Cullen, D., Goodwin, S. B., Hibbett, D., Jeffries, T. W., Kubicek, C. P., Kuske, C., Magnuson, J. K., Martin, F., Spatafora, J. W., Tsang, A. & Baker, S. E. 2011. Fueling the future with fungal genomics. Mycology, 2, 192-209. Grigoriev, I. V., Nikitin, R., Haridas, S., Kuo, A., Ohm, R., Otillar, R., Riley, R., Salamov, A., Zhao, X., Korzeniewski, F., Smirnova, T., Nordberg, H., Dubchak, I. & Shabalov, I. 2014. MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res, 42, D699-704. Grubisha, L. C., Trappe, J. M., Molina, R. & Spatafora, J. W. 2002. Biology of the ectomycorrhizal genus Rhizopogon. VI. Re-examination of infrageneric relationships inferred from phylogenetic analyses of ITS sequences. Mycologia, 94, 607-19. Gugger, P. F., Sugita, S. & Cavender-Bares, J. 2010. Phylogeography of Douglas-fir based on mitochondrial and chloroplast DNA sequences: testing hypotheses from the fossil record. Mol Ecol, 19, 1877-97. Hane, J. K., Rouxel, T., Howlett, B. J., Kema, G. H., Goodwin, S. B. & Oliver, R. P. 2011. A novel mode of chromosomal evolution peculiar to filamentous Ascomycete fungi. Genome Biology, 12, 1-16. Heitman, J., Sun, S. & James, T. Y. 2013. Evolution of fungal sexual reproduction. Mycologia, 105, 1-27. Hess, J., Skrede, I., Wolfe, B. E., Labutti, K., Ohm, R. A., Grigoriev, I. V. & Pringle, A. 2014. Transposable element dynamics among asymbiotic and ectomycorrhizal Amanita fungi. Genome biology and evolution, 6, 1564-1578. Hibbett, D. S., Bauer, R., Binder, M., Giachini, A. J., Hosaka, K., Justo, A., Larsson, E., Larsson, K. H., Lawrey, J. D., Miettinen, O., Nagy, L. G., Nilsson, R. H., Weiss, M. & Thorn, R. G. 2014. Chapter 14: Agaricomycetes. In: MCLAUGHLIN, J. D. & SPATAFORA, W. J. (eds.) Systematics and Evolution: Part A. Berlin, Heidelberg: Springer Berlin Heidelberg. Horns, F., Petit, E., Yockteng, R. & Hood, M. E. 2012. Patterns of repeat-induced point mutation in transposable elements of basidiomycete fungi. Genome Biol Evol, 4. Horton, T. R. & Bruns, T. D. 2001. The Molecular Revolution in Ectomycorrhizal Ecology: Peeking into the Black-Box. Molecular Ecology 10 (8): 1855–71. Hosford, D. R. T., J.M. 1988. A Preliminary Survey of Japanese Species of Rhizopogon. Transactions of the Mycological Society of Japan. 29:63-72. Hu, G., Wang, J., Choi, J., Jung, W. H., Liu, I., Litvintseva, A. P., Bicanic, T., Aurora, R., Mitchell, T. G., Perfect, J. R. & Kronstad, J. W. 2011. Variation in chromosome copy number influences the virulence of Cryptococcus neoformans and occurs in isolates from AIDS patients. BMC Genomics, 12, 1-19. Izzo, A. D., Marc Meyer, James M. Trappe, Malcolm North, and Thomas D. Bruns 2005. Hypogeous Ectomycorrhizal Fungal Species on Roots and in Small Mammal Diet in a Mixed- Forest. Forest Science 51(3)243-254. Jacobs, K. M. & Luoma, D. L. 2008. Small Mammal Mycophagy Response to Variations in Green-Tree Retention. Journal of Wildlife Management, 72, 1747-1755. James, T. Y. 2007. Analysis of mating-type locus organization and synteny in mushroom fungi- beyond model species. In J. Heitman, J. Kronstad, J. W. Taylor, and L. A.

30

Casselton [eds.]. Sex in fungi: molecular determination and evolutionary implications. ASM Press, Washington, D. C., 317-331. James, T. Y., Sun, S., Li, W., Heitman, J., Kuo, H. C., Lee, Y. H., Asiegbu, F. O. & Olson, A. 2013. Polyporales genomes reveal the genetic architecture underlying tetrapolar and bipolar mating systems. Mycologia, 105, 1374-90. Jeffrey, T. & Selker, E. U. 1996. Gene silencing in filamentous fungi: RIP, MIP and quelling. J Genet, 75. Kamper, J., Kahmann, R., Bolker, M., Ma, L.-J., Brefort, T., Saville, B. J., Banuett, F., Kronstad, J. W., Gold, S. E., Muller, O., Perlin, M. H., Wosten, H. a. B., De Vries, R., Ruiz-Herrera, J., Reynaga-Pena, C. G., Snetselaar, K., Mccann, M., Perez- Martin, J., Feldbrugge, M., Basse, C. W., Steinberg, G., Ibeas, J. I., Holloman, W., Guzman, P., Farman, M., Stajich, J. E., Sentandreu, R., Gonzalez-Prieto, J. M., Kennell, J. C., Molina, L., Schirawski, J., Mendoza-Mendoza, A., Greilinger, D., Munch, K., Rossel, N., Scherer, M., Vranes, M., Ladendorf, O., Vincon, V., Fuchs, U., Sandrock, B., Meng, S., Ho, E. C. H., Cahill, M. J., Boyce, K. J., Klose, J., Klosterman, S. J., Deelstra, H. J., Ortiz-Castellanos, L., Li, W., Sanchez- Alonso, P., Schreier, P. H., Hauser-Hahn, I., Vaupel, M., Koopmann, E., Friedrich, G., Voss, H., Schluter, T., Margolis, J., Platt, D., Swimmer, C., Gnirke, A., Chen, F., Vysotskaia, V., Mannhaupt, G., Guldener, U., Munsterkotter, M., Haase, D., Oesterheld, M., Mewes, H.-W., Mauceli, E. W., Decaprio, D., Wade, C. M., Butler, J., Young, S., Jaffe, D. B., Calvo, S., Nusbaum, C., Galagan, J. & Birren, B. W. 2006. Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis. Nature, 444, 97-101. Kapitonov, V. V. & Jurka, J. 2008. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet, 9, 411-412. Kawai, M., Yamahara, M. & Ohta, A. 2008. Bipolar incompatibility system of an ectomycorrhizal basidiomycete, Rhizopogon rubescens. Mycorrhiza, 18, 205-10. Kennedy, P. G., Bergemann, S. E., Hortal, S. & Bruns, T. D. 2007. Determining the outcome of field-based competition between two Rhizopogon species using real- time PCR. Molecular Ecology, 16, 881-890. Kennedy, P. G. & Bruns, T. D. 2005. Priority effects determine the outcome of ectomycorrhizal competition between two Rhizopogon species colonizing Pinus muricata seedlings. New Phytologist, 166, 631-638. Kohler, A., Kuo, A., Nagy, L. G., Morin, E., Barry, K. W., Buscot, F., Canback, B., Choi, C., Cichocki, N., Clum, A., Colpaert, J., Copeland, A., Costa, M. D., Dore, J., Floudas, D., Gay, G., Girlanda, M., Henrissat, B., Herrmann, S., Hess, J., Hogberg, N., Johansson, T., Khouja, H.-R., Labutti, K., Lahrmann, U., Levasseur, A., Lindquist, E. A., Lipzen, A., Marmeisse, R., Martino, E., Murat, C., Ngan, C. Y., Nehls, U., Plett, J. M., Pringle, A., Ohm, R. A., Perotto, S., Peter, M., Riley, R., Rineau, F., Ruytinx, J., Salamov, A., Shah, F., Sun, H., Tarkka, M., Tritt, A., Veneault-Fourrey, C., Zuccaro, A., Mycorrhizal Genomics Initiative, C., Tunlid, A., Grigoriev, I. V., Hibbett, D. S. & Martin, F. 2015. Convergent losses of decay mechanisms and rapid turnover of symbiosis genes in mycorrhizal mutualists. Nat Genet, 47, 410-415.

31

Korbel, J. O. & Lee, C. 2013. Genome assembly and haplotyping with Hi-C. Nat Biotech, 31, 1099-1101. Kretzer, A. M., Dunham, S., Molina, R. & Spatafora, J. W. 2004. Microsatellite markers reveal the below ground distribution of genets in two species of Rhizopogon forming tuberculate ectomycorrhizas on Douglas fir. New Phytologist, 161, 313- 320. Kretzer, A. M., Dunham, S., Molina, R. & Spatafora, J. W. 2005. Patterns of vegetative growth and gene flow in Rhizopogon vinicolor and R. vesiculosus (Boletales, Basidiomycota). Mol Ecol, 14, 2259-68. Kretzer, A. M., Luoma, D. L., Molina, R. & Spatafora, J. W. 2003. of the Rhizopogon vinicolor species complex based on analysis of ITS sequences and microsatellite loci. Mycologia, 95, 480-7. Kües, U., James, T. & Heitman, J. 2011. 6 Mating Type in Basidiomycetes: Unipolar, Bipolar, and Tetrapolar Patterns of Sexuality. In: PÖGGELER, S. & WÖSTEMEYER, J. (eds.) Evolution of Fungi and Fungal-Like Organisms. Springer Berlin Heidelberg. Kullman, B., Tamm, H. & Kullman, K. 2005. Fungal Genome Size Database. Lajoie, B. R., Dekker, J. & Kaplan, N. 2015. The Hitchhiker's guide to Hi-C analysis: practical guidelines. Methods, 72, 65-75. Lamb, J. C. & Birchler, J. A. 2003. The role of DNA sequence in centromere formation. Genome Biology, 4, 1-4. Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., Desalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J. & Snyder, M. 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research, 22, 1813-1831. Larraya, L. M., Pérez, G., Peñas, M. M., Baars, J. J. P., Mikosch, T. S. P., Pisabarro, A. G. & Ramírez, L. 1999. Molecular Karyotype of the White Rot Fungus Pleurotus ostreatus. Applied and Environmental Microbiology, 65, 3413-3417. Lavender, D. P. & Hermann, R. K. 2014. Douglas-Fir: The Genus Pseudotsuga, OSU College of Forestry. Lee, S. C., Ni, M., Li, W., Shertz, C. & Heitman, J. 2010. The evolution of sex: a perspective from the fungal kingdom. Microbiol Mol Biol Rev, 74, 298-340. Li, P. & Adams, W. T. 1989. Range-wide patterns of allozyme variation in Douglas-fir (Pseudotsugamenziesii). Canadian Journal of Forest Research, 19, 149-161. Luoma, D. L., Durall, D. M., Eberhart, J. L. & Sidlar, K. 2011. Rediscovery of the vesicles that characterized Rhizopogon vesiculosus. Mycologia, 103, 1074-9. Marcet-Houben, M. & Gabaldon, T. 2010. Acquisition of prokaryotic genes by fungal genomes. Trends Genet, 26.

32

Marmeisse, R., Nehls, U., Öpik, M., Selosse, M. A. & Pringle, A. 2013. Bridging mycorrhizal genomics, metagenomics and forest ecology. New Phytologist, 198, 343-346. Martin, F., Aerts, A., Ahren, D., Brun, A., Danchin, E. G., Duchaussoy, F., Gibon, J., Kohler, A., Lindquist, E., Pereda, V., Salamov, A., Shapiro, H. J., Wuyts, J., Blaudez, D., Buee, M., Brokstein, P., Canback, B., Cohen, D., Courty, P. E., Coutinho, P. M., Delaruelle, C., Detter, J. C., Deveau, A., Difazio, S., Duplessis, S., Fraissinet-Tachet, L., Lucic, E., Frey-Klett, P., Fourrey, C., Feussner, I., Gay, G., Grimwood, J., Hoegger, P. J., Jain, P., Kilaru, S., Labbe, J., Lin, Y. C., Legue, V., Le Tacon, F., Marmeisse, R., Melayah, D., Montanini, B., Muratet, M., Nehls, U., Niculita-Hirzel, H., Oudot-Le Secq, M. P., Peter, M., Quesneville, H., Rajashekar, B., Reich, M., Rouhier, N., Schmutz, J., Yin, T., Chalot, M., Henrissat, B., Kues, U., Lucas, S., Van De Peer, Y., Podila, G. K., Polle, A., Pukkila, P. J., Richardson, P. M., Rouze, P., Sanders, I. R., Stajich, J. E., Tunlid, A., Tuskan, G. & Grigoriev, I. V. 2008. The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature, 452, 88-92. Martin, F., Kohler, A., Murat, C., Balestrini, R., Coutinho, P. M., Jaillon, O., Montanini, B., Morin, E., Noel, B., Percudani, R., Porcel, B., Rubini, A., Amicucci, A., Amselem, J., Anthouard, V., Arcioni, S., Artiguenave, F., Aury, J.-M., Ballario, P., Bolchi, A., Brenna, A., Brun, A., Buee, M., Cantarel, B., Chevalier, G., Couloux, A., Da Silva, C., Denoeud, F., Duplessis, S., Ghignone, S., Hilselberger, B., Iotti, M., Marcais, B., Mello, A., Miranda, M., Pacioni, G., Quesneville, H., Riccioni, C., Ruotolo, R., Splivallo, R., Stocchi, V., Tisserant, E., Viscomi, A. R., Zambonelli, A., Zampieri, E., Henrissat, B., Lebrun, M.-H., Paolocci, F., Bonfante, P., Ottonello, S. & Wincker, P. 2010. Perigord black truffle genome uncovers evolutionary origins and mechanisms of symbiosis. Nature, 464, 1033- 1038. Martín, M. P. 1996. The genus Rhizopogon in Europe. Barcelona: Soc. Catalana Micol., Spec. Ed. no. 5., 173 pp. Martinez, D., Larrondo, L. F., Putnam, N., Gelpke, M. D., Huang, K., Chapman, J., Helfenbein, K. G., Ramaiya, P., Detter, J. C., Larimer, F., Coutinho, P. M., Henrissat, B., Berka, R., Cullen, D. & Rokhsar, D. 2004. Genome sequence of the lignocellulose degrading fungus Phanerochaete chrysosporium strain RP78. Nat Biotechnol, 22, 695-700. Maser, C., Claridge, A. W. & Trappe, J. M. 2008. Trees, Truffles, and Beasts: How Forests Function, Rutgers University Press. Massicotte, H. B., Molina, R., Luoma, D. L. & Smith, J. E. 1994. Biology of the ectomycorrhizal genus, Rhizopogon. New Phytologist, 126, 677-690. Metzker, M. L. 2010. Applications of Next-Generation Sequencing: Sequencing technologies - the next generation. Nature Reviews Genetics, 11, 31-46. Molina, R. & Trappe, J. M. 1994. Biology of the Ectomycorrhizal Genus Rhizopogon. 1. Host Associations, Host-Specificity and Pure Culture Syntheses. New Phytologist, 126, 653-675.

33

Molina, R., Trappe, J. M., Grubisha, L. C. & Spatafora, J. W. 1999. Rhizopogon. In: Cairney, J. G. & Chambers, S. (eds.) Ectomycorrhizal Fungi Key Genera in Profile. Springer Berlin Heidelberg. Mujic, A. B. 2015. Symbiosis in the Pacific Ring of Fire: Evolutionary-Biology of Rhizopogon Subgenus Villosuli as Mutualists of Pseudotsuga. Doctoral Dissertation Oregon State University. Mujic, A. B., Durall, D. M., Spatafora, J. W. & Kennedy, P. G. 2016. Competitive avoidance not edaphic specialization drives vertical niche partitioning among sister species of ectomycorrhizal fungi. New Phytologist, 209, 1174-1183. Mujic, A. B., Hosaka, K. & Spatafora, J. W. 2014. Rhizopogon togasawariana sp nov., the first report of Rhizopogon associated with an Asian species of Pseudotsuga. Mycologia, 106, 105-112. Ni, M., Feretzaki, M., Sun, S., Wang, X. & Heitman, J. 2011. Sex in fungi. Annu Rev Genet, 45, 405-30. Nieuwenhuis, B. P., Billiard, S., Vuilleumier, S., Petit, E., Hood, M. E. & Giraud, T. 2013. Evolution of uni- and bifactorial sexual compatibility systems in fungi. Heredity (Edinb), 111, 445-55. Peay, K. G., Kennedy, P. G. & Bruns, T. D. 2008. Fungal Community Ecology: A Hybrid Beast with a Molecular Master. BioScience, 58, 799-810. Plett, J. M., Daguerre, Y., Wittulsky, S., Vayssieres, A., Deveau, A., Melton, S. J., Kohler, A., Morrell-Falvey, J. L., Brun, A., Veneault-Fourrey, C. & Martin, F. 2014. Effector MiSSP7 of the mutualistic fungus Laccaria bicolor stabilizes the Populus JAZ6 protein and represses jasmonic acid (JA) responsive genes. Proc Natl Acad Sci U S A, 111, 8299-304. Plett, J. M., Kemppainen, M., Kale, S. D., Kohler, A., Legue, V., Brun, A., Tyler, B. M., Pardo, A. G. & Martin, F. 2011. A secreted effector protein of Laccaria bicolor is required for symbiosis development. Curr Biol, 21, 1197-203. Plett, J. M. & Martin, F. 2015. Reconsidering mutualistic plant–fungal interactions through the lens of effector biology. Current Opinion in Plant Biology, 26, 45-50. Pop, M. 2009. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics, 10, 354-366. Pop, M., Phillippy, A., Delcher, A. L. & Salzberg, S. L. 2004. Comparative genome assembly. Briefings in Bioinformatics, 5, 237-248. Quandt, C. A., Kohler, A., Hesse, C. N., Sharpton, T. J., Martin, F. & Spatafora, J. W. 2015. Metagenome sequence of Elaphomyces granulatus from sporocarp tissue reveals Ascomycota ectomycorrhizal fingerprints of genome expansion and a Proteobacteria-rich microbiome. Environmental Microbiology, 17, 2952-2968. Rai, M. & Varma, A. 2010. Diversity and Biotechnology of Ectomycorrhizae, Springer Berlin Heidelberg. Raper, J. R. 1966. Genetics of Sexuality in Higher Fungi. Ronald Press, New York. Rebollo, R., Romanish, M. T. & Mager, D. L. 2012. Transposable Elements: An Abundant and Natural Source of Regulatory Sequences for Host Genes. Annual Review of Genetics, 46, 21-42. Santana, M. F., Queiroz M.V 2015. Transposable Elements in Fungi: A Genomic Approach. Scientific Journal of Genetics and Gene Therapy, 1, 012-016.

34

Schadt, E. E., Turner, S. & Kasarskis, A. 2010. A window into third-generation sequencing. Human Molecular Genetics, 19, R227-R240. Selvaraj, S., J, R. D., Bansal, V. & Ren, B. 2013. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat Biotechnol, 31, 1111-8. Simard, S. W., Perry, D. A., Jones, M. D., Myrold, D. D., Durall, D. M. & Molina, R. 1997. Net transfer of carbon between ectomycorrhizal tree species in the field. Nature, 388, 579-582. Skrede, I., Maurice, S. & Kauserud, H. 2013. Molecular characterization of sexual diversity in a population of Serpula lacrymans, a tetrapolar basidiomycete. G3 (Bethesda), 3, 145-52. Smith, A. H. & Zeller, S. M. 1966. A preliminary account of the North American species of Rhizopgon. Memoirs of The New York Botanical Garden, 14, 1-178. Smith, K. M., Galazka, J. M., Phatale, P. A., Connolly, L. R. & Freitag, M. 2012. Centromeres of filamentous fungi. Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology, 20, 635-656. Smith, S. E. & Read, D. 2008. Mycorrhizal Symbiosis. (Third Edition) London: Academic Press. Stajich, J. E., Wilke, S. K., Ahren, D., Au, C. H., Birren, B. W., Borodovsky, M., Burns, C., Canback, B., Casselton, L. A., Cheng, C. K., Deng, J., Dietrich, F. S., Fargo, D. C., Farman, M. L., Gathman, A. C., Goldberg, J., Guigo, R., Hoegger, P. J., Hooker, J. B., Huggins, A., James, T. Y., Kamada, T., Kilaru, S., Kodira, C., Kues, U., Kupfer, D., Kwan, H. S., Lomsadze, A., Li, W., Lilly, W. W., Ma, L. J., Mackey, A. J., Manning, G., Martin, F., Muraguchi, H., Natvig, D. O., Palmerini, H., Ramesh, M. A., Rehmeyer, C. J., Roe, B. A., Shenoy, N., Stanke, M., Ter- Hovhannisyan, V., Tunlid, A., Velagapudi, R., Vision, T. J., Zeng, Q., Zolan, M. E. & Pukkila, P. J. 2010. Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus). Proc Natl Acad Sci U S A, 107, 11889-94. Stergiopoulos, I. & De Wit, P. J. 2009. Fungal effector proteins. Annu Rev Phytopathol, 47, 233-63. Stukenbrock, E. H. & Croll, D. 2014. The evolving fungal genome. Fungal Biology Reviews, 28, 1-12. Tavares, S., Ramos, A. P., Pires, A. S., Azinheira, H. G., Caldeirinha, P., Link, T., Abranches, R., Silva Mdo, C., Voegele, R. T., Loureiro, J. & Talhinhas, P. 2014. Genome size analyses of Pucciniales reveal the largest fungal genomes. Front Plant Sci, 5, 422. Tedersoo, L., May, T. & Smith, M. 2010. Ectomycorrhizal lifestyle in fungi: global diversity, distribution, and evolution of phylogenetic lineages. Mycorrhiza, 20, 217-263. Van De Werken, H. J., De Vree, P. J., Splinter, E., Holwerda, S. J., Klous, P., De Wit, E. & De Laat, W. 2012. 4C technology: protocols and data analysis. Methods Enzymol, 513, 89-112.

35

Van Der Heijden, M. G., Martin, F. M., Selosse, M. A. & Sanders, I. R. 2015. Mycorrhizal ecology and evolution: the past, the present, and the future. New Phytol, 205, 1406-23. Van Kan, J. a. L., Stassen, J. H. M., Mosbach, A., Van Der Lee, T. a. J., Faino, L., Farmer, A. D., Papasotiriou, D. G., Zhou, S., Seidl, M. F., Cottam, E., Edel, D., Hahn, M., Schwartz, D. C., Dietrich, R. A., Widdison, S. & Scalliet, G. 2016. A gapless genome sequence of the fungus Botrytis cinerea. Molecular Plant Pathology, n/a-n/a. Varoquaux, N., Liachko, I., Ay, F., Burton, J. N., Shendure, J., Dunham, M. J., Vert, J. P. & Noble, W. S. 2015. Accurate identification of centromere locations in yeast genomes using Hi-C. Nucleic Acids Res, 43, 5331-9. Wan, J. N., Li, Y., Shimomura, N., Yamaguchi, T. & Aimi, T. 2013. Characterization of DNA polymorphisms in Rhizopogon roseolus homeodomain protein genes and their utilization for strain identification. Mycological Progress, 12, 353-365. Whitehouse, H. L. 1949a. Heterothallism and sex in the fungi. Biol Rev Camb Philos Soc, 24, 411-47. Whitehouse, H. L. K. 1949b. Multiple-Allelomorph Heterothallism in the Fungi. New Phytologist, 48, 212-244. Wieloch, W. 2006. Chromosome visualisation in filamentous fungi. J Microbiol Methods, 67, 1-8. Zolan, M. E., Heyler, N. K. & Stassen, N. Y. 1994. Inheritance of Chromosome-Length Polymorphisms in Coprinus Cinereus. Genetics, 137, 87-94.

36

Figure 1.1: ITS phylogram of Rhizopogon showing the placement of R. vesiculosus. Representatives of all 5 currently recognized subgenera in genus Rhizopogon are included. A number of the sequences were provided by A.B Mujic, and the alignment was made with muscle and manually edited. The phylogenetic tree was built in Geneious version 9.15 using UPGMA and the Tamura-Nei for genetic distance.

37

Figure 1.2: Morphology of R. vesiculosus life history stages. “R.vinicolor-type” signifies that the species has not been determined but is likely either R. vesiculosus or R. vinicolor. A) young sporocarp displaying patches of yellow and vinaceos stains in its natural fruiting habitat between duff and the mineral soil horizon. B) Gleba of young and mature sporocarp. Note change in gleba color. C) Basidiospores. D) Young root tip produced in bioassay. E) Hartig net in Rhizopogon vinicolor-type tuberculate root tip. Arrow points towards a portion of the hartig net which extends between the lighter plant cells. F) Older root tips produced in bioassay leaning towards a more coralloid growth form, arrow shows rhizomorphs. G) Scanning electron microscope caption of R. vinicolor-type coralloid root tips. H) Cross section of a R. vinicolor-type tuberculate root tip where a peridium enclose dense aggregates of mycorrhizal root tips. I) Scanning electron microscopy photo of a cross section of R. vinicolor-type tubercule. The fungal mycelium can be seen as thin threads populating the space in between the individual roots. Photo courtesy of Hugues Massicotte for E,G,H, and I.

Figure 1.3: The typical lifecycle of an EM basidiomycete represented here by Rhizopogon, proceeding in clockwise direction.

38

39

Figure 1.4: A representation of the Hi-C library construction process adapted from Belton et al. 2012. A) Covalent crosslinking of protein/DNA complexes with formaldehyde. B) Digestion of chromatin with restriction enzymes marked with a biotinylated nucleotide. C) DNA in the crosslinked complexes is ligated to produce chimeric DNA molecules. D) Biotin removal from the ends of linear molecules which are fragmented for size reduction. E) The chimeric molecules with internal biotin incorporation are pulled down with streptavidin coated magnetic beads and modified for deep sequencing.

40

Figure 1.5: Main classes of transposable elements (TE) and their mechanism of transposition (adapted from Grandaubert et al 2014 and CuboCube.com). A) TEs are divided into two classes according their transposition intermediate, which can be either RNA or DNA. Pr-protease; In-integrase; Rt-reverse transcriptase; Rn-RNase H. Elements that encode the enzymatic machinery to transpose are considered autonomous. B) Class I transposons rely on RNA as intermediate and transpose through a copy-and-paste mechanism resulting in an increase of copy number. Class II transposons use DNA as intermediate and transpose through a cut-and-pase mechanism.

41

Table 1.1: Sequenced and annotated genomes of the Boletales by 1KFG as of June 2016. Species Genes Genome Contigs Scaffolds Scaffolds Scaffold Scaffold size (>2Kb) N50 L50 (Mbp) Boletus edulis 16,933 46.64 4723 1099 779 42 0.29 Coniophora 14,928 39.07 990 863 682 80 0.14 olivacea Coniophorq 13,761 42.97 1034 210 210 7 240 puteana Gyrodon lividus 11,779 43.05 1390 369 294 14 1.16 Hydnomerulius 13,270 38.28 2315 603 428 16 0.69 pinastri Leucogyrophana 14,619 35.19 1347 1262 976 108 0.09 mollusca Paxillus 13,473 35.76 2244 1194 992 84 0.09 ammoniavirescens Paxillus involutus 17,968 58.30 6222 2681 2681 29 0.38 Paxillus 18,999 64.46 4331 1671 1335 165 0.11 rubicundulus Pisolithus 21,064 53.03 5476 1064 877 89 0.15 microcarpus Pisolithus tinctorius 22,701 71.01 5915 610 491 36 0.49 Rhizopogon 18,900 82.29 5710 1976 1629 240 0.11 salebrosus Rhizopogon 14,469 36.10 3018 2310 1633 218 0.04 vinicolor Scleroderma 21,012 56.14 3919 938 681 63 0.24 citrinum Serpula lacrymans 12,789 42.73 375 36 36 6 2.95 Serpula lacrymans 13,805 45.98 375 36 3291 553 0.02 var shastensis Suillus americanus 17,163 50.81 3604 1307 686 47 0.30

Suillus brevipes 21,458 52.03 2205 1550 1132 84 0.16

Suillus decipiens 16,894 62.78 3648 1391 873 48 0.34

Suillus granulatus 15,802 42.34 1869 628 419 24 0.51 Suillus hirtellus 17,067 49.94 1626 644 464 36 0.41

Suillus luteus 16,744 41.74 1365 649 488 49 0.25 Xerocomus badius 14,938 38.39 2580 1390 1092 110 0.09

42

Table 1.2: Determined mating systems in the Boletales. Species Mating system Study Coniophora puteana Bipolar (Ainsworth and Rayner 1990) Paxillus involutus Tetrapolar (Fries 1985) Rhizopogon roseolus Bipolar (Kawai et al. 2008) Serpula lacrymans Tetrapolar (Skrede et al. 2013) Suillus bovinus Bipolar (Fries and Sun 1992) Suillus granulatus Bipolar (Fries and Neumann 1990) Suillus luteus Bipolar (Fries and Neumann 1990) Suillus variegatus Bipolar (Fries 1994)

43

Chapter2: Hi-C assembly of Rhizopogon vesiculosus reveals the genome organization of an ectomycorrhizal truffle in Boletales

Dabao Sun Lu, Ivan Liachko, Shawn T. Sullivan, Alija B. Mujic, Maitreya Dunham,

Joseph W. Spatafora

44

ABSTRACT:

Rhizopogon species are important ectomycorrhizal symbionts of Pinaceae in western North America where they exhibit varying degrees of host specificity. Genome sequencing of fungi through efforts like the 1000 Fungal Genome Project have produced draft assemblies of multiple Rhizopogon species derived from short read sequencing.

While these assemblies are sufficient for capturing gene space, they are limited in their ability to resolve complex repetitive regions (e.g., transposable elements, centromeres) as they are fragmented over thousands of contigs. In order to map the large scale organization and advance the assembly of Rhizopogon genomes, chromatin conformation capture assay (Hi-C) sequencing was performed on R. vesiculosus. The resultant Hi-C reads were used to scaffold the existing Illumina assembly of the same strain into 10 chromosome-scale scaffolds containing 97.1% of the original assembly’s sequence length.

Blast results on single copy genes provided evidence that the individual haplotypes of the parent dikaryon were co-assembled. We predict that R. vesiculosus has 10 chromosomes, spanning lengths from 2.8 to 10.1 Mb. This Hi-C-based assembly was annotated with

RNA and protein data, resulting in the first Boletales genome with a chromosome level annotation. Centromeres were regional in nature, averaging ~40 Kb in length, and possessed a gene density comparable to the genome average. The mating type loci were determined to reside on separate linkage groups, implying a bifactorial architecture of the mating system. Comparison with other species of Rhizopogon and Suillus revealed distinct regions in the genome enriched for species specific genes across all linkage groups. We mined the genome for small secreted proteins, and found ~200 of which 50% were species specific. Repetitive DNA was found to comprise 16.7 % of the genome, and

45

46.6% of these sequences were annotated as transposable elements (TE). Mapping of

SNPs across the genome revealed higher levels of polymorphisms associated with TE,

SSPs and species specific genes as compared to orthologs shared with Suillus. SNPs associated with TEs also revealed elevated transition rates suggestive of genome defense mechanisms targeting these regions. Here, we highlight the nature of TE, the mating loci, centromeres, small secreted proteins, and the use of this Hi-C-based genome assembly in comparative analyses of Rhizopogon and its sister genus Suillus.

46

INTRODUCTION

The increased efficiency and decreased cost of next generation sequencing (NGS) technologies have enabled whole genome sequencing to become one of the standard tools in biological inquiry. In recent years, genome sequencing projects have generated a wealth of data that have benefitted a broad range of studies in fungal biology ranging from genome-scale phylogenetic analyses (Binder et al. 2013, Wisecaver et al. 2014), to populations genomics and genome-wide association studies (Ellison et al. 2011, Branco et al. 2015), to ecological genomics of fungal-host interactions e.g., (Ma et al. 2010,

Kohler et al. 2015). Most genome assemblies used in these studies have been generated from short read sequencing technologies (e.g., Illumina), and tend to be fragmented over thousands of contigs. This has led to the situation where researchers may have a good estimate for the number of genes in a species, but limited information about genome organization including the number of chromosomes. The lack of understanding of the higher level organization is compounded by repetitive regions in the genome, which include telomeres, transposons, and in many cases centromeres. As the short reads generated by NGS are inherently limited in their ability to span regions consisting of repetitive motifs which exceed the read length (Flot et al. 2015), these regions remain largely unassembled and unmapped (Galagan et al. 2005). Yet, these repetitive regions frequently comprise a significant percentage of the genome in multicellular eukaryotes where they may play significant roles in gene regulation and generation of genomic variation for adaptive evolution (Biémont 2010).

Fungi have been the subject of numerous genomic studies, and today kingdom

Fungi represents the deepest sampling of whole genomes within the eukaryotes (Taylor

47

and Berbee 2014). The popularity of fungi as eukaryote study organisms can mainly be attributed to their relatively small, and often haploid, genomes, along with the ease of cultivation and short generation times for many species (Galagan et al. 2005). However ectomycorrhizal (EM) fungi, which are common mutualists to trees and woody shrubs in boreal and temperate ecosystems (Smith and Read 2008), where they may perform essential roles pertaining to nutrient cycling and carbon storage (van der Heijden et al.

2015), frequently do not fit this pattern. Notably, they tend to have larger genomes

(Martin et al. 2008, Martin et al. 2010, Hess and Pringle 2014, Quandt et al. 2015) and often do not grow or complete their life cycle when separated from their obligate plant hosts (Tedersoo et al. 2010). Moreover, Agaricomycetes, which comprise the majority of the EM fungi (Brundrett 2002), exist in a dikaryotic state where two dissimilar nuclei remain separate in the cells throughout the majority of their life cycle (Casselton 2002).

The number of sequenced mycorrhizal genomes has accelerated in recent year largely thanks to the Mycorrhizal Genome Initiative (MGI) and the One Thousand Fungal

Genomes Project (1KFG) sponsored by the Joint Genome Institute. Comparative genomic analysis of these genomes by Kohler et al. (2015) suggest that there are characteristic genome signatures associated with the EM lifestyle including i) a reduction in plant cell wall degrading enzymes (PCWDEs), ii) increased genome size due to an increase in copy number and types of transposable elements (TEs), and iii) expansion of lineage specific small secreted proteins (SSPs) that are likely responsible for inducing the formation of mycorrhiza.

Rhizopogon (Boletales) is a diverse genus that forms EM associations with members of the Pinaceae and produces hypogeous, truffle-like fruiting bodies (Molina

48

and Trappe 1994). Rhizopogon represents a transition to the EM ecology independent from Laccaria in the Agaricales (Tedersoo et al. 2010) where the genome has been subjected to more thorough characterizations for EM features (Martin et al. 2008).

Rhizopogon has been particularly well studied with respect to systematics (Grubisha et al.

2002), ecology (Kennedy and Bruns 2005, Kennedy et al. 2007,) and population biology

(Dunham et al. 2013), and it is documented as a common EM colonizer of young seedlings (Twieg et al. 2007), and major component of EM communities both on root systems (Twieg et al. 2007) and in the soil spore bank (Bruns et al. 2009). These studies were facilitated by the development of microsatellites (Kretzer et al. 2000), and characters such as a culturable dikaryon and EM morphology that is easily recognizable in the field. Numerous ecology studies have also been conducted on Suillus, the sister genus of Rhizopogon, which exhibits the same EM host specificity for Pinaceae, but produces pileate-stipitate above ground fruiting bodies (=mushrooms). This study is concerned with the genome organization of Rhizopogon vesiculosus, which together with its sister species R. vinicolor show strict host specificity to Pseudotsuga menziesii

(Molina and Trappe 1994), and have been extensively studied for life history traits (e.g., see Beiler et al. 2012, Mujic et al. 2016). The genomes of both species have also been sequenced with the Illumina platform and annotated, resulting in short read draft assemblies (Mujic 2015).

Traditionally, the challenges in obtaining long range sequence contiguity have been faced through the creation and end sequencing of clone libraries, but these may be prone to error and bias (Galagan et al. 2005). Recently, implementation of single molecule sequencing (e.g. PacBio) has demonstrated the ability to generate reads of

49

thousands of bases sufficient for either improving existing short read assemblies or generating de novo assemblies (Rhoads and Au 2015). The end result is a more assembled, genome and in some cases identification of linkage groups (Van Kan et al.

2016). Other tools for obtaining large scale gene order include mapping approaches like

HAPPY and optical mapping (Galagan et al. 2005); the latter was used by Stajich et al.

(2010) to obtain a chromosome level assembly of the agaricomycete Coprinopsis cinerea.

Hi-C sequencing is among the technologies that can capture 3-dimensional (3D) genome architecture through the quantification of 3D contacts across the genome. The method involves the crosslinking of chromatin, which is subsequently digested with restriction enzymes and biotinylated to form chimeric sequences that are subjected to massive parallel sequencing to create a genome wide contact map (Belton et al. 2012,

Flot et al. 2015, Lajoie et al. 2015). Although Hi-C sequencing was initially developed to probe chromatin structure, the interaction maps has proven useful for a number of purposes. For instance, the distant dependent decay of interaction frequency, which is due to the fact that crosslinking is more likely to happen between regions in physical proximity along the chromosome, can be used as basis for establishing long-range contiguity information in genomes. Furthermore, the fact that chromosomes tend to interact more in cis than trans, and that centromeres tend to interact among chromosomes

(Selvaraj et al. 2013), makes Hi-C a viable sequence method for constructing chromosome-scale assemblies of fungal genomes.

Here we present an annotated Hi-C assembly of the R. vesiculosus genome, which to our knowledge represents the first chromosome level assembly of a species of

Boletales. These results provide information on large scale genome organization

50

including the nature and composition of Boletales chromosomes and centromeres, abundance and distribution of repetitive elements, organization of mating type genes, distribution of small secreted proteins, and the distribution of single nucleotide polymorphisms (SNPs) within a dikaryon. A comparative analysis that includes

Rhizopogon vinicolor, R. salebrosus, Suillus brevipes and S. luteus is used to further characterize lineage specific genes at the species, subgenus, and genus level, and to identify unique regions of chromosomes that have undergone species specific diversification.

MATERIALS AND METHODS

Isolate Information:

The sequenced isolate of R. vesiculosus, AM-OR11-056, was collected in the

Oregon coast range on the north slope of Mary’s Peak on Woods Creek Road, Benton

County, Oregon. (44 32.1 N, 123 32.1 W, elevation: 1690 meters). A portion of the collection was submitted to the mycological collection at the OSC Herbarium as a voucher with the accession number OSC# 148003. A pure culture was obtained from the dikaryotic tissue of the sporocarp, which was grown at 20°C and transferred every 3-4 months. The cultures were grown on Modified Melin Norkrans media as described in

(Rossi and Oliveira 2011) with the addition of 4.88 g/L 2-(N-morpholin) ethane sulfonic acid (MES). The Illumina DNA and RNA sequencing and assembly of AM-OR11-056, along with details on the culturing are described in Mujic (2015).

Hi-C DNA Extraction, Library Preparation, and Sequencing:

51

Hi-C libraries were generated as described in (Burton et al. 2014) with the exception that a 4-cutter restriction enzyme Sau3AI was used to fragment the crosslinked chromatin instead of HindIII. Approximately 50 microliters of fungal mycelium was used for sample preparation. The resulting libraries were sequenced on an Illumina NextSeq

500 instrument at the Department of Genome Sciences at the University of Washington.

Hi-C-based Proximity-Guided Assembly:

The Hi-C raw reads were mapped to the draft Illumina assembly (generated from the same fungal isolate; Mujic 2015) which was subsequently scaffolded using the

Lachesis software (Burton et al. 2013). Clustering analysis was run with the parameters

Nlinkage groups = 24, 16, 12, 11, 10, 9, and 8, where N=10 resulted in the most optimal clustering and was retained for further downstream analyses. The resulting Hi-C based

Proximity-Guided Assembly (PGA) will be referred to as the Hi-C assembly in the rest of this text.

Centromere and Telomere Identification:

Centromeres were identified based on known Hi-C centromere patterns in yeast caused by their clustering behavior: The default Lachesis settings created a single cluster containing Hi-C linkage patterns consistent with all centromeres being placed into a single cluster. Using the Hi-C assembly clustering results for N=10, Hi-C linkage density peaks in the hypothetical all-centromere cluster were identified, and these peaks were mapped back to contigs in the original assembly by virtue of the fact that Hi-C linkage data is expressed between contigs in the original assembly. The identified centromere containing contigs were forbidden from being placed in the same cluster in subsequent

PGA scaffolding runs. PGA scaffolding was repeated several times and centromeric

52

contig calls were refined until only one centromere-like pattern per cluster was observed.

Searches for telomeric sequences were conducted by matching for repeated strings of the motif ” TTAGGG” in the Hi-C assembly, Illumina assembly, and the Illumina raw reads.

A blastn query was also carried out for other fungal telomere repeat sequences listed in

Cohn (2006).

Hi-C comparison to Illumina:

MUMer version 3.18 (Kurtz et al. 2004) was run with the Hi-C assembly as query against the Illumina assembly from Mujic (2015) as reference using the nucmer algorithm with the maxmatch option. The resulting nucmer output was filtered with the –delta m option and plotted with mummerplot in MUMer. The summary statistics of a comparison between the Hi-C and initial Illumina assembly were obtained by running dnadiff in

MUMmer with default settings. CEGMA version 2.5 (Parra et al 2007) was run on the

Hi-C and the Illumina assembly to compare their completeness score.

Identification and Annotation of Repetitive Content :

A combined approach was employed for identifying repetitive content in the genome. RepeatScout version 1.0.5 (Price et al. 2005) was run on the Hi-C assembly to generate a custom repeat library with lmer size set as 14. The derived repeat library was filtered with RepeatMasker (Smit et al. 2015) and filter_stage2.prl in RepeatScout so that only repeats occurring ≥ 10 times in the genome and ≥ 100 bp in length were retained. In addition, repeats with significant blastx (cutoff e-value = 10-5) hits to genes found in

Uniprot/Swiss-prot (2015) were discarded unless they also had a hit to one of 124 Pfam domains associated with transposable elements in an overview compiled by Piriyapongsa et al. (2007). The Pfam search was submitted as a batch job through pfam.xfam.org (Finn

53

et al. 2014). A second search for transposable elements in the Hi-C assembly was conducted using LTRharvest (Ellinghaus et al. 2008). The LTRharvest output was filtered for sequences with ≥ 5 blastn hits to the genome (cutoff e-value =10-15), ≥ 400 bp in length, and no blastx hits (cutoff e-value = 10-5) to genes in Uniprot/Swiss-prot. The filtered repeats from RepeatScout and putative TE elements identified with LTRharvest were clustered together and deduplicated at 80% similarity with USEARCH version

4.2.66 (Edgar 2010), and submitted to RepeatMasker (Smit et al. 2015) to estimate repeat content in genome and determine location. The repeats were annotated by tblastx queries

(cutoff e-value = 10-5) against Repbase version 21.06 (Bao et al. 2015) and annotated custom TE libraries of Amanita spp. from Hess et al. (2014), in addition to Pleurotus ostreatus and Serpula lacrymans in Castanera et al. (2016). In cases were the repeats had hits to different elements in the Repbase and the custom libraries, the Repbase annotation was chosen. Classification into major TE families followed the system proposed by

Kapitonov and Jurka (2008). The Gypsy and Copia elements were aligned using MAFFT version 6.864b (Katoh et al 2009), respectively, and phylogenetic trees was constructed of the alignments using the UPGMA option with the Tamura-Nei in Geneious version

9.15 (Kearse et al. 2012).

Annotation of the Hi-C assembly:

Gene annotation was performed using the MAKER version 2.10 pipeline (Holt and Yandell 2011) on the assembly masked with the clustered and deduplicated custom repeat library from RepeatScout and LTRharvest. Transcriptome data from Mujic (2015) of the same isolate was used as EST evidence, and a dataset of Boletales protein models from R. vinicolor, Boletus edulis, S. brevipes, and S. luteus annotated by JGI as part of

54

the 1KFG and MGI projects available on the JGI MycoCosm portal (Grigoriev et al. 2014) were given as protein evidence. GeneMark version 2.3a (Besemer and Borodovsky 2005) and Augustus version 2.3.1 (Stanke and Morgenstern 2005) were used for ab initio estimation of gene models, where the species in Augustus was set as Laccaria bicolor.

The scripts maker_map_ids, map_fasta_ids, fasta_merge, map_gff_ids, and gff3_merge in MAKER were utilized to produce the final fasta and gff files. The MAKER pipeline was run both using the mask option given the filtered custom repeat library generated by

RepeatScout and without the mask option for comparison. Gene-space was calculated from the maker gene models including introns, whereas intergenic space was defined as the region that was not occupied by gene space or identified as repetitive DNA. Further functional annotation of the gene models involving Gene Ontology (GO) terms, enzyme codes, and Interpro IDs were assigned through Blast2GO (Conesa et al. 2005). The

MAKER proteins predicted in the Hi-C assembly were used to query against all MAKER proteins found in the Illumina assembly by (Mujic 2015) with blastp. The blastp output was sorted for unique top hits with bash scripts in order to transfer GenBank protein ID’s and evaluate the amount of overlap in genes models found in the two different assemblies.

Identification of Small Secreted Proteins:

The characterization of the secretome and subsequent identification of SSPs followed the pipeline in Pellegrin et al. (2015). The MAKER annotation of the unmasked

Hi-C assembly was used to allow for a wider search space. The gene models were first filtered for those carrying a signal peptide as predicted by SignalP version 4.1 (Petersen et al. 2011), using a sensitive cutoff of 0.34, and targeted for a secretory pathway as predicted by TargetP version 1.1 (Emanuelsson et al. 2000) with the –N option. Gene

55

models with more than 1 transmembrane helix as predicted by TMHMM version 2.0c

(Krogh et al. 2001) were filtered out. WoLF PSORT version 0.1 (Horton et al. 2007) was run on the remaining gene models with fungi set as model organism. Proteins with a predicted extracellular location by WoLF PSORT were retained and scanned for the kdel motif with PS-SCAN (http://www.hpa-bioinfotools.org.uk/cgi- bin/ps_scan/ps_scanCGI.pl) under the prosite accession PS00014, and positive matches were discarded. The remaining protein models were considered to represent the secretome, and further size selected for proteins ≤ 300 amino acids and with ≥ 4 cysteine residues with a custom biopython script in order to identify the SSPs.

Identification of Orthologous Gene Clusters and Lineage Specific Genes:

For the identification of orthologous gene clusters, FastOrtho (Wattam et al. 2014) was run with default settings on the R. vesiculosus MAKER gene models along with protein fasta files comprising all gene models of R. salebrosus, R. vinicolor, S. brevipes and S. luteus annotated by JGI available through the MycoCosm portal (Grigoriev et al.

2014). Custom bash scripts and Venny (Oliveros 2015) were used to parse out orthologous clusters belonging to different taxonomic levels. Gene models in R. vesiculosus that were not detected in any orthologous cluster, as well as orthologous clusters containing only R. vesiculosus genes were assigned as species specific.

Single nucleotide polymorphism (SNP) Analysis:

The Illumina raw reads were trimmed and quality filtered with the FASTX-

Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and interleaved using a custom perl script. The paired and filtered raw reads were aligned to the Hi-C assembly with Bowtie version 2.1.0 (Langmead and Salzberg 2012) using the –sensitive and –no-discordant

56

options. SAMtools version 1.3 (Li et al. 2009) was used to sort the resulting alignment and generate a mpileup file. SNP calls were made with VarScan version 2.4 (Koboldt et al. 2012) using a Bonferroni corrected p-value cutoff of 10-4. SNPs were quantified in 50 kb windows along the different linkage groups with a custom script in R (Appendix 1 R script1). The SNPs were split into transitions/transversions using bash commands, and were quantified in genes versus TEs using another R-script (Appendix 1 R script 2).

Identification of Mating loci, Nucleolar Organizing Region and RIP genes:

The mating loci were identified by a blastn search querying with the sequences from the mating gene annotations of the Illumina assembly of R. vesiculosus in Mujic

(2015) against the Hi-C assembly as subject. The top hits were then compared to the annotation of the Hi-C annotation to assess what gene models the blast hits fell within.

For the b-locus, a custom perl script described in Mujic (2015) was used to identify pheromone precursor genes as these were missed by the MAKER annotation. A dataset of ITS, LSU and 18s SSU of Rhizopogon spp. (R. occidentalis, R. roseolus, R. villosulus,

R. parksii, R. hawkerae, R. vesiculosus) obtained from GenBank (Benson et al. 2005) were used as blastn queries against the assembly to locate the nucleolar organizer region.

The read depths for the hits to these genes were estimated using SAMtools –depth on the sorted .bam file produced by the Bowtie alignment. Blastn and blastp queries were performed in the same fashion to RPB1 and RPB2 genes of Rhizopogon, Beta-tubulin in

Suillus, RIP defective (rid) genes of Neurospora spp, misc in immersus, and recQ-like helicases in Ustilago maydis.

Data Visualization and Additional Softwares:

57

Statistical tests and calculations were performed in R version 3.3.1 (2016). Venn diagrams were made in Venny (Oliveros 2015) and Venn

(http://bioinformatics.psb.ugent.be/webtools/Venn/). The distribution of GO annotations was visualized with Blast2GO. Geneious version 9.15 was used to submit batch blast jobs and facilitate intermediate steps between a number of analysis, as well as visualize the synteny of the mating loci. Coloring of phylogenetic trees were facilitated with Figtree version 1.4.2. (http://tree.bio.ed.ac.uk/software/figtree/). All other plots were made in R and the ggplot2 package (Wickham 2009) in R. BLAST queries (blastn, blastp, blastx, tblastx) were performed with ncbi-blast version 2.2.31. The Hi-C assembly using

Lachesis was generated at Phase Genomics (Seattle, WA), while all other command line programs where run in the tcsh environment on the computational cluster at the Center of

Genome Research and Biocomputing at Oregon State University.

RESULTS

Hi-C Assembly:

Lachesis analyses found 10 linkage groups to be the optimal number of clusters judged by the diagonal cis patterns and centromeric trans patterns in the post-clustering heatmap (Figure 2.1). 97.1 % of the Illumina assembly sequences were covered on the

10 linkage groups, while the remaining 2.9% were distributed over 2291 scaffolds of which 400 were > 1000 bp in length (Table 2.1). Length of the linkage groups spanned from 10.1 to 2.8 Mb. A comparison of the Hi-C and Illumina assemblies is listed in Table

2.1 and statistics of the individual linkage groups are summarized in Table 2.2. Mummer

58

analysis of the two assemblies revealed a strong pattern of colinearity, but did reveal numerous examples of small regions that were scaffolded differently (Figure 2.2). The

Hi-C assembly scored better on the CEGMA analysis than the Illumina assembly (Table

2.1), suggesting it provides a better basis for detecting gene models.

Identification of Centromeres and Telomeres:

Prior to having taken steps to account for the tendency of fungal centromeres to cluster together near the spindle pole body, the default Lachesis settings created a single cluster containing Hi-C linkage patterns consistent with all centromeres being placed into a single cluster. Using the PGA clustering results for N=10, Hi-C linkage density peaks in the hypothetical all-centromere cluster were identified; numbering 10 in all, consistent with optimal PGA clustering results. The centromeres on linkage group 1-8 show clean interaction patterns (Fig 2.2), whereas centromeres on linkage group 0 and 9 display more complex interactions. The centromeres were found to be regional, and ranged in length from 16-65 Kb, with an average length 40 Kb. Linkage group 1,2,3, 8 and 9 approximate metacentric (submetacentric) centromeres where the centromere is more centrally placed as opposed to the other linkage groups, which are more acrocentric. The centromeres contained an average gene density of 338 genes per Mb which is comparable to the genome average (Table 2.1). From blast queries to NCBI and blast2GO functional annotation on the gene models within the centromeres, it was found that linkage groups

1,2,4,7,8, and 9 contained gene models that were directly associated with the centromere or centromere associated proteins CENP (Table S-2.1) listed in (Smith et al. 2012) within or in close proximity to the identified centromere boundaries. Centromeric regions in linkage group 2, 7, and 9 also contained proteins associated with the ribosome (Table S-

59

2.1). A visual assessment of the linkage group ends did not reveal any repetitive telomeric sequences, and no clustering was found in blast queries. The motif searches for the fungal telomere motif (TTAGGG)n in the Hi-C and Illumina assembly detected the same number of repeats; 19 2X repeats, 1 3X repeat, and 0 repeats > 3X. In comparison,

2582 2X, 176 3X, 12 4X, and 7 5X repeats were found in the Illumina raw reads.

Furthermore, 6 TTAGGG repeats were repeated 7X or more, and 1 sequence contained >

10X repeats.

Transposable Elements and Repetitive DNA:

RepeatMasker estimated the repeat library to comprise 16.7 % of the genome

(Figure 2.5) of which TEs comprised 11% (65.8% of the repeat library). The distribution of repetitive DNA among the linkage groups did not follow linkage group size, but the amount was more constant (Figure 2.5). The repeat library is represented by 26484 repeat sequences of which 14135 repeat sequences comprised 53.3 % of the repeat content and did not have any hits to Repbase or the custom fungal repeat libraries, and as such could not be classified. 9.3 % of the repeats, represented by 2471 repeat sequences, had hits to

TEs but the annotations were of unknown placement with regards to the current classification system (Kapitonov and Jurka 2008), and were binned as unclassified repeats. The remaining 9878 sequences could be classified, with retrotransposons representing the most abundant class both in repeat counts and occupancy in the genome.

Among the retrotransposons Copia elements (2933/1.6%) were the most common followed by Gypsy (700/0.54%), although Ngaro made up significant portion in linkage group 4 (119/1.5%) and 6 (38/0.8%). Mariner-like elements (1556/0.82%) were the most common class of DNA transposons, and were most abundant on linkage group 0, 1, 4,

60

and 6 (Figure 2.3, Table 2.3). The phylogeny of Gypsy and Copia elements revealed clades mainly consisting of a mixed origin with regards to linkage group, but among these were also clades dominated by elements from a single linkage group (Figure 2.4) suggesting a complex pattern of interchromosomal transposition combined with more recent local expansions. In the case of the Gypsy elements, the two most prominent linkage group specific clades were represented by elements from linkage group 1, which also contains the most Gypsy elements.

Linkage group 3 had the largest proportion of repetitive content not associated with TE, represented by simple repeats (56% of the simple repeat length). The great majority of low complexity repeats were localized to one region on linkage group 0

(99.7%) (Figure 2.3A).

Gene Annotation:

The MAKER annotation pipeline predicted 10067 and 13899 gene models in the masked and unmasked genomes, respectively. The gene space in the masked assembly was found to comprise 48% of the genome, and its distribution over linkage groups was largely proportionate to linkage group size (Figure 2.5). In the blastp output of the unmasked genome against the Illumina maker gene models 7002 of these had a blastp e- value of 0.0 and could readily be identified as identical to genes in the Illumina annotation. In addition, 2086 gene models were aligned together without gaps with matching query-, subject-, and alignment lengths, but had a e-value greater than 0.0.

Both HD proteins of the A-locus were found on linkage group 0 flanked by metalloendopeptidase gene (mip) and Beta-flanking region, displaying a synteny consistent with other Agaricomycetes (Fig. 2.6) (James 2007). Four pheromone precursor

61

and 4 pheromone receptor genes were identified on linkage group 5, residing within a 20 kb window and represented the B-locus. All the pheromone receptor genes had corresponding MAKER gene models while none of the pheromone precursor genes were predicted by MAKER. The pheromone identifying custom perl script and a blastn query to additional pheromone and pheromone receptors in R. vinicolor found 8 more partial hits to pheromone precursor genes and 1 additional pheromone receptor gene within a

200 Kb window. Three putative ICMT genes were found within 20 Kb upstream of the core B-locus, two of these overlapped with MAKER gene models. The nucleolar organizing region was mapped to linkage group 5, where the ribosomal DNA repeat unit was not found in multiple copies in the blastn query. Partial hits to LSU and 18s SSU was also found on linkage group 1. Coverage analysis of the Illumina raw read alignment to the Hi-C assembly revealed that the read depth in 18s SSU and LSU averaged 6670x. In comparison, the read depth of the single copy genes RPB1, RPB2, and a partial sequence of beta-tubulin averaged 51x. These results suggest that the rDNA tandem repeat is represented in approximately 130 copies.

Blastn and blastp queries to the RIP defective (rid) genes of Neurospora spp. and the masc1 in Ascobolus immersus did not identify any hits. However a search for the

DNA MTase family methyltransferase PFAM domain (PF00045) listed in (Amselem et al.

2015) to the Blast2GO functional annotations uncovered 3 gene models containing this domain, where all three where located to linkage group 0, and 2 of these occurred in tandem. The query for the reqQ-like helicase in U. maydis did not identify any hits.

Orthologous Clusters and Species Specific Genes:

62

FastOrtho found 13350 orthologous clusters using the masked annotation of R. vesiculosus, of these, 6206 were shared among all 5 species, while 42 were specific to R. vesiculosus. The next highest number of clusters comprised orthologs shared between S. luteus and S. brevipes (1838), while R. salebrosus had the highest number of lineage specific clusters (1097) (Figure 2.7). 1983 genes in R. vesiculosus were not identified in any cluster, and together with the 90 genes in the 42 species clusters comprised 2073 genes that were identified as species specific. The mapping of species specific genes revealed islands along each linkage group enriched for species specific genes (Figure 2.8,

Figure S-2.1). All linkage groups contained regions with 75 % percent or more species specific genes, however the aggregation of these regions vary considerably between different linkage groups (Figure 2.9), where linkage group 3 and 8 possessed the highest number of these regions (Figure 2.9), and in linkage group 8 these regions made up the longest continuous stretches (Figure S-2.1).

Secretome and Small Secreted Proteins

The characterization of SSPs found 670 protein models to meet the criteria of the fungal secretome (Kim et al. 2016, Pellegrin et al. 2015). Of these, 201 could be classified as SSPs. The mapping of SSP did not reveal any clear pattern of their distribution among or within linkage groups (Figure 2.10). Of the small secreted proteins,

101 were identified as lineage specific to R. vesiculosus by referring to the FastOrtho output.

SNP analysis and correlation with SSG and TE:

95629 SNPs were identified with VarScan after a Bonferroni correction using a p- value of 10-4. The combined sliding window plots revealed a strong trend of correlation

63

between transposable elements and SNPs (Figure 2.11, Figure S-2.2), and the SNPs density within TEs was found to be 8 times higher than the SNP density in genes (Table

2.5). The TEs also displayed a disproportionately higher proportion of transitions relative to transversions (5 fold) when compared to genes (3 fold). Using Fishers exact test this difference was found to be statistically significant (p-value = 0.00017) (Table S-2.2).

Furthermore, different classes of genes displayed different SNP densities and transition to transversion ratios (Table 2.5). Notably, the SNP density in the genes with orthologs to

Suillineae was half of the overall SNP density of all genes. On the other hand, SNP density was 2 times higher for SSPs, and 4 times higher for species specific genes when compared to the average for all genes.

DISCUSSION

Evaluating the PGA Hi-C Assembly:

Several studies have demonstrated that PGA Hi-C assemblies are able to accurately scaffold contigs into chromosomes using animal genomes (Peichel et al. 2016,

Bickhart et al. 2016, Burton et al. 2013). However, the scaffolding is affected by the reference assembly that is provided (Peichel et al. 2016). In the case of R. vesiculosus, where no prior knowledge about the chromosome number is known, the initial assessment of correctness and accuracy is largely confined to visual evaluation of the intracluster linkage distribution and postclustering heatmaps (Figure 2.1). The Hi-C assembly of the Illumina scaffolds is guided by algorithms that seek to minimize the average distance from the main diagonal to Hi-C linkage pairs, and maximize the amount

64

of contact along the main diagonal Hi-C. However, it should be kept in mind that Hi-C linkage pairs are ultimately a measure of chromatin conformation, and the signal from them reflect the 3D structure of the chromatin. When this 3D data is projected onto a 1D genome sequence, it therefore is not uncommon for small regions to exhibit strong Hi-C linkage off the main diagonal simply due to the way chromatin interacts with itself in vivo.

The predicted chromosome number of R. vesiculosus (n=10) falls within the range of Agaricomycetes where the chromosome number ranges from 7-13 (Stajich et al. 2010,

Larraya et al. 1999, Martinez et al. 2004, Foulongne-Oriol et al. 2016). Compared to C. cinerea and P. ostreatus, where the linkage groups measure from 0.98-4.7 Mb (Larraya et al. 1999, Stajich et al. 2010), linkage group 0 in the Hi-C assembly deviates significantly, and is twice the length of the average in R. vesiculosus. Moreover, linkage group 0 displays a different composition with a proportionally smaller load of transposable elements (Figure 2.5), but a much higher amount of low complexity repeats (Table 2.2), most of which are concentrated on a region near the predicted centromere (Figure S-2.2).

The high density region composed of low complexity DNA could be interpreted as part of the centromere, which would suggest that the different linkage groups have different types of centromeres since none of the other linkages contain such regions. Alternatively, it is possible that the large size is partially an artifact stemming from over-assembling the linkage group due to complex intranuclear interactions.

Centromeres and Telomeres:

Like TEs, centromeres are difficult to characterize from short read assemblies on account of their frequently repetitive nature. Furthermore, they are difficult to identify

65

because they show little conservation in sequence identity, but rather are defined by their associated protein complexes (Smith et al. 2012). The point centromeres of many yeast species have been well characterized, but the nature of the centromeres is less well understood for filamentous fungi (Smith et al. 2012). Stajich et al. (2010) showed that the centromeres of C. cinerea mostly consisted of degenereate islands of transposable elements. A similar clear pattern of centromere and TE elements were not observed in R. vesiculosus, but TE enrichment around centromeres can be discerned in linkage group 2,

3, 5, 7, and 8 (Figure S-2.2). Moreover, TE content in C. cinerea is 2.5 %, and the higher

TE content of R. vesiculosus (11%) may confound observation of any simple pattern.

Yeast centromeres are known to cluster together around the spindle pole body in the nucleus (Mizuguchi et al. 2014) and as a consequence, generally exhibit very strong Hi-C links to each other (Varoquaux et al. 2015). We have shown here that the same strategy may be applied to the successful identification of regional centromeres in a filamentous fungus. Centromere associated proteins were not found on all centromeres (Table S-2.1), which may indicate that some of these genes were missed by the MAKER annotation, or suggest inaccuracies in the identified centromere boundaries, though it should be stressed that the current understanding of centromere protein complexes in fungi is incomplete

(Smith et al. 2012).

Like centromeres, telomeres tend to interact fairly strongly with each other in Hi-

C, and typically cluster (Shawn Sullivan pers. communication). Yet, no fungal telomeric repeats were detected in the Hi-C assembly, which is likely an artifact of the telomeres not being scaffolded in the Illumina reference assembly. Indeed, Schwartz and Farman

(2010) showed that telomeric sequences are frequently absent from fungal genome

66

databases. The authors suggest that the DNA shearing and size selection processes in library construction for short read shotgun sequencing play a role in disfavoring short telomeric fragments, and in addition point to the challenge of correctly assembling highly repetitive sequences. The fact that the telomeric sequence repeats were much more abundant in the raw reads of R. vesiculosus lends some support to this. Nevertheless, the overall repetitive content appears to increase towards the terminal ends in linkage group 1,

2, 3, and 6 (Figure S-2.2). It is possible that these trends are the result of recombination rates increasing towards subtelomeric ends, as was demonstrated in C. cinerea by Stajich et al. (2010).

Repetitive content and Transposable Elements:

It is known that shotgun based short read sequencing methods like Illumina do not assemble repetitive elements well (Hess et al. 2014). In addition, detection and classification of repetitive elements including TEs remain challenging due to the presence of fragmented, mosaic, and sometimes nested copies (Lerat 2010). Nevertheless, through adapting multiple strategies and repeating runs with different parameters while referencing to well annotated TE libraries of Amanita spp., Pleurotus ostreatus, and

Serpula lacrymans, we were able to classify a significantly larger portion of TEs than the sequences for fungi in Repbase alone could annotate. This demonstrates the importance of employing a combined approach for characterization and annotation of TE content in new genomes. Lerat (2010) showed that predictions are sensitive to the input parameters in the software used for the identification of TE, including RepeatScout and LTR-harvest, which were used in this study. A common issue is false positives, which likely were present when the genome was first masked with RepeatMasker and subsequently

67

annotated with MAKER as a number of veritable gene models were masked out as repeats instead of being identified as genes. This may partially explain the reduced number of orthologous clusters resulting from a FastOrtho analysis using gene models found in the masked assembly as compared to the unmasked assembly (Figure 2.7, Figure

S-2.5), and also the lower CEGMA score on the masked genome (Table 2.1).

The estimated repeat content of 16.7 % in R. vesiculosus is comparable to what has been found in many filamentous asco- and basidiomycetes (Castanera et al. 2016).

TE content tends to be high in EM genomes (Martin et al. 2008, Martin et al. 2010, Hess et al. 2014, Quandt et al. 2015), though the cause behind this trend is unknown. One popular theory invoke parallels from fungal plant pathogens where higher levels of TEs are also observed (Grandaubert et al 2014), and are thought to aid in their adaptation to new hosts through the increased genome plasticity they confer (Gladieux et al. 2014).

However, host selective pressures and host jumps are reduced and less common among mycorrhizal fungi compared to plant pathogens (Gladieux et al. 2014). Another explanation include small effective population sizes with decreased efficiency of purifying selection which may allow for more buildup of TE content in genomes

(Stukenbrock and Croll 2014). SNP comparisons to R. vinicolor revealed reduced levels of heterozygosity in R. vesiculosus (Mujic 2015), and population studies have found that

R. vesiculosus is less genetically diverse and display patterns of inbreeding (Kretzer et al.

2005, Dunham et al. 2013). These findings suggest that R. vesiculosus has recently been through a bottleneck, or reproduce clonally, though the latter explanation is unlikely as this would require a homothallic mating system and asexual reproduction is unknown for the genus. Comparisons to other Rhizopogon species along with Suillus species, which to

68

a greater extent are wind dispersed and thereby are expected to have bigger population sizes (Douhan et al. 2011), would be needed to establish a correlation between effective population size and TE content in this group.

Mating Loci and the Nucleolar Organizer Region:

The identification of the A and B mating loci on separate linkage groups demonstrates that R. vesiculosus has a bifactorial organization of the mating loci and possesses the fundamental genetic architecture to support a tetrapolar mating system. The

Hi-C assembly was able to resolve the assembly of the A-locus where the HD2 protein was previously split across two scaffolds in the Illumina assembly. The HD2 gene model of the Hi-C contained a ~200 bp insert region bridging the HD2 containing scaffolds in the Illumina assembly, and is an example of how gene models can be improved through the long contiguity information in Hi-C. Only 4 pheromone and 4 pheromone precursor genes could be unambiguously identified, which supports (Mujic 2015) proposition that the genetic diversity appears to be reduced at the B-locus compared to tetrapolar species in Boletales. The long contiguity information revealed up to 3 potential isoprenyl cysteine methyltransferase genes (ICMT) within 50 Kb upstream of the B-locus, 2 of which were not detected by (Mujic 2015) due to the fragmented nature of the Illumina assembly. The ICMT genes may act in the post translational modification of pheromones

(Caldwell et al. 1995), and Mujic (2015) speculated that the copy number of these on the

B-locus may affect the mating behavior of the fungus. Experimental studies would be needed to verify what mating behavior (homothallic/bipolar/tetrapolar) the bifactorial architecture of R. vesiculosus encodes.

69

rDNA tend to be present in multiple copies (~400 in humans) which are generally located together in several clusters (Boisvert et al. 2007). The fact that blastn queries only identified single copies that contained 130X times the read depth of selected single copy genes, suggest that the rDNA copies were collapsed into a single assembly and the actual copy number is approximately 130. It is possible that the tendency of rDNA to form expanded chromosomal loops with the nucleolus interfered with the ability of Hi-C to capture the contiguity in these regions. Furthermore, this would suggest that the 43 Mb assembly is an underestimation of the genome size. Hess et al. (2014) demonstrated that in some cases a large number of TEs present in the raw reads do not get assembled (up to

500%). If this is the case for R. vesiculosus, the true genome size could be significantly larger.

Presence and distribution of orthologous clusters:

The relatively low number of genes shared between all three Rhizopogon species in both of the FastOrtho analyses (Figure 2.7 and Figure S-2.5) as compared to orthologous clusters in the two Suillus species were unexpected. Species in genus

Rhizopogon and Suillus differ remarkably in fruit body development and morphology, if the relatively low number of orthologous genes unique to Rhizopogon is correct, it could suggest that the loss of epigeous mushroom production in Rhizopogon is in part mediated by regulatory elements as opposed to distinct genes. The larger number of clusters in R. salebrosus as well as clusters shared between S. brevipes and S. suillus is not surprising considering the large genome size of R. salebrosus and the greater number of gene models present in R. salebrosus and the Suillus species.

70

The distribution pattern of distinct clusters of species specific R. vesiculosus genes outside conserved block may indicate what has been called a “two-speed genome”.

The term is used to describe genomes consisting of rapidly evolving gene sparse regions abundant with transposable elements, which are dispersed among highly conserved and slowly evolving blocks in the genome, and has been implied for several plant pathogenic fungi (Gladieux et al. 2014). The species specific genes in R. vesiculosus possessed a higher SNP density than genes shared with Suillus (Table 2.5), an observation consistent with species specific genes evolving more rapidly than orthologs shared across taxa. The plotting of biological processes associated with these gene models revealed a different composition between the species specific and shared genes and revealed disproportionate enrichments for transmembrane transporters, protein metabolic processes, oxidation- reduction processes, and cellular nitrogen compound biosynthetic processes (Figure S-

2.4).

The mapping of SNPs revealed a strong correlation with TEs, and the covariance of SNPs and TEs could be the product of rapidly evolving regions in these areas (=higher mutation rate), but we suggest that these were deliberately induced by R. vesiculosus through a defense mechanism as the transition rate (C < > T) is significantly higher within TEs versus gene regions (Table 2.4). Within Basidiomycota, RIP has only been detected in (Horns et al. 2012) and our genome mining did not reveal any obvious homologs to rid or masc1. Castanera et al. (2016) found that genes in TE clusters tend to be repressed in P. ostratus, and the authors hypothesized that the phenomenon is caused by epigenetic defense mechanism aimed at suppressing TE expression and controlling their proliferation. Additional analyses by Castanera et al.

71

(2016) revealed that this trend was only present in species with an active cytosine methylation machinery. We did detect homologs to the MTase family of methyltransferase, but further experimentation is needed to define any potential role in genome defense. Regardless of the mechanisms, these data are consistent with a considerable fraction of variation between haplotypes of a dikaryon being a function of transposable elements, as 14% SNPs mapped to TEs (Table 2.5).

Secretome and Small Secreted Proteins:

The size of the secretome and number of SSPs were comparable to other EM fungi (Pellegrin et al. 2015), and a similarly high amount of lineage specificity in these was also found for Laccaria (Kohler et al. 2015). It is possible that part of the genetic machinery responsible for the host specificity to Pseudotsuga in R. subgenus Villosuli lies within the 27 SSPs that are shared with R. vinicolor, but further characterization of them along with functional experiments is needed. Meanwhile, only 5 SSPs were conserved for Rhizopogon, which highlight the plasticity in these effectors. SSPs did not exhibit any obvious spatial patterning within or among linkage groups, as they were found equally abundant across all linkage groups and spatially dispersed within a given linkage group, i.e., they were, contrary to expectation, not concentrated in subtelomeric regions (Figure 2.10). SSPs did, however, show a higher SNP density than other gene regions (Table 2.5) suggesting that they are characterized by higher rates of mutation or occupy regions of linkage groups that are more prone to mutation.

Conclusion:

The Hi-C assembly has presented an unprecedented opportunity to improve the assembly of R. vesiculosus and advance our understanding of genome organization in the

72

Boletales. 10 linkage groups were identified, which likely represent chromosomes, and 8 of these displayed clear centromere patterns. The A- and B-locus were found to reside on different linkage groups, which supports a bifactorial organization of the mating loci, although the functional nature of B-locus, i.e., bipolar vs. tetrapolar, requires experimentation. We examined the genome organization with regards to repetitive content, along with lineage specific and shared genes. In mapping the shared and lineage specific genes together it was revealed that enriched lineage specific regions exist across all linkage groups. These showed some correlation with TEs and SNPs. A strong correlation between TE and SNPs was revealed with TEs containing an overabundance of transitions. It is possible that this pattern is generated by a genome defense mechanism, in which case this will be the first documented case in subphylum . SSPs were characterized for the first time for a species in Rhizopogon, and revealed a high proportion of lineage specific genes comparable to what have been found for EM fungi in other lineages. Further work will be needed to dissect the function of the SSPs and determine which ones are responsible for the host specificity to Pseudotsuga. SNPs analyses of the dikaryon are consistent with what has been termed a “two-speed” genome.

SNPs were increasingly abundant in SSPs, species specific genes, and TEs as compared to orthologs shared with taxa in the genus Suillus. These findings paint a complex picture of the Rhizopogon genome as containing regions that are enriched for TEs, SSPs, and species specific genes which are more variable than regions that are enriched for orthologs shared across taxa.

73

REFERENCES

Amselem, J., Lebrun, M.-H. & Quesneville, H. 2015. Whole genome comparative analysis of transposable elements provides new insight into mechanisms of their inactivation in fungal genomes. BMC Genomics, 16, 1-14. Bao, W., Kojima, K. K. & Kohany, O. 2015. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA, 6, 1-6. Beiler, K. J., Simard, S. W., Lemay, V. & Durall, D. M. 2012. Vertical partitioning between sister species of Rhizopogon fungi on mesic and xeric sites in an interior Douglas-fir forest. Mol Ecol, 21, 6163-74. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. 2005. GenBank. Nucleic Acids Research, 33, D34-D38. Besemer, J. & Borodovsky, M. 2005. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Research, 33, W451-W454. Bickhart, D. M., Rosen, B. D., Koren, S., Sayre, B. L., Hastie, A. R., Chan, S., Lee, J., Lam, E. T., Liachko, I., Sullivan, S. T., Burton, J. N., Huson, H. J., Kelley, C. M., Hutchison, J. L., Zhou, Y., Sun, J., Crisa, A., Ponce De Leon, F. A., Schwartz, J. C., Hammond, J. A., Waldbieser, G. C., Schroeder, S. G., Liu, G. E., Dunham, M. J., Shendure, J., Sonstegard, T. S., Phillippy, A. M., Van Tassell, C. P. & Smith, T. P. L. 2016. Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes. bioRxiv. Biémont, C. 2010. A Brief History of the Status of Transposable Elements: From Junk DNA to Major Players in Evolution. Genetics, 186, 1085-1093. Binder, M., Justo, A., Riley, R., Salamov, A., Lopez-Giraldez, F., Sjökvist, E., Copeland, A., Foster, B., Sun, H., Larsson, E., Larsson, K.-H., Townsend, J., Grigoriev, I. V. & Hibbett, D. S. 2013. Phylogenetic and phylogenomic overview of the Polyporales. Mycologia, 105, 1350-1373. Boisvert, F.-M., Van Koningsbruggen, S., Navascues, J. & Lamond, A. I. 2007. The multifunctional nucleolus. Nat Rev Mol Cell Biol, 8, 574-585. Branco, S., Gladieux, P., Ellison, C. E., Kuo, A., Labutti, K., Lipzen, A., Grigoriev, I. V., Liao, H.-L., Vilgalys, R., Peay, K. G., Taylor, J. W. & Bruns, T. D. 2015. Genetic isolation between two recently diverged populations of a symbiotic fungus. Molecular Ecology, 24, 2747-2758. Britton-Davidian, J., Cazaux, B. & Catalan, J. 2012. Chromosomal dynamics of nucleolar organizer regions (NORs) in the house mouse: micro-evolutionary insights. Heredity, 108, 68-74. Brundrett, M. C. 2002. Coevolution of roots and mycorrhizas of land plants. New Phytologist, 154, 275-304. Bruns, T. D., Peay, K. G., Boynton, P. J., Grubisha, L. C., Hynson, N. A., Nguyen, N. H. & Rosenstock, N. P. 2009. Inoculum potential of Rhizopogon spores increases with time over the first 4 yr of a 99-yr spore burial experiment. New Phytologist, 181, 463-470. Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O. & Shendure, J. 2013. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotech, 31, 1119-1125.

74

Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. 2014. Species-Level Deconvolution of Metagenome Assemblies with Hi-C-Based Contact Probability Maps. G3: Genes|Genomes|Genetics. Caldwell, G. A., Naider, F. & Becker, J. M. 1995. Fungal lipopeptide mating pheromones: a model system for the study of protein prenylation. Microbiological Reviews, 59, 406-422. Casselton, L. A. 2002. Mate recognition in fungi. Heredity (Edinb), 88, 142-7. Castanera, R., López-Varas, L., Borgognone, A., Labutti, K., Lapidus, A., Schmutz, J., Grimwood, J., Pérez, G., Pisabarro, A. G., Grigoriev, I. V., Stajich, J. E. & Ramírez, L. 2016. Transposable Elements versus the Fungal Genome: Impact on Whole-Genome Architecture and Transcriptional Profiles. PLoS Genet, 12, e1006108. Cohn, M., Liti, G., & Barton, D. B. 2006. Telomeres in fungi. In: PISKUR, P. S. J. (ed.) Comparative genomics using fungi as models Springer. Conesa, A., Götz, S., García-Gómez, J. M., Terol, J., Talón, M. & Robles, M. 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674-3676. Consortium, T. U. 2015. UniProt: a hub for protein information. Nucleic Acids Research, 43, D204-D212. Dunham, S. M., Mujic, A. B., Spatafora, J. W. & Kretzer, A. M. 2013. Within-population genetic structure differs between two sympatric sister-species of ectomycorrhizal fungi, Rhizopogon vinicolor and R. vesiculosus. Mycologia, 105, 814-26. Edgar, R. C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461. Ellinghaus, D., Kurtz, S. & Willhoeft, U. 2008. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, 9, 18-18. Ellison, C. E., Hall, C., Kowbel, D., Welch, J., Brem, R. B., Glass, N. L. & Taylor, J. W. 2011. Population genomics and local adaptation in wild isolates of a model microbial eukaryote. Proceedings of the National Academy of Sciences, 108, 2831-2836. Emanuelsson, O., Nielsen, H., Brunak, S. & Von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol, 300, 1005-16. Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J. & Punta, M. 2014. Pfam: the protein families database. Nucleic Acids Research, 42, D222-D230. Flot, J. F., Marie-Nelly, H. & Koszul, R. 2015. Contact genomics: scaffolding and phasing (meta)genomes using chromosome 3D physical signatures. Febs Letters, 589, 2966-2974. Foulongne-Oriol, M., Rocha De Brito, M., Cabannes, D., Clément, A., Spataro, C., Moinard, M., Dias, E. S., Callac, P. & Savoie, J.-M. 2016. The Genetic Linkage Map of the Medicinal Mushroom Agaricus subrufescens Reveals Highly

75

Conserved Macrosynteny with the Congeneric Species Agaricus bisporus. G3: Genes|Genomes|Genetics, 6, 1217-1226. Galagan, J. E., Henn, M. R., Ma, L. J., Cuomo, C. A. & Birren, B. 2005. Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res, 15, 1620-31. Gladieux, P., Ropars, J., Badouin, H., Branca, A., Aguileta, G., De Vienne, D. M., De La Vega, R. C. R., Branco, S. & Giraud, T. 2014. Fungal evolutionary genomics provides insight into the mechanisms of adaptive divergence in eukaryotes. Molecular Ecology, 23, 753-773. Grandaubert, J., Rouxel T., Balesdent, M., 2014. Evolutionary and Adaptive Role of Transposable Elements in Fungal Genomes. Advances in Botanical Research, 70. Grigoriev, I. V., Nikitin, R., Haridas, S., Kuo, A., Ohm, R., Otillar, R., Riley, R., Salamov, A., Zhao, X., Korzeniewski, F., Smirnova, T., Nordberg, H., Dubchak, I. & Shabalov, I. 2014. MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res, 42, D699-704. Grubisha, L. C., Trappe, J. M., Molina, R. & Spatafora, J. W. 2002. Biology of the ectomycorrhizal genus Rhizopogon. VI. Re-examination of infrageneric relationships inferred from phylogenetic analyses of ITS sequences. Mycologia, 94, 607-19. Hertweck, K. L. 2013. Assembly and comparative analysis of transposable elements from low coverage genomic sequence data in Asparagales. Genome, 56, 487-494. Hess, J. & Pringle, A. 2014. The Natural Histories of Species and Their Genomes: Asymbiotic and Ectomycorrhizal Amanita Fungi. FUNGI, 70, 235-257. Hess, J., Skrede, I., Wolfe, B. E., Labutti, K., Ohm, R. A., Grigoriev, I. V. & Pringle, A. 2014. Transposable element dynamics among asymbiotic and ectomycorrhizal Amanita fungi. Genome biology and evolution, 6, 1564-1578. Holt, C. & Yandell, M. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics, 12, 1-14. Horns, F., Petit, E., Yockteng, R. & Hood, M. E. 2012. Patterns of repeat-induced point mutation in transposable elements of basidiomycete fungi. Genome Biol Evol, 4. Horton, P., Park, K.-J., Obayashi, T., Fujita, N., Harada, H., Adams-Collier, C. J. & Nakai, K. 2007. WoLF PSORT: protein localization predictor. Nucleic Acids Research, 35, W585-W587. James, T. Y. 2007. Analysis of mating-type locus organization and synteny in mushroom fungi- beyond model species. In J. Heitman, J. Kronstad, J. W. Taylor, and L. A. Casselton [eds.]. Sex in fungi: molecular determination and evolutionary implications. ASM Press, Washington, D. C., 317-331. Kapitonov, V. V. & Jurka, J. 2008. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet, 9, 411-412. Katoh, K., Asimenos, G. & Toh, H. Multiple Alignment of DNA Sequences with MAFFT. Methods in Molecular Biology, 537, 39-64. Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., Buxton, S., Cooper, A., Markowitz, S., Duran, C., Thierer, T., Ashton, B., Meintjes, P. & Drummond, A. 2012. Geneious Basic: an integrated and extendable desktop

76

software platform for the organization and analysis of sequence data. Bioinformatics, 28, 1647-9. Kennedy, P. G., Bergemann, S. E., Hortal, S. & Bruns, T. D. 2007. Determining the outcome of field-based competition between two Rhizopogon species using real- time PCR. Molecular Ecology, 16, 881-890. Kennedy, P. G. & Bruns, T. D. 2005. Priority effects determine the outcome of ectomycorrhizal competition between two Rhizopogon species colonizing Pinus muricata seedlings. New Phytologist, 166, 631-638. Kim, K.-T., Jeon, J., Choi, J., Cheong, K., Song, H., Choi, G., Kang, S. & Lee, Y.-H. 2016. Kingdom-wide analysis of fungal small secreted proteins (SSPs) reveals their potential role in host association. Frontiers in Plant Science, 7. Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., Mclellan, M. D., Lin, L., Miller, C. A., Mardis, E. R., Ding, L. & Wilson, R. K. 2012. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res, 22, 568-76. Kohler, A., Kuo, A., Nagy, L. G., Morin, E., Barry, K. W., Buscot, F., Canback, B., Choi, C., Cichocki, N., Clum, A., Colpaert, J., Copeland, A., Costa, M. D., Dore, J., Floudas, D., Gay, G., Girlanda, M., Henrissat, B., Herrmann, S., Hess, J., Hogberg, N., Johansson, T., Khouja, H.-R., Labutti, K., Lahrmann, U., Levasseur, A., Lindquist, E. A., Lipzen, A., Marmeisse, R., Martino, E., Murat, C., Ngan, C. Y., Nehls, U., Plett, J. M., Pringle, A., Ohm, R. A., Perotto, S., Peter, M., Riley, R., Rineau, F., Ruytinx, J., Salamov, A., Shah, F., Sun, H., Tarkka, M., Tritt, A., Veneault-Fourrey, C., Zuccaro, A., Mycorrhizal Genomics Initiative, C., Tunlid, A., Grigoriev, I. V., Hibbett, D. S. & Martin, F. 2015. Convergent losses of decay mechanisms and rapid turnover of symbiosis genes in mycorrhizal mutualists. Nat Genet, 47, 410-415. Kretzer, A. M., Dunham, S., Molina, R. & Spatafora, J. W. 2005. Patterns of vegetative growth and gene flow in Rhizopogon vinicolor and R. vesiculosus (Boletales, Basidiomycota). Mol Ecol, 14, 2259-68. Kretzer, A. M., Molina, R. & Spatafora, J. W. 2000. Microsatellite markers for the ectomycorrhizal basidiomycete Rhizopogon vinicolor. Mol Ecol, 9, 1190-1. Krogh, A., Larsson, B., Von Heijne, G. & Sonnhammer, E. L. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol, 305, 567-80. Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C. & Salzberg, S. L. 2004. Versatile and open software for comparing large genomes. Genome Biol, 5, R12. Langmead, B. & Salzberg, S. L. 2012. Fast gapped-read alignment with Bowtie 2. Nat Meth, 9, 357-359. Larraya, L. M., Pérez, G., Peñas, M. M., Baars, J. J. P., Mikosch, T. S. P., Pisabarro, A. G. & Ramírez, L. 1999. Molecular Karyotype of the White Rot Fungus Pleurotus ostreatus. Applied and Environmental Microbiology, 65, 3413-3417. Lerat, E. 2010. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity, 104, 520-533.

77

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. & Durbin, R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-9. Ma, L.-J., Van Der Does, H. C., Borkovich, K. A., Coleman, J. J., Daboussi, M.-J., Di Pietro, A., Dufresne, M., Freitag, M., Grabherr, M., Henrissat, B., Houterman, P. M., Kang, S., Shim, W.-B., Woloshuk, C., Xie, X., Xu, J.-R., Antoniw, J., Baker, S. E., Bluhm, B. H., Breakspear, A., Brown, D. W., Butchko, R. a. E., Chapman, S., Coulson, R., Coutinho, P. M., Danchin, E. G. J., Diener, A., Gale, L. R., Gardiner, D. M., Goff, S., Hammond-Kosack, K. E., Hilburn, K., Hua-Van, A., Jonkers, W., Kazan, K., Kodira, C. D., Koehrsen, M., Kumar, L., Lee, Y.-H., Li, L., Manners, J. M., Miranda-Saavedra, D., Mukherjee, M., Park, G., Park, J., Park, S.-Y., Proctor, R. H., Regev, A., Ruiz-Roldan, M. C., Sain, D., Sakthikumar, S., Sykes, S., Schwartz, D. C., Turgeon, B. G., Wapinski, I., Yoder, O., Young, S., Zeng, Q., Zhou, S., Galagan, J., Cuomo, C. A., Kistler, H. C. & Rep, M. 2010. Comparative genomics reveals mobile pathogenicity chromosomes in Fusarium. Nature, 464, 367-373. Martin, F., Aerts, A., Ahren, D., Brun, A., Danchin, E. G., Duchaussoy, F., Gibon, J., Kohler, A., Lindquist, E., Pereda, V., Salamov, A., Shapiro, H. J., Wuyts, J., Blaudez, D., Buee, M., Brokstein, P., Canback, B., Cohen, D., Courty, P. E., Coutinho, P. M., Delaruelle, C., Detter, J. C., Deveau, A., Difazio, S., Duplessis, S., Fraissinet-Tachet, L., Lucic, E., Frey-Klett, P., Fourrey, C., Feussner, I., Gay, G., Grimwood, J., Hoegger, P. J., Jain, P., Kilaru, S., Labbe, J., Lin, Y. C., Legue, V., Le Tacon, F., Marmeisse, R., Melayah, D., Montanini, B., Muratet, M., Nehls, U., Niculita-Hirzel, H., Oudot-Le Secq, M. P., Peter, M., Quesneville, H., Rajashekar, B., Reich, M., Rouhier, N., Schmutz, J., Yin, T., Chalot, M., Henrissat, B., Kues, U., Lucas, S., Van De Peer, Y., Podila, G. K., Polle, A., Pukkila, P. J., Richardson, P. M., Rouze, P., Sanders, I. R., Stajich, J. E., Tunlid, A., Tuskan, G. & Grigoriev, I. V. 2008. The genome of Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature, 452, 88-92. Martin, F., Kohler, A., Murat, C., Balestrini, R., Coutinho, P. M., Jaillon, O., Montanini, B., Morin, E., Noel, B., Percudani, R., Porcel, B., Rubini, A., Amicucci, A., Amselem, J., Anthouard, V., Arcioni, S., Artiguenave, F., Aury, J.-M., Ballario, P., Bolchi, A., Brenna, A., Brun, A., Buee, M., Cantarel, B., Chevalier, G., Couloux, A., Da Silva, C., Denoeud, F., Duplessis, S., Ghignone, S., Hilselberger, B., Iotti, M., Marcais, B., Mello, A., Miranda, M., Pacioni, G., Quesneville, H., Riccioni, C., Ruotolo, R., Splivallo, R., Stocchi, V., Tisserant, E., Viscomi, A. R., Zambonelli, A., Zampieri, E., Henrissat, B., Lebrun, M.-H., Paolocci, F., Bonfante, P., Ottonello, S. & Wincker, P. 2010. Perigord black truffle genome uncovers evolutionary origins and mechanisms of symbiosis. Nature, 464, 1033- 1038. Martinez, D., Larrondo, L. F., Putnam, N., Gelpke, M. D., Huang, K., Chapman, J., Helfenbein, K. G., Ramaiya, P., Detter, J. C., Larimer, F., Coutinho, P. M., Henrissat, B., Berka, R., Cullen, D. & Rokhsar, D. 2004. Genome sequence of the lignocellulose degrading fungus Phanerochaete chrysosporium strain RP78. Nat Biotechnol, 22, 695-700.

78

Mizuguchi, T., Fudenberg, G., Mehta, S., Belton, J. M., Taneja, N., Folco, H. D., Fitzgerald, P., Dekker, J., Mirny, L., Barrowman, J. & Grewal, S. I. 2014. Cohesin-dependent globules and heterochromatin shape 3D genome architecture in S. pombe. Nature, 516, 432-5. Molina, R. & Trappe, J. M. 1994. Biology of the Ectomycorrhizal Genus Rhizopogon . I. Host Associations, Host-Specificity, Host-Specificity, and Pure Culture Syntheses. New Phytologist, 126, 653-675. Mujic, A. B. 2015. Symbiosis in the Pacific Ring of Fire: Evolutionary-Biology of Rhizopogon Subgenus Villosuli as Mutualists of Pseudotsuga. Doctoral Dissertation Oregon State University. Mujic, A. B., Durall, D. M., Spatafora, J. W. & Kennedy, P. G. 2016. Competitive avoidance not edaphic specialization drives vertical niche partitioning among sister species of ectomycorrhizal fungi. New Phytologist, 209, 1174-1183. Oliveros, J. C. 2015. Venny. An interactive tool for comparing lists with Venn's diagrams. Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23, 1061-1067. Peichel, C. L., Sullivan, S. T., Liachko, I. & White, M. A. 2016. Improvement of the threespine stickleback (Gasterosteus aculeatus) genome using a Hi-C-based Proximity-Guided Assembly method. bioRxiv. Pellegrin, C., Morin, E., Martin, F. M. & Veneault-Fourrey, C. 2015. Comparative Analysis of Secretomes from Ectomycorrhizal Fungi with an Emphasis on Small- Secreted Proteins. Frontiers in Microbiology, 6, 1278. Petersen, T. N., Brunak, S., Von Heijne, G. & Nielsen, H. 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Meth, 8, 785-786. Piriyapongsa, J., Rutledge, M. T., Patel, S., Borodovsky, M. & Jordan, I. K. 2007. Evaluating the protein coding potential of exonized transposable element sequences. Biology Direct, 2, 31-31. Price, A. L., Jones, N. C. & Pevzner, P. A. 2005. De novo identification of repeat families in large genomes. Bioinformatics, 21. Quandt, C. A., Kohler, A., Hesse, C. N., Sharpton, T. J., Martin, F. & Spatafora, J. W. 2015. Metagenome sequence of Elaphomyces granulatus from sporocarp tissue reveals Ascomycota ectomycorrhizal fingerprints of genome expansion and a Proteobacteria-rich microbiome. Environmental Microbiology, 17, 2952-2968. Rhoads, A. & Au, K. F. 2015. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics, 13, 278-89. Rossi, M. J. & Oliveira, V. L. 2011. Growth of the Ectomycorrhizal Fungus Pisolithus Microcarpus in different nutritional conditions. Brazilian Journal of Microbiology, 42, 624-632. Schwartz, S. L. & Farman, M. L. 2010. Systematic overrepresentation of DNA termini and underrepresentation of subterminal regions among sequencing templates prepared from hydrodynamically sheared linear DNA molecules. BMC Genomics, 11, 1-12. Selvaraj, S., J, R. D., Bansal, V. & Ren, B. 2013. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat Biotechnol, 31, 1111-8.

79

Smit, A. F. A., Hubley, R. & Green, P. 2015. RepeatMasker Open-4.0. Smith, K. M., Galazka, J. M., Phatale, P. A., Connolly, L. R. & Freitag, M. 2012. Centromeres of filamentous fungi. Chromosome research : an international journal on the molecular, supramolecular and evolutionary aspects of chromosome biology, 20, 635-656. Smith, S. E. & Read, D. 2008. Mycorrhizal Symbiosis. (Third Edition) London: Academic Press. Stajich, J. E., Wilke, S. K., Ahren, D., Au, C. H., Birren, B. W., Borodovsky, M., Burns, C., Canback, B., Casselton, L. A., Cheng, C. K., Deng, J., Dietrich, F. S., Fargo, D. C., Farman, M. L., Gathman, A. C., Goldberg, J., Guigo, R., Hoegger, P. J., Hooker, J. B., Huggins, A., James, T. Y., Kamada, T., Kilaru, S., Kodira, C., Kues, U., Kupfer, D., Kwan, H. S., Lomsadze, A., Li, W., Lilly, W. W., Ma, L. J., Mackey, A. J., Manning, G., Martin, F., Muraguchi, H., Natvig, D. O., Palmerini, H., Ramesh, M. A., Rehmeyer, C. J., Roe, B. A., Shenoy, N., Stanke, M., Ter- Hovhannisyan, V., Tunlid, A., Velagapudi, R., Vision, T. J., Zeng, Q., Zolan, M. E. & Pukkila, P. J. 2010. Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus). Proc Natl Acad Sci U S A, 107, 11889-94. Stanke, M. & Morgenstern, B. 2005. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research, 33, W465-W467. Stukenbrock, E. H. & Croll, D. 2014. The evolving fungal genome. Fungal Biology Reviews, 28, 1-12. Taylor, J. W. & Berbee, M. L. 2014. Chapter 1: Fungi from PCR to Genomics: The Spreading Revolution in Evolutionary Biology. In: David J. McLaughlin, Joseph. W. Spatafora. (ed.) The Mycota Systematics and Evolution Part A. Berlin Heidelberg: Springer-Verlag. Team, R. C. 2016. R: A language and environment for statistical computing. Tedersoo, L., May, T. & Smith, M. 2010. Ectomycorrhizal lifestyle in fungi: global diversity, distribution, and evolution of phylogenetic lineages. Mycorrhiza, 20, 217-263. Twieg, B. D., Durall, D. M. & Simard, S. W. 2007. Ectomycorrhizal fungal succession in mixed temperate forests. New Phytol, 176, 437-47. Van Der Heijden, M. G., Martin, F. M., Selosse, M. A. & Sanders, I. R. 2015. Mycorrhizal ecology and evolution: the past, the present, and the future. New Phytol, 205, 1406-23. Van Kan, J. a. L., Stassen, J. H. M., Mosbach, A., Van Der Lee, T. a. J., Faino, L., Farmer, A. D., Papasotiriou, D. G., Zhou, S., Seidl, M. F., Cottam, E., Edel, D., Hahn, M., Schwartz, D. C., Dietrich, R. A., Widdison, S. & Scalliet, G. 2016. A gapless genome sequence of the fungus Botrytis cinerea. Molecular Plant Pathology, n/a-n/a. Varoquaux, N., Liachko, I., Ay, F., Burton, J. N., Shendure, J., Dunham, M. J., Vert, J. P. & Noble, W. S. 2015. Accurate identification of centromere locations in yeast genomes using Hi-C. Nucleic Acids Res, 43, 5331-9.

80

Wattam, A. R., Abraham, D., Dalay, O., Disz, T. L., Driscoll, T., Gabbard, J. L., Gillespie, J. J., Gough, R., Hix, D., Kenyon, R., Machi, D., Mao, C., Nordberg, E. K., Olson, R., Overbeek, R., Pusch, G. D., Shukla, M., Schulman, J., Stevens, R. L., Sullivan, D. E., Vonstein, V., Warren, A., Will, R., Wilson, M. J., Yoo, H. S., Zhang, C., Zhang, Y. & Sobral, B. W. 2014. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res, 42, D581-91. Wickham, H. 2009. gplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. Wisecaver, J. H., Slot, J. C. & Rokas, A. 2014. The Evolution of Fungal Metabolic Pathways. PLoS Genet, 10, e1004816.

81

Figure 2.1: Postlclustering heatmap (n=10). The strong diagonal signal represents intra- contig linearity while the x-like patterns are inter-linkage group interactions resulting from centromeric clustering.

82

Figure 2.2: Mummerplot of the Illumina assembly against the 10 linkage groups of the Hi-C assembly. The Illumina assembly is reference on the X-axis while the Hi-C assembly is given as query, on the y-axis. Note that the high number of scaffolds on the Illumina assembly are too dense to display on the x-axis. The red diagonal indicates a good match between the two assemblies. Blue denotes reverse matches, and there is some noise which can result from random misalignments in the nucmer algorithm, as well as unmatched content seen on the far right of the plot.

83

Figure 2.3: Classification of repetitive DNA and TE in linkage groups. A) shows distribution by main classes, while B) gives a more detailed breakdown of TE-elements by family.

84

Figure 2.4: Phylogeny of A) Gypsy elements, and B) Copia elements. Sequences were aligned with MAFFT and trees were constructed with the UPGMA Tamura-Nei option in Geneious. The sequences were color coded by linkage as displayed in the legend.

Figure 2.5: Main content in linkage groups.

85

Figure 2.6: The A- and B mating loci. A) The A-locus (red) is mapped within a region of ~50 kb. Blue denotes genes with good functional annotations, while genes were the function is unknown or questionable are marked in yellow. The MAKER gene models are shown in green in the track below. B) The B-locus is mapped to a ~50kb region using the same color codes as in A.

86

87

Figure 2.7: Venn diagram quantifying the distribution of orthologous clusters. Clusters were found by FastOrtho using the protein models from the masked genome.

Figure 2.8: Frequency of orthologous clusters in linkage group 4 plotted in 50 kb sliding windows. A) Color coded representation of different levels of orthologous clusters while B) visualizes the ratio of species specific genes to all genes. The red horizontal lines mark the 50% and 75% limit. The green dot points to the location of the centromeric region.

88

Figure 2.9: Distribution of regions enriched for species specific genes in all linkage groups (SSG=Speceis specific genes).

89

90

90

Figure 2.10: Distribution of small secreted proteins. A) mapped on linkage groups with color codes: Black=SSPs shared with Suillus.Blue=SSPs shared by all 3 Rhizopogon sp. Red=genes shared exclusively by R. vesiculosus and R. vinicolor. Yellow=R. vesiculosus species specific SSPs. B) SSPs binned by orthologous cluster or species specific identity.

Figure 2.11: A combined view of different genetic elements plotted in 50 kb windows in linkage group 4. Location of the centromere is shown in the top panel, and denoted with the dotted line for reference. 91

92

Table 2.1: Comparison Hi-C and Illumina assembly statistics and repeat content.

Illumina Hi-C

Size 43809644 43809644

Contigs 6700 2201

Contigs > 1000 3765 416

N50 raw 28895 3841056

N50 scaffold 388 4

Maker gene models 13623 138991/100672

Cegma 95.2% 96.41/94.82

Repetitive content 10.8 % 16.7 %

.

1 unmasked 2 masked

Table 2.2: Linkage group statistics

Link. group Length bp Gene Gene % % repeats Centromere Centromere gene Centromere gene

models density coding length models density

Link. group 0 10060592 3369 395 60% 8.8% 16352 5 306

Link. group 1 5228452 1674 320 48% 14.3% 32930 11 334

Link. group 2 3963460 1239 313 48% 15.0% 49033 18 367

Link. group 3 3841056 1228 320 45% 18.4% 46523 12 258

Link. group 4 3621133 1166 322 47% 17.5% 64958 17 262

Link. group 5 3407800 1043 306 40% 17.8% 36325 13 358

Link. group 6 3144919 1061 337 47% 14.4% 22373 9 402

Link. group 7 2996465 973 325 54% 13.4% 63694 22 345

Link. group 8 2848509 827 290 36% 19.9% 36808 12 326

Link. group 9 2847823 1005 353 52% 13.1% 52297 22 421

Average 4196021 1359 322 48% 15.2% 42129 14 338

93

Table 2.3A: Repeat number by linkage groups. Link. gr.=Linkage. group, Sim. repeat=Simple repeat, Low com.=Low complexity, Heli.=Helitron, Har.=Harbinger.

Link. All Sim. Low Gypsy Copia Ngaro LINE & Mariner hAT IS3EU MuDR Heli. EnSpm Har. Other Unknown gr. repeat repeat com. related & related 0 3907 42 236 100 346 40 97 253 49 18 22 31 2 17 503 2151 1 2677 32 1 100 312 38 77 199 19 9 6 22 4 16 327 1515 2 2530 36 1 78 244 26 77 136 39 5 8 15 0 28 261 1576 3 2835 61 0 84 468 41 74 188 66 4 8 38 2 24 187 1590 4 2249 21 0 44 216 119 44 152 30 9 24 18 0 25 289 1258 5 2694 47 0 79 392 45 97 210 36 7 14 24 3 15 195 1530 6 2042 18 0 31 290 38 36 127 27 6 5 33 11 11 152 1257 7 1501 15 0 49 193 24 31 109 25 8 7 16 1 15 120 888 8 2414 23 0 65 292 16 51 91 27 10 26 13 2 12 202 1584 9 1483 28 0 70 180 11 27 91 14 7 7 18 2 7 235 786 Total 26484 323 238 700 2933 398 611 1556 332 83 127 228 27 170 2471 14135

94

Table 2.3B: Repeat content in kb. Link. group=Linkage group, Sim. repeats=Simple repeats, Low com.=Low complexity, Heli.=Helitron, Har.=Harbinger.

Link. All simple low Gypsy Copia Ngaro LINE & Mariner hAT IS3EU MuDR Heli. EnSpm Har. Other Unknown gr. repeat repeat com. related & related 0 892.3 4.8 18.3 42.6 71.7 5.3 28.5 66.0 10.2 19.0 2.5 14.6 0.2 2.2 207.4 399.0 1 749.5 7.3 0 71.8 88.6 7.1 33.9 84.4 2.9 1.5 0.7 5.3 1.8 10.5 134.5 299.3 2 592.7 7.3 0 21.7 76.0 3.5 35.2 22.2 11.5 0.7 0.5 2.9 0.0 19.2 100.5 291.6 3 708.1 48.2 0 22.9 189.3 5.1 26.9 37.6 17.9 1.3 0.7 22.7 0.4 7.2 39.7 288.1 4 634.3 2.2 0 10.9 43.7 53.4 13.7 54.4 4.5 3.8 25.8 10.9 0.0 12.1 150.6 248.5 5 605.4 8.8 0 26.8 100.1 6.9 35.2 64.3 12.3 1.8 1.0 7.1 0.4 2.7 36.4 301.7 6 454.1 2.8 0 8.5 70.3 25.2 8.9 34.1 4.8 2.1 0.4 10.3 2.9 1.6 48.9 233.3 7 401.3 2.0 0 16.5 92.0 2.9 18.0 33.4 12.4 1.9 0.6 3.3 0.3 5.1 34.9 178.0 8 567.0 3.3 0 42.0 51.3 1.7 24.9 24.6 15.3 5.4 18.0 2.9 1.1 2.2 67.9 306.5 9 373.2 3.2 0 27.6 40.7 1.6 3.7 26.4 2.6 4.0 0.3 2.6 3.5 1.6 104.0 151.2 Total 5085.6 85.1 0.1 248.7 752.0 107.4 200.5 381.4 84.0 22.6 47.9 68.1 10.4 62.2 717.4 2298.1

95

96

Table 2.4: Summary SNP frequency and density.

All genes Suillineae Species Small Transposable shared specific secreted Elements othologs genes Proteins Transitions 7931 4963 2248 117 11358 Transversions 2326 946 828 50 2233 Total SNPs 10257 5909 3076 227 13591 Transitions 0.00047 0.000267 0.0020 0.0013 0.0042 per bp Transversions 0.00014 0.000066 0.000073 0.00036 0.00083 per bp Total SNPs 0.000613 0.00033 0.0027 0.0016 0.0050 per bp

97

Supplementary Figures and Tables

Figure S-2.1-A)

98

Figure S-2.1-B)

99

Figure S-2.1-C)

100

Figure S-2.1-D)

101

Figure S-2.1-E)

102

Figure S-2.1-F)

103

Figure S-2.1-G)

104

Figure S-2.1-H)

105

Figure S-2.1-I)

106

Figure S-2.2-A)

107

Figure S-2.2-B)

108

Figure S-2.2-C)

109

Figure S-2.2-D)

110

Figure S-2.2-E)

111

Figure S-2.2-F)

112

Figure S-2.2-G)

113

Figure S-2.2-H)

114

Figure S-2.2-I)

115

Figure S-2.3: Barplot showing number of repeats for each family in linkage groups.

116

117

Figure S-2.4: Pie diagrams showing the distribution biological processes associated with the shared (Fig 2.12 A) and species specific genes (Fig 2.12 B). Shared genes are defined as genes identified in orthologues clusters containing all 5 taxa

.

118

Figure S-2.5: Venn diagram quantifying the distribution of different orthologous clusters. The R. vesiculosus gene models used in this FasOrtho analysis were derived from the unmasked Hi-C assembly.

Table S-2.1: Selected gene models within the centromeres annotated with Blast2GO

linkage Protein ID Description e-Value sim #GO GO Names list group mean sc1 Rves056-HiC- hypothetical protein 0.00E+00 84% 5 F:nucleic acid binding; P:chromatin remodeling; F:metal unmasked_04330- K503DRAFT_767360 ion binding; C:cytosol; C:Ino80 complex RA sc2 Rves056-HiC- hypothetical protein 0.00E+00 87.60% 14 C:cell division site; C:Golgi apparatus; P:cell separation unmasked_05743- K503DRAFT_717551 after cytokinesis; F:SNARE binding; C:nucleolus; RA P:endocytosis; P:exocytosis; C:endosome; C:site of polarized growth; C:cell tip; P:vesicle docking; C:cytosol; P:early endosome to Golgi transport; P:endocytic recycling sc2 Rves056-HiC- ribosomal S8 1.10E-89 98.80% 3 F:structural constituent of ribosome; C:ribosome; unmasked_05747- P:translation RA sc2 Rves056-HiC- ribosomal S17e 1.20E-78 94.40% 3 F:structural constituent of ribosome; C:ribosome; unmasked_05748- P:translation RA sc4 Rves056-HiC- riboflavin synthase C:Ddb1-Wdr21 complex; F:FMN binding; P:transcription- unmasked_07963- domain coupled nucleotide-excision repair; P:regulation of RA chromatin silencing at silent mating-type cassette; P:regulation of chromatin silencing at centromere; C:nucleolus; P:protein ubiquitination involved in ubiquitin-dependent protein catabolic process; F:nucleic acid binding; F:iron ion binding; F:NADPH-hemoprotein reductase activity; C:cytosol; P:oxidation-reduction process; F:protein complex binding; C:Ddb1-Ckn1 complex

0.00E+00 89.20% 14 119

Cont. Table S-2.1 Rves056-HiC- U6 snRNA-associated 5.20E-38 95.40% 9 F:RNA binding; P:maturation of SSU-rRNA; C:U4/U6 x U5 Sc5 unmasked_09447- Sm LSm7 tri-snRNP complex; P:nuclear-transcribed mRNA RA catabolic process; C:cytosol; C:nucleolus; C:U6 snRNP; C:small nucleolar ribonucleoprotein complex; P:mRNA splicing, via spliceosome sc6 Rves056-HiC- P34-Arc-domain- 0.00E+00 94.60% 10 C:nucleus; C:cell division site; C:cell tip; F:structural unmasked_10080- containing protein molecule activity; P:Arp2/3 complex-mediated actin RA nucleation; C:cytosol; P:microtubule cytoskeleton organization; P:receptor-mediated endocytosis; C:actin cortical patch; C:Arp2/3 protein complex sc7 Rves056-HiC- hypothetical protein 4.40E-28 72.40% 4 P:chromosome organization; P:single-organism cellular unmasked_11432- CY34DRAFT_38566, process; P:DNA metabolic process; P:single-organism RA partial metabolic process sc7 Rves056-HiC- hypothetical protein 2.60E-77 82.20% 3 C:mediator complex; F:RNA polymerase II transcription unmasked_11438- K503DRAFT_767850 cofactor activity; P:regulation of transcription from RNA RA polymerase II promoter sc7 Rves056-HiC- DRMBL-domain- 0.00E+00 88.60% 1 P:DNA repair unmasked_11442- containing protein RA sc7 Rves056-HiC- ribosomal S4e 0.00E+00 96.80% 4 F:structural constituent of ribosome; C:ribosome; F:rRNA unmasked_11443- binding; P:translation RA sc7 Rves056-HiC- ATP-dependent RNA 0.00E+00 87.80% 4 F:nucleic acid binding; F:ATP binding; P:metabolic unmasked_11448- helicase dbp9 process; F:helicase activity RA sc7 Rves056-HiC- hypothetical protein 0.00E+00 85.60% 6 P:chromatin maintenance; P:chromatin silencing at unmasked_11451- K503DRAFT_736060 centromere; P:mitotic sister chromatid segregation; RA C:chromosome, centromeric region; P:CENP-A containing nucleosome assembly; C:nucleoplasm

120

Cont. Table S-2.1 sc7 Rves056-HiC- hypothetical protein 0.00E+00 86.80% 30 P:regulation of histone H2B conserved C-terminal lysine unmasked_11453- K503DRAFT_864025 ubiquitination; P:transcription elongation from RNA RA polymerase I promoter; P:rRNA processing; F:RNA polymerase II core binding; P:transcription elongation from RNA polymerase II promoter; P:negative regulation of transcription from RNA polymerase II promoter; P:positive regulation of transcription elongation from RNA polymerase II promoter; F:RNA polymerase II C- terminal domain phosphoserine binding; P:regulation of chromatin silencing at telomere; P:chromatin silencing at rDNA; P:regulation of transcription involved in G1/S transition of mitotic cell cycle; P:positive regulation of transcription elongation from RNA polymerase I promoter; P:histone modification; C:Cdc73/Paf1 complex; C:cytosol; P:chromatin organization involved in regulation of transcription; P:regulation of histone H3-K4 methylation; P:positive regulation of phosphorylation of RNA polymerase II C-terminal domain serine 2 residues; F:chromatin binding; P:DNA-templated transcription, termination; P:regulation of transcription-coupled nucleotide-excision repair; F:transcription factor activity, RNA polymerase II core promoter sequence-specific; P:snoRNA transcription from an RNA polymerase II promoter; P:negative regulation of DNA recombination; F:transcription factor activity, TFIIF-class transcription factor binding; C:transcriptionally active chromatin; P:mRNA 3'-end processing; P:global genome nucleotide- excision repair; P:positive regulation of histone H3-K36 trimethylation; P:snoRNA 3'-end processing sc8 Rves056-HiC- hypothetical protein 2.30E-92 82% 3 C:mediator complex; F:RNA polymerase II transcription unmasked_12184- K503DRAFT_737488 cofactor activity; P:regulation of transcription from RNA RA polymerase II promoter

121

Cont. Table S-2.1 sc8 Rves056-HiC- 60S ribosomal L19 1.30E-128 97.80% 3 F:structural constituent of ribosome; C:ribosome; unmasked_12187- P:translation RA sc8 Rves056-HiC- acyl- N- 0.00E+00 82% 7 C:chromosome, telomeric region; F:histone unmasked_12188- acyltransferase acetyltransferase activity; C:nuclear chromatin; P:histone RA exchange; P:histone acetylation; C:SAS acetyltransferase complex; P:chromatin silencing at telomere sc9 Rves056-HiC- 40S ribosomal S18 3.90E-124 98.20% 4 F:RNA binding; F:structural constituent of ribosome; unmasked_13185- C:ribosome; P:translation RA sc9 Rves056-HiC- ribosome biogenesis 0.00E+00 83% 2 C:nucleus; P:ribosome biogenesis unmasked_13189- YTM1 RA sc9 Rves056-HiC- Cytoplasmic 60S 0.00E+00 84% 2 F:nucleic acid binding; F:zinc ion binding unmasked_13190- subunit biogenesis RA factor sc9 Rves056-HiC- hypothetical protein 0.00E+00 88.40% 8 C:cellular bud neck; P:cellular bud site selection; unmasked_13191- K503DRAFT_861733 C:plasma membrane of growing cell tip; P:actin filament RA bundle organization; C:integral component of membrane; C:endoplasmic reticulum; P:positive regulation of establishment of bipolar cell polarity regulating cell shape; C:cellular bud scar sc9 Rves056-HiC- hypothetical protein 0.00E+00 82% 11 C:nucleus; F:DNA replication origin binding; P:negative unmasked_13198- K503DRAFT_679566 regulation of exit from mitosis; P:DNA replication RA initiation; F:zinc ion binding; F:protein serine/threonine kinase activator activity; P:positive regulation of meiosis I; C:Dbf4-dependent protein kinase complex; C:chromosome, centromeric region; P:positive regulation of protein serine/threonine kinase activity; P:premeiotic DNA replication sc9 Rves056-HiC- gamma-tubulin 0.00E+00 74.20% 6 C:microtubule organizing center; C:cytoplasm; C:spindle unmasked_13199- DGRIP91 SPC98 pole; P:positive regulation of microtubule nucleation;

RA component P:microtubule cytoskeleton organization; C:microtubule 122

Table S-2.2A: TE vs all genes Transitions per 1Mb Transversions per 1Mb TE 4202 826 Genes 474 139 p-value 0.0001715

Table S-2.2B: Species specific genes (SSG) vs all genes Transitions per 1Mb Transversions per 1Mb SSG 1990 733 Genes 474 139 p-value 0.03256

Table S-2.2C: Species specific genes vs Suillineae orthologs Transitions per 1Mb Transversions per 1Mb SSG 1990 733 Suillineae 267 66 p-value 0.005444

123

124

Chapter 3. Conclusion

125

The primary objective of this thesis project was to improve the short read shotgun assembly of a dikaryotic filamentous fungus with Hi-C sequencing, and to use these data to advance the analysis of the genome organization. I have presented here an improved assembly of Rhizopogon vesiculosus, representing the first chromosome level assembly of the Boletales, and a valuable addition to the short list of well assembled ectomycorrhizal (EM) genomes. The long contiguity information allowed for the elucidation of the genetic architecture of the mating loci in R. vesiculosus, and facilitated the detection of transposable elements (TE). Moreover, the proximity contact maps produced in Hi-C proved efficient for identifying the centromeres. The highly polymorphic nature of the mating loci, repetitive nature of

TEs and many centromeres, pose challenges for short read data, and tend to provide an incomplete picture. The chromosome level assembly we obtained through Hi-C allowed the mapping of different genetic elements to and along each linkage group.

The distribution patterns of TEs, orthologous clusters, species specific genes, and

SSPs was examined, and revealed trends which warranted further investigation and steered the project in new directions. Notably, the detection of islands within the linkage groups that were enriched for species-specific genes displaying higher densities of single nucleotide polymorphisms (SNP) as contrasted to blocks abound with genes conserved at the suborder level resonate with the notion of “two-speed genomes” where ecologically important genes are linked to gene sparse regions in the genome enriched for TEs (Gladieux et al. 2014). Confirmation that these regions of the genome are indeed evolving in a species-specific manner will require population level sampling and more sophisticated analyses of TE distribution, preferentially involving coalescence estimates. Notwithstanding the potential advantages associated

126

with genome plasticity for symbionts which must gain access to its host, a genome running wild with TE activity may pose deleterious risks, and a genome is subject to various selective forces that must be balanced. We identified elevated densities of

SNPs in TEs, and found the ratio of transitions/transversions in them to be significantly higher than in genes. It is possible that this pattern is caused by a defense silencing machinery in the genome such as Repeat Induced Point mutations (RIP), but this would need further analysis to verify. In spite of containing gaps and regions which were missed, such as the telomeres, the current assembly is a promising starting point to address and test for additional hypotheses, and may serve as reference for other short read Boletales genomes. The remaining discussion will go through ways to advance questions which were addressed in this thesis, and conclude with suggestions on studies I think could benefit our understanding of Rhizopogon and EM fungi at large.

There are several alternative methods which can verify the assembly of R. vesiculosus into 10 linkage groups including long read sequencing with PacBio, optical mapping, and pulse field electrophoresis. Unlike whole genome sequencing and mapping approaches, pulse field electrophoresis can provide an immediate assessment of both chromosome number and genome size, though the technique require the production of protoplasts, which has not been accomplished for EM species like Rhizopogon and may prove challenging. Analysis of Hi-C assemblies of animal and yeast genomes have shown that Hi-C is a robust method for correctly assembling genomes (Burton et al. 2014, Bickhart et al. 2016, Peichel et al. 2016), but more work should be directed to test whether this holds true for dikaryotic fungi, and determine to what extent the reference assembly given will affect the clustering.

PacBio has been shown to successfully assemble genomes to chromosomes (Van Kan

127

et al. 2016), and may be a preferred approach for assembly purposes, although Hi-C data facilitates additional analysis such as centromere calling. Indeed, many utilities enabled through Hi-C data were not applied in this study. The contact maps can be screened for specific regulatory interactions between sequences, and have also been used for successful haplotype assembly (Korbel and Lee 2013, Selvaraj et al. 2013).

Assembling the haplotype would be particularly conducive for understanding the nature of the dikaryon and levels of interspecies variation. It would also provide a finer scale of resolution to the patterns that were observed in this study and answer whether there is significant variation in TE content between haplotypes, guide the identification of the different alleles of the mating loci, and provide information on copy number variation in certain gene families like the pheromone/pheromone receptor genes at the B-locus. In vivo, haplotypes are obtained through the germination of basidiospores (meiospores). Kawai et al. (2008) succeeded in doing this for Rhizopogon rubescens by growing cultures in media that had contained pine seedlings. The development of such protocols for R. vesiculosus, which would likely involve Pseudotusga seedlings, and more EM fungi, may be beneficial.

Additional analyses to verify the TE annotation are also warranted. There is a variety of software that was not used in this study due to time constraints, but a more thorough search may reveal a larger proportion and also help to filter out false positives. This in turn, would be important for achieving a more precise gene calling in providing the gene annotation pipeline an optimally masked genome. It would be interesting to search for TEs in the raw reads, as Hess et al. (2014) showed that a large portion of the TE there did not get assembled, and this would be particularly relevant if additional analysis suggest a larger genome size.

128

The long contiguity information obtainable through Hi-C and long read sequencing is a desirable starting point for comparative genomic studies, especially as this information becomes available for closely related species pairs. A Hi-C assembly of R. vinicolor, the sympatric sister species to R. vesiculosus, could reveal structural difference in the karyotype, and potentially regions of introgression. Alteration in the karyotype has been invoked for speciation as it may interfere with chromosomal segregation (Ma et al. 2010). Higher level comparisons between related EM fungi and saprobes would expound on the observed trends of increased TE content that have been observed (Hess et al. 2014, Kohler et al. 2015), and add synteny and an organizational context to these comparisons. A number of genomes for saprobic

Boletales fungi are available, with extensive characterization in the case of Serpula lacrymans.

The phenomenon of host specificity has received a lot of attention in the case of fungal plant pathogens (Ma et al. 2010) but remains less characterized for EM fungi. Rhizopogon subgenus Villosuli is particularly interesting in this regard as it exhibits strict host specificity for Pseudotsuga spp. The answer to host specific interaction might partly be found among the 200 small secreted proteins that were identified in this study. That being said, the cellular and molecular mechanisms mediating EM are not completely understood, although Plett et al. (2014) have made significant advances in Laccaria bicolor. As a start, RNA expression data can be obtained through bioassays where Pseudotsuga seedling are inoculated with

Rhizopogon spores, and the expressional profiles in the resulting mycorrhizal tissue can be compared to nonmycorrhizal tissue (e.g., on rhizomorphs, in axenic culture, or from sporocarps). Increasing attention has been directed to the role bacteria may play in EM (van der Heijden et al. 2015), and their presence should be accounted for in the

129

study design and analysis. Furthermore, variations in growth conditions, and in the pairing of host and fungal populations as well as species may reveal fine-tuned aspects of the interaction. However, to fully understand the symbiosis, in vivo experimental manipulations are necessary. The potential of Rhizopogon as an experimental model organism is unknown, but likely limited as it does not fruit in culture and is reliant on a plant host. Nonetheless, it is possible that heterologous expression using C. cinerea will be informative. Moreover, the advancements in genome editing biotechnologies like CRISPR/Cas9 may make such manipulations easier in the future, and genetic manipulations may also elucidate the genes or mechanisms behind hypogeous fruit body development.

It should be possible to verify the mating system using bioassays. The mating loci, and in particular the B-locus, is hard to design reliable markers for, but the current contiguity from the Hi-C assembly can serve as a reference to map other strains in order to identify conserved regions which in turn can be used for improved marker design.

Recent advances in sequencing technologies present researchers today with vast amounts of data to analyze. Nonetheless, there is still a vital need for experimental studies to match the sequence data and verify the findings made in silico.

This study has focused on the genome of one isolate R. vesiculosus, but more population sampling of genomes will be needed to probe the variation in this species.

Additionally, population studies of Rhizopogon may better define species boundaries, inform of us population sizes, and reveal geographic/host aspects of adaptation. These studies may be compared to population studies in Suillus spp. to determine how much they differ in population size and dispersal limitation, which could be putatively linked to their differing fruit body morphology. The presence of different host

130

specificities within the two genera also presents opportunities for studying this phenomenon at the intergeneric level. Although the use of microsatellite markers are likely to be completely replaced by genome scale SNP data in the near future, a Hi-C assembly does represent a fitting resource for the design of these markers as it at immediately reveals whether they are linked or not.

“The map is not the territory” (Alfred Korzybski 1933), and likewise basepair order only constitute one aspect of a genome, still less of the living organism. Yet, the genomics revolution the past few decades has repeatedly shown that there is much that can be learned about the ecology, and evolutionary history and mechanisms in a species simply by peering into its genome, and that new, unexpected research questions are bound to present themselves as biotechnology continue to develop. For instance, advances in characterizing the epigenetic state using methods like ChIP-seq has the potential to provide a more complete picture of genome architecture and organization, and could be particularly interesting if combined with the 3D chromatin data in Hi-C.

131

REFERENCES

Bickhart, D. M., Rosen, B. D., Koren, S., Sayre, B. L., Hastie, A. R., Chan, S., Lee, J., Lam, E. T., Liachko, I., Sullivan, S. T., Burton, J. N., Huson, H. J., Kelley, C. M., Hutchison, J. L., Zhou, Y., Sun, J., Crisa, A., Ponce De Leon, F. A., Schwartz, J. C., Hammond, J. A., Waldbieser, G. C., Schroeder, S. G., Liu, G. E., Dunham, M. J., Shendure, J., Sonstegard, T. S., Phillippy, A. M., Van Tassell, C. P. & Smith, T. P. L. 2016. Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes. bioRxiv. Burton, J. N., Liachko, I., Dunham, M. J. & Shendure, J. 2014. Species-Level Deconvolution of Metagenome Assemblies with Hi-C-Based Contact Probability Maps. G3: Genes|Genomes|Genetics. Hess, J., Skrede, I., Wolfe, B. E., Labutti, K., Ohm, R. A., Grigoriev, I. V. & Pringle, A. 2014. Transposable element dynamics among asymbiotic and ectomycorrhizal Amanita fungi. Genome biology and evolution, 6, 1564-1578. Kawai, M., Yamahara, M. & Ohta, A. 2008. Bipolar incompatibility system of an ectomycorrhizal basidiomycete, Rhizopogon rubescens. Mycorrhiza, 18, 205- 10. Kohler, A., Kuo, A., Nagy, L. G., Morin, E., Barry, K. W., Buscot, F., Canback, B., Choi, C., Cichocki, N., Clum, A., Colpaert, J., Copeland, A., Costa, M. D., Dore, J., Floudas, D., Gay, G., Girlanda, M., Henrissat, B., Herrmann, S., Hess, J., Hogberg, N., Johansson, T., Khouja, H.-R., Labutti, K., Lahrmann, U., Levasseur, A., Lindquist, E. A., Lipzen, A., Marmeisse, R., Martino, E., Murat, C., Ngan, C. Y., Nehls, U., Plett, J. M., Pringle, A., Ohm, R. A., Perotto, S., Peter, M., Riley, R., Rineau, F., Ruytinx, J., Salamov, A., Shah, F., Sun, H., Tarkka, M., Tritt, A., Veneault-Fourrey, C., Zuccaro, A., Mycorrhizal Genomics Initiative, C., Tunlid, A., Grigoriev, I. V., Hibbett, D. S. & Martin, F. 2015. Convergent losses of decay mechanisms and rapid turnover of symbiosis genes in mycorrhizal mutualists. Nat Genet, 47, 410-415. Korbel, J. O. & Lee, C. 2013. Genome assembly and haplotyping with Hi-C. Nat Biotech, 31, 1099-1101. Ma, L.-J., Van Der Does, H. C., Borkovich, K. A., Coleman, J. J., Daboussi, M.-J., Di Pietro, A., Dufresne, M., Freitag, M., Grabherr, M., Henrissat, B., Houterman, P. M., Kang, S., Shim, W.-B., Woloshuk, C., Xie, X., Xu, J.-R., Antoniw, J., Baker, S. E., Bluhm, B. H., Breakspear, A., Brown, D. W., Butchko, R. a. E., Chapman, S., Coulson, R., Coutinho, P. M., Danchin, E. G. J., Diener, A., Gale, L. R., Gardiner, D. M., Goff, S., Hammond-Kosack, K. E., Hilburn, K., Hua-Van, A., Jonkers, W., Kazan, K., Kodira, C. D., Koehrsen, M., Kumar, L., Lee, Y.-H., Li, L., Manners, J. M., Miranda-Saavedra, D., Mukherjee, M., Park, G., Park, J., Park, S.-Y., Proctor, R. H., Regev, A., Ruiz-Roldan, M. C., Sain, D., Sakthikumar, S., Sykes, S., Schwartz, D. C., Turgeon, B. G., Wapinski, I., Yoder, O., Young, S., Zeng, Q., Zhou, S., Galagan, J., Cuomo, C. A., Kistler, H. C. & Rep, M. 2010. Comparative genomics reveals mobile pathogenicity chromosomes in Fusarium. Nature, 464, 367-373. Peichel, C. L., Sullivan, S. T., Liachko, I. & White, M. A. 2016. Improvement of the threespine stickleback (Gasterosteus aculeatus) genome using a Hi-C-based Proximity-Guided Assembly method. bioRxiv.

132

Plett, J. M., Daguerre, Y., Wittulsky, S., Vayssieres, A., Deveau, A., Melton, S. J., Kohler, A., Morrell-Falvey, J. L., Brun, A., Veneault-Fourrey, C. & Martin, F. 2014. Effector MiSSP7 of the mutualistic fungus Laccaria bicolor stabilizes the Populus JAZ6 protein and represses jasmonic acid (JA) responsive genes. Proc Natl Acad Sci U S A, 111, 8299-304. Selvaraj, S., J, R. D., Bansal, V. & Ren, B. 2013. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat Biotechnol, 31, 1111-8. Van Kan, J. a. L., Stassen, J. H. M., Mosbach, A., Van Der Lee, T. a. J., Faino, L., Farmer, A. D., Papasotiriou, D. G., Zhou, S., Seidl, M. F., Cottam, E., Edel, D., Hahn, M., Schwartz, D. C., Dietrich, R. A., Widdison, S. & Scalliet, G. 2016. A gapless genome sequence of the fungus Botrytis cinerea. Molecular Plant Pathology, n/a-n/a.

133

Appendix 1: Scripts written to facilitate data analysis

134

R Script 1: sliding_window_frequency_counter

####sliding window window_size <- 50000 #size of sliding window step <- 10000 #increment sliding window moves start <- 0 #start coordinate for window end <-window_size + start #end coordinate for window chr_end <- #length of scaffold rep_start <- #column containing start coordinates of each element use_frame <-seq(0,chr_end, step) #reference points listn <-c() for(i in use_frame){ if(end <= chr_end){ n <- 0 for(i in rep_start){ if(start <= i & i < end){ print(c(start, end, i)) n <- n +1 } } start <- start + step end <- end +step list0 <- c(listn, n) } } listn barplot(listn, ylab="frequency", xlab="", main="linkage group n")

135

R Script 2: snp_frequency_counter

freq_count <- function(start,end, snp_column){

list1 <-c()

n <- 0

for(i in snp_column){

if(start <= i & i <=end){

n <- n + 1

}

}

list1 <- c(list1, n)

return(list1)

}

136

Python Script 1: cysteine_length_counter

#!/local/cluster/bin/python2.7 from Bio import SeqIO from Bio.Seq import Seq import io import sys import re input_file_name = sys.argv[1] output = sys.argv[2] if len(sys.argv) < 3 : print('3 arguments required: 1: file to read, 2: output file') quit() fhandle_output = io.open(output, "wb") fhandle_output.write("ID" + '\t' "length" + '\t' + "cysteines" + '\n') for seq_record in SeqIO.parse(input_file_name, "fasta"): sequence = str(seq_record) cysteines = sequence.count("C") print(str(seq_record.id) + '\t' + str(len(seq_record)) + '\t' + str(cysteines)) fhandle_output.write(str(seq_record.id) + '\t' + str(len(seq_record)) + '\t' + str(cysteines) + '\n') fhandle_output.close() #close file handle of output file

137

Python Script 2: repeatmasker_annotator

#!/usr/bin/env python2.7 ###replaces annotations in repeatmasker.out given a blast output list import io import sys import re if len(sys.argv) <3 : print("inserts annotationis into tabular output of repeatmasker given a second file containing a sorted blast file" + "\n" + \ "repeatmasker_annotator.py ") quit() repeatmasker = sys.argv[1] id_list = sys.argv[2] output = sys.argv[3] fhandle_rep = io.open(repeatmasker, "rb") fhandle_id = io.open(id_list, "rb") fhandle_output = io.open(output, "wb") #create outfile my_list = list() anno_list = list() anno_dict = {} for line in fhandle_id: linestripped = line.strip() my_id_list = re.split(r"\s+", linestripped) my_id = my_id_list[0] my_anno = my_id_list[1] my_list.append(my_id) anno_list.append(my_anno) for i in range(len(my_list)): anno_dict[my_list[i]] = anno_list[i] for line in fhandle_rep: linestripped = line.strip() line_list = re.split(r"\s+", linestripped) id = line_list[9] status = line_list[10] str(id) j = "Unknown" if status == j: for i in anno_dict: if i == id: j = anno_dict.get(i) fhandle_output.write(id + "\t" + str(j)+ "\n") else: fhandle_output.write(id + "\t" + str(status) + "\n") fhandle_rep.close() fhandle_id.close(