bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Research Article

2 The architecture of metabolism maximizes biosynthetic diversity in the largest class of

3 fungi

4 Authors:

5 Emile Gluck-Thaler, Department of Pathology, The Ohio State University Columbus, OH, USA,

6 and Biological Sciences, University of Pittsburgh, Pittsburgh, PA, USA

7 Sajeet Haridas, US Department of Energy Joint Genome Institute, Lawrence Berkeley National

8 Laboratory, Berkeley, CA, USA

9 Manfred Binder, TechBase, R-Tech GmbH, Regensburg, Germany

10 Igor V. Grigoriev, US Department of Energy Joint Genome Institute, Lawrence Berkeley National

11 Laboratory, Berkeley, CA, USA, and Department of Plant and Microbial Biology, University of

12 California, Berkeley, CA

13 Pedro W. Crous, Westerdijk Fungal Biodiversity Institute, Uppsalalaan 8, 3584 CT Utrecht, The

14 Netherlands

15 Joseph W. Spatafora, Department of Botany and Plant Pathology, Oregon State University, OR, USA

16 Kathryn Bushley, Department of Plant and Microbial Biology, University of Minnesota, MN, USA

17 Jason C. Slot, Department of Plant Pathology, The Ohio State University Columbus, OH, USA

18 corresponding author: [email protected]

19

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

19 Abstract:

20 Background - Ecological diversity in fungi is largely defined by metabolic traits, including the

21 ability to produce secondary or "specialized" metabolites (SMs) that mediate interactions with

22 other organisms. Fungal SM pathways are frequently encoded in biosynthetic gene clusters

23 (BGCs), which facilitate the identification and characterization of metabolic pathways. Variation

24 in BGC composition reflects the diversity of their SM products. Recent studies have documented

25 surprising diversity of BGC repertoires among isolates of the same fungal species, yet little is

26 known about how this population-level variation is inherited across macroevolutionary

27 timescales.

28 Results - Here, we applied a novel linkage-based algorithm to reveal previously unexplored

29 dimensions of diversity in BGC composition, distribution, and repertoire across 101 species of

30 , which are considered to be the most phylogenetically diverse class of fungi

31 and are known to produce many SMs. We predicted both complementary and overlapping sets of

32 clustered genes compared with existing methods and identified novel gene pairs that associate

33 with known secondary metabolite genes. We found that variation in BGC repertoires is due to

34 non-overlapping BGC combinations and that several BGCs have biased ecological distributions,

35 consistent with niche-specific selection. We observed that total BGC diversity scales linearly

36 with increasing repertoire size, suggesting that secondary metabolites have little structural

37 redundancy in individual fungi.

38 Conclusions - We project that there is substantial unsampled BGC diversity across specific

39 families of Dothideomycetes, which will provide a roadmap for future sampling efforts. Our

40 approach and findings lend new insight into how BGC diversity is generated and maintained

41 across an entire fungal taxonomic class.

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

42 Keywords:

43 chemical ecology; Fungi; metabolism; gene cluster

44

45 Background:

46 , bacteria and fungi produce the majority of the earth's biochemical diversity. These

47 organisms produce a remarkable variety of secondary/specialized metabolites (SMs) that can

48 mediate ecological functions, including defense, resource acquisition, and mutualism. Standing

49 SM diversity is often high at the population level, which may affect the rates of adaptation over

50 microevolutionary timescales. For example, high intraspecific quantitative and qualitative

51 chemotype diversity in plants can enable rapid adaptation to local biotic factors (Agrawal,

52 Hastings et al. 2012; Züst, Heichinger et al. 2012; Glassmire, Jeffrey et al. 2016). However, the

53 fate of population-level chemodiversity across longer timescales is not well explored in plants or

54 other lineages. We therefore sought to identify how metabolic variation is distributed across

55 macroevolutionary timescales by profiling chemodiversity across a well-sampled taxonomic

56 class.

57 The Dothideomycetes, which originated between 247 and 459 million years ago

58 (Beimforde, Feldberg et al. 2014), comprise the largest and arguably most phylogenetically

59 diverse class of fungi. Currently, 19,000 species are recognized in 32 orders containing more

60 than 1,300 genera (Zhang, Crous et al. 2011). Dothideomycetes are divided into two major

61 subclasses, the (order ) and (orders

62 , , and ), which correspond to the presence or absence,

63 respectively, of pseudoparaphyses during development of the asci (Schoch, Crous et al. 2009).

64 Several other orders await definitive placement.

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

65 Dothideomycetes also display a large diversity of fungal lifestyles and ecologies. The

66 majority of Dothideomycetes are terrestrial and associate with phototrophic hosts as either

67 pathogens, saprobes, endophytes (Schoch, Crous et al. 2009), lichens (Nelsen, Lucking et al.

68 2011), or ectomycorrhizal symbionts (Spatafora, Owensby et al. 2012). Six orders contain plant

69 pathogens capable of infecting nearly every known crop species. The Pleosporales and

70 Capnodiales, in particular, are dominated by asexual plant pathogens that cause significant

71 economic losses and have been well sampled in previous genome sequencing efforts (Goodwin,

72 Ben M'Barek et al. 2011; Ohm, Feau et al. 2012; Oliver, Friesen et al. 2012; Condon, Leng et al.

73 2013; Manning, Pandelova et al. 2013). A single order () contains aquatic, primarily

74 freshwater, species (Suetrong, Boonyuen et al. 2011). Other ecologies include human and animal

75 pathogens, including some taxa that can elicit allergies and asthma (Crameri, Garbani et al.

76 2014), and rock-inhabiting fungi (Ruibal, Gueidan et al. 2009).

77 This broad range of lifestyles is accompanied by extensive diversity of SMs, for which

78 very few have known ecological roles. The Dothideomycetes, and several other ascomycete

79 classes (Eurotiomycetes, Sordariomycetes, and Leotiomycetes), produce the greatest number and

80 diversity of SMs across the fungal kingdom (Spatafora and Bushley 2015; Akimitsu et al. 2014).

81 Economically important plant pathogens in the Pleosporales (Alternaria, Bipolaris, ,

82 , , and ), in particular, are known to produce host-

83 selective toxins that confer the ability to cause disease in specific plant hosts (Walton and

84 Panaccione 1993; Walton 1996; Wolpert, Dunkle et al. 2002; Ciuffetti, Manning et al. 2010,

85 Pandelova, Figueroa et al. 2012; Akimitsu, Tsuge et al. 2014). Other toxins first identified in

86 Pleosporales have roles in virulence, but are not pathogenicity determinants, including the PKS

87 derived compounds depudecin (Wight, Kim et al. 2009) and solanapyrone (Kaur 1995).

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

88 Dothideomycetes are also known to produce bioactive metabolites shared with more

89 distantly related fungal classes. Sirodesmin, a virulence factor produced by Leptosphaeria

90 maculans, for example, belongs to the same class of epipolythiodioxopiperazine (ETP) toxins as

91 gliotoxin, an immunosuppressant produced by the eurotiomycete human pathogen Aspergillus

92 fumigatus (Gardiner, Waring et al. 2005; Patron, Waller et al. 2007). Dothistromin, a polyketide

93 metabolite produced by the pine pathogen Dothistroma septosporum shares ancestry with

94 aflatoxin (Bradshaw, Slot et al. 2013), a mycotoxin produced by Aspergillus species that poses

95 serious human health and environmental risks worldwide (Horn 2003; Wang and Tang 2004).

96 A majority of bioactive metabolites in Dothideomycetes are small-molecule SMs that,

97 like those of other fungi, are frequently the products of biosynthetic gene clusters (BGCs)

98 composed of enzymes, transporters, and regulators that contribute to a common SM pathway.

99 Most of these BGCs are defined by four main classes of SM core signature enzymes: 1)

100 nonribosomal peptide synthetases (NRPS), 2) polyketide synthetases (PKS), 3) terpene synthases

101 (TS), and 4) dimethylallyl tryptophan synthases (DMAT) (Hoffmeister & Keller 2007). Fungal

102 gene clusters are hotspots for genome evolution through gene duplication, loss, and horizontal

103 transfer, which recombine pathways and generate diversity (Wisecaver, Slot et al. 2014).

104 Additionally, recent studies have shown that gene clusters may evolve through recombination or

105 shuffling of modular subunits of syntenic genes (Lind, Wisecaver et al. 2017; Gluck-Thaler et al

106 2018. Changes in BGC gene content often result in structural changes to the SM product(s), and

107 therefore BGCs can be used to monitor the evolution of chemodiversity (Lind, Wisecaver et al.

108 2017; Proctor, McCormick et al. 2018). The most widely used methods for detecting BGCs rely

109 on models of gene cluster composition based on putative functions in SM biosynthesis informed

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

110 by a phylogenetically limited set of taxa, but gene function agnostic methods are being

111 developed (Slot and Gluck-Thaler 2019).

112 Here, we systematically assessed BGC richness and compositional diversity in the

113 genomes of 101 Dothideomycetes species, most recently sequenced (Haridas et al. in press).

114 Using a newly benchmarked algorithm that identifies clustered genes of interest through the

115 frequency of their co-occurrence with and around signature biosynthetic genes, we identified

116 3399 putative BGCs, grouped into 719 unique cluster types, including 5 varieties of candidate

117 DHN melanin clusters. The conservation of specific gene pairs across BGC types suggests that

118 precise functional interactions contribute to the modular evolution of these loci. Numerous BGCs

119 have either over- or under-dispersed phylogenetic distributions, suggesting pathways have been

120 differentially impacted by selection. In comparisons across species, BGC repertoire diversity

121 increases linearly with repertoire size, reflecting a mode of metabolic evolution in these fungi

122 that is likely distinct from that of plants. We found little overlap in cluster repertoires among

123 genomes from different genera, and project that a wealth of unique BGCs remain to be

124 discovered within this fungal lineage.

125 Results:

126 Dothideomycetes contain hundreds of distinct types of BGCs, a small fraction of which are

127 characterized.

128 Using a novel cluster detection approach based on shared syntenic relationships among

129 genes (CO-OCCUR, see Methods, Figure 1, Figure SA), we identified 332 gene homolog groups

130 (homolog groups) of interest (Table SA, Table SB) whose members were organized into 3399

131 candidate BGCs of at least two genes (Table SC) in 101 Dothideomycete genomes (Table SD),

132 representing an average of 33.7 BGCs per genome (SD= 15.4, Figure SG). We grouped BGCs

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1. 101 Dothideomycetes Pipeline for sampling Genomes random pairs of co- with genes clustered into homolog groups occurring HGs (HGs) (500,000 replicates)

CO-OCCUR Pipeline for identifying Random biosynthetic gene distributions of HG clusters (BGCs) using co-occurrences unexpected HG co- occurrences

Group together BGCs that have ~90% of their genes in BGCs common ( 3,399 Total) (719 Total)

Cluster groups Orphan (CGs) clusters (OCs) (422 Total) (297 Total) bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

133 into 719 unique cluster types based on a minimum gene content similarity of 90%; 422 cluster

134 types are part of homologous cluster groups (cluster groups) found in 2 or more genomes, and

135 297 are orphan clusters found in only one genome (Table SE). Of these, 345 cluster types (166

136 cluster groups and 179 orphan clusters) had 5 or more genes per BGC (Figure 2), and 459 cluster

137 types (239 cluster groups and 220 orphan clusters) had 4 or more genes per BGC (Figure SC).

138 Only 9 of the 459 cluster types with greater than 4 genes were ever found more than once in any

139 given genome (Table SE). According to standard practice, we classified cluster types based on

140 the presence of biosynthetic signature genes: dimethylallyl tryptophan synthase (DMAT),

141 polyketide synthase (PKS), PKS-like, nonribosomal peptide-synthetase (NRPS), NRPS-like,

142 hybrid (HYBRID), and terpene cyclase (TC). We found that among all cluster types with greater

143 than 4 genes, 186 contained only PKS and 29 contained only NRPS signature genes. Similarly,

144 we detected 4 DMAT, 38 PKS-like, 16 NRPS-like, 3 HYBRID, and 3 TC-only cluster types. 127

145 cluster types contained more than 1 type of signature gene, and 53 cluster types contained no

146 signature gene at all but still consisted of genes found in significant co-occurrences. By

147 searching the MIBiG database for highly similar hits (≥70% amino acid identity) to the signature

148 biosynthetic genes in CO-OCCUR BGCs, we were able to confidently annotate 158 of the BGCs

149 recovered by CO-OCCUR with 32 unique MIBiG entries, corresponding to 22 unique

150 metabolites (Table SF). BGC annotations based instead on content overlap with characterized

151 MIBiG clusters can be found in Table SG (minimum cluster size=3 genes; minimum percentage

152 of genes with similarity=70%).

153 Of the 158 BGCs with hits to the MIBiG database, some encoded non-host selective

154 phytotoxins or other compounds with known roles in virulence to plants. Two PKS BGC’s

155 encoding the non-host-specific phytotoxin and DNA polymerase inhibitor, solanapyrone

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Signature gene type Homologous cluster group linkage tree Number of clusters Figure 2 PKS 1 2 NRPS signature gene ≥70% PKS-like identical to known locus NRPS-like Hybrid

> 1 signature gene Order Viridothelium virens Trypetheliales duriaei CBS 260.36 Myriangiales Elsinoe ampelina Myriangiales Myriangiales EXF-150 Dothideales Aureobasidium namibiae CBS 147.97 Dothideales Aureobasidium melanogenum CBS 110374 Dothideales Aureobasidium subglaciale EXF-2481 Dothideales Delphinella strobiligena Dothideales Dothideales Polychaeton citri CBS 116435 Capnodiales Dissoconium aciculare CBS 342.82 Capnodiales Zymoseptoria tritici IPO323 Capnodiales Zymoseptoria pseudotritici STIR04_2.2.1 Capnodiales Zymoseptoria ardabiliae STIR04_1.1.1 Capnodiales Dothistroma septosporum NZE10 Capnodiales Passalora fulva Capnodiales Zasmidium cellare ATCC 36951 Capnodiales Pseudocercospora fijiensis Capnodiales Sphaerulina musiva SO2202 Capnodiales Sphaerulina populicola Capnodiales zeae-maydis SCOH1-5 Capnodiales Hortaea acidophila Dothideales Baudoinia panamericana UAMH 10762 Capnodiales Teratosphaeria nubilosa Capnodiales Piedraia hortae Capnodiales Capnodiales Acidomyces richmondensis BFW Unknown Eremomyces bilateralis CBS 781.70 Unknown Microthyrium microscopicum Microthyriales Trichodelitschia bisporula Phaeotrichales Tothia fuscella Venturiales Verruconis gallopava Venturiales inaequalis Venturiales Venturia pyrina Venturiales Venturiales Aulographum hederae Unknown Rhizodiscina lignyota Patellariales Lineolata rhizophorae Dothideales Patellaria atrata CBS 101060 Patellariales Coniosporium apollinis CBS 100218 Unknown Aplosporella prunicola CBS 121167 Phyllosticta citriasiana Botryosphaeriales Neofusicoccum parvum UCRNP2 Botryosphaeriales Macrophomina phaseolina MS6 Botryosphaeriales Botryosphaeria dothidea Botryo- Botryosphaeriales Diplodia seriata Botryosphaeriales Saccharata proteae CBS 121410 sphaeriales Botryosphaeriales Pseudovirgaria hyperparasitica Capnodiales Lepidopterella palustris CBS 459.81 Glonium stellatum Unknown geophilum 1.58 Unknown mytilinum Mytilinidiales resinicola Mytilinidiales Mytilinidales rufulum Hysterium pulicare Hysteriales Hysteriales Delitschia confertaspora Pleosporales Zopfia rhizophila CBS 207.26 Unknown Lindgomyces ingoldianus Pleosporales Clohesyomyces aquaticus Pleosporales Amniculicola lignicola CBS 123094 Pleosporales Lophiotrema nucula Pleosporales Polyplosphaeria fusca Pleosporales Didymosphaeria enalia Pleosporales Aaosphaeria arxii Pleosporales macrostomum CBS 122681 Pleosporales ornata Pleosporales fimetaria CBS 119925 Pleosporales Massariosphaeria phaeospora Pleosporales Trematosphaeria pertusa Pleosporales Bimuria novae-zelandiae Pleosporales Paraphaeosphaeria sporulosa Pleosporales Karstenula rhodostoma CBS 690.94 Pleosporales Massarina eburnea CBS 473.64 Pleosporales circinans Pleosporales Periconia macrospinosa Pleosporales Lentithecium fluviatile CBS 122367 Pleosporales Stagonospora sp. SRC1lsM3a Pleosporales Pleosporales Parastagonospora nodorum SN15 Pleosporales disseminans Pleosporales holmii Pleosporales JN3 Pleosporales Plenodomus tracheiphilus IPT5 Pleosporales Clathrospora elynae Pleosporales Decorospora gaudefroyi Pleosporales Alternaria brassicicola Pleosporales Alternaria alternata Pleosporales Pyrenophora tritici-repentis Pleosporales Pyrenophora teres f. teres 0-1 Pleosporales turcica Et28A Pleosporales Curvularia lunata m118 Pleosporales Bipolaris maydis C5 Pleosporales Bipolaris sorokiniana ND90Pr Pleosporales Bipolaris oryzae ATCC 44560 Pleosporales Bipolaris zeicola 26-R-13 Pleosporales Bipolaris victoriae FI3 Pleosporales Cucurbitaria berberidis CBS 394.84 Pleosporales sp. DS3sAY3a Pleosporales Lizonia empirigonia Unknown Didymella exigua CBS 183.55 Pleosporales Macroventuria anomochaeta Pleosporales Dothidotthia symphoricarpi CBS 119687 Pleosporales siparia CBS 279.74 Pleosporales Dothideomycete species tree Melanomma pulvis-pyrius CBS 109.77 Pleosporales Pleosporales Aspergillus nidulans Eurotiomycetes Coccidioides immitis Eurotiomycetes Fusarium graminearum Swainsonine Aflatoxin Ferrichrome DHN Cercosporin Depudecin Sordariomycetes Neurospora crassa melanin Sordariomycetes Botrytis cinerea Leotiomycetes Sirodesmin Dimethylcoprogen Curvupallides Betaenone A/B/C Sclerotinia sclerotiorum Leotiomycetes Pyronema omphalodes Pezizomycetes Chaetoviridin/ Tuber melanosporum Alternapyrone Pezizomycetes Chaetomugilin 0.10 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

156 (Mizushina, Kamisuki et al. 2002, Kasahara, Miyamoto et al. 2010), and the related

157 alternapyrone (Fujii, Yoshida et al. 2005), first identified in Alternaria solani (Kasahara et al.

158 2010; Mizushina et al. 2002; Fujii et al. 2005), were found across taxa primarily in the order

159 Pleosporales, especially in the closely related , , and

160 families (Figure 2, Table SF, SG). Several BGCs mapping to the MIBiG

161 cluster for the extracellular siderophore dimethylcoprogen, which plays a role in virulence in the

162 corn pathogen Cochliobolus heterostrophus (Dothideomycetes) and Fusarium graminearum

163 (Sordariomycetes), was also found in most taxa in Pleosporales (Oide, Moeder et al. 2006). In

164 contrast, a BGC mapping to the NRPS phytotoxin sirodesmin, produced by Leptosphaeria

165 maculans (Gardiner, Cozijnsen et al. 2004), and depudecin, a histone deacetylase (HDAC)

166 synthesized by a PKS BGC first identified in A. brassicicola (Wight et al. 2009), were found

167 discontinuously distributed in only a few unrelated species within Pleosporales (Figure 2, Table

168 SF, SG). Aside from sirodesmin, only one other BGC had hits in MIBiG to a BGC producing a

169 host-selective toxin. A BGC mapping to T-toxin, a polyketide toxin produced by race T (C4) of

170 C. heterostrophus (only Race O (C5) included in this study) that was responsible for the

171 devastating Southern Corn Leaf Blight (Daly 1982; Turgeon and Baker 2007), was detected by

172 CO-OCCUR in only two additional taxa, Ampelomyces quisqualis and L. maculans (Table SG).

173 Other BGCs matched MIBiG clusters from other ascomycete classes (Eurotiomycetes,

174 Sordariomycetes), some of which have been previously detected in Dothideomycetes while

175 others were unexpected. The aflatoxin-like dothistromin clusters, which are fragmented into six

176 mini-clusters in Dothistroma septorum (Bradshaw, Slot et al. 2013), predictably mapped to

177 clusters detected by CO-OCCUR in D. septorum and the closely related species Passalora fulva

178 (Capnodiales) Some unexpected findings included a cluster in Macrophomina phaseolina, that

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

179 matched the PKS BGC for chaetoglobosins, a class of mycotoxins with both antifungal and anti-

180 cancer activities (Ali, Caggia et al. 2015; Jiang, Song et al. 2017) found in the distantly related

181 Chaetomium globosum (Sordariomycetes) and some Eurotiomycetes (Schumann and Hertweck

182 2007) (Figure 2, Table S1). Another unexpected finding was the similarity between a CO-

183 OCCUR cluster in M. phaseolina and the BGC for leucinostatin, a peptaibol compound with

184 putative antimicrobial and antifungal activity, that was previously only known from taxa in the

185 Sordariomycetes (Wang, Liu et al. 2016).

186 Cluster co-occurrence networks reveal contrasting trends in diversification

187 We visualized all significant homolog group co-occurrences predicted by CO-OCCUR as

188 networks where nodes represent homolog groups and edges connect homolog groups that co-

189 occur with unexpected frequency in genomic regions containing core biosynthetic genes (Figure

190 3a). A total of 33 discrete networks were recovered, with 71% of homolog groups located in the

191 largest two networks. Signature genes tended to be highly connected to other homolog groups in

192 two qualitatively different types of subnetworks. In one type of subnetwork, signature genes are

193 centrally connected to diverse accessory homolog groups (e.g. PKS subnetworks), while in the

194 other type one or more signature genes are non-centrally linked with fewer accessory homolog

195 groups (e.g., the NRPS and DMAT subnetwork in network 1). By quantifying the betweenness

196 centrality of each node (a function of the number of shortest network paths that pass through that

197 node) within each network, we identified signature genes and several other biosynthetic

198 enzymes, transporters, and DNA binding proteins that bridge alternate subnetworks (Figure

199 3a,b).

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

200 PKS BGCs are more compositionally diverse than NRPS BGCs. BGCs containing PKS

201 signature genes tended to have fewer significant co-occurrences among their constituent genes

202 across various BGC sizes, compared to BGCs containing NRPS signature genes (Figure 3c).

203 This is consistent with a trend in which clusters containing PKS signature genes have more

204 unique types of BGCs for a given cluster size (corrected by the total number of BGCs of that

205 size), compared with BGCs containing NRPS signature genes (Figure 3d).

206 Different algorithms annotate overlapping and complementary sets of clustered genes.

207 CO-OCCUR predictions and the pHMM-based SMURF (Khaldi, Seifuddin et al. 2010)

208 and antiSMASH (Blin, Wolf et al. 2017) programs all predicted similar numbers, but different

209 types of BGCs. antiSMASH identified a total of 1710 clusters that were part of 252 cluster

210 groups and 887 orphan clusters with 4 or more homolog groups (Table SH, Table SI). SMURF

211 identified a total of 686 clusters that were part of 194 cluster groups and 495 orphan clusters with

212 4 or more homolog groups (Table SJ, Table SK). CO-OCCUR predicted 1469 clusters that are

213 part of 239 cluster groups and 220 orphan clusters with 4 or more homolog groups (Table SC,

214 Table SE). We found that no single algorithm was able to annotate all predicted genes of interest

215 in a BGC, even those predicted to be involved in SM biosynthesis (Figure 4a, Table SL). CO-

216 OCCUR identified 51.2% and 37.7% of the clustered genes detected by SMURF and

217 antiSMASH, respectively. Conversely, SMURF and antiSMASH identify 40.7% and 42.0% of

218 the clustered genes detected by CO-OCCUR, respectively. When examining only genes

219 predicted to participate in SM biosynthesis, transport and catabolism, we found that CO-OCCUR

220 identified 51.2% and 43.3% of genes detected by SMURF and antiSMASH, respectively, while

221 SMURF and antiSMASH each identified 62.6% of those detected by CO-OCCUR (Figure 4a).

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Signature genes a) b) 150 DMAT NRPS NRPS-LIKE Network 1 Network 2 Node types (0.35) NRPS PKS-LIKE Homolog group 50 NRPS-like n4 (0.25) Transport-related Hybrid homolog group 25 TC

PKS TC n2 (0.31)

Number of nodes 0 PKS-like 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 NRPS-LIKE c) Betweeness Centrality n1 (0.32) 40 co-occurences HYBRID 30 involving signature gene

PKS (0.78) 20

NRPS 10 DMAT n3 (0.31) 0

40 co-occurences not involving signature gene Network 1 30 20

10

0 PKS-LIKE 4 6 8 10 12 14 Number of significant co-occurrences per cluster Gene cluster size d) Network 2 PKS (0.75) 0.8 n6 (0.22) NRPS-LIKE (0.44) NRPS-LIKE n7 (0.21) n5 (0.22) 0.6

n8 (0.21) PKS (0.42)

NRPS-LIKE 0.4 Gene cluster diversity n1: serine hydrolase n5: FAD-binding n2: multi-drug resistance protein n6: o-methyltransferase 0.2 n3: trichothecene efflux pump n7: DNA-binding protein n4: amino acid aminotransferase n8: mono-carboxylate transporter 4 6 8 10 12 14 Figure 3 Gene cluster size bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a) Figure 4 c) 100

antiSMASH 75 5416 50 (887) 25 1408 antiSMASH % recovery 0 (200) 0 25 50 75 100 3326 CO-OCCUR % recovery (1301) 3873 500 (1468) 400

300 CO-OCCUR SMURF 200 1553 1254 6051 (796) 100

(290) (200) antiSMASH % discovery 0 0 100 200 300 400 500 b) CO-OCCUR % discovery

CTB6CTB4 CTB2 CTB1 CTB3 CTB5 CTB7 CTB8ORF11 85332ORF1243083 CFP CTB9CTB10CTB11CTB12 * **** * * * * * * required for cercosporin biosynthesis (Chen et al., 2007; Newman et al., 2016; de Jonge et al., 2018) bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

222 The complementary nature of the CO-OCCUR and antiSMASH algorithms is illustrated

223 by their annotations of a characterized BGC that encodes the biosynthesis of cercosporin, a non-

224 host specific polyketide produced by Cercospora spp. (Dothideomycetes) and Colletotrichum

225 (Sordariomycetes) (de Jonge, Ebert et al. 2018). Encoded in a BGC, all 10 genes involved in

226 cercosporin biosynthesis, are known and characterized (CTB1-3, CTB5-7, CTB9), in addition to

227 a regulator (CTB8) and two transporters (CTB4 and CFP)(Chen, Lee et al. 2007; de Jonge, Ebert

228 et al. 2018; Newman & Townsend 2016). At this BGC’s locus in Cercospora zeae-maydis, both

229 antiSMASH and CO-OCCUR annotated CTB1, CTB2, and CTB3 as genes of interest; only

230 antiSMASH annotated CTB4, CTB5 and CTB6; only CO-OCCUR annotated CTB10, CTB11

231 and CTB12; and no algorithm annotated CTB7, CTB8, CTB9 or CFP (Figure 4b).

232 CO-OCCUR and antiSMASH recovered similar proportions of loci homologous to

233 known BGCs and predicted additional genes of interest in the vicinity of these candidate. Using

234 BLASTP, we identified 364 BGCs with ≥ 3 genes across all Dothideomycete genomes that are

235 homologous to 58 characterized BGCs from the MIBiG database (i.e., where ≥75% of genes

236 show similarity) (Table SM). We then compared how many genes within and around these BGCs

237 were predicted to be of interest by either antiSMASH or CO-OCCUR by cross-referencing all

238 BGCs detected by each method, and found that both algorithms recovered similar percentages of

239 BGC content (antiSMASH mean percent recovery = 48.3%, SD = 37.6%; CO-OCCUR mean

240 percent recovery = 51.0%, SD = 42.6%), although for any given BGC, percent recovery often

241 differed between each algorithm (Figure 4c, Table SG). We also found that both antiSMASH and

242 CO-OCCUR identified similar numbers of new genes of interest around BGC loci (antiSMASH

243 mean percent discovery = 65.4%, SD = 85.4%; CO-OCCUR mean percent discovery = 56.6%,

244 SD = 89.4%), and that the number of additional genes of interest often exceeded the size of the

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

245 recovered candidate cluster. High rates of novel gene discovery are perhaps expected given that

246 many of the clusters in MIBiG are only partially annotated.

247 Some over-dispersed clusters have ecologically biased distributions

248 We found that nearly one-fifth (18%) of cluster groups are phylogenetically over-

249 dispersed when compared to expected distributions that would result from strict Brownian

250 evolution using Fritz and Purvis’ D statistic, where more closely related species are predicted to

251 be more similar to each other compared to more distantly related species (Figure 5, Figure SD).

252 Six over-dispersed cluster groups were over-represented (present at least twice as often) in either

253 plant pathotrophs or plant saprotrophs (Figure 5). By comparison 22.5% of cluster groups had

254 distributions that were more conserved than expected. The remaining cluster group distributions

255 either fell on a continuum between phylogenetically conserved and over-dispersed (35.1%) or

256 were present in sets of taxa too small to be analyzed (23.8%). Figure 6 presents three examples

257 of closely related cluster groups that vary in their phylogenetic conservation (See Methods).

258 Cluster groups in the first group, which partially encode the 1,8-dihydroxynaphthalene (DHN)

259 melanin pathway, were found in nearly all Dothideomycetes; cluster groups in the second group

260 were restricted to the Pleosporales; and cluster groups in the third group were found among

261 Bipolaris and Dothidotthia, two closely related genera within the Pleosporales.

262 Dothideomycetes have five distinct types of DHN melanin clusters

263 We detected five cluster groups with distinct but overlapping compositions that appear to

264 encode partial pathways for 1,8-dihydroxynaphthalene (DHN) melanin biosynthesis in 87 of the

265 101 genomes (Figure 6). No genome had more than one predicted DHN melanin cluster. The two

266 most prevalent types, cluster group 131 and cluster group 113, are found in 48 fungi from 10 of

267 the 13 taxonomic orders and in 29 fungi from 6 orders, respectively. Cluster groups 131, 113,

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

CG id Sig. gene(s) Size Freq. Best known signature(s) a) 7 b) Homologous cluster group (CG) (a) group 221 NRPS 4 5 Asperphenamate (35%) 6 P(Brownian motion) > 0.05 (b) group 6 PKS 14 2 Emodin, Emericellin, 5 P(Brownian motion) ≤ 0.05 Pestheic acid (60%) 4 (c) group 182 PKS,TC 4 3 Fusarubin (46%) 3 a (d) group 163 PKS,DMAT 5 3 ACT-toxin (47%) 2 (e) group 49 PKS 8 6 Emodin (66%) 1 Plant path. : sap. (f) group 202 PKS 5 7 Azanigerone (46%) 0 Dothideomycete 1 c)

0.10 species tree 2

Lifestyle ratios e d b 3 f c 4 5 6 HCG Plant sap. : path. 7 (a) 2 (b) 8 (c) -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 (d) (e) Figure 5 Fritz and Purvis' D (f) bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Set 1 Set 2 Set 3 a) a b c d e f g h i j b) Set 1: DHN melanin

Polyketide1,3,8-trihydroxynaphthalene synthase TranscriptionPrefoldin factor cmr1 subunithsp40 co-chaperoneUnknown JID1 reductase (a) group 131 (b) group 113 () (c) group 223 () (d) group 134

(e) group 137

Set 2: Unknown NRPS-like

NRPS-likeGlycerol-3-phosphateUnknownEfflux pumpMFS multidrugABC transporterGliotoxin dehydrogenase transporter exporter (f) group 185 () (g) group 168

(h) group 100

Set 3: Alternapyrone-like PKS Dothideomycete species tree PKS PKS LanosterolFAD bindingReductasePrenyltransferaseUnknownFAD bindingSalicylateUnknown hydroxylaseBerberine synthase 0.10 (i) group 16

(j) group 22

Total fungi with HCG: 29486 2 2 5 18 25 3 2 Figure 6 ( ) = present in < 50% of clusters in HCG bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

268 223 and 134 encode 2 of the 5 biosynthetic genes (pks1, 1,3,8-trihydroxynapthalnene [T3HN]

269 reductase) and the transcription factor (cmr1) involved in DHN melanin biosynthesis, while

270 cluster group 137 encodes 1 biosynthetic gene (T3HN reductase) and cmr1 (Figure 6b). In

271 addition to the homolog groups known to participate in DHN melanin biosynthesis, we detected

272 3 additional homolog groups (Prefoldin subunit, Heat-shock protein 40 co-chaperone JID1, and a

273 protein of unknown function) that are broadly conserved within DHN melanin clusters but that

274 have no known role in melanin biosynthesis. As an example of how CO-OCCUR is not

275 constrained by a priori assumptions of pHMMs, these additional homolog groups were not

276 detected by either antiSMASH or SMURF despite their prevalent linkage to the known

277 biosynthetic genes.

278

279 SM cluster diversity is under-sampled and increases proportionally with total number of

280 genomes in Pleosporales

281 Cluster repertoires (combinations of cluster groups found within a given genome) differ

282 markedly between fungi from different genera (mean pairwise Sørensen dissimilarity = 0.79, SD

283 = 0.12) and to a lesser extent within a (mean = 0.37, SD = 0.13), with dissimilarity

284 increasing linearly with phylogenetic distance across all pairwise species combinations among

285 49 Pleosporales (y = 0.84x + 0.51; r2 = 0.50), the most well-sampled Dothideomycetes order

286 (Figure 7a, Figure SF). However, given the same level of within-repertoire diversity (i.e., alpha

287 diversity) and total diversity across all repertoires (i.e., gamma diversity), dissimilarity between

288 repertoires (i.e., beta diversity) can result from either nestedness (where some repertoires are

289 subsets of others) or turnover (where no repertoire is a subset of the other), or a combination of

290 the two. When we partitioned total Sørensen dissimilarity between all cluster repertoires (βSOR =

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

a) Linkage tree of Raup-Crick dissimilarity between unique cluster types Figure 7 1.0 0.8 0.6 0.4 0.2 0.0 b)

Masph1 25 Spofi1 Amnli1 Melpu1 Veren1 Lopnu1 20 Polfu1 Linin1 Zoprh1 Delco1 15 Perma1 Cloaq1 Trepe1 Bimnz1

Cluster repertoire diversity 10 Lopma1

Lenfl1 branch length) (total Raup-Circk Karrh1 Parsp1 Wesor1 10 20 30 40 Bysci1 Cluster repertoire size Maseb1 Aaoar1 sampled Pleosporalean genome Plesi1 Pyrtr1 Pyrtt1 c) Coclu2 Settu1 Altal1 Altbr1 400 Cocmi1 Cocca1 Cocvi1 CocheC5_3 300 Cocsa1 Stano2 Ampqui1 Clael1 200 Photr1 Macan1 Cluster richness Didex1 100 Lizem1 Stasp1 Cucbe1 Pyrsp1 0 Ophdi1 0 25 50 75 100 Setho1 Number of sampled Pleosporalean genomes Linkage tree of Sørensen dissimilarity between Pleosporalean cluster repertoires Decga1 Dotsy1 Lepmu1 observed interpolation extrapolation bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

291 0.969) into its nestedness and turnover components, we found that nearly all of the differences

292 between the cluster repertoires of different genomes were due to turnover (βSIM = 0.96) and not

293 repertoire nestedness (βSNE = 0.008), such that any given cluster repertoire contains a unique

294 combination of clusters (Figure SH). Furthermore, the compositional diversity of gene clusters

295 within a given repertoire (measured as the total branch length on a Raup-Crick linkage tree,

296 Figure 7a) scales linearly with repertoire size (y = 0.49x + 3.92; adj. R2 = 0.86), indicating that

297 clusters added to a given repertoire are generally dissimilar to the clusters already present in that

298 repertoire (Figure 7b). Finally, rarefaction analysis of the total number of unique cluster groups

299 and orphan clusters (i.e., cluster richness) detected over increasing numbers of sampled genomes

300 suggests genomes within Pleosporales are under-sampled with respect to BGC diversity, and

301 project substantially more unique cluster types arising from future genome sampling within this

302 order (Figure 7c).

303 Discussion:

304 BGC diversity has been investigated primarily in bacteria (Cimermancic, Medema et al.

305 2014) and within individual genera in the fungal classes Eurotiomycetes and Sordariomycetes

306 (Lind, Wisecaver et al. 2017; Theobald, Vesth et al. 2017; Villani, Proctor et al. 2019). Although

307 Dothideomycetes are producers of a number of secondary metabolites important to fungal-plant

308 interactions and toxin production, to date there has not been a systematic evaluation of BGC

309 diversity in the Dothideomycetes nor in any other fungal class. Fungal genomes experience

310 frequent reorganization and changes in gene composition that underlies large-scale differences in

311 chromosomal macro- and micro-synteny among species (Grandaubert, Lowe et al. 2014; Hane,

312 Rouxel et al. 2011; Shi‐Kunne, Faino et al. 2018). Yet despite the overall dynamic nature of

313 fungal chromatin, tight linkage is often maintained between loci with related metabolic

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

314 functions, manifesting as gene clusters (Del Carratore, Zych et al. 2014). Here, we developed an

315 alternative, function-agnostic approach to annotating SM genes of interest that exploits these

316 patterns of microsynteny in order to identify previously unexplored dimensions of fungal BGC

317 diversity.

318 Complementary methodologies enhance understanding of BGC composition and diversity

319 in Dothideomycetes

320 There are two main approaches to predicting genes that are functionally associated in

321 BGCs. The first uses targeted methods based on precomputed pHMMs derived from a set of

322 genes known to participate in SM metabolism to identify sequences of interest (Khaldi,

323 Seifuddin et al., 2010; Blin, Wolf et al., 2017). The second uses untargeted methods based on

324 some function-agnostic criteria, such as synteny conservation or shared evolutionary history, to

325 implicate genes as part of a gene cluster (Gluck-Thaler & Slot 2018). Due to common metabolic

326 functions employed across distantly related taxa, targeted approaches, such as those employed by

327 SMURF and antiSMASH, have proven enormously successful. However, our objective in this

328 study was to develop a complementary untargeted approach in order to capture undescribed BGC

329 diversity within a single fungal lineage.

330 Our CO-OCCUR algorithm leverages a database of 101 Dothideomycete genomes in

331 order to annotate genes of interest using unexpectedly conserved genetic linkage as an indicator

332 of selection for co-inheritance with SM signature genes. CO-OCCUR failed to recover many of

333 the genes annotated using the pHMM approaches employed by SMURF and antiSMASH,

334 indicating that it has limitations in its prediction of secondary metabolite BGC content. These

335 results suggest that it is not optimal for the de novo BGC annotation of individual genomes, and

336 its ability to annotate genes of interest is proportional to their co-occurrence frequency in a given

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

337 database, meaning that it is not well suited for recovering associated SM genes that are not

338 evolutionarily conserved. This may explain in part why 10,295 genes (including 2,478 genes

339 predicted to be involved in secondary metabolism) identified by antiSMASH and SMURF

340 combined were not detected with CO-OCCUR (Figure 4), and why CO-OCCUR detected only a

341 few of the host-selective toxins found in Dothideomycetes.

342 Nevertheless, our method avoids some of the limitations intrinsic to algorithms that

343 employ pHMMs to delineate cluster content. While pHMM-based approaches gain predictive

344 power by leveraging similarities in SM biosynthesis across disparate organisms, they may fail to

345 identify gene families involved in secondary metabolism that are unique to a particular lineage of

346 organisms. For example, SMURF detects accessory SM genes using pHMMs derived from

347 mostly Aspergillus (Eurotiomycetes) BGCs (Khaldi, Seifuddin et al., 2010), while antiSMASH

348 v4 and v5 use 301 pHMMs of smCOGs (secondary metabolism gene families) derived from

349 aligning SM-related proteins, of which few are currently from fungi, in order to identify genes of

350 interest in the regions surrounding signature biosynthetic genes (Blin, Wolf et al., 2017).

351 Taxonomic bias introduced by sampling a limited number of BGCs may account for the 6,051

352 proteins found in BCGs that were identified by CO-OCCUR but not any other algorithm, of

353 which 796 are predicted to participate in secondary metabolism and 617 could not be assigned to

354 a COG category but nevertheless have domains commonly observed in secondary metabolite

355 biosynthetic proteins (e.g., methyltransferase, hydrolase).

356 A linkage-based approach can also identify non-canonical accessory genes involved in

357 SM biosynthesis. For example, we detected 3 genes among the 5 variants of the DHN melanin

358 cluster that were not previously considered to be part of this BGC and not detected by either

359 antiSMASH or SMURF. One of these genes, a predicted HSP40 chaperone, is a homolog of the

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

360 gene JID1, whose knock-out mutants display a range of phenotypes

361 (https://www.yeastgenome.org/locus/S000006265/phenotype) related to melanin production,

362 including increased sensitivity to heat and chemical stress. We propose that natural selection (not

363 genetic hitchhiking) is responsible for conservation of synteny in these loci, because SM cluster

364 locus composition and microsynteny in general are typically highly dynamic in fungi (Lind,

365 Wisecaver et al., 2018; Proctor, McCormick et al. 2018), and therefore conserved linkage in

366 these clusters over speciation events is a strong indicator of related function (de Jonge, Ebert et

367 al. 2018; Del Carratore, Zych et al. 2019). The identification of genes with non-canonical

368 functions, including those not participating directly in SM biosynthesis, may reveal SM

369 supportive functions, including mechanisms to protect endogenous targets of the metabolic

370 product (Keller 2015), in addition to novel biosynthetic genes (de Jonge, Ebert et al. 2018).

371 Ultimately, targeted and untargeted approaches to BGC annotation reinforce and enrich

372 our understanding of BGC diversity, as no single method identifies all accessory genes of

373 interest in the regions surrounding signature biosynthetic genes (Figure 4). It is notable that the

374 cercosporin BGC was long thought to consist only of CTB1-8, based on functional analyses and

375 structural prediction. However, de Jonge et al. recently predicted CTB9-12 to be of interest after

376 observing that these genes have conserved synteny among all fungi that possessed CTB1-8, and

377 subsequently demonstrated they are essential for cercosporin biosynthesis (de Jonge, Ebert et al.

378 2018). Only CO-OCCUR detected these additional four genes and bBoth pHMM-based models

379 and CO-OCCUR were required to detect the complete cercosporin BGC in our study. Given the

380 complementary nature of the advantages and disadvantages of different algorithms, we suggest

381 future studies incorporate multiple lines of evidence from both targeted and untargeted

382 approaches to more fully capture BGC compositional diversity. The 332 homolog groups of

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

383 interest that we identified using CO-OCCUR could further be used to build pHMMs and be

384 incorporated into existing BGC annotation pipelines in order to facilitate more complete analyses

385 of single genomes.

386 Signature genes differ in mode of BGC diversification

387 Although BGCs in fungi typically display characteristics of diversification ‘hotspots’,

388 showing elevated rates of gene duplication and gene gain and loss (Wisecaver, Slot et al. 2014;

389 Lind, Wisecaver et al. 2017), modular parts of clusters and even entire clusters are often shared

390 between divergent species. BGC diversification through gain and loss of individual genes and

391 sub-clusters of genes has been demonstrated in bacterial BGC diversity (Cimermancic, Medema

392 et al. 2014). Although the extent of sub-clustering in fungal genomes has never been directly

393 addressed to our knowledge, the algorithm we designed here essentially functions by identifying

394 the smallest possible type of sub-cluster: a pair of genes found more often than expected by

395 chance. The unexpected co-occurrence of gene pairs revealed that the two largest types of

396 signature gene families, PKS and NRPS, have contrasting co-occurrence network structures.

397 NRPS homolog groups are embedded in highly reticulate cliques (i.e., form unexpected

398 associations with genes that co-occur amongst themselves). This could suggest NRPS cluster

399 diversification is constrained by interdependencies among accessory genes. By contrast, PKS

400 homolog groups are network hubs (i.e., form unexpected associations with many non-co-

401 occurring genes), which may underlie the higher compositional diversity and decreased

402 frequency of unexpected co-occurrences found within PKS clusters (Figure 3c, d). The apparent

403 contrast in how these different signature cluster types are assembled may reflect the range of

404 accessory modifications typically applied to the structures of polyketides and nonribosomal

405 peptides produced by PKSs and NRPSs. Alternatively, PKS clusters may be subject to more

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

406 diversifying selection, due to the ability of cognate metabolism in other organisms to utilize,

407 degrade, or neutralize the metabolites. These hypotheses remain to be tested.

408 Persistent gene co-occurrences reveal layers of combinatorial evolution

409 Previous large-scale analyses of BGCs suggest there is an upper limit to the number of

410 gene families that associate with signature biosynthetic genes, and that diversity is in large part

411 dependent on combinatorial re-shuffling of existing loci (Cimermancic, Medema et al. 2014).

412 Our analysis expands the number of gene families implicated in BGC diversity and identifies

413 patterns of modular combinatorial evolution among accessory homolog groups with metabolic,

414 transport and regulatory-related functions. While some of these accessory homolog groups are

415 restricted to BGCs with a particular type of signature SM gene, others are present in multiple

416 BGC types, suggesting they encode evolvable or promiscuous functions that can be readily

417 incorporated into different metabolic processes (Table SB). For example, 34 homolog groups

418 with predicted transporter functions are common features of the clusters we detected, present in

419 just under half (43%) of all predicted clusters. Among these homolog groups, 5 have been

420 recruited to compositionally diverse gene clusters and are primarily annotated as toxin efflux

421 transporters or multidrug resistance proteins. Transporters are a key component of fungal

422 chemical defense systems, well known for facilitating resistance to fungicides and host-produced

423 toxins (Coleman & Mylonakis 2009). Transporters are also increasingly recognized as integral

424 components of self-defense mechanisms against toxicity of endogenously produced SMs

425 (Menke, Dong et al., 2012).

426 Heterogeneous dispersal patterns of BGCs underpin fungal ecological diversity

427 The distribution of fungal chemodiversity remains difficult to observe and interpret

428 directly, making BGCs useful tools for elucidating underlying trends in fungal chemical ecology.

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

429 Although the vast majority of BGCs remain uncharacterized, their phylogenetic distributions

430 occasionally provide clues to the selective environments that promote their retention (Slot 2017).

431 For example, spotty distributions resulting from horizontal transfer of BGCs between distantly

432 related but ecologically similar species suggests the encoded metabolites contribute to fitness in

433 the shared environment (Dhillon, Feau et al. 2015; Reynolds, Vijayakumar et al. 2018). Shared

434 ecological lifestyle may also help explain why certain clusters, such as those involved in putative

435 degradative pathways, are retained among phylogenetically distant species (Gluck-Thaler and

436 Slot 2018). Our simple eco-evolutionary screen identified 43 BGCs that are more widely

437 dispersed than expected under neutral evolutionary models, and further revealed that a subset of

438 these BGCs are present more often in fungi with specific nutritional strategies (e.g. plant

439 saprotrophs and plant pathotrophs), suggesting the molecules they encode contribute to specific

440 plant-associated lifestyles (Figure 5). For example, we found an over-dispersed NRPS BGC

441 (group 221) that is present in three plant pathogens and one plant saprotroph. In contrast, the 54

442 BGCs showing a phylogenetically under-dispersed distribution among mostly closely related

443 genomes is consistent with lineage-favored traits, which may or may not be due to shared

444 ecology. For example, a monophyletic clade of 26 pleosporalean fungi all have a 6 gene NRPS-

445 like cluster (group 100) of unknown function, fully maintained among these allied taxa (and a

446 single distant relative), suggesting it encodes a trait that contributes to the success of this lineage.

447 Phylogenetic screens, especially when coupled with more robust phylogenetic analyses, will be

448 useful for prioritizing the characterization of BGCs most likely to contribute to the success of

449 particular guilds or clades.

450 Among those BGCs with hits to the MIBiG database, we identified clusters that displayed

451 both lineage specific and spotty or sporadic distributions. The Pleosporales, for example,

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

452 contains many plant pathogens and the conservation of BGCs involved in production of general

453 virulence factors towards plants such as solanopyrone, alternapyrone, and the extracellular

454 siderophore dimethylcoprogen across many taxa in this order suggests a shared lineage-specific

455 trait with roles in plant-pathogenesis. In contrast, the aflatoxin-like cluster Dothistromin cluster,

456 which was proposed to be horizontally transferred from Aspergillus (Eurotiomycetes), had a very

457 spotty distribution, found only in several closely related taxa in Capnodiales, supporting a

458 hypothesis of HGT. Similarly, the ETP toxin sirodesmin shares 6 genes with the BGC producing

459 the epipolythiodioxopiperazine (ETP) toxin gliotoxin, which plays a role in virulence towards

460 animals in human pathogen Aspergillus fumigatus (Eurotiomycetes) (Gardiner, Cozijnsen et al.

461 2004; Bok, Chung et al. 2006). Related ETP-like BGCs have since been identified in a number

462 of other taxa of Eurotiomycetes and Sordariomycetes, but among Dothiodeomycetes were

463 previously known only from L. maculans and a partial cluster in Sirodesmin diversum lacking

464 the core NRPS (Patron, Waller et al. 2007). We detected homologs of this cluster found

465 sporadically distributed in several other taxa within Pleosporales (Figure 2, Table SG). The CO-

466 OCCUR algorithm detected only a few BGC with hits to host-selective toxins (sirodesmin and T-

467 toxin) but failed to detect several well-known host-selective toxins such as HC-toxin, and other

468 host-selective toxins in Alternaria alternata. Either these host-selective toxins are not represented

469 in MIBiG or as discussed above, the uniqueness of these clusters and rarity of the linkages

470 between genes in these clusters in the overall dataset may make them difficult to detect through

471 CO-OCCUR.

472

473 Variation among BGC repertoires is due to high BGC turnover, not nestedness

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

474 Recent comparative studies have documented high intraspecific diversity of SM

475 pathways within and between different species of plants, bacteria and fungi (Penn et al., 2009;

476 Choudoir, Pepe-Ranney & Buckley 2018; Holeski, Hillstrom et al. 2012; Holeski, Keefover-

477 Ring et al. 2013; Vesth, Nybo et al. 2018). However, identical estimates of diversity can result

478 from two distinct processes: nestedness, where one set of features is entirely subsumed within

479 another, or turnover, where differences are instead due to a lack of overlap among the features of

480 different sets (Baselga 2012). When we partitioned diversity among BGC repertoires in

481 Pleosporales (i.e., β diversity), we found that the vast majority of variation is due to a high

482 degree of genome-specific cluster combinations, and not nestedness (Figure 7, Figure SH). Much

483 of the turnover in BGC repertoire content between genomes appears to occur over relatively

484 short evolutionary timescales (Figure SF), and then diversifies more gradually, suggesting that

485 divergence in repertoires may be closely linked to speciation processes, such as niche

486 differentiation or geographic isolation. Directional selection, especially for multi-genic traits

487 encoded at a single locus (e.g., BGCs), leads to rapid gain/loss dynamics exemplary of many SM

488 phenotypes and genotypes (Choudoir, Pepe-Ranney & Buckley 2018; Lind, Wisecaver et al.

489 2017). Niche differentiation further reinforces divergence between closely related repertoires,

490 which might lead to rapid accumulation of variation over short evolutionary timescales. Indeed,

491 evidence from within populations suggests that BGCs are occasionally located in genomic

492 regions experiencing selective sweeps in geographically isolated pathogen populations

493 (Hartmann, McDonald et al. 2018). The retention/loss of certain SM clusters is coincident with

494 speciation in bacteria (Kurmayer, Blom et al., 2015) and much of the variation in cluster

495 repertoires in Metarhizium insect pathogens is species specific (Xu, Luo et al., 2016). Within

496 Dothideomycetes, the evolution of host-selective toxins even within a single species of pathogen,

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

497 for example, may allow for niche differentiation, host specialization, and potentially speciation.

498 Rare chemical phenotypes, especially with regards to defense chemistry, may also increase

499 fitness in complex communities (Kursar, Dexter et al. 2009).

500 BGCs interact with dimensions of chemical diversity

501 Biological activity of a SM can increase organismal fitness, but any given molecule is not

502 likely to be biologically active. The screening hypothesis posits that mechanisms to generate and

503 retain biochemical diversity would therefore be selected, despite the energetic costs, because

504 increasing structural diversity increases the probability of “finding” those that are adaptive (Firn

505 and Jones 2003). This phenomenon is analogous to the mammalian immune system’s latent

506 capacity to generate novel antibodies, resulting in a remarkable ability to respond to diverse

507 antagonists (Firn and Jones 2003). However, while the screening hypothesis may equally apply

508 to plants and microorganisms, patterns of diversity we observe here suggest each lineage

509 generates and maintains biochemical diversity in fundamentally distinct ways. Specifically,

510 fungal individuals appear to maximize total chemical beta-diversity while simultaneously

511 minimizing alpha-diversity of similar chemical classes (Nielsen, Grijseels et al. 2017; Vesth,

512 Nybo et al. 2018). In contrast, individual plants are more likely to produce diverse suites of

513 structurally similar molecules (Li, Bladwin & Gaquerel 2015; Song, Qiao et al. 2017; Weinhold,

514 Ullah et al. 2017). We show that total cluster diversity increases linearly with repertoire size

515 across a broad sample of fungi, extending previous observations that individual fungal genomes

516 are streamlined to produce molecules that share little structural similarity. Rather than

517 maintaining sets of homologous BGCs and pathways within the same genome, evidence from

518 ours and other studies suggests that fungi instead maintain high genetic variation in homologous

519 BGCs across individuals at the level of the pan-genome (Ziemert, Lechner et al. 2014; Lind,

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

520 Wisecaver et al. 2017; Olarte et al.2019). Although not a selectable evolvability mechanism per

521 se, greater access to the diversity of BGCs harbored in pan-genomes through recombination,

522 hybridization and horizontal transfer effectively outsources the incremental screening for

523 bioactive metabolites across many individuals, thereby decreasing the costs for generating

524 diversity for any given individual and likely accelerating the rate at which effective bioactive

525 metabolite repertoires are assembled within a given lineage (Slot and Gluck-Thaler 2019). Our

526 characterization of BGC diversity across the largest fungal taxonomic class represents a step

527 towards elucidating the broader consequences of these contrasting strategies for generating and

528 maintaining biodiversity of metabolism writ large.

529 Conclusions:

530 Fungi produce a range of secondary metabolites that are linked to different ecological functions

531 or defense mechanisms, playing a role in adaptation over time. Although studied at intra- and

532 interspecific level, this phenomenon has not been studied at macroevolutionary scales. The

533 Dothideomycetes represent the largest and phylogenetically most diverse class of fungi,

534 displaying a range of fungal lifestyles and ecologies. Here we assessed the patterns of diversity

535 of biosynthetic gene clusters across the genomes of 101 Dothideomycetes to dissect patterns of

536 evolution of chemodiversity. Our results suggest that different classes of BGCs (e.g. PKS versus

537 NRPS) have differing diversity of cluster content and connectedness among networks of co-

538 occurring genes and implicate high rates of BGC turnover, rather than nestedness, as the main

539 contributor to the high diversity of BGCs observed among fungi. Consequently, little overlap

540 was found in biosynthetic gene clusters from different genera, consistent with diverse ecologies

541 and lifestyles among the Dothideomycetes, and suggesting that most of the metabolic capacity of

542 this fungal class remains to be discovered.

543

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

544 Methods:

545 Dothideomycetes genome database and species phylogeny

546 A database of 101 Dothideomycetes annotated genomes, gene homolog groups, and the

547 corresponding phylogenomic species tree were obtained from (Grigoriev et al. 2014; Haridas et

548 al in press).

549 Gene cluster annotation with the SMURF algorithm

550 We used a command-line Python script based on the SMURF algorithm (Vesth, Nybo et al.

551 2018). Using genomic coordinate data and annotated PFAM domains of predicted genes as input,

552 the algorithm predicts seven types of SM clusters based on the multi-PFAM domain composition

553 of known 'backbone' genes. The cluster types are 1) Polyketide synthases (PKSs), 2) PKS-like, 3)

554 nonribosomal peptide-synthetases (NRPSs) 4) NRPS-like, 5) hybrid PKS-NRPS, 6)

555 prenyltransferases (DMATS), and 7) terpene cyclases (TCs). The borders of clusters are

556 determined using PFAM domains that are enriched in characterized SM clusters, allowing up to

557 3 kb of intergenic space between genes, and no more than 6 intervening genes that lack SM-

558 associated domains. SM-associated PFAM domains were borrowed from Khaldi et al. (2010).

559 Gene cluster annotation with antiSMASH

560 All genomes were annotated using antiSMASH v4.2.0 by submitting genome assemblies

561 and GFF files to the public web server with options “use ClusterFinder algorithm for BGC

562 border prediction” and “smCOG analysis” (Blin, Wolf et al., 2017). antiSMASH reports all

563 genes within the borders of a predicted cluster as part of the cluster. For our analysis, we only

564 considered genes belonging to annotated smCOGs or signature biosynthetic gene families as part

565 of a given cluster and excluded all others, in order to obtain conservative, high confidence

566 estimations of cluster content based on genes of interest.

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

567 Sampling null homolog group-pair distributions

568 We created null distributions from which we could empirically estimate co-occurrence

569 probabilities by randomly sampling homolog group pairs without replacement from all

570 Dothideomycete genomes (Figure SA). Before beginning, we defined null distributions based on

571 two parameters: a range of sizes for the smallest homolog group in the pair, and a range of sizes

572 for the largest homolog group in the pair, where each range progressively incremented by 25

573 from 1-800 and all combinations of ranges were considered. For example, there existed a null

574 distribution for homolog group pairs where the smallest homolog group had between 26-50

575 members, and the largest homolog group had between 151-175 members. To begin, we randomly

576 sampled a genome and then randomly selected two genes within 6 genes of one other from that

577 genome. We retrieved the homolog groups to which those genes belonged, and then counted the

578 number of times members of each homolog group were found within 6 genes of each other

579 across all Dothideomycete genomes. We counted the number of members belonging to each

580 homolog group, excluding those that were found within 6 genes of the end of a contig, in order to

581 obtain a corrected size for each homolog group that accounted for variation in assembly quality.

582 The co-occurrence observation was then stored in the appropriate null distribution based on the

583 corrected sizes of each homolog group. For example, the number of co-occurrences of a sampled

584 homolog group pair where the smallest homolog group had a corrected size of 89, and the largest

585 homolog group had a corrected size of 732 would be placed in the null distribution where the

586 smallest size bin was 76-100 members, and the largest size bin was 726-750. All homolog

587 groups with greater than 800 members were assigned to the 776-800 size bin. This sampling

588 procedure was repeated 500,000 times. After evaluating various bin sizes, we ultimately decided

589 to use a range of 25 because this resulted in the most even distribution of samples across all null

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

590 distributions. Due to variation in the number of homolog groups with any given size across our

591 dataset, it was not possible for all null distributions to contain the same number of samples.

592 The CO-OCCUR pipeline

593 Current BGC detection algorithms first identify signature biosynthetic genes using profile

594 Hidden Markov Models (pHMMs) of genes known to participate in SM biosynthesis, and then

595 search predefined regions surrounding signature genes for co-located "accessory" biosynthetic,

596 regulatory, and transport genes. The approach of CO-OCCUR, in contrast, is to define genes of

597 interest based on whether they are ever found to have unexpectedly conserved syntenic

598 relationships with other genes in the vicinity of signature biosynthetic genes, agnostic of gene

599 function. Here, we used CO-OCCUR in conjunction with a preliminary SMURF analysis to

600 arrive at our final BGC annotations (Figure SA). We first took all SMURF BGC predictions and

601 extended their boundaries to genes within a 6 gene distance that belonged to homolog groups

602 found in another SMURF BGC, effectively “bootstrapping” the BGC annotations in order to

603 ensure consistent identification of BGC content across the various genomes. SMURF BGCs at

604 this point in the analysis were considered to consist of all genes found within the cluster’s

605 boundaries. For each pair of genes in each BGC (including signature biosynthetic genes), we

606 retrieved their homolog groups, and kept track of how many times that homolog group pair was

607 observed across all BGCs. Then, for each observed homolog group pair, we divided the number

608 of randomly sampled homolog group pairs in the appropriate null distribution (based on the

609 corrected sizes of the smallest and largest homolog groups within the observed pair, see above)

610 that had a number of co-occurrences greater than or equal to the observed number of co-

611 occurrences by the total number of samples in the null distribution. In doing so, we empirically

612 estimated the probability of observing a homolog group pair with at least that many co-

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

613 occurrences by chance, given the sizes of the homolog groups. In this way, we were able to take

614 into account the relative frequencies of each homolog group within a pair across all genomes

615 when assessing the probability of observing that pair’s co-occurrence. For example, if we

616 observed that homolog group 1 and homolog group 2 co-occurred 19 times within SMURF-

617 predicted BGCs, and that homolog group 1 had 57 members while homolog group 2 had 391

618 members, we would count the number of randomly sampled homolog group pairs that co-

619 occurred 19 or more times within the null distribution where the smallest homolog group size bin

620 was 51-75 and the largest homolog group size bin was 376-400, and then divided this by the total

621 number of samples in that same null distribution to obtain the probability of observing homolog

622 group 1 and homolog group 2’s co-occurrences by chance. All co-occurrences with an empirical

623 probability estimate of ≤0.05 were considered significant and retained for further analysis. In

624 order to decrease the risk of false positive error, we did not evaluate the probability of observing

625 any homolog group pairs with less than 5 co-occurrences, and also did not evaluate any homolog

626 group pairs whose corresponding null distribution had fewer than 10 samples.

627 Next, in order to obtain our final set of predicted BGCs, we took all homolog groups

628 found in significant co-occurrences, and conducted a de novo search in each genome for all

629 clusters containing genes belonging to those homolog groups within a 6 gene distance of each

630 other. In this way, all BGC clusters in our final set consisted of genes that belonged to these

631 homolog groups of interest, while all other intervening genes were not considered to be part of

632 the cluster. We treated homolog groups containing signature biosynthetic genes as we would any

633 other homolog group: if a signature gene predicted by SMURF was not a member of a homolog

634 group part of an unexpected co-occurrence, we did not consider it part of any clusters. We stress

635 that co-occurrences were only used to determine homolog groups of interest, but that once those

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

636 homolog groups were identified, they did not need to be part of an unexpected co-occurrence

637 within a predicted cluster in order to be considered part of that cluster. By focusing only on

638 genes that form unexpected co-occurrences, it is likely that we have underestimated the

639 compositional diversity of Dothideomycetes BGCs (but this may be the case for all cluster

640 detection algorithms; see Results).

641 We then grouped all predicted BGCs into homologous cluster groups (cluster groups)

642 based on a minimum of 90% similarity in their gene content, rounded down, in order to obtain a

643 strict definition of BGC homology that increases the likelihood that homologous clusters encode

644 similar metabolic phenotypes. This meant that clusters with sizes ranging from 2-10 were

645 allowed to differ in at most 1 gene; clusters with sizes ranging from 11-20 were allowed to differ

646 in at most 2 genes, etc. Clusters that were not at least 90% similar to any other cluster in the

647 dataset were designated orphan clusters. Note that because there is no perfect way to determine

648 homology when using similarity based metrics, (e.g., a 10 gene cluster could be 90% similar to a

649 9 gene cluster, which in turn could be 90% similar to a 8 gene cluster, but that 8 gene cluster

650 cannot be 90% similar to the 10 gene cluster), we developed a heuristic approach for sorting

651 clusters into groups. First, we conducted an all-vs-all comparison of content similarity to sort all

652 clusters into preliminary groups by iterating through the clusters from largest to smallest, where

653 size equaled the number of unique homolog groups, and clusters could only be assigned to a

654 single group. Then, within each preliminary group, we identified clusters most similar to all

655 other clusters within the group and used them as references to which all other clusters were

656 compared during a new round of group assignment. In this final round, clusters were grouped

657 together with a given reference into a cluster group if they were at least 90% similar to it and

658 were classified as orphan clusters if they were not 90% similar to any references. The often-

29 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

659 unique compositions of clusters means that in most cases, there is no ambiguity to how the

660 clusters are classified; however, for a small number of clusters, especially those with fewer

661 genes, there may be some ambiguity as to which group they belong.

662 Annotation of BGCs and gene functions

663 In order to detect loci homologous to known BGCs in Dothideomycete genomes, amino

664 acid sequences of each annotated BGC within the MIBiG database (v1.4) were downloaded and

665 used as queries in a BLASTp search of all Dothideomycete proteomes (last accessed

666 04/01/2019). All hits with ≥50 bitscore and ≤1x10-4 evalue were retained, and clusters composed

667 of these hits were retrieved using a maximum of 6 intervening genes. In order to retain only

668 credible homologs of the annotated MIBiG queries and to account for error in BLAST searches

669 due to overlapping hits, we retained clusters with at least 3 genes that recovered at least 75% of

670 the genes in the initial query. This set of high confidence MIBiG BGCs was then compared to

671 the set of BGCs predicted by CO-OCCUR and antiSMASH to assess the ability of each

672 algorithm to recover homologs to known clusters. For each algorithm and each BGC recovered

673 using BLASTp to search the MIBiG database, we calculated percent recovery, defined as the

674 number of genes identified by the BLASTp search that were also identified as clustered by the

675 algorithm, divided by the size of the BGC identified by the BLASTp search, multiplied by 100.

676 We also calculated percent discovery, defined as the number of clustered genes identified by the

677 algorithm but not identified in the BLASTp search, divided by the size of the BGC identified by

678 the BLASTp search, multiplied by 100.

679 In order to annotate BGCs recovered by CO-OCCUR with characterized clusters, we

680 used amino acid sequences of all signature biosynthetic genes in CO-OCCUR clusters as

681 BLASTp queries in a search of the MIBiG database (min. percent similarity=70%; max

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

682 evalue=1x10-4; min. high scoring pairs coverage=50%). Basing our annotations on percent amino

683 acid similarity to characterized signature biosynthetic genes rather than on the number of genes

684 with similarity to BGC genes enabled a more conservative and comprehensive approach, as

685 many BGC entries within the MiBIG database are not complete.

686 Proteins within predicted BGCs were annotated using eggNOG-mapper (Huerta-Cepas,

687 Forslund et al., 2017) based on fungal-specific fuNOG orthology data (Huerta-Cepas,

688 Szklarczyk, et al., 2015). Consensus annotations for all homolog groups were derived by

689 selecting the most frequent annotation among all members of the group.

690 Comparing BGC detection algorithms

691 In order to assess the relative performances of SMURF, antiSMASH and CO-OCCUR,

692 we compared all BGCs predicted by each method, and kept track of the genes within those BGCs

693 that were identified by either one or multiple methods. We summarized these findings in a venn

694 diagram using the “eulerr” package in R (Larsson 2019). Note that for the purposes of this

695 analysis, BGCs predicted by SMURF and antiSMASH were considered to be composed only of

696 genes that matched a precomputed pHMM, and BGCs predicted by CO-OCCUR were composed

697 only of genes belonging to homolog groups that were part of unexpected co-occurrences, while

698 all other intervening genes within the BGC’s boundaries were not considered to be part of the

699 cluster. In doing so, we effectively ignored intervening genes that were situated between or are

700 immediately adjacent to these clustered genes of interest for the purposes of defining a cluster’s

701 content. While this approach likely does not capture the full diversity of cluster composition, it is

702 expected to decrease false positive error in BGC content prediction and represents a conservative

703 approach to identifying what genes make up a given cluster.

704 Construction of a co-occurrence network

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

705 We visualized relationships between homolog group pairs with unexpectedly large

706 numbers of co-occurrences in a network using Cytoscape v.3.4.0 (Shannon, Markiel et al. 2003).

707 The network layout was determined using the AllegroLayout plugin with the Allegro Spring-

708 Electric algorithm. In order to identify hub nodes within the network, we calculated betweeness

709 centrality, a measurement of the shortest paths within a network that pass through a given node,

710 for each node using Cytoscape.

711 Assessment of cluster group phylogenetic signal

712 In order to quantify the dispersion of phylogenetic distributions of cluster groups

713 predicted by CO-OCCUR, we created a binary genome x cluster group matrix for all 239 cluster

714 groups with ≥4 genes that indicated the presence or absence of these cluster groups across all 101

715 genomes. We used this matrix in conjunction with the “phylo.d” function from the “caper”

716 package v1.0.1 in R (Orme, Freckleton et al. 2012) to calculate Fritz and Purvis’ D statistic for

717 each cluster group’s distribution, where D is a measurement of phylogenetic signal for a binary

718 trait obtained by calibrating the observed number of changes in a binary trait’s evolution across a

719 phylogeny by the mean sum of changes expected under two null models of binary trait evolution.

720 The first null model simulates the phylogenetic distribution expected under a model of random

721 trait inheritance, and the second simulates the phylogenetic distribution expected under a

722 threshold model of Brownian evolution that evolves a trait along the phylogeny under a

723 Brownian process where variation in that trait’s distribution accumulates at a rate proportional to

724 branch length (Fritz & Purvis 2010). D≈1 if the trait has phylogenetically random distribution;

725 D≈0 if the trait has a phylogenetic distribution that follows the Brownian model; D>1 if the trait

726 has a phylogenetic distribution that is less conserved, or over-dispersed, compared to the

32 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

727 Brownian model; D < 0 if the trait has a phylogenetic distribution that is more conserved, or

728 under-dispersed, compared to the Brownian model.

729 Dissimilarity and Diversity Analyses

730 We created cluster group and orphan cluster x homolog group matrices in order to

731 determine the dissimilarity between predicted cluster groups. In these matrices, for each cluster

732 group or orphan cluster, we indicated the presence or absence of homolog groups in at least one

733 cluster within the cluster group or orphan cluster, effectively summarizing each cluster group and

734 orphan cluster by integrating over the content of all clusters assigned to that group. We next used

735 the matrix in conjunction with the “vegdist” function from the “vegan” package in R (Oksanen,

736 Blanchet et al. 2016) to create a Raup-Crick dissimilarity matrix that was visualized as a

737 dendrogram using complete linkage clustering as implemented in the “hclust” function from the

738 core “stats” package in R. These dendrograms were then used to assess the functional diversity

739 of BGC repertoires (e.g., in the Pleosporales) by measuring the total branch distance connecting

740 all cluster groups and orphan clusters within a given repertoire using the “treedive” function

741 from the “vegan” package in R.

742 We used the same above procedure to calculate Sørensen dissimilarity between

743 Pleosporalean genomes based on their BGC repertoires, only this time using a genome x cluster

744 group and orphan cluster matrix that depicted the presence or absence of cluster groups and

745 orphan clusters across all 49 Pleosporalean genomes. We also used this matrix to calculate and

746 partition β diversity in Pleosporalean cluster repertoires using the “beta.sample” function

747 (index.family = "sorensen", sites = 10, samples = 999) from the “betapart” v1.4 package in R

748 (Baselga & Orme 2012) in order to determine how much of the observed diversity among

749 repertoires was due to gain/loss of cluster groups and orphan clusters, and how much was due to

33 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

750 nestedness. We also used the genome x cluster group and orphan cluster matrix to conduct a

751 rarefaction of cluster richness across Pleosporalean genomes using the “iNEXT” function (q = 0,

752 datatype="incidence_raw", endpoint=98) from the “iNEXT” package in R (Hseih, Ma & Chao

753 2016).

754 Abbreviations:

755 BGC - Biosynthetic Gene Cluster; pHMM - profile Hidden Markov Model; SM - Secondary

756 Metabolite; PKS - Polyketide Synthetase; NRPS - Nonribosomal Peptide Synthetase; TC -

757 Terpene Cyclase; DMAT - dimethylallyl tryptophan synthase;

758 Declarations:

759 Ethics approval and consent to participate

760 No human subjects were involved in the research

761 Consent for publication

762 No human data was used in the research

763 Availability of data and materials

764 All genome data is available at https://mycocosm.jgi.doe.gov/mycocosm/home

765 and described in Haridas et al. in press.

766 All scripts used in the analyses are available at https://github.com/egluckthaler/co-occur

767 Additional data generated in this study is included in the supplemental datafile

768 Funding

769 This work was supported by the National Science Foundation (DEB-1638999, JCS), the Fonds

770 de Recherche du Québec-Nature et Technologies (EG-T), and the Ohio State University

771 Graduate School (EG-T). The work conducted by the U.S. Department of Energy Joint Genome

34 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

772 Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the

773 U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

774 Authors' contributions

775 Formulated the study EG-T, KEB, JCS

776 Designed the methodology EG-T, JCS

777 Generated resources EG-T, PWC, IG, MB

778 Collected the data EG-T, SH

779 Analyzed the data EG-T

780 Provided leadership and/or mentorship in the study JWS, JCS

781 Prepared the manuscript EG-T, JCS

782 Contributed to the writing/editing of the manuscript - KEB, PWC, JWS

783 Acknowledgements

784 Computational work by EG-T was conducted using the resources of the Ohio Supercomputer

785 Center.

786 Competing interests

787 The authors declare that they have no competing interests.

788 Figure legends:

789 Figure 1. CO-OCCUR pipeline. The pipeline used genome annotations from 101

790 Dothideomycetes, and previously computed homolog groups (HGs) consisting of both orthologs

791 and paralogs (Haridas et al. in press). Biosynthetic gene clusters (BGCs) were inferred by

792 determining unexpectedly distributed shared HG pairs, determined according to a null-

793 distribution of randomly sampled gene pairs in the same genomes, and then a search for all

794 clusters containing the HG pairs. The resulting BGCs were then either consolidated into cluster

35 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

795 groups (CGs) that share ~90% of gene content or labeled orphan cluster (OCs) if found only in a

796 single taxon. A detailed pipeline is presented in Figure SA.

797 Figure 2. Diversity of the largest detected secondary metabolite gene clusters across 101

798 Dothideomycetes. A maximum likelihood phylogenomic tree of 101 Dothideomycete species

799 (Haridas et al. in press) corresponds to rows in a heatmap (right) that depicts the number of

800 secondary metabolite clusters found in each genome, delimited by order (dotted line). Each

801 cluster is assigned to a homologous cluster group (cluster group; column) defined by at least

802 90% similarity at the composition level. Only cluster groups with ≥ 5 unique homolog groups per

803 cluster are shown. A complete linkage tree (top) depicts relationships among cluster groups,

804 where distance is proportional to the Raup-Crick dissimilarity in cluster group composition.

805 Cluster groups are colored according to their core signature biosynthetic genes, and cluster

806 groups with greater than 1 signature gene are left uncolored. Cluster groups with signature genes

807 ≥70% identical to characterized BGC signature genes in MIBiG are indicated by a labeled red

808 box.

809 Figure 3. Gene co-occurrence networks among biosynthetic signature gene clusters. a) Co-

810 occurrence network of gene homolog groups (homolog groups). Nodes in the co-occurrence

811 network represent all homolog groups found in homologous cluster groups (cluster groups).

812 Edges represent significant co-occurrences between homolog groups. Node size is proportional

813 to the number of significant co-occurrences involving that homolog group, and edge width is

814 proportional to the number of unique cluster types (either cluster groups or orphan clusters) with

815 ≥ 4 homolog groups that contain the co-occurrence. Distance between nodes is proportional to

816 the number of co-occurrences they have in common, adjusted by edge width. Signature genes

817 (colored circles) and transport-related function (squares) are indicated. Betweenness centrality

36 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

818 scores ≥0.2 are indicated in brackets for signature genes and eight other nodes (n1-8). Networks

819 1 and 2 are the two largest networks. b) Histogram of betweenness centrality scores for all nodes

820 in Networks 1 and 2 (bin width = 0.1). c) Significant co-occurrences within PKS and NRPS

821 clusters. Boxplots of homolog group co-occurrences involving signature genes (top) and non-

822 signature genes (bottom) across all polyketide synthase (PKS; green) and nonribosomal

823 polypeptide synthetase (NRPS; purple) clusters with ≥4 unique homolog groups. Boxplots

824 display the 75% percentile (top hinge), median (middle hinge), the 25th percentile (lower hinge),

825 and outliers (dots) determined by Tukey’s method. d) Diversity of PKS and NRPS clusters. A

826 line chart tracks the diversity of PKS and NRPS clusters across all cluster sizes for both PKS

827 (green) and NRPS (purple) clusters, where diversity is defined as the total number of unique

828 cluster types (either cluster group or orphan clusters) divided by the total number of clusters.

829 Figure 4. Benchmarking three different algorithms for biosynthetic gene cluster (BGC)

830 detection. a) Proportional Venn diagram of distinct and overlapping BGC genes of interest

831 detected by SMURF, antiSMASH and CO-OCCUR. SMURF and antiSMASH use profile Hidden

832 Markov Models (pHMMs) to identify clustered genes of interest, while CO-OCCUR uses

833 linkage-based criteria (see methods). Clustered genes (unbracketed) and secondary metabolism

834 biosynthesis, transport and catabolism clustered genes (fuNOG) detected are indicated for each

835 algorithm/combination. b) Complementary recovery of the cercosporin BGC using antiSMASH

836 and CO-OCCUR). Shading of genes in the Cercospora zeae-maydis cercosporin BGC (MIBiG

837 ID BGC0001541; recovered clusterID Cerzm1_BGC0001541_h92 in Table SG) indicates genes

838 identified by antiSMASH (blue), CO-OCCUR (yellow), or both algorithms (green). Gene names

839 are as in (de Jonge et al., 2018) and those required for cercosporin biosynthesis (Chen et al.,

840 2007; Newman et al., 2016; de Jonge et al., 2018) are indicated with an asterisk. c) Gene

37 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

841 recovery and discovery in clusters homologous to known BGCs. Scatterplots show the percent of

842 genes recovered (top) or discovered (bottom) by antiSMASH vs. CO-OCCUR at each locus

843 homologous to a MIBiG BGC (search criteria: minimum 3 gene cutoff; minimum of 75% genes

844 similar to MIBiG BGC genes in locus). Percent recovery is defined as the number of genes

845 identified by BLASTp in an algorithm-identified cluster divided by the size of the BLASTp

846 identified BGC, multiplied by 100. Percent discovery is defined as the number of genes

847 identified by the cluster algorithm but not identified in the BLASTp search, divided by the size

848 of the BLASTp identified BGC, multiplied by 100. y = x at the dotted reference line.

849 Figure 5. Phylogenetic and ecological signal in the distributions of homologous cluster

850 groups (cluster groups). a) Scatterplot of phylogenetic and ecological signal of cluster groups.

851 Values along the X-axis correspond to Fritz and Purvis’ D statistic, representing phylogenetic

852 signal in a cluster group’s distribution. Distributions of cluster groups with D<0 are more

853 conserved compared to a Brownian model of trait evolution, and distributions of cluster groups

854 with D>1 are considered over-dispersed. Cluster groups with more pathotrophs than saprotrophs

855 have Y >0 while cluster groups with more saprotrophs than pathotrophs have Y <0. Point

856 representing cluster group distributions with probability ≤0.05 of Brownian trait evolution are in

857 black, while those >0.05 are in gray. Cluster groups with P(Brownian) ≤0.05 and a lifestyle ratio

858 ≥2 are labeled and described in b) and c). Only cluster groups with ≥ 4 unique gene homolog

859 groups (homolog groups) per cluster are shown. b) Summary descriptions of labeled cluster

860 groups. Sig. genes = signature genes present in the cluster group; Size = number of unique

861 homolog groups in the cluster group reference cluster; Freq. = number of fungi with a cluster that

862 belongs to the cluster group; Best known signature(s) = signature gene(s) from the MIBiG

863 database with the highest similarity to signature genes from the cluster group, with average

38 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

864 percentage similarity shown in parentheses. c) Phylogenetic distributions of labeled cluster

865 groups. Presence (black cells) and absence (gray cells) matrix of clusters assigned to each

866 labeled cluster group across Dothideomycetes genomes tree as in Figure 2.

867 Figure 6. Three examples of homologous cluster groups (cluster groups) with conserved

868 phylogenetic distributions. a) Cluster group distributions. Presence (black cells) and absence

869 (gray cells) matrix of clusters assigned to various cluster groups (columns a-j, described in part

870 b), across Dothideomycetes genomes tree as in Figure 2. Each matrix contains distinct sets of

871 cluster groups that are separated by ≤0.05 distance units on the complete linkage tree in Figure 2.

872 The number of fungi with each cluster group is indicated at the bottom of each column. b)

873 Cluster group composition. Cluster groups in set 1 are predicted to encode DHN melanin

874 biosynthesis; set 2 contains unknown cluster groups with NRPS-like signature genes; set 3

875 contains unknown cluster groups with PKS signature genes, where the PKSs from group 16 are

876 on average 84% similar to the PKS in the characterized alternapyrone cluster (MIBiG ID:

877 BGC0000012). Homolog group presence in a given cluster group is indicated by a gray box

878 below the description. Brackets surround homolog groups present in <50% of clusters assigned

879 to a given cluster group.

880 Figure 7. Diversity of secondary metabolite gene cluster repertoires in Pleosporalean fungi.

881 a) Grouping of fungi based on the combinations of gene clusters (i.e., cluster repertoires) found

882 in their genomes. Shown to the left is a complete linkage tree where distance between different

883 fungal species is proportional to the Sorensen dissimilarity between their cluster repertoires. To

884 the right is a presence (black) and absence (white) matrix where each column represents a unique

885 cluster type (either a homologous cluster group or cluster orphan) and each row corresponds to

886 the adjacent fungal genome. On top of the heatmap is a complete linkage tree displaying

39 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

887 relationships between unique cluster types, where distance is proportional to the Raup-Crick

888 dissimilarity in cluster composition. b) Relationship between cluster repertoire size and cluster

889 repertoire diversity. Cluster repertoire diversity was calculated for each genome by finding the

890 total branch length on the Raup-Crick dissimilarity tree in a) associated with the set of clusters

891 found in that genome. Cluster repertoire diversity is thus a measurement of a given genome’s

892 repertoire diversity, in terms of the gene content of its clusters. A solid line models the linear

893 relationship between repertoire size and diversity (adj. R2 = 0.855). The shaded area around the

894 line represents the 95% confidence interval associated with the model. c) Sampled and projected

895 secondary metabolite cluster richness within the Pleosporales. Rarefied (solid lines) and

896 extrapolated (dotted lines) estimates of secondary metabolite gene cluster richness (i.e., the

897 number of unique cluster types) with respect to the number of sampled genomes are shown for

898 the Pleosporales. Shaded areas represent the 95% confidence intervals for both estimate types,

899 derived from 100 bootstrap replicates. All three graphs were generated using data from the 318

900 unique cluster types with ≥ 4 unique gene homolog groups that are associated with 47

901 Pleosporalean fungi and 2 as yet unclassified fungi found within the Pleosporalean clade on the

902 phylogenomic species tree in Figure 2.

903

904 Supporting information:

905 Table SA. Gene homolog groups (homolog groups) part of unexpected co-occurrences.

906 Table SB. Unexpected co-occurrences between gene homolog groups (homolog groups)

907 occurring in the vicinity of signature biosynthetic genes, and their frequency across different SM

908 classes.

909 Table SC. Positional information of all recovered CO-OCCUR clusters.

40 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

910 Table SD. Genomes used in this study.

911 Table SE. Cluster types (groups and orphans detected by CO-OCCUR.

912 Table SF. BLAST-based annotation of CO-OCCUR clusters with known signature biosynthetic

913 genes from the MIBiG database.

914 Table SG. Cross-referencing clusters retrieved by CO-OCCUR, antiSMASH, and BLAST

915 searches of MIBiG database to determine percent recovery and discovery.

916 Table SH. positional information of all recovered antiSMASH clusters.

917 Table SI. Cluster types (groups and orphans detected by antiSMASH.

918 Table SJ. positional information of all recovered SMURF clusters.

919 Table SK. Cluster types (groups and orphans detected by SMURF).

920 Table SL. Overlapping and complementary recovery of clustered genes of interest using

921 antiSMASH, SMURF, and CO-OCCUR.

922 Table SM. Positional information of all clusters recovered with a BLASTp search of the MIBiG

923 database.

924

41 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

925 References: 926 Agrawal AA, Hastings AP, Johnson MT, Maron JL, Salminen JP. Insect herbivores drive real- 927 time ecological and evolutionary change in plant populations. Science. 2012;338:6103. 928 Akimitsu K, Tsuge T, Kodama M, Yamamoto M, Otani H. Alternaria host-selective toxins: 929 determinant factors of plant disease. Journal of General Plant Pathology. 2014;80:2. 930 Ali A, Caggia S, Matesic DF, Khan SA. Chaetoglobosin K, an Akt pathway inhibitor, prevents 931 proliferation and migration of prostate carcinoma cells. 2015. 932 Baselga A, Orme CD L. betapart: an R package for the study of beta diversity. Methods in 933 ecology and evolution. 2012;3:5. 934 Baselga, A. The relationship between species replacement, dissimilarity derived from nestedness, 935 and nestedness. Global Ecology and Biogeography. 2012;21:12. 936 Beimforde C, Feldberg K, Nylinder S, Rikkinen J, Tuovila H, Dörfelt H, M. Gube DJ Jackson, 937 Reitner J, Seyfullah LJ. Estimating the Phanerozoic history of the lineages: 938 combining fossil and molecular data. Molecular phylogenetics and evolution. 2014;78. 939 Blin K, Wolf T, Chevrette MG, Lu X, Schwalen CJ, Kautsar SA, ... Medema MH. antiSMASH 940 4. 0—improvements in chemistry prediction and gene cluster boundary identification. 941 Nucleic acids research. 2017;45:W1. 942 Bok JW, Chung D, Balajee SA, Marr KA, Andes D, Nielsen KF, ... Keller NP. GliZ, a 943 transcriptional regulator of gliotoxin biosynthesis, contributes to Aspergillus fumigatus 944 virulence. Infection and immunity. 2006;74:12. 945 Bradshaw RE, Slot JC, Moore GG, Chettri P, de Wit PJ, Ehrlich KC, ... Cox MP. Fragmentation 946 of an aflatoxin‐like gene cluster in a forest pathogen. New Phytologist. 2013;198:2. 947 Chen H, Lee MH, Daub ME, Chung KR. Molecular analysis of the cercosporin biosynthetic gene 948 cluster in Cercospora nicotianae. Molecular microbiology. 2007;64:3. 949 Choudoir MJ, Pepe-Ranney C, Buckley DH. Diversification of secondary metabolite 950 biosynthetic gene clusters coincides with lineage divergence in Streptomyces. 951 . 2018;7:1. 952 Cimermancic P, Medema MH, Claesen J, Kurita K, Brown LC W, Mavrommatis K, ... Birren 953 BW. Insights into secondary metabolism from a global analysis of prokaryotic 954 biosynthetic gene clusters. Cell. 2014;158:2. 955 Ciuffetti LM, Manning VA, Pandelova I, Betts MF, Martinez JP. Host‐selective toxins, Ptr ToxA 956 and Ptr ToxB, as necrotrophic effectors in the Pyrenophora tritici‐repentis– 957 interaction. New Phytologist. 2010 Sep;187(4):911-9. 958 Coleman JJ, Mylonakis, E. Efflux in fungi: la piece de resistance. PLoS Pathogens. 2009;5:6. 959 Condon BJ, Elliott C, González JB, Yun SH, Akagi Y, Wiesner-Hanks T, Kodama M, Turgeon 960 BG. Clues to an evolutionary mystery: the genes for T-Toxin, enabler of the devastating 961 1970 Southern corn leaf blight epidemic, are present in ancestral species, suggesting an 962 ancient origin. Molecular plant-microbe interactions. 2018 Nov 12;31(11):1154-65. 963 Crameri R, Garbani M, Rhyner C, Huitema, C. Fungi: the neglected allergenic sources. Allergy. 964 2014;69:2.

42 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

965 Daly, J. The host-specific toxins of Helminthosporia. Plant Infection: The Physiological and 966 Biochemical Basis. Asada Y, Bushnell, WR, Ouchi, S., and Vance CP Berlin, 967 Springer_verlag. 1982. 968 De Jonge R, Ebert MK, Huitt-Roehl CR, Pal P, Suttle JC, Spanner RE, ... Thomma BP. Gene 969 cluster conservation provides insight into cercosporin biosynthesis and extends 970 production to the genus Colletotrichum. Proceedings of the National Academy of 971 Sciences. 2018;115:24. 972 Del Carratore F, Zych K, Cummings M, Takano E, Medema MH, Breitling, R. Computational 973 identification of co-evolving multi-gene modules in microbial biosynthetic gene clusters. 974 Communications Biology. 2019;2:1. 975 Dhillon B, Feau N, Aerts AL, Beauseigle S, Bernier L, Copeland A, ... LaButti KM. Horizontal 976 gene transfer and gene dosage drives adaptation to wood colonization in a tree pathogen. 977 Proceedings of the National Academy of Sciences. 2015;112:11. 978 Firn RD, Jones CG. Natural products–a simple model to explain chemical diversity. Natural 979 product reports. 2003;20:4. 980 Fritz SA, Purvis, A. Selectivity in mammalian extinction risk and threat types: a new measure of 981 phylogenetic signal strength in binary traits. Conservation Biology. 2010;24:4. 982 Fujii I, Yoshida N, Shimomaki S, Oikawa H, Ebizuka, Y. An iterative type I polyketide synthase 983 PKSN catalyzes synthesis of the decaketide alternapyrone with regio-specific octa- 984 methylation. Chemistry biology. 2005;12:12. 985 Gardiner DM, Cozijnsen AJ, Wilson LM, Pedras MS, Howlett BJ. The sirodesmin biosynthetic 986 gene cluster of the plant pathogenic Leptosphaeria maculans. Molecular 987 microbiology. 2004 Sep;53(5):1307-18. 988 Gardiner DM, Waring P, Howlett BJ. The epipolythiodioxopiperazine (ETP) class of fungal 989 toxins: distribution, mode of action, functions and biosynthesis. Microbiology. 990 2005;151:4. 991 Glassmire AE, Jeffrey CS, Forister ML, Parchman TL, Nice CC, Jahner JP, ... Leonard MD. 992 Intraspecific phytochemical variation shapes community and population structure for 993 specialist caterpillars. New Phytologist. 2016;212:1. 994 Gluck-Thaler E, Slot JC. Specialized plant biochemistry drives gene clustering in fungi. The 995 ISME journal. 2018;12:7. 996 Gluck‐Thaler E, Vijayakumar V, Slot JC. Fungal adaptation to plant defences through 997 convergent assembly of metabolic modules. Molecular ecology. 2018;27:24. 998 Goodwin SB, M'Barek SB, Dhillon B, Wittenberg AH, Crane CF, Hane JK, ... Antoniw, J. 999 Finished genome of the fungal wheat pathogen Mycosphaerella graminicola reveals 1000 dispensome structure, chromosome plasticity, and stealth pathogenesis. PLoS genetics. 1001 2011;7:6. 1002 Grandaubert J, Lowe RG, Soyer JL, Schoch CL, Van de Wouw AP, Fudal I, ... Linglin, J. 1003 Transposable element-assisted evolution and adaptation to host plant within the

43 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1004 Leptosphaeria maculans-Leptosphaeria biglobosa species complex of fungal pathogens. 1005 BMC genomics. 2014;15:1. 1006 Grigoriev IV, Nikitin R, Haridas S, Kuo A, Ohm R, Otillar R, Riley R, Salamov A, Zhao X, 1007 Korzeniewski F, Smirnova T. MycoCosm portal: gearing up for 1000 fungal genomes. 1008 Nucleic acids research. 2014 Jan 1;42(D1):D699-704. 1009 Hane JK, Rouxel T, Howlett BJ, Kema GH, Goodwin SB, Oliver RP. A novel mode of 1010 chromosomal evolution peculiar to filamentous Ascomycete fungi. Genome biology. 1011 2011;12:5. 1012 Haridas S, Albert R, Binder M, Bloem J, LaButti K, Salamov A, Andreopoulos B, Baker, SE, 1013 Barry K, Bills G, Bluhm, BH, Cannon C, Castanera R, Culley, DE, Daum C, Ezra D, 1014 González, JB, Henrissat B, Kuo A, Liang C, Lipzen A, Lutzoni F, Magnuson J, Mondo S, 1015 Nolan M, Ohm, RA, Pangilinan J, Park, H-J, Ramírez L, Alfaro M, Sun H, Tritt A, 1016 Yoshinaga Y, Zwiers L-H, Turgeon BG, Goodwin SB, Spatafora JW, Crous PW, 1017 Grigoriev IV. 101 Dothideomycetes genomes: a test case for predicting lifestyles and 1018 emergence of pathogens. Studies in Mycology, in press. 1019 Hartmann FE, McDonald BA, Croll, D. Genome‐wide evidence for divergent selection between 1020 populations of a major agricultural pathogen. Molecular ecology. 2018;27:12. 1021 Hoffmeister D, Keller NP. Natural products of filamentous fungi: enzymes, genes, and their 1022 regulation. Natural product reports. 2007;24:2. 1023 Holeski LM, Hillstrom ML, Whitham TG, Lindroth RL. Relative importance of genetic, 1024 ontogenetic, induction, and seasonal variation in producing a multivariate defense 1025 phenotype in a foundation tree species. Oecologia. 2012;170:3. 1026 Holeski LM, Keefover-Ring K, Bowers MD, Harnenz ZT, Lindroth RL. Patterns of 1027 phytochemical variation in Mimulus guttatus (yellow monkeyflower). Journal of 1028 chemical ecology. 2013;39:4. 1029 Horn BW. Ecology and population biology of aflatoxigenic fungi in soil. Journal of Toxicology- 1030 Toxin Reviews 2003;22:2-3. 1031 Hsieh TC, Ma KH, Chao, A. iNEXT: an R package for rarefaction and extrapolation of species 1032 diversity (H ill numbers). Methods in Ecology and Evolution. 2016;7:12. 1033 Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Von Mering C, Bork, P. Fast 1034 genome-wide functional annotation through orthology assignment by eggNOG-mapper. 1035 Molecular biology and evolution. 2017;34:8. 1036 Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, ... Jensen LJ. 1037 eggNOG 4. 5: a hierarchical orthology framework with improved functional annotations 1038 for eukaryotic, prokaryotic and viral sequences. Nucleic acids research. 2016;44:D1. 1039 Jiang C, Song J, Zhang J, Yang, Q. New production process of the antifungal chaetoglobosin A 1040 using cornstalks. Brazilian journal of microbiology. 2017;48:3. 1041 Kasahara K, Miyamoto T, Fujimoto T, Oguri H, Tokiwano T, Oikawa H, ... Fujii, I. 1042 Solanapyrone synthase, a possible Diels–Alderase and iterative type I polyketide

44 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1043 synthase encoded in a biosynthetic gene cluster from Alternaria solani. ChemBioChem. 1044 2010;11:9. 1045 Kaur, S. Phytotoxicity of solanapyrones produced by the fungus Ascochyta rabiei and their 1046 possible role in blight of chickpea (Cicer arietinum). Plant Science. 1995;109:1. 1047 Keller NP. Translating biosynthetic gene clusters into fungal armor and weaponry. Nature 1048 chemical biology. 2015;11:9. 1049 Khaldi N, Seifuddin FT, Turner G, Haft D, Nierman WC, Wolfe KH, Fedorova ND. SMURF: 1050 genomic mapping of fungal secondary metabolite clusters. Fungal Genetics and Biology. 1051 2010;47:9. 1052 Kurmayer R, Blom JF, Deng L, Pernthaler J. Integrating phylogeny, geographic niche 1053 partitioning and secondary metabolite synthesis in bloom-forming Planktothrix. The 1054 ISME journal. 2015;9:4. 1055 Kursar TA, Dexter KG, Lokvam J, Pennington RT, Richardson JE, Weber MG, ... Coley PD. 1056 The evolution of antiherbivore defenses and their contribution to species coexistence in 1057 the tropical tree genus Inga. Proceedings of the National Academy of Sciences. 1058 2009;106:43. 1059 Larsson, J. Eulerr: Area-Proportional Euler and Venn Diagrams with Ellipses. 2018. R package 1060 version 3. 1061 Li D, Baldwin IT, Gaquerel E. Navigating natural variation in herbivory-induced secondary 1062 metabolism in coyote tobacco populations using MS/MS structural analysis. Proceedings 1063 of the National Academy of Sciences. 2015;112:30. 1064 Lind AL, Wisecaver JH, Lameiras C, Wiemann P, Palmer JM, Keller NP, ... Rokas A. Drivers 1065 of genetic diversity in secondary metabolic gene clusters within a fungal species. PLoS 1066 biology. 2017;15:11. 1067 Manning VA, Pandelova I, Dhillon B, Wilhelm LJ, Goodwin SB, Berlin AM, ... Holman WH. 1068 Comparative genomics of a plant-pathogenic fungus, Pyrenophora tritici-repentis, reveals 1069 transduplication and the impact of repeat elements on pathogenicity and population 1070 divergence. G3: Genes, Genomes, Genetics. 2013;3:1. 1071 Menke J, Dong Y, Kistler HC. Fusarium graminearum Tri12p influences virulence to wheat and 1072 trichothecene accumulation. Molecular plant-microbe interactions. 2012;25:11. 1073 Mizushina Y, Kamisuki S, Kasai N, Shimazaki N, Takemura M, Asahara H, ... Sugawara F. A 1074 plant phytotoxin, solanapyrone A, is an inhibitor of DNA polymerase β and λ. Journal of 1075 Biological Chemistry. 2002;277:1. 1076 Newman AG, Townsend CA. Molecular characterization of the cercosporin biosynthetic 1077 pathway in the fungal plant pathogen Cercospora nicotianae. Journal of the American 1078 Chemical Society. 2016;138:12. 1079 Nielsen JC, Grijseels S, Prigent S, Ji B, Dainat J, Nielsen KF, ... Nielsen J. Global analysis of 1080 biosynthetic gene clusters reveals vast potential of secondary metabolite production in 1081 Penicillium species. Nature microbiology. 2017;2:6.

45 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1082 Ohm RA, Feau N, Henrissat B, Schoch CL, Horwitz BA, Barry KW, Condon BJ, Copeland AC, 1083 Dhillon B, Glaser F, Hesse CN. Diverse lifestyles and strategies of plant pathogenesis 1084 encoded in the genomes of eighteen Dothideomycetes fungi. PLoS Pathogens. 2012 1085 Dec;8(12). 1086 Oksanen J, Kindt R, Legendre P, O’Hara B, Stevens MH, Oksanen MJ, Solymos P, Wagner H. 1087 The vegan package. Community ecology package. 2007;10. 1088 Olarte RA, Menke J, Zhang Y, Sullivan S, Slot JC, Huang Y, ... Bushley KE. Chromosome 1089 rearrangements shape the diversification of secondary metabolism in the cyclosporin 1090 producing fungus Tolypocladium inflatum. BMC genomics. 2019;20:1. 1091 Oliver RP, Friesen TL, Faris JD, Solomon PS. Stagonospora nodorum: from pathology to 1092 genomics and host resistance. Annual review of phytopathology. 2012;50. 1093 Orme D, Freckleton R, Thomas G, Petzoldt T, Fritz S, Isaac N, ... Pearse W. Caper: comparative 1094 analyses of phylogenetics and evolution in R. R package version 0.5. 2012. 1095 Pandelova I, Figueroa M, Wilhelm LJ, Manning VA, Mankaney AN, Mockler TC, Ciuffetti LM. 1096 Host-selective toxins of Pyrenophora tritici-repentis induce common responses associated 1097 with host susceptibility. PLoS One. 2012;7:7. 1098 Patron NJ, Waller RF, Cozijnsen AJ, Straney DC, Gardiner DM, Nierman WC, Howlett BJ. 1099 Origin and distribution of epipolythiodioxopiperazine (ETP) gene clusters in filamentous 1100 ascomycetes. BMC Evolutionary Biology. 2007;7:1. 1101 Proctor RH, McCormick SP, Kim HS, Cardoza RE, Stanley AM, Lindo L, ... Alexander NJ. 1102 Evolution of structural diversity of trichothecenes, a family of toxins produced by plant 1103 pathogenic and entomopathogenic fungi. PLoS pathogens. 2018;14:4. 1104 Reynolds HT, Vijayakumar V, Gluck‐Thaler E, Korotkin HB, Matheny PB, Slot JC. Horizontal 1105 gene cluster transfer increased hallucinogenic mushroom diversity. Evolution letters. 1106 2018;2:2. 1107 Ruibal C, Gueidan C, Selbmann L, Gorbushina AA, Crous PW, Groenewald JZ, ... Staley JT. 1108 Phylogeny of rock-inhabiting fungi related to Dothideomycetes. Studies in Mycology. 1109 2009;64. 1110 Schoch CL, Crous PW, Groenewald JZ, Boehm EW A, Burgess TI, De Gruyter J, ... Harada, Y. 1111 A class-wide phylogenetic assessment of Dothideomycetes. Studies in mycology. 1112 2009;64. 1113 Schümann J, Hertweck C. Molecular basis of cytochalasan biosynthesis in fungi: gene cluster 1114 analysis and evidence for the involvement of a PKS-NRPS hybrid synthase by RNA 1115 silencing. Journal of the American Chemical Society. 2007;129:31. 1116 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, ... Ideker T. Cytoscape: a 1117 software environment for integrated models of biomolecular interaction networks. 1118 Genome research. 2003;13:11. 1119 Shi‐Kunne X, Faino L, van den Berg GC, Thomma BP, Seidl MF. Evolution within the fungal 1120 genus Verticillium is characterized by chromosomal rearrangement and gene loss. 1121 Environmental microbiology. 2018;20:4.

46 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1122 Slot JC. Fungal gene cluster diversity and evolution. In Advances in genetics (Vol. 100, pp. 141- 1123 178). Academic Press. 2017. 1124 Slot JC, Gluck-Thaler E. Metabolic gene clusters, fungal diversity, and the generation of 1125 accessory functions. Current opinion in genetics development. 2019;58. 1126 Song W, Qiao X, Chen K, Wang Y, Ji S, Feng J, ... Ye M. Biosynthesis-based quantitative 1127 analysis of 151 secondary metabolites of licorice to differentiate medicinal Glycyrrhiza 1128 species and their hybrids. Analytical chemistry. 2017;89:5. 1129 Spatafora JW, Bushley KE. Phylogenomics and evolution of secondary metabolism in plant- 1130 associated fungi. Current opinion in plant biology. 2015;26. 1131 Spatafora JW, Owensby CA, Douhan GW, Boehm EW, Schoch CL. Phylogenetic placement of 1132 the ectomycorrhizal genus Cenococcum in (Dothideomycetes). Mycologia. 1133 2012;104:3. 1134 Suetrong S, Boonyuen N, Pang KL, Ueapattanakit J, Klaysuban A, Sri-indrasutdhi V, ... Jones 1135 EG. A taxonomic revision and phylogenetic reconstruction of the Jahnulales 1136 (Dothideomycetes), and the new family Manglicolaceae. Fungal Diversity. 2011;51:1. 1137 Theobald S, Vesth TC, Rendsvig JK, Nielsen KF, Riley R, de Abreu LM, ... Hoof JB. 1138 Uncovering secondary metabolite evolution and biosynthesis using gene cluster networks 1139 and genetic dereplication. Scientific reports. 2018;8:1. 1140 Turgeon BG, Baker SE. Genetic and genomic dissection of the Cochliobolus heterostrophus 1141 Tox1 locus controlling biosynthesis of the polyketide virulence factor T‐toxin. Advances 1142 in genetics. 2007;57. 1143 Vesth TC, Nybo JL, Theobald S, Frisvad JC, Larsen TO, Nielsen KF, ... Gladden JM. 1144 Investigation of inter-and intraspecies variation through genome sequencing of 1145 Aspergillus section Nigri. Nature Genetics. 2018;50:12. 1146 Villani A, Proctor RH, Kim HS, Brown DW, Logrieco AF, Amatulli MT, ... Susca A. Variation 1147 in secondary metabolite production potential in the Fusarium incarnatum-equiseti species 1148 complex revealed by comparative analysis of 13 genomes. BMC genomics. 2019;20:1. 1149 Walton JD. Host-selective toxins: Agents of compatibility. Plant Cell. 1996;8:10. 1150 Walton JD, Panaccione DG. Host-selective toxins and disease specificity: perspectives and 1151 progress. Annual review of phytopathology. 1993;31:1. 1152 Wang G, Liu Z, Lin R, Li E, Mao Z, Ling J, ... Xie B. Biosynthesis of leucinostatins in 1153 bio-control fungus Purpureocillium lilacinum and their inhibition on Phytophthora 1154 revealed by genome mining. PLoS pathogens. 2016;12:7. 1155 Wang JS, Tang, L. Epidemiology of aflatoxin exposure and human liver cancer. Journal of 1156 Toxicology: Toxin Reviews. 2004;23:2-3. 1157 Weinhold A, Ullah C, Dressel S, Schoettner M, Gase K, Gaquerel E, ... Baldwin IT. O-acyl 1158 sugars protect a wild tobacco from both native fungal pathogens and a specialist 1159 herbivore. Plant physiology. 2017;174:1.

47 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1160 Wight WD, Kim KH, Lawrence CB, Walton JD. Biosynthesis and role in virulence of the 1161 histone deacetylase inhibitor depudecin from Alternaria brassicicola. Molecular plant- 1162 microbe interactions. 2009;22:10. 1163 Wijayawardene NN, Hyde KD, Lumbsch HT, Liu JK, Maharachchikumbura SS, Ekanayaka AH, 1164 ... Phookamsak R. Outline of ascomycota: 2017. Fungal Diversity. 2018;88:1. 1165 Wisecaver JH, Slot JC, Rokas, A. The evolution of fungal metabolic pathways. PLoS Genetics. 1166 2014;10:12. 1167 Wolpert TJ, Dunkle LD, Ciuffetti LM. Host-selective toxins and avirulence determinants: what's 1168 in a name?. Annual review of phytopathology. 2002;40:1. 1169 Xu YJ, Luo F, Li B, Shang Y, Wang C. Metabolic conservation and diversification of 1170 Metarhizium species correlate with fungal host-specificity. Frontiers in microbiology. 1171 2016;7. 1172 Ziemert N, Lechner A, Wietz M, Millán-Aguiñaga N, Chavarria KL, Jensen PR. Diversity and 1173 evolution of secondary metabolism in the marine actinomycete genus Salinispora. 1174 Proceedings of the National Academy of Sciences. 2014;111:12. 1175 Züst T, Heichinger C, Grossniklaus U, Harrington R, Kliebenstein DJ, Turnbull LA. Natural 1176 enemies drive geographic variation in plant defenses. Science. 2012;338:6103.

48 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint Figure SA (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. CO-OCCUR pipeline Annotated genomes with genes clustered into homolog groups Null model pipeline (HGs) Secondary metabolite (SM) clustering pipeline Randomly sample 2 genes occurring within 6 genes of each other SM cluster prediction using SMURF

Retrieve the HG to which each gene belongs and Initial SM number of genes in each HG clusters

Count the number of times members of Extend SM clusters to contain

times (without replacement() each HG are found within 6 genes of each neighboring genes (within a 6 gene other across complete dataset distance) belonging to a homolog group (101 Dothideomycete Genomes) (HG) found in another SM cluster 500,000 500,000

Bin sampled HG pairs into different extended

Repeat Repeat categories based on the total number of SM clusters genes in each HG

Extract HGs present in SMURF clusters that Random co-occur an unexpected number of times distributions of HG across all Dothideomycete genomes compared co-occurrences with the random distribution

Unexpected HG co-occurrences

Conduct de novo search for all clusters consisting of genes belonging to HGs that are part of unexpected co-occurrences

new SM clusters (3399 Total)

Group together new SM clusters that have ~90% of their genes in common (719 Total)

Similar clusters Yes found? No

Orphan Cluster groups clusters (CGs) (OCs) (422 Total) (297 Total)

BLAST CGs and OCs Count the frequency of all Count the number of CGs against known cluster unexpected HG co- per genome database occurrences across all CGs (Average 33.7) (MIBIG, local database)

CG distribution Co-occurrence across all MIBIG Annotated network of HGs CGs and OCs genomes

Group together genomes based on Raup-Crick dissimilarity in their CG repertoire

Multi-cluster profiles bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure SB

antiSMASH 5416 (887)

1408 (200*) 3326 (1301*) 3873 (1468*)

smurf co-occur 1553 1254 6051 (290) (200) (796)

Number of proteins (Proteins participating in SM biosynthesis, transport and catabolism) * = p(enriched) < 0.01, Bonferroni corrected 0 .10 Figure SC Unknown Coniosporium apollinis CBS 100218 Unknown Glonium stellatum Unknown 1.58 Trypetheliales Viridothelium virens Mytilinidiales Lepidopterella palustris CBS 459.81 Patellariales Patellaria atrata CBS 101060 Botryosphaeriales Saccharata proteae CBS 121410 Hysteriales Hysterium pulicare Botryosphaeriales Aplosporella prunicola CBS 121167 Patellariales Rhizodiscina lignyota Mytilinidiales Mytilinidion resinicola Mytilinidiales Lophium mytilinum Botryosphaeriales Neofusicoccum parvum UCRNP2 Unknown Aulographum hederae Hysteriales Rhytidhysteron rufulum Unknown Zop Dothideales Lineolata rhizophorae Botryosphaeriales Botryosphaeria dothidea Dothideales Delphinella strobiligena Botryosphaeriales Macrophomina phaseolina MS6 Leotiomycetes Sclerotinia sclerotiorum Eurotiomycetes Coccidioides immitis Leotiomycetes Botrytis cinerea Botryosphaeriales Diplodia seriata Capnodiales Pseudovirgaria hyperparasitica Pleosporales Delitschia confertaspora Dothideales Aureobasidium pullulans EXF-150 Botryosphaeriales Phyllosticta citriasiana Eurotiomycetes Aspergillus nidulans fia rhizophila CBS 207.26 Pleosporales Lindgomyces ingoldianus Venturiales Tothia fuscella Dothideales Aureobasidium subglaciale EXF-2481 Dothideales Aureobasidium melanogenum CBS 110374 Dothideales Aureobasidium namibiae CBS 147.97 Pleosporales Clohesyomyces aquaticus Phaeotrichales Trichodelitschia bisporula Unknown Eremomyces bilateralis CBS 781.70 bioRxiv preprint Capnodiales Polychaeton citri CBS 116435 Myriangiales Myriangium duriaei CBS 260.36 Venturiales Venturia inaequalis Venturiales Venturia pyrina Myriangiales Elsinoe ampelina Pleosporales Didymosphaeria enalia Pleosporales Melanomma pulvis-pyrius CBS 109.77 Capnodiales Baudoinia panamericana UAMH 10762 Pleosporales Lophiostoma macrostomum CBS 122681 Pleosporales Amniculicola lignicola CBS 123094 Pleosporales Lophiotrema nucula Pleosporales Aaosphaeria arxii Capnodiales Zasmidium cellare ATCC 36951 Pleosporales Trematosphaeria pertusa Venturiales Verruconis gallopava Pleosporales Pleomassaria siparia CBS 279.74 Dothideales Hortaea acidophila Pleosporales Massariosphaeria phaeospora Capnodiales Zymoseptoria ardabiliae STIR04_1.1.1 Capnodiales Teratosphaeria nubilosa Capnodiales Zymoseptoria pseudotritici STIR04_2.2.1 Capnodiales Zymoseptoria tritici IPO323 Pleosporales Polyplosphaeria fusca Unknown Acidomyces richmondensis BFW Capnodiales Pseudocercospora Capnodiales Passalora fulva Sordariomycetes Neurospora crassa Microthyriales Microthyrium microscopicum Capnodiales Dothistroma septosporum NZE10 Capnodiales Dissoconium aciculare CBS 342.82 Sordariomycetes Fusarium graminearum Pleosporales Lentithecium Pleosporales Westerdykella ornata Capnodiales Cercospora zeae-maydis SCOH1-5 Pleosporales Sporormia Capnodiales Sphaerulina musiva SO2202 Pleosporales Cucurbitaria berberidis CBS 394.84 Capnodiales Sphaerulina populicola Pleosporales Dothidotthia symphoricarpi CBS 119687 Pleosporales Setomelanomma holmii Pleosporales Byssothecium circinans Pleosporales Bimuria novae-zelandiae Pleosporales Paraphaeosphaeria sporulosa Pleosporales Periconia macrospinosa Pleosporales Karstenula rhodostoma CBS 690.94 Pleosporales Pyrenochaeta sp. DS3sAY3a Pleosporales Ophiobolus disseminans Unknown Lizonia empirigonia Pleosporales Macroventuria anomochaeta Pleosporales Massarina eburnea CBS 473.64 Pleosporales Plenodomus tracheiphilus IPT5 Pleosporales Clathrospora elynae Pezizomycetes Tuber melanosporum Pleosporales Parastagonospora nodorum SN15 Pleosporales Stagonospora sp. SRC1lsM3a Pleosporales Leptosphaeria maculans JN3 Pleosporales Decorospora gaudefroyi Pleosporales Ampelomyces quisqualis Pleosporales Alternaria alternata Pleosporales Didymella exigua CBS 183.55 Pleosporales Alternaria brassicicola Pleosporales Pyrenophora teres f. teres 0-1 Pleosporales Pyrenophora tritici-repentis Pleosporales Et28A Pleosporales Curvularia lunata m118 Pleosporales Bipolaris maydis C5 Pleosporales Bipolaris sorokiniana ND90Pr fluviatile CBS 122367 fimetaria CBS 119925 Pleosporales Bipolaris oryzae ATCC 44560 Pleosporales Bipolaris victoriae FI3 Pleosporales Bipolaris zeicola 26-R-13 Pezizomycetes Pyronema omphalodes fijiensis Capnodiales Piedraia hortae CBS 480.64 (which wasnotcertifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. doi: https://doi.org/10.1101/2020.01.31.928846

DMAT+NRPS_group30_s9 y n.d._group178_s4 n.d.+NRPS-Like_group191_s4 n.d.+NRPS-Like_group193_s4 n.d._group123_s5 PKS_group67_s7 TC_group184_s4 NRPS+PKS-Like+TC_group233_s4 PKS-Like+TC_group126_s5 n.d._group185_s4

n.d.+NRPS-Like_group168_s4 ; this versionpostedFebruary1,2020. NRPS-Like_group100_s6 PKS-Like_group190_s4 n.d.+PKS_group230_s4 n.d._group231_s4 n.d._group146_s5 PKS-Like_group166_s5 PKS-Like_group38_s8 PKS-Like_group154_s5 PKS-Like_group62_s7 PKS-Like_group199_s4 The copyrightholderforthispreprint n.d._group174_s4 PKS-Like_group155_s5 PKS-Like_group110_s6 PKS-Like_group41_s8 PKS-Like_group91_s6 NRPS_group210_s4 n.d.+PKS_group202_s4 PKS_group136_s5 PKS_group39_s8 NRPS-Like_group206_s4 n.d._group211_s4 PKS_group128_s5 NRPS+PKS_group97_s6 PKS_group108_s6 n.d._group181_s4 n.d._group217_s4 PKS_group78_s6 PKS_group44_s8 PKS_group149_s5 NRPS-Like_group239_s4 NRPS-Like+PKS_group81_s6 NRPS-Like+PKS+TC_group101_s6 NRPS-Like+PKS+TC_group160_s5 NRPS-Like+PKS_group60_s7 HYBRID_group121_s5 HYBRID_group103_s6 HYBRID_group116_s5 PKS-Like_group86_s6 PKS-Like_group93_s6 n.d.+PKS-Like_group218_s4 PKS-Like_group65_s7 PKS-Like_group130_s5 PKS-Like_group125_s5 PKS-Like_group107_s6 n.d._group177_s4 PKS_group117_s5 n.d._group213_s4 n.d._group192_s4 PKS_group118_s5 PKS_group35_s8 n.d.+PKS_group198_s4 PKS_group89_s6 PKS_group20_s11 HYBRID+PKS_group25_s10 PKS_group157_s5 PKS_group229_s4 DMAT+NRPS_group17_s12 DMAT+NRPS_group5_s14 DMAT+NRPS_group3_s15 DMAT+NRPS_group1_s18 DMAT+NRPS_group2_s17 DMAT+NRPS_group4_s15 DMAT+NRPS+PKS_group24_s11 DMAT+NRPS_group14_s12 DMAT+NRPS_group21_s11 DMAT+NRPS_group12_s13 n.d._group208_s4 n.d._group92_s6 PKS_group19_s11 PKS_group31_s9 n.d._group141_s5 n.d._group173_s4 n.d._group46_s8 NRPS+PKS-Like_group13_s13 PKS-Like_group37_s8 PKS-Like_group122_s5 PKS-Like_group52_s8 PKS-Like_group18_s11 PKS-Like_group8_s13 PKS-Like_group29_s10 PKS-Like_group9_s13 PKS-Like_group68_s7 PKS-Like_group32_s9 NRPS_group69_s7 NRPS_group156_s5 NRPS_group228_s4 NRPS_group221_s4 NRPS_group164_s5 NRPS+PKS+PKS-Like_group88_s6 NRPS+PKS-Like_group87_s6 NRPS+PKS-Like_group27_s10 NRPS+PKS-Like_group59_s7 NRPS+NRPS-Like_group180_s4 NRPS_group75_s6 NRPS_group63_s7 NRPS_group34_s9 NRPS_group158_s5 NRPS_group144_s5 NRPS_group133_s5 NRPS+NRPS-Like_group216_s4 NRPS+NRPS-Like_group51_s8 NRPS+NRPS-Like_group73_s7 NRPS_group152_s5 NRPS_group115_s6 NRPS_group42_s8 NRPS_group26_s10 NRPS_group50_s8 NRPS_group33_s9 NRPS_group66_s7 NRPS_group99_s6 NRPS+NRPS-Like_group57_s7 NRPS_group172_s4 NRPS_group188_s4 NRPS+TC_group90_s6 NRPS+PKS_group220_s4 NRPS_group189_s4 n.d._group225_s4 NRPS-Like+PKS_group131_s5 PKS_group134_s5 n.d._group137_s5 PKS_group223_s4 PKS_group113_s6 PKS_group40_s8 PKS_group175_s4 PKS_group10_s13 PKS_group165_s5 n.d.+PKS_group150_s5 PKS_group49_s8 DMAT+PKS_group23_s11 DMAT+PKS_group7_s14 PKS_group6_s14 PKS_group11_s13 PKS_group127_s5 PKS_group28_s10 PKS+TC_group142_s5 PKS+TC_group182_s4 PKS_group80_s6 PKS_group200_s4 PKS_group187_s4 PKS_group94_s6 PKS_group132_s5 PKS_group64_s7 PKS_group106_s6 PKS_group43_s8 PKS_group209_s4 PKS+PKS-Like_group36_s8 PKS+PKS-Like_group109_s6 PKS_group170_s4 PKS_group205_s4 PKS_group70_s7 PKS_group111_s6 PKS_group105_s6 PKS_group112_s6 PKS_group196_s4 PKS_group234_s4 PKS_group54_s7 PKS_group238_s4 NRPS-Like_group226_s4 NRPS-Like_group203_s4 PKS_group138_s5 PKS_group159_s5 PKS_group72_s7 HYBRID+PKS_group15_s12 PKS_group22_s11 PKS_group16_s12 PKS_group212_s4 PKS_group176_s4 PKS+TC_group45_s8 PKS_group102_s6 PKS_group139_s5 PKS_group83_s6 PKS+TC_group58_s7 PKS_group96_s6 DMAT+PKS_group183_s4 PKS_group53_s8 PKS_group124_s5 PKS_group104_s6 NRPS+PKS_group227_s4 NRPS+PKS_group143_s5 PKS_group237_s4 PKS_group71_s7 PKS_group56_s7 PKS_group207_s4 PKS_group82_s6 PKS_group47_s8 PKS_group84_s6 DMAT+PKS_group48_s8 PKS_group76_s6 NRPS+PKS_group161_s5 DMAT+PKS_group163_s5 n.d.+PKS_group61_s7 PKS_group153_s5 PKS_group114_s6 PKS_group135_s5 PKS_group171_s4 PKS_group145_s5 NRPS+PKS_group224_s4 DMAT+PKS_group98_s6 PKS+TC_group179_s4 PKS_group140_s5 PKS_group95_s6 PKS_group197_s4 PKS_group167_s4 PKS_group186_s4 PKS_group85_s6 PKS_group214_s4 PKS_group232_s4 n.d.+PKS_group201_s4 PKS_group236_s4 NRPS-Like+PKS_group222_s4 PKS_group169_s4 PKS_group120_s5 PKS_group129_s5 PKS_group195_s4 PKS_group151_s5 PKS_group74_s6 PKS_group215_s4 NRPS-Like+PKS_group194_s4 PKS_group119_s5 PKS_group204_s4 n.d.+PKS_group235_s4 PKS_group79_s6 PKS+TC_group147_s5 PKS+TC_group148_s5 PKS_group55_s7 PKS_group219_s4 PKS_group77_s6 PKS_group162_s5 0 .10 Figure SD Unknown Coniosporium apollinis CBS 100218 Unknown Glonium stellatum Unknown Cenococcum geophilum 1.58 Trypetheliales Viridothelium virens Mytilinidiales Lepidopterella palustris CBS 459.81 Patellariales Patellaria atrata CBS 101060 Botryosphaeriales Saccharata proteae CBS 121410 Hysteriales Hysterium pulicare Botryosphaeriales Aplosporella prunicola CBS 121167 Patellariales Rhizodiscina lignyota Mytilinidiales Mytilinidion resinicola Mytilinidiales Lophium mytilinum Botryosphaeriales Neofusicoccum parvum UCRNP2 Unknown Aulographum hederae Hysteriales Rhytidhysteron rufulum Unknown Zop Dothideales Lineolata rhizophorae Botryosphaeriales Botryosphaeria dothidea Dothideales Delphinella strobiligena Botryosphaeriales Macrophomina phaseolina MS6 Leotiomycetes Sclerotinia sclerotiorum Eurotiomycetes Coccidioides immitis Leotiomycetes Botrytis cinerea Botryosphaeriales Diplodia seriata Capnodiales Pseudovirgaria hyperparasitica Pleosporales Delitschia confertaspora Dothideales Aureobasidium pullulans EXF-150 Botryosphaeriales Phyllosticta citriasiana Eurotiomycetes Aspergillus nidulans fia rhizophila CBS 207.26 Pleosporales Lindgomyces ingoldianus Venturiales Tothia fuscella Dothideales Aureobasidium subglaciale EXF-2481 Dothideales Aureobasidium melanogenum CBS 110374 Dothideales Aureobasidium namibiae CBS 147.97 Pleosporales Clohesyomyces aquaticus Phaeotrichales Trichodelitschia bisporula Unknown Eremomyces bilateralis CBS 781.70 bioRxiv preprint Capnodiales Polychaeton citri CBS 116435 Myriangiales Myriangium duriaei CBS 260.36 Venturiales Venturia inaequalis Venturiales Venturia pyrina Myriangiales Elsinoe ampelina Pleosporales Didymosphaeria enalia Pleosporales Melanomma pulvis-pyrius CBS 109.77 Capnodiales Baudoinia panamericana UAMH 10762 Pleosporales Lophiostoma macrostomum CBS 122681 Pleosporales Amniculicola lignicola CBS 123094 Pleosporales Lophiotrema nucula Pleosporales Aaosphaeria arxii Capnodiales Zasmidium cellare ATCC 36951 Pleosporales Trematosphaeria pertusa Venturiales Verruconis gallopava Pleosporales Pleomassaria siparia CBS 279.74 Dothideales Hortaea acidophila Pleosporales Massariosphaeria phaeospora Capnodiales Zymoseptoria ardabiliae STIR04_1.1.1 Capnodiales Teratosphaeria nubilosa Capnodiales Zymoseptoria pseudotritici STIR04_2.2.1 Capnodiales Zymoseptoria tritici IPO323 Pleosporales Polyplosphaeria fusca Unknown Acidomyces richmondensis BFW Capnodiales Pseudocercospora Capnodiales Passalora fulva Sordariomycetes Neurospora crassa Microthyriales Microthyrium microscopicum Capnodiales Dothistroma septosporum NZE10 Capnodiales Dissoconium aciculare CBS 342.82 Sordariomycetes Fusarium graminearum Pleosporales Lentithecium Pleosporales Westerdykella ornata Capnodiales Cercospora zeae-maydis SCOH1-5 Pleosporales Sporormia Capnodiales Sphaerulina musiva SO2202 Pleosporales Cucurbitaria berberidis CBS 394.84 Capnodiales Sphaerulina populicola Pleosporales Dothidotthia symphoricarpi CBS 119687 Pleosporales Setomelanomma holmii Pleosporales Byssothecium circinans Pleosporales Bimuria novae-zelandiae Pleosporales Paraphaeosphaeria sporulosa Pleosporales Periconia macrospinosa Pleosporales Karstenula rhodostoma CBS 690.94 Pleosporales Pyrenochaeta sp. DS3sAY3a Pleosporales Ophiobolus disseminans Unknown Lizonia empirigonia Pleosporales Macroventuria anomochaeta Pleosporales Massarina eburnea CBS 473.64 Pleosporales Plenodomus tracheiphilus IPT5 Pleosporales Clathrospora elynae Pezizomycetes Tuber melanosporum Pleosporales Parastagonospora nodorum SN15 Pleosporales Stagonospora sp. SRC1lsM3a Pleosporales Leptosphaeria maculans JN3 Pleosporales Decorospora gaudefroyi Pleosporales Ampelomyces quisqualis Pleosporales Alternaria alternata Pleosporales Didymella exigua CBS 183.55 Pleosporales Alternaria brassicicola Pleosporales Pyrenophora teres f. teres 0-1 Pleosporales Pyrenophora tritici-repentis Pleosporales Setosphaeria turcica Et28A Pleosporales Curvularia lunata m118 Pleosporales Bipolaris maydis C5 Pleosporales Bipolaris sorokiniana ND90Pr fluviatile CBS 122367 fimetaria CBS 119925 Pleosporales Bipolaris oryzae ATCC 44560 Pleosporales Bipolaris victoriae FI3 Pleosporales Bipolaris zeicola 26-R-13 Pezizomycetes Pyronema omphalodes fijiensis Capnodiales Piedraia hortae CBS 480.64 (which wasnotcertifiedbypeerreview)istheauthor/funder.Allrightsreserved.Noreuseallowedwithoutpermission. doi:

NRPS.Like.PKS_group60_s7https://doi.org/10.1101/2020.01.31.928846 PKS_group238_s4 PKS_group186_s4 PKS_group53_s8 PKS_group85_s6 NRPS.Like_group100_s6 PKS.Like_group125_s5 PKS.Like_group122_s5 PKS_group176_s4 NRPS_group228_s4

DMAT.NRPS_group12_s13; this versionpostedFebruary1,2020. PKS_group44_s8 NRPS.Like.PKS_group222_s4 DMAT.NRPS_group5_s14 NRPS_group42_s8 DMAT.NRPS_group17_s12 NRPS.NRPS.Like_group51_s8 PKS_group74_s6 n.d._group185_s4 PKS_group175_s4 PKS.Like_group107_s6 The copyrightholderforthispreprint NRPS_group66_s7 PKS_group76_s6 NRPS_group144_s5 NRPS.PKS.Like.TC_group233_s4 PKS_group55_s7 PKS_group132_s5 NRPS.NRPS.Like_group57_s7 n.d..NRPS.Like_group168_s4 n.d._group141_s5 PKS_group72_s7 PKS_group214_s4 PKS_group234_s4 PKS_group169_s4 DMAT.NRPS_group30_s9 PKS_group102_s6 NRPS_group172_s4 n.d._group231_s4 PKS.TC_group179_s4 n.d..PKS_group201_s4 n.d..PKS_group230_s4 PKS_group54_s7 PKS.PKS.Like_group36_s8 PKS_group120_s5 PKS.Like_group199_s4 PKS_group56_s7 PKS_group159_s5 NRPS_group26_s10 PKS_group197_s4 NRPS.PKS.Like_group13_s13 PKS_group119_s5 PKS_group149_s5 NRPS_group152_s5 0 PKS_group117_s5 DMAT.PKS_group183_s4 n.d._group174_s4 PKS_group113_s6 PKS_group47_s8 DMAT.NRPS_group1_s18 PKS.Like_group68_s7 n.d..PKS.Like_group218_s4 PKS_group232_s4 n.d._group173_s4 n.d._group211_s4 HYBRID_group116_s5 NRPS.Like_group203_s4 NRPS.Like.PKS_group131_s5 PKS.Like_group52_s8 DMAT.PKS_group7_s14 NRPS.NRPS.Like_group180_s4 PKS.Like_group154_s5 HYBRID.PKS_group25_s10 NRPS_group63_s7 NRPS_group158_s5 n.d..PKS_group235_s4 PKS_group219_s4 NRPS.PKS.PKS.Like_group88_s6 PKS_group204_s4 NRPS_group115_s6 PKS_group28_s10 NRPS_group188_s4 n.d._group181_s4 PKS_group22_s11 n.d._group146_s5 NRPS.PKS.Like_group27_s10 PKS_group205_s4 NRPS.PKS_group97_s6 PKS_group229_s4 PKS_group95_s6 PKS_group35_s8 PKS_group67_s7 PKS.Like_group91_s6 PKS_group79_s6 PKS_group236_s4 NRPS.Like.PKS_group194_s4 PKS_group80_s6 PKS_group129_s5 PKS_group43_s8 PKS_group162_s5 n.d._group178_s4 PKS.Like_group41_s8 PKS_group77_s6 PKS_group223_s4 NRPS.TC_group90_s6 PKS_group171_s4 NRPS.Like_group206_s4 NRPS.PKS.Like_group59_s7 n.d._group225_s4 DMAT.NRPS_group14_s12 n.d..NRPS.Like_group191_s4 Fritz and Purvis ‘ D PKS.Like_group32_s9 PKS_group78_s6 TC_group184_s4 NRPS_group164_s5 PKS.TC_group142_s5 PKS.Like_group155_s5 NRPS.Like_group239_s4 PKS_group207_s4 NRPS.PKS_group143_s5 PKS.Like_group110_s6 DMAT.NRPS_group4_s15 DMAT.PKS_group23_s11 PKS_group136_s5 PKS_group108_s6 NRPS.PKS_group161_s5 PKS_group167_s4 PKS_group145_s5 PKS_group89_s6 PKS_group134_s5 PKS_group153_s5 NRPS.PKS_group227_s4 HYBRID_group121_s5 NRPS.Like_group226_s4 PKS_group94_s6 PKS_group114_s6 NRPS_group75_s6 PKS.TC_group147_s5

1 DMAT.PKS_group48_s8 PKS_group111_s6 PKS_group39_s8 n.d..PKS_group198_s4 PKS_group195_s4 n.d..PKS_group202_s4 PKS_group124_s5 PKS_group106_s6 DMAT.NRPS_group2_s17 NRPS.PKS_group220_s4 PKS_group31_s9 PKS_group64_s7 PKS_group200_s4 PKS_group49_s8 n.d..PKS_group150_s5 NRPS.NRPS.Like_group73_s7 PKS_group82_s6 PKS_group40_s8 PKS_group84_s6 n.d._group123_s5 PKS.Like_group37_s8 DMAT.NRPS_group3_s15 PKS.Like_group166_s5 NRPS.Like.PKS_group81_s6 PKS.TC_group58_s7 PKS.PKS.Like_group109_s6 PKS_group196_s4 PKS_group237_s4 DMAT.PKS_group163_s5 NRPS_group221_s4 PKS.Like_group190_s4 PKS_group212_s4 NRPS.Like.PKS.TC_group160_s5 PKS_group127_s5 NRPS_group210_s4 DMAT.NRPS_group21_s11 PKS_group105_s6 PKS_group128_s5 n.d._group217_s4 PKS.TC_group182_s4 PKS_group139_s5 PKS_group6_s14

2.5 PKS_group209_s4 PKS_group187_s4 DMAT.NRPS.PKS_group24_s11 DMAT.PKS_group98_s6 HYBRID_group103_s6 HYBRID.PKS_group15_s12 n.d._group46_s8 n.d._group92_s6 n.d._group137_s5 n.d._group177_s4 n.d._group192_s4 n.d._group208_s4 n.d._group213_s4 n.d..NRPS.Like_group193_s4 n.d..PKS_group61_s7 NRPS_group33_s9 NRPS_group34_s9 NRPS_group50_s8 NRPS_group69_s7 NRPS_group99_s6 NRPS_group133_s5 NRPS_group156_s5 NRPS.Like.PKS.TC_group101_s6 NRPS.NRPS.Like_group216_s4 NRPS.PKS_group224_s4 NRPS.PKS.Like_group87_s6 PKS_group10_s13 PKS_group11_s13 PKS_group16_s12 PKS_group19_s11 PKS_group20_s11 PKS_group70_s7 PKS_group71_s7 PKS_group83_s6 PKS_group96_s6 PKS_group104_s6 PKS_group112_s6 PKS_group118_s5 PKS_group135_s5 PKS_group138_s5 PKS_group140_s5 PKS_group151_s5 PKS_group157_s5 PKS_group165_s5 PKS_group170_s4 PKS_group215_s4 PKS.Like_group8_s13 PKS.Like_group9_s13 PKS.Like_group18_s11 PKS.Like_group29_s10 PKS.Like_group38_s8 PKS.Like_group62_s7 PKS.Like_group65_s7 PKS.Like_group86_s6 PKS.Like_group93_s6 PKS.Like_group130_s5 PKS.Like.TC_group126_s5 PKS.TC_group45_s8 PKS.TC_group148_s5 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Figure SE Recovery of 87 melanin cluster loci

1.00

0.75 % total proteins recovered at locus

0.50

0.25

0.00

co-occur SMURF antiSMASH Cluster detection algorithm bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure SF Phylogenetic distance vs. dissimilarity in cluster repertoire in the Pleosporales

1.00 Sorensen dissimilarity

0.75

0.50

0.25

0.0 0.1 0.2 0.3 0.4 Phylogenetic distance bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Viridothelium virens Figure SGTrypetheliales Myriangium duriaei CBS 260.36 Myriangiales Elsinoe ampelina Myriangiales Myriangiales Aureobasidium pullulans EXF-150 Dothideales Aureobasidium namibiae CBS 147.97 Dothideales Aureobasidium melanogenum CBS 110374 Dothideales Aureobasidium subglaciale EXF-2481 Dothideales Delphinella strobiligena Dothideales Dothideales Polychaeton citri CBS 116435 Capnodiales Dissoconium aciculare CBS 342.82 Capnodiales Zymoseptoria tritici IPO323 Capnodiales Zymoseptoria pseudotritici STIR04_2.2.1 Capnodiales Zymoseptoria ardabiliae STIR04_1.1.1 Capnodiales Dothistroma septosporum NZE10 Capnodiales Passalora fulva Capnodiales Zasmidium cellare ATCC 36951 Capnodiales Pseudocercospora fijiensis Capnodiales Sphaerulina musiva SO2202 Capnodiales Sphaerulina populicola Capnodiales Cercospora zeae-maydis SCOH1-5 Capnodiales Hortaea acidophila Dothideales Baudoinia panamericana UAMH 10762 Capnodiales Teratosphaeria nubilosa Capnodiales Piedraia hortae Capnodiales Capnodiales Acidomyces richmondensis BFW Unknown Eremomyces bilateralis CBS 781.70 Unknown Microthyrium microscopicum Microthyriales Trichodelitschia bisporula Phaeotrichales Tothia fuscella Venturiales Verruconis gallopava Venturiales Venturia inaequalis Venturiales Venturia pyrina Venturiales Venturiales Aulographum hederae Unknown Rhizodiscina lignyota Patellariales Lineolata rhizophorae Dothideales Patellaria atrata CBS 101060 Patellariales Coniosporium apollinis CBS 100218 Unknown Aplosporella prunicola CBS 121167 Botryosphaeriales Phyllosticta citriasiana Botryosphaeriales Neofusicoccum parvum UCRNP2 Botryosphaeriales Macrophomina phaseolina MS6 Botryosphaeriales Botryosphaeria dothidea Botryo- Botryosphaeriales Diplodia seriata Botryosphaeriales Saccharata proteae CBS 121410 sphaeriales Botryosphaeriales Pseudovirgaria hyperparasitica Capnodiales Lepidopterella palustris CBS 459.81 Mytilinidiales Glonium stellatum Unknown Cenococcum geophilum 1.58 Unknown Lophium mytilinum Mytilinidiales Mytilinidion resinicola Mytilinidiales Mytilinidales Rhytidhysteron rufulum Hysteriales Hysterium pulicare Hysteriales Hysteriales Delitschia confertaspora Pleosporales Zopfia rhizophila CBS 207.26 Unknown Lindgomyces ingoldianus Pleosporales Clohesyomyces aquaticus Pleosporales Amniculicola lignicola CBS 123094 Pleosporales Lophiotrema nucula Pleosporales Polyplosphaeria fusca Pleosporales Didymosphaeria enalia Pleosporales Aaosphaeria arxii Pleosporales Lophiostoma macrostomum CBS 122681 Pleosporales Westerdykella ornata Pleosporales Sporormia fimetaria CBS 119925 Pleosporales Massariosphaeria phaeospora Pleosporales Trematosphaeria pertusa Pleosporales Bimuria novae-zelandiae Pleosporales Paraphaeosphaeria sporulosa Pleosporales Karstenula rhodostoma CBS 690.94 Pleosporales Massarina eburnea CBS 473.64 Pleosporales Byssothecium circinans Pleosporales Periconia macrospinosa Pleosporales Lentithecium fluviatile CBS 122367 Pleosporales Stagonospora sp. SRC1lsM3a Pleosporales Ampelomyces quisqualis Pleosporales Parastagonospora nodorum SN15 Pleosporales Ophiobolus disseminans Pleosporales Setomelanomma holmii Pleosporales Leptosphaeria maculans JN3 Pleosporales Plenodomus tracheiphilus IPT5 Pleosporales Clathrospora elynae Pleosporales Decorospora gaudefroyi Pleosporales Alternaria brassicicola Pleosporales Alternaria alternata Pleosporales Pyrenophora tritici-repentis Pleosporales Pyrenophora teres f. teres 0-1 Pleosporales Setosphaeria turcica Et28A Pleosporales Curvularia lunata m118 Pleosporales Bipolaris maydis C5 Pleosporales Bipolaris sorokiniana ND90Pr Pleosporales Bipolaris oryzae ATCC 44560 Pleosporales Bipolaris zeicola 26-R-13 Pleosporales Bipolaris victoriae FI3 Pleosporales Cucurbitaria berberidis CBS 394.84 Pleosporales Pyrenochaeta sp. DS3sAY3a Pleosporales Lizonia empirigonia Unknown Didymella exigua CBS 183.55 Pleosporales Macroventuria anomochaeta Pleosporales Dothidotthia symphoricarpi CBS 119687 Pleosporales Pleomassaria siparia CBS 279.74 Pleosporales Melanomma pulvis-pyrius CBS 109.77 Pleosporales Pleosporales Aspergillus nidulans Eurotiomycetes Coccidioides immitis Eurotiomycetes Fusarium graminearum Sordariomycetes Neurospora crassa Sordariomycetes Botrytis cinerea Leotiomycetes 0 20 40 60 Sclerotinia sclerotiorum Leotiomycetes Pyronema omphalodes Pezizomycetes Tuber melanosporum Number of refined clusters with a signature gene Pezizomycetes 0.10 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure SH Partitioning dissimilarity between cluster repertoires (Pleosporales) Nestedness component

Density of Sørensen dissimilarity

60

Sørensen dissimilarity (total dissimilarity across all repertoires)

40 Turnover component of Sørensen dissimilarity

20

0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Dissimilarity estimates across all cluster repertoires bioRxiv preprint doi: https://doi.org/10.1101/2020.01.31.928846; this version posted February 1, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure SI Relationship between gene homolog group size and cluster composition diversity (Raup-Crick)

MCL000109 30 MCL000001

MCL007288

Raup-Crick dissimilarity MCL000016 MCL000017 MCL007271 20 MCL000002 MCL000033 MCL000176 MCL007003 MCL000003

MCL000193 MCL000357 MCL000005

10 MCL006196 MCL000006

MCL000253 0 0 300 600 900 gene homolog group size