<<

G3: Genes|Genomes|Genetics Early Online, published on August 15, 2018 as doi:10.1534/g3.118.200656

1 2 3 4 5 Early diverging insect-pathogenic fungi of the order possess diverse 6 and unique subtilisin-like serine proteases 7 8 9 Authors 10 Jonathan A. Arnesen1, Joanna Malagocka1, Andrii Gryganskyi2, Igor V. Grigoriev3, Kerstin 11 Voigt4,5, Jason E. Stajich6 (orcid 0000-0002-7591-0020) and Henrik H. De Fine Licht1* 12 (orcid 0000-0003-3326-5729) 13 14 15 1Section for Organismal Biology, Department of Plant and Environmental Sciences, 16 University of Copenhagen, Thorvaldsenvej 40, 1871 Frederiksberg, Denmark. 17 2Department of Biology, Duke University, Durham, North Carolina, USA. 18 3US Department of Energy Joint Genome Institute, Walnut Creek, California, USA. 19 4Jena Microbial Resource Collection (JMRC), Leibniz Institute for Natural Product Research 20 and Infection Biology - Hans Knoell Institute, Adolf-Reichwein-Str.23, 07745 Jena, 21 Germany. 22 5Institute of Microbiology, Friedrich Schiller University, Neugasse 25, 07743 Jena, Germany. 23 6Department of Plant Pathology and Microbiology, University of California, Riverside, 24 California, USA. 25 26 27 *Author for correspondence: 28 Section for Organismal Biology, Department of Plant and Environmental Sciences, 29 University of Copenhagen, Thorvaldsenvej 40, 1871 Frederiksberg, Denmark. Email: 30 [email protected], Tel: +45 35320097 31 32

1

© The Author(s) 2013. Published by the Genetics Society of America. 33 Abstract 34 Insect-pathogenic fungi use subtilisin-like serine proteases (SLSPs) to degrade chitin- 35 associated proteins in the insect procuticle. Most insect-pathogenic fungi in the order 36 Hypocreales () are generalist with a broad host-range, and most species 37 possess a high number of SLSPs. The other major clade of insect-pathogenic fungi is part of 38 the subphylum Entomophthoromycotina (Zoopagomycota, formerly ) which 39 consists of high host-specificity insect-pathogenic fungi that naturally only infect a single or 40 very few host species. The extent to which insect-pathogenic fungi in the order 41 Entomophthorales rely on SLSPs is unknown. Here we take advantage of recently available 42 transcriptomic and genomic datasets from four genera within Entomophthoromycotina: the 43 saprobic or opportunistic pathogens Basidiobolus meristosporus, Conidiobolus coronatus, C. 44 thromboides, C. incongruus, and the host-specific insect pathogens Entomophthora muscae 45 and Pandora formicae, specific pathogens of house flies (Muscae domestica) and wood ants 46 (Formica polyctena), respectively. In total 154 SLSP from six fungi in the subphylum 47 Entomophthoromycotina were identified: E. muscae (n = 22), P. formicae (n = 6), B. 48 meristosporus (n = 60), C. thromboides (n = 18), C. coronatus (n = 36), and C. incongruus (n 49 = 12). A unique group of 11 SLSPs was discovered in the genomes of the obligate biotrophic 50 fungi E. muscae, P. formicae and the saprobic human pathogen C. incongruus that loosely 51 resembles bacillopeptidase F-like SLSPs. Phylogenetics and protein domain analysis show 52 this class represents a unique group of SLSPs so far only observed among Bacteria, 53 Oomycetes and early diverging fungi such as Cryptomycota, , and 54 Entomophthoromycotina. This group of SLSPs is missing in the sister fungal lineages of 55 and the fungal phyla Mucoromyocta, Ascomycota and 56 fungi suggesting interesting gene loss patterns. 57 58 Keywords 59 Subtilase; insect-pathogen; early-diverging fungi; proteases; phylogenomics 60 61 62

2 63 1. Introduction 64 Insect pathogenic fungi use a broad array of enzymes to penetrate the host cuticle and gain 65 entry to the soft tissues inside (Charnley, 2003; St. Leger et al., 1986b). In many cases, serine 66 proteases are among the first enzymes to be secreted in the early stages of infection in order 67 to cleave and open up chitin-associated proteins in the procuticle (St. Leger et al., 1986a; 68 Vega et al., 2012), which later is followed by extensive lipase and chitinase enzymatic 69 secretions (Charnley, 2003). In particular, subtilisin-like serine proteases (SLSPs) have been 70 considered important virulence factors in pathogenic fungi (Muszewska et al., 2011). The 71 first SLSPs from insect pathogenic fungi were identified in Metarhizium anisopliae 72 (ARSEF2575), which secretes SLSPs as some of the key proteases during fungal growth on 73 insect cuticle (Charnley, 2003; St. Leger et al., 1986a). Comparative genomic approaches 74 have identified significant expansions of the SLSP gene family in the Metarhizium 75 (Bagga et al., 2004; Hu et al., 2014), the insect pathogenic Beauveria bassiana (Xiao 76 et al., 2012), two nematode-trapping fungi Monacrosporium haptotylum and Arthrobotrys 77 oligospora that are able to penetrate the chitinaceous cell wall of soil nematodes (Meerupvati 78 et al. 2013), and dermatophytic fungi such as Arthroderma benhamiae and 79 verrucosum that can cause nail and skin infections in humans and animals (Burmester et al., 80 2011; Desjardins et al., 2011; Martinez et al., 2012; Sharpton et al., 2009). Fungi that are able 81 to utilize chitin-rich substrates, including many insect pathogenic fungi, thus appear to often 82 be associated with a diversified and expanded set of SLSPs. 83 84 Although SLSPs are expanded among insect pathogenic fungi, this group of proteases is 85 ubiquitous among eukaryotic organisms. Most SLSPs are secreted externally or localized to 86 vacuoles, and especially in saprobic and symbiotic fungi SLSPs constitute an important 87 component of the secretome (Li et al., 2017). According to the MEROPS peptidase 88 classification, the S8 family of SLSPs together with the S53 family of serine-carboxyl 89 proteases make up the SB clan of subtilases (Rawlings et al., 2016). The S8 family of SLSPs 90 is characterized by an Asp-His-Ser catalytic triad (DHS triad), which forms the active site 91 and is further divided into two subfamilies S8A and S8B. Subfamily S8A contains most S8 92 representatives, including the well-known Proteinase K enzyme that is widely used in 93 laboratories as a broad-spectrum protease. The S8B SLSPs are kexins and furins which 94 cleave peptides and protohormones (Jalving et al., 2000; Muszewska et al., 2017, 2011). 95 Based on characteristic protein domain architectures and protein motifs surrounding the 96 active site residues, the large S8A subfamily of SLSPs is further divided into several groups

3 97 such as proteinase-K and pyrolysin. Besides these two major groups of proteinase K-like and 98 pyrolisin subfamilies, six new groups of subtilase genes designated new 1 to new 6 have 99 recently been found (Li et al., 2017; Muszewska et al., 2011). The analysis of fungal genome 100 data from a wide taxonomic range has shown that the size of the proteinase K gene family 101 has expanded independently in fungi pathogenic to invertebrates (Hypocreales) and 102 vertebrates () (Muszewska et al., 2017; Sharpton et al., 2009). Closely related 103 invasive human-pathogenic fungi, however, do not show the same expansions and related 104 pathogens and non-pathogens can show the same expansions (Muszewska et al., 2011; 105 Whiston and Taylor, 2016). This suggests that the number of SLSPs that a fungus possess is 106 not directly related to pathogenicity, but instead is associated with the use of dead or alive 107 animal tissue as growth substrate (Li et al., 2017; Muszewska et al., 2011). 108 109 Most anamorphic insect-pathogenic fungi in the order Hypocreales (Ascomycota) are 110 generalist species with a broad host-range capable of infecting most major orders of insects 111 (e.g. M. robertsii and B. bassiana) or specific to larger phylogenetic groups (e.g. the locust- 112 specific M. acridum or the coleopteran pathogen B. brongniartii) (Boomsma et al., 2014; Hu 113 et al., 2014). The above-mentioned inferences of fungal SLSP evolution rely almost 114 exclusively on insights from Ascomycota, and consequently have strong sampling bias 115 towards generalist insect-pathogenic fungi. In contrast, the other major clade of insect- 116 pathogenic fungi in the subphylum Entomophthoromycotina (Zoopagomycota, formerly 117 Zygomycota) consists almost exclusively of insect-pathogens and many are extremely host- 118 specific, naturally only infecting a single or very few host species (Spatafora et al., 2016). 119 The dearth of genomic data for Entomophthoromycotina has previously precluded their 120 inclusion in comparative genomic analyses (De Fine Licht et al., 2016; Gryganskyi and 121 Muszewska, 2014). Here we take advantage of recently available transcriptomic and genomic 122 datasets from four genera within Entomophthoromycotina: the saprobic Basidiobolus 123 meristosporus, the saprobic and opportunistic pathogens, Conidiobolus coronatus, C. 124 thromboides, C. incongruus, and the host-specific insect pathogens Entomophthora muscae 125 and Pandora formicae, specific pathogens of house flies (Muscae domestica) and wood ants 126 (Formica polyctena), respectively. We use phylogenetics and protein domain analysis to 127 show that the obligate biotrophic fungi E. muscae, P. formicae and the saprobic human 128 pathogen C. incongruous, in addition to more “classical” fungal SLSPs, harbor a unique 129 group of SLSPs that loosely resembles bacillopeptidase F-like SLSPs. 130

4 131 2. Materials and Methods 132 Sequence database searches for subtilisin-like serine proteases 133 We identified putative subtilisin-like serine proteases (SLSPs) from six fungi in the 134 subphylum Entomophthoromycotina: Entomophthora muscae, Pandora formicae, 135 Basidiobolus meristosporus, Conidiobolus coronatus, C. incongruus and C. thromboides. 136 First, protein family (pfam) domains were identified in the de-novo assembled transcriptomes 137 of E. muscae KVL-14-117 (De Fine Licht et al., 2017) and P. formicae (Małagocka et al., 138 2015) using profile Hidden Markov Models with hmmscan searches (E-value < 1e-10) 139 against the PFAM-A database ver. 31.0 using HMMER ver. 3 (Eddy, 1998; Finn et al., 140 2016). All sequences in the transcriptome datasets containing the S8 subtilisin-like protease 141 domain (PF00082) were identified and included in further analyses. Second, all sequences 142 that contain the PF00082 domain were obtained from the genomes of B. meristosporus CBS 143 931.73 (Mondo et al., 2017), C. coronatus NRRL28638 (Chang et al., 2015), and C. 144 thromboides FSU 785 from the US Department of Energy Joint Genome Institute MycoCosm 145 genome portal (http://jgi.doe.gov/-fungi). Third, predicted coding regions in the genome 146 sequence of C. incongruus CDC-B7586 (Chibucos et al., 2016), were searched for the 147 presence of the S8 subtilisin-like protease domain (PF00082) as described above. 148 149 Sequences encoding an incomplete Asp-His-Ser catalytic triad (DHS triad) characteristic of 150 S8 family proteases were regarded as potential pseudogenes and excluded from further 151 analysis. Although a subset may represent neofunctionalizations, the risk of including false- 152 positive SLSPs in down-stream analyses were considered too high. A preliminary protein 153 alignment made with ClustalW (Larkin et al., 2007) using default parameters and 154 construction of a Neighbour-Joining tree using Geneious 4.8.5 (Kearse et al., 2012) revealed 155 a highly divergent group of P. formicae SLSP-sequences that had significant homology with 156 insect proteases (blastp, E-value < 1e-6, ncbi-nr protein database, accessed June 2017). These 157 putative insect-sequences likely originate from the ant host, Formica polyctena, and 158 represents host contamination that were not filtered out from the dual-RNAseq reads used to 159 construct the P. formicae transcriptome (Małagocka et al., 2015). These divergent sequences 160 were therefore removed and excluded from further analysis. 161 162 Protein domain architecture and sequence clustering 163 The domain architectures of all putative SLSPs identified within Entomophthoromycotina 164 were predicted using Pfam domain annotation. The presence of putative signal peptides for

5 165 extracellular secretion were predicted using SignalP (Petersen et al., 2011). A Markov 166 Clustering Algorithm (MCL) was used to identify clusters of similar proteins among putative 167 SLSPs identified within Entomophthoromycotina. Clustering using MCL is based on a graph 168 constructed by an all-vs-all-BLAST of SLSPs (BLASTP, E-value < 1e-10). The Tribe-MCL 169 protocol (Enright et al., 2002) as implemented in the Spectral Clustering of Protein 170 Sequences (SCPS) program (Nepusz et al., 2010) was used with inflation = 2.0. The inflation 171 parameter is typically set between 1.2 – 5.0 (Nepusz et al., 2010), and controls the “tightness” 172 of the sequence matrix, with lower values leading to fewer clusters and higher values to more 173 sequence clusters. To putatively assign protease function to the newly identified 174 Entomophthoromycotina sequence clusters, the Tribe-MCL protocol was used to identify 175 clusters of similar proteins between the putative SLSPs identified within 176 Entomophthoromycotina and 20,806 protease sequences belonging to the peptidase 177 subfamily S8A obtained from the MEROPS database, accessed November 2017 (Rawlings et 178 al., 2016). Investigation of the MEROPS protease sequences that clustered together with the 179 identified Entomophthoromycotina sequence clusters allowed putative protease holotype 180 information to be assigned to the identified clusters. 181 182 Phylogenetic analysis 183 All identified putative Entomophthoromycotina SLSP coding nucleotide sequences were 184 aligned in frame to preserve codon structure using MAFFT (Larkin et al., 2007). Unreliable 185 codon-columns with a Guidance2 score below 0.90 in the multiple sequence alignment were 186 removed (Penn et al., 2010). The best model for phylogenetic analysis was selected by 187 running PhyML with GTR as substitution model and with-or-without Gamma parameter and 188 a proportion of invariable sites (Guindon and Gascuel, 2003). The optimal substitution model 189 based on the Bayesian Information Criterion (BIC) score (GTR+G) was determined using 190 TOPALI 2.5 (Milne et al., 2009) and used in maximum likelihood analysis using RaxML 191 with 10,000 bootstrap runs (Stamatakis, 2014). 192 193 To identify branches that potentially contain signatures of positive selection among SLSP 194 sequences we used maximum likelihood estimates of the dN/dS ratio (ω) for each site 195 (codon) along protein sequences. A specific lineage (branch) was tested independently for 196 positive selection (ω > 1) on individual sites by applying a neutral model that allows ω to 197 vary between 0 – 1 and a selection model that also incorporates sites with ω > 1 using the

6 198 software codeml implemented in PAML 4.4 (Yang, 2007). Statistical significance was 199 determined with a likelihood ratio test of these two models for the tested lineage. 200 201 To infer phylogenetic relationship of identified putative Entomophthoromycotina SLSPs with 202 fungal SLSPs from other invertebrate-associated ascomycete fungi, two approaches were 203 used. First, all S8A SLSPs in the MEROPS database from the insect-pathogenic genera 204 Metarhizium (n = 240), Cordyceps (including Beauveria) (n = 44), Ophiocordyceps (n = 11), 205 and the nematode-trapping genera Arthrobotrys (n = 33), and Monachrosporium (n = 8), 206 were clustered by sequence identity with the identified SLSPs from Entomophthoromycotina 207 (n = 152) using the MCL approach previously described. Second, from the three identified 208 SLSP groups (A, B, and C) within Entomophthoromycotina, 10 representative group A 209 SLSPs and all sequences from group B (n = 11) and group C (n = 11) were searched against 210 the entire S8A MEROPS database (blastp, e-value = 1e-10). The non-redundant top-ten hits 211 from this search (n = 149) were aligned with the Entomophthoromycotina query SLSPs (n = 212 32) and representative invertebrate-associated SLSPs from the clustering analysis (n = 350) 213 using MAFFT. Unreliable columns with a Guidance2 score below 0.80 in the multiple 214 sequence alignment were removed (Penn et al., 2010). The optimal protein substitution model 215 based on the Bayesian Information Criterion (BIC) score (LG+i) was determined using 216 ModelFinder (Kalyaanamoorthy et al., 2017) and used in maximum likelihood analysis using 217 RaxML with 100 bootstrap runs (Stamatakis, 2014). 218 219 Data availability 220 All data used are available through the US Department of Energy Joint Genome Institute 221 MycoCosm genome portal (http://jgi.doe.gov/-fungi), or has been previously published 222 (Małagocka et al., 2015; Chibucos et al., 2016; De Fine Licht et al., 2017). SLSP sequence 223 data and Tribe-MCL outputs analyzed in this manuscript are provided as a zip-compressed 224 single file ArnesenEtalSupplementaryData.zip in the supplementary material. 225 226 3. Results 227 We identified 154 SLSP sequences from six fungi in the subphylum 228 Entomophthoromycotina: E. muscae (n = 22), P. formicae (n = 6), B. meristosporus (n = 60), 229 C. thromboides (n = 18), C. coronatus (n = 36), and C. incongruus (n = 12). Close inspection 230 of the active site residues revealed two C. incongruus sequences (Ci7229 and Ci12055), 231 which contained the active site DHS residues in the motifs Asp-Asp-Gly, His-Gly-Thr-Arg,

7 232 and Gly-Thr-Ser-Ala/Val-Ala/Ser-Pro characteristic of the S8B subfamily of S8 proteases. 233 These two sequences also contained the P domain (PF01483) indicating that they are S8B 234 kexin proteases. All other 152 identified Entomophthoromycotina S8 protease sequences 235 contained active site residues closely resembling the motifs Asp-Thr/Ser-Gly, His-Gly-Thr- 236 His, and Gly-Thr-Ser-Met-Ala-Xaa-Pro characteristic of the S8A subfamily. Cluster analysis 237 of these 152 S8A-protease sequences identified three groups of proteins that were designated 238 as group A, B and C, with 130, 11 and 11 sequences in each cluster, respectively (Fig. 1). 239 These group memberships do not change when the Tribe-MCL inflation parameters varied 240 within a range of 1.5 – 6.0 suggesting that these three groups were clearly distinct. 241 Phylogenetic analysis of the identified 152 S8A SLSPs using maximum likelihood methods 242 also recovered the same three distinct lineages (Fig. 1). Evidence of positive selection acting 243 on specific enzyme residues on the branches leading to these clusters were not detected with 244 branch-site tests (2ΔlnL > 1.53, P > 0.217). 245 246 To further characterize the three SLSP groups, the protein domain architecture of each of the 247 152 protease sequences were analyzed. The presence of a proteinase-associated (PA, 248 pfam:PF02225) domain was only found in Group B that strongly suggested this cluster with 249 11 members is comprised of pyrolisin and osf proteases (Muszewska et al., 2011). An 250 additional Tribe-MCL cluster analysis (inflation = 1.2) of the 152 Entomophthoromycotina 251 and all S8A proteases in the MEROPS database clustered these 11 Entomophthoromycotina 252 sequences into a group of 779 MEROPS proteases. This group of proteases contained 42 253 members of the fungal S08.139 (PoSl-(Pleurotus ostreatus)-type peptidase) holotype 254 (Supplementary data). The protein domain architecture of group A contained 130 255 Entomopthoromycotina SLSPs, and many of these 130 SLSPs contained a secretory signal 256 and/or a peptidase inhibitor (Inhibitor_I9, pfam: PF05922) domain. The 130 257 Entomophthoromycotina SLSPs clustered with 3,046 MEROPS proteins of well-known 258 fungal entomopathogenic protease holotypes (Supplementary data), including cuticle- 259 degrading peptidase of nematode-trapping fungi (S08.120), cuticle-degrading peptidase of 260 insect-pathogenic fungi in the genus Metarhizium (S08.056), and subtilisin-like peptidase 3 261 (Microsporum-type; S08.115). The 130 Entomophthoromycotina Group A SLSPs thus 262 belong to the common Proteinase K group of S8A proteases (Muszewska et al., 2011) based 263 on conservation of the active site residues (Fig. 2), and cluster membership of the MEROPS 264 Tribe-MCL analysis. 265

8 266 The MEROPS Tribe-MCL analysis identified a third group (C) of 11 267 Entomophthoromycotina SLSPs, which only contained members from the order 268 Entomophthorales and clustered with 402 MEROPS proteins (Supplementary data). Of these, 269 386 were classified as un-assigned subfamily S8A peptidases (S08.UPA) and the remaining 270 15 assigned to the bacillopeptidase F holotype (S08.017; Supplementary data). All 402 271 MEROPS proteases in this group originated from either Bacteria or Oomycetes, except for 272 two proteases from the Fungi Rozella allomycis (Cryptomycota) and Mitosporidium daphnia 273 (Microsporidia), respectively (Fig. 3, 4, Supplementary data). The group C Entomophthorales 274 SLSPs clustered as sister to these two Cryptomycota and Microsporidia proteins with strong 275 support (ML bootstrap value = 99) in the phylogenetic analysis of the protein sequences (Fig. 276 3), in concordance with this group of SLSPs being an outlier from all other previously known 277 fungal S8A SLSPs. The monophyly of group C Entomophthorales SLSPs together with 278 Cryptomycota and Microsporidia proteins were further supported in the phylogenetic analysis 279 including all three Entomophthoromycotina groups (Fig. 6) and when compared to 280 invertebrate-associated ascomycete SLSPs (Fig. 5). 281 282 4. Discussion 283 Subtilisin-like serine proteases (SLSPs) have many roles in fungal biology and are known to 284 be involved in host–pathogen interactions. Independent expansion of copy number and 285 diversification of SLSPs is widespread among animal pathogenic (Ascomycota and 286 Basidiomycota) (Li et al., 2017). The repeated expansion of SLSPs among the generalist 287 insect-pathogenic hypocrealean fungi has been interpreted as an adaptation to enable 288 infection of insect hosts (Muszewska et al., 2011), whereas comparatively little is known 289 about the evolution and diversification of SLSPs among the early diverging fungal clades. To 290 understand the evolution of SLSPs among the vertebrate and arthropod pathogenic fungi in 291 the subphylum Entomophthoromycotina, we searched available genomic and transcriptomic 292 sequence data to identify all Entomophthoromycotina genes with SLSP domains. We found 293 154 Entomophthoromycotina SLSPs, of which two copies were classified as S8B kexin 294 SLSPs. The remaining 152 S8A SLSPs were clustered by sequence similarity and compared 295 by phylogenetic analysis to show that the majority of the SLSPs (n = 130) are similar to and 296 cluster together with “classical” proteinase-K-like fungal S8A SLSPs (Fig. 1). A statistical 297 test for a significant expansion of SLSP copy number among the insect-pathogenic 298 Entomophthoromycotina was not explicitly performed in the present analysis due to 299 uncertainty of total gene numbers from transcriptomic data sets of the specialist insect-

9 300 pathogens E. muscae and P. formicae. In the sampled transcriptomes, the number of 301 transcripts is likely larger than the genome gene count due to splice variants, post- 302 transcriptional modifications, and allelic variants assembling into multiple transcripts per 303 gene. In addition, the assembled transcripts only reflect actively transcribed genes expressed 304 in the sampled conditions and time points, and may underestimate the actual number of 305 genes. These confounding factors impact the estimated number of genes and make 306 quantitative comparative analyses of gene family size between transcriptomes unreliable. 307 308 We did identify 11 SLSPs that cluster together with 402 un-annotated or Bacillopeptidase F- 309 like SLSPs primarily from bacteria and Oomycetes (Fig. 3), but also including two fungal 310 protease sequences from the early-diverging Cryptomycota R. allomycis and microsporidium 311 M. daphnia lineages (Fig. 4). These observations remained consistent even when exploring 312 variation in the inflation parameter, which controls the “tightness” in the cluster analysis. The 313 entomophthoralean and oomycetous S8A SLSPs form separate clades within this 402- 314 sequence cluster of primarily bacterial proteases. Instead, the 11 Entomophthorales SLSPs 315 form a distinct lineage together with the two proteases from R. allomycis and M. daphnia 316 (Fig. 3, 6). This indicates that within this group, the Entomophthorales and Oomycete SLSPs 317 both share a most recent common ancestor with independent bacterial proteases and the 11 318 entomophthoralean proteases, together with the two SLSPs from Cryptomycota and 319 Microsporidia, are a unique group of proteases exclusive to some of the early diverging 320 fungal lineages. 321 322 Functional annotation indicates apparent protease activity based on sequence similarity, but 323 molecular function of the novel 11 SLSPs in group C is unknown. Eight of these SLSPs 324 possess a signal peptide that suggest external secretion and thus indicative of a function on 325 the immediate environment of the fungi, whereas the remaining three SLSPs might not be 326 secreted or represent incomplete sequence models. Apart from the canonical protease S8 327 domain (PF00082), no other Pfam domains were found among this group C SLSPs. Searches 328 against InterPro databases similarly did not reveal any other protein domains apart from the 329 protease S8 SLSP domain (PRINTS: subtilisin serine protease family (S8) signature 330 (PR00723), InterPro: peptidase S8, subtilisin-related (IPR015500), and ProSitePatterns: 331 serine proteases, subtilase family (PS00138)). Notably, none of these SLSPs contain the 332 proteinase inhibitor i9 domain (PF05922) often found among the classical protease K-like 333 SLSPs (Muszewska et al., 2011). Extensive diversification of the amino acids immediately

10 334 surrounding the active site residues in the DHS triad (Fig. 2), further suggests that the new 335 group C of SLSPs have evolved a different function than the “classical” fungal SLSPs in 336 group A. Out of the five Entomophthoromycotina species analysed here, only three: C. 337 incongruus, E. muscae and P. formicae within the order Entomophthorales contained 338 members in the new group C SLSPs (Fig. 1). The unequal phylogenetic presence of the group 339 C SLSPs could be indicative of specific functions related to niche adaptation. The two insect- 340 pathogenic fungi specialized on house flies (E. muscae) and wood ants (P. formicae) contain 341 five and two of the novel group C SLSPs, respectively (Fig. 1). However, the soil saprobe 342 and opportunistic human pathogen C. incongruus also contains four group C SLSPs, which 343 provide evidence that group C SLSPs are unrelated to host-specific evolution of the specialist 344 insect-pathogenic entomophthoralean fungi. 345 346 The novel group C SLSPs were absent from the sequenced genome of C. thromboides, 347 implying that the uneven taxonomic presence and absence of particular SLSPs within 348 Entomophthoromycotina taxa are unlikely to be due to sequencing or sampling artifacts of 349 data. This is further supported by the absence of group C SLSP S8A-proteases in 350 Basidiomycota or Ascomycota (Fig. 3). However, the genomes of ascomycete insect- 351 pathogenic hypocrealean and nematode-trapping fungi within Ascomycota contain different 352 SLSP’s missing in Entomphthoromycotina (Fig. 5). The present analysis shows that the two 353 major groups of insect-pathogenic fungi within Ascomycota and Entomophthoromycotina 354 contain a similar complement of SLSPs, but each clade also possesses unique sets of these 355 proteases that most likely evolved independently (Fig. 6). Further studies including genomic 356 comparisons of the host-specific insect-pathogenic fungi within Entomophthoromycotina will 357 likely shed interesting new light on the gene content of these early diverging fungi (De Fine 358 Licht et al., 2016). The presence of unusual genome organization, polyploidy and large 359 genomes in many host-specific insect-pathogenic species within Entomophthoromycotina has 360 previously been a hindrance to genome sequencing (Gryganskyi and Muszewska, 2014). 361 However, this study highlights examples of many new proteins and enzymes that may be 362 discovered through genome sequencing and data mining within Entomophthoromycotina. 363 364 365 366 Acknowledgements

11 367 The work conducted by the U.S. Department of Energy Joint Genome Institute is supported 368 by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02- 369 05CH11231. Work by JES and AG was partially supported by funding from the National 370 Science Foundation (DEB-1441715) to JES. KV expresses her gratitude to generous financial 371 support from CRC/TR 124 Pathogenic fungi and their human host: Networks of Interaction - 372 FungiNet. (Z1 and A6) granted by the Deutsche Forschungsgemeinschaft. HHDFL was 373 supported by the Villum Foundation (grant no. 10122). 374 375 376 References 377 378 Bagga S., Hu G., Screen S. E., St. Leger R. J., 2004 Reconstructing the diversification of 379 subtilisins in the pathogenic fungus Metarhizium anisopliae. Gene 324: 159–169. 380 doi:10.1016/j.gene.2003.09.031 381 Boomsma J. J., Jensen A. B., Meyling N. V, Eilenberg J., 2014 Evolutionary interaction 382 networks of insect pathogenic fungi. Annu. Rev. Entomol. 59: 467–85. 383 doi:10.1146/annurev-ento-011613-162054 384 Burmester A., Shelest E., Glöckner G., Heddergott C., Schindler S., et al., 2011 Comparative 385 and functional genomics provide insights into the pathogenicity of dermatophytic fungi. 386 Genome Biol. 12: R7. doi:10.1186/gb-2011-12-1-r7 387 Chang Y., Wang S., Sekimoto S., Aerts A., Choi C., et al., 2015 Phylogenomic analyses 388 indicate that early fungi evolved digesting cell walls of algal ancestors of land plants. 389 Genome Biol. Evol. 7: 1590–1601. doi:10.1093/gbe/evv090 390 Charnley A. K., 2003 Fungal pathogens of insects: Cuticle degrading enzymes and toxins. 391 Adv. Bot. Res. 40: 241–321. doi:10.1016/S0065-2296(05)40006-3 392 Chibucos M. C., Soliman S., Gebremariam T., Lee H., Daugherty S., et al., 2016 An 393 integrated genomic and transcriptomic survey of mucormycosis-causing fungi. Nat. 394 Commun. 7: 12218. doi:10.1038/ncomms12218 395 De Fine Licht H. H., Hajek A. E., Eilenberg J., Jensen A. B., 2016 Utilizing genomics to 396 study entomopathogenicity in the fungal phylum : A review of 397 current genetic resources. Adv. Genet. 94: 41–65. doi:10.1016/bs.adgen.2016.01.003 398 De Fine Licht H. H., Jensen A. B., Eilenberg J., 2017 Comparative transcriptomics reveal 399 host-specific nucleotide variation in entomophthoralean fungi. Mol. Ecol. 26: 2092– 400 2110. doi:10.1111/mec.13863 401 Desjardins C. A., Champion M. D., Holder J. W., Muszewska A., Goldberg J., et al., 2011 402 Comparative genomic analysis of human fungal pathogens causing 403 paracoccidioidomycosis. PLoS Genet. 7: e1002345. doi:10.1371/journal.pgen.1002345 404 Eddy S., 1998 Profile hidden Markov models. Bioinformatics 14: 755–763. doi:btb114 [pii] 405 Enright A. J., Van Dongen S., Ouzounis C. A., 2002 An efficient algorithm for large-scale 406 detection of protein families. Nucleic Acids Res. 30: 1575–1584. doi:doi: 407 10.1093/nar/30.7.1575 408 Finn R. D., Coggill P., Eberhardt R. Y., Eddy S. R., Mistry J., et al. 2016 The Pfam protein 409 families database: towards a more sustainable future. Nucleic Acids Res. 44: D279– 410 D285. doi:10.1093/nar/gkv1344 411 Gryganskyi A. P., Muszewska A., 2014 Whole Genome Sequencing and the Zygomycota.

12 412 Fungal Genomics Biol. 4: 10–12. doi:10.4172/2165-8056.1000e116 413 Guindon S., Gascuel O., 2003 A simple, fast, and accurate algorithm to estimate large 414 phylogenies by maximum likelihood. Syst. Biol. 52: 696–704. 415 doi:10.1080/10635150390235520 416 Hu X., Xiao G., Zheng P., Shang Y., Su Y., et al., 2014 Trajectory and genomic determinants 417 of fungal-pathogen speciation and host adaptation. Proc. Natl. Acad. Sci. 111: 16796– 418 16801. doi:10.1073/pnas.1412662111 419 Jalving R., Van De Vondervoort P. J. I., Visser J., Schaap P. J., 2000 Characterization of the 420 kexin-like maturase of Aspergillus niger. Appl. Environ. Microbiol. 66: 363–368. 421 doi:10.1128/AEM.66.1.363-368.2000.Updated 422 Kalyaanamoorthy S., Minh B. Q., Wong T. K. F., von Haeseler A., Jermiin L. S., 2017 423 ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 424 14: 587–589. doi:10.1038/nmeth.4285 425 Kearse M., Moir R., Wilson A., Stones-Havas S., Cheung M., et al., 2012 Geneious Basic: 426 An integrated and extendable desktop software platform for the organization and 427 analysis of sequence data. Bioinformatics 28: 1647–1649. 428 doi:10.1093/bioinformatics/bts199 429 Larkin M. A., Blackshields G., Brown N. P., Chenna R., McGettigan P. A., et al., 2007 430 Clustal W and Clustal X version 2.0. Bioinformatics 23: 2947–2948. 431 doi:10.1093/bioinformatics/btm404 432 Li J., Gu F., Wu R., Yang J. K., Zhang K. Q., 2017 Phylogenomic evolutionary surveys of 433 subtilase superfamily genes in fungi. Sci. Rep. 7: 1–15. doi:10.1038/srep45456 434 Małagocka J., Grell M. N., Lange L., Eilenberg J., Jensen A. B., 2015 Transcriptome of an 435 entomophthoralean fungus (Pandora formicae) shows molecular machinery adjusted for 436 successful host exploitation and transmission. J. Invertebr. Pathol. 128: 47–56. 437 doi:10.1016/j.jip.2015.05.001 438 Martinez D. A., Oliver B. G., Gräser Y., Goldberg J. M., Li W., et al., 2012 Comparative 439 genome analysis of Trichophyton rubrum and related reveals candidate 440 genes involved in infection. MBio 3: e00259-12. doi:10.1128/mBio.00259-12 441 Milne I., Lindner D., Bayer M., Husmeier D., McGuire G., et al., 2009 TOPALi v2: a rich 442 graphical interface for evolutionary analyses of multiple alignments on HPC clusters and 443 multi-core desktops. Bioinformatics 25: 126–127. doi:10.1093/bioinformatics/btn575 444 Mondo S. J., Dannebaum R. O., Kuo R. C., Louie K. B., Bewick A. J., et al. 2017 445 Widespread adenine N6-methylation of active genes in fungi. Nat. Genet. 49: 964–968. 446 doi:10.1038/ng.3859 447 Muszewska A., Stepniewska-Dziubinska M. M., Steczkiewicz K., Pawlowska J., Dziedzic, 448 A., et al., 2017 Fungal lifestyle reflected in serine protease repertoire. Sci. Rep. 7: 9147. 449 doi:10.1038/s41598-017-09644-w 450 Muszewska A., Taylor J. W., Szczesny P., Grynberg M., 2011 Independent subtilases 451 expansions in fungi associated with animals. Mol. Biol. Evol. 28: 3395–3404. 452 doi:10.1093/molbev/msr176 453 Nepusz T., Sasidharan R., Paccanaro A., 2010 SCPS: a fast implementation of a spectral 454 method for detecting protein families on a genome-wide scale. BMC Bioinformatics 11: 455 120. doi:10.1186/1471-2105-11-120 456 Penn O., Privman E., Ashkenazy H., Landan G., Graur D., et al., 2010 GUIDANCE: A web 457 server for assessing alignment confidence scores. Nucleic Acids Res. 38: 458 doi:10.1093/nar/gkq443 459 Petersen T. N., Brunak S., von Heijne G., Nielsen H., 2011 SignalP 4.0: discriminating signal 460 peptides from transmembrane regions. Nat. Methods 8: 785–786. 461 doi:10.1038/nmeth.1701

13 462 Rawlings N. D., Barrett A. J., Finn R., 2016 Twenty years of the MEROPS database of 463 proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 44: D343–D350. 464 doi:10.1093/nar/gkv1118 465 Sharpton T. J., Stajich J. E., Rounsley S. D., Gardner M. J., Wortman J. R., et al. 2009 466 Comparative genomic analyses of the human fungal pathogens Coccidioides and their 467 relatives. Genome Res. 19: 1722–1731. doi:10.1101/gr.087551.108. 468 Spatafora J. W., Chang Y., Benny G. L., Lazarus K., Smith M. E., et al., 2016. A phylum- 469 level phylogenetic classification of zygomycete fungi based on genome-scale data. 470 Mycologia 108: 1028–1046. doi:10.3852/16-042 471 St. Leger R. J., Charnley A. K., Cooper R. M., 1986a. Cuticle-degrading enzymes of 472 entomopathogenic fungi: Synthesis in culture on cuticle. J. Invertebr. Pathol. 48: 85–95. 473 doi:10.1016/0022-2011(86)90146-1 474 St. Leger, R. J., Cooper, R. M., Charnley, A. K., 1986b. Cuticle-degrading enzymes of 475 entomopathogenic fungi: Cuticle degradation in vitro by enzymes from 476 entomopathogens. J. Invertebr. Pathol. 177: 167–177. doi:10.1016/0022- 477 2011(86)90043-1 478 Stamatakis A., 2014 RAxML version 8: A tool for phylogenetic analysis and post-analysis of 479 large phylogenies. Bioinformatics 30: 1312–1313. doi:10.1093/bioinformatics/btu033 480 Vega F. E., Meyling N. V., Luangsa-Ard J. J., Blackwell M., 2012 Fungal entomopathogens, 481 in: Vega, F.E., Kaya, H. (Eds.), Insect Pathology. Elsevier Inc., pp. 171–220. 482 doi:10.1016/B978-0-12-384984-7.00006-3 483 Whiston E., Taylor J. W., 2016 Comparative phylogenomics of pathogenic and 484 nonpathogenic species. G3 Genes|Genomes|Genetics 6: 235–244. 485 doi:10.1534/g3.115.022806 486 Xiao G., Ying S.-H., Zheng P., Wang Z.-L., Zhang S., et al., 2012 Genomic perspectives on 487 the evolution of fungal entomopathogenicity in Beauveria bassiana. Sci. Rep. 2: 488 doi:10.1038/srep00483 489 Yang Z., 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24: 490 1586–1591. doi:10.1093/molbev/msm088 491

14 492 493 Figure 1 494

495 496 497 Figure 1. Maximum likelihood phylogeny calculated with RAxML and based on a 2,379 bp 498 alignment of 152 subtilisin-like serine protease codon nucleotide sequences from 499 Entomophthoromycotina that contain the peptidase S8/S53-subtilisin (PF00082) domain. 500 Branches are coloured for eachs species as Entomophthora muscae (Blue), Pandora formicae 501 (Purple), Conidiobolus coronatus (Pink), C. thromboides (brown), C. incongruus (orange), 502 and B. meristoporus (Green). For each SLSP, the accession number and protein domains 503 additional to PF00082 are shown. The three identified clusters: Protease K cluster (A), 504 Pyrolysin/osf protease cluster (B), and the new bacillopeptidase-like Entomophthorales 505 cluster (C), are marked in the grey circle surrounding the tree and with grey background for 506 cluster B and C. 507

15 508 Figure 2

509 510 Figure 2. Active site and domain co-occurrence variability of the three Tribe-MCL clusters 511 identified among 152 Entomophthoromycotina subtilisin-like serine proteases. The columns 512 DTG, GHGTH, and SGTS represents the closest amino acid sequence for each of the amino 513 acids from the DHS catalytic triad. A. Amino acid alignment of the active site residues for 514 the three identified groups (A-C) of SLSPs within Entomophthoromycotina. Accession codes 515 are color coded as: Orange – C. incongruus, Blue – E. muscae, and Purple – P. formicae. B. 516 Sequence motifs of the active site residues for each group. 517

16 518 Figure 3

519 520 Figure 3. Mid-point rooted maximum likelihood phylogeny calculated with RAxML and 521 based on a (479 amino acid) alignment of 413 protein subtilisin-like serine protease 522 sequences, which belonged to group C in the Tribe-MCL analysis (see text for details). 523 Bootstrap values >70 from 1000 iterations are shown for non-terminal deeper nodes. 524

17 525 Figure 4 526 527

528 529 530 Figure 4. Schematic phylogeny and classification of the early-diverging fungi and related 531 taxonomic groups principally based on Spatafora et al. (2016). Branch lengths are not 532 proportional to genetic distances. 533

18 534 Figure 5 535 536

537 538 539 Figure 5. Venn diagram showing taxonomic distribution of subtilisin-like serine protease 540 clusters of major insect and nematode-pathogenic fungal genera. Black circles correspond to 541 clusters with number of S8A proteases for each cluster, and the placement within the Venn 542 diagram correspond to the taxonomic groups contributing sequences to a specific cluster. The 543 asterisk (*) marks the 11 members of the new bacillo-peptidase like cluster C described 544 within the order Entomophthorales (subphylum: Entomophthoromycotina). 545 Entomophthoromycotina encompasses SLSP’s found in the genera: Basidiobolus, 546 Conidiobolus, Entomophthora, and Pandora. Hypocrealean entities consist of SLSP’s in the 547 MEROPS database from the genera: Cordyceps, Metarhizium, Ophiocordyceps, and 548 Nematode trapping fungi are MEROPS SLSP’s found in the genera: Arthrobotrys and 549 Monacrosporium. 550 551

19 552 Figure 6

553 554 555 Figure 6. Maximum likelihood phylogeny calculated with RAxML and based on a (276 556 amino acid) alignment of 500 protein subtilisin-like serine protease sequences. SLPSs within 557 Entomophthoromyctina from group B and C, and 10 representative SLSPs from group A are 558 included together with the 10 most similar SLSPs in the MEROPS database for each group 559 A, B, and C, respectively. To show the relationship between the three 560 Entomophthoromycotina SLSP groups, representative SLSPs from insect-pathogenic and 561 nematode-trapping fungi were included (see text for details). Bootstrap values >70 from 100 562 iterations are shown for non-terminal nodes. The tree is rooted with the S8B subfamily type 563 Kexin from S. cerevisiae (UniProt: P13134) 564

20